Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next] bpf: Extend the sk_lookup() helper to XDP hookpoint.
From: Nitin Hande @ 2018-10-18 23:32 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Joe Stringer, netdev, ast, Jesper Brouer, john fastabend
In-Reply-To: <f13124bf-bb4f-b681-d014-299993305262@iogearbox.net>

On Thu, 18 Oct 2018 23:20:17 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 10/18/2018 11:06 PM, Joe Stringer wrote:
> > On Thu, 18 Oct 2018 at 11:54, Nitin Hande <nitin.hande@gmail.com> wrote:  
> [...]
> >> Open Issue
> >> * The underlying code relies on presence of an skb to find out the
> >> right sk for the case of REUSEPORT socket option. Since there is
> >> no skb available at XDP hookpoint, the helper function will return
> >> the first available sk based off the 5 tuple hash. If the desire
> >> is to return a particular sk matching reuseport_cb function, please
> >> suggest way to tackle it, which can be addressed in a future commit.  
> 
> >> Signed-off-by: Nitin Hande <Nitin.Hande@gmail.com>  
> > 
> > Thanks Nitin, LGTM overall.
> > 
> > The REUSEPORT thing suggests that the usage of this helper from XDP
> > layer may lead to a different socket being selected vs. the equivalent
> > call at TC hook, or other places where the selection may occur. This
> > could be a bit counter-intuitive.
> > 
> > One thought I had to work around this was to introduce a flag,
> > something like BPF_F_FIND_REUSEPORT_SK_BY_HASH. This flag would
> > effectively communicate in the API that the bpf_sk_lookup_xxx()
> > functions will only select a REUSEPORT socket based on the hash and
> > not by, for example BPF_PROG_TYPE_SK_REUSEPORT programs. The absence
> > of the flag would support finding REUSEPORT sockets by other
> > mechanisms (which would be allowed for now from TC hooks but would be
> > disallowed from XDP, since there's no specific plan to support this). 

Thanks Joe for the quick response.This certainly looks feasible. With the
flag, both tc and XDP hookpoints will be consistent in their approach.
 
> 
> Hmm, given skb is NULL here the only way to lookup the socket in such
> scenario is based on hash, that is, inet_ehashfn() / inet6_ehashfn(),
> perhaps alternative is to pass this hash in from XDP itself to the
> helper so it could be custom selector.

Interesting, and this will be an additional helper or done within this
sk_lookup() helper ?

 Do you have a specific use case
> on this for XDP (just curious)?

Yes, this is for a dual-stack solution. The XDP program functions as a
demux, if there is a receiver the packet will be ingressed on Linux
networking stack, else it enters the other stack path.

Thanks
Nitin

> 
> Thanks,
> Daniel

^ permalink raw reply

* Re: [PATCH v4 bpf-next 0/2] bpf: add cg_skb_is_valid_access
From: Alexei Starovoitov @ 2018-10-18 23:34 UTC (permalink / raw)
  To: Song Liu; +Cc: netdev, ast, daniel, kernel-team
In-Reply-To: <20181018160649.1611530-1-songliubraving@fb.com>

On Thu, Oct 18, 2018 at 09:06:47AM -0700, Song Liu wrote:
> Changes v3 -> v4:
> 1. Fixed crash issue reported by Alexei.
> 
> Changes v2 -> v3:
> 1. Added helper function bpf_compute_and_save_data_pointers() and
>    bpf_restore_data_pointers().
> 
> Changes v1 -> v2:
> 1. Updated the list of read-only fields, and read-write fields.
> 2. Added dummy sk to bpf_prog_test_run_skb().
> 
> This set enables BPF program of type BPF_PROG_TYPE_CGROUP_SKB to access
> some __skb_buff data directly.

Applied, Thanks

^ permalink raw reply

* linux-next: Signed-off-by missing for commits in the net-next tree
From: Stephen Rothwell @ 2018-10-19  7:30 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List,
	Saeed Mahameed

[-- Attachment #1: Type: text/plain, Size: 261 bytes --]

Hi all,

Commits

  2a4c42980250 ("net/mlx5: Remove counter from idr after removing it from list")
  fd3307130344 ("net/mlx5: Take fs_counters dellist before addlist")

are missing a Signed-off-by from their committer.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] cxgb4: fix the error path of cxgb4_uld_register()
From: David Miller @ 2018-10-18 23:26 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, nirranjan, indranil, dt, harsh, linux-crypto
In-Reply-To: <1539871459-6921-1-git-send-email-ganeshgr@chelsio.com>

From: Ganesh Goudar <ganeshgr@chelsio.com>
Date: Thu, 18 Oct 2018 19:34:19 +0530

> On multi adapter setup if the uld registration fails even on
> one adapter, the allocated resources for the uld on all the
> adapters are freed, rendering the functioning adapters unusable.
> 
> This commit fixes the issue by freeing the allocated resources
> only for the failed adapter.
> 
> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>

Applied, thank you.

^ permalink raw reply

* [PATCH] net: ethernet: lpc_eth: add device and device node local variables
From: Vladimir Zapolskiy @ 2018-10-18 23:25 UTC (permalink / raw)
  To: David S. Miller; +Cc: Sylvain Lemieux, netdev

Trivial non-functional change added to simplify getting multiple
references to device pointer in lpc_eth_drv_probe().

Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
---
NB, to avoid a merge conflict the change should be applied
after https://marc.info/?l=linux-netdev&m=153990402012273

 drivers/net/ethernet/nxp/lpc_eth.c | 40 ++++++++++++++++--------------
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
index e275d64007af..ecc7f464c238 100644
--- a/drivers/net/ethernet/nxp/lpc_eth.c
+++ b/drivers/net/ethernet/nxp/lpc_eth.c
@@ -1242,17 +1242,19 @@ static const struct net_device_ops lpc_netdev_ops = {
 
 static int lpc_eth_drv_probe(struct platform_device *pdev)
 {
-	struct resource *res;
-	struct net_device *ndev;
+	struct device *dev = &pdev->dev;
+	struct device_node *np = dev->of_node;
 	struct netdata_local *pldat;
+	struct net_device *ndev;
 	dma_addr_t dma_handle;
+	struct resource *res;
 	int irq, ret;
 	u32 tmp;
 
 	/* Setup network interface for RMII or MII mode */
 	tmp = __raw_readl(LPC32XX_CLKPWR_MACCLK_CTRL);
 	tmp &= ~LPC32XX_CLKPWR_MACCTRL_PINS_MSK;
-	if (lpc_phy_interface_mode(&pdev->dev) == PHY_INTERFACE_MODE_MII)
+	if (lpc_phy_interface_mode(dev) == PHY_INTERFACE_MODE_MII)
 		tmp |= LPC32XX_CLKPWR_MACCTRL_USE_MII_PINS;
 	else
 		tmp |= LPC32XX_CLKPWR_MACCTRL_USE_RMII_PINS;
@@ -1262,7 +1264,7 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
 	irq = platform_get_irq(pdev, 0);
 	if (!res || irq < 0) {
-		dev_err(&pdev->dev, "error getting resources.\n");
+		dev_err(dev, "error getting resources.\n");
 		ret = -ENXIO;
 		goto err_exit;
 	}
@@ -1270,12 +1272,12 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 	/* Allocate net driver data structure */
 	ndev = alloc_etherdev(sizeof(struct netdata_local));
 	if (!ndev) {
-		dev_err(&pdev->dev, "could not allocate device.\n");
+		dev_err(dev, "could not allocate device.\n");
 		ret = -ENOMEM;
 		goto err_exit;
 	}
 
-	SET_NETDEV_DEV(ndev, &pdev->dev);
+	SET_NETDEV_DEV(ndev, dev);
 
 	pldat = netdev_priv(ndev);
 	pldat->pdev = pdev;
@@ -1287,9 +1289,9 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 	ndev->irq = irq;
 
 	/* Get clock for the device */
-	pldat->clk = clk_get(&pdev->dev, NULL);
+	pldat->clk = clk_get(dev, NULL);
 	if (IS_ERR(pldat->clk)) {
-		dev_err(&pdev->dev, "error getting clock.\n");
+		dev_err(dev, "error getting clock.\n");
 		ret = PTR_ERR(pldat->clk);
 		goto err_out_free_dev;
 	}
@@ -1302,14 +1304,14 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 	/* Map IO space */
 	pldat->net_base = ioremap(res->start, resource_size(res));
 	if (!pldat->net_base) {
-		dev_err(&pdev->dev, "failed to map registers\n");
+		dev_err(dev, "failed to map registers\n");
 		ret = -ENOMEM;
 		goto err_out_disable_clocks;
 	}
 	ret = request_irq(ndev->irq, __lpc_eth_interrupt, 0,
 			  ndev->name, ndev);
 	if (ret) {
-		dev_err(&pdev->dev, "error requesting interrupt.\n");
+		dev_err(dev, "error requesting interrupt.\n");
 		goto err_out_iounmap;
 	}
 
@@ -1323,7 +1325,7 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 		sizeof(struct txrx_desc_t) + sizeof(struct rx_status_t));
 	pldat->dma_buff_base_v = 0;
 
-	if (use_iram_for_net(&pldat->pdev->dev)) {
+	if (use_iram_for_net(dev)) {
 		dma_handle = LPC32XX_IRAM_BASE;
 		if (pldat->dma_buff_size <= lpc32xx_return_iram_size())
 			pldat->dma_buff_base_v =
@@ -1334,7 +1336,7 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 	}
 
 	if (pldat->dma_buff_base_v == 0) {
-		ret = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
+		ret = dma_coerce_mask_and_coherent(dev, DMA_BIT_MASK(32));
 		if (ret)
 			goto err_out_free_irq;
 
@@ -1343,7 +1345,7 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 		/* Allocate a chunk of memory for the DMA ethernet buffers
 		   and descriptors */
 		pldat->dma_buff_base_v =
-			dma_alloc_coherent(&pldat->pdev->dev,
+			dma_alloc_coherent(dev,
 					   pldat->dma_buff_size, &dma_handle,
 					   GFP_KERNEL);
 		if (pldat->dma_buff_base_v == NULL) {
@@ -1368,7 +1370,7 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 	__lpc_get_mac(pldat, ndev->dev_addr);
 
 	if (!is_valid_ether_addr(ndev->dev_addr)) {
-		const char *macaddr = of_get_mac_address(pdev->dev.of_node);
+		const char *macaddr = of_get_mac_address(np);
 		if (macaddr)
 			memcpy(ndev->dev_addr, macaddr, ETH_ALEN);
 	}
@@ -1398,7 +1400,7 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 
 	ret = register_netdev(ndev);
 	if (ret) {
-		dev_err(&pdev->dev, "Cannot register net device, aborting.\n");
+		dev_err(dev, "Cannot register net device, aborting.\n");
 		goto err_out_dma_unmap;
 	}
 	platform_set_drvdata(pdev, ndev);
@@ -1410,17 +1412,17 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 	netdev_info(ndev, "LPC mac at 0x%08x irq %d\n",
 	       res->start, ndev->irq);
 
-	device_init_wakeup(&pdev->dev, 1);
-	device_set_wakeup_enable(&pdev->dev, 0);
+	device_init_wakeup(dev, 1);
+	device_set_wakeup_enable(dev, 0);
 
 	return 0;
 
 err_out_unregister_netdev:
 	unregister_netdev(ndev);
 err_out_dma_unmap:
-	if (!use_iram_for_net(&pldat->pdev->dev) ||
+	if (!use_iram_for_net(dev) ||
 	    pldat->dma_buff_size > lpc32xx_return_iram_size())
-		dma_free_coherent(&pldat->pdev->dev, pldat->dma_buff_size,
+		dma_free_coherent(dev, pldat->dma_buff_size,
 				  pldat->dma_buff_base_v,
 				  pldat->dma_buff_base_p);
 err_out_free_irq:
-- 
2.17.1

^ permalink raw reply related

* [PATCH] net: ethernet: lpc_eth: remove unused local variable
From: Vladimir Zapolskiy @ 2018-10-18 23:06 UTC (permalink / raw)
  To: David S. Miller; +Cc: Sylvain Lemieux, netdev

A trivial change which removes an unused local variable, the issue
is reported as a compile time warning:

  drivers/net/ethernet/nxp/lpc_eth.c: In function 'lpc_eth_drv_probe':
  drivers/net/ethernet/nxp/lpc_eth.c:1250:21: warning: variable 'phydev' set but not used [-Wunused-but-set-variable]
    struct phy_device *phydev;
                       ^~~~~~

Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
---
 drivers/net/ethernet/nxp/lpc_eth.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
index 922b5b5b5c01..e275d64007af 100644
--- a/drivers/net/ethernet/nxp/lpc_eth.c
+++ b/drivers/net/ethernet/nxp/lpc_eth.c
@@ -1245,7 +1245,6 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 	struct resource *res;
 	struct net_device *ndev;
 	struct netdata_local *pldat;
-	struct phy_device *phydev;
 	dma_addr_t dma_handle;
 	int irq, ret;
 	u32 tmp;
@@ -1411,8 +1410,6 @@ static int lpc_eth_drv_probe(struct platform_device *pdev)
 	netdev_info(ndev, "LPC mac at 0x%08x irq %d\n",
 	       res->start, ndev->irq);
 
-	phydev = ndev->phydev;
-
 	device_init_wakeup(&pdev->dev, 1);
 	device_set_wakeup_enable(&pdev->dev, 0);
 
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net-next 0/6] net/ncsi: Allow enabling multiple packages & channels
From: Samuel Mendoza-Jonas @ 2018-10-18 23:05 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Justin.Lee1, linux-kernel, openbmc
In-Reply-To: <20181018.155647.1045018243241594303.davem@davemloft.net>

On Thu, 2018-10-18 at 15:56 -0700, David Miller wrote:
> From: Samuel Mendoza-Jonas <sam@mendozajonas.com>
> Date: Thu, 18 Oct 2018 14:59:11 +1100
> 
> > This series extends the NCSI driver to configure multiple packages
> > and/or channels simultaneously. Since the RFC series this includes a few
> > extra changes to fix areas in the driver that either made this harder or
> > were roadblocks due to deviations from the NCSI specification.
> > 
> > Patches 1 & 2 fix two issues where the driver made assumptions about the
> > capabilities of the NCSI topology.
> > Patches 3 & 4 change some internal semantics slightly to make multi-mode
> > easier.
> > Patch 5 introduces a cleaner way of reconfiguring the NCSI configuration
> > and keeping track of channel states.
> > Patch 6 implements the main multi-package/multi-channel configuration,
> > configured via the Netlink interface.
> > 
> > Readers who have an interesting NCSI setup - especially multi-package
> > with HWA - please test! I think I've covered all permutations but I
> > don't have infinite hardware to test on.
> 
> This doesn't apply cleanly to net-next.  Does it depend upon changes
> applied elsewhere?  You must always make that explicit.

Ah, my mistake; I hadn't updated my net-next branch recently enough and
missed Vijay's OEM command patch. Will rebase.

> 
> Also, please explain this locking in ncsi_reset_dev():
> 
> +       NCSI_FOR_EACH_PACKAGE(ndp, np) {
> +               NCSI_FOR_EACH_CHANNEL(np, nc) {
> +                       spin_lock_irqsave(&nc->lock, flags);
> +                       enabled = nc->monitor.enabled;
> +                       state = nc->state;
> +                       spin_unlock_irqrestore(&nc->lock, flags);
> +
> +                       if (enabled)
> +                               ncsi_stop_channel_monitor(nc);
> +                       if (state == NCSI_CHANNEL_ACTIVE) {
> +                               active = nc;
> +                               break;
> +                       }
> 
> Is that really protecting anything?
> 
> Right after you drop np->lock those two values can change, the state
> of the 'nc' can change such that it isn't NCSI_CHANNEL_ACTIVE anymore
> etc.
> 
> At best this locking makes sure thatn enabled and state are consistent
> with respect to eachother, only.  It doesn't guarantee anything about
> the stability of the state of the object at all, and it can change
> right from under you.

And you've caught this correctly, I'll fix this up in the rebase to
protect actually checking the channel/monitor state.

Thanks,
Sam

^ permalink raw reply

* RE: [PATCH] net: ethernet: fec: Add missing SPEED_
From: Andy Duan @ 2018-10-19  7:07 UTC (permalink / raw)
  To: Heiner Kallweit, Florian Fainelli, LABBE Corentin
  Cc: andrew@lunn.ch, davem@davemloft.net, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <f1933c0a-d124-a7ac-afab-2fa8f88907ab@gmail.com>

From: Heiner Kallweit <hkallweit1@gmail.com> Sent: 2018年10月19日 4:41
> On 18.10.2018 22:10, Florian Fainelli wrote:
> > On 10/18/2018 12:59 PM, LABBE Corentin wrote:
> >> On Thu, Oct 18, 2018 at 12:38:32PM -0700, Florian Fainelli wrote:
> >>> On 10/18/2018 12:16 PM, LABBE Corentin wrote:
> >>>> On Thu, Oct 18, 2018 at 11:55:49AM -0700, Florian Fainelli wrote:
> >>>>> On 10/18/2018 11:47 AM, LABBE Corentin wrote:
> >>>>>> On Thu, Oct 18, 2018 at 11:39:24AM -0700, Florian Fainelli
> wrote:
> >>>>>>> On 10/18/2018 08:05 AM, Corentin Labbe wrote:
> >>>>>>>> Since commit 58056c1e1b0e ("net: ethernet: Use
> phy_set_max_speed() to limit advertised speed"), the fec driver is unable
> to get any link.
> >>>>>>>> This is due to missing SPEED_.
> >>>>>>>
> >>>>>>> But SPEED_1000 is defined in include/uapi/linux/ethtool.h as
> >>>>>>> 1000, so surely this would amount to the same code paths being
> >>>>>>> taken or am I missing something here?
> >>>>>>
> >>>>>> The bisect session pointed your patch, reverting it fix the issue.
> >>>>>> BUT since the fix seemed trivial I sent the patch without more test
> then compile it.
> >>>>>> Sorry, I have just found some minutes ago that it didnt fix the
> issue.
> >>>>>>
> >>>>>> But your patch is still the cause for sure.
> >>>>>>
> >>>>>
> >>>>> What you are writing is really lowering the confidence level,
> >>>>> first Andrew is the author of that patch, and second "just
> >>>>> compiling" and pretending this fixes a problem when it does not is
> >>>>> not quite what I would expect.
> >>>>>
> >>>>> I don't have a problem helping you find the solution or the right
> >>>>> fix though, even if it is not my patch, but please get the author
> >>>>> and actual problem right so we can move forward in confidence,
> thanks!
> >>>>
> >>>> Sorry again, I wanted to acknoledge my error but I did it too fast and
> late.
> >>>> And sorry to have confound you with Andrew.
> >>>
> >>> No worries, here to help, let us know what your bisection points to.
> >>> THanks
> >>
> >> I have added printing of phydev->supported My working kernel (on top
> >> of 58056c1e1b0e + revert patch) got:
> >> [    5.550838] fec_enet_mii_probe 2ff (gbit features)
> >> [    5.555848] fec_enet_mii_probe 2ef (without 1000baseT_Half)
> >> [    5.561620] fec_enet_mii_probe 22ef final (after pause)
> >> [    5.566914] Micrel KSZ9021 Gigabit PHY 2188000.ethernet-1:06:
> attached PHY driver [Micrel KSZ9021 Gigabit PHY]
> (mii_bus:phy_addr=2188000.ethernet-1:06, irq=POLL)
> >> [    8.730751] fec 2188000.ethernet eth0: Link is Up - 1Gbps/Full -
> flow control rx/tx
> >> [    8.788311] Sending DHCP requests ., OK
> >> [    8.832357] IP-Config: Got DHCP answer from 192.168.66.1, my
> address is 192.168.66.58
> >>
> >> the non-working kernel (next-20181015)
> >> [    7.308917] fec_enet_mii_probe 62ff after phy_set_max_speed
> >> [    7.314545] fec_enet_mii_probe 62ef after
> phy_remove_link_mode
> >> [    7.320418] fec_enet_mii_probe 62ef after pause
> >> and then no link
> >>
> >> So it seems that phy_set_max_speed adds bit 14
> >> (ETHTOOL_LINK_MODE_Asym_Pause_BIT)
> >
> > It's not masking it so it must be coming from phy_probe().
> >
> See df8ed346d4a8 ("net: phy: fix flag masking in __set_phy_supported").
> phy_set_max_speed() used to (unintentionally) mask the pause bits and it
> seems that the fec driver used this bug as a feature.
> 
> >>
> >> I have patched by adding:
> >> phy_remove_link_mode(phy_dev,
> ETHTOOL_LINK_MODE_Asym_Pause_BIT);
> 
> Instead of programmatically removing the feature bit it should be possible
> to do this in the PHY driver configuration. See also this part of
> phy_probe().
> 
> 	if (phydrv->features & (SUPPORTED_Pause |
> SUPPORTED_Asym_Pause)) {
> 		phydev->supported &= ~(SUPPORTED_Pause |
> SUPPORTED_Asym_Pause);
> 		phydev->supported |= phydrv->features &
> 				     (SUPPORTED_Pause | SUPPORTED_Asym_Pause);
> 	} else {
> 		phydev->supported |= SUPPORTED_Pause |
> SUPPORTED_Asym_Pause;
> 	}

The ksz9021 phy driver don't set Pause feature,  then the phylib enable "SUPPORTED_Pause" and " SUPPORTED_Asym_Pause" in both.
Micrel.c:
        .phy_id         = PHY_ID_KSZ9021,
        .phy_id_mask    = 0x000ffffe,
        .name           = "Micrel KSZ9021 Gigabit PHY",
        .features       = PHY_GBIT_FEATURES,

From @LABBE Corentin debug,  it seem ksz9021 cannot advertise Pause and Asym pause in both, otherwise it cannot link up.
From ksz9021 datasheet description as below,  Symmetric & Asymmetric PAUSE is for local device,  I don't understand its mean.  
4.11:10 Pause
[0,0] = No PAUSE
[1,0] = Asymmetric PAUSE (link partner)
[0,1] = Symmetric PAUSE
[1,1] = Symmetric & Asymmetric PAUSE (local device)

@ LABBE Corentin, you can try this:
Micrel.c:
        .phy_id         = PHY_ID_KSZ9021,
        .phy_id_mask    = 0x000ffffe,
        .name           = "Micrel KSZ9021 Gigabit PHY",
        .features       = PHY_GBIT_FEATURES | SUPPORTED_Pause,


In fact,  I test net tree without any change with AR8031 and there has no link problem.


Andy

^ permalink raw reply

* Re: [net PATCH] net: sched: Fix for duplicate class dump
From: David Miller @ 2018-10-18 23:00 UTC (permalink / raw)
  To: phil; +Cc: netdev, edumazet
In-Reply-To: <20181018083426.6623-1-phil@nwl.cc>

From: Phil Sutter <phil@nwl.cc>
Date: Thu, 18 Oct 2018 10:34:26 +0200

> When dumping classes by parent, kernel would return classes twice:
> 
> | # tc qdisc add dev lo root prio
> | # tc class show dev lo
> | class prio 8001:1 parent 8001:
> | class prio 8001:2 parent 8001:
> | class prio 8001:3 parent 8001:
> | # tc class show dev lo parent 8001:
> | class prio 8001:1 parent 8001:
> | class prio 8001:2 parent 8001:
> | class prio 8001:3 parent 8001:
> | class prio 8001:1 parent 8001:
> | class prio 8001:2 parent 8001:
> | class prio 8001:3 parent 8001:
> 
> This comes from qdisc_match_from_root() potentially returning the root
> qdisc itself if its handle matched. Though in that case, root's classes
> were already dumped a few lines above.
> 
> Fixes: cb395b2010879 ("net: sched: optimize class dumps")
> Signed-off-by: Phil Sutter <phil@nwl.cc>

Applied and queued up for -stable, thanks Phil.

^ permalink raw reply

* Re: [PATCH net-next] bnxt_en: Copy and paste bug in extended tx_stats
From: David Miller @ 2018-10-18 22:59 UTC (permalink / raw)
  To: dan.carpenter; +Cc: michael.chan, netdev, kernel-janitors
In-Reply-To: <20181018080239.z2egoh4bw4beb3r3@kili.mountain>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Thu, 18 Oct 2018 11:02:39 +0300

> The struct type was copied from the line before but it should be "tx"
> instead of "rx".  I have reviewed the code and I can't immediately see
> that this bug causes a runtime issue.
> 
> Fixes: 36e53349b60b ("bnxt_en: Add additional extended port statistics.")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> ---
> This is from static analysis and I don't have a way to test it.

Applied.

^ permalink raw reply

* Re: [PATCH v2 net] r8169: fix NAPI handling under high load
From: Francois Romieu @ 2018-10-18 22:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Heiner Kallweit, David Miller, Realtek linux nic maintainers,
	netdev@vger.kernel.org
In-Reply-To: <05eb4233-bf98-4aeb-73f6-94fec46ca3de@gmail.com>

Eric Dumazet <eric.dumazet@gmail.com> :
[...]
> One has to wonder why rtl8169_poll(), which might be called in a loop under DOS,
> has to call rtl_ack_events() ?

So as to cover a wider temporal range before any event can trigger an
extra irq. I was more worried about irq cost than about IO cost (and
I still am).

-- 
Ueimor

^ permalink raw reply

* [PATCH] net: ethernet: lpc_eth: remove CONFIG_OF guard from the driver
From: Vladimir Zapolskiy @ 2018-10-18 22:58 UTC (permalink / raw)
  To: David S. Miller; +Cc: Sylvain Lemieux, netdev

The MAC controller device is available on NXP LPC32xx platform only,
and the LPC32xx platform supports OF builds only, so additional
checks in the device driver are not needed.

Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
---
 drivers/net/ethernet/nxp/lpc_eth.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
index ed02e8e18f25..922b5b5b5c01 100644
--- a/drivers/net/ethernet/nxp/lpc_eth.c
+++ b/drivers/net/ethernet/nxp/lpc_eth.c
@@ -1518,13 +1518,11 @@ static int lpc_eth_drv_resume(struct platform_device *pdev)
 }
 #endif
 
-#ifdef CONFIG_OF
 static const struct of_device_id lpc_eth_match[] = {
 	{ .compatible = "nxp,lpc-eth" },
 	{ }
 };
 MODULE_DEVICE_TABLE(of, lpc_eth_match);
-#endif
 
 static struct platform_driver lpc_eth_driver = {
 	.probe		= lpc_eth_drv_probe,
@@ -1535,7 +1533,7 @@ static struct platform_driver lpc_eth_driver = {
 #endif
 	.driver		= {
 		.name	= MODNAME,
-		.of_match_table = of_match_ptr(lpc_eth_match),
+		.of_match_table = lpc_eth_match,
 	},
 };
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH] net: ethernet: lpc_eth: clean up the list of included headers
From: Vladimir Zapolskiy @ 2018-10-18 22:53 UTC (permalink / raw)
  To: David S. Miller; +Cc: Sylvain Lemieux, netdev

The change removes all unnecessary included headers from the driver
source code, the remaining list is sorted in alphabetical order.

Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
---
 drivers/net/ethernet/nxp/lpc_eth.c | 28 ++++++----------------------
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c
index 08381ef8bdb4..ed02e8e18f25 100644
--- a/drivers/net/ethernet/nxp/lpc_eth.c
+++ b/drivers/net/ethernet/nxp/lpc_eth.c
@@ -19,34 +19,18 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
-#include <linux/module.h>
-#include <linux/kernel.h>
-#include <linux/sched.h>
-#include <linux/slab.h>
-#include <linux/delay.h>
-#include <linux/interrupt.h>
-#include <linux/errno.h>
-#include <linux/ioport.h>
-#include <linux/crc32.h>
-#include <linux/platform_device.h>
-#include <linux/spinlock.h>
-#include <linux/ethtool.h>
-#include <linux/mii.h>
 #include <linux/clk.h>
-#include <linux/workqueue.h>
-#include <linux/netdevice.h>
+#include <linux/crc32.h>
 #include <linux/etherdevice.h>
-#include <linux/skbuff.h>
-#include <linux/phy.h>
-#include <linux/dma-mapping.h>
-#include <linux/of.h>
+#include <linux/module.h>
 #include <linux/of_net.h>
-#include <linux/types.h>
+#include <linux/phy.h>
+#include <linux/platform_device.h>
+#include <linux/spinlock.h>
 
-#include <linux/io.h>
 #include <mach/board.h>
-#include <mach/platform.h>
 #include <mach/hardware.h>
+#include <mach/platform.h>
 
 #define MODNAME "lpc-eth"
 #define DRV_VERSION "1.00"
-- 
2.17.1

^ permalink raw reply related

* Re: [net-next 00/13][pull request] Intel Wired LAN Doc Updates 2018-10-18
From: David Miller @ 2018-10-18 22:46 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann
In-Reply-To: <20181018203259.4175-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Thu, 18 Oct 2018 13:32:46 -0700

> This series contains documentation fixes and updates for Intel wired LAN
> drivers.
> 
> The following was done:
>  - Updated incorrect URLs
>  - removed document references which did not apply to the current
>    in-kernel drivers
>  - added documentation for fm10k driver
>  - added missing documentation on existing or new features
>  - added SPDX headers to all the documentation files
> 
> Lastly, the documentation was converted over to the RST (reStructured
> Text) format, so that 'make htmldocs' produces pretty html driver
> documentation for our drivers.
> 
> The following are changes since commit 3a3295bfa6f484bba91de0a804c1d0bf7d31dbd3:
>   Merge branch 'sctp-fix-sk_wmem_queued-and-use-it-to-check-for-writable-space'
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 10GbE

Pulled, thanks Jeff.

^ permalink raw reply

* [PATCH v4.14-stable] sch_netem: restore skb->dev after dequeuing from the rbtree
From: Christoph Paasch @ 2018-10-18 20:38 UTC (permalink / raw)
  To: netdev, stable, David Miller, gregkh
  Cc: Stephen Hemminger, Eric Dumazet, Soheil Hassas Yeganeh, Wei Wang,
	Willem de Bruijn

Upstream commit bffa72cf7f9d ("net: sk_buff rbnode reorg") got
backported as commit 6b921536f170 ("net: sk_buff rbnode reorg") into the
v4.14.x-tree.

However, the backport does not include the changes in sch_netem.c

We need these, as otherwise the skb->dev pointer is not set when
dequeueing from the netem rbtree, resulting in a panic:

[   15.427748] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
[   15.428863] IP: netif_skb_features+0x24/0x230
[   15.429402] PGD 0 P4D 0
[   15.429733] Oops: 0000 [#1] SMP PTI
[   15.430169] Modules linked in:
[   15.430614] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.77.mptcp #77
[   15.431497] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
[   15.432568] task: ffff88042db19680 task.stack: ffffc90000070000
[   15.433356] RIP: 0010:netif_skb_features+0x24/0x230
[   15.433977] RSP: 0018:ffff88043fd83e70 EFLAGS: 00010286
[   15.434665] RAX: ffff880429ad80c0 RBX: ffff88042bd0e400 RCX: ffff880429ad8000
[   15.435585] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88042bd0e400
[   15.436551] RBP: ffff88042bd0e400 R08: ffff88042a4b6c9c R09: 0000000000000001
[   15.437485] R10: 0000000000000004 R11: 0000000000000000 R12: ffff88042c700000
[   15.438393] R13: ffff88042c700000 R14: ffff88042a4b6c00 R15: ffff88042c6bb000
[   15.439315] FS:  0000000000000000(0000) GS:ffff88043fd80000(0000) knlGS:0000000000000000
[   15.440314] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.441084] CR2: 00000000000000d0 CR3: 000000042c374000 CR4: 00000000000006e0
[   15.442016] Call Trace:
[   15.442333]  <IRQ>
[   15.442596]  validate_xmit_skb+0x17/0x270
[   15.443134]  validate_xmit_skb_list+0x38/0x60
[   15.443698]  sch_direct_xmit+0x102/0x190
[   15.444198]  __qdisc_run+0xe3/0x240
[   15.444671]  net_tx_action+0x121/0x140
[   15.445177]  __do_softirq+0xe2/0x224
[   15.445654]  irq_exit+0xbf/0xd0
[   15.446072]  smp_apic_timer_interrupt+0x5d/0x90
[   15.446654]  apic_timer_interrupt+0x7d/0x90
[   15.447185]  </IRQ>
[   15.447460] RIP: 0010:native_safe_halt+0x2/0x10
[   15.447992] RSP: 0018:ffffc90000073f10 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff10
[   15.449008] RAX: ffffffff816667d0 RBX: ffffffff820946b0 RCX: 0000000000000000
[   15.449895] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[   15.450768] RBP: ffffffff82026940 R08: 00000004e858e5e1 R09: ffff88042a4b6d58
[   15.451643] R10: 0000000000000000 R11: 000000d0d56879bb R12: 0000000000000000
[   15.452478] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   15.453340]  ? __sched_text_end+0x2/0x2
[   15.453835]  default_idle+0xf/0x20
[   15.454259]  do_idle+0x170/0x200
[   15.454653]  cpu_startup_entry+0x14/0x20
[   15.455142]  secondary_startup_64+0xa5/0xb0
[   15.455715] Code: 1f 84 00 00 00 00 00 55 53 48 89 fd 48 83 ec 08 8b 87 bc 00 00 00 48 8b 8f c0 00 00 00 0f b6 97 81 00 00 00 48 8b 77 10 48 01 c8 <48> 8b 9
[   15.458138] RIP: netif_skb_features+0x24/0x230 RSP: ffff88043fd83e70
[   15.458933] CR2: 00000000000000d0
[   15.459352] ---[ end trace 083925903ae60570 ]---

Fixes: 6b921536f170 ("net: sk_buff rbnode reorg")
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
---

Notes:
    This patch should only make it into v4.14-stable as that's the only branch where
    the offending commit has been backported to.

 net/sched/sch_netem.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 2a2ab6bfe5d8..3d325b840802 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -624,6 +624,10 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 			skb->next = NULL;
 			skb->prev = NULL;
 			skb->tstamp = netem_skb_cb(skb)->tstamp_save;
+			/* skb->dev shares skb->rbnode area,
+			 * we need to restore its value.
+			 */
+			skb->dev = qdisc_dev(sch);
 
 #ifdef CONFIG_NET_CLS_ACT
 			/*
-- 
2.16.2

^ permalink raw reply related

* Re: [PATCH 0/4] Add SOCFPGA System Manager
From: Lee Jones @ 2018-10-19  6:16 UTC (permalink / raw)
  To: Thor Thayer
  Cc: peppe.cavallaro, dinguyen, linux, alexandre.torgue, joabreu,
	davem, mchehab+samsung, catalin.marinas, akpm, arnd, aisheng.dong,
	linux-kernel, netdev, linux-arm-kernel
In-Reply-To: <a095071e-bd46-e440-fc01-7e2bbaf5964a@linux.intel.com>

On Wed, 17 Oct 2018, Thor Thayer wrote:


> On 10/10/2018 09:42 AM, Thor Thayer wrote:
> > Hi
> > On 09/24/2018 05:09 PM, thor.thayer@linux.intel.com wrote:
> > > From: Thor Thayer <thor.thayer@linux.intel.com>
> > > 
> > > Add MFD driver for ARM64 SOCFPGA System Manager to steer
> > > System Manager calls appropriately.
> > > The SOCFPGA System Manager includes registers from several
> > > SOC peripherals.
> > > 
> > > On ARM32, syscon handles this aggregated register grouping.
> > > Redirect System Manager calls to syscon for ARM32 SOCFPGA
> > > systems.
> > > 
> > > The ARM64 System Manager can only be accessed from priority
> > > level EL3 so this new MFD driver handles the calls to EL3.
> > > 
> > > Thor Thayer (4):
> > >    mfd: altera-sysmgr: Add SOCFPGA System Manager abstraction
> > >    ARM: socfpga_defconfig: Enable CONFIG_MTD_ALTERA_SYSMGR
> > >    arm64: defconfig: Enable CONFIG_MTD_ALTERA_SYSMGR
> > >    net: stmmac: socfpga: Convert to shared System Manager driver
> > > 
> > >   MAINTAINERS                                        |   6 +
> > >   arch/arm/configs/socfpga_defconfig                 |   1 +
> > >   arch/arm64/configs/defconfig                       |   1 +
> > >   drivers/mfd/Kconfig                                |   9 +
> > >   drivers/mfd/Makefile                               |   1 +
> > >   drivers/mfd/altera-sysmgr.c                        | 310
> > > +++++++++++++++++++++
> > >   .../net/ethernet/stmicro/stmmac/dwmac-socfpga.c    |   4 +-
> > >   include/linux/mfd/altera-sysmgr.h                  | 113 ++++++++
> > >   8 files changed, 444 insertions(+), 1 deletion(-)
> > >   create mode 100644 drivers/mfd/altera-sysmgr.c
> > >   create mode 100644 include/linux/mfd/altera-sysmgr.h
> > > 
> > Gentle ping.
> 
> Gentle ping again...
> 
> Any comments on this patch series?

"Please don't send content free pings and please allow a reasonable time
for review.  People get busy, go on holiday, attend conferences and so
on so unless there is some reason for urgency (like critical bug fixes)
please allow at least a couple of weeks for review.  If there have been
review comments then people may be waiting for those to be addressed.
Sending content free pings just adds to the mail volume (if they are
seen at all) and if something has gone wrong you'll have to resend the
patches anyway so resending with any comments addressed is generally a
much better approach."

In this case, the we are too late in the series to have these
applied.  Maintainers are generally preparing their submissions for
the merge-window.  The MFD component of this set is marked as "To
Review" and I will get around to it when time is more abundant.

-- 
Lee Jones [李琼斯]
Linaro Services Technical Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

^ permalink raw reply

* Re: [PATCH bpf-next] bpf: emit audit messages upon successful prog load and unload
From: Steve Grubb @ 2018-10-18 22:09 UTC (permalink / raw)
  To: linux-audit
  Cc: Richard Guy Briggs, Jesper Dangaard Brouer, Daniel Borkmann,
	netdev, ast, acme, Jiri Olsa, Jiri Olsa, Alexei Starovoitov
In-Reply-To: <20181018195306.td5ijww3lk3knswx@madcap2.tricolour.ca>

On Thursday, October 18, 2018 3:53:06 PM EDT Richard Guy Briggs wrote:
> On 2018-10-07 18:19, Jesper Dangaard Brouer wrote:
> > On Sat, 6 Oct 2018 00:05:22 +0200
> > 
> > Jiri Olsa <jolsa@redhat.com> wrote:
> > > On Fri, Oct 05, 2018 at 11:44:35AM -0700, Alexei Starovoitov wrote:
> > > > On Fri, Oct 05, 2018 at 08:14:09AM +0200, Jiri Olsa wrote:
> > > > > On Thu, Oct 04, 2018 at 03:10:15PM -0700, Alexei Starovoitov wrote:
> > > > > > On Thu, Oct 04, 2018 at 10:22:31PM +0200, Jesper Dangaard Brouer 
wrote:
> > > > > > > On Thu, 4 Oct 2018 21:41:17 +0200 Daniel Borkmann 
<daniel@iogearbox.net> wrote:
> > > > > > > > On 10/04/2018 08:39 PM, Jesper Dangaard Brouer wrote:
> > > > > > > > > On Thu, 4 Oct 2018 10:11:43 -0700 Alexei Starovoitov 
<alexei.starovoitov@gmail.com> wrote:
> > > > > > > > >> On Thu, Oct 04, 2018 at 03:50:38PM +0200, Daniel Borkmann 
wrote:
> > > > > > > [...]
> > > > > > > 
> > > > > > > > >> If the purpose of the patch is to give user space
> > > > > > > > >> visibility into
> > > > > > > > >> bpf prog load/unload as a notification, then I completely
> > > > > > > > >> agree that
> > > > > > > > >> some notification mechanism is necessary.
> > > > > > > > 
> > > > > > > > Yeah, I did only regard it as only that, nothing more. Some
> > > > > > > > means
> > > > > > > > of timeline and notification that can be kept in a record in
> > > > > > > > user
> > > > > > > > space and later retrieved e.g. for introspection on what has
> > > > > > > > been
> > > > > > > > loaded.
> > > > > > > > 
> > > > > > > > >> I've started working on such mechanism via perf ring
> > > > > > > > >> buffer which is
> > > > > > > > >> the fastest mechanism we have in the kernel so far.
> > > > > > > > >> See long discussion here:
> > > > > > > > >> https://patchwork.ozlabs.org/patch/971970/
> > 
> > [...]
> > 
> > > > > > > > That one is definitely needed in any case to resolve the
> > > > > > > > kallsyms
> > > > > > > > limitations, and it does have overlap in that in either case
> > > > > > > > we
> > > > > > > > want to look at past BPF programs that have been unloaded in
> > > > > > > > the
> > > > > > > > meantime, so I don't have a strong preference either way, and
> > > > > > > > the
> > > > > > > > former is needed in any case. Though thought was that audit
> > > > > > > > might
> > > > > > > > be an option for those not running profiling daemons 24/7,
> > > > > > > > but
> > > > > > > > presumably bpftool could be extended to record these events
> > > > > > > > as
> > > > > > > > well if we don't want to reuse audit infra.
> > > > > > > 
> > > > > > > Yes, exactly, I don't want to run a profiling daemon 24/7 to
> > > > > > > record
> > > > > > > these events.  I do acknowledge that this perf event is
> > > > > > > relevant,
> > > > > > > especially for catching the kernel symbols (I need that
> > > > > > > myself), but it
> > > > > > > does not cover my use-case.
> > > > > > > 
> > > > > > > My use-case is to 24/7 collect and keep records in userspace,
> > > > > > > and have a
> > > > > > > timeline of these notifications, for later retrieval.  The idea
> > > > > > > is that
> > > > > > > our support engineers can look at these records when
> > > > > > > troubleshooting
> > > > > > > the system.  And the plan is also to collect these records as
> > > > > > > part of
> > > > > > > our sosreport tool, which is part of the support case.
> > > > > > 
> > > > > > I don't think you're implying that prog load/unload should be
> > > > > > spamming dmesg and auditd not even running...
> > > > > 
> > > > > I think the problem Jesper implied is that in order to collect
> > > > > those logs you'll need perf tool running all the time.. which
> > > > > it's not equipped for yet
> > > > 
> > > > I'm not proposing to run 'perf' binary all the time.
> > > > Setting up perf ring buffer just for these new bpf prog load/unload
> > > > events and epolling it is simple enough to do from any application
> > > > including auditd. selftests/bpf/ do it for bpf output events.
> > > 
> > > ok, did not think about the possibility to teach auditd talk to perf,
> > > time to get that tool evsel/evlist/rb library ready ;-)
> > 
> > Interesting, I also didn't consider teaching auditd to gets its 'bpf'
> > events from a separate perf ring-buffer, that might work.  I do wonder
> > how the audit people will take this suggestion.

I'm not sure exactly what the issue is. You can audit for specific syscall 
and argument. So, if you want to see loads, then you can make a rule like:

-a always,exit -F arch=b64 -S bpf -F a0=5

-Steve

^ permalink raw reply

* RE: Issues in error queue polling
From: Keller, Jacob E @ 2018-10-18 21:54 UTC (permalink / raw)
  To: pasqualirb@gmail.com, netdev@vger.kernel.org
  Cc: Gomes, Vinicius, davem@davemloft.net
In-Reply-To: <20181018182644.GA11634@localhost.localdomain>

Hi,

> -----Original Message-----
> From: Ricardo Biehl Pasquali [mailto:pasqualirb@gmail.com]
> Sent: Thursday, October 18, 2018 11:27 AM
> To: netdev@vger.kernel.org
> Cc: Keller, Jacob E <jacob.e.keller@intel.com>; Gomes, Vinicius
> <vinicius.gomes@intel.com>; davem@davemloft.net
> Subject: Issues in error queue polling
> 
> The commit 7d4c04fc170087119727 ("net: add option to enable
> error queue packets waking select") (2013-03-28) introduced
> SO_SELECT_ERR_QUEUE, which masks POLLPRI with POLLERR event
> return in some socket poll callbacks.
> 
> POLLERR event issued with sock_queue_err_skb() did not wake
> up a poll when POLLERR is the only requested event because
> sk_data_ready() (sock_def_readable()) was used and it
> doesn't mask POLLERR in poll wake up:
> 

Right.

> wake_up_interruptible_sync_poll(&wq->wait,
>                                 EPOLLIN | EPOLLPRI |
>                                 EPOLLRDNORM | EPOLLRDBAND);
> 
> If POLLIN or POLLPRI are requested, for example, poll does
> wake up.
> 
> POLLERR wakeup by requesting POLLPRI is possible without
> set SO_SELECT_ERR_QUEUE. All the option does is masking
> POLLPRI as a returned event before poll returns. poll
> would return anyway because of POLLERR.
> 

Yes. The problem being that the application thread not being ready to handle POLLPRI, so they want to avoid the application waking up to an event.

> Also, the sentence "[...] enable software to wait on error
> queue packets without waking up for regular data on the
> socket." from the above commit is not true.
> 

Not entirely true but...

> A POLLIN event issued via sock_def_readable() wakes up
> threads waiting for POLLPRI, and vice versa. However,
> poll() does not return, sleeping again, as the requested
> events do not match events.
> 

The thread wakes up, but the application handling the events doesn't because the thread goes right back to sleep.

> The commit 6e5d58fdc9bedd0255a8 ("skbuff: Fix not waking
> applications when errors are enqueued") (2018-03-14) make
> POLLERR alone wake up poll. It replaces sk_data_ready()
> (sock_def_readable()) with sk_error_report()
> (sock_def_error_report()). This makes "POLLERR wake up by
> requesting POLLPRI" obsolete.
> 

Yep, this is a better solution, and I wish it had been thought of before we introduced SO_SELECT_ERR_QUEUE.

> Rationale:
> 
> POLLIN-only and POLLERR-only wake up are useful when there
> is a receiving thread, a sending thread, and a thread that
> get transmit timestamps. The thread polling on POLLERR will
> not wake up when regular data arrives (POLLIN). The thread
> polling on POLLIN will not wake up when tx timestamps are
> ready (POLLERR).

Right. This is the goal for applications like ptp4l.

> 
> One solution is adding an option that disable POLLERR as
> requested event. This is in the Virtual File System
> subsystem, not in the network, though.
> 
> This solves the problem of waking up other threads that
> not interested in error queue. Thus allowing a separate
> thread take care of error queue (useful for receiving
> transmit timestamps).

Yes, this makes sense to me.

Thanks,
Jake

^ permalink raw reply

* Re: Kernel oops with mlx5 and dual XDP redirect programs
From: Toke Høiland-Jørgensen @ 2018-10-18 21:53 UTC (permalink / raw)
  To: Saeed Mahameed, netdev@vger.kernel.org
  Cc: Eran Ben Elisha, Tariq Toukan, brouer@redhat.com
In-Reply-To: <4e2cfdc3db244f4b9483a0c3dfc62fae55238bb3.camel@mellanox.com>

Saeed Mahameed <saeedm@mellanox.com> writes:

> I think that the mlx5 driver doesn't know how to tell the other device
> to stop transmitting to it while it is resetting.. Maybe tariq or
> Jesper know more about this ?
> I will look at this tomorrow after noon and will try to repro...

Hi Saeed

Did you have a chance to poke at this? :)

-Toke

^ permalink raw reply

* [bpf-next PATCH 3/3] bpf: test_sockmap add options to use msg_push_data
From: John Fastabend @ 2018-10-18 21:52 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, john.fastabend
In-Reply-To: <20181018215041.22362.36585.stgit@john-Precision-Tower-5810>

Add options to run msg_push_data, this patch creates two more flags
in test_sockmap that can be used to specify the offset and length
of bytes to be added. The new options are --txmsg_start_push to
specify where bytes should be inserted and --txmsg_end_push to
specify how many bytes. This is analagous to the options that are
used to pull data, --txmsg_start and --txmsg_end.

In addition to adding the options tests are added to the test
suit to run the tests similar to what was done for msg_pull_data.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 tools/testing/selftests/bpf/test_sockmap.c      |   58 +++++++++++++-
 tools/testing/selftests/bpf/test_sockmap_kern.h |   97 ++++++++++++++++++-----
 2 files changed, 129 insertions(+), 26 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c
index cbd1c0b..622ade0 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -77,6 +77,8 @@
 int txmsg_cork;
 int txmsg_start;
 int txmsg_end;
+int txmsg_start_push;
+int txmsg_end_push;
 int txmsg_ingress;
 int txmsg_skb;
 int ktls;
@@ -100,6 +102,8 @@
 	{"txmsg_cork",	required_argument,	NULL, 'k'},
 	{"txmsg_start", required_argument,	NULL, 's'},
 	{"txmsg_end",	required_argument,	NULL, 'e'},
+	{"txmsg_start_push", required_argument,	NULL, 'p'},
+	{"txmsg_end_push",   required_argument,	NULL, 'q'},
 	{"txmsg_ingress", no_argument,		&txmsg_ingress, 1 },
 	{"txmsg_skb", no_argument,		&txmsg_skb, 1 },
 	{"ktls", no_argument,			&ktls, 1 },
@@ -903,6 +907,30 @@ static int run_options(struct sockmap_options *options, int cg_fd,  int test)
 			}
 		}
 
+		if (txmsg_start_push) {
+			i = 2;
+			err = bpf_map_update_elem(map_fd[5],
+						  &i, &txmsg_start_push, BPF_ANY);
+			if (err) {
+				fprintf(stderr,
+					"ERROR: bpf_map_update_elem (txmsg_start_push):  %d (%s)\n",
+					err, strerror(errno));
+				goto out;
+			}
+		}
+
+		if (txmsg_end_push) {
+			i = 3;
+			err = bpf_map_update_elem(map_fd[5],
+						  &i, &txmsg_end_push, BPF_ANY);
+			if (err) {
+				fprintf(stderr,
+					"ERROR: bpf_map_update_elem %i@%i (txmsg_end_push):  %d (%s)\n",
+					txmsg_end_push, i, err, strerror(errno));
+				goto out;
+			}
+		}
+
 		if (txmsg_ingress) {
 			int in = BPF_F_INGRESS;
 
@@ -1235,6 +1263,8 @@ static int test_mixed(int cgrp)
 	txmsg_pass = txmsg_noisy = txmsg_redir_noisy = txmsg_drop = 0;
 	txmsg_apply = txmsg_cork = 0;
 	txmsg_start = txmsg_end = 0;
+	txmsg_start_push = txmsg_end_push = 0;
+
 	/* Test small and large iov_count values with pass/redir/apply/cork */
 	txmsg_pass = 1;
 	txmsg_redir = 0;
@@ -1351,6 +1381,8 @@ static int test_start_end(int cgrp)
 	/* Test basic start/end with lots of iov_count and iov_lengths */
 	txmsg_start = 1;
 	txmsg_end = 2;
+	txmsg_start_push = 1;
+	txmsg_end_push = 2;
 	err = test_txmsg(cgrp);
 	if (err)
 		goto out;
@@ -1364,6 +1396,8 @@ static int test_start_end(int cgrp)
 	for (i = 99; i <= 1600; i += 500) {
 		txmsg_start = 0;
 		txmsg_end = i;
+		txmsg_start_push = 0;
+		txmsg_end_push = i;
 		err = test_exec(cgrp, &opt);
 		if (err)
 			goto out;
@@ -1373,6 +1407,8 @@ static int test_start_end(int cgrp)
 	for (i = 199; i <= 1600; i += 500) {
 		txmsg_start = 100;
 		txmsg_end = i;
+		txmsg_start_push = 100;
+		txmsg_end_push = i;
 		err = test_exec(cgrp, &opt);
 		if (err)
 			goto out;
@@ -1381,6 +1417,8 @@ static int test_start_end(int cgrp)
 	/* Test start/end with cork pulling last sg entry */
 	txmsg_start = 1500;
 	txmsg_end = 1600;
+	txmsg_start_push = 1500;
+	txmsg_end_push = 1600;
 	err = test_exec(cgrp, &opt);
 	if (err)
 		goto out;
@@ -1388,6 +1426,8 @@ static int test_start_end(int cgrp)
 	/* Test start/end pull of single byte in last page */
 	txmsg_start = 1111;
 	txmsg_end = 1112;
+	txmsg_start_push = 1111;
+	txmsg_end_push = 1112;
 	err = test_exec(cgrp, &opt);
 	if (err)
 		goto out;
@@ -1395,6 +1435,8 @@ static int test_start_end(int cgrp)
 	/* Test start/end with end < start */
 	txmsg_start = 1111;
 	txmsg_end = 0;
+	txmsg_start_push = 1111;
+	txmsg_end_push = 0;
 	err = test_exec(cgrp, &opt);
 	if (err)
 		goto out;
@@ -1402,6 +1444,8 @@ static int test_start_end(int cgrp)
 	/* Test start/end with end > data */
 	txmsg_start = 0;
 	txmsg_end = 1601;
+	txmsg_start_push = 0;
+	txmsg_end_push = 1601;
 	err = test_exec(cgrp, &opt);
 	if (err)
 		goto out;
@@ -1409,6 +1453,8 @@ static int test_start_end(int cgrp)
 	/* Test start/end with start > data */
 	txmsg_start = 1601;
 	txmsg_end = 1600;
+	txmsg_start_push = 1601;
+	txmsg_end_push = 1600;
 	err = test_exec(cgrp, &opt);
 
 out:
@@ -1424,7 +1470,7 @@ static int test_start_end(int cgrp)
 	"sock_map_redir",
 	"sock_apply_bytes",
 	"sock_cork_bytes",
-	"sock_pull_bytes",
+	"sock_bytes",
 	"sock_redir_flags",
 	"sock_skb_opts",
 };
@@ -1531,7 +1577,7 @@ static int __test_suite(int cg_fd, char *bpf_file)
 	}
 
 	/* Tests basic commands and APIs with range of iov values */
-	txmsg_start = txmsg_end = 0;
+	txmsg_start = txmsg_end = txmsg_start_push = txmsg_end_push = 0;
 	err = test_txmsg(cg_fd);
 	if (err)
 		goto out;
@@ -1580,7 +1626,7 @@ int main(int argc, char **argv)
 	if (argc < 2)
 		return test_suite(-1);
 
-	while ((opt = getopt_long(argc, argv, ":dhvc:r:i:l:t:",
+	while ((opt = getopt_long(argc, argv, ":dhvc:r:i:l:t:p:q:",
 				  long_options, &longindex)) != -1) {
 		switch (opt) {
 		case 's':
@@ -1589,6 +1635,12 @@ int main(int argc, char **argv)
 		case 'e':
 			txmsg_end = atoi(optarg);
 			break;
+		case 'p':
+			txmsg_start_push = atoi(optarg);
+			break;
+		case 'q':
+			txmsg_end_push = atoi(optarg);
+			break;
 		case 'a':
 			txmsg_apply = atoi(optarg);
 			break;
diff --git a/tools/testing/selftests/bpf/test_sockmap_kern.h b/tools/testing/selftests/bpf/test_sockmap_kern.h
index 8e8e417..14b8bba 100644
--- a/tools/testing/selftests/bpf/test_sockmap_kern.h
+++ b/tools/testing/selftests/bpf/test_sockmap_kern.h
@@ -70,11 +70,11 @@ struct bpf_map_def SEC("maps") sock_cork_bytes = {
 	.max_entries = 1
 };
 
-struct bpf_map_def SEC("maps") sock_pull_bytes = {
+struct bpf_map_def SEC("maps") sock_bytes = {
 	.type = BPF_MAP_TYPE_ARRAY,
 	.key_size = sizeof(int),
 	.value_size = sizeof(int),
-	.max_entries = 2
+	.max_entries = 4
 };
 
 struct bpf_map_def SEC("maps") sock_redir_flags = {
@@ -181,8 +181,8 @@ int bpf_sockmap(struct bpf_sock_ops *skops)
 SEC("sk_msg1")
 int bpf_prog4(struct sk_msg_md *msg)
 {
-	int *bytes, zero = 0, one = 1;
-	int *start, *end;
+	int *bytes, zero = 0, one = 1, two = 2, three = 3;
+	int *start, *end, *start_push, *end_push;
 
 	bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero);
 	if (bytes)
@@ -190,18 +190,24 @@ int bpf_prog4(struct sk_msg_md *msg)
 	bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero);
 	if (bytes)
 		bpf_msg_cork_bytes(msg, *bytes);
-	start = bpf_map_lookup_elem(&sock_pull_bytes, &zero);
-	end = bpf_map_lookup_elem(&sock_pull_bytes, &one);
+	start = bpf_map_lookup_elem(&sock_bytes, &zero);
+	end = bpf_map_lookup_elem(&sock_bytes, &one);
 	if (start && end)
 		bpf_msg_pull_data(msg, *start, *end, 0);
+	start_push = bpf_map_lookup_elem(&sock_bytes, &two);
+	end_push = bpf_map_lookup_elem(&sock_bytes, &three);
+	if (start_push && end_push)
+		bpf_msg_push_data(msg, *start_push, *end_push, 0);
 	return SK_PASS;
 }
 
 SEC("sk_msg2")
 int bpf_prog5(struct sk_msg_md *msg)
 {
-	int err1 = -1, err2 = -1, zero = 0, one = 1;
-	int *bytes, *start, *end, len1, len2;
+	int zero = 0, one = 1, two = 2, three = 3;
+	int *start, *end, *start_push, *end_push;
+	int *bytes, len1, len2 = 0, len3;
+	int err1 = -1, err2 = -1;
 
 	bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero);
 	if (bytes)
@@ -210,8 +216,8 @@ int bpf_prog5(struct sk_msg_md *msg)
 	if (bytes)
 		err2 = bpf_msg_cork_bytes(msg, *bytes);
 	len1 = (__u64)msg->data_end - (__u64)msg->data;
-	start = bpf_map_lookup_elem(&sock_pull_bytes, &zero);
-	end = bpf_map_lookup_elem(&sock_pull_bytes, &one);
+	start = bpf_map_lookup_elem(&sock_bytes, &zero);
+	end = bpf_map_lookup_elem(&sock_bytes, &one);
 	if (start && end) {
 		int err;
 
@@ -225,6 +231,23 @@ int bpf_prog5(struct sk_msg_md *msg)
 		bpf_printk("sk_msg2: length update %i->%i\n",
 			   len1, len2);
 	}
+
+	start_push = bpf_map_lookup_elem(&sock_bytes, &two);
+	end_push = bpf_map_lookup_elem(&sock_bytes, &three);
+	if (start_push && end_push) {
+		int err;
+
+		bpf_printk("sk_msg2: push(%i:%i)\n",
+			   start_push ? *start_push : 0,
+			   end_push ? *end_push : 0);
+		err = bpf_msg_push_data(msg, *start_push, *end_push, 0);
+		if (err)
+			bpf_printk("sk_msg2: push_data err %i\n", err);
+		len3 = (__u64)msg->data_end - (__u64)msg->data;
+		bpf_printk("sk_msg2: length push_update %i->%i\n",
+			   len2 ? len2 : len1, len3);
+	}
+
 	bpf_printk("sk_msg2: data length %i err1 %i err2 %i\n",
 		   len1, err1, err2);
 	return SK_PASS;
@@ -233,8 +256,8 @@ int bpf_prog5(struct sk_msg_md *msg)
 SEC("sk_msg3")
 int bpf_prog6(struct sk_msg_md *msg)
 {
-	int *bytes, zero = 0, one = 1, key = 0;
-	int *start, *end, *f;
+	int *bytes, *start, *end, *start_push, *end_push, *f;
+	int zero = 0, one = 1, two = 2, three = 3, key = 0;
 	__u64 flags = 0;
 
 	bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero);
@@ -243,10 +266,17 @@ int bpf_prog6(struct sk_msg_md *msg)
 	bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero);
 	if (bytes)
 		bpf_msg_cork_bytes(msg, *bytes);
-	start = bpf_map_lookup_elem(&sock_pull_bytes, &zero);
-	end = bpf_map_lookup_elem(&sock_pull_bytes, &one);
+
+	start = bpf_map_lookup_elem(&sock_bytes, &zero);
+	end = bpf_map_lookup_elem(&sock_bytes, &one);
 	if (start && end)
 		bpf_msg_pull_data(msg, *start, *end, 0);
+
+	start_push = bpf_map_lookup_elem(&sock_bytes, &two);
+	end_push = bpf_map_lookup_elem(&sock_bytes, &three);
+	if (start_push && end_push)
+		bpf_msg_push_data(msg, *start_push, *end_push, 0);
+
 	f = bpf_map_lookup_elem(&sock_redir_flags, &zero);
 	if (f && *f) {
 		key = 2;
@@ -262,8 +292,9 @@ int bpf_prog6(struct sk_msg_md *msg)
 SEC("sk_msg4")
 int bpf_prog7(struct sk_msg_md *msg)
 {
-	int err1 = 0, err2 = 0, zero = 0, one = 1, key = 0;
-	int *f, *bytes, *start, *end, len1, len2;
+	int zero = 0, one = 1, two = 2, three = 3, len1, len2 = 0, len3;
+	int *bytes, *start, *end, *start_push, *end_push, *f;
+	int err1 = 0, err2 = 0, key = 0;
 	__u64 flags = 0;
 
 		int err;
@@ -274,10 +305,10 @@ int bpf_prog7(struct sk_msg_md *msg)
 	if (bytes)
 		err2 = bpf_msg_cork_bytes(msg, *bytes);
 	len1 = (__u64)msg->data_end - (__u64)msg->data;
-	start = bpf_map_lookup_elem(&sock_pull_bytes, &zero);
-	end = bpf_map_lookup_elem(&sock_pull_bytes, &one);
-	if (start && end) {
 
+	start = bpf_map_lookup_elem(&sock_bytes, &zero);
+	end = bpf_map_lookup_elem(&sock_bytes, &one);
+	if (start && end) {
 		bpf_printk("sk_msg2: pull(%i:%i)\n",
 			   start ? *start : 0, end ? *end : 0);
 		err = bpf_msg_pull_data(msg, *start, *end, 0);
@@ -288,6 +319,22 @@ int bpf_prog7(struct sk_msg_md *msg)
 		bpf_printk("sk_msg2: length update %i->%i\n",
 			   len1, len2);
 	}
+
+	start_push = bpf_map_lookup_elem(&sock_bytes, &two);
+	end_push = bpf_map_lookup_elem(&sock_bytes, &three);
+	if (start_push && end_push) {
+		bpf_printk("sk_msg4: push(%i:%i)\n",
+			   start_push ? *start_push : 0,
+			   end_push ? *end_push : 0);
+		err = bpf_msg_push_data(msg, *start_push, *end_push, 0);
+		if (err)
+			bpf_printk("sk_msg4: push_data err %i\n",
+				   err);
+		len3 = (__u64)msg->data_end - (__u64)msg->data;
+		bpf_printk("sk_msg4: length push_update %i->%i\n",
+			   len2 ? len2 : len1, len3);
+	}
+
 	f = bpf_map_lookup_elem(&sock_redir_flags, &zero);
 	if (f && *f) {
 		key = 2;
@@ -342,8 +389,8 @@ int bpf_prog9(struct sk_msg_md *msg)
 SEC("sk_msg7")
 int bpf_prog10(struct sk_msg_md *msg)
 {
-	int *bytes, zero = 0, one = 1;
-	int *start, *end;
+	int *bytes, *start, *end, *start_push, *end_push;
+	int zero = 0, one = 1, two = 2, three = 3;
 
 	bytes = bpf_map_lookup_elem(&sock_apply_bytes, &zero);
 	if (bytes)
@@ -351,10 +398,14 @@ int bpf_prog10(struct sk_msg_md *msg)
 	bytes = bpf_map_lookup_elem(&sock_cork_bytes, &zero);
 	if (bytes)
 		bpf_msg_cork_bytes(msg, *bytes);
-	start = bpf_map_lookup_elem(&sock_pull_bytes, &zero);
-	end = bpf_map_lookup_elem(&sock_pull_bytes, &one);
+	start = bpf_map_lookup_elem(&sock_bytes, &zero);
+	end = bpf_map_lookup_elem(&sock_bytes, &one);
 	if (start && end)
 		bpf_msg_pull_data(msg, *start, *end, 0);
+	start_push = bpf_map_lookup_elem(&sock_bytes, &two);
+	end_push = bpf_map_lookup_elem(&sock_bytes, &three);
+	if (start_push && end_push)
+		bpf_msg_push_data(msg, *start_push, *end_push, 0);
 
 	return SK_DROP;
 }

^ permalink raw reply related

* [bpf-next PATCH 2/3] bpf: libbpf support for msg_push_data
From: John Fastabend @ 2018-10-18 21:52 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, john.fastabend
In-Reply-To: <20181018215041.22362.36585.stgit@john-Precision-Tower-5810>

Add support for new bpf_msg_push_data in libbpf.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 tools/include/uapi/linux/bpf.h            |   20 +++++++++++++++++++-
 tools/testing/selftests/bpf/bpf_helpers.h |    2 ++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 5e46f67..1e9fbc5 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2215,6 +2215,23 @@ struct bpf_stack_build_id {
  *		pointer that was returned from bpf_sk_lookup_xxx\ ().
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_push_data(struct sk_buff *skb, u32 start, u32 len, u64 flags)
+ *	Description
+ *		For socket policies, insert *len* bytes into msg at offset
+ *		*start*.
+ *
+ *		If a program of type **BPF_PROG_TYPE_SK_MSG** is run on a
+ *		*msg* it may want to insert metadata or options into the msg.
+ *		This can later be read and used by any of the lower layer BPF
+ *		hooks.
+ *
+ *		This helper may fail if under memory pressure (a malloc
+ *		fails) in these cases BPF programs will get an appropriate
+ *		error and BPF programs will need to handle them.
+ *
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2303,7 +2320,8 @@ struct bpf_stack_build_id {
 	FN(skb_ancestor_cgroup_id),	\
 	FN(sk_lookup_tcp),		\
 	FN(sk_lookup_udp),		\
-	FN(sk_release),
+	FN(sk_release),			\
+	FN(msg_push_data),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index fda8c16..4e33511 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -104,6 +104,8 @@ static int (*bpf_msg_cork_bytes)(void *ctx, int len) =
 	(void *) BPF_FUNC_msg_cork_bytes;
 static int (*bpf_msg_pull_data)(void *ctx, int start, int end, int flags) =
 	(void *) BPF_FUNC_msg_pull_data;
+static int (*bpf_msg_push_data)(void *ctx, int start, int end, int flags) =
+	(void *) BPF_FUNC_msg_push_data;
 static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
 	(void *) BPF_FUNC_bind;
 static int (*bpf_xdp_adjust_tail)(void *ctx, int offset) =

^ permalink raw reply related

* [bpf-next PATCH 1/3] bpf: sk_msg program helper bpf_msg_push_data
From: John Fastabend @ 2018-10-18 21:52 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, john.fastabend
In-Reply-To: <20181018215041.22362.36585.stgit@john-Precision-Tower-5810>

This allows user to push data into a msg using sk_msg program types.
The format is as follows,

	bpf_msg_push_data(msg, offset, len, flags)

this will insert 'len' bytes at offset 'offset'. For example to
prepend 10 bytes at the front of the message the user can,

	bpf_msg_push_data(msg, 0, 10, 0);

This will invalidate data bounds so BPF user will have to then recheck
data bounds after calling this. After this the msg size will have been
updated and the user is free to write into the added bytes. We allow
any offset/len as long as it is within the (data, data_end) range.
However, a copy will be required if the ring is full and its possible
for the helper to fail with ENOMEM or EINVAL errors which need to be
handled by the BPF program.

This can be used similar to XDP metadata to pass data between sk_msg
layer and lower layers.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h    |    5 ++
 include/uapi/linux/bpf.h |   20 +++++++
 net/core/filter.c        |  134 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 22347b0..677b673 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -207,6 +207,11 @@ static inline struct scatterlist *sk_msg_elem(struct sk_msg *msg, int which)
 	return &msg->sg.data[which];
 }
 
+static inline struct scatterlist sk_msg_elem_cpy(struct sk_msg *msg, int which)
+{
+	return msg->sg.data[which];
+}
+
 static inline struct page *sk_msg_page(struct sk_msg *msg, int which)
 {
 	return sg_page(sk_msg_elem(msg, which));
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5e46f67..1e9fbc5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2215,6 +2215,23 @@ struct bpf_stack_build_id {
  *		pointer that was returned from bpf_sk_lookup_xxx\ ().
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_msg_push_data(struct sk_buff *skb, u32 start, u32 len, u64 flags)
+ *	Description
+ *		For socket policies, insert *len* bytes into msg at offset
+ *		*start*.
+ *
+ *		If a program of type **BPF_PROG_TYPE_SK_MSG** is run on a
+ *		*msg* it may want to insert metadata or options into the msg.
+ *		This can later be read and used by any of the lower layer BPF
+ *		hooks.
+ *
+ *		This helper may fail if under memory pressure (a malloc
+ *		fails) in these cases BPF programs will get an appropriate
+ *		error and BPF programs will need to handle them.
+ *
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2303,7 +2320,8 @@ struct bpf_stack_build_id {
 	FN(skb_ancestor_cgroup_id),	\
 	FN(sk_lookup_tcp),		\
 	FN(sk_lookup_udp),		\
-	FN(sk_release),
+	FN(sk_release),			\
+	FN(msg_push_data),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 1a3ac6c..4bcf238 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2297,6 +2297,137 @@ int skb_do_redirect(struct sk_buff *skb)
 	.arg4_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
+	   u32, len, u64, flags)
+{
+	struct scatterlist sge, nsge, nnsge, rsge = {0}, *psge;
+	int new, i = 0, l, space, copy = 0, offset = 0;
+	u8 *raw, *to, *from;
+	struct page *page;
+
+	if (unlikely(flags))
+		return -EINVAL;
+
+	/* First find the starting scatterlist element */
+	i = msg->sg.start;
+	do {
+		l = sk_msg_elem(msg, i)->length;
+
+		if (start < offset + l)
+			break;
+		offset += l;
+		sk_msg_iter_var_next(i);
+	} while (i != msg->sg.end);
+
+	if (start >= offset + l)
+		return -EINVAL;
+
+	space = MAX_MSG_FRAGS - sk_msg_elem_used(msg);
+
+	/* If no space available will fallback to copy, we need at
+	 * least one scatterlist elem available to push data into
+	 * when start aligns to the beginning of an element or two
+	 * when it falls inside an element. We handle the start equals
+	 * offset case because its the common case for inserting a
+	 * header.
+	 */
+	if (!space || (space == 1 && start != offset))
+		copy = msg->sg.data[i].length;
+
+	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
+			   get_order(copy + len));
+	if (unlikely(!page))
+		return -ENOMEM;
+
+	if (copy) {
+		int front, back;
+
+		raw = page_address(page);
+
+		psge = sk_msg_elem(msg, i);
+		front = start - offset;
+		back = psge->length - front;
+		from = sg_virt(psge);
+
+		if (front)
+			memcpy(raw, from, front);
+
+		if (back) {
+			from += front;
+			to = raw + front + len;
+
+			memcpy(to, from, back);
+		}
+
+		put_page(sg_page(psge));
+	} else if (start - offset) {
+		psge = sk_msg_elem(msg, i);
+		rsge = sk_msg_elem_cpy(msg, i);
+
+		psge->length = start - offset;
+		rsge.length -= psge->length;
+		rsge.offset += start;
+
+		sk_msg_iter_var_next(i);
+		sg_unmark_end(psge);
+		sk_msg_iter_next(msg, end);
+	}
+
+	/* Slot(s) to place newly allocated data */
+	new = i;
+
+	/* Shift one or two slots as needed */
+	if (!copy) {
+		sge = sk_msg_elem_cpy(msg, i);
+
+		sk_msg_iter_var_next(i);
+		sg_unmark_end(&sge);
+		sk_msg_iter_next(msg, end);
+
+		nsge = sk_msg_elem_cpy(msg, i);
+		if (rsge.length) {
+			sk_msg_iter_var_next(i);
+			nnsge = sk_msg_elem_cpy(msg, i);
+		}
+
+		while (i != msg->sg.end) {
+			msg->sg.data[i] = sge;
+			sge = nsge;
+			sk_msg_iter_var_next(i);
+			if (rsge.length) {
+				nsge = nnsge;
+				nnsge = sk_msg_elem_cpy(msg, i);
+			} else {
+				nsge = sk_msg_elem_cpy(msg, i);
+			}
+		}
+	}
+
+	/* Place newly allocated data buffer */
+	sk_mem_charge(msg->sk, len);
+	msg->sg.size += len;
+	msg->sg.copy[new] = false;
+	sg_set_page(&msg->sg.data[new], page, len + copy, 0);
+	if (rsge.length) {
+		get_page(sg_page(&rsge));
+		sk_msg_iter_var_next(new);
+		msg->sg.data[new] = rsge;
+	}
+
+	sk_msg_compute_data_pointers(msg);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_msg_push_data_proto = {
+	.func		= bpf_msg_push_data,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 BPF_CALL_1(bpf_get_cgroup_classid, const struct sk_buff *, skb)
 {
 	return task_get_classid(skb);
@@ -4854,6 +4985,7 @@ bool bpf_helper_changes_pkt_data(void *func)
 	    func == bpf_xdp_adjust_head ||
 	    func == bpf_xdp_adjust_meta ||
 	    func == bpf_msg_pull_data ||
+	    func == bpf_msg_push_data ||
 	    func == bpf_xdp_adjust_tail ||
 #if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
 	    func == bpf_lwt_seg6_store_bytes ||
@@ -5124,6 +5256,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 		return &bpf_msg_cork_bytes_proto;
 	case BPF_FUNC_msg_pull_data:
 		return &bpf_msg_pull_data_proto;
+	case BPF_FUNC_msg_push_data:
+		return &bpf_msg_push_data_proto;
 	case BPF_FUNC_get_local_storage:
 		return &bpf_get_local_storage_proto;
 	default:

^ permalink raw reply related

* [bpf-next PATCH 0/3] sockmap, bpf_msg_push_data helper
From: John Fastabend @ 2018-10-18 21:51 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, john.fastabend

This series adds a new helper bpf_msg_push_data to be used by
sk_msg programs. The helper can be used to insert extra bytes into
the message that can then be used by the program as metadata tags
among other things.

The first patch adds the helper, second patch the libbpf support,
and last patch updates test_sockmap to run msg_push_data tests.

---

John Fastabend (3):
      bpf: sk_msg program helper bpf_msg_push_data
      bpf: libbpf support for msg_push_data
      bpf: test_sockmap add options to use msg_push_data


 include/linux/skmsg.h                           |    5 +
 include/uapi/linux/bpf.h                        |   20 +++
 net/core/filter.c                               |  134 +++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h                  |   20 +++
 tools/testing/selftests/bpf/bpf_helpers.h       |    2 
 tools/testing/selftests/bpf/test_sockmap.c      |   58 +++++++++-
 tools/testing/selftests/bpf/test_sockmap_kern.h |   97 +++++++++++++----
 7 files changed, 308 insertions(+), 28 deletions(-)

^ permalink raw reply

* Re: linux-next: build failure after merge of the net-next tree
From: Or Gerlitz @ 2018-10-19  5:48 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: David Miller, Linux Netdev List, Doug Ledford, Jason Gunthorpe,
	linux-next, Linux Kernel, Mark Bloch, Leon Romanovsky,
	Paul Blakey, Saeed Mahameed
In-Reply-To: <20181019111805.28860f73@canb.auug.org.au>

On Fri, Oct 19, 2018 at 3:19 AM Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> Hi all,
>
> After merging the net-next tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
>
> drivers/infiniband/hw/mlx5/flow.c: In function 'mlx5_ib_handler_MLX5_IB_METHOD_CREATE_FLOW':
> drivers/infiniband/hw/mlx5/flow.c:163:12: error: 'struct mlx5_flow_act' has no member named 'has_flow_tag'; did you mean 'flow_tag'?
>    flow_act.has_flow_tag = true;
>             ^~~~~~~~~~~~
>             flow_tag
>
> Caused by commit
>
>   d5634fee245f ("net/mlx5: Add a no-append flow insertion mode")
>
> interacting with commit
>
>   ba4a41198324 ("RDMA/mlx5: Add support for flow tag to raw create flow")
>
> from the rdma tree.
>
> I have applied the following merge fix patch for today:
>
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Fri, 19 Oct 2018 11:10:39 +1100
> Subject: [PATCH] net/mlx5: fix up for has_flow_tag changing to a flag
>
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
> ---
>  drivers/infiniband/hw/mlx5/flow.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/hw/mlx5/flow.c b/drivers/infiniband/hw/mlx5/flow.c
> index e57435cb6d96..f86cdcafdafc 100644
> --- a/drivers/infiniband/hw/mlx5/flow.c
> +++ b/drivers/infiniband/hw/mlx5/flow.c
> @@ -160,7 +160,7 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_CREATE_FLOW)(
>                         ret = -EINVAL;
>                         goto err_out;
>                 }
> -               flow_act.has_flow_tag = true;
> +               flow_act.flags |= FLOW_ACT_HAS_TAG;
>         }
>
>         flow_handler = mlx5_ib_raw_fs_rule_add(dev, fs_matcher, &flow_act,


yeah, this is the correct resolution, thanks for addressing!

^ permalink raw reply

* Re: [PATCH net-next] MAINTAINERS: Update contact info for VRF entry
From: David Miller @ 2018-10-18 21:30 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, shrijeet, dsahern
In-Reply-To: <20181018191549.64041-1-dsahern@kernel.org>

From: David Ahern <dsahern@kernel.org>
Date: Thu, 18 Oct 2018 13:15:49 -0600

> From: David Ahern <dsahern@gmail.com>
> 
> Update Shrijeet's email address for the VRF entry.
> 
> Signed-off-by: David Ahern <dsahern@gmail.com>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox