Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next] r8169: allocate rx buffers using alloc_pages_node
From: Heiner Kallweit @ 2019-08-07 19:38 UTC (permalink / raw)
  To: Realtek linux nic maintainers, David Miller; +Cc: netdev@vger.kernel.org

We allocate 16kb per rx buffer, so we can avoid some overhead by using
alloc_pages_node directly instead of bothering kmalloc_node. Due to
this change buffers are page-aligned now, therefore the alignment check
can be removed.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169_main.c | 45 ++++++++++-------------
 1 file changed, 19 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 3c7af6669..4f0b32280 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -642,7 +642,7 @@ struct rtl8169_private {
 	struct RxDesc *RxDescArray;	/* 256-aligned Rx descriptor ring */
 	dma_addr_t TxPhyAddr;
 	dma_addr_t RxPhyAddr;
-	void *Rx_databuff[NUM_RX_DESC];	/* Rx data buffers */
+	struct page *Rx_databuff[NUM_RX_DESC];	/* Rx data buffers */
 	struct ring_info tx_skb[NUM_TX_DESC];	/* Tx data buffers */
 	u16 cp_cmd;
 	u16 irq_mask;
@@ -5261,12 +5261,13 @@ static inline void rtl8169_make_unusable_by_asic(struct RxDesc *desc)
 }
 
 static void rtl8169_free_rx_databuff(struct rtl8169_private *tp,
-				     void **data_buff, struct RxDesc *desc)
+				     struct page **data_buff,
+				     struct RxDesc *desc)
 {
-	dma_unmap_single(tp_to_dev(tp), le64_to_cpu(desc->addr),
-			 R8169_RX_BUF_SIZE, DMA_FROM_DEVICE);
+	dma_unmap_page(tp_to_dev(tp), le64_to_cpu(desc->addr),
+		       R8169_RX_BUF_SIZE, DMA_FROM_DEVICE);
 
-	kfree(*data_buff);
+	__free_pages(*data_buff, get_order(R8169_RX_BUF_SIZE));
 	*data_buff = NULL;
 	rtl8169_make_unusable_by_asic(desc);
 }
@@ -5281,38 +5282,30 @@ static inline void rtl8169_mark_to_asic(struct RxDesc *desc)
 	desc->opts1 = cpu_to_le32(DescOwn | eor | R8169_RX_BUF_SIZE);
 }
 
-static struct sk_buff *rtl8169_alloc_rx_data(struct rtl8169_private *tp,
-					     struct RxDesc *desc)
+static struct page *rtl8169_alloc_rx_data(struct rtl8169_private *tp,
+					  struct RxDesc *desc)
 {
-	void *data;
-	dma_addr_t mapping;
 	struct device *d = tp_to_dev(tp);
 	int node = dev_to_node(d);
+	dma_addr_t mapping;
+	struct page *data;
 
-	data = kmalloc_node(R8169_RX_BUF_SIZE, GFP_KERNEL, node);
+	data = alloc_pages_node(node, GFP_KERNEL, get_order(R8169_RX_BUF_SIZE));
 	if (!data)
 		return NULL;
 
-	/* Memory should be properly aligned, but better check. */
-	if (!IS_ALIGNED((unsigned long)data, 8)) {
-		netdev_err_once(tp->dev, "RX buffer not 8-byte-aligned\n");
-		goto err_out;
-	}
-
-	mapping = dma_map_single(d, data, R8169_RX_BUF_SIZE, DMA_FROM_DEVICE);
+	mapping = dma_map_page(d, data, 0, R8169_RX_BUF_SIZE, DMA_FROM_DEVICE);
 	if (unlikely(dma_mapping_error(d, mapping))) {
 		if (net_ratelimit())
 			netif_err(tp, drv, tp->dev, "Failed to map RX DMA!\n");
-		goto err_out;
+		__free_pages(data, get_order(R8169_RX_BUF_SIZE));
+		return NULL;
 	}
 
 	desc->addr = cpu_to_le64(mapping);
 	rtl8169_mark_to_asic(desc);
-	return data;
 
-err_out:
-	kfree(data);
-	return NULL;
+	return data;
 }
 
 static void rtl8169_rx_clear(struct rtl8169_private *tp)
@@ -5337,7 +5330,7 @@ static int rtl8169_rx_fill(struct rtl8169_private *tp)
 	unsigned int i;
 
 	for (i = 0; i < NUM_RX_DESC; i++) {
-		void *data;
+		struct page *data;
 
 		data = rtl8169_alloc_rx_data(tp, tp->RxDescArray + i);
 		if (!data) {
@@ -5892,6 +5885,7 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
 
 	for (rx_left = min(budget, NUM_RX_DESC); rx_left > 0; rx_left--, cur_rx++) {
 		unsigned int entry = cur_rx % NUM_RX_DESC;
+		const void *rx_buf = page_address(tp->Rx_databuff[entry]);
 		struct RxDesc *desc = tp->RxDescArray + entry;
 		u32 status;
 
@@ -5946,9 +5940,8 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
 				goto release_descriptor;
 			}
 
-			prefetch(tp->Rx_databuff[entry]);
-			skb_copy_to_linear_data(skb, tp->Rx_databuff[entry],
-						pkt_size);
+			prefetch(rx_buf);
+			skb_copy_to_linear_data(skb, rx_buf, pkt_size);
 			skb->tail += pkt_size;
 			skb->len = pkt_size;
 
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH net] netdevsim: Restore per-network namespace accounting for fib entries
From: David Ahern @ 2019-08-07 19:31 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Jakub Kicinski, David Ahern, davem, netdev
In-Reply-To: <20190807130739.GA2201@nanopsycho>

On 8/7/19 7:07 AM, Jiri Pirko wrote:
> 
> Yeah. I believe it was a mistake to add it in the first place. Abuses
> netdevsim for something it is not. I'm fine to use devlink the way you
> want to after we conclude 2), but outside netdevsim.
> 
> Again, netdevsim is there for config api testing purposes. If things
> got broken, it is not that bit deal. I broke the way it is
> instantiated significantly for example (iplink->sysfs).
>

netdevsim was created as a way of testing hardware focused kernel APIs
and code paths without hardware. yes?

The devlink api was added to netdevsim to test fib notifiers failing by
handlers. The notifiers were added for mlxsw - a hardware driver - as a
way to get notifications of fib changes.

The easiest way to test the error paths was to code limits to fib
entries very similar to what mlxsw implements with its 'devlink
resource' implementation. ie., to implement 'devlink resource' for a
software emulated device. netdevsim was the most logical place for it.

See the commit logs starting at:

commit b349e0b5ec5d7be57ac243fb08ae8b994c928165
Merge: 6e2135ce54b7 37923ed6b8ce
Author: David S. Miller <davem@davemloft.net>
Date:   Thu Mar 29 14:10:31 2018 -0400

    Merge branch 'net-Allow-FIB-notifiers-to-fail-add-and-replace'

    David Ahern says:

    ====================
    net: Allow FIB notifiers to fail add and replace

Everything about that implementation is using the s/w device to test
code paths targeted at hardware offloads.

^ permalink raw reply

* Re: [PATCH v4 bpf-next 01/14] libbpf: add helpers for working with BTF types
From: Alexei Starovoitov @ 2019-08-07 19:31 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, ast, daniel, yhs, andrii.nakryiko, kernel-team
In-Reply-To: <20190807053806.1534571-2-andriin@fb.com>

On Tue, Aug 06, 2019 at 10:37:53PM -0700, Andrii Nakryiko wrote:
> Add lots of frequently used helpers that simplify working with BTF
> types.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
..
> +/* get bitfield size of a member, assuming t is BTF_KIND_STRUCT or
> + * BTF_KIND_UNION. If member is not a bitfield, zero is returned. */

Invalid comment style.


^ permalink raw reply

* Re: [PATCH v4 bpf-next 02/14] libbpf: convert libbpf code to use new btf helpers
From: Alexei Starovoitov @ 2019-08-07 19:30 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, ast, daniel, yhs, andrii.nakryiko, kernel-team
In-Reply-To: <20190807053806.1534571-3-andriin@fb.com>

On Tue, Aug 06, 2019 at 10:37:54PM -0700, Andrii Nakryiko wrote:
> Simplify code by relying on newly added BTF helper functions.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
..
>  
> -	for (i = 0, vsi = (struct btf_var_secinfo *)(t + 1);
> -	     i < vars; i++, vsi++) {
> +	for (i = 0, vsi = (void *)btf_var_secinfos(t); i < vars; i++, vsi++) {

> +			struct btf_member *m = (void *)btf_members(t);
...
>  		case BTF_KIND_ENUM: {
> -			struct btf_enum *m = (struct btf_enum *)(t + 1);
> -			__u16 vlen = BTF_INFO_VLEN(t->info);
> +			struct btf_enum *m = (void *)btf_enum(t);
> +			__u16 vlen = btf_vlen(t);
...
>  		case BTF_KIND_FUNC_PROTO: {
> -			struct btf_param *m = (struct btf_param *)(t + 1);
> -			__u16 vlen = BTF_INFO_VLEN(t->info);
> +			struct btf_param *m = (void *)btf_params(t);
> +			__u16 vlen = btf_vlen(t);

So all of these 'void *' type hacks are only to drop const-ness ?
May be the helpers shouldn't be taking const then?


^ permalink raw reply

* Re: [PATCH net-next v4 2/2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Heiner Kallweit @ 2019-08-07 19:18 UTC (permalink / raw)
  To: Tao Ren, Andrew Lunn, Florian Fainelli, David S . Miller,
	Arun Parameswaran, Justin Chen, Vladimir Oltean,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	openbmc@lists.ozlabs.org
In-Reply-To: <fe0d39ea-91f3-0cac-f13b-3d46ea1748a3@fb.com>

On 06.08.2019 23:42, Tao Ren wrote:
> Hi Andrew / Heiner / Vladimir,
> 
> On 8/6/19 2:09 PM, Tao Ren wrote:
>> The BCM54616S PHY cannot work properly in RGMII->1000Base-KX mode (for
>> example, on Facebook CMM BMC platform), mainly because genphy functions
>> are designed for copper links, and 1000Base-X (clause 37) auto negotiation
>> needs to be handled differently.
>>
>> This patch enables 1000Base-X support for BCM54616S by customizing 3
>> driver callbacks:
>>
>>   - probe: probe callback detects PHY's operation mode based on
>>     INTERF_SEL[1:0] pins and 1000X/100FX selection bit in SerDES 100-FX
>>     Control register.
>>
>>   - config_aneg: bcm54616s_config_aneg_1000bx function is added for auto
>>     negotiation in 1000Base-X mode.
>>
>>   - read_status: BCM54616S and BCM5482 PHY share the same read_status
>>     callback which manually set link speed and duplex mode in 1000Base-X
>>     mode.
>>
>> Signed-off-by: Tao Ren <taoren@fb.com>
> 
> I customized config_aneg function for BCM54616S 1000Base-X mode and link-down issue is also fixed: the patch is tested on Facebook CMM and Minipack BMC and everything looks normal. Please kindly review when you have bandwidth and let me know if you have further suggestions.
> 
> BTW, I would be happy to help if we decide to add a set of genphy functions for clause 37, although that may mean I need more help/guidance from you :-)

You want to have standard clause 37 aneg and this should be generic in phylib.
I hacked together a first version that is compile-tested only:
https://patchwork.ozlabs.org/patch/1143631/
It supports fixed mode too.

It doesn't support half duplex mode because phylib doesn't know 1000BaseX HD yet.
Not sure whether half duplex mode is used at all in reality.

You could test the new core functions in your own config_aneg and read_status
callback implementations.

> 
> 
> Cheers,
> 
> Tao
> 
Heiner

^ permalink raw reply

* Re: [v3,2/4] tools: bpftool: add net detach command to detach XDP on interface
From: Jakub Kicinski @ 2019-08-07 19:14 UTC (permalink / raw)
  To: Y Song; +Cc: Daniel T. Lee, Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <CAH3MdRW4LgdLoqSpLsWUOwjnNhJA1sodHqSD2Z14JY6aHMaKxg@mail.gmail.com>

On Wed, 7 Aug 2019 10:02:04 -0700, Y Song wrote:
> -bash-4.4$ sudo ./bpftool net detach x dev v2
> -bash-4.4$ sudo ./bpftool net
> xdp:
> v1(4) driver id 1172
> 
> tc:
> eth0(2) clsact/ingress fbflow_icmp id 29 act []
> eth0(2) clsact/egress cls_fg_dscp_section id 27 act []
> eth0(2) clsact/egress fbflow_egress id 28
> eth0(2) clsact/egress fbflow_sslwall_egress id 35
> 
> flow_dissector:
> 
> -bash-4.4$
> 
> Basically detaching may fail due to wrong dev name or wrong type, etc.
> But the tool did not return an error. Is this expected?
> This may be related to this funciton "bpf_set_link_xdp_fd()".
> So this patch itself should be okay.

Yup, I'm pretty sure kernel doesn't return errors on unset if there 
is no program on the interface.

^ permalink raw reply

* [PATCH RFC] net: phy: add support for clause 37 auto-negotiation
From: Heiner Kallweit @ 2019-08-07 19:11 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org

This patch adds support for clause 37 1000Base-X auto-negotiation.
It's compile-tested only as I don't have fiber equipment.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/phy/phy_device.c | 139 +++++++++++++++++++++++++++++++++++
 include/linux/phy.h          |   5 ++
 2 files changed, 144 insertions(+)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 7ddd91df9..f24e3e7e5 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1617,6 +1617,40 @@ static int genphy_config_advert(struct phy_device *phydev)
 	return changed;
 }
 
+/**
+ * genphy_c37_config_advert - sanitize and advertise auto-negotiation parameters
+ * @phydev: target phy_device struct
+ *
+ * Description: Writes MII_ADVERTISE with the appropriate values,
+ *   after sanitizing the values to make sure we only advertise
+ *   what is supported.  Returns < 0 on error, 0 if the PHY's advertisement
+ *   hasn't changed, and > 0 if it has changed. This function is intended
+ *   for Clause 37 1000Base-X mode.
+ */
+static int genphy_c37_config_advert(struct phy_device *phydev)
+{
+	u16 adv = 0;
+
+	/* Only allow advertising what this PHY supports */
+	linkmode_and(phydev->advertising, phydev->advertising,
+		     phydev->supported);
+
+	if (linkmode_test_bit(ETHTOOL_LINK_MODE_1000baseX_Full_BIT,
+			      phydev->advertising))
+		adv |= ADVERTISE_1000XFULL;
+	if (linkmode_test_bit(ETHTOOL_LINK_MODE_Pause_BIT,
+			      phydev->advertising))
+		adv |= ADVERTISE_1000XPAUSE;
+	if (linkmode_test_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT,
+			      phydev->advertising))
+		adv |= ADVERTISE_1000XPSE_ASYM;
+
+	return phy_modify_changed(phydev, MII_ADVERTISE,
+				  ADVERTISE_1000XFULL | ADVERTISE_1000XPAUSE |
+				  ADVERTISE_1000XHALF | ADVERTISE_1000XPSE_ASYM,
+				  adv);
+}
+
 /**
  * genphy_config_eee_advert - disable unwanted eee mode advertisement
  * @phydev: target phy_device struct
@@ -1726,6 +1760,54 @@ int genphy_config_aneg(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(genphy_config_aneg);
 
+/**
+ * genphy_c37_config_aneg - restart auto-negotiation or write BMCR
+ * @phydev: target phy_device struct
+ *
+ * Description: If auto-negotiation is enabled, we configure the
+ *   advertising, and then restart auto-negotiation.  If it is not
+ *   enabled, then we write the BMCR. This function is intended
+ *   for use with Clause 37 1000Base-X mode.
+ */
+int genphy_c37_config_aneg(struct phy_device *phydev)
+{
+	int err, changed;
+
+	if (AUTONEG_ENABLE != phydev->autoneg)
+		return genphy_setup_forced(phydev);
+
+	err = phy_modify(phydev, MII_BMCR, BMCR_SPEED1000 | BMCR_SPEED100,
+			 BMCR_SPEED1000);
+	if (err)
+		return err;
+
+	changed = genphy_c37_config_advert(phydev);
+	if (changed < 0) /* error */
+		return changed;
+
+	if (!changed) {
+		/* Advertisement hasn't changed, but maybe aneg was never on to
+		 * begin with?  Or maybe phy was isolated?
+		 */
+		int ctl = phy_read(phydev, MII_BMCR);
+
+		if (ctl < 0)
+			return ctl;
+
+		if (!(ctl & BMCR_ANENABLE) || (ctl & BMCR_ISOLATE))
+			changed = 1; /* do restart aneg */
+	}
+
+	/* Only restart aneg if we are advertising something different
+	 * than we were before.
+	 */
+	if (changed > 0)
+		return genphy_restart_aneg(phydev);
+
+	return 0;
+}
+EXPORT_SYMBOL(genphy_c37_config_aneg);
+
 /**
  * genphy_aneg_done - return auto-negotiation status
  * @phydev: target phy_device struct
@@ -1864,6 +1946,63 @@ int genphy_read_status(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(genphy_read_status);
 
+/**
+ * genphy_c37_read_status - check the link status and update current link state
+ * @phydev: target phy_device struct
+ *
+ * Description: Check the link, then figure out the current state
+ *   by comparing what we advertise with what the link partner
+ *   advertises. This function is for Clause 37 1000Base-X mode.
+ */
+int genphy_c37_read_status(struct phy_device *phydev)
+{
+	int lpa, err, old_link = phydev->link;
+
+	/* Update the link, but return if there was an error */
+	err = genphy_update_link(phydev);
+	if (err)
+		return err;
+
+	/* why bother the PHY if nothing can have changed */
+	if (phydev->autoneg == AUTONEG_ENABLE && old_link && phydev->link)
+		return 0;
+
+	phydev->duplex = DUPLEX_UNKNOWN;
+	phydev->pause = 0;
+	phydev->asym_pause = 0;
+
+	if (phydev->autoneg == AUTONEG_ENABLE && phydev->autoneg_complete) {
+		lpa = phy_read(phydev, MII_LPA);
+		if (lpa < 0)
+			return lpa;
+
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_Autoneg_BIT,
+				 phydev->lp_advertising, lpa & LPA_LPACK);
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_1000baseX_Full_BIT,
+				 phydev->lp_advertising, lpa & LPA_1000XFULL);
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_Pause_BIT,
+				 phydev->lp_advertising, lpa & LPA_1000XPAUSE);
+		linkmode_mod_bit(ETHTOOL_LINK_MODE_Asym_Pause_BIT,
+				 phydev->lp_advertising,
+				 lpa & LPA_1000XPAUSE_ASYM);
+
+		phy_resolve_aneg_linkmode(phydev);
+	} else if (phydev->autoneg == AUTONEG_DISABLE) {
+		int bmcr = phy_read(phydev, MII_BMCR);
+
+		if (bmcr < 0)
+			return bmcr;
+
+		if (bmcr & BMCR_FULLDPLX)
+			phydev->duplex = DUPLEX_FULL;
+		else
+			phydev->duplex = DUPLEX_HALF;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(genphy_c37_read_status);
+
 /**
  * genphy_soft_reset - software reset the PHY via BMCR_RESET bit
  * @phydev: target phy_device struct
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 462b90b73..36cbdfe84 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -1077,6 +1077,11 @@ int genphy_suspend(struct phy_device *phydev);
 int genphy_resume(struct phy_device *phydev);
 int genphy_loopback(struct phy_device *phydev, bool enable);
 int genphy_soft_reset(struct phy_device *phydev);
+
+/* Clause 37 */
+int genphy_c37_aneg_done(struct phy_device *phydev);
+int genphy_c37_read_status(struct phy_device *phydev);
+
 static inline int genphy_no_soft_reset(struct phy_device *phydev)
 {
 	return 0;
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH iproute2-next v2] ip tunnel: add json output
From: David Ahern @ 2019-08-07 19:08 UTC (permalink / raw)
  To: Andrea Claudi, netdev; +Cc: stephen, dsahern
In-Reply-To: <a7a161117f3a68e5a0cea008a8ca7e80b42bf2fa.1564766777.git.aclaudi@redhat.com>

On 8/2/19 11:38 AM, Andrea Claudi wrote:
> Add json support on iptunnel and ip6tunnel.
> The plain text output format should remain the same.
> 
> Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
> ---
> Changes since v1:
>  * Use print_color_* for ifname and ip addresses;
>  * Use print_null() instead of print_bool() where appropriate;
>  * Reduce indentation level on tnl_print_gre_flags()
> ---
>  ip/ip6tunnel.c | 66 ++++++++++++++++++++++++---------------
>  ip/iptunnel.c  | 84 ++++++++++++++++++++++++++++++++------------------
>  ip/tunnel.c    | 37 +++++++++++++++++-----
>  3 files changed, 125 insertions(+), 62 deletions(-)
> 

applied to iproute2-next. Thanks



^ permalink raw reply

* Re: [PATCH iproute2-next] rdma: Add driver QP type string
From: David Ahern @ 2019-08-07 19:07 UTC (permalink / raw)
  To: Gal Pressman, Stephen Hemminger; +Cc: netdev, linux-rdma, Leon Romanovsky
In-Reply-To: <20190804080756.58364-1-galpress@amazon.com>

On 8/4/19 2:07 AM, Gal Pressman wrote:
> RDMA resource tracker now tracks driver QPs as well, add driver QP type
> string to qp_types_to_str function.
> 
> Signed-off-by: Gal Pressman <galpress@amazon.com>
> ---
>  rdma/res.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 

applied to iproute2-next. Thanks



^ permalink raw reply

* Re: [RFC] implicit per-namespace devlink instance to set kernel resource limitations
From: Jakub Kicinski @ 2019-08-07 18:57 UTC (permalink / raw)
  To: David Ahern
  Cc: Andrew Lunn, Jiri Pirko, netdev, davem, mlxsw, f.fainelli,
	vivien.didelot, mkubecek, stephen, daniel, brouer, eric.dumazet
In-Reply-To: <153eb34b-05dd-4a85-88d8-e5723f41bbe3@gmail.com>

On Tue, 6 Aug 2019 21:10:40 -0600, David Ahern wrote:
> On 8/6/19 8:59 PM, Andrew Lunn wrote:
> > However, zoom out a bit, from networking to the whole kernel. In
> > general, across the kernel as a whole, resource management is done
> > with cgroups. cgroups is the consistent operational model across the
> > kernel as a whole.
> > 
> > So i think you need a second leg to your argument. You have said why
> > devlink is the right way to do this. But you should also be able to
> > say to Tejun Heo why cgroups is the wrong way to do this, going
> > against the kernel as a whole model. Why is networking special?
> >   
> 
> So you are saying mlxsw should be using a cgroups based API for its
> resources? netdevsim is for testing kernel APIs sans hardware. Is that
> not what the fib controller netdevsim is doing? It is from my perspective.

Why would all the drivers have to pay attention to resource limits?
Shouldn't we try to implement that at a higher layer?

> I am not the one arguing to change code and functionality that has
> existed for 16 months. I am arguing that the existing resource
> controller satisfies all existing goals (testing in kernel APIs) and
> even satisfies additional ones - like a consistent user experience
> managing networking resources. ie.., I see no reason to change what exists.

Please don't use the netdevsim code as an argument that something
already exists. The only legitimate use of that code is to validate
the devlink resource API and that the notifier can fail the insertion.

We try to encourage adding tests and are generally more willing to
merge test code. Possible abuse of that for establishing precedents 
is worrying.

^ permalink raw reply

* Re: [RFC] implicit per-namespace devlink instance to set kernel resource limitations
From: Jakub Kicinski @ 2019-08-07 18:49 UTC (permalink / raw)
  To: David Ahern
  Cc: Andrew Lunn, Jiri Pirko, netdev, davem, mlxsw, f.fainelli,
	vivien.didelot, mkubecek, stephen, daniel, brouer, eric.dumazet
In-Reply-To: <e0047c07-11a0-423c-9560-3806328a0d76@gmail.com>

On Tue, 6 Aug 2019 20:33:47 -0600, David Ahern wrote:
> Some time back supported was added for devlink 'resources'. The idea is
> that hardware (mlxsw) has limited resources (e.g., memory) that can be
> allocated in certain ways (e.g., kvd for mlxsw) thus implementing
> restrictions on the number of programmable entries (e.g., routes,
> neighbors) by userspace.
> 
> I contend:
> 
> 1. The kernel is an analogy to the hardware: it is programmed by
> userspace, has limited resources (e.g., memory), and that users want to
> control (e.g., limit) the number of networking entities that can be
> programmed - routes, rules, nexthop objects etc and by address family
> (ipv4, ipv6).

Memory hierarchy for ASIC is more complex and changes more often than
we want to change the model and kernel ABIs. The API in devlink is
intended for TCAM partitioning.

> 2. A consistent operational model across use cases - s/w forwarding, XDP
> forwarding and hardware forwarding - is good for users deploying systems
> based on the Linux networking stack. This aligns with my basic point at
> LPC last November about better integration of XDP and kernel tables.
> 
> The existing devlink API is the right one for all use cases. Most
> notably that the kernel can mimic the hardware from a resource
> management. Trying to say 'use cgroups for s/w forwarding and devlink
> for h/w forwarding' is complicating the lives of users. It is just a
> model and models can apply to more than some rigid definition.

This argument holds no water. Only a tiny fraction of Linux networking
users will have an high performance forwarding ASIC attached to their
CPUs. So we'll make 99.9% of users who never seen devlink learn the
tool for device control to control kernel resource?

Perhaps I'm misinterpreting your point there.

> As for the namespace piece of this, the kernel's tables for networking
> are *per namespace*, and so the resource controller must be per
> namespace. This aligns with another consistent theme I have promoted
> over the years - the ability to divide up a single ASIC into multiple,
> virtual switches which are managed per namespace. This is a very popular
> feature from a certain legacy vendor and one that would be good for open
> networking to achieve. This is the basis of my response last week about
> the devlink instance per namespace, and I thought Jiri was moving in
> that direction until our chat today. Jiri's intention is something
> different; we can discuss that on the next version of his patches.

Resource limits per namespace make perfect sense. Just not configured
via devlink..

^ permalink raw reply

* Re: [PATCH net v2] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Willem de Bruijn @ 2019-08-07 18:46 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Willem de Bruijn, David Miller, Network Development, davejwatson,
	borisp, aviadye, John Fastabend, Daniel Borkmann, Eric Dumazet,
	Alexei Starovoitov, oss-drivers
In-Reply-To: <20190807110042.690cf50a@cakuba.netronome.com>

On Wed, Aug 7, 2019 at 2:01 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Wed, 7 Aug 2019 12:59:00 -0400, Willem de Bruijn wrote:
> > On Wed, Aug 7, 2019 at 2:06 AM Jakub Kicinski wrote:
> > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > index d57b0cc995a0..0f9619b0892f 100644
> > > --- a/net/core/sock.c
> > > +++ b/net/core/sock.c
> > > @@ -1992,6 +1992,20 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
> > >  }
> > >  EXPORT_SYMBOL(skb_set_owner_w);
> > >
> > > +static bool can_skb_orphan_partial(const struct sk_buff *skb)
> > > +{
> > > +#ifdef CONFIG_TLS_DEVICE
> > > +       /* Drivers depend on in-order delivery for crypto offload,
> > > +        * partial orphan breaks out-of-order-OK logic.
> > > +        */
> > > +       if (skb->decrypted)
> > > +               return false;
> > > +#endif
> > > +       return (IS_ENABLED(CONFIG_INET) &&
> > > +               skb->destructor == tcp_wfree) ||
> >
> > Please add parentheses around IS_ENABLED(CONFIG_INET) &&
> > skb->destructor == tcp_wfree
>
> Mm.. there are parenthesis around them, maybe I'm being slow,
> could you show me how?

I mean

    return (skb->destructor == sock_wfree ||
               (IS_ENABLED(CONFIG_INET) && skb->destructor == tcp_wfree))

In other words, (a || (b && c)) instead of (a || b && c). Though the
existing code also eschews the extra parentheses.

> > I was also surprised that this works when tcp_wfree is not defined if
> > !CONFIG_INET. But apparently it does (at -O2?) :)
>
> I was surprised to but in essence it should work the same as
>
>         if (IS_ENABLED(CONFIG_xyz))
>                 call_some_xyz_code();
>
> from compiler's perspective, and we do that a lot. Perhaps kbuild
> bot will prove us wrong :)
>
> > > @@ -984,6 +984,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> > >                         if (!skb)
> > >                                 goto wait_for_memory;
> > >
> > > +#ifdef CONFIG_TLS_DEVICE
> > > +                       skb->decrypted = !!(flags & MSG_SENDPAGE_DECRYPTED);
> > > +#endif
> >
> > Nothing is stopping userspace from passing this new flag. In send
> > (tcp_sendmsg_locked) it is ignored. But can it reach do_tcp_sendpages
> > through tcp_bpf_sendmsg?
>
> Ah, I think you're right, thanks for checking that :( I don't entirely
> follow how 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through
> ULP") is safe then.
>
> One option would be to clear the flags kernel would previously ignore
> in tcp_bpf_sendmsg(). But I feel like we should just go back to marking
> the socket, since we don't need the per-message flexibility of a flag.
>
> WDYT?

I don't feel strongly either way. Passing flags from send through
tcp_bpf_sendmsg is probably unintentional, so should probably be
addressed anyway? Then this is a bit simpler.

> > >                         skb_entail(sk, skb);
> > >                         copy = size_goal;
> > >                 }
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index 6e4afc48d7bb..979520e46e33 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -1320,6 +1320,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
> > >         buff = sk_stream_alloc_skb(sk, nsize, gfp, true);
> > >         if (!buff)
> > >                 return -ENOMEM; /* We'll just try again later. */
> > > +       skb_copy_decrypted(buff, skb);
> >
> > This code has to copy timestamps, tx_flags, zerocopy state and now
> > this in three locations. Eventually we'll want a single helper for all
> > of them..
>
> Ack, should I take an action on that for net-next or was it a
> note-to-self? :)

Note-to-self :)

As a matter of fact, your patch showed me that we actually miss the
tstamp case in tcp_mtu_probe..

^ permalink raw reply

* Re: [PATCH net-next] net/ncsi: allow to customize BMC MAC Address offset
From: Andrew Lunn @ 2019-08-07 18:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Tao Ren, Samuel Mendoza-Jonas, David S . Miller, netdev,
	linux-kernel, openbmc, William Kennington, Joel Stanley
In-Reply-To: <20190807112518.644a21a2@cakuba.netronome.com>

On Wed, Aug 07, 2019 at 11:25:18AM -0700, Jakub Kicinski wrote:
> On Tue, 6 Aug 2019 17:21:18 -0700, Tao Ren wrote:
> > Currently BMC's MAC address is calculated by adding 1 to NCSI NIC's base
> > MAC address when CONFIG_NCSI_OEM_CMD_GET_MAC option is enabled. The logic
> > doesn't work for platforms with different BMC MAC offset: for example,
> > Facebook Yamp BMC's MAC address is calculated by adding 2 to NIC's base
> > MAC address ("BaseMAC + 1" is reserved for Host use).
> > 
> > This patch adds NET_NCSI_MC_MAC_OFFSET config option to customize offset
> > between NIC's Base MAC address and BMC's MAC address. Its default value is
> > set to 1 to avoid breaking existing users.
> > 
> > Signed-off-by: Tao Ren <taoren@fb.com>
> 
> Maybe someone more knowledgeable like Andrew has an opinion here, 
> but to me it seems a bit strange to encode what seems to be platfrom
> information in the kernel config :(

Yes, this is not a good idea. It makes it impossible to have a 'BMC
distro' kernel which you install on a number of different BMCs.

A device tree property would be better. Ideally it would be the
standard local-mac-address, or mac-address.

  Andrew

^ permalink raw reply

* Re: [v3,2/4] tools: bpftool: add net detach command to detach XDP on interface
From: Maciej Fijalkowski @ 2019-08-07 18:30 UTC (permalink / raw)
  To: Y Song; +Cc: Daniel T. Lee, Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <CAH3MdRW4LgdLoqSpLsWUOwjnNhJA1sodHqSD2Z14JY6aHMaKxg@mail.gmail.com>

On Wed, 7 Aug 2019 10:02:04 -0700
Y Song <ys114321@gmail.com> wrote:

> On Tue, Aug 6, 2019 at 7:25 PM Daniel T. Lee <danieltimlee@gmail.com> wrote:
> >
> > By this commit, using `bpftool net detach`, the attached XDP prog can
> > be detached. Detaching the BPF prog will be done through libbpf
> > 'bpf_set_link_xdp_fd' with the progfd set to -1.
> >
> > Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> > ---
> >  tools/bpf/bpftool/net.c | 42 ++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 41 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
> > index c05a3fac5cac..7be96acb08e0 100644
> > --- a/tools/bpf/bpftool/net.c
> > +++ b/tools/bpf/bpftool/net.c
> > @@ -343,6 +343,43 @@ static int do_attach(int argc, char **argv)
> >         return 0;
> >  }
> >
> > +static int do_detach(int argc, char **argv)
> > +{
> > +       enum net_attach_type attach_type;
> > +       int progfd, ifindex, err = 0;
> > +
> > +       /* parse detach args */
> > +       if (!REQ_ARGS(3))
> > +               return -EINVAL;
> > +
> > +       attach_type = parse_attach_type(*argv);
> > +       if (attach_type == max_net_attach_type) {
> > +               p_err("invalid net attach/detach type");
> > +               return -EINVAL;
> > +       }
> > +
> > +       NEXT_ARG();
> > +       ifindex = net_parse_dev(&argc, &argv);
> > +       if (ifindex < 1)
> > +               return -EINVAL;
> > +
> > +       /* detach xdp prog */
> > +       progfd = -1;
> > +       if (is_prefix("xdp", attach_type_strings[attach_type]))
> > +               err = do_attach_detach_xdp(progfd, attach_type, ifindex, NULL);  
> 
> I found an issue here. This is probably related to do_attach_detach_xdp.
> 
> -bash-4.4$ sudo ./bpftool net attach x pinned /sys/fs/bpf/xdp_example dev v1
> -bash-4.4$ sudo ./bpftool net
> xdp:
> v1(4) driver id 1172
> 
> tc:
> eth0(2) clsact/ingress fbflow_icmp id 29 act []
> eth0(2) clsact/egress cls_fg_dscp_section id 27 act []
> eth0(2) clsact/egress fbflow_egress id 28
> eth0(2) clsact/egress fbflow_sslwall_egress id 35
> 
> flow_dissector:
> 
> -bash-4.4$ sudo ./bpftool net detach x dev v2

Shouldn't this be v1 as dev?

> -bash-4.4$ sudo ./bpftool net
> xdp:
> v1(4) driver id 1172
> 
> tc:
> eth0(2) clsact/ingress fbflow_icmp id 29 act []
> eth0(2) clsact/egress cls_fg_dscp_section id 27 act []
> eth0(2) clsact/egress fbflow_egress id 28
> eth0(2) clsact/egress fbflow_sslwall_egress id 35
> 
> flow_dissector:
> 
> -bash-4.4$
> 
> Basically detaching may fail due to wrong dev name or wrong type, etc.
> But the tool did not return an error. Is this expected?
> This may be related to this funciton "bpf_set_link_xdp_fd()".
> So this patch itself should be okay.
> 
> > +
> > +       if (err < 0) {
> > +               p_err("interface %s detach failed",
> > +                     attach_type_strings[attach_type]);
> > +               return err;
> > +       }
> > +
> > +       if (json_output)
> > +               jsonw_null(json_wtr);
> > +
> > +       return 0;
> > +}
> > +
> >  static int do_show(int argc, char **argv)
> >  {
> >         struct bpf_attach_info attach_info = {};
> > @@ -419,6 +456,7 @@ static int do_help(int argc, char **argv)
> >         fprintf(stderr,
> >                 "Usage: %s %s { show | list } [dev <devname>]\n"
> >                 "       %s %s attach ATTACH_TYPE PROG dev <devname> [ overwrite ]\n"
> > +               "       %s %s detach ATTACH_TYPE dev <devname>\n"
> >                 "       %s %s help\n"
> >                 "\n"
> >                 "       " HELP_SPEC_PROGRAM "\n"
> > @@ -429,7 +467,8 @@ static int do_help(int argc, char **argv)
> >                 "      to dump program attachments. For program types\n"
> >                 "      sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
> >                 "      consult iproute2.\n",
> > -               bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
> > +               bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
> > +               bin_name, argv[-2]);
> >
> >         return 0;
> >  }
> > @@ -438,6 +477,7 @@ static const struct cmd cmds[] = {
> >         { "show",       do_show },
> >         { "list",       do_show },
> >         { "attach",     do_attach },
> > +       { "detach",     do_detach },
> >         { "help",       do_help },
> >         { 0 }
> >  };
> > --
> > 2.20.1
> >  


^ permalink raw reply

* Re: [PATCH net-next] net/ncsi: allow to customize BMC MAC Address offset
From: Jakub Kicinski @ 2019-08-07 18:25 UTC (permalink / raw)
  To: Tao Ren
  Cc: Samuel Mendoza-Jonas, David S . Miller, netdev, linux-kernel,
	openbmc, William Kennington, Joel Stanley, Andrew Lunn
In-Reply-To: <20190807002118.164360-1-taoren@fb.com>

On Tue, 6 Aug 2019 17:21:18 -0700, Tao Ren wrote:
> Currently BMC's MAC address is calculated by adding 1 to NCSI NIC's base
> MAC address when CONFIG_NCSI_OEM_CMD_GET_MAC option is enabled. The logic
> doesn't work for platforms with different BMC MAC offset: for example,
> Facebook Yamp BMC's MAC address is calculated by adding 2 to NIC's base
> MAC address ("BaseMAC + 1" is reserved for Host use).
> 
> This patch adds NET_NCSI_MC_MAC_OFFSET config option to customize offset
> between NIC's Base MAC address and BMC's MAC address. Its default value is
> set to 1 to avoid breaking existing users.
> 
> Signed-off-by: Tao Ren <taoren@fb.com>

Maybe someone more knowledgeable like Andrew has an opinion here, 
but to me it seems a bit strange to encode what seems to be platfrom
information in the kernel config :(

> diff --git a/net/ncsi/Kconfig b/net/ncsi/Kconfig
> index 2f1e5756c03a..be8efe1ed99e 100644
> --- a/net/ncsi/Kconfig
> +++ b/net/ncsi/Kconfig
> @@ -17,3 +17,11 @@ config NCSI_OEM_CMD_GET_MAC
>  	---help---
>  	  This allows to get MAC address from NCSI firmware and set them back to
>  		controller.
> +config NET_NCSI_MC_MAC_OFFSET
> +	int
> +	prompt "Offset of Management Controller's MAC Address"
> +	depends on NCSI_OEM_CMD_GET_MAC
> +	default 1
> +	help
> +	  This defines the offset between Network Controller's (base) MAC
> +	  address and Management Controller's MAC address.
> diff --git a/net/ncsi/ncsi-rsp.c b/net/ncsi/ncsi-rsp.c
> index 7581bf919885..24a791f9ebf5 100644
> --- a/net/ncsi/ncsi-rsp.c
> +++ b/net/ncsi/ncsi-rsp.c
> @@ -656,6 +656,11 @@ static int ncsi_rsp_handler_oem_bcm_gma(struct ncsi_request *nr)
>  	struct ncsi_rsp_oem_pkt *rsp;
>  	struct sockaddr saddr;
>  	int ret = 0;
> +#ifdef CONFIG_NET_NCSI_MC_MAC_OFFSET
> +	int mac_offset = CONFIG_NET_NCSI_MC_MAC_OFFSET;
> +#else
> +	int mac_offset = 1;
> +#endif
>  
>  	/* Get the response header */
>  	rsp = (struct ncsi_rsp_oem_pkt *)skb_network_header(nr->rsp);
> @@ -663,8 +668,14 @@ static int ncsi_rsp_handler_oem_bcm_gma(struct ncsi_request *nr)
>  	saddr.sa_family = ndev->type;
>  	ndev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
>  	memcpy(saddr.sa_data, &rsp->data[BCM_MAC_ADDR_OFFSET], ETH_ALEN);
> -	/* Increase mac address by 1 for BMC's address */
> -	eth_addr_inc((u8 *)saddr.sa_data);
> +
> +	/* Management Controller's MAC address is calculated by adding
> +	 * the offset to Network Controller's (base) MAC address.
> +	 * Note: negative offset is "ignored", and BMC will use the Base
> +	 * MAC address in this case.
> +	 */
> +	while (mac_offset-- > 0)
> +		eth_addr_inc((u8 *)saddr.sa_data);
>  	if (!is_valid_ether_addr((const u8 *)saddr.sa_data))
>  		return -ENXIO;
>  


^ permalink raw reply

* Re: [PATCH v5 4/4] net: phy: realtek: Add LED configuration support for RTL8211E
From: Andrew Lunn @ 2019-08-07 18:24 UTC (permalink / raw)
  To: Matthias Kaehlcke
  Cc: David S . Miller, Rob Herring, Mark Rutland, Florian Fainelli,
	Heiner Kallweit, netdev, devicetree, linux-kernel,
	Douglas Anderson
In-Reply-To: <20190807170449.205378-5-mka@chromium.org>

> +static int rtl8211e_config_led(struct phy_device *phydev, int led,
> +			       struct phy_led_config *cfg)
> +{
> +	u16 lacr_bits = 0, lcr_bits = 0;
> +	int oldpage, ret;
> +
> +	switch (cfg->trigger.t) {
> +	case PHY_LED_TRIGGER_LINK:
> +		lcr_bits = 7 << (led * 4);
> +		break;
> +
> +	case PHY_LED_TRIGGER_LINK_10M:
> +		lcr_bits = 1 << (led * 4);
> +		break;
> +
> +	case PHY_LED_TRIGGER_LINK_100M:
> +		lcr_bits = 2 << (led * 4);
> +		break;
> +
> +	case PHY_LED_TRIGGER_LINK_1G:
> +		lcr_bits |= 4 << (led * 4);
> +		break;
> +
> +	case PHY_LED_TRIGGER_NONE:
> +		break;
> +
> +	default:
> +		phydev_warn(phydev,
> +			    "unknown trigger for LED%d: %d\n",
> +			    led, cfg->trigger.t);

Lets do this once in the core, not in every driver.

> +		return -EINVAL;
> +	}
> +
> +	if (cfg->trigger.activity)
> +		lacr_bits = BIT(RLT8211E_LACR_LEDACTCTRL_SHIFT + led);
> +
> +	rtl8211e_disable_eee_led_mode(phydev);
> +
> +	oldpage = rtl8211x_select_ext_page(phydev, 0x2c);
> +	if (oldpage < 0) {
> +		phydev_err(phydev, "failed to select extended page: %d\n", oldpage);
> +		return oldpage;
> +	}
> +
> +	ret = __phy_modify(phydev, RTL8211E_LACR,
> +			   LACR_MASK(led), lacr_bits);
> +	if (ret) {
> +		phydev_err(phydev, "failed to write LACR reg: %d\n",
> +			   ret);
> +		goto err;
> +	}
> +
> +	ret = __phy_modify(phydev, RTL8211E_LCR,
> +			   LCR_MASK(led), lcr_bits);
> +	if (ret)
> +		phydev_err(phydev, "failed to write LCR reg: %d\n",
> +			   ret);

That is a lot of phydev_err() calls. The rest of the driver does not
use any. Also, mdio operations are very unlikely to fail. So i would
just leave it to the layer above to report there is a problem.

     Andrew

^ permalink raw reply

* Re: [PATCH net-next 2/5] r8152: replace array with linking list for rx information
From: Jakub Kicinski @ 2019-08-07 18:21 UTC (permalink / raw)
  To: Hayes Wang
  Cc: netdev@vger.kernel.org, nic_swsd, linux-kernel@vger.kernel.org,
	linux-usb@vger.kernel.org
In-Reply-To: <0835B3720019904CB8F7AA43166CEEB2F18D04FA@RTITMBSVM03.realtek.com.tw>

On Wed, 7 Aug 2019 04:34:24 +0000, Hayes Wang wrote:
> Jakub Kicinski [mailto:jakub.kicinski@netronome.com]
> > >  static int rtl_stop_rx(struct r8152 *tp)
> > >  {
> > > -	int i;
> > > +	struct list_head *cursor, *next, tmp_list;
> > > +	unsigned long flags;
> > > +
> > > +	INIT_LIST_HEAD(&tmp_list);
> > >
> > > -	for (i = 0; i < RTL8152_MAX_RX; i++)
> > > -		usb_kill_urb(tp->rx_info[i].urb);
> > > +	/* The usb_kill_urb() couldn't be used in atomic.
> > > +	 * Therefore, move the list of rx_info to a tmp one.
> > > +	 * Then, list_for_each_safe could be used without
> > > +	 * spin lock.
> > > +	 */  
> > 
> > Would you mind explaining in a little more detail why taking the
> > entries from the list for a brief period of time is safe?  
> 
> Usually, it needs the spin lock before accessing the entry
> of the list "tp->rx_info". However, for some reasons,
> if we want to access the entry without spin lock, we
> cloud move all entries to a local list temporally. Then,
> we could make sure no other one could access the entries
> included in the temporal local list.
> 
> For this case, when I move all entries to a temporal 
> local list, no other one could access them. Therefore,
> I could access the entries included in the temporal local
> list without spin lock.

I see, thanks for the explanation.

^ permalink raw reply

* Re: [PATCH v5 2/4] net: phy: Add support for generic LED configuration through the DT
From: Andrew Lunn @ 2019-08-07 18:18 UTC (permalink / raw)
  To: Matthias Kaehlcke
  Cc: David S . Miller, Rob Herring, Mark Rutland, Florian Fainelli,
	Heiner Kallweit, netdev, devicetree, linux-kernel,
	Douglas Anderson
In-Reply-To: <20190807170449.205378-3-mka@chromium.org>

> +static void of_phy_config_leds(struct phy_device *phydev)
> +{
> +	struct device_node *np, *child;
> +	struct phy_led_config cfg;
> +	const char *trigger;
> +	int ret;
> +
> +	if (!IS_ENABLED(CONFIG_OF_MDIO) || !phydev->drv->config_led)
> +		return;
> +
> +	np = of_find_node_by_name(phydev->mdio.dev.of_node, "leds");
> +	if (!np)
> +		return;
> +
> +	for_each_child_of_node(np, child) {
> +		u32 led;
> +
> +		if (of_property_read_u32(child, "reg", &led))
> +			goto skip_config;
> +
> +		ret = of_property_read_string(child, "linux,default-trigger",
> +					      &trigger);
> +		if (ret)
> +			trigger = "none";
> +
> +		memset(&cfg, 0, sizeof(cfg));
> +
> +		if (!strcmp(trigger, "none")) {
> +			cfg.trigger.t = PHY_LED_TRIGGER_NONE;
> +		} else if (!strcmp(trigger, "phy-link")) {
> +			cfg.trigger.t = PHY_LED_TRIGGER_LINK;
> +		} else if (!strcmp(trigger, "phy-link-10m")) {
> +			cfg.trigger.t = PHY_LED_TRIGGER_LINK_10M;
> +		} else if (!strcmp(trigger, "phy-link-100m")) {
> +			cfg.trigger.t = PHY_LED_TRIGGER_LINK_100M;
> +		} else if (!strcmp(trigger, "phy-link-1g")) {
> +			cfg.trigger.t = PHY_LED_TRIGGER_LINK_1G;
> +		} else if (!strcmp(trigger, "phy-link-10g")) {
> +			cfg.trigger.t = PHY_LED_TRIGGER_LINK_10G;
> +		} else if (!strcmp(trigger, "phy-link-activity")) {
> +			cfg.trigger.t = PHY_LED_TRIGGER_LINK;
> +			cfg.trigger.activity = true;
> +		} else if (!strcmp(trigger, "phy-link-10m-activity")) {
> +			cfg.trigger.t = PHY_LED_TRIGGER_LINK_10M;
> +			cfg.trigger.activity = true;
> +		} else if (!strcmp(trigger, "phy-link-100m-activity")) {
> +			cfg.trigger.t = PHY_LED_TRIGGER_LINK_100M;
> +			cfg.trigger.activity = true;
> +		} else if (!strcmp(trigger, "phy-link-1g-activity")) {
> +			cfg.trigger.t = PHY_LED_TRIGGER_LINK_1G;
> +			cfg.trigger.activity = true;
> +		} else if (!strcmp(trigger, "phy-link-10g-activity")) {
> +			cfg.trigger.t = PHY_LED_TRIGGER_LINK_10G;
> +			cfg.trigger.activity = true;
> +		} else {
> +			phydev_warn(phydev, "trigger '%s' for LED%d is invalid\n",
> +				    trigger, led);
> +			goto skip_config;
> +		}
> +
> +		phydev->drv->config_led(phydev, led, &cfg);

We should not ignore the return value. I expect drivers will return
-EINVAL, or EOPNOTSUPP if asked to configure an LED in a way they
don't support. We need to report this. So i would add a phydev_warn:

> +			phydev_warn(phydev, "trigger '%s' for LED%d not supported\n",
> +				    trigger, led);

  Andrew

^ permalink raw reply

* Re: [PATCH v5 2/4] net: phy: Add support for generic LED configuration through the DT
From: Andrew Lunn @ 2019-08-07 18:15 UTC (permalink / raw)
  To: Matthias Kaehlcke
  Cc: David S . Miller, Rob Herring, Mark Rutland, Florian Fainelli,
	Heiner Kallweit, netdev, devicetree, linux-kernel,
	Douglas Anderson
In-Reply-To: <20190807170449.205378-3-mka@chromium.org>

> +	for_each_child_of_node(np, child) {
> +		u32 led;
> +
> +		if (of_property_read_u32(child, "reg", &led))
> +			goto skip_config;


> +
> +	skip_config:
> +		of_node_put(child);

There is no need for this put. So long as you don't break out of
for_each_child_of_node() with a return, it will correctly release
child at the end of each loop. A continue statement is also O.K.

      Andrew

^ permalink raw reply

* Re: [bpf-next PATCH 3/3] samples/bpf: xdp_fwd explain bpf_fib_lookup return codes
From: Y Song @ 2019-08-07 18:05 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, Xdp, a.s.protopopov,
	David Ahern
In-Reply-To: <156518138817.5636.3626699603179533211.stgit@firesoul>

On Wed, Aug 7, 2019 at 5:38 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
> Make it clear that this XDP program depend on the network
> stack to do the ARP resolution.  This is connected with the
> BPF_FIB_LKUP_RET_NO_NEIGH return code from bpf_fib_lookup().
>
> Another common mistake (seen via XDP-tutorial) is that users
> don't realize that sysctl net.ipv{4,6}.conf.all.forwarding
> setting is honored by bpf_fib_lookup.
>
> Reported-by: Anton Protopopov <a.s.protopopov@gmail.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply

* Re: [bpf-next PATCH 2/3] samples/bpf: make xdp_fwd more practically usable via devmap lookup
From: Y Song @ 2019-08-07 18:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, Xdp, a.s.protopopov,
	David Ahern
In-Reply-To: <156518138310.5636.13064696265479533742.stgit@firesoul>

On Wed, Aug 7, 2019 at 5:37 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
> This address the TODO in samples/bpf/xdp_fwd_kern.c, which points out
> that the chosen egress index should be checked for existence in the
> devmap. This can now be done via taking advantage of Toke's work in
> commit 0cdbb4b09a06 ("devmap: Allow map lookups from eBPF").
>
> This change makes xdp_fwd more practically usable, as this allows for
> a mixed environment, where IP-forwarding fallback to network stack, if
> the egress device isn't configured to use XDP.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  samples/bpf/xdp_fwd_kern.c |   20 ++++++++++++++------
>  samples/bpf/xdp_fwd_user.c |   36 +++++++++++++++++++++++++-----------
>  2 files changed, 39 insertions(+), 17 deletions(-)
>
> diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
> index e6ffc4ea06f4..4a5ad381ed2a 100644
> --- a/samples/bpf/xdp_fwd_kern.c
> +++ b/samples/bpf/xdp_fwd_kern.c
> @@ -104,13 +104,21 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
>
>         rc = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
>
> -       /* verify egress index has xdp support
> -        * TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
> -        *       cannot pass map_type 14 into func bpf_map_lookup_elem#1:
> -        * NOTE: without verification that egress index supports XDP
> -        *       forwarding packets are dropped.
> -        */
>         if (rc == 0) {
> +               int *val;
> +
> +               /* Verify egress index has been configured as TX-port.
> +                * (Note: User can still have inserted an egress ifindex that
> +                * doesn't support XDP xmit, which will result in packet drops).
> +                *
> +                * Note: lookup in devmap supported since 0cdbb4b09a0.
> +                * If not supported will fail with:
> +                *  cannot pass map_type 14 into func bpf_map_lookup_elem#1:
> +                */
> +               val = bpf_map_lookup_elem(&tx_port, &fib_params.ifindex);

It should be "xdp_tx_ports". Otherwise, you will have compilation errors.

> +               if (!val)
> +                       return XDP_PASS;

Also, maybe we can do
         if (!bpf_map_lookup_elem(&tx_port, &fib_params.ifindex))
            return XDP_PASS;
so we do not need to define val at all.

> +
>                 if (h_proto == htons(ETH_P_IP))
>                         ip_decrease_ttl(iph);
>                 else if (h_proto == htons(ETH_P_IPV6))
> diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
> index ba012d9f93dd..20951bc27477 100644
> --- a/samples/bpf/xdp_fwd_user.c
> +++ b/samples/bpf/xdp_fwd_user.c
> @@ -27,14 +27,20 @@
>  #include "libbpf.h"
>  #include <bpf/bpf.h>
>
> -
> -static int do_attach(int idx, int fd, const char *name)
> +static int do_attach(int idx, int prog_fd, int map_fd, const char *name)
>  {
>         int err;
>
> -       err = bpf_set_link_xdp_fd(idx, fd, 0);
> -       if (err < 0)
> +       err = bpf_set_link_xdp_fd(idx, prog_fd, 0);
> +       if (err < 0) {
>                 printf("ERROR: failed to attach program to %s\n", name);
> +               return err;
> +       }
> +
> +       /* Adding ifindex as a possible egress TX port */
> +       err = bpf_map_update_elem(map_fd, &idx, &idx, 0);
> +       if (err)
> +               printf("ERROR: failed using device %s as TX-port\n", name);
>
>         return err;
>  }
> @@ -47,6 +53,9 @@ static int do_detach(int idx, const char *name)
>         if (err < 0)
>                 printf("ERROR: failed to detach program from %s\n", name);
>
> +       /* TODO: Remember to cleanup map, when adding use of shared map
> +        *  bpf_map_delete_elem((map_fd, &idx);
> +        */
>         return err;
>  }
>
> @@ -67,10 +76,10 @@ int main(int argc, char **argv)
>         };
>         const char *prog_name = "xdp_fwd";
>         struct bpf_program *prog;
> +       int prog_fd, map_fd = -1;
>         char filename[PATH_MAX];
>         struct bpf_object *obj;
>         int opt, i, idx, err;
> -       int prog_fd, map_fd;
>         int attach = 1;
>         int ret = 0;
>
> @@ -103,8 +112,17 @@ int main(int argc, char **argv)
>                         return 1;
>                 }
>
> -               if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
> +               err = bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd);
> +               if (err) {
> +                       if (err == -22) {

-EINVAL?

For -EINVAL, many things could go wrong. But maybe the blow error
is the most common one so I am fine with that.

> +                               printf("Does kernel support devmap lookup?\n");
> +                               /* If not, the error message will be:
> +                                * "cannot pass map_type 14 into func
> +                                * bpf_map_lookup_elem#1"
> +                                */
> +                       }
>                         return 1;
> +               }
>
>                 prog = bpf_object__find_program_by_title(obj, prog_name);
>                 prog_fd = bpf_program__fd(prog);
> @@ -119,10 +137,6 @@ int main(int argc, char **argv)
>                         return 1;
>                 }
>         }
> -       if (attach) {
> -               for (i = 1; i < 64; ++i)
> -                       bpf_map_update_elem(map_fd, &i, &i, 0);
> -       }
>
>         for (i = optind; i < argc; ++i) {
>                 idx = if_nametoindex(argv[i]);
> @@ -138,7 +152,7 @@ int main(int argc, char **argv)
>                         if (err)
>                                 ret = err;
>                 } else {
> -                       err = do_attach(idx, prog_fd, argv[i]);
> +                       err = do_attach(idx, prog_fd, map_fd, argv[i]);
>                         if (err)
>                                 ret = err;
>                 }
>

^ permalink raw reply

* Re: [PATCH v3] mlx5: Use refcount_t for refcount
From: Saeed Mahameed @ 2019-08-07 18:03 UTC (permalink / raw)
  To: leon@kernel.org
  Cc: jgg@ziepe.ca, dledford@redhat.com, davem@davemloft.net,
	linux-kernel@vger.kernel.org, hslester96@gmail.com,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20190807031717.GB4832@mtr-leonro.mtl.com>

On Wed, 2019-08-07 at 06:17 +0300, Leon Romanovsky wrote:
> On Tue, Aug 06, 2019 at 08:40:11PM +0000, Saeed Mahameed wrote:
> > On Tue, 2019-08-06 at 09:59 +0800, Chuhong Yuan wrote:
> > > Reference counters are preferred to use refcount_t instead of
> > > atomic_t.
> > > This is because the implementation of refcount_t can prevent
> > > overflows and detect possible use-after-free.
> > > So convert atomic_t ref counters to refcount_t.
> > > 
> > > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > > ---
> > > Changes in v3:
> > >   - Merge v2 patches together.
> > > 
> > >  drivers/infiniband/hw/mlx5/srq_cmd.c         | 6 +++---
> > >  drivers/net/ethernet/mellanox/mlx5/core/qp.c | 6 +++---
> > >  include/linux/mlx5/driver.h                  | 3 ++-
> > >  3 files changed, 8 insertions(+), 7 deletions(-)
> > > 
> > 
> > LGTM, Leon, let me know if you are happy with this version,
> > this should go to mlx5-next.
> 
> Thanks,
> Acked-by: Leon Romanovsky <leonro@mellanox.com>

Applied to mlx5-next.

Thanks !

^ permalink raw reply

* Re: [PATCH net v2] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Jakub Kicinski @ 2019-08-07 18:00 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Miller, Network Development, davejwatson, borisp, aviadye,
	John Fastabend, Daniel Borkmann, Eric Dumazet, Alexei Starovoitov,
	oss-drivers
In-Reply-To: <CA+FuTScYkHho4hqrGf9q6=4iao-f2P2s258rjtQTCgn+nF-CYg@mail.gmail.com>

On Wed, 7 Aug 2019 12:59:00 -0400, Willem de Bruijn wrote:
> On Wed, Aug 7, 2019 at 2:06 AM Jakub Kicinski wrote:
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index d57b0cc995a0..0f9619b0892f 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -1992,6 +1992,20 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
> >  }
> >  EXPORT_SYMBOL(skb_set_owner_w);
> >
> > +static bool can_skb_orphan_partial(const struct sk_buff *skb)
> > +{
> > +#ifdef CONFIG_TLS_DEVICE
> > +       /* Drivers depend on in-order delivery for crypto offload,
> > +        * partial orphan breaks out-of-order-OK logic.
> > +        */
> > +       if (skb->decrypted)
> > +               return false;
> > +#endif
> > +       return (IS_ENABLED(CONFIG_INET) &&
> > +               skb->destructor == tcp_wfree) ||  
> 
> Please add parentheses around IS_ENABLED(CONFIG_INET) &&
> skb->destructor == tcp_wfree

Mm.. there are parenthesis around them, maybe I'm being slow, 
could you show me how?

> I was also surprised that this works when tcp_wfree is not defined if
> !CONFIG_INET. But apparently it does (at -O2?) :)

I was surprised to but in essence it should work the same as

	if (IS_ENABLED(CONFIG_xyz))
		call_some_xyz_code();

from compiler's perspective, and we do that a lot. Perhaps kbuild 
bot will prove us wrong :)

> > @@ -984,6 +984,9 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> >                         if (!skb)
> >                                 goto wait_for_memory;
> >
> > +#ifdef CONFIG_TLS_DEVICE
> > +                       skb->decrypted = !!(flags & MSG_SENDPAGE_DECRYPTED);
> > +#endif  
> 
> Nothing is stopping userspace from passing this new flag. In send
> (tcp_sendmsg_locked) it is ignored. But can it reach do_tcp_sendpages
> through tcp_bpf_sendmsg?

Ah, I think you're right, thanks for checking that :( I don't entirely
follow how 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through
ULP") is safe then.

One option would be to clear the flags kernel would previously ignore
in tcp_bpf_sendmsg(). But I feel like we should just go back to marking
the socket, since we don't need the per-message flexibility of a flag.

WDYT?

> >                         skb_entail(sk, skb);
> >                         copy = size_goal;
> >                 }
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 6e4afc48d7bb..979520e46e33 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1320,6 +1320,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
> >         buff = sk_stream_alloc_skb(sk, nsize, gfp, true);
> >         if (!buff)
> >                 return -ENOMEM; /* We'll just try again later. */
> > +       skb_copy_decrypted(buff, skb);  
> 
> This code has to copy timestamps, tx_flags, zerocopy state and now
> this in three locations. Eventually we'll want a single helper for all
> of them..

Ack, should I take an action on that for net-next or was it a
note-to-self? :)

^ permalink raw reply

* Re: [bpf-next PATCH 3/3] samples/bpf: xdp_fwd explain bpf_fib_lookup return codes
From: David Ahern @ 2019-08-07 17:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, Daniel Borkmann,
	Alexei Starovoitov
  Cc: xdp-newbies, a.s.protopopov
In-Reply-To: <156518138817.5636.3626699603179533211.stgit@firesoul>

On 8/7/19 6:36 AM, Jesper Dangaard Brouer wrote:
> Make it clear that this XDP program depend on the network
> stack to do the ARP resolution.  This is connected with the
> BPF_FIB_LKUP_RET_NO_NEIGH return code from bpf_fib_lookup().
> 
> Another common mistake (seen via XDP-tutorial) is that users
> don't realize that sysctl net.ipv{4,6}.conf.all.forwarding
> setting is honored by bpf_fib_lookup.
> 
> Reported-by: Anton Protopopov <a.s.protopopov@gmail.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  samples/bpf/xdp_fwd_kern.c |   19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 

Reviewed-by: David Ahern <dsahern@gmail.com>



^ permalink raw reply

* Re: [bpf-next PATCH 1/3] samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports
From: Y Song @ 2019-08-07 17:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Daniel Borkmann, Alexei Starovoitov, Xdp, a.s.protopopov,
	David Ahern
In-Reply-To: <156518137803.5636.11766023213864836956.stgit@firesoul>

On Wed, Aug 7, 2019 at 5:37 AM Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
> The devmap name 'tx_port' came from a copy-paste from xdp_redirect_map
> which only have a single TX port. Change name to xdp_tx_ports
> to make it more descriptive.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Acked-by: Yonghong Song <yhs@fb.com>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox