Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 2/2] l2tp: add peer_offset parameter
From: Guillaume Nault @ 2017-12-28 14:53 UTC (permalink / raw)
  To: Lorenzo Bianconi; +Cc: davem, netdev, jchapman, liuhangbin
In-Reply-To: <d51dc65a40c5e335984e95a1e6db0eea00d9fd13.1513951129.git.lorenzo.bianconi@redhat.com>

On Fri, Dec 22, 2017 at 03:10:18PM +0100, Lorenzo Bianconi wrote:
> Introduce peer_offset parameter in order to add the capability
> to specify two different values for payload offset on tx/rx side.
> If just offset is provided by userspace use it for rx side as well
> in order to maintain compatibility with older l2tp versions
> 
Sorry for being late on this, I originally missed this patchset and
only noticed it yesterday.

Lorenzo, can you give some context around this new feature?
Quite frankly I can't see the point of it. I've never heard of offsets
in L2TPv3, and for L2TPv2, the offset value is already encoded in the
header.

After a quick review of L2TPv3 and pseudowires RFCs, I still don't see
how adding some padding between the L2TPv3 header and the payload could
constitute a valid frame. Of course, the base feature is already there,
but after a quick test, it looks like the padding bits aren't
initialised and leak memory.

So, unless I missed something, setting offsets in L2TPv3 is
non-compliant, the current implementation is buggy and most likely
unused. I'd really prefer getting the implementation fixed, or even
removed entirely. Extending it to allow asymmetrical offset values
looks wrong to me, unless you have a use case in mind.

Regards,

Guillaume

PS: I also noticed that iproute2 has a "peer_offset" option, but it's a
noop.

^ permalink raw reply

* Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface
From: Florian Fainelli @ 2017-12-28 15:02 UTC (permalink / raw)
  To: Antoine Tenart, Andrew Lunn
  Cc: thomas.petazzoni, ymarkman, jason, netdev, linux-kernel,
	Russell King - ARM Linux, kishon, nadavh, miquel.raynal,
	gregory.clement, stefanc, mw, davem, linux-arm-kernel,
	sebastian.hesselbarth
In-Reply-To: <20171228100519.GE2626@kwain>



On 12/28/2017 02:05 AM, Antoine Tenart wrote:
> Hi Andrew,
> 
> On Thu, Dec 28, 2017 at 08:46:23AM +0100, Andrew Lunn wrote:
>> On Wed, Dec 27, 2017 at 10:24:01PM +0000, Russell King - ARM Linux wrote:
>>> On Wed, Dec 27, 2017 at 11:14:45PM +0100, Antoine Tenart wrote:
>>>>  
>>>> +&cps_eth2 {
>>>> +	/* CPS Lane 5 */
>>>> +	status = "okay";
>>>> +	phy-mode = "2500base-x";
>>>> +	/* Generic PHY, providing serdes lanes */
>>>> +	phys = <&cps_comphy5 2>;
>>>> +};
>>>> +
>>>
>>> This is wrong.  This lane is connected to a SFP cage which can support
>>> more than just 2500base-X.  Tying it in this way to 2500base-X means
>>> that this port does not support conenctions at 1000base-X, despite
>>> that's one of the most popular and more standardised speeds.
>>>
>>
>> I agree with Russell here. SFP modules are hot pluggable, and support
>> a range of interface modes. You need to query what the SFP module is
>> in order to know how to configure the SERDES interface. The phylink
>> infrastructure does that for you.
> 
> Sure, I understand. We'll be able to support such interfaces only when
> the phylink PPv2 support lands in.

Should we expect PHYLINK support to make it as the first patch in your
v2 of this patch series, or is someone else doing that?
-- 
Florian

^ permalink raw reply

* [PATCH net-next 4/4] net/mlx4_en: Change default QoS settings
From: Tariq Toukan @ 2017-12-28 14:26 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eran Ben Elisha, Moni Shoua, Maor Gottlieb, Tariq Toukan
In-Reply-To: <1514471171-3894-1-git-send-email-tariqt@mellanox.com>

From: Moni Shoua <monis@mellanox.com>

Change the default mapping between TC and TCG as follows:

Prio     |             TC/TCG
         |      from             to
         |    (set by FW)      (set by SW)
---------+-----------------------------------
0        |      0/0              0/7
1        |      1/0              0/6
2        |      2/0              0/5
3        |      3/0              0/4
4        |      4/0              0/3
5        |      5/0              0/2
6        |      6/0              0/1
7        |      7/0              0/0

These new settings cause that a pause frame for any prio stops
traffic for all prios.

Fixes: 564c274c3df0 ("net/mlx4_en: DCB QoS support")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c | 5 +++++
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 7 +++++++
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   | 1 +
 3 files changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
index 5f41dc92aa68..1a0c3bf86ead 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
@@ -310,6 +310,7 @@ static int mlx4_en_ets_validate(struct mlx4_en_priv *priv, struct ieee_ets *ets)
 		}
 
 		switch (ets->tc_tsa[i]) {
+		case IEEE_8021QAZ_TSA_VENDOR:
 		case IEEE_8021QAZ_TSA_STRICT:
 			break;
 		case IEEE_8021QAZ_TSA_ETS:
@@ -347,6 +348,10 @@ static int mlx4_en_config_port_scheduler(struct mlx4_en_priv *priv,
 	/* higher TC means higher priority => lower pg */
 	for (i = IEEE_8021QAZ_MAX_TCS - 1; i >= 0; i--) {
 		switch (ets->tc_tsa[i]) {
+		case IEEE_8021QAZ_TSA_VENDOR:
+			pg[i] = MLX4_EN_TC_VENDOR;
+			tc_tx_bw[i] = MLX4_EN_BW_MAX;
+			break;
 		case IEEE_8021QAZ_TSA_STRICT:
 			pg[i] = num_strict++;
 			tc_tx_bw[i] = MLX4_EN_BW_MAX;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 99051a294fa6..21bc17fa3854 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -3336,6 +3336,13 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	priv->msg_enable = MLX4_EN_MSG_LEVEL;
 #ifdef CONFIG_MLX4_EN_DCB
 	if (!mlx4_is_slave(priv->mdev->dev)) {
+		u8 prio;
+
+		for (prio = 0; prio < IEEE_8021QAZ_MAX_TCS; ++prio) {
+			priv->ets.prio_tc[prio] = prio;
+			priv->ets.tc_tsa[prio]  = IEEE_8021QAZ_TSA_VENDOR;
+		}
+
 		priv->dcbx_cap = DCB_CAP_DCBX_VER_CEE | DCB_CAP_DCBX_HOST |
 			DCB_CAP_DCBX_VER_IEEE;
 		priv->flags |= MLX4_EN_DCB_ENABLED;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 2b72677eccd4..7db3d0d9bfce 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -479,6 +479,7 @@ struct mlx4_en_frag_info {
 #define MLX4_EN_BW_MIN 1
 #define MLX4_EN_BW_MAX 100 /* Utilize 100% of the line */
 
+#define MLX4_EN_TC_VENDOR 0
 #define MLX4_EN_TC_ETS 7
 
 enum dcb_pfc_type {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 3/4] net/mlx4_core: Cleanup FMR unmapping flow
From: Tariq Toukan @ 2017-12-28 14:26 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Eran Ben Elisha, Tariq Toukan, Moshe Shemesh
In-Reply-To: <1514471171-3894-1-git-send-email-tariqt@mellanox.com>

Remove redundant and not essential operations in fmr unmap/free.
According to device spec, in FMR unmap it is sufficient to set
ownership bit to SW. This allows remapping afterwards.

Fixes: 8ad11fb6b073 ("IB/mlx4: Implement FMRs")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/mr.c | 40 +++++++++++++++++----------------
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c b/drivers/net/ethernet/mellanox/mlx4/mr.c
index c7c0764991c9..2e84f10f59ba 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mr.c
@@ -1103,30 +1103,16 @@ int mlx4_fmr_enable(struct mlx4_dev *dev, struct mlx4_fmr *fmr)
 void mlx4_fmr_unmap(struct mlx4_dev *dev, struct mlx4_fmr *fmr,
 		    u32 *lkey, u32 *rkey)
 {
-	struct mlx4_cmd_mailbox *mailbox;
-	int err;
-
 	if (!fmr->maps)
 		return;
 
-	fmr->maps = 0;
+	/* To unmap: it is sufficient to take back ownership from HW */
+	*(u8 *)fmr->mpt = MLX4_MPT_STATUS_SW;
 
-	mailbox = mlx4_alloc_cmd_mailbox(dev);
-	if (IS_ERR(mailbox)) {
-		err = PTR_ERR(mailbox);
-		pr_warn("mlx4_ib: mlx4_alloc_cmd_mailbox failed (%d)\n", err);
-		return;
-	}
+	/* Make sure MPT status is visible */
+	wmb();
 
-	err = mlx4_HW2SW_MPT(dev, NULL,
-			     key_to_hw_index(fmr->mr.key) &
-			     (dev->caps.num_mpts - 1));
-	mlx4_free_cmd_mailbox(dev, mailbox);
-	if (err) {
-		pr_warn("mlx4_ib: mlx4_HW2SW_MPT failed (%d)\n", err);
-		return;
-	}
-	fmr->mr.enabled = MLX4_MPT_EN_SW;
+	fmr->maps = 0;
 }
 EXPORT_SYMBOL_GPL(mlx4_fmr_unmap);
 
@@ -1136,6 +1122,22 @@ int mlx4_fmr_free(struct mlx4_dev *dev, struct mlx4_fmr *fmr)
 
 	if (fmr->maps)
 		return -EBUSY;
+	if (fmr->mr.enabled == MLX4_MPT_EN_HW) {
+		/* In case of FMR was enabled and unmapped
+		 * make sure to give ownership of MPT back to HW
+		 * so HW2SW_MPT command will success.
+		 */
+		*(u8 *)fmr->mpt = MLX4_MPT_STATUS_SW;
+		/* Make sure MPT status is visible before changing MPT fields */
+		wmb();
+		fmr->mpt->length = 0;
+		fmr->mpt->start  = 0;
+		/* Make sure MPT data is visible after changing MPT status */
+		wmb();
+		*(u8 *)fmr->mpt = MLX4_MPT_STATUS_HW;
+		/* make sure MPT status is visible */
+		wmb();
+	}
 
 	ret = mlx4_mr_free(dev, &fmr->mr);
 	if (ret)
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 1/4] net/mlx4_en: RX csum, remove redundant branches and checks
From: Tariq Toukan @ 2017-12-28 14:26 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Eran Ben Elisha, Tariq Toukan
In-Reply-To: <1514471171-3894-1-git-send-email-tariqt@mellanox.com>

Do not check IPv6 bit in cqe status if CONFIG_IPV6 is not enabled.
Function check_csum() is reached only with IPv4 or IPv6 set (if enabled),
if IPv6 is not set (or is not enabled) it is redundant to test the
IPv4 bit.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 85e28efcda33..680bd3fbaa60 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -617,6 +617,10 @@ static int get_fixed_ipv6_csum(__wsum hw_checksum, struct sk_buff *skb,
 	return 0;
 }
 #endif
+
+/* We reach this function only after checking that any of
+ * the (IPv4 | IPv6) bits are set in cqe->status.
+ */
 static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 		      netdev_features_t dev_features)
 {
@@ -632,13 +636,11 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 		hdr += sizeof(struct vlan_hdr);
 	}
 
-	if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV4))
-		return get_fixed_ipv4_csum(hw_checksum, skb, hdr);
 #if IS_ENABLED(CONFIG_IPV6)
 	if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV6))
 		return get_fixed_ipv6_csum(hw_checksum, skb, hdr);
 #endif
-	return 0;
+	return get_fixed_ipv4_csum(hw_checksum, skb, hdr);
 }
 
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
@@ -830,7 +832,11 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			} else {
 				if (priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
 				    (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV4 |
+#if IS_ENABLED(CONFIG_IPV6)
 							       MLX4_CQE_STATUS_IPV6))) {
+#else
+							       0))) {
+#endif
 					if (check_csum(cqe, skb, va, dev->features)) {
 						goto csum_none;
 					} else {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 2/4] net/mlx4_en: RX csum, reorder branches
From: Tariq Toukan @ 2017-12-28 14:26 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Eran Ben Elisha, Tariq Toukan
In-Reply-To: <1514471171-3894-1-git-send-email-tariqt@mellanox.com>

Use early goto commands, and save else branches.
This uses less indentations and brackets, making the code
more readable.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 46 ++++++++++++++----------------
 1 file changed, 21 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 680bd3fbaa60..5f9dbc9a7f5b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -816,37 +816,33 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		if (likely(dev->features & NETIF_F_RXCSUM)) {
 			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
 						      MLX4_CQE_STATUS_UDP)) {
-				if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
-				    cqe->checksum == cpu_to_be16(0xffff)) {
-					bool l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
-						(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
-
-					ip_summed = CHECKSUM_UNNECESSARY;
-					hash_type = PKT_HASH_TYPE_L4;
-					if (l2_tunnel)
-						skb->csum_level = 1;
-					ring->csum_ok++;
-				} else {
+				bool l2_tunnel;
+
+				if (!((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) &&
+				      cqe->checksum == cpu_to_be16(0xffff)))
 					goto csum_none;
-				}
+
+				l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
+					(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
+				ip_summed = CHECKSUM_UNNECESSARY;
+				hash_type = PKT_HASH_TYPE_L4;
+				if (l2_tunnel)
+					skb->csum_level = 1;
+				ring->csum_ok++;
 			} else {
-				if (priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
-				    (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV4 |
+				if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP &&
+				      (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPV4 |
 #if IS_ENABLED(CONFIG_IPV6)
-							       MLX4_CQE_STATUS_IPV6))) {
+								 MLX4_CQE_STATUS_IPV6))))
 #else
-							       0))) {
+								 0))))
 #endif
-					if (check_csum(cqe, skb, va, dev->features)) {
-						goto csum_none;
-					} else {
-						ip_summed = CHECKSUM_COMPLETE;
-						hash_type = PKT_HASH_TYPE_L3;
-						ring->csum_complete++;
-					}
-				} else {
 					goto csum_none;
-				}
+				if (check_csum(cqe, skb, va, dev->features))
+					goto csum_none;
+				ip_summed = CHECKSUM_COMPLETE;
+				hash_type = PKT_HASH_TYPE_L3;
+				ring->csum_complete++;
 			}
 		} else {
 csum_none:
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next 0/4] mlx4 misc for 4.16
From: Tariq Toukan @ 2017-12-28 14:26 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Eran Ben Elisha, Tariq Toukan

Hi Dave,

This patchset contains misc cleanups and improvements
to the mlx4 Core and Eth drivers.

In patches 1 and 2 I reduce and reorder the branches in the RX csum flow.
In patch 3 I align the FMR unmapping flow with the device spec, to allow
  a remapping afterwards.
Patch 4 by Moni changes the default QoS settings so that a pause
  frame stops all traffic regardless of its prio.

Series generated against net-next commit:
836df24a7062 net: hns3: hns3_get_channels() can be static

Thanks,
Tariq.

Moni Shoua (1):
  net/mlx4_en: Change default QoS settings

Tariq Toukan (3):
  net/mlx4_en: RX csum, remove redundant branches and checks
  net/mlx4_en: RX csum, reorder branches
  net/mlx4_core: Cleanup FMR unmapping flow

 drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c |  5 +++
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  7 ++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 56 +++++++++++++-------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  1 +
 drivers/net/ethernet/mellanox/mlx4/mr.c        | 40 +++++++++---------
 5 files changed, 63 insertions(+), 46 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: [PATCH net-next 1/6] phy: add 2.5G SGMII mode to the phy_mode enum
From: Florian Fainelli @ 2017-12-28 14:16 UTC (permalink / raw)
  To: Antoine Tenart, Andrew Lunn
  Cc: thomas.petazzoni, ymarkman, jason, netdev, linux-kernel, linux,
	kishon, nadavh, miquel.raynal, gregory.clement, stefanc, mw,
	davem, linux-arm-kernel, sebastian.hesselbarth
In-Reply-To: <20171228100656.GF2626@kwain>



On 12/28/2017 02:06 AM, Antoine Tenart wrote:
> Hi Andrew,
> 
> On Thu, Dec 28, 2017 at 08:20:53AM +0100, Andrew Lunn wrote:
>> On Wed, Dec 27, 2017 at 11:14:41PM +0100, Antoine Tenart wrote:
>>> This patch adds one more generic PHY mode to the phy_mode enum, to allow
>>> configuring generic PHYs to the 2.5G SGMII mode by using the set_mode
>>> callback.
>>>
>>> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
>>> ---
>>>  include/linux/phy/phy.h | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/include/linux/phy/phy.h b/include/linux/phy/phy.h
>>> index 4f8423a948d5..70459a28f3a1 100644
>>> --- a/include/linux/phy/phy.h
>>> +++ b/include/linux/phy/phy.h
>>> @@ -28,6 +28,7 @@ enum phy_mode {
>>>  	PHY_MODE_USB_DEVICE,
>>>  	PHY_MODE_USB_OTG,
>>>  	PHY_MODE_SGMII,
>>> +	PHY_MODE_SGMII_2_5G,
>>>  	PHY_MODE_10GKR,
>>>  	PHY_MODE_UFS_HS_A,
>>>  	PHY_MODE_UFS_HS_B,
>>
>> There was a discussion maybe last month about adding 2.5G SGMII. I
>> would prefer 2500SGMII. Putting the number first makes it uniform with
>> the other defines, 1000BASEX, 25000BASEX, 10GKR.
> 
> Good to know. I wasn't completely sure how to name this mode properly,
> but I'm fine with PHY_MODE_2500SGMII. I'll update the patches and send a
> v2 (without the dt part).

And since you are respinning, please make sure you update phy_modes() in
the same header file as well as
Documentation/devicetree/bindings/net/ethernet.txt with the newly added
PHY interface mode.
-- 
Florian

^ permalink raw reply

* Re: INFO: task hung in cleanup_net
From: Dmitry Vyukov @ 2017-12-28 13:56 UTC (permalink / raw)
  To: David Ahern
  Cc: syzbot, LKML, Ingo Molnar, Peter Zijlstra, syzkaller-bugs,
	David Miller, Florian Westphal, Daniel Borkmann, Xin Long,
	jakub.kicinski, mschiffer, Vladislav Yasevich, Jiri Benc, netdev
In-Reply-To: <c41d642e-71ea-422e-2d6d-048a83e45143@gmail.com>

On Tue, Dec 19, 2017 at 7:17 PM, David Ahern <dsahern@gmail.com> wrote:
> On 12/19/17 5:47 AM, Dmitry Vyukov wrote:
>> On Tue, Dec 19, 2017 at 1:33 PM, syzbot
>> <bot+b17f10c8a8c693b40723d40d6553fbc54d197679@syzkaller.appspotmail.com>
>> wrote:
>>> Hello,
>>>
>>> syzkaller hit the following crash on
>>> e40fd8d6b4d9f59b160faa1736f78fc07533ff37
>>> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master
>>> compiler: gcc (GCC) 7.1.1 20170620
>>> .config is attached
>>> Raw console output is attached.
>>>
>>> Unfortunately, I don't have any reproducer for this bug yet.
>>>
>>>
>>> sctp: sctp_transport_update_pmtu: Reported pmtu 508 too low, using default
>>> minimum of 512
>>> sctp: sctp_transport_update_pmtu: Reported pmtu 508 too low, using default
>>> minimum of 512
>>> sctp: sctp_transport_update_pmtu: Reported pmtu 508 too low, using default
>>> minimum of 512
>>> INFO: task kworker/u4:11:3785 blocked for more than 120 seconds.
>>>       Not tainted 4.15.0-rc2-next-20171207+ #61
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> kworker/u4:11   D16448  3785      2 0x80000000
>>> Workqueue: netns cleanup_net
>>> Call Trace:
>>>  context_switch kernel/sched/core.c:2800 [inline]
>>>  __schedule+0x8eb/0x2060 kernel/sched/core.c:3376
>>>  schedule+0xf5/0x430 kernel/sched/core.c:3435
>>>  schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3493
>>>  __mutex_lock_common kernel/locking/mutex.c:833 [inline]
>>>  __mutex_lock+0xaad/0x1a80 kernel/locking/mutex.c:893
>>>  mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
>>>  rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
>>>  cleanup_net+0x24c/0xb60 net/core/net_namespace.c:453
>>>  process_one_work+0xbfd/0x1bc0 kernel/workqueue.c:2113
>>>  worker_thread+0x223/0x1990 kernel/workqueue.c:2247
>>>  kthread+0x37a/0x440 kernel/kthread.c:238
>>>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:524
>>>
>>> Showing all locks held in the system:
>>> 2 locks held by khungtaskd/672:
>>>  #0:  (rcu_read_lock){....}, at: [<00000000b9bef8ab>]
>>> check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline]
>>>  #0:  (rcu_read_lock){....}, at: [<00000000b9bef8ab>] watchdog+0x1c5/0xd60
>>> kernel/hung_task.c:249
>>>  #1:  (tasklist_lock){.+.+}, at: [<000000006d6acf8b>]
>>> debug_show_all_locks+0xd3/0x400 kernel/locking/lockdep.c:4554
>>> 1 lock held by rsyslogd/2970:
>>>  #0:  (&f->f_pos_lock){+.+.}, at: [<000000001cede688>]
>>> __fdget_pos+0x131/0x1a0 fs/file.c:765
>>> 2 locks held by getty/3052:
>>>  #0:  (&tty->ldisc_sem){++++}, at: [<000000004e0245ba>]
>>> ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
>>>  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000ff863b18>]
>>> n_tty_read+0x2f2/0x1a10 drivers/tty/n_tty.c:2131
>>> 2 locks held by getty/3053:
>>>  #0:  (&tty->ldisc_sem){++++}, at: [<000000004e0245ba>]
>>> ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
>>>  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000ff863b18>]
>>> n_tty_read+0x2f2/0x1a10 drivers/tty/n_tty.c:2131
>>> 2 locks held by getty/3054:
>>>  #0:  (&tty->ldisc_sem){++++}, at: [<000000004e0245ba>]
>>> ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
>>>  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000ff863b18>]
>>> n_tty_read+0x2f2/0x1a10 drivers/tty/n_tty.c:2131
>>> 2 locks held by getty/3055:
>>>  #0:  (&tty->ldisc_sem){++++}, at: [<000000004e0245ba>]
>>> ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
>>>  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000ff863b18>]
>>> n_tty_read+0x2f2/0x1a10 drivers/tty/n_tty.c:2131
>>> 2 locks held by getty/3056:
>>>  #0:  (&tty->ldisc_sem){++++}, at: [<000000004e0245ba>]
>>> ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
>>>  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000ff863b18>]
>>> n_tty_read+0x2f2/0x1a10 drivers/tty/n_tty.c:2131
>>> 2 locks held by getty/3057:
>>>  #0:  (&tty->ldisc_sem){++++}, at: [<000000004e0245ba>]
>>> ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
>>>  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000ff863b18>]
>>> n_tty_read+0x2f2/0x1a10 drivers/tty/n_tty.c:2131
>>> 2 locks held by getty/3058:
>>>  #0:  (&tty->ldisc_sem){++++}, at: [<000000004e0245ba>]
>>> ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
>>>  #1:  (&ldata->atomic_read_lock){+.+.}, at: [<00000000ff863b18>]
>>> n_tty_read+0x2f2/0x1a10 drivers/tty/n_tty.c:2131
>>> 1 lock held by syz-executor2/3320:
>>>  #0:  (rtnl_mutex){+.+.}, at: [<0000000063c47ced>] rtnl_lock+0x17/0x20
>>> net/core/rtnetlink.c:74
>>> 4 locks held by kworker/u4:11/3785:
>>>  #0:  ((wq_completion)"%s""netns"){+.+.}, at: [<0000000087d81703>]
>>> __write_once_size include/linux/compiler.h:212 [inline]
>>>  #0:  ((wq_completion)"%s""netns"){+.+.}, at: [<0000000087d81703>]
>>> atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
>>>  #0:  ((wq_completion)"%s""netns"){+.+.}, at: [<0000000087d81703>]
>>> atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
>>>  #0:  ((wq_completion)"%s""netns"){+.+.}, at: [<0000000087d81703>]
>>> set_work_data kernel/workqueue.c:619 [inline]
>>>  #0:  ((wq_completion)"%s""netns"){+.+.}, at: [<0000000087d81703>]
>>> set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
>>>  #0:  ((wq_completion)"%s""netns"){+.+.}, at: [<0000000087d81703>]
>>> process_one_work+0xad4/0x1bc0 kernel/workqueue.c:2084
>>>  #1:  (net_cleanup_work){+.+.}, at: [<000000008a817b5c>]
>>> process_one_work+0xb2f/0x1bc0 kernel/workqueue.c:2088
>>>  #2:  (net_mutex){+.+.}, at: [<0000000079150fd8>] cleanup_net+0x247/0xb60
>>> net/core/net_namespace.c:450
>>>  #3:  (rtnl_mutex){+.+.}, at: [<0000000063c47ced>] rtnl_lock+0x17/0x20
>>> net/core/rtnetlink.c:74
>>> 3 locks held by kworker/0:6/14978:
>>>  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<0000000087d81703>]
>>> __write_once_size include/linux/compiler.h:212 [inline]
>>>  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<0000000087d81703>]
>>> atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
>>>  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<0000000087d81703>]
>>> atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
>>>  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<0000000087d81703>]
>>> set_work_data kernel/workqueue.c:619 [inline]
>>>  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<0000000087d81703>]
>>> set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
>>>  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<0000000087d81703>]
>>> process_one_work+0xad4/0x1bc0 kernel/workqueue.c:2084
>>>  #1:  ((addr_chk_work).work){+.+.}, at: [<000000008a817b5c>]
>>> process_one_work+0xb2f/0x1bc0 kernel/workqueue.c:2088
>>>  #2:  (rtnl_mutex){+.+.}, at: [<0000000063c47ced>] rtnl_lock+0x17/0x20
>>> net/core/rtnetlink.c:74
>>> 1 lock held by syz-executor2/15852:
>>>  #0:  (rtnl_mutex){+.+.}, at: [<0000000063c47ced>] rtnl_lock+0x17/0x20
>>> net/core/rtnetlink.c:74
>>> 1 lock held by syz-executor7/15958:
>>>  #0:  (rtnl_mutex){+.+.}, at: [<0000000063c47ced>] rtnl_lock+0x17/0x20
>>> net/core/rtnetlink.c:74
>>> 1 lock held by syz-executor7/15961:
>>>  #0:  (rtnl_mutex){+.+.}, at: [<0000000063c47ced>] rtnl_lock+0x17/0x20
>>> net/core/rtnetlink.c:74
>>>
>>> =============================================
>>>
>>> NMI backtrace for cpu 1
>>> CPU: 1 PID: 672 Comm: khungtaskd Not tainted 4.15.0-rc2-next-20171207+ #61
>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>>> Google 01/01/2011
>>> Call Trace:
>>>  __dump_stack lib/dump_stack.c:17 [inline]
>>>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>>>  nmi_cpu_backtrace+0x1d2/0x210 lib/nmi_backtrace.c:103
>>>  nmi_trigger_cpumask_backtrace+0x122/0x180 lib/nmi_backtrace.c:62
>>>  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
>>>  trigger_all_cpu_backtrace include/linux/nmi.h:138 [inline]
>>>  check_hung_task kernel/hung_task.c:132 [inline]
>>>  check_hung_uninterruptible_tasks kernel/hung_task.c:190 [inline]
>>>  watchdog+0x90c/0xd60 kernel/hung_task.c:249
>>>  kthread+0x37a/0x440 kernel/kthread.c:238
>>>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:524
>>> Sending NMI from CPU 1 to CPUs 0:
>>> NMI backtrace for cpu 0
>>> CPU: 0 PID: 15938 Comm: syz-executor7 Not tainted 4.15.0-rc2-next-20171207+
>>> #61
>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>>> Google 01/01/2011
>>> RIP: 0010:inb arch/x86/include/asm/io.h:348 [inline]
>>> RIP: 0010:io_serial_in+0x6b/0x90 drivers/tty/serial/8250/8250_port.c:434
>>> RSP: 0018:ffff8801c1c173e8 EFLAGS: 00000002
>>> RAX: dffffc0000000000 RBX: 00000000000003fd RCX: 0000000000000000
>>> RDX: 00000000000003fd RSI: ffffc90001c88000 RDI: ffffffff880fa9c0
>>> RBP: ffff8801c1c173f8 R08: 0000000000000002 R09: 000000000000000c
>>> R10: 0000000000000000 R11: ffffffff87896d20 R12: ffffffff880fa980
>>> R13: 0000000000000020 R14: fffffbfff101f577 R15: fffffbfff101f53a
>>> FS:  00007f1ef8fad700(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00007fbea91b6000 CR3: 00000001d0aaa000 CR4: 00000000001406f0
>>> DR0: 0000000020001000 DR1: 0000000020001000 DR2: 0000000020001020
>>> DR3: 0000000020000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
>>> Call Trace:
>>>  serial_in drivers/tty/serial/8250/8250.h:111 [inline]
>>>  wait_for_xmitr+0x93/0x1e0 drivers/tty/serial/8250/8250_port.c:2033
>>>  serial8250_console_putchar+0x1f/0x60
>>> drivers/tty/serial/8250/8250_port.c:3170
>>>  uart_console_write+0xac/0xe0 drivers/tty/serial/serial_core.c:1858
>>>  serial8250_console_write+0x647/0xa20
>>> drivers/tty/serial/8250/8250_port.c:3236
>>>  univ8250_console_write+0x5f/0x70 drivers/tty/serial/8250/8250_core.c:590
>>>  call_console_drivers kernel/printk/printk.c:1574 [inline]
>>>  console_unlock+0x788/0xd70 kernel/printk/printk.c:2233
>>>  vprintk_emit+0x4ad/0x590 kernel/printk/printk.c:1757
>>>  vprintk_default+0x28/0x30 kernel/printk/printk.c:1796
>>>  vprintk_func+0x57/0xc0 kernel/printk/printk_safe.c:379
>>>  printk+0xaa/0xca kernel/printk/printk.c:1829
>>>  __dev_set_promiscuity+0x2a4/0x630 net/core/dev.c:6609
>>>  __dev_change_flags+0x559/0x990 net/core/dev.c:6826
>>>  dev_change_flags+0x88/0x140 net/core/dev.c:6886
>>>  dev_ifsioc+0x60d/0x9b0 net/core/dev_ioctl.c:257
>>>  dev_ioctl+0x2c2/0xf90 net/core/dev_ioctl.c:566
>>>  sock_do_ioctl+0x94/0xb0 net/socket.c:971
>>>  sock_ioctl+0x2c2/0x440 net/socket.c:1061
>>>  vfs_ioctl fs/ioctl.c:46 [inline]
>>>  do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:686
>>>  SYSC_ioctl fs/ioctl.c:701 [inline]
>>>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>>>  entry_SYSCALL_64_fastpath+0x1f/0x96
>>> RIP: 0033:0x452a39
>>> RSP: 002b:00007f1ef8facc58 EFLAGS: 00000212 ORIG_RAX: 0000000000000010
>>> RAX: ffffffffffffffda RBX: 00007f1ef8fad700 RCX: 0000000000452a39
>>> RDX: 00000000208a3fe0 RSI: 0000000000008914 RDI: 0000000000000019
>>> RBP: 0000000000a6f880 R08: 0000000000000000 R09: 0000000000000000
>>> R10: 0000000000000000 R11: 0000000000000212 R12: 0000000000000000
>>> R13: 0000000000a6f7ff R14: 00007f1ef8fad9c0 R15: 0000000000000002
>>> Code: 24 d9 00 00 00 49 8d 7c 24 40 48 b8 00 00 00 00 00 fc ff df 48 89 fa
>>> 48 c1 ea 03 d3 e3 80 3c 02 00 75 17 41 03 5c 24 40 89 da ec <5b> 0f b6 c0 41
>>> 5c 5d c3 e8 88 ad 18 ff eb c2 e8 e1 ad 18 ff eb
>>
>>
>> This looks like +rtnetlink issue.
>
> Perhaps related to / fixed by: http://patchwork.ozlabs.org/patch/850957/

If you think so, let's tell syzbot about it:

#syz fix: net: Fix double free and memory corruption in get_net_ns_by_id()

^ permalink raw reply

* Re: [PATCH 1/3] net: Fix possible race in peernet2id_alloc()
From: Kirill Tkhai @ 2017-12-28 12:55 UTC (permalink / raw)
  To: Eric W. Biederman, davem; +Cc: netdev, eric.dumazet
In-Reply-To: <54b5eb20-0357-d0f2-75ef-3de093c3a283@virtuozzo.com>

Hi, David,

I see the status of patchset has changed on https://patchwork.ozlabs.org/patch/851928/
and now it is "Changes Requested".

There was Eric's reply, and my reply on his message. After that there wasn't objections on
my reply or other patches.

Could you please clarify the status or what I should do with the patchset
(because it's not clear for me)?

Thanks,
Kirill

On 22.12.2017 13:30, Kirill Tkhai wrote:
> Hi, Eric,
> 
> On 21.12.2017 20:39, Eric W. Biederman wrote:
>> Kirill Tkhai <ktkhai@virtuozzo.com> writes:
>>
>>> peernet2id_alloc() is racy without rtnl_lock() as atomic_read(&peer->count)
>>> under net->nsid_lock does not guarantee, peer is alive:
>>>
>>> rcu_read_lock()
>>> peernet2id_alloc()                            ..
>>>   spin_lock_bh(&net->nsid_lock)               ..
>>>   atomic_read(&peer->count) == 1              ..
>>>   ..                                          put_net()
>>>   ..                                            cleanup_net()
>>>   ..                                              for_each_net(tmp)
>>>   ..                                                spin_lock_bh(&tmp->nsid_lock)
>>>   ..                                                __peernet2id(tmp, net) == -1
>>>   ..                                                    ..
>>>   ..                                                    ..
>>>     __peernet2id_alloc(alloc == true)                   ..
>>>   ..                                                    ..
>>> rcu_read_unlock()                                       ..
>>> ..                                                synchronize_rcu()
>>> ..                                                kmem_cache_free(net)
>>>
>>> After the above situation, net::netns_id contains id pointing to freed memory,
>>> and any other dereferencing by the id will operate with this freed memory.
>>>
>>> Currently, peernet2id_alloc() is used under rtnl_lock() everywhere except
>>> ovs_vport_cmd_fill_info(), and this race can't occur. But peernet2id_alloc()
>>> is generic interface, and better we fix it before someone really starts
>>> use it in wrong context.
>>
>> So it comes down to this piece of code from ovs and just let me say ick.
>> 	if (!net_eq(net, dev_net(vport->dev))) {
>> 		int id = peernet2id_alloc(net, dev_net(vport->dev));
>>
>> 		if (nla_put_s32(skb, OVS_VPORT_ATTR_NETNSID, id))
>> 			goto nla_put_failure;
>> 	}
>>
>> Without the rtnl lock dev_net can cange between the test and the
>> call of peernet2id_alloc.
> 
> The places like this are usually protected via netdevice notifier. There is
> ovs_dp_device_notifier in openvswitch and it seems to be a handler for
> the situation above. But I'm not sure it's safe, and it's not clear for
> me why it does not take ovl_mutex to synchronize with ovs_vport_cmd_fill_info(),
> and whether there is another synchronization.
> 
>> At first glance it looks like the bug is that we are running a control
>> path of the networking stack without the rtnl lock. So it may be that
>> ASSERT_RTNL() is the better fix.
> 
> Do you mean inserting ASSERT_RTNL() into peernet2id_alloc()?
> If so, it's not need, as all nsid are protected via separate lock starting from
> 
> commit 95f38411df05 "netns: use a spin_lock to protect nsid management"
> 
> and commit de133464c9e7 "netns: make nsid_lock per net".
> 
> nsid primitives are aimed to be safe without respect to rtnl_lock().
> My patch just adds some sanity checks to make them more safe.
>  
>> Given that it would be nice to reduce the scope of the rtnl lock this
>> might not be a bad direction.  Let me see.
>>> Is rtnl_notify safe without the rtnl lock?
> 
> It seems they have to be safe, as there is only sending a skb to netlink.
> rtnl_notify() is used without rtnl_lock() in most other places of net code.
> 
> The only reason rtnl_lock() is needed in cleanup_net() is we unlink net from the list.
> rtnl_lock() around nsid cleanup is leftover after nsid are made protected via
> separate lock. So we may relax rtnl_lock() in cleanup_net() and I'm going to do that
> in one of next patches (not yet sent).
> 
>>>
>>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>>> ---
>>>  net/core/net_namespace.c |   23 +++++++++++++++++++----
>>>  1 file changed, 19 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>>> index 60a71be75aea..6a4eab438221 100644
>>> --- a/net/core/net_namespace.c
>>> +++ b/net/core/net_namespace.c
>>> @@ -221,17 +221,32 @@ static void rtnl_net_notifyid(struct net *net, int cmd, int id);
>>>   */
>>>  int peernet2id_alloc(struct net *net, struct net *peer)
>>>  {
>>> -	bool alloc;
>>> +	bool alloc = false, alive = false;
>>>  	int id;
>>
>>         ^^^ Perhaps we want "ASSERT_RTNL();" here?
>>>  
>>> -	if (atomic_read(&net->count) == 0)
>>> -		return NETNSA_NSID_NOT_ASSIGNED;
>>
>> Moving this hunk is of no benefit.  The code must be called with a valid
>> reference to net.   Which means net->count is a fancy way of testing to
>> see if the code is in cleanup_net.  In all other cases net->count should
>> be non-zero and it should remain that way because of our caller must
>> keep a reference.
> 
> Let me discuss about this.
> This interface is exported to drivers, and they don't always follow the rules
> about what is allowed or disallowed. This hunk adds sanity checks, which
> protects us from bad drivers code etc. Some people want to use peernet2id_alloc()
> in generic functions, which may be called in different context (also, for net
> obtained from rcu list). Instead of every user makes such a test in its function,
> this hunks introduces generic valid check, which allows people do not insert n+1
> their own checks.
> 
> Also, unlocked check does not give performance advantages as later we take the lock
> *almost always*. So, my suggestion is to make peernet2id_alloc() more safe for
> end user.
>  
>>>  	spin_lock_bh(&net->nsid_lock);
>>> -	alloc = atomic_read(&peer->count) == 0 ? false : true;
>>> +	/* Spinlock guarantees we never hash a peer to net->netns_ids
>>> +	 * after idr_destroy(&net->netns_ids) occurs in cleanup_net().
>>> +	 */
>>> +	if (atomic_read(&net->count) == 0) {
>>> +		id = NETNSA_NSID_NOT_ASSIGNED;
>>> +		goto unlock;
>>> +	}
>>> +	/*
>>> +	 * When peer is obtained from RCU lists, we may race with
>>> +	 * its cleanup. Check whether it's alive, and this guarantees
>>> +	 * we never hash a peer back to net->netns_ids, after it has
>>> +	 * just been idr_remove()'d from there in cleanup_net().
>>> +	 */
>>> +	if (maybe_get_net(peer))
>>> +		alive = alloc = true;
>>
>> Yes this does seem reasonable.  The more obvious looking code which
>> would return NETNSA_NSID_NOT_ASSIGNED if the peer has a count of 0, is
>> silly as it makes would make it appear that a peer is momentary outside
>> of a network namespace when the peer is in fact moving from one network
>> namespace to another.
>>         
>>>  	id = __peernet2id_alloc(net, peer, &alloc);
>>> +unlock:
>>>  	spin_unlock_bh(&net->nsid_lock);
>>>  	if (alloc && id >= 0)
>>>  		rtnl_net_notifyid(net, RTM_NEWNSID, id);
>>                 ^^^^^^
>>                 Is this safe without the rtnl lock?
> 
> Yes, it seems to be safe. Please, look above.
> 
>>> +	if (alive)
>>> +		put_net(peer);
>>>  	return id;
>>>  }
>>>  EXPORT_SYMBOL_GPL(peernet2id_alloc);
> 
> Kirill
> 

^ permalink raw reply

* [RFC] Supporting namespaces in the connector driver
From: Elad Wexler @ 2017-12-28 12:05 UTC (permalink / raw)
  To: netdev, linux-kernel, zbr

Hi,

I have spent some time looking at the cn_proc monitor capabilities
which use the connector driver for monitoring processes.

Currently the cn_proc (and all other modules which are using the connector)
only works on the main init_user_ns namespace.
If I am working inside a docker container (as an example), I won't be able to monitor process
creation, for example fork()/exec() etc ...

Of course I will be able to monitor any process from the host namespace (init_user_ns)
but I would like also to be able to monitor the processes that belong to the same
docker container, (belong to the same namespace)

I wonder if there is a plan to add pernet support for the connector?
If you think it there is a good reason to add that? (I can provide a few patches that do that)

I will be happy to work on to add a support for that, and actually I have started to
do some modification to make it works. and still testing it.

Thanks

^ permalink raw reply

* [PATCH iproute2] gre/tunnel: Print erspan_index using print_uint()
From: Serhey Popovych @ 2017-12-28 11:12 UTC (permalink / raw)
  To: netdev

One is missing in JSON output because fprintf()
is used instead of print_uint().

Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com>
---
 ip/link_gre.c  |    3 ++-
 ip/link_gre6.c |    4 +++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/ip/link_gre.c b/ip/link_gre.c
index 896bb19..1e331c8 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -476,7 +476,8 @@ static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 	if (tb[IFLA_GRE_ERSPAN_INDEX]) {
 		__u32 erspan_idx = rta_getattr_u32(tb[IFLA_GRE_ERSPAN_INDEX]);
 
-		fprintf(f, "erspan_index %u ", erspan_idx);
+		print_uint(PRINT_ANY,
+			   "erspan_index", "erspan_index %u ", erspan_idx);
 	}
 
 	if (tb[IFLA_GRE_ENCAP_TYPE] &&
diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 7ae4b49..2687a62 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -532,7 +532,9 @@ static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 
 	if (tb[IFLA_GRE_ERSPAN_INDEX]) {
 		__u32 erspan_idx = rta_getattr_u32(tb[IFLA_GRE_ERSPAN_INDEX]);
-		fprintf(f, "erspan_index %u ", erspan_idx);
+
+		print_uint(PRINT_ANY,
+			   "erspan_index", "erspan_index %u ", erspan_idx);
 	}
 
 	if (tb[IFLA_GRE_ENCAP_TYPE] &&
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH iproute2 3/3] ip/tunnel: Document "external" parameter
From: Serhey Popovych @ 2017-12-28 11:11 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1514374096-1473-4-git-send-email-serhe.popovych@gmail.com>

Add it to ip-link(8) "type gre" output help message
as well as to ip-link(8) page.

Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com>
---
 ip/link_gre.c         |    1 +
 man/man8/ip-link.8.in |    7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/ip/link_gre.c b/ip/link_gre.c
index 896bb19..3c0b6d6 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -43,6 +43,7 @@ static void print_usage(FILE *f)
 		"                            [ [no]encap-csum ]\n"
 		"                            [ [no]encap-csum6 ]\n"
 		"                            [ [no]encap-remcsum ]\n"
+		"                            [ external ]\n"
 		"                            [ fwmark MARK ]\n"
 		"                            [ erspan IDX ]\n"
 		"\n"
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 94ecbec..9ddf0ff 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -698,6 +698,8 @@ the following additional arguments are supported:
 .I " mode " { ip6ip | ipip | mplsip | any } "
 ] [
 .BR erspan " \fIIDX "
+] [
+.BR external
 ]
 
 .in +8
@@ -749,6 +751,11 @@ IPv6-Over-IPv4 is not supported for IPIP.
 indicates a 20 bit index/port number associated with the ERSPAN
 traffic's source port and direction.
 
+.sp
+.BR external
+- make this tunnel externally controlled
+.RB "(e.g. " "ip route encap" ).
+
 .in -8
 
 .TP
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH net v1 1/1] tipc: fix hanging poll() for stream sockets
From: Parthasarathy Bhuvaragan @ 2017-12-28 11:03 UTC (permalink / raw)
  To: netdev; +Cc: tipc-discussion, jon.maloy, Parthasarathy Bhuvaragan

In commit 42b531de17d2f6 ("tipc: Fix missing connection request
handling"), we replaced unconditional wakeup() with condtional
wakeup for clients with flags POLLIN | POLLRDNORM | POLLRDBAND.

This breaks the applications which do a connect followed by poll
with POLLOUT flag. These applications are not woken when the
connection is ESTABLISHED and hence sleep forever.

In this commit, we fix it by including the POLLOUT event for
sockets in TIPC_CONNECTING state.

Fixes: 42b531de17d2f6 ("tipc: Fix missing connection request handling")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com>
---
 net/tipc/socket.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index 41127d0b925e..3b4084480377 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -727,11 +727,11 @@ static unsigned int tipc_poll(struct file *file, struct socket *sock,

 	switch (sk->sk_state) {
 	case TIPC_ESTABLISHED:
+	case TIPC_CONNECTING:
 		if (!tsk->cong_link_cnt && !tsk_conn_cong(tsk))
 			revents |= POLLOUT;
 		/* fall thru' */
 	case TIPC_LISTEN:
-	case TIPC_CONNECTING:
 		if (!skb_queue_empty(&sk->sk_receive_queue))
 			revents |= POLLIN | POLLRDNORM;
 		break;
-- 
2.11.0

^ permalink raw reply related

* [PATCH iproute2 1/3] vxcan,veth: Forbid "type" for peer device
From: Serhey Popovych @ 2017-12-28 11:01 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20171226090535.51ac43ef@xeon-e3>

It is already given for original device we configure this
peer for.

Results from following command before/after change applied
are shown below:

  $ ip link add dev veth1a type veth peer name veth1b \
                           type veth peer name veth1c

Before:
-------

<no output, no netdevs created>

After:
------

Error: duplicate "type": "veth" is the second value.

Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com>
---
 ip/iplink_vxcan.c |    3 +++
 ip/link_veth.c    |    3 +++
 2 files changed, 6 insertions(+)

diff --git a/ip/iplink_vxcan.c b/ip/iplink_vxcan.c
index c13224c..ed0ad8b 100644
--- a/ip/iplink_vxcan.c
+++ b/ip/iplink_vxcan.c
@@ -65,6 +65,9 @@ static int vxcan_parse_opt(struct link_util *lu, int argc, char **argv,
 	if (err < 0)
 		return err;
 
+	if (type)
+		duparg("type", argv[err]);
+
 	if (name) {
 		addattr_l(hdr, 1024,
 			  IFLA_IFNAME, name, strlen(name) + 1);
diff --git a/ip/link_veth.c b/ip/link_veth.c
index fcfd1ef..fddb7ac 100644
--- a/ip/link_veth.c
+++ b/ip/link_veth.c
@@ -63,6 +63,9 @@ static int veth_parse_opt(struct link_util *lu, int argc, char **argv,
 	if (err < 0)
 		return err;
 
+	if (type)
+		duparg("type", argv[err]);
+
 	if (name) {
 		addattr_l(hdr, 1024,
 			  IFLA_IFNAME, name, strlen(name) + 1);
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH iproute2 1/3] vxcan,veth: Forbid "type" for peer device
From: Serhey Popovych @ 2017-12-28 10:54 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20171226090535.51ac43ef@xeon-e3>


[-- Attachment #1.1: Type: text/plain, Size: 1170 bytes --]



Stephen Hemminger wrote:
> On Wed, 20 Dec 2017 09:37:29 +0200
> Serhey Popovych <serhe.popovych@gmail.com> wrote:
> 
>> It is already given for original device we configure this
>> peer for.
>>
>> Results from following command before/after change applied
>> are shown below:
>>
>>   $ ip link add dev veth1a type veth peer name veth1b \
>>                            type veth peer name veth1c
>>
>> Before:
>> -------
>>
>> <no output, no netdevs created>
>>
>> After:
>> ------
>>
>> Error: argument "type" is wrong: not supported for peer
>>
>> Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com>
> 
> This makes a lot of sense.
> Maybe the better error message would be to use duparg() since
> that is the most likely cause of the user error.

Not sure Stephen. With duparg() I get following error message for
test case described in above.

Error: duplicate "type": "veth" is the second value.

It looks good for me, but we have two "type veth" in command and
which one is wrong might not be very clear to user.

Anyway I send v2 for this particular change to get and option
to pick the best one.

Thanks.
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 1/6] phy: add 2.5G SGMII mode to the phy_mode enum
From: Antoine Tenart @ 2017-12-28 10:06 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Antoine Tenart, davem, kishon, jason, sebastian.hesselbarth,
	gregory.clement, linux, mw, stefanc, ymarkman, thomas.petazzoni,
	miquel.raynal, nadavh, netdev, linux-arm-kernel, linux-kernel
In-Reply-To: <20171228072053.GA27336@lunn.ch>

Hi Andrew,

On Thu, Dec 28, 2017 at 08:20:53AM +0100, Andrew Lunn wrote:
> On Wed, Dec 27, 2017 at 11:14:41PM +0100, Antoine Tenart wrote:
> > This patch adds one more generic PHY mode to the phy_mode enum, to allow
> > configuring generic PHYs to the 2.5G SGMII mode by using the set_mode
> > callback.
> > 
> > Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> > ---
> >  include/linux/phy/phy.h | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/include/linux/phy/phy.h b/include/linux/phy/phy.h
> > index 4f8423a948d5..70459a28f3a1 100644
> > --- a/include/linux/phy/phy.h
> > +++ b/include/linux/phy/phy.h
> > @@ -28,6 +28,7 @@ enum phy_mode {
> >  	PHY_MODE_USB_DEVICE,
> >  	PHY_MODE_USB_OTG,
> >  	PHY_MODE_SGMII,
> > +	PHY_MODE_SGMII_2_5G,
> >  	PHY_MODE_10GKR,
> >  	PHY_MODE_UFS_HS_A,
> >  	PHY_MODE_UFS_HS_B,
> 
> There was a discussion maybe last month about adding 2.5G SGMII. I
> would prefer 2500SGMII. Putting the number first makes it uniform with
> the other defines, 1000BASEX, 25000BASEX, 10GKR.

Good to know. I wasn't completely sure how to name this mode properly,
but I'm fine with PHY_MODE_2500SGMII. I'll update the patches and send a
v2 (without the dt part).

Thanks!
Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface
From: Antoine Tenart @ 2017-12-28 10:05 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Russell King - ARM Linux, Antoine Tenart, davem, kishon, jason,
	sebastian.hesselbarth, gregory.clement, mw, stefanc, ymarkman,
	thomas.petazzoni, miquel.raynal, nadavh, netdev, linux-arm-kernel,
	linux-kernel
In-Reply-To: <20171228074623.GA28444@lunn.ch>

Hi Andrew,

On Thu, Dec 28, 2017 at 08:46:23AM +0100, Andrew Lunn wrote:
> On Wed, Dec 27, 2017 at 10:24:01PM +0000, Russell King - ARM Linux wrote:
> > On Wed, Dec 27, 2017 at 11:14:45PM +0100, Antoine Tenart wrote:
> > >  
> > > +&cps_eth2 {
> > > +	/* CPS Lane 5 */
> > > +	status = "okay";
> > > +	phy-mode = "2500base-x";
> > > +	/* Generic PHY, providing serdes lanes */
> > > +	phys = <&cps_comphy5 2>;
> > > +};
> > > +
> > 
> > This is wrong.  This lane is connected to a SFP cage which can support
> > more than just 2500base-X.  Tying it in this way to 2500base-X means
> > that this port does not support conenctions at 1000base-X, despite
> > that's one of the most popular and more standardised speeds.
> > 
> 
> I agree with Russell here. SFP modules are hot pluggable, and support
> a range of interface modes. You need to query what the SFP module is
> in order to know how to configure the SERDES interface. The phylink
> infrastructure does that for you.

Sure, I understand. We'll be able to support such interfaces only when
the phylink PPv2 support lands in.

Thanks!
Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface
From: Antoine Tenart @ 2017-12-28 10:04 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Antoine Tenart, davem, kishon, andrew, jason,
	sebastian.hesselbarth, gregory.clement, mw, stefanc, ymarkman,
	thomas.petazzoni, miquel.raynal, nadavh, netdev, linux-arm-kernel,
	linux-kernel, Jon Nettleton
In-Reply-To: <20171227231959.GU10595@n2100.armlinux.org.uk>

Hi Russell,

On Wed, Dec 27, 2017 at 11:20:00PM +0000, Russell King - ARM Linux wrote:
> On Wed, Dec 27, 2017 at 11:42:52PM +0100, Antoine Tenart wrote:
> > 
> > What do you suggest to describe this in the dt, to enable a port using
> > the current PPv2 driver?
> 
> I don't - I'm merely pointing out that you're bodging support for the
> SFP cage rather than productively discussing phylink for mvpp2.
> 
> As far as I remember, the discussion stalled at this point:
> 
> - You think there's modes that mvpp2 supports that are not supportable
>   if you use phylink.
> 
> - I've described what phylink supports, and I've asked you for details
>   about what you can't support.

That's not what I remembered. You had some valid points, and others
related to PHY modes the driver wasn't supporting before the phylink
transition. My understanding of this was that you wanted a full
featured support while I only wanted to convert the already supported
modes.

> Unfortunately, no details have been forthcoming, and no further
> discussion has occurred - the ball is entirely in your court to
> progress this issue since I requested information from you and that
> is where things seem to have stalled.
> 
> The result is that, with your patch, you're locking the port to 2.5G
> speeds, meaning that only 4.3Mbps Fibrechannel SFPs can be used with
> the port, and it can only be used with another device that supports
> 2.5G speeds.  You can't use a copper RJ45 module, and you can't use
> a standard 1000base-X module either in this configuration.

You're probably right about not wanting this dt patch. The non-dt
patches still are relevant regardless of phylink being supported in the
PPv2 driver. I'll send a v2 without the dt parts.

Regarding the phylink patch it's stalled for now as I have other
priorities, but I do agree it's a topic that needs to be worked on for a
proper support. I initially made a patch to be nice as it was mentioned
on a previous series, but given the feedback I got I decided to delay it
until my other tasks were completed.

So let's delay the fourth interface support on the mcbin for now.

> What I'm most concerned about, given the bindings for comphy that
> have been merged, is that Free Electrons is pushing forward seemingly
> with no regard to the requirement that the serdes lanes are dynamically
> reconfigurable, and that's a basic requirement for SFP, and for the
> 88x3310 PHYs on the Macchiatobin platform.

The main idea behind the comphy driver is to provide a way to
reconfigure the serdes lanes at runtime. Could you develop what are
blocking points to properly support SFP, regarding the current comphy
support?

Thanks,
Antoine

-- 
Antoine Ténart, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

^ permalink raw reply

* Re: lost connection to test machine (3)
From: Dmitry Vyukov @ 2017-12-28  8:50 UTC (permalink / raw)
  To: Florian Westphal
  Cc: syzbot, LKML, syzkaller-bugs, Pablo Neira Ayuso, Jozsef Kadlecsik,
	David Miller, netfilter-devel, coreteam, netdev
In-Reply-To: <20171227213627.GC23214@breakpoint.cc>

On Wed, Dec 27, 2017 at 10:36 PM, Florian Westphal <fw@strlen.de> wrote:
> Dmitry Vyukov <dvyukov@google.com> wrote:
>> On Wed, Dec 27, 2017 at 7:18 PM, syzbot
>> <syzbot+4396883fa8c4f64e0175@syzkaller.appspotmail.com> wrote:
>> > Hello,
>> >
>> > syzkaller hit the following crash on
>> > beacbc68ac3e23821a681adb30b45dc55b17488d
>> > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>> > compiler: gcc (GCC) 7.1.1 20170620
>> > .config is attached
>> > Raw console output is attached.
>> > C reproducer is attached
>> > syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>> > for information about syzkaller reproducers
>> >
>> >
>> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> > Reported-by: <syzbot+4396883fa8c4f64e0175@syzkaller.appspotmail.com>
>> > It will help syzbot understand when the bug is fixed. See footer for
>> > details.
>> > If you forward the report, please keep this part and the footer.
>>
>> +netfilter maintainers
>>
>> Here is cleaned reproducer:
>>
>> // autogenerated by syzkaller (http://github.com/google/syzkaller)
>> #include <sys/types.h>
>> #include <sys/socket.h>
>> #include <netinet/in.h>
>> #include <netinet/tcp.h>
>> #include <linux/if.h>
>> #include <linux/netfilter_ipv4/ip_tables.h>
>>
>> int main()
>> {
>>   int fd;
>>
>>   fd = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
>>   struct ipt_replace opt = {};
>>   opt.num_counters = 1;
>>   opt.size = -1;
>>   setsockopt(fd, SOL_IP, 0x40, &opt, 0x4);
>>   return 0;
>> }
>>
>>
>> What happens there is that here:
>>
>> struct xt_table_info *xt_alloc_table_info(unsigned int size)
>> {
>>     ...
>>     if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
>>         return NULL;
>>
>> size = -1 and SMP_ALIGN(size) = 0, so this still tries to allocate
>> 4GB+delta bytes.
>>
>> I don't understand why this uses SMP_ALIGN since we add 2 pages on
>> top, it seems that we could just drop SMP_ALIGN and local SMP_ALIGN
>> definition altogether.
>
> Looking at history.git this seems to be a left over from back when
> iptables allocated size * num_cpus() (and used an SMP_ALIGN based offset
> for each cpu).
>
> So yes, I think we can just toss/drop this.


Thanks.
I've mailed "netfilter: fix int overflow in xt_alloc_table_info()" to fix this.

^ permalink raw reply

* [PATCH] netfilter: fix int overflow in xt_alloc_table_info()
From: Dmitry Vyukov @ 2017-12-28  8:48 UTC (permalink / raw)
  To: pablo, kadlec, fw, davem
  Cc: Dmitry Vyukov, netfilter-devel, coreteam, netdev, linux-kernel

syzkaller triggered OOM kills by passing ipt_replace.size = -1
to IPT_SO_SET_REPLACE. The root cause is that SMP_ALIGN() in
xt_alloc_table_info() causes int overflow and the size check passes
when it should not. SMP_ALIGN() is no longer needed leftover.

Remove SMP_ALIGN() call in xt_alloc_table_info().

Reported-by: syzbot+4396883fa8c4f64e0175@syzkaller.appspotmail.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Florian Westphal <fw@strlen.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 net/netfilter/x_tables.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 55802e97f906..e02a21549c99 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -39,7 +39,6 @@ MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Harald Welte <laforge@netfilter.org>");
 MODULE_DESCRIPTION("{ip,ip6,arp,eb}_tables backend module");
 
-#define SMP_ALIGN(x) (((x) + SMP_CACHE_BYTES-1) & ~(SMP_CACHE_BYTES-1))
 #define XT_PCPU_BLOCK_SIZE 4096
 
 struct compat_delta {
@@ -1000,7 +999,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
 		return NULL;
 
 	/* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */
-	if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
+	if ((size >> PAGE_SHIFT) + 2 > totalram_pages)
 		return NULL;
 
 	info = kvmalloc(sz, GFP_KERNEL);
-- 
2.15.1.620.gb9897f4670-goog


^ permalink raw reply related

* RE: [patch net-next v2 00/10] Add support for resource abstraction
From: Yuval Mintz @ 2017-12-28  8:25 UTC (permalink / raw)
  To: David Ahern, Jiri Pirko
  Cc: netdev@vger.kernel.org, davem@davemloft.net, Arkadi Sharshevsky,
	mlxsw, andrew@lunn.ch, vivien.didelot@savoirfairelinux.com,
	f.fainelli@gmail.com, michael.chan@broadcom.com,
	ganeshgr@chelsio.com, Saeed Mahameed, Matan Barak,
	Leon Romanovsky, Ido Schimmel, jakub.kicinski@netronome.com,
	ast@kernel.org, daniel@iogearbox.net,
	"simon.horman@netronome.com" <simon.horman
In-Reply-To: <ae70d810-8277-899b-b2a9-6b2dbdd5eb21@cumulusnetworks.com>

> >>> Many of the ASIC's internal resources are limited and are shared
> between
> >>> several hardware procedures. For example, unified hash-based memory
> can
> >>> be used for many lookup purposes, like FDB and LPM. In many cases the
> user
> >>> can provide a partitioning scheme for such a resource in order to
> perform
> >>> fine tuning for his application. In such cases performing driver reload is
> >>> needed for the changes to take place, thus this patchset also adds support
> >>> for hot reload.
> >>>
> >>> Such an abstraction can be coupled with devlink's dpipe interface, which
> >>> models the ASIC's pipeline as a graph of match/action tables. By
> modeling
> >>> the hardware resource object, and by coupling it to several dpipe tables,
> >>> further visibility can be achieved in order to debug ASIC-wide issues.
> >>>
> >>> The proposed interface will provide the user the ability to understand the
> >>> limitations of the hardware, and receive notification regarding its
> occupancy.
> >>> Furthermore, monitoring the resource occupancy can be done in real-
> time and
> >>> can be useful in many cases.
> >>
> >> In the last RFC (not v1, but RFC) I asked for some kind of description
> >> for each resource, and you and Arkadi have pushed back. Let's walk
> >> through an example to see what I mean:
> >>
> >> $ devlink resource show pci/0000:03:00.0
> >> pci/0000:03:00.0:
> >>  name kvd size 245760 size_valid true
> >>  resources:
> >>    name linear size 98304 occ 0
> >>    name hash_double size 60416
> >>    name hash_single size 87040
> >>
> >> So this 2700 has 3 resources that can be managed -- some table or
> >> resource or something named 'kvd' with linear, hash_double and
> >> hash_single sub-resources. What are these names referring too? The
> above
> >> output gives no description, and 'kvd' is not an industry term. Further,
> >
> > This are internal resources specific to the ASIC. Would you like some
> > description to each or something like that?
> 
> devlink has some nice self-documenting capabilities. What's missing here
> is a description of what the resource is used for in standard terms --
> ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a
> short list versus an exhaustive list of everything it is used for. e.g.,
> Why would a user decrease linear and increase hash_single or vice versa?
> 
> >
> >
> >> what are these sizes that a user can control? The output contains no
> >> units, no description, nothing. In short, the above output provides
> >> random numbers associated with random names.
> >
> > Units are now exposed from kernel, just this version of iproute2 patch
> > does not display it.
> 
> please provide an iproute2 patch that does so the full context if this
> patch set can be reviewed from a user perspective.
> 
> >
> >
> >>
> >> I can see dpipe tables exported by this device:
> >>
> >> $ devlink dpipe header show pci/0000:03:00.0
> >>
> >> pci/0000:03:00.0:
> >>  name mlxsw_meta
> >>  field:
> >>    name erif_port bitwidth 32 mapping_type ifindex
> >>    name l3_forward bitwidth 1
> >>    name l3_drop bitwidth 1
> >>    name adj_index bitwidth 32
> >>    name adj_size bitwidth 32
> >>    name adj_hash_index bitwidth 32
> >>
> >>  name ipv6
> >>  field:
> >>    name destination ip bitwidth 128
> >>
> >>  name ipv4
> >>  field:
> >>    name destination ip bitwidth 32
> >>
> >>  name ethernet
> >>  field:
> >>    name destination mac bitwidth 48
> >>
> >> but none mention 'kvd' or 'linear' or 'hash" and none of the other
> >> various devlink options:
> >>
> >> $ devlink
> >> Usage: devlink [ OPTIONS ] OBJECT { COMMAND | help }
> >> where  OBJECT := { dev | port | sb | monitor | dpipe }
> >>
> >> seem to related to resources.
> >>
> >> So how does a user know what they are controlling by this 'resource'
> >> option? Is the user expected to have a PRM or user guide on hand for the
> >> specific device model that is being configured?
> >
> > The relation of specific dpipe table to specific resource is exposed by
> > the kernel as well. Probably the iproute2 patch just does not display
> > it.
> 
> please provide an iproute2 patch that does so the full context if this
> patch set can be reviewed from a user perspective.
> 
> >
> >
> >>
> >> Again, I have no objections to kvd, linear, hash, etc terms as they do
> >> relate to Mellanox products. But kvd/linear, for example, does correlate
> >> to industry standard concepts in some way. My request is that the
> >> resource listing guide the user in some way, stating what these
> >> resources mean.
> >
> > So the showed relation to dpipe table would be enougn or you would still
> > like to see some description? I don't like the description concept here
> > as the relations to dpipe table should tell user exactly what he needs
> > to know.
> 
> I believe it is useful to have a 1-line, short description that gives
> the user some memory jogger as to what the resource is used for. It does
> not have to be an exhaustive list, but the user should not have to do
> mental jumping jacks running a bunch of commands to understand the
> resources for vendor specific asics.

Perhaps we can simply have devlink utility output the dpipe
table[s] associated with the resource when showing the resource?
It would contain live information as well as prevent the need for
'mental jumping jacks'.

^ permalink raw reply

* RE: [patch net-next v2 00/10] Add support for resource abstraction
From: Yuval Mintz @ 2017-12-28  8:21 UTC (permalink / raw)
  To: Roopa Prabhu, David Ahern
  Cc: Jiri Pirko, netdev@vger.kernel.org, David Miller,
	Arkadi Sharshevsky, mlxsw, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, Michael Chan, ganeshgr@chelsio.com,
	Saeed Mahameed, Matan Barak, Leon Romanovsky, Ido Schimmel,
	jakub.kicinski@netronome.com, ast@kernel.org, Daniel Borkmann,
	Simon Horman, "pieter.jansenvanvuuren@
In-Reply-To: <CAJieiUi2UeTLS1Vcgy0JOEOohNEyaqpnQ2vp_qCxvyOZYCz6bA@mail.gmail.com>

> >>>> Many of the ASIC's internal resources are limited and are shared
> between
> >>>> several hardware procedures. For example, unified hash-based memory
> can
> >>>> be used for many lookup purposes, like FDB and LPM. In many cases the
> user
> >>>> can provide a partitioning scheme for such a resource in order to
> perform
> >>>> fine tuning for his application. In such cases performing driver reload is
> >>>> needed for the changes to take place, thus this patchset also adds
> support
> >>>> for hot reload.
> >>>>
> >>>> Such an abstraction can be coupled with devlink's dpipe interface, which
> >>>> models the ASIC's pipeline as a graph of match/action tables. By
> modeling
> >>>> the hardware resource object, and by coupling it to several dpipe tables,
> >>>> further visibility can be achieved in order to debug ASIC-wide issues.
> >>>>
> >>>> The proposed interface will provide the user the ability to understand
> the
> >>>> limitations of the hardware, and receive notification regarding its
> occupancy.
> >>>> Furthermore, monitoring the resource occupancy can be done in real-
> time and
> >>>> can be useful in many cases.
> >>>
> >>> In the last RFC (not v1, but RFC) I asked for some kind of description
> >>> for each resource, and you and Arkadi have pushed back. Let's walk
> >>> through an example to see what I mean:
> >>>
> >>> $ devlink resource show pci/0000:03:00.0
> >>> pci/0000:03:00.0:
> >>>  name kvd size 245760 size_valid true
> >>>  resources:
> >>>    name linear size 98304 occ 0
> >>>    name hash_double size 60416
> >>>    name hash_single size 87040
> >>>
> >>> So this 2700 has 3 resources that can be managed -- some table or
> >>> resource or something named 'kvd' with linear, hash_double and
> >>> hash_single sub-resources. What are these names referring too? The
> above
> >>> output gives no description, and 'kvd' is not an industry term. Further,
> >>
> >> This are internal resources specific to the ASIC. Would you like some
> >> description to each or something like that?
> >
> > devlink has some nice self-documenting capabilities. What's missing here
> > is a description of what the resource is used for in standard terms --
> > ipv4 host routes, fdb, nexthops, rifs, etc. Even if the description is a
> > short list versus an exhaustive list of everything it is used for. e.g.,
> > Why would a user decrease linear and increase hash_single or vice versa?
> 
> 
> Arkadi, on what david says above, can the resource names and ids not
> be driver specific, but moved up to the switchdev layer and just map
> to fdb, host routes, nexthops table sizes etc ?.
> Can these generic networking resources then in-turn be mapped to kvd
> sizes by the driver ?

I think it goes the other way around. The dpipe tables are the ones that
can be translated to functionality; The resources are internal and HW-specific
representing the possible internal division of resources -
but a given resource sn't necessarily mapped to a single networking feature.
[It might be in some cases, but not in the general case]

You could always move to a structured approach where each resource
in the hierarchy is further split to sub-resources until each leaf represents
a single network concepts - but that would stop be an abstraction of the
HW resources and become a SW implementation instead, as SW would
have to be the one to maintain and enforce the resource distribution.
And that's not what we're trying to achieve here.

^ permalink raw reply

* Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry
From: Masami Hiramatsu @ 2017-12-28  8:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Masami Hiramatsu, Alexei Starovoitov, Josef Bacik,
	mingo, davem, netdev, linux-kernel, ast, kernel-team, daniel,
	linux-btrfs, darrick.wong, Josef Bacik, Akinobu Mita
In-Reply-To: <c66a969f-6854-ffa5-502d-bd194c5536bc@fb.com>

On Wed, 27 Dec 2017 20:32:07 -0800
Alexei Starovoitov <ast@fb.com> wrote:

> On 12/27/17 8:16 PM, Steven Rostedt wrote:
> > On Wed, 27 Dec 2017 19:45:42 -0800
> > Alexei Starovoitov <ast@fb.com> wrote:
> >
> >> I don't think that's the case. My reading of current
> >> trace_kprobe_ftrace() -> arch_check_ftrace_location()
> >> is that it will not be true for old mcount case.
> >
> > In the old mcount case, you can't use ftrace to return without calling
> > the function. That is, no modification of the return ip, unless you
> > created a trampoline that could handle arbitrary stack frames, and
> > remove them from the stack before returning back to the function.
> 
> correct. I was saying that trace_kprobe_ftrace() won't let us do
> bpf_override_return with old mcount.

No, trace_kprobe_ftrace() just checks the given address will be
managed by ftrace. you can see arch_check_ftrace_location() in kernel/kprobes.c.

FYI, CONFIG_KPROBES_ON_FTRACE depends on DYNAMIC_FTRACE_WITH_REGS, and
DYNAMIC_FTRACE_WITH_REGS doesn't depend on CC_USING_FENTRY.
This means if you compile kernel with old gcc and enable DYNAMIC_FTRACE,
kprobes uses ftrace on mcount address which is NOT the entry point
of target function.

On the other hand, changing IP feature has been implemented originaly
by kprobes with int3 (sw breakpoint). This means you can use kprobes
at correct address (the entry address of the function) you can hijack
the function, as jprobe did.

> >> As far as the rest of your arguments it very much puzzles me that
> >> you claim that this patch suppose to work based on historical
> >> reasoning whereas you did NOT test it.
> >
> > I believe that Masami is saying that the modification of the IP from
> > kprobes has been very well tested. But I'm guessing that you still want
> > a test case for using kprobes in this particular instance. It's not the
> > implementation of modifying the IP that you are worried about, but the
> > implementation of BPF using it in this case. Right?
> 
> exactly. No doubt that old code works.
> But it doesn't mean that bpf_override_return() will continue to
> work in kprobes that are not ftrace based.
> I suspect Josef's existing test case will cover this situation.
> Probably only special .config is needed to disable ftrace, so
> "kprobe on entry but not ftrace" check will kick in.

Right. If you need to test it, you can run Josef's test case without
CONFIG_DYNAMIC_FTRACE.

> But I didn't get an impression that this situation was tested.
> Instead I see only logical reasoning that it's _supposed_ to work.
> That's not enough.

OK, so would you just ask me to run samples/bpf ?

Thanks,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply

* Re: [PATCH bpf-next v3 1/3] libbpf: add function to setup XDP
From: Toshiaki Makita @ 2017-12-28  8:18 UTC (permalink / raw)
  To: Eric Leblond, daniel; +Cc: Alexei Starovoitov, netdev, linux-kernel
In-Reply-To: <20171228080437.16933-2-eric@regit.org>

On 2017/12/28 17:04, Eric Leblond wrote:
> Most of the code is taken from set_link_xdp_fd() in bpf_load.c and
> slightly modified to be library compliant.
> 
> Signed-off-by: Eric Leblond <eric@regit.org>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> ---
...
> +int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
...
> +	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
> +		ret = -errno;
> +		goto cleanup;
> +	}
> +
> +	addrlen = sizeof(sa);
> +	if (getsockname(sock, (struct sockaddr *)&sa, &addrlen) < 0) {
> +		ret = errno;

Still errno is not inverted,

> +		goto cleanup;
> +	}
> +
> +	if (addrlen != sizeof(sa)) {
> +		ret = errno;

And not set here.

> +		goto cleanup;
> +	}

-- 
Toshiaki Makita

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox