Netdev List
 help / color / mirror / Atom feed
* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
From: Toshiaki Makita @ 2019-08-15 10:26 UTC (permalink / raw)
  To: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, netdev, bpf, William Tu
In-Reply-To: <20190814170715.GJ2820@mini-arch>

On 2019/08/15 2:07, Stanislav Fomichev wrote:
> On 08/13, Toshiaki Makita wrote:
>> * Implementation
>>
>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
>> bpfilter. The difference is that xdp_flow does not generate the eBPF
>> program dynamically but a prebuilt program is embedded in UMH. This is
>> mainly because flow insertion is considerably frequent. If we generate
>> and load an eBPF program on each insertion of a flow, the latency of the
>> first packet of ping in above test will incease, which I want to avoid.
> Can this be instead implemented with a new hook that will be called
> for TC events? This hook can write to perf event buffer and control
> plane will insert/remove/modify flow tables in the BPF maps (contol
> plane will also install xdp program).
> 
> Why do we need UMH? What am I missing?

So you suggest doing everything in xdp_flow kmod?
I also thought about that. There are two phases so let's think about them separately.

1) TC block (qdisc) creation / eBPF load

I saw eBPF maintainers repeatedly saying eBPF program loading needs to be
done from userland, not from kernel, to run the verifier for safety.
However xdp_flow eBPF program is prebuilt and embedded in kernel so we may
allow such programs to be loaded from kernel? I currently don't have the will
to make such an API as loading can be done with current UMH mechanism.

2) flow insertion / eBPF map update

Not sure if this needs to be done from userland. One concern is that eBPF maps can
be modified by unrelated processes and we need to handle all unexpected state of maps.
Such handling tends to be difficult and may cause unexpected kernel behavior.
OTOH updating maps from kmod may reduces the latency of flow insertion drastically.

Alexei, Daniel, what do you think?

Toshiaki Makita

^ permalink raw reply

* Re: [PATCH] net: hns: hns_enet: Add of_node_put in hns_nic_dev_probe()
From: Yonglong Liu @ 2019-08-15  9:59 UTC (permalink / raw)
  To: Nishka Dasgupta, yisen.zhuang, salil.mehta, davem, netdev
In-Reply-To: <20190815062837.6015-1-nishkadg.linux@gmail.com>



On 2019/8/15 14:28, Nishka Dasgupta wrote:
> The local variable ae_node in function hns_nic_dev_probe takes the
> return value of of_parse_phandle, which gets a node but does not put it.
> This may cause a memory leak. Hence put ae_node after the last time it
> is invoked.
> Issue found with Coccinelle.
> 
> Signed-off-by: Nishka Dasgupta <nishkadg.linux@gmail.com>
> ---
>  drivers/net/ethernet/hisilicon/hns/hns_enet.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
> index 2235dd55fab2..b26e84929e1e 100644
> --- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
> +++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
> @@ -2309,6 +2309,7 @@ static int hns_nic_dev_probe(struct platform_device *pdev)
>  			goto out_read_prop_fail;
>  		}
>  		priv->fwnode = &ae_node->fwnode;
> +		of_node_put(ae_node);
>  	} else if (is_acpi_node(dev->fwnode)) {
>  		struct fwnode_reference_args args;
>  
> 

Hi, Nishka:
This patch is wrong, we put the node in hns_nic_dev_remove().

The following patch had solved this problem:
263c6d75f9a5 (net: hns: Fix for missing of_node_put() after of_parse_phandle())


^ permalink raw reply

* Re: [PATCH v2] virtio-net: lower min ring num_free for efficiency
From: Jason Wang @ 2019-08-15  9:55 UTC (permalink / raw)
  To: ? jiang, mst@redhat.com
  Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	jakub.kicinski@netronome.com, hawk@kernel.org,
	john.fastabend@gmail.com, kafai@fb.com, songliubraving@fb.com,
	yhs@fb.com, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	xdp-newbies@vger.kernel.org, bpf@vger.kernel.org,
	jiangran.jr@alibaba-inc.com
In-Reply-To: <BYAPR14MB32058F4B2AD162F5421BB9B4A6AC0@BYAPR14MB3205.namprd14.prod.outlook.com>


On 2019/8/15 下午5:42, ? jiang wrote:
> This change lowers ring buffer reclaim threshold from 1/2*queue to budget
> for better performance. According to our test with qemu + dpdk, packet
> dropping happens when the guest is not able to provide free buffer in
> avail ring timely with default 1/2*queue. The value in the patch has been
> tested and does show better performance.
>
> Test setup: iperf3 to generate packets to guest (total 30mins, pps 400k, UDP)
> avg packets drop before: 2842
> avg packets drop after: 360(-87.3%)
>
> Signed-off-by: jiangkidd <jiangkidd@hotmail.com>
> ---
>   drivers/net/virtio_net.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 0d4115c9e20b..bc08be7925eb 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1331,7 +1331,7 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
>   		}
>   	}
>   
> -	if (rq->vq->num_free > virtqueue_get_vring_size(rq->vq) / 2) {
> +	if (rq->vq->num_free > min((unsigned int)budget, virtqueue_get_vring_size(rq->vq)) / 2) {
>   		if (!try_fill_recv(vi, rq, GFP_ATOMIC))
>   			schedule_delayed_work(&vi->refill, 0);
>   	}


Acked-by: Jason Wang <jasowang@redhat.com>



^ permalink raw reply

* [PATCH net-next 2/2] r8169: use the generic EEE management functions
From: Heiner Kallweit @ 2019-08-15  9:47 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <4a6878bf-344e-2df5-df00-b80c7c0982d1@gmail.com>

Now that the Realtek PHY driver maps the vendor-specific EEE registers
to the standard MMD registers, we can remove all special handling and
use the generic functions phy_ethtool_get/set_eee.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169_main.c | 172 +++-------------------
 1 file changed, 24 insertions(+), 148 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 7dd75c047..bd9077f85 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -758,6 +758,13 @@ static bool rtl_is_8168evl_up(struct rtl8169_private *tp)
 	       tp->mac_version <= RTL_GIGA_MAC_VER_51;
 }
 
+static bool rtl_supports_eee(struct rtl8169_private *tp)
+{
+	return tp->mac_version >= RTL_GIGA_MAC_VER_34 &&
+	       tp->mac_version != RTL_GIGA_MAC_VER_37 &&
+	       tp->mac_version != RTL_GIGA_MAC_VER_39;
+}
+
 static void rtl_read_mac_from_reg(struct rtl8169_private *tp, u8 *mac, int reg)
 {
 	int i;
@@ -2014,144 +2021,40 @@ static int rtl_set_coalesce(struct net_device *dev, struct ethtool_coalesce *ec)
 	return 0;
 }
 
-static int rtl_get_eee_supp(struct rtl8169_private *tp)
-{
-	struct phy_device *phydev = tp->phydev;
-	int ret;
-
-	switch (tp->mac_version) {
-	case RTL_GIGA_MAC_VER_34:
-	case RTL_GIGA_MAC_VER_35:
-	case RTL_GIGA_MAC_VER_36:
-	case RTL_GIGA_MAC_VER_38:
-		ret = phy_read_mmd(phydev, MDIO_MMD_PCS, MDIO_PCS_EEE_ABLE);
-		break;
-	case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
-		ret = phy_read_paged(phydev, 0x0a5c, 0x12);
-		break;
-	default:
-		ret = -EPROTONOSUPPORT;
-		break;
-	}
-
-	return ret;
-}
-
-static int rtl_get_eee_lpadv(struct rtl8169_private *tp)
-{
-	struct phy_device *phydev = tp->phydev;
-	int ret;
-
-	switch (tp->mac_version) {
-	case RTL_GIGA_MAC_VER_34:
-	case RTL_GIGA_MAC_VER_35:
-	case RTL_GIGA_MAC_VER_36:
-	case RTL_GIGA_MAC_VER_38:
-		ret = phy_read_mmd(phydev, MDIO_MMD_AN, MDIO_AN_EEE_LPABLE);
-		break;
-	case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
-		ret = phy_read_paged(phydev, 0x0a5d, 0x11);
-		break;
-	default:
-		ret = -EPROTONOSUPPORT;
-		break;
-	}
-
-	return ret;
-}
-
-static int rtl_get_eee_adv(struct rtl8169_private *tp)
-{
-	struct phy_device *phydev = tp->phydev;
-	int ret;
-
-	switch (tp->mac_version) {
-	case RTL_GIGA_MAC_VER_34:
-	case RTL_GIGA_MAC_VER_35:
-	case RTL_GIGA_MAC_VER_36:
-	case RTL_GIGA_MAC_VER_38:
-		ret = phy_read_mmd(phydev, MDIO_MMD_AN, MDIO_AN_EEE_ADV);
-		break;
-	case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
-		ret = phy_read_paged(phydev, 0x0a5d, 0x10);
-		break;
-	default:
-		ret = -EPROTONOSUPPORT;
-		break;
-	}
-
-	return ret;
-}
-
-static int rtl_set_eee_adv(struct rtl8169_private *tp, int val)
-{
-	struct phy_device *phydev = tp->phydev;
-	int ret = 0;
-
-	switch (tp->mac_version) {
-	case RTL_GIGA_MAC_VER_34:
-	case RTL_GIGA_MAC_VER_35:
-	case RTL_GIGA_MAC_VER_36:
-	case RTL_GIGA_MAC_VER_38:
-		ret = phy_write_mmd(phydev, MDIO_MMD_AN, MDIO_AN_EEE_ADV, val);
-		break;
-	case RTL_GIGA_MAC_VER_40 ... RTL_GIGA_MAC_VER_51:
-		phy_write_paged(phydev, 0x0a5d, 0x10, val);
-		break;
-	default:
-		ret = -EPROTONOSUPPORT;
-		break;
-	}
-
-	return ret;
-}
-
 static int rtl8169_get_eee(struct net_device *dev, struct ethtool_eee *data)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
 	struct device *d = tp_to_dev(tp);
 	int ret;
 
+	if (!rtl_supports_eee(tp))
+		return -EOPNOTSUPP;
+
 	pm_runtime_get_noresume(d);
 
 	if (!pm_runtime_active(d)) {
 		ret = -EOPNOTSUPP;
-		goto out;
+	} else {
+		ret = phy_ethtool_get_eee(tp->phydev, data);
 	}
 
-	/* Get Supported EEE */
-	ret = rtl_get_eee_supp(tp);
-	if (ret < 0)
-		goto out;
-	data->supported = mmd_eee_cap_to_ethtool_sup_t(ret);
-
-	/* Get advertisement EEE */
-	ret = rtl_get_eee_adv(tp);
-	if (ret < 0)
-		goto out;
-	data->advertised = mmd_eee_adv_to_ethtool_adv_t(ret);
-	data->eee_enabled = !!data->advertised;
-
-	/* Get LP advertisement EEE */
-	ret = rtl_get_eee_lpadv(tp);
-	if (ret < 0)
-		goto out;
-	data->lp_advertised = mmd_eee_adv_to_ethtool_adv_t(ret);
-	data->eee_active = !!(data->advertised & data->lp_advertised);
-out:
 	pm_runtime_put_noidle(d);
-	return ret < 0 ? ret : 0;
+
+	return ret;
 }
 
 static int rtl8169_set_eee(struct net_device *dev, struct ethtool_eee *data)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
 	struct device *d = tp_to_dev(tp);
-	int old_adv, adv = 0, cap, ret;
+	int ret;
+
+	if (!rtl_supports_eee(tp))
+		return -EOPNOTSUPP;
 
 	pm_runtime_get_noresume(d);
 
-	if (!dev->phydev || !pm_runtime_active(d)) {
+	if (!pm_runtime_active(d)) {
 		ret = -EOPNOTSUPP;
 		goto out;
 	}
@@ -2162,38 +2065,10 @@ static int rtl8169_set_eee(struct net_device *dev, struct ethtool_eee *data)
 		goto out;
 	}
 
-	/* Get Supported EEE */
-	ret = rtl_get_eee_supp(tp);
-	if (ret < 0)
-		goto out;
-	cap = ret;
-
-	ret = rtl_get_eee_adv(tp);
-	if (ret < 0)
-		goto out;
-	old_adv = ret;
-
-	if (data->eee_enabled) {
-		adv = !data->advertised ? cap :
-		      ethtool_adv_to_mmd_eee_adv_t(data->advertised) & cap;
-		/* Mask prohibited EEE modes */
-		adv &= ~dev->phydev->eee_broken_modes;
-	}
-
-	if (old_adv != adv) {
-		ret = rtl_set_eee_adv(tp, adv);
-		if (ret < 0)
-			goto out;
-
-		/* Restart autonegotiation so the new modes get sent to the
-		 * link partner.
-		 */
-		ret = phy_restart_aneg(dev->phydev);
-	}
-
+	ret = phy_ethtool_set_eee(tp->phydev, data);
 out:
 	pm_runtime_put_noidle(d);
-	return ret < 0 ? ret : 0;
+	return ret;
 }
 
 static const struct ethtool_ops rtl8169_ethtool_ops = {
@@ -2220,10 +2095,11 @@ static const struct ethtool_ops rtl8169_ethtool_ops = {
 
 static void rtl_enable_eee(struct rtl8169_private *tp)
 {
-	int supported = rtl_get_eee_supp(tp);
+	struct phy_device *phydev = tp->phydev;
+	int supported = phy_read_mmd(phydev, MDIO_MMD_PCS, MDIO_PCS_EEE_ABLE);
 
 	if (supported > 0)
-		rtl_set_eee_adv(tp, supported);
+		phy_write_mmd(phydev, MDIO_MMD_AN, MDIO_AN_EEE_ADV, supported);
 }
 
 static void rtl8169_get_mac_version(struct rtl8169_private *tp)
-- 
2.22.1



^ permalink raw reply related

* [PATCH net-next 1/2] net: phy: realtek: add support for EEE registers on integrated PHY's
From: Heiner Kallweit @ 2019-08-15  9:46 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <4a6878bf-344e-2df5-df00-b80c7c0982d1@gmail.com>

EEE-related registers on newer integrated PHY's have the standard
layout, but are accessible not via MMD but via vendor-specific
registers. Emulating the standard MMD registers allows to use the
generic functions for EEE control.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/phy/realtek.c | 43 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
index c49a1fb13..2635ad1ff 100644
--- a/drivers/net/phy/realtek.c
+++ b/drivers/net/phy/realtek.c
@@ -266,6 +266,45 @@ static int rtl8366rb_config_init(struct phy_device *phydev)
 	return ret;
 }
 
+static int rtlgen_read_mmd(struct phy_device *phydev, int devnum, u16 regnum)
+{
+	int ret;
+
+	if (devnum == MDIO_MMD_PCS && regnum == MDIO_PCS_EEE_ABLE) {
+		rtl821x_write_page(phydev, 0xa5c);
+		ret = __phy_read(phydev, 0x12);
+		rtl821x_write_page(phydev, 0);
+	} else if (devnum == MDIO_MMD_AN && regnum == MDIO_AN_EEE_ADV) {
+		rtl821x_write_page(phydev, 0xa5d);
+		ret = __phy_read(phydev, 0x10);
+		rtl821x_write_page(phydev, 0);
+	} else if (devnum == MDIO_MMD_AN && regnum == MDIO_AN_EEE_LPABLE) {
+		rtl821x_write_page(phydev, 0xa5d);
+		ret = __phy_read(phydev, 0x11);
+		rtl821x_write_page(phydev, 0);
+	} else {
+		ret = -EOPNOTSUPP;
+	}
+
+	return ret;
+}
+
+static int rtlgen_write_mmd(struct phy_device *phydev, int devnum, u16 regnum,
+			    u16 val)
+{
+	int ret;
+
+	if (devnum == MDIO_MMD_AN && regnum == MDIO_AN_EEE_ADV) {
+		rtl821x_write_page(phydev, 0xa5d);
+		ret = __phy_write(phydev, 0x10, val);
+		rtl821x_write_page(phydev, 0);
+	} else {
+		ret = -EOPNOTSUPP;
+	}
+
+	return ret;
+}
+
 static int rtl8125_get_features(struct phy_device *phydev)
 {
 	int val;
@@ -422,6 +461,8 @@ static struct phy_driver realtek_drvs[] = {
 		.resume		= genphy_resume,
 		.read_page	= rtl821x_read_page,
 		.write_page	= rtl821x_write_page,
+		.read_mmd	= rtlgen_read_mmd,
+		.write_mmd	= rtlgen_write_mmd,
 	}, {
 		.name		= "RTL8125 2.5Gbps internal",
 		.match_phy_device = rtl8125_match_phy_device,
@@ -432,6 +473,8 @@ static struct phy_driver realtek_drvs[] = {
 		.resume		= genphy_resume,
 		.read_page	= rtl821x_read_page,
 		.write_page	= rtl821x_write_page,
+		.read_mmd	= rtlgen_read_mmd,
+		.write_mmd	= rtlgen_write_mmd,
 	}, {
 		PHY_ID_MATCH_EXACT(0x001cc961),
 		.name		= "RTL8366RB Gigabit Ethernet",
-- 
2.22.1



^ permalink raw reply related

* [PATCH net-next 0/2] net: phy: realtek: map vendor-specific EEE registers to standard MMD registers
From: Heiner Kallweit @ 2019-08-15  9:45 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, David Miller; +Cc: netdev@vger.kernel.org

EEE-related registers on newer integrated PHY's have the standard
layout, but are accessible not via MMD but via vendor-specific
registers. Emulating the standard MMD registers allows to use the
generic functions for EEE control and to significantly simplify
the r8169 driver.

Heiner Kallweit (2):
  net: phy: realtek: add support for EEE registers on integrated PHY's
  r8169: use the generic EEE management functions

 drivers/net/ethernet/realtek/r8169_main.c | 172 +++-------------------
 drivers/net/phy/realtek.c                 |  43 ++++++
 2 files changed, 67 insertions(+), 148 deletions(-)

-- 
2.22.1


^ permalink raw reply

* Re: [PATCH 08/16] mips: prefer __section from compiler_attributes.h
From: Paul Burton @ 2019-08-15  9:38 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: akpm@linux-foundation.org, sedat.dilek@gmail.com,
	jpoimboe@redhat.com, yhs@fb.com, miguel.ojeda.sandonis@gmail.com,
	clang-built-linux@googlegroups.com, Ralf Baechle, James Hogan,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	linux-mips@vger.kernel.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, bpf@vger.kernel.org
In-Reply-To: <20190812215052.71840-8-ndesaulniers@google.com>

Hi Nick,

On Mon, Aug 12, 2019 at 02:50:41PM -0700, Nick Desaulniers wrote:
> Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
> Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>

It would be good to add a commit message, even if it's just a line
repeating the subject & preferably describing the motivation.

> ---
>  arch/mips/include/asm/cache.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/mips/include/asm/cache.h b/arch/mips/include/asm/cache.h
> index 8b14c2706aa5..af2d943580ee 100644
> --- a/arch/mips/include/asm/cache.h
> +++ b/arch/mips/include/asm/cache.h
> @@ -14,6 +14,6 @@
>  #define L1_CACHE_SHIFT		CONFIG_MIPS_L1_CACHE_SHIFT
>  #define L1_CACHE_BYTES		(1 << L1_CACHE_SHIFT)
>  
> -#define __read_mostly __attribute__((__section__(".data..read_mostly")))
> +#define __read_mostly __section(.data..read_mostly)
>  
>  #endif /* _ASM_CACHE_H */
> -- 
> 2.23.0.rc1.153.gdeed80330f-goog

I'm not copied on the rest of the series so I'm not sure what your
expectations are about where this should be applied. Let me know if
you'd prefer this to go through mips-next, otherwise:

    Acked-by: Paul Burton <paul.burton@mips.com>

Thanks,
    Paul

^ permalink raw reply

* [PATCH net-next] net/rds: Add RDS6_INFO_SOCKETS and RDS6_INFO_RECV_MESSAGES options
From: Ka-Cheong Poon @ 2019-08-15  9:36 UTC (permalink / raw)
  To: netdev; +Cc: santosh.shilimkar, davem, rds-devel

Add support of the socket options RDS6_INFO_SOCKETS and
RDS6_INFO_RECV_MESSAGES which update the RDS_INFO_SOCKETS and
RDS_INFO_RECV_MESSAGES options respectively.  The old options work
for IPv4 sockets only.

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
---
 net/rds/af_rds.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 90 insertions(+), 3 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 2b969f9..e7b082a 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2006, 2018 Oracle and/or its affiliates. All rights reserved.
+ * Copyright (c) 2006, 2019 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -741,6 +741,10 @@ static void rds_sock_inc_info(struct socket *sock, unsigned int len,
 	spin_lock_bh(&rds_sock_lock);
 
 	list_for_each_entry(rs, &rds_sock_list, rs_item) {
+		/* This option only supports IPv4 sockets. */
+		if (!ipv6_addr_v4mapped(&rs->rs_bound_addr))
+			continue;
+
 		read_lock(&rs->rs_recv_lock);
 
 		/* XXX too lazy to maintain counts.. */
@@ -762,21 +766,60 @@ static void rds_sock_inc_info(struct socket *sock, unsigned int len,
 	lens->each = sizeof(struct rds_info_message);
 }
 
+#if IS_ENABLED(CONFIG_IPV6)
+static void rds6_sock_inc_info(struct socket *sock, unsigned int len,
+			       struct rds_info_iterator *iter,
+			       struct rds_info_lengths *lens)
+{
+	struct rds_incoming *inc;
+	unsigned int total = 0;
+	struct rds_sock *rs;
+
+	len /= sizeof(struct rds6_info_message);
+
+	spin_lock_bh(&rds_sock_lock);
+
+	list_for_each_entry(rs, &rds_sock_list, rs_item) {
+		read_lock(&rs->rs_recv_lock);
+
+		list_for_each_entry(inc, &rs->rs_recv_queue, i_item) {
+			total++;
+			if (total <= len)
+				rds6_inc_info_copy(inc, iter, &inc->i_saddr,
+						   &rs->rs_bound_addr, 1);
+		}
+
+		read_unlock(&rs->rs_recv_lock);
+	}
+
+	spin_unlock_bh(&rds_sock_lock);
+
+	lens->nr = total;
+	lens->each = sizeof(struct rds6_info_message);
+}
+#endif
+
 static void rds_sock_info(struct socket *sock, unsigned int len,
 			  struct rds_info_iterator *iter,
 			  struct rds_info_lengths *lens)
 {
 	struct rds_info_socket sinfo;
+	unsigned int cnt = 0;
 	struct rds_sock *rs;
 
 	len /= sizeof(struct rds_info_socket);
 
 	spin_lock_bh(&rds_sock_lock);
 
-	if (len < rds_sock_count)
+	if (len < rds_sock_count) {
+		cnt = rds_sock_count;
 		goto out;
+	}
 
 	list_for_each_entry(rs, &rds_sock_list, rs_item) {
+		/* This option only supports IPv4 sockets. */
+		if (!ipv6_addr_v4mapped(&rs->rs_bound_addr))
+			continue;
 		sinfo.sndbuf = rds_sk_sndbuf(rs);
 		sinfo.rcvbuf = rds_sk_rcvbuf(rs);
 		sinfo.bound_addr = rs->rs_bound_addr_v4;
@@ -786,15 +829,51 @@ static void rds_sock_info(struct socket *sock, unsigned int len,
 		sinfo.inum = sock_i_ino(rds_rs_to_sk(rs));
 
 		rds_info_copy(iter, &sinfo, sizeof(sinfo));
+		cnt++;
 	}
 
 out:
-	lens->nr = rds_sock_count;
+	lens->nr = cnt;
 	lens->each = sizeof(struct rds_info_socket);
 
 	spin_unlock_bh(&rds_sock_lock);
 }
 
+#if IS_ENABLED(CONFIG_IPV6)
+static void rds6_sock_info(struct socket *sock, unsigned int len,
+			   struct rds_info_iterator *iter,
+			   struct rds_info_lengths *lens)
+{
+	struct rds6_info_socket sinfo6;
+	struct rds_sock *rs;
+
+	len /= sizeof(struct rds6_info_socket);
+
+	spin_lock_bh(&rds_sock_lock);
+
+	if (len < rds_sock_count)
+		goto out;
+
+	list_for_each_entry(rs, &rds_sock_list, rs_item) {
+		sinfo6.sndbuf = rds_sk_sndbuf(rs);
+		sinfo6.rcvbuf = rds_sk_rcvbuf(rs);
+		sinfo6.bound_addr = rs->rs_bound_addr;
+		sinfo6.connected_addr = rs->rs_conn_addr;
+		sinfo6.bound_port = rs->rs_bound_port;
+		sinfo6.connected_port = rs->rs_conn_port;
+		sinfo6.inum = sock_i_ino(rds_rs_to_sk(rs));
+
+		rds_info_copy(iter, &sinfo6, sizeof(sinfo6));
+	}
+
+ out:
+	lens->nr = rds_sock_count;
+	lens->each = sizeof(struct rds6_info_socket);
+
+	spin_unlock_bh(&rds_sock_lock);
+}
+#endif
+
 static void rds_exit(void)
 {
 	sock_unregister(rds_family_ops.family);
@@ -808,6 +887,10 @@ static void rds_exit(void)
 	rds_bind_lock_destroy();
 	rds_info_deregister_func(RDS_INFO_SOCKETS, rds_sock_info);
 	rds_info_deregister_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info);
+#if IS_ENABLED(CONFIG_IPV6)
+	rds_info_deregister_func(RDS6_INFO_SOCKETS, rds6_sock_info);
+	rds_info_deregister_func(RDS6_INFO_RECV_MESSAGES, rds6_sock_inc_info);
+#endif
 }
 module_exit(rds_exit);
 
@@ -845,6 +928,10 @@ static int rds_init(void)
 
 	rds_info_register_func(RDS_INFO_SOCKETS, rds_sock_info);
 	rds_info_register_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info);
+#if IS_ENABLED(CONFIG_IPV6)
+	rds_info_register_func(RDS6_INFO_SOCKETS, rds6_sock_info);
+	rds_info_register_func(RDS6_INFO_RECV_MESSAGES, rds6_sock_inc_info);
+#endif
 
 	goto out;
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH] tun: fix use-after-free when register netdev failed
From: Eric Dumazet @ 2019-08-15  9:35 UTC (permalink / raw)
  To: Yang Yingliang, netdev; +Cc: jasowang, xiyou.wangcong, davem
In-Reply-To: <1565857122-24660-1-git-send-email-yangyingliang@huawei.com>



On 8/15/19 10:18 AM, Yang Yingliang wrote:
> I got a UAF repport in tun driver when doing fuzzy test:
> 
>
> [  466.368604] page:ffffea000dc84e00 refcount:1 mapcount:0 mapping:ffff8883df1b4f00 index:0x0 compound_mapcount: 0
> [  466.371582] flags: 0x2fffff80010200(slab|head)
> [  466.372910] raw: 002fffff80010200 dead000000000100 dead000000000122 ffff8883df1b4f00
> [  466.375209] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
> [  466.377778] page dumped because: kasan: bad access detected
> [  466.379730]
> [  466.380288] Memory state around the buggy address:
> [  466.381844]  ffff888372139100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.384009]  ffff888372139180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.386131] >ffff888372139200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.388257]                                                  ^
> [  466.390234]  ffff888372139280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.392512]  ffff888372139300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.394667] ==================================================================
> 
> tun_chr_read_iter() accessed the memory which freed by free_netdev()
> called by tun_set_iff():
> 
> 	CPUA				CPUB
>     tun_set_iff()
>       alloc_netdev_mqs()
>       tun_attach()
> 				    tun_chr_read_iter()
> 				      tun_get()
>       register_netdevice()
>       tun_detach_all()
>         synchronize_net()
> 				      tun_do_read()
> 				        tun_ring_recv()
> 				          schedule()
>       free_netdev()
> 				      tun_put() <-- UAF

UAF on what exactly ? The dev_hold() should prevent the free_netdev().

> 
> Set a new bit in tun->flag if register_netdevice() successed,
> without this bit, tun_get() returns NULL to avoid using a
> freed tun pointer.
> 
> Fixes: eb0fb363f920 ("tuntap: attach queue 0 before registering netdevice")
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
> ---
>  drivers/net/tun.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index db16d7a13e00..cbd60c276c40 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -115,6 +115,7 @@ do {								\
>  /* High bits in flags field are unused. */
>  #define TUN_VNET_LE     0x80000000
>  #define TUN_VNET_BE     0x40000000
> +#define TUN_DEV_REGISTERED	0x20000000
>  
>  #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
>  		      IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
> @@ -719,8 +720,10 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
>  			netif_carrier_off(tun->dev);
>  
>  			if (!(tun->flags & IFF_PERSIST) &&
> -			    tun->dev->reg_state == NETREG_REGISTERED)
> +			    tun->dev->reg_state == NETREG_REGISTERED) {
>  				unregister_netdevice(tun->dev);
> +				tun->flags &= ~TUN_DEV_REGISTERED;

Isn't this done too late ?

> +			}
>  		}
>  		if (tun)
>  			xdp_rxq_info_unreg(&tfile->xdp_rxq);
> @@ -884,8 +887,10 @@ static struct tun_struct *tun_get(struct tun_file *tfile)
>  
>  	rcu_read_lock();
>  	tun = rcu_dereference(tfile->tun);
> -	if (tun)
> +	if (tun && (tun->flags & TUN_DEV_REGISTERED))
>  		dev_hold(tun->dev);
> +	else
> +		tun = NULL;
>  	rcu_read_unlock();
>  
>  	return tun;
> @@ -2836,6 +2841,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>  		err = register_netdevice(tun->dev);
>  		if (err < 0)
>  			goto err_detach;
> +		tun->flags |= TUN_DEV_REGISTERED;
>  	}
>  
>  	netif_carrier_on(tun->dev);
> 


So tun_get() will return NULL as long as  tun_set_iff() (TUNSETIFF ioctl()) has not yet been called ?

This could break some applications, since tun_get() is used from poll() and other syscalls.


^ permalink raw reply

* [PATCH bpf-next v5 2/2] xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP
From: Björn Töpel @ 2019-08-15  9:30 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, jonathan.lemon,
	bruce.richardson, songliubraving, bpf
In-Reply-To: <20190815093014.31174-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flags when updating
an entry. This patch addresses that.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 kernel/bpf/xskmap.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 16031d489173..4cc28e226398 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -226,8 +226,6 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 		return -EINVAL;
 	if (unlikely(i >= m->map.max_entries))
 		return -E2BIG;
-	if (unlikely(map_flags == BPF_NOEXIST))
-		return -EEXIST;
 
 	sock = sockfd_lookup(fd, &err);
 	if (!sock)
@@ -257,6 +255,12 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 	if (old_xs == xs) {
 		err = 0;
 		goto out;
+	} else if (old_xs && map_flags == BPF_NOEXIST) {
+		err = -EEXIST;
+		goto out;
+	} else if (!old_xs && map_flags == BPF_EXIST) {
+		err = -ENOENT;
+		goto out;
 	}
 	xsk_map_sock_add(xs, node);
 	WRITE_ONCE(*map_entry, xs);
-- 
2.20.1


^ permalink raw reply related

* [PATCH bpf-next v5 1/2] xsk: remove AF_XDP socket from map when the socket is released
From: Björn Töpel @ 2019-08-15  9:30 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, jonathan.lemon,
	bruce.richardson, songliubraving, bpf
In-Reply-To: <20190815093014.31174-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

When an AF_XDP socket is released/closed the XSKMAP still holds a
reference to the socket in a "released" state. The socket will still
use the netdev queue resource, and block newly created sockets from
attaching to that queue, but no user application can access the
fill/complete/rx/tx queues. This results in that all applications need
to explicitly clear the map entry from the old "zombie state"
socket. This should be done automatically.

In this patch, the sockets tracks, and have a reference to, which maps
it resides in. When the socket is released, it will remove itself from
all maps.

Suggested-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h |  18 ++++++
 kernel/bpf/xskmap.c    | 125 ++++++++++++++++++++++++++++++++++-------
 net/xdp/xsk.c          |  50 +++++++++++++++++
 3 files changed, 173 insertions(+), 20 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 69796d264f06..066e3ae446a8 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -50,6 +50,16 @@ struct xdp_umem {
 	struct list_head xsk_list;
 };
 
+/* Nodes are linked in the struct xdp_sock map_list field, and used to
+ * track which maps a certain socket reside in.
+ */
+struct xsk_map;
+struct xsk_map_node {
+	struct list_head node;
+	struct xsk_map *map;
+	struct xdp_sock **map_entry;
+};
+
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
@@ -75,6 +85,9 @@ struct xdp_sock {
 	/* Protects generic receive. */
 	spinlock_t rx_lock;
 	u64 rx_dropped;
+	struct list_head map_list;
+	/* Protects map_list */
+	spinlock_t map_list_lock;
 };
 
 struct xdp_buff;
@@ -96,6 +109,11 @@ struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
 void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
 struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, u16 queue_id);
 
+void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
+			     struct xdp_sock **map_entry);
+int xsk_map_inc(struct xsk_map *map);
+void xsk_map_put(struct xsk_map *map);
+
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
 	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 9bb96ace9fa1..16031d489173 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -13,8 +13,71 @@ struct xsk_map {
 	struct bpf_map map;
 	struct xdp_sock **xsk_map;
 	struct list_head __percpu *flush_list;
+	spinlock_t lock; /* Synchronize map updates */
 };
 
+int xsk_map_inc(struct xsk_map *map)
+{
+	struct bpf_map *m = &map->map;
+
+	m = bpf_map_inc(m, false);
+	return IS_ERR(m) ? PTR_ERR(m) : 0;
+}
+
+void xsk_map_put(struct xsk_map *map)
+{
+	bpf_map_put(&map->map);
+}
+
+static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map,
+					       struct xdp_sock **map_entry)
+{
+	struct xsk_map_node *node;
+	int err;
+
+	node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
+	if (!node)
+		return NULL;
+
+	err = xsk_map_inc(map);
+	if (err) {
+		kfree(node);
+		return ERR_PTR(err);
+	}
+
+	node->map = map;
+	node->map_entry = map_entry;
+	return node;
+}
+
+static void xsk_map_node_free(struct xsk_map_node *node)
+{
+	xsk_map_put(node->map);
+	kfree(node);
+}
+
+static void xsk_map_sock_add(struct xdp_sock *xs, struct xsk_map_node *node)
+{
+	spin_lock_bh(&xs->map_list_lock);
+	list_add_tail(&node->node, &xs->map_list);
+	spin_unlock_bh(&xs->map_list_lock);
+}
+
+static void xsk_map_sock_delete(struct xdp_sock *xs,
+				struct xdp_sock **map_entry)
+{
+	struct xsk_map_node *n, *tmp;
+
+	spin_lock_bh(&xs->map_list_lock);
+	list_for_each_entry_safe(n, tmp, &xs->map_list, node) {
+		if (map_entry == n->map_entry) {
+			list_del(&n->node);
+			xsk_map_node_free(n);
+		}
+	}
+	spin_unlock_bh(&xs->map_list_lock);
+}
+
 static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 {
 	struct xsk_map *m;
@@ -34,6 +97,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 		return ERR_PTR(-ENOMEM);
 
 	bpf_map_init_from_attr(&m->map, attr);
+	spin_lock_init(&m->lock);
 
 	cost = (u64)m->map.max_entries * sizeof(struct xdp_sock *);
 	cost += sizeof(struct list_head) * num_possible_cpus();
@@ -71,21 +135,9 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
 static void xsk_map_free(struct bpf_map *map)
 {
 	struct xsk_map *m = container_of(map, struct xsk_map, map);
-	int i;
 
 	bpf_clear_redirect_map(map);
 	synchronize_net();
-
-	for (i = 0; i < map->max_entries; i++) {
-		struct xdp_sock *xs;
-
-		xs = m->xsk_map[i];
-		if (!xs)
-			continue;
-
-		sock_put((struct sock *)xs);
-	}
-
 	free_percpu(m->flush_list);
 	bpf_map_area_free(m->xsk_map);
 	kfree(m);
@@ -164,8 +216,9 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 			       u64 map_flags)
 {
 	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct xdp_sock *xs, *old_xs, **map_entry;
 	u32 i = *(u32 *)key, fd = *(u32 *)value;
-	struct xdp_sock *xs, *old_xs;
+	struct xsk_map_node *node;
 	struct socket *sock;
 	int err;
 
@@ -192,32 +245,64 @@ static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
 		return -EOPNOTSUPP;
 	}
 
-	sock_hold(sock->sk);
+	map_entry = &m->xsk_map[i];
+	node = xsk_map_node_alloc(m, map_entry);
+	if (IS_ERR(node)) {
+		sockfd_put(sock);
+		return PTR_ERR(node);
+	}
 
-	old_xs = xchg(&m->xsk_map[i], xs);
+	spin_lock_bh(&m->lock);
+	old_xs = READ_ONCE(*map_entry);
+	if (old_xs == xs) {
+		err = 0;
+		goto out;
+	}
+	xsk_map_sock_add(xs, node);
+	WRITE_ONCE(*map_entry, xs);
 	if (old_xs)
-		sock_put((struct sock *)old_xs);
-
+		xsk_map_sock_delete(old_xs, map_entry);
+	spin_unlock_bh(&m->lock);
 	sockfd_put(sock);
 	return 0;
+
+out:
+	spin_unlock_bh(&m->lock);
+	sockfd_put(sock);
+	xsk_map_node_free(node);
+	return err;
 }
 
 static int xsk_map_delete_elem(struct bpf_map *map, void *key)
 {
 	struct xsk_map *m = container_of(map, struct xsk_map, map);
-	struct xdp_sock *old_xs;
+	struct xdp_sock *old_xs, **map_entry;
 	int k = *(u32 *)key;
 
 	if (k >= map->max_entries)
 		return -EINVAL;
 
-	old_xs = xchg(&m->xsk_map[k], NULL);
+	spin_lock_bh(&m->lock);
+	map_entry = &m->xsk_map[k];
+	old_xs = xchg(map_entry, NULL);
 	if (old_xs)
-		sock_put((struct sock *)old_xs);
+		xsk_map_sock_delete(old_xs, map_entry);
+	spin_unlock_bh(&m->lock);
 
 	return 0;
 }
 
+void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
+			     struct xdp_sock **map_entry)
+{
+	spin_lock_bh(&map->lock);
+	if (READ_ONCE(*map_entry) == xs) {
+		WRITE_ONCE(*map_entry, NULL);
+		xsk_map_sock_delete(xs, map_entry);
+	}
+	spin_unlock_bh(&map->lock);
+}
+
 const struct bpf_map_ops xsk_map_ops = {
 	.map_alloc = xsk_map_alloc,
 	.map_free = xsk_map_free,
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 59b57d708697..c3d027aa693d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -362,6 +362,52 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
 	dev_put(dev);
 }
 
+static struct xsk_map *xsk_get_map_list_entry(struct xdp_sock *xs,
+					      struct xdp_sock ***map_entry)
+{
+	struct xsk_map *map = NULL;
+	struct xsk_map_node *node;
+
+	*map_entry = NULL;
+
+	spin_lock_bh(&xs->map_list_lock);
+	node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
+					node);
+	if (node) {
+		WARN_ON(xsk_map_inc(node->map));
+		map = node->map;
+		*map_entry = node->map_entry;
+	}
+	spin_unlock_bh(&xs->map_list_lock);
+	return map;
+}
+
+static void xsk_delete_from_maps(struct xdp_sock *xs)
+{
+	/* This function removes the current XDP socket from all the
+	 * maps it resides in. We need to take extra care here, due to
+	 * the two locks involved. Each map has a lock synchronizing
+	 * updates to the entries, and each socket has a lock that
+	 * synchronizes access to the list of maps (map_list). For
+	 * deadlock avoidance the locks need to be taken in the order
+	 * "map lock"->"socket map list lock". We start off by
+	 * accessing the socket map list, and take a reference to the
+	 * map to guarantee existence between the
+	 * xsk_get_map_list_entry() and xsk_map_try_sock_delete()
+	 * calls. Then we ask the map to remove the socket, which
+	 * tries to remove the socket from the map. Note that there
+	 * might be updates to the map between
+	 * xsk_get_map_list_entry() and xsk_map_try_sock_delete().
+	 */
+	struct xdp_sock **map_entry = NULL;
+	struct xsk_map *map;
+
+	while ((map = xsk_get_map_list_entry(xs, &map_entry))) {
+		xsk_map_try_sock_delete(map, xs, map_entry);
+		xsk_map_put(map);
+	}
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -381,6 +427,7 @@ static int xsk_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	local_bh_enable();
 
+	xsk_delete_from_maps(xs);
 	xsk_unbind_dev(xs);
 
 	xskq_destroy(xs->rx);
@@ -855,6 +902,9 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
 	spin_lock_init(&xs->rx_lock);
 	spin_lock_init(&xs->tx_completion_lock);
 
+	INIT_LIST_HEAD(&xs->map_list);
+	spin_lock_init(&xs->map_list_lock);
+
 	mutex_lock(&net->xdp.lock);
 	sk_add_node_rcu(sk, &net->xdp.list);
 	mutex_unlock(&net->xdp.lock);
-- 
2.20.1


^ permalink raw reply related

* [PATCH bpf-next v5 0/2] net: xdp: XSKMAP improvements
From: Björn Töpel @ 2019-08-15  9:30 UTC (permalink / raw)
  To: ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, jonathan.lemon,
	bjorn.topel, bruce.richardson, songliubraving, bpf

This series (v5 and counting) add two improvements for the XSKMAP,
used by AF_XDP sockets.

1. Automatic cleanup when an AF_XDP socket goes out of scope/is
   released. Instead of require that the user manually clears the
   "released" state socket from the map, this is done
   automatically. Each socket tracks which maps it resides in, and
   remove itself from those maps at relase. A notable implementation
   change, is that the sockets references the map, instead of the map
   referencing the sockets. Which implies that when the XSKMAP is
   freed, it is by definition cleared of sockets.

2. The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flag on insert,
   which this patch addresses.


Thanks,
Björn

v1->v2: Fixed deadlock and broken cleanup. (Daniel)
v2->v3: Rebased onto bpf-next
v3->v4: {READ, WRITE}_ONCE consistency. (Daniel)
        Socket release/map update race. (Daniel)
v4->v5: Avoid use-after-free on XSKMAP self-assignment [1]. (Daniel)
        Removed redundant assignment in xsk_map_update_elem().
        Variable name consistency; Use map_entry everywhere.

[1] https://lore.kernel.org/bpf/20190802081154.30962-1-bjorn.topel@gmail.com/T/#mc68439e97bc07fa301dad9fc4850ed5aa392f385

Björn Töpel (2):
  xsk: remove AF_XDP socket from map when the socket is released
  xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP

 include/net/xdp_sock.h |  18 ++++++
 kernel/bpf/xskmap.c    | 133 ++++++++++++++++++++++++++++++++++-------
 net/xdp/xsk.c          |  50 ++++++++++++++++
 3 files changed, 179 insertions(+), 22 deletions(-)

-- 
2.20.1


^ permalink raw reply

* Re: [PATCH] virtio-net: lower min ring num_free for efficiency
From: Jason Wang @ 2019-08-15  9:25 UTC (permalink / raw)
  To: 冉 jiang, mst@redhat.com
  Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	jakub.kicinski@netronome.com, hawk@kernel.org,
	john.fastabend@gmail.com, kafai@fb.com, songliubraving@fb.com,
	yhs@fb.com, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	xdp-newbies@vger.kernel.org, bpf@vger.kernel.org,
	jiangran.jr@alibaba-inc.com
In-Reply-To: <BYAPR14MB320512CCA27487548DDAA57FA6AC0@BYAPR14MB3205.namprd14.prod.outlook.com>


On 2019/8/15 下午4:36, 冉 jiang wrote:
> On 2019/8/15 11:17, Jason Wang wrote:
>> On 2019/8/15 上午11:11, 冉 jiang wrote:
>>> On 2019/8/15 11:01, Jason Wang wrote:
>>>> On 2019/8/14 上午10:06, ? jiang wrote:
>>>>> This change lowers ring buffer reclaim threshold from 1/2*queue to
>>>>> budget
>>>>> for better performance. According to our test with qemu + dpdk, packet
>>>>> dropping happens when the guest is not able to provide free buffer in
>>>>> avail ring timely with default 1/2*queue. The value in the patch has
>>>>> been
>>>>> tested and does show better performance.
>>>> Please add your tests setup and result here.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>> Signed-off-by: jiangkidd <jiangkidd@hotmail.com>
>>>>> ---
>>>>>     drivers/net/virtio_net.c | 2 +-
>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>> index 0d4115c9e20b..bc08be7925eb 100644
>>>>> --- a/drivers/net/virtio_net.c
>>>>> +++ b/drivers/net/virtio_net.c
>>>>> @@ -1331,7 +1331,7 @@ static int virtnet_receive(struct receive_queue
>>>>> *rq, int budget,
>>>>>             }
>>>>>         }
>>>>>     -    if (rq->vq->num_free > virtqueue_get_vring_size(rq->vq) / 2) {
>>>>> +    if (rq->vq->num_free > min((unsigned int)budget,
>>>>> virtqueue_get_vring_size(rq->vq)) / 2) {
>>>>>             if (!try_fill_recv(vi, rq, GFP_ATOMIC))
>>>>>                 schedule_delayed_work(&vi->refill, 0);
>>>>>         }
>>> Sure, here are the details:
>>
>> Thanks for the details, but I meant it's better if you could summarize
>> you test result in the commit log in a compact way.
>>
>> Btw, some comments, see below:
>>
>>
>>>
>>> Test setup & result:
>>>
>>> ----------------------------------------------------
>>>
>>> Below is the snippet from our test result. Test1 was done with default
>>> driver with the value of 1/2 * queue, while test2 is with my patch. We
>>> can see average
>>> drop packets do decrease a lot in test2.
>>>
>>> test1Time    avgDropPackets    test2Time    avgDropPackets pps
>>>
>>> 16:21.0    12.295    56:50.4    0 300k
>>> 17:19.1    15.244    56:50.4    0    300k
>>> 18:17.5    18.789    56:50.4    0    300k
>>> 19:15.1    14.208    56:50.4    0    300k
>>> 20:13.2    20.818    56:50.4    0.267    300k
>>> 21:11.2    12.397    56:50.4    0    300k
>>> 22:09.3    12.599    56:50.4    0    300k
>>> 23:07.3    15.531    57:48.4    0    300k
>>> 24:05.5    13.664    58:46.5    0    300k
>>> 25:03.7    13.158    59:44.5    4.73    300k
>>> 26:01.1    2.486    00:42.6    0    300k
>>> 26:59.1    11.241    01:40.6    0    300k
>>> 27:57.2    20.521    02:38.6    0    300k
>>> 28:55.2    30.094    03:36.7    0    300k
>>> 29:53.3    16.828    04:34.7    0.963    300k
>>> 30:51.3    46.916    05:32.8    0    400k
>>> 31:49.3    56.214    05:32.8    0    400k
>>> 32:47.3    58.69    05:32.8    0    400k
>>> 33:45.3    61.486    05:32.8    0    400k
>>> 34:43.3    72.175    05:32.8    0.598    400k
>>> 35:41.3    56.699    05:32.8    0    400k
>>> 36:39.3    61.071    05:32.8    0    400k
>>> 37:37.3    43.355    06:30.8    0    400k
>>> 38:35.4    44.644    06:30.8    0    400k
>>> 39:33.4    72.336    06:30.8    0    400k
>>> 40:31.4    70.676    06:30.8    0    400k
>>> 41:29.4    108.009    06:30.8    0    400k
>>> 42:27.4    65.216    06:30.8    0    400k
>>
>> Why there're difference in test time? Could you summarize them like:
>>
>> Test setup: e.g testpmd or pktgen to generate packets to guest
>>
>> avg packets drop before: XXX
>>
>> avg packets drop after: YYY(-ZZZ%)
>>
>> Thanks
>>
>>
>>>
>>> Data to prove why the patch helps:
>>>
>>> ----------------------------------------------------
>>>
>>> We did have completed several rounds of test with setting the value to
>>> budget (64 as the default value). It does improve a lot with pps is
>>> below 400pps for a single stream. We are confident that it runs out
>>> of free
>>> buffer in avail ring when packet dropping happens with below systemtap:
>>>
>>> Just a snippet:
>>>
>>> probe module("virtio_ring").function("virtqueue_get_buf")
>>> {
>>>         x = (@cast($_vq, "vring_virtqueue")->vring->used->idx)-
>>> (@cast($_vq, "vring_virtqueue")->last_used_idx) ---> we use this one
>>> to verify if the queue is full, which means guest is not able to take
>>> buffer from the queue timely
>>>
>>>         if (x<0 && (x+65535)<4096)
>>>             x = x+65535
>>>
>>>         if((x==1024) && @cast($_vq, "vring_virtqueue")->vq->callback ==
>>> callback_addr)
>>>             netrxcount[x] <<< gettimeofday_s()
>>> }
>>>
>>>
>>> probe module("virtio_ring").function("virtqueue_add_inbuf")
>>> {
>>>         y = (@cast($vq, "vring_virtqueue")->vring->avail->idx)-
>>> (@cast($vq, "vring_virtqueue")->vring->used->idx) ---> we use this one
>>> to verify if we run out of free buffer in avail ring
>>>         if (y<0 && (y+65535)<4096)
>>>             y = y+65535
>>>
>>>         if(@2=="debugon")
>>>         {
>>>             if(y==0 && @cast($vq, "vring_virtqueue")->vq->callback ==
>>> callback_addr)
>>>             {
>>>                 netrxfreecount[y] <<< gettimeofday_s()
>>>
>>>                 printf("no avail ring left seen, printing most recent 5
>>> num free, vq: %lx, current index: %d\n", $vq, recentfreecount)
>>>                 for(i=recentfreecount; i!=((recentfreecount+4) % 5);
>>> i=((i+1) % 5))
>>>                 {
>>>                     printf("index: %d, num free: %d\n", i,
>>> recentfree[$vq,
>>> i])
>>>                 }
>>>
>>>                 printf("index: %d, num free: %d\n", i, recentfree[$vq,
>>> i])
>>>                 //exit()
>>>             }
>>>         }
>>> }
>>>
>>>
>>> probe
>>> module("virtio_net").statement("virtnet_receive@drivers/net/virtio_net.c:732")
>>>
>>>
>>> {
>>>         recentfreecount++
>>>         recentfreecount = recentfreecount % 5
>>>         recentfree[$rq->vq, recentfreecount] = $rq->vq->num_free --->
>>> record the num_free for the last 5 calls to virtnet_receive, so we can
>>> see if lowering the bar helps.
>>> }
>>>
>>>
>>> Here is the result:
>>>
>>> no avail ring left seen, printing most recent 5 num free, vq:
>>> ffff9c13c1200000, current index: 1
>>> index: 1, num free: 561
>>> index: 2, num free: 305
>>> index: 3, num free: 369
>>> index: 4, num free: 433
>>> index: 0, num free: 497
>>> no avail ring left seen, printing most recent 5 num free, vq:
>>> ffff9c13c1200000, current index: 1
>>> index: 1, num free: 543
>>> index: 2, num free: 463
>>> index: 3, num free: 469
>>> index: 4, num free: 476
>>> index: 0, num free: 479
>>> no avail ring left seen, printing most recent 5 num free, vq:
>>> ffff9c13c1200000, current index: 2
>>> index: 2, num free: 555
>>> index: 3, num free: 414
>>> index: 4, num free: 420
>>> index: 0, num free: 427
>>> index: 1, num free: 491
>>>
>>> We can see in the last 4 calls to virtnet_receive before we run out
>>> of free buffer and start to relaim, num_free is quite high. So if we
>>> can do the reclaim earlier, it will certainly help.
>>>
>>> Jiang
>>
>> Right, but I think there's no need to put those thing in the commit log.
>>
>> Thanks
>>
>>
> Sure, here is the info:
>
>
> Test setup: iperf3 to generate packets to guest (total 30mins, pps 400k)


Please also note that type of packets e.g TCP or UDP.

Thanks


>
> avg packets drop before: 2842
>
> avg packets drop after: 360(-87.3%)
>
>
> Just let me know if it looks good enough. Thx.
>
> Jiang
>

^ permalink raw reply

* Re: [PATCH] virtio-net: lower min ring num_free for efficiency
From: Jason Wang @ 2019-08-15  9:22 UTC (permalink / raw)
  To: 冉 jiang, mst@redhat.com
  Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net,
	jakub.kicinski@netronome.com, hawk@kernel.org,
	john.fastabend@gmail.com, kafai@fb.com, songliubraving@fb.com,
	yhs@fb.com, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	xdp-newbies@vger.kernel.org, bpf@vger.kernel.org,
	jiangran.jr@alibaba-inc.com
In-Reply-To: <BYAPR14MB320512CCA27487548DDAA57FA6AC0@BYAPR14MB3205.namprd14.prod.outlook.com>


On 2019/8/15 下午4:36, 冉 jiang wrote:
> On 2019/8/15 11:17, Jason Wang wrote:
>> On 2019/8/15 上午11:11, 冉 jiang wrote:
>>> On 2019/8/15 11:01, Jason Wang wrote:
>>>> On 2019/8/14 上午10:06, ? jiang wrote:
>>>>> This change lowers ring buffer reclaim threshold from 1/2*queue to
>>>>> budget
>>>>> for better performance. According to our test with qemu + dpdk, packet
>>>>> dropping happens when the guest is not able to provide free buffer in
>>>>> avail ring timely with default 1/2*queue. The value in the patch has
>>>>> been
>>>>> tested and does show better performance.
>>>> Please add your tests setup and result here.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>> Signed-off-by: jiangkidd <jiangkidd@hotmail.com>
>>>>> ---
>>>>>     drivers/net/virtio_net.c | 2 +-
>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>> index 0d4115c9e20b..bc08be7925eb 100644
>>>>> --- a/drivers/net/virtio_net.c
>>>>> +++ b/drivers/net/virtio_net.c
>>>>> @@ -1331,7 +1331,7 @@ static int virtnet_receive(struct receive_queue
>>>>> *rq, int budget,
>>>>>             }
>>>>>         }
>>>>>     -    if (rq->vq->num_free > virtqueue_get_vring_size(rq->vq) / 2) {
>>>>> +    if (rq->vq->num_free > min((unsigned int)budget,
>>>>> virtqueue_get_vring_size(rq->vq)) / 2) {
>>>>>             if (!try_fill_recv(vi, rq, GFP_ATOMIC))
>>>>>                 schedule_delayed_work(&vi->refill, 0);
>>>>>         }
>>> Sure, here are the details:
>>
>> Thanks for the details, but I meant it's better if you could summarize
>> you test result in the commit log in a compact way.
>>
>> Btw, some comments, see below:
>>
>>
>>>
>>> Test setup & result:
>>>
>>> ----------------------------------------------------
>>>
>>> Below is the snippet from our test result. Test1 was done with default
>>> driver with the value of 1/2 * queue, while test2 is with my patch. We
>>> can see average
>>> drop packets do decrease a lot in test2.
>>>
>>> test1Time    avgDropPackets    test2Time    avgDropPackets pps
>>>
>>> 16:21.0    12.295    56:50.4    0 300k
>>> 17:19.1    15.244    56:50.4    0    300k
>>> 18:17.5    18.789    56:50.4    0    300k
>>> 19:15.1    14.208    56:50.4    0    300k
>>> 20:13.2    20.818    56:50.4    0.267    300k
>>> 21:11.2    12.397    56:50.4    0    300k
>>> 22:09.3    12.599    56:50.4    0    300k
>>> 23:07.3    15.531    57:48.4    0    300k
>>> 24:05.5    13.664    58:46.5    0    300k
>>> 25:03.7    13.158    59:44.5    4.73    300k
>>> 26:01.1    2.486    00:42.6    0    300k
>>> 26:59.1    11.241    01:40.6    0    300k
>>> 27:57.2    20.521    02:38.6    0    300k
>>> 28:55.2    30.094    03:36.7    0    300k
>>> 29:53.3    16.828    04:34.7    0.963    300k
>>> 30:51.3    46.916    05:32.8    0    400k
>>> 31:49.3    56.214    05:32.8    0    400k
>>> 32:47.3    58.69    05:32.8    0    400k
>>> 33:45.3    61.486    05:32.8    0    400k
>>> 34:43.3    72.175    05:32.8    0.598    400k
>>> 35:41.3    56.699    05:32.8    0    400k
>>> 36:39.3    61.071    05:32.8    0    400k
>>> 37:37.3    43.355    06:30.8    0    400k
>>> 38:35.4    44.644    06:30.8    0    400k
>>> 39:33.4    72.336    06:30.8    0    400k
>>> 40:31.4    70.676    06:30.8    0    400k
>>> 41:29.4    108.009    06:30.8    0    400k
>>> 42:27.4    65.216    06:30.8    0    400k
>>
>> Why there're difference in test time? Could you summarize them like:
>>
>> Test setup: e.g testpmd or pktgen to generate packets to guest
>>
>> avg packets drop before: XXX
>>
>> avg packets drop after: YYY(-ZZZ%)
>>
>> Thanks
>>
>>
>>>
>>> Data to prove why the patch helps:
>>>
>>> ----------------------------------------------------
>>>
>>> We did have completed several rounds of test with setting the value to
>>> budget (64 as the default value). It does improve a lot with pps is
>>> below 400pps for a single stream. We are confident that it runs out
>>> of free
>>> buffer in avail ring when packet dropping happens with below systemtap:
>>>
>>> Just a snippet:
>>>
>>> probe module("virtio_ring").function("virtqueue_get_buf")
>>> {
>>>         x = (@cast($_vq, "vring_virtqueue")->vring->used->idx)-
>>> (@cast($_vq, "vring_virtqueue")->last_used_idx) ---> we use this one
>>> to verify if the queue is full, which means guest is not able to take
>>> buffer from the queue timely
>>>
>>>         if (x<0 && (x+65535)<4096)
>>>             x = x+65535
>>>
>>>         if((x==1024) && @cast($_vq, "vring_virtqueue")->vq->callback ==
>>> callback_addr)
>>>             netrxcount[x] <<< gettimeofday_s()
>>> }
>>>
>>>
>>> probe module("virtio_ring").function("virtqueue_add_inbuf")
>>> {
>>>         y = (@cast($vq, "vring_virtqueue")->vring->avail->idx)-
>>> (@cast($vq, "vring_virtqueue")->vring->used->idx) ---> we use this one
>>> to verify if we run out of free buffer in avail ring
>>>         if (y<0 && (y+65535)<4096)
>>>             y = y+65535
>>>
>>>         if(@2=="debugon")
>>>         {
>>>             if(y==0 && @cast($vq, "vring_virtqueue")->vq->callback ==
>>> callback_addr)
>>>             {
>>>                 netrxfreecount[y] <<< gettimeofday_s()
>>>
>>>                 printf("no avail ring left seen, printing most recent 5
>>> num free, vq: %lx, current index: %d\n", $vq, recentfreecount)
>>>                 for(i=recentfreecount; i!=((recentfreecount+4) % 5);
>>> i=((i+1) % 5))
>>>                 {
>>>                     printf("index: %d, num free: %d\n", i,
>>> recentfree[$vq,
>>> i])
>>>                 }
>>>
>>>                 printf("index: %d, num free: %d\n", i, recentfree[$vq,
>>> i])
>>>                 //exit()
>>>             }
>>>         }
>>> }
>>>
>>>
>>> probe
>>> module("virtio_net").statement("virtnet_receive@drivers/net/virtio_net.c:732")
>>>
>>>
>>> {
>>>         recentfreecount++
>>>         recentfreecount = recentfreecount % 5
>>>         recentfree[$rq->vq, recentfreecount] = $rq->vq->num_free --->
>>> record the num_free for the last 5 calls to virtnet_receive, so we can
>>> see if lowering the bar helps.
>>> }
>>>
>>>
>>> Here is the result:
>>>
>>> no avail ring left seen, printing most recent 5 num free, vq:
>>> ffff9c13c1200000, current index: 1
>>> index: 1, num free: 561
>>> index: 2, num free: 305
>>> index: 3, num free: 369
>>> index: 4, num free: 433
>>> index: 0, num free: 497
>>> no avail ring left seen, printing most recent 5 num free, vq:
>>> ffff9c13c1200000, current index: 1
>>> index: 1, num free: 543
>>> index: 2, num free: 463
>>> index: 3, num free: 469
>>> index: 4, num free: 476
>>> index: 0, num free: 479
>>> no avail ring left seen, printing most recent 5 num free, vq:
>>> ffff9c13c1200000, current index: 2
>>> index: 2, num free: 555
>>> index: 3, num free: 414
>>> index: 4, num free: 420
>>> index: 0, num free: 427
>>> index: 1, num free: 491
>>>
>>> We can see in the last 4 calls to virtnet_receive before we run out
>>> of free buffer and start to relaim, num_free is quite high. So if we
>>> can do the reclaim earlier, it will certainly help.
>>>
>>> Jiang
>>
>> Right, but I think there's no need to put those thing in the commit log.
>>
>> Thanks
>>
>>
> Sure, here is the info:
>
>
> Test setup: iperf3 to generate packets to guest (total 30mins, pps 400k)
>
> avg packets drop before: 2842
>
> avg packets drop after: 360(-87.3%)
>
>
> Just let me know if it looks good enough. Thx.
>
> Jiang


Looks good, please post a V2 and include the above result in the commit log.

Thanks

>

^ permalink raw reply

* Re: [PATCH] tun: fix use-after-free when register netdev failed
From: Jason Wang @ 2019-08-15  9:21 UTC (permalink / raw)
  To: Yang Yingliang, netdev; +Cc: xiyou.wangcong, davem
In-Reply-To: <1565857122-24660-1-git-send-email-yangyingliang@huawei.com>


On 2019/8/15 下午4:18, Yang Yingliang wrote:
> I got a UAF repport in tun driver when doing fuzzy test:
>
> [  466.269490] ==================================================================
> [  466.271792] BUG: KASAN: use-after-free in tun_chr_read_iter+0x2ca/0x2d0
> [  466.271806] Read of size 8 at addr ffff888372139250 by task tun-test/2699
> [  466.271810]
> [  466.271824] CPU: 1 PID: 2699 Comm: tun-test Not tainted 5.3.0-rc1-00001-g5a9433db2614-dirty #427
> [  466.271833] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> [  466.271838] Call Trace:
> [  466.271858]  dump_stack+0xca/0x13e
> [  466.271871]  ? tun_chr_read_iter+0x2ca/0x2d0
> [  466.271890]  print_address_description+0x79/0x440
> [  466.271906]  ? vprintk_func+0x5e/0xf0
> [  466.271920]  ? tun_chr_read_iter+0x2ca/0x2d0
> [  466.271935]  __kasan_report+0x15c/0x1df
> [  466.271958]  ? tun_chr_read_iter+0x2ca/0x2d0
> [  466.271976]  kasan_report+0xe/0x20
> [  466.271987]  tun_chr_read_iter+0x2ca/0x2d0
> [  466.272013]  do_iter_readv_writev+0x4b7/0x740
> [  466.272032]  ? default_llseek+0x2d0/0x2d0
> [  466.272072]  do_iter_read+0x1c5/0x5e0
> [  466.272110]  vfs_readv+0x108/0x180
> [  466.299007]  ? compat_rw_copy_check_uvector+0x440/0x440
> [  466.299020]  ? fsnotify+0x888/0xd50
> [  466.299040]  ? __fsnotify_parent+0xd0/0x350
> [  466.299064]  ? fsnotify_first_mark+0x1e0/0x1e0
> [  466.304548]  ? vfs_write+0x264/0x510
> [  466.304569]  ? ksys_write+0x101/0x210
> [  466.304591]  ? do_preadv+0x116/0x1a0
> [  466.304609]  do_preadv+0x116/0x1a0
> [  466.309829]  do_syscall_64+0xc8/0x600
> [  466.309849]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  466.309861] RIP: 0033:0x4560f9
> [  466.309875] Code: 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> [  466.309889] RSP: 002b:00007ffffa5166e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000127
> [  466.322992] RAX: ffffffffffffffda RBX: 0000000000400460 RCX: 00000000004560f9
> [  466.322999] RDX: 0000000000000003 RSI: 00000000200008c0 RDI: 0000000000000003
> [  466.323007] RBP: 00007ffffa516700 R08: 0000000000000004 R09: 0000000000000000
> [  466.323014] R10: 0000000000000000 R11: 0000000000000206 R12: 000000000040cb10
> [  466.323021] R13: 0000000000000000 R14: 00000000006d7018 R15: 0000000000000000
> [  466.323057]
> [  466.323064] Allocated by task 2605:
> [  466.335165]  save_stack+0x19/0x80
> [  466.336240]  __kasan_kmalloc.constprop.8+0xa0/0xd0
> [  466.337755]  kmem_cache_alloc+0xe8/0x320
> [  466.339050]  getname_flags+0xca/0x560
> [  466.340229]  user_path_at_empty+0x2c/0x50
> [  466.341508]  vfs_statx+0xe6/0x190
> [  466.342619]  __do_sys_newstat+0x81/0x100
> [  466.343908]  do_syscall_64+0xc8/0x600
> [  466.345303]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  466.347034]
> [  466.347517] Freed by task 2605:
> [  466.348471]  save_stack+0x19/0x80
> [  466.349476]  __kasan_slab_free+0x12e/0x180
> [  466.350726]  kmem_cache_free+0xc8/0x430
> [  466.351874]  putname+0xe2/0x120
> [  466.352921]  filename_lookup+0x257/0x3e0
> [  466.354319]  vfs_statx+0xe6/0x190
> [  466.355498]  __do_sys_newstat+0x81/0x100
> [  466.356889]  do_syscall_64+0xc8/0x600
> [  466.358037]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  466.359567]
> [  466.360050] The buggy address belongs to the object at ffff888372139100
> [  466.360050]  which belongs to the cache names_cache of size 4096
> [  466.363735] The buggy address is located 336 bytes inside of
> [  466.363735]  4096-byte region [ffff888372139100, ffff88837213a100)
> [  466.367179] The buggy address belongs to the page:
> [  466.368604] page:ffffea000dc84e00 refcount:1 mapcount:0 mapping:ffff8883df1b4f00 index:0x0 compound_mapcount: 0
> [  466.371582] flags: 0x2fffff80010200(slab|head)
> [  466.372910] raw: 002fffff80010200 dead000000000100 dead000000000122 ffff8883df1b4f00
> [  466.375209] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
> [  466.377778] page dumped because: kasan: bad access detected
> [  466.379730]
> [  466.380288] Memory state around the buggy address:
> [  466.381844]  ffff888372139100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.384009]  ffff888372139180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.386131] >ffff888372139200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.388257]                                                  ^
> [  466.390234]  ffff888372139280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.392512]  ffff888372139300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [  466.394667] ==================================================================
>
> tun_chr_read_iter() accessed the memory which freed by free_netdev()
> called by tun_set_iff():
>
> 	CPUA				CPUB
>      tun_set_iff()
>        alloc_netdev_mqs()
>        tun_attach()
> 				    tun_chr_read_iter()
> 				      tun_get()
>        register_netdevice()
>        tun_detach_all()
>          synchronize_net()
> 				      tun_do_read()
> 				        tun_ring_recv()
> 				          schedule()
>        free_netdev()
> 				      tun_put() <-- UAF
>
> Set a new bit in tun->flag if register_netdevice() successed,
> without this bit, tun_get() returns NULL to avoid using a
> freed tun pointer.


Good catch.

Some comments inline.


>
> Fixes: eb0fb363f920 ("tuntap: attach queue 0 before registering netdevice")
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
> ---
>   drivers/net/tun.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index db16d7a13e00..cbd60c276c40 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -115,6 +115,7 @@ do {								\
>   /* High bits in flags field are unused. */
>   #define TUN_VNET_LE     0x80000000
>   #define TUN_VNET_BE     0x40000000
> +#define TUN_DEV_REGISTERED	0x20000000
>   
>   #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
>   		      IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
> @@ -719,8 +720,10 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
>   			netif_carrier_off(tun->dev);
>   
>   			if (!(tun->flags & IFF_PERSIST) &&
> -			    tun->dev->reg_state == NETREG_REGISTERED)
> +			    tun->dev->reg_state == NETREG_REGISTERED) {
>   				unregister_netdevice(tun->dev);
> +				tun->flags &= ~TUN_DEV_REGISTERED;
> +			}
>   		}
>   		if (tun)
>   			xdp_rxq_info_unreg(&tfile->xdp_rxq);
> @@ -884,8 +887,10 @@ static struct tun_struct *tun_get(struct tun_file *tfile)
>   
>   	rcu_read_lock();
>   	tun = rcu_dereference(tfile->tun);
> -	if (tun)
> +	if (tun && (tun->flags & TUN_DEV_REGISTERED))
>   		dev_hold(tun->dev);
> +	else
> +		tun = NULL;
>   	rcu_read_unlock();
>   
>   	return tun;
> @@ -2836,6 +2841,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>   		err = register_netdevice(tun->dev);
>   		if (err < 0)
>   			goto err_detach;
> +		tun->flags |= TUN_DEV_REGISTERED;
>   	}
>   
>   	netif_carrier_on(tun->dev);


This looks just a duplicated of netdev->state? However it lacks 
sufficient synchronization like barriers or locks. How about:

- call tun_set_real_num_queues() before register_netdevice() this can 
have the same result as what  eb0fb363f920 did.
- move tun_attach() after register_netdevice() this makes sure we won't 
publish tfile->tun until we are sure at least one refcnt is held by 
register_netdevice()?

Thanks


^ permalink raw reply

* Re: [PATCH net] tunnel: fix dev null pointer dereference when send pkg larger than mtu in collect_md mode
From: Eric Dumazet @ 2019-08-15  9:16 UTC (permalink / raw)
  To: Hangbin Liu, netdev
  Cc: Stefano Brivio, wenxu, Alexei Starovoitov, David S . Miller
In-Reply-To: <20190815060904.19426-1-liuhangbin@gmail.com>



On 8/15/19 8:09 AM, Hangbin Liu wrote:
> When we send a packet larger than PMTU, we need to reply with
> icmp_send(ICMP_FRAG_NEEDED) or icmpv6_send(ICMPV6_PKT_TOOBIG).
> 
> But in collect_md mode, kernel will crash while accessing the dst dev
> as __metadata_dst_init() init dst->dev to NULL by default. Here is what
> the code path looks like, for GRE:
> 
> - ip6gre_tunnel_xmit
>   - ip6gre_xmit_ipv4
>     - __gre6_xmit
>       - ip6_tnl_xmit
>         - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
>     - icmp_send
>       - net = dev_net(rt->dst.dev); <-- here
>   - ip6gre_xmit_ipv6
>     - __gre6_xmit
>       - ip6_tnl_xmit
>         - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
>     - icmpv6_send
>       ...
>       - decode_session4
>         - oif = skb_dst(skb)->dev->ifindex; <-- here
>       - decode_session6
>         - oif = skb_dst(skb)->dev->ifindex; <-- here
> 
> Fix it by updating the dst dev if not set.
> 
> The reproducer is easy:
> 
> ovs-vsctl add-br br0
> ip link set br0 up
> ovs-vsctl add-port br0 gre0 -- \
> 	  set interface gre0 type=gre options:remote_ip=$dst_addr
> ip link set gre0 up
> ip addr add ${local_gre6}/64 dev br0
> ping6 $remote_gre6 -s 1500
> 
> Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
> Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
> Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  net/ipv4/ip_tunnel.c  |  3 +++
>  net/ipv6/ip6_tunnel.c | 13 +++++++++----
>  2 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
> index 38c02bb62e2c..c6713c7287df 100644
> --- a/net/ipv4/ip_tunnel.c
> +++ b/net/ipv4/ip_tunnel.c
> @@ -597,6 +597,9 @@ void ip_md_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
>  		goto tx_error;
>  	}
>  
> +	if (skb_dst(skb) && !skb_dst(skb)->dev)
> +		skb_dst(skb)->dev = rt->dst.dev;
> +


IMO this looks wrong.
This dst seems shared. 
Once set, we will reuse the same dev ?

If intended, why not doing this in __metadata_dst_init() instead of in the fast path ?

>  	if (key->tun_flags & TUNNEL_DONT_FRAGMENT)
>  		df = htons(IP_DF);
>  	if (tnl_update_pmtu(dev, skb, rt, df, inner_iph, tunnel_hlen,

^ permalink raw reply

* Re: [PATCH 12/16] arm64: prefer __section from compiler_attributes.h
From: Miguel Ojeda @ 2019-08-15  9:12 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Will Deacon, Andrew Morton, Sedat Dilek, Josh Poimboeuf,
	Yonghong Song, clang-built-linux, Catalin Marinas,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Andrey Konovalov, Greg Kroah-Hartman, Enrico Weigelt,
	Suzuki K Poulose, Thomas Gleixner, Masayoshi Mizuma,
	Shaokun Zhang, Alexios Zavras, Allison Randal, Linux ARM,
	linux-kernel, Network Development, bpf
In-Reply-To: <CANiq72mGoGpx7EAVUPcGuhVkLit8sB3bR-k1XBDyeM8HBUaDZw@mail.gmail.com>

On Thu, Aug 15, 2019 at 11:08 AM Miguel Ojeda
<miguel.ojeda.sandonis@gmail.com> wrote:
>
> On Thu, Aug 15, 2019 at 12:20 AM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > This lone patch of the series is just cosmetic, but patch 14/16 fixes
> > a real boot issue:
> > https://github.com/ClangBuiltLinux/linux/issues/619
> > Miguel, I'd like to get that one landed ASAP; the rest are just for consistency.
>
> Ah, interesting. It would be best to have sent that one independently
> to the others, plus adding a commit message mentioning this in
> particular. Let's talk about that in the thread.

Btw, I guess that is the Oops you were mentioning in the cover letter?

Cheers,
Miguel

^ permalink raw reply

* Re: [PATCH 12/16] arm64: prefer __section from compiler_attributes.h
From: Miguel Ojeda @ 2019-08-15  9:08 UTC (permalink / raw)
  To: Nick Desaulniers
  Cc: Will Deacon, Andrew Morton, Sedat Dilek, Josh Poimboeuf,
	Yonghong Song, clang-built-linux, Catalin Marinas,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Andrey Konovalov, Greg Kroah-Hartman, Enrico Weigelt,
	Suzuki K Poulose, Thomas Gleixner, Masayoshi Mizuma,
	Shaokun Zhang, Alexios Zavras, Allison Randal, Linux ARM,
	linux-kernel, Network Development, bpf
In-Reply-To: <CAKwvOdk4hca8WzWzhcPEvxXnJVLbXGnhBdDZbeL_W_H91Ttjqw@mail.gmail.com>

On Thu, Aug 15, 2019 at 12:20 AM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> This lone patch of the series is just cosmetic, but patch 14/16 fixes
> a real boot issue:
> https://github.com/ClangBuiltLinux/linux/issues/619
> Miguel, I'd like to get that one landed ASAP; the rest are just for consistency.

Ah, interesting. It would be best to have sent that one independently
to the others, plus adding a commit message mentioning this in
particular. Let's talk about that in the thread.

> Miguel, how do you want to take the rest of these patches? Will picked
> up the arm64 one, I think the SuperH one got picked up.  There was
> feedback to add more info to individual commits' commit messages.

Yes, I told Will I would pick up whatever is not already picked up by
individual maintainers.

> I kept these tree wide changes separate to improve the likelihood that
> they'd backport to stable cleanly, but could always squash if you'd
> prefer to have 1 patch instead of a series.  Just let me know.

Since you already did the splitting work, let's take advantage of it.
I prefer them to be split anyway, since that gives maintainers a
chance to pick them up individually if they prefer to do so.

Cheers,
Miguel

^ permalink raw reply

* Re: [PATCH] net: ethernet: mediatek: Add MT7628/88 SoC support
From: Stefan Roese @ 2019-08-15  9:08 UTC (permalink / raw)
  To: René van Dorst
  Cc: netdev, linux-mediatek, Sean Wang, John Crispin, Daniel Golle
In-Reply-To: <20190814130856.Horde.wzHL8_VRawJ8NIIk--BD18e@www.vdorst.com>

Hi Rene,

On 14.08.19 15:08, René van Dorst wrote:
> Quoting Stefan Roese <sr@denx.de>:
> 
>> Hi Rene,
>>
>> On 14.08.19 11:26, René van Dorst wrote:
> 
> <snip>
> 
>>> Great, Thanks for addressing this issue.
>>>
>>> I hope we can collaborate to also support mt76x8 in my PHYLINK
>>> patches [0][1].
>>> I am close to posting V2 of the patches but I am currently waiting on some
>>> fiber modules to test the changes better.
>>
>> I do have a "hackish" DSA driver for the integrated switch (ESW) in my
>> tree. If time permits, I'll work on upstreaming this one as well. And
>> yes, hopefully we can collaborate on your PHYLINK work too.
> 
> It is not only the switch driver but also the Mediatek ethernet driver that is
> converted to PHYLINK. So we have a conflict in each others work.

Yes, I am aware of this.
  
> I don't no what the right way is to go but I was thinking about 2 options
> 
> 1. Lets say your work goes in first. I rebase my patches on your changes.
>      We collaborate to create an extra PHYLINK patch ontop of my work
> for your SOC.
> 2. My patches goes in first and you adapt your patches to that.
> 
> What do you think?

It really depends on the timing, when the patches arrive in the kernel
(net-next). If yours makes it first, I'll rebase my patch on top of
your work. Otherwise you will need to rebase yours.
  
> I have latest changes here [0].
> 
> Also my modules did arrive so I can test my changes.

Thanks,
Stefan

^ permalink raw reply

* Re: [PATCH v5] perf machine: arm/arm64: Improve completeness for kernel address space
From: Adrian Hunter @ 2019-08-15  8:54 UTC (permalink / raw)
  To: Leo Yan, Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, linux-kernel, netdev,
	bpf, clang-built-linux
  Cc: Mathieu Poirier, Peter Zijlstra, Suzuki Poulouse, coresight,
	linux-arm-kernel
In-Reply-To: <20190815082521.16885-1-leo.yan@linaro.org>

On 15/08/19 11:25 AM, Leo Yan wrote:
> Arm and arm64 architecture reserve some memory regions prior to the
> symbol '_stext' and these memory regions later will be used by device
> module and BPF jit.  The current code misses to consider these memory
> regions thus any address in the regions will be taken as user space
> mode, but perf cannot find the corresponding dso with the wrong CPU
> mode so we misses to generate samples for device module and BPF
> related trace data.
> 
> This patch parse the link scripts to get the memory size prior to start
> address and reduce this size from 'machine>->kernel_start', then can
> get a fixed up kernel start address which contain memory regions for
> device module and BPF.  Finally, machine__get_kernel_start() can reflect
> more complete kernel memory regions and perf can successfully generate
> samples.
> 
> The reason for parsing the link scripts is Arm architecture changes text
> offset dependent on different platforms, which define multiple text
> offsets in $kernel/arch/arm/Makefile.  This offset is decided when build
> kernel and the final value is extended in the link script, so we can
> extract the used value from the link script.  We use the same way to
> parse arm64 link script as well.  If fail to find the link script, the
> pre start memory size is assumed as zero, in this case it has no any
> change caused with this patch.
> 
> Below is detailed info for testing this patch:
> 
> - Install or build LLVM/Clang;
> 
> - Configure perf with ~/.perfconfig:
> 
>   root@debian:~# cat ~/.perfconfig
>   # this file is auto-generated.
>   [llvm]
>           clang-path = /mnt/build/llvm-build/build/install/bin/clang
>           kbuild-dir = /mnt/linux-kernel/linux-cs-dev/
>           clang-opt = "-g"
>           dump-obj = true
> 
>   [trace]
>           show_zeros = yes
>           show_duration = no
>           no_inherit = yes
>           show_timestamp = no
>           show_arg_names = no
>           args_alignment = 40
>           show_prefix = yes
> 
> - Run 'perf trace' command with eBPF event:
> 
>   root@debian:~# perf trace -e string \
>       -e $kernel/tools/perf/examples/bpf/augmented_raw_syscalls.c
> 
> - Read eBPF program memory mapping in kernel:
> 
>   root@debian:~# echo 1 > /proc/sys/net/core/bpf_jit_kallsyms
>   root@debian:~# cat /proc/kallsyms | grep -E "bpf_prog_.+_sys_[enter|exit]"
>   ffff00000008a0d0 t bpf_prog_e470211b846088d5_sys_enter  [bpf]
>   ffff00000008c6a4 t bpf_prog_29c7ae234d79bd5c_sys_exit   [bpf]
> 
> - Launch any program which accesses file system frequently so can hit
>   the system calls trace flow with eBPF event;
> 
> - Capture CoreSight trace data with filtering eBPF program:
> 
>   root@debian:~# perf record -e cs_etm/@tmc_etr0/ \
> 	--filter 'filter 0xffff00000008a0d0/0x800' -a sleep 5s
> 
> - Decode the eBPF program symbol 'bpf_prog_f173133dc38ccf87_sys_enter':
> 
>   root@debian:~# perf script -F,ip,sym
>   Frame deformatter: Found 4 FSYNCS
>                   0 [unknown]
>    ffff00000008a1ac bpf_prog_e470211b846088d5_sys_enter
>    ffff00000008a250 bpf_prog_e470211b846088d5_sys_enter
>                   0 [unknown]
>    ffff00000008a124 bpf_prog_e470211b846088d5_sys_enter
>                   0 [unknown]
>    ffff00000008a14c bpf_prog_e470211b846088d5_sys_enter
>    ffff00000008a13c bpf_prog_e470211b846088d5_sys_enter
>    ffff00000008a14c bpf_prog_e470211b846088d5_sys_enter
>                   0 [unknown]
>    ffff00000008a180 bpf_prog_e470211b846088d5_sys_enter
>                   0 [unknown]
>    ffff00000008a1ac bpf_prog_e470211b846088d5_sys_enter
>    ffff00000008a190 bpf_prog_e470211b846088d5_sys_enter
>    ffff00000008a1ac bpf_prog_e470211b846088d5_sys_enter
>    ffff00000008a250 bpf_prog_e470211b846088d5_sys_enter
>                   0 [unknown]
>    ffff00000008a124 bpf_prog_e470211b846088d5_sys_enter
>                   0 [unknown]
>    ffff00000008a14c bpf_prog_e470211b846088d5_sys_enter
>                   0 [unknown]
>    ffff00000008a180 bpf_prog_e470211b846088d5_sys_enter
>    [...]
> 
> Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Suzuki Poulouse <suzuki.poulose@arm.com>
> Cc: Adrian Hunter <adrian.hunter@intel.com>
> Cc: coresight@lists.linaro.org
> Cc: linux-arm-kernel@lists.infradead.org
> Signed-off-by: Leo Yan <leo.yan@linaro.org>
> ---
>  tools/perf/Makefile.config | 22 ++++++++++++++++++++++
>  tools/perf/util/machine.c  | 15 ++++++++++++++-
>  2 files changed, 36 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config
> index e4988f49ea79..d7ff839d8b20 100644
> --- a/tools/perf/Makefile.config
> +++ b/tools/perf/Makefile.config
> @@ -48,9 +48,20 @@ ifeq ($(SRCARCH),x86)
>    NO_PERF_REGS := 0
>  endif
>  
> +ARM_PRE_START_SIZE := 0
> +
>  ifeq ($(SRCARCH),arm)
>    NO_PERF_REGS := 0
>    LIBUNWIND_LIBS = -lunwind -lunwind-arm
> +  ifneq ($(wildcard $(srctree)/arch/$(SRCARCH)/kernel/vmlinux.lds),)
> +    # Extract info from lds:
> +    #   . = ((0xC0000000)) + 0x00208000;
> +    # ARM_PRE_START_SIZE := 0x00208000
> +    ARM_PRE_START_SIZE := $(shell egrep ' \. \= \({2}0x[0-9a-fA-F]+\){2}' \
> +      $(srctree)/arch/$(SRCARCH)/kernel/vmlinux.lds | \
> +      sed -e 's/[(|)|.|=|+|<|;|-]//g' -e 's/ \+/ /g' -e 's/^[ \t]*//' | \
> +      awk -F' ' '{printf "0x%x", $$2}' 2>/dev/null)
> +  endif
>  endif
>  
>  ifeq ($(SRCARCH),arm64)
> @@ -58,8 +69,19 @@ ifeq ($(SRCARCH),arm64)
>    NO_SYSCALL_TABLE := 0
>    CFLAGS += -I$(OUTPUT)arch/arm64/include/generated
>    LIBUNWIND_LIBS = -lunwind -lunwind-aarch64
> +  ifneq ($(wildcard $(srctree)/arch/$(SRCARCH)/kernel/vmlinux.lds),)
> +    # Extract info from lds:
> +    #  . = ((((((((0xffffffffffffffff)) - (((1)) << (48)) + 1) + (0)) + (0x08000000))) + (0x08000000))) + 0x00080000;
> +    # ARM_PRE_START_SIZE := (0x08000000 + 0x08000000 + 0x00080000) = 0x10080000
> +    ARM_PRE_START_SIZE := $(shell egrep ' \. \= \({8}0x[0-9a-fA-F]+\){2}' \
> +      $(srctree)/arch/$(SRCARCH)/kernel/vmlinux.lds | \
> +      sed -e 's/[(|)|.|=|+|<|;|-]//g' -e 's/ \+/ /g' -e 's/^[ \t]*//' | \
> +      awk -F' ' '{printf "0x%x", $$6+$$7+$$8}' 2>/dev/null)
> +  endif

So, that is not going to work if you take a perf.data file to a non-arm machine?

How come you cannot use kallsyms to get the information?

>  endif
>  
> +CFLAGS += -DARM_PRE_START_SIZE=$(ARM_PRE_START_SIZE)
> +
>  ifeq ($(SRCARCH),csky)
>    NO_PERF_REGS := 0
>  endif
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index f6ee7fbad3e4..e993f891bb82 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -2687,13 +2687,26 @@ int machine__get_kernel_start(struct machine *machine)
>  	machine->kernel_start = 1ULL << 63;
>  	if (map) {
>  		err = map__load(map);
> +		if (err)
> +			return err;
> +
>  		/*
>  		 * On x86_64, PTI entry trampolines are less than the
>  		 * start of kernel text, but still above 2^63. So leave
>  		 * kernel_start = 1ULL << 63 for x86_64.
>  		 */
> -		if (!err && !machine__is(machine, "x86_64"))
> +		if (!machine__is(machine, "x86_64"))
>  			machine->kernel_start = map->start;
> +
> +		/*
> +		 * On arm/arm64, the kernel uses some memory regions which are
> +		 * prior to '_stext' symbol; to reflect the complete kernel
> +		 * address space, compensate these pre-defined regions for
> +		 * kernel start address.
> +		 */
> +		if (!strcmp(perf_env__arch(machine->env), "arm") ||
> +		    !strcmp(perf_env__arch(machine->env), "arm64"))
> +			machine->kernel_start -= ARM_PRE_START_SIZE;
>  	}
>  	return err;
>  }
> 


^ permalink raw reply

* Re: WARNING in is_bpf_text_address
From: Daniel Borkmann @ 2019-08-15  8:52 UTC (permalink / raw)
  To: Hillf Danton, syzbot
  Cc: ast@kernel.org, bpf@vger.kernel.org, bvanassche@acm.org,
	dvyukov@google.com, hawk@kernel.org, jakub.kicinski@netronome.com,
	johannes.berg@intel.com, johannes@sipsolutions.net,
	john.fastabend@gmail.com, linux-kernel@vger.kernel.org,
	longman@redhat.com, netdev@vger.kernel.org,
	syzkaller-bugs@googlegroups.com, will
In-Reply-To: <20190811083658.10748-1-hdanton@sina.com>

On 8/11/19 10:36 AM, Hillf Danton wrote:
> On Sun, 11 Aug 2019 08:24:09 +0800
>>
>> syzbot has found a reproducer for the following crash on:
>>
>> HEAD commit:    451577f3 Merge tag 'kbuild-fixes-v5.3-3' of git://git.kern..
>> git tree:       upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=120850a6600000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=2031e7d221391b8a
>> dashboard link: https://syzkaller.appspot.com/bug?extid=bd3bba6ff3fcea7a6ec6
>> compiler:       clang version 9.0.0 (/home/glider/llvm/clang 80fee25776c2fb61e74c1ecb1a523375c2500b69)
>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=130ffe4a600000
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17137d2c600000
>>
>> The bug was bisected to:
>>
>> commit a0b0fd53e1e67639b303b15939b9c653dbe7a8c4
>> Author: Bart Van Assche <bvanassche@acm.org>
>> Date:   Thu Feb 14 23:00:46 2019 +0000
>>
>>       locking/lockdep: Free lock classes that are no longer in use

Hey Bart, don't think it's related in any way to your commit. I'll allocate some
time on working on this issue today, thanks!

>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=152f6a9da00000
>> final crash:    https://syzkaller.appspot.com/x/report.txt?x=172f6a9da00000
>> console output: https://syzkaller.appspot.com/x/log.txt?x=132f6a9da00000
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+bd3bba6ff3fcea7a6ec6@syzkaller.appspotmail.com
>> Fixes: a0b0fd53e1e6 ("locking/lockdep: Free lock classes that are no longer in use")
>>
>> WARNING: CPU: 0 PID: 9604 at kernel/bpf/core.c:851 bpf_jit_free+0x1a8/0x1f0
>> Kernel panic - not syncing: panic_on_warn set ...
>> CPU: 0 PID: 9604 Comm: kworker/0:5 Not tainted 5.3.0-rc3+ #71
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Workqueue: events bpf_prog_free_deferred
>> Call Trace:
>>    __dump_stack lib/dump_stack.c:77 [inline]
>>    dump_stack+0x1d8/0x2f8 lib/dump_stack.c:113
>>    panic+0x25c/0x799 kernel/panic.c:219
>>    __warn+0x22f/0x230 kernel/panic.c:576
>>    report_bug+0x190/0x290 lib/bug.c:186
>> BUG: unable to handle page fault for address: fffffbfff4001000
>> #PF: supervisor read access in kernel mode
>> #PF: error_code(0x0000) - not-present page
>> PGD 21ffee067 P4D 21ffee067 PUD 21ffed067 PMD 936de067 PTE 0
>> Oops: 0000 [#1] PREEMPT SMP KASAN
>> CPU: 0 PID: 9604 Comm: kworker/0:5 Not tainted 5.3.0-rc3+ #71
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Workqueue: events bpf_prog_free_deferred
>> RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:537 [inline]
>> RIP: 0010:bpf_tree_comp kernel/bpf/core.c:600 [inline]
>> RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
>> RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
>> RIP: 0010:bpf_prog_kallsyms_find kernel/bpf/core.c:674 [inline]
>> RIP: 0010:is_bpf_text_address+0x201/0x3b0 kernel/bpf/core.c:709
>> Code: 85 c4 f5 ff 4d 39 f4 76 10 e8 7b c2 f5 ff 49 83 c7 10 eb 46 0f 1f 44
>> 00 00 4c 89 e0 48 c1 e8 03 48 b9 00 00 00 00 00 fc ff df <0f> b6 04 08 84
>> c0 75 7d 41 8b 1c 24 48 c1 e3 0c 4c 01 e3 48 89 df
>> RSP: 0018:ffff888097eff828 EFLAGS: 00010806
>> RAX: 1ffffffff4001000 RBX: 0000000000000001 RCX: dffffc0000000000
>> RDX: ffff88809f1e0280 RSI: ffffffffff7a5520 RDI: ffffffffa0008000
>> RBP: ffff888097eff860 R08: ffffffff817dc73b R09: 0000000000000001
>> R10: fffffbfff117be6d R11: 0000000000000000 R12: ffffffffa0008000
>> R13: 0000000000000000 R14: ffffffffff7a5520 R15: ffff88809a46b2f8
>> FS:  0000000000000000(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: fffffbfff4001000 CR3: 0000000095d73000 CR4: 00000000001406f0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
> [pruned]
> 
> Pair bpf_prog_kallsyms_del_all() with bpf_prog_free() to silence
> WARNING at kernel/bpf/core.c:851, see __bpf_prog_release() in
> net/core/filter.c for why.
> 
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -1987,6 +1987,7 @@ void bpf_prog_free(struct bpf_prog *fp)
>   {
>   	struct bpf_prog_aux *aux = fp->aux;
>   
> +	bpf_prog_kallsyms_del_all(fp);
>   	INIT_WORK(&aux->work, bpf_prog_free_deferred);
>   	schedule_work(&aux->work);
>   }
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1328,7 +1328,6 @@ static void __bpf_prog_put(struct bpf_pr
>   		perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_UNLOAD, 0);
>   		/* bpf_prog_free_id() must be called first */
>   		bpf_prog_free_id(prog, do_idr_lock);
> -		bpf_prog_kallsyms_del_all(prog);
>   		btf_put(prog->aux->btf);
>   		kvfree(prog->aux->func_info);
>   		bpf_prog_free_linfo(prog);
> --
> 


^ permalink raw reply

* Re: [patch net-next v2 2/2] selftests: netdevsim: add devlink params tests
From: Jiri Pirko @ 2019-08-15  8:52 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, mlxsw
In-Reply-To: <20190815084545.GB2273@nanopsycho>

Thu, Aug 15, 2019 at 10:45:45AM CEST, jiri@resnulli.us wrote:
>Thu, Aug 15, 2019 at 03:09:00AM CEST, jakub.kicinski@netronome.com wrote:
>>On Wed, 14 Aug 2019 17:26:04 +0200, Jiri Pirko wrote:
>>> From: Jiri Pirko <jiri@mellanox.com>
>>> 
>>> Test recently added netdevsim devlink param implementation.
>>> 
>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>> ---
>>> v1->v2:
>>> -using cmd_jq helper
>>
>>Still failing here :(
>>
>># ./devlink.sh 
>>TEST: fw flash test                                                 [ OK ]
>>TEST: params test                                                   [FAIL]
>>	Failed to get test1 param value
>>TEST: regions test                                                  [ OK ]
>>
>># jq --version
>>jq-1.5-1-a5b5cbe
>># echo '{ "a" : false }' | jq -e -r '.[]'
>>false
>># echo $?
>>1
>
>Odd, could you please try:
>$ jq --version
>jq-1.5
>$ echo '{"param":{"netdevsim/netdevsim11":[{"name":"test1","type":"driver-specific","values":[{"cmode":"driverinit","value":"false"}]}]}}' | jq -e -r '.[][][].values[] | select(.cmode == "driverinit").value'
>false
>$ echo $?
>0

Ah, it is not the jq version, it is the iproute2 version:
8257e6c49cca9847e01262f6e749c6e88e2ddb72

I'll think about how to fix this.


>
>
>>
>>On another machine:
>>
>>$ echo '{ "a" : false }' | jq -e -r '.[]'
>>false
>>$ echo $?
>>1
>>
>>Did you mean to drop the -e ?

^ permalink raw reply

* Re: [patch net-next v2 2/2] selftests: netdevsim: add devlink params tests
From: Jiri Pirko @ 2019-08-15  8:45 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, mlxsw
In-Reply-To: <20190814180900.71712d88@cakuba.netronome.com>

Thu, Aug 15, 2019 at 03:09:00AM CEST, jakub.kicinski@netronome.com wrote:
>On Wed, 14 Aug 2019 17:26:04 +0200, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>> 
>> Test recently added netdevsim devlink param implementation.
>> 
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>> ---
>> v1->v2:
>> -using cmd_jq helper
>
>Still failing here :(
>
># ./devlink.sh 
>TEST: fw flash test                                                 [ OK ]
>TEST: params test                                                   [FAIL]
>	Failed to get test1 param value
>TEST: regions test                                                  [ OK ]
>
># jq --version
>jq-1.5-1-a5b5cbe
># echo '{ "a" : false }' | jq -e -r '.[]'
>false
># echo $?
>1

Odd, could you please try:
$ jq --version
jq-1.5
$ echo '{"param":{"netdevsim/netdevsim11":[{"name":"test1","type":"driver-specific","values":[{"cmode":"driverinit","value":"false"}]}]}}' | jq -e -r '.[][][].values[] | select(.cmode == "driverinit").value'
false
$ echo $?
0


>
>On another machine:
>
>$ echo '{ "a" : false }' | jq -e -r '.[]'
>false
>$ echo $?
>1
>
>Did you mean to drop the -e ?

^ permalink raw reply

* Re: can: tcan4x5x: spi bits_per_word issue on Raspberry PI
From: Marc Kleine-Budde @ 2019-08-15  8:34 UTC (permalink / raw)
  To: FIXED-TERM Buecheler Konstantin (ETAS-SEC/ECT-Mu),
	linux-can@vger.kernel.org, netdev@vger.kernel.org
  Cc: Dan Murphy
In-Reply-To: <3f71bdff8f4f4fe19ad9a09be89bc73d@escrypt.com>


[-- Attachment #1.1: Type: text/plain, Size: 1871 bytes --]

On 8/14/19 5:01 PM, FIXED-TERM Buecheler Konstantin (ETAS-SEC/ECT-Mu) wrote:
>> Hi all,
>>
>> I am trying to use a tcan4550 together with a Raspberry PI 3 B. I am using the
>> tcan4x5x driver from net-next.
>> I always get the following error during startup.
>> 	tcan4x5x spi0.0: Probe failed, err=-22
>> 	tcan4x5x: probe of spi0.0 failed with error -22
>>
>> I realized that this happens because the Raspberry PI does only support 8/9 bit
>> words. https://elinux.org/index.php?title=RPi_SPI#Supported_bits_per_word
>> In the driver it is set to 32.
>> 	spi->bits_per_word = 32;
>>
>> Setting this to 8 does not help of course since the tcan chip expects a multiple of
>> 32 per spi transaction.
>> I don't know if this is a Raspberry Pi specific problem or if there are more devices
>> with this hardware limitation.
>>
>> Does anyone have a workaround for that?
> 
> It seems to be enough to just change the bits_per_word value to 8
> 
>>
>> If this a common issue it might be a good idea to patch the driver. I will check if I
>> can find proper a way to do so.
>>
>> Regards,
>> Konstantin
> 
> Now I have another really confusing problem. Anything I write to SPI is written little endian. The tcan chip expects big endian. 
> Anything I read from SPI is treated as little endian but is big endian. Does anyone know why this happens? 
> Is there a flag or something I can set for the SPI device/wire to fix this?

Have you changed the bits_per_word to 8? Then you read just a stream of
bytes. If you tread them as an u32 they are in host order.

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released
From: Björn Töpel @ 2019-08-15  8:28 UTC (permalink / raw)
  To: Daniel Borkmann, Björn Töpel
  Cc: Alexei Starovoitov, Netdev, Karlsson, Magnus, Bruce Richardson,
	Song Liu, bpf
In-Reply-To: <5ce25e5b-a07a-31d8-4141-c6bd250bba0e@iogearbox.net>

On 2019-08-15 01:17, Daniel Borkmann wrote:
> Seems reasonable to me, and inc as opposed to inc_not_zero is also fine
> here since at this point in time we're still holding one reference to
> the map.

Ok, good.

> But I think there's a catch with the current code that still
> needs fixing:
> 
> Imagine you do a xsk_map_update_elem() where we have a situation where
> xs == old_xs. There, we first do the xsk_map_sock_add() to add the new
> xsk map node at the tail of the socket's xs->map_list. We do the xchg()
> and then xsk_map_sock_delete() for old_xs which then walks xs->map_list
> again and purges all entries including the just newly created one. This
> means we'll end up with an xs socket at the given map slot, but the xs
> socket has empty xs->map_list. This means we could release the xs sock
> and the xsk_delete_from_maps() won't need to clean up anything anymore
> but yet the xs is still in the map slot, so if you redirect to that
> socket, it would be use-after-free, no?

Ah, correct. Checking against self-assignment, or doing the delete prior 
add. I'll spin a v5.

...and again, thanks for detailed review, Daniel!


Björn

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox