Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] tools: bpftool: add net load command to load XDP on interface
From: Jesper Dangaard Brouer @ 2019-07-31 10:08 UTC (permalink / raw)
  To: Daniel T. Lee
  Cc: brouer, Daniel Borkmann, Alexei Starovoitov, netdev,
	Quentin Monnet
In-Reply-To: <20190730184821.10833-2-danieltimlee@gmail.com>

On Wed, 31 Jul 2019 03:48:20 +0900
"Daniel T. Lee" <danieltimlee@gmail.com> wrote:

> By this commit, using `bpftool net load`, user can load XDP prog on
> interface. New type of enum 'net_load_type' has been made, as stated at
> cover-letter, the meaning of 'load' is, prog will be loaded on interface.

Why the keyword "load" ?
Why not "attach" (and "detach")?

For BPF there is a clear distinction between the "load" and "attach"
steps.  I know this is under subcommand "net", but to follow the
conversion of other subcommands e.g. "prog" there are both "load" and
"attach" commands.


> BPF prog will be loaded through libbpf 'bpf_set_link_xdp_fd'.

Again this is a "set" operation, not a "load" operation.

> Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>

[...]
>  static int do_show(int argc, char **argv)
>  {
>  	struct bpf_attach_info attach_info = {};
> @@ -305,13 +405,17 @@ static int do_help(int argc, char **argv)
>  
>  	fprintf(stderr,
>  		"Usage: %s %s { show | list } [dev <devname>]\n"
> +		"       %s %s load PROG LOAD_TYPE <devname>\n"

The "PROG" here does it correspond to the 'bpftool prog' syntax?:

 PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }

>  		"       %s %s help\n"
> +		"\n"
> +		"       " HELP_SPEC_PROGRAM "\n"
> +		"       LOAD_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload }\n"
>  		"Note: Only xdp and tc attachments are supported now.\n"
>  		"      For progs attached to cgroups, use \"bpftool cgroup\"\n"
>  		"      to dump program attachments. For program types\n"
>  		"      sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
>  		"      consult iproute2.\n",


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH V2 9/9] vhost: do not return -EAGIAN for non blocking invalidation too early
From: Jason Wang @ 2019-07-31 10:05 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm, jgg
In-Reply-To: <20190731095950.d6zr472megt7rgkt@steredhat>


On 2019/7/31 下午5:59, Stefano Garzarella wrote:
> A little typo in the title: s/EAGIAN/EAGAIN
>
> Thanks,
> Stefano


Right, will fix if need respin or Michael can help to fix.

Thanks


>
> On Wed, Jul 31, 2019 at 04:46:55AM -0400, Jason Wang wrote:
>> Instead of returning -EAGAIN unconditionally, we'd better do that only
>> we're sure the range is overlapped with the metadata area.
>>
>> Reported-by: Jason Gunthorpe <jgg@ziepe.ca>
>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>   drivers/vhost/vhost.c | 32 +++++++++++++++++++-------------
>>   1 file changed, 19 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index fc2da8a0c671..96c6aeb1871f 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -399,16 +399,19 @@ static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
>>   	smp_mb();
>>   }
>>   
>> -static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>> -				      int index,
>> -				      unsigned long start,
>> -				      unsigned long end)
>> +static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>> +				     int index,
>> +				     unsigned long start,
>> +				     unsigned long end,
>> +				     bool blockable)
>>   {
>>   	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
>>   	struct vhost_map *map;
>>   
>>   	if (!vhost_map_range_overlap(uaddr, start, end))
>> -		return;
>> +		return 0;
>> +	else if (!blockable)
>> +		return -EAGAIN;
>>   
>>   	spin_lock(&vq->mmu_lock);
>>   	++vq->invalidate_count;
>> @@ -423,6 +426,8 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>>   		vhost_set_map_dirty(vq, map, index);
>>   		vhost_map_unprefetch(map);
>>   	}
>> +
>> +	return 0;
>>   }
>>   
>>   static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
>> @@ -443,18 +448,19 @@ static int vhost_invalidate_range_start(struct mmu_notifier *mn,
>>   {
>>   	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
>>   					     mmu_notifier);
>> -	int i, j;
>> -
>> -	if (!mmu_notifier_range_blockable(range))
>> -		return -EAGAIN;
>> +	bool blockable = mmu_notifier_range_blockable(range);
>> +	int i, j, ret;
>>   
>>   	for (i = 0; i < dev->nvqs; i++) {
>>   		struct vhost_virtqueue *vq = dev->vqs[i];
>>   
>> -		for (j = 0; j < VHOST_NUM_ADDRS; j++)
>> -			vhost_invalidate_vq_start(vq, j,
>> -						  range->start,
>> -						  range->end);
>> +		for (j = 0; j < VHOST_NUM_ADDRS; j++) {
>> +			ret = vhost_invalidate_vq_start(vq, j,
>> +							range->start,
>> +							range->end, blockable);
>> +			if (ret)
>> +				return ret;
>> +		}
>>   	}
>>   
>>   	return 0;
>> -- 
>> 2.18.1
>>

^ permalink raw reply

* Re: [PATCH V2 9/9] vhost: do not return -EAGIAN for non blocking invalidation too early
From: Stefano Garzarella @ 2019-07-31  9:59 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm, jgg
In-Reply-To: <20190731084655.7024-10-jasowang@redhat.com>

A little typo in the title: s/EAGIAN/EAGAIN

Thanks,
Stefano

On Wed, Jul 31, 2019 at 04:46:55AM -0400, Jason Wang wrote:
> Instead of returning -EAGAIN unconditionally, we'd better do that only
> we're sure the range is overlapped with the metadata area.
> 
> Reported-by: Jason Gunthorpe <jgg@ziepe.ca>
> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/vhost.c | 32 +++++++++++++++++++-------------
>  1 file changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index fc2da8a0c671..96c6aeb1871f 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -399,16 +399,19 @@ static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
>  	smp_mb();
>  }
>  
> -static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> -				      int index,
> -				      unsigned long start,
> -				      unsigned long end)
> +static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> +				     int index,
> +				     unsigned long start,
> +				     unsigned long end,
> +				     bool blockable)
>  {
>  	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
>  	struct vhost_map *map;
>  
>  	if (!vhost_map_range_overlap(uaddr, start, end))
> -		return;
> +		return 0;
> +	else if (!blockable)
> +		return -EAGAIN;
>  
>  	spin_lock(&vq->mmu_lock);
>  	++vq->invalidate_count;
> @@ -423,6 +426,8 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>  		vhost_set_map_dirty(vq, map, index);
>  		vhost_map_unprefetch(map);
>  	}
> +
> +	return 0;
>  }
>  
>  static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
> @@ -443,18 +448,19 @@ static int vhost_invalidate_range_start(struct mmu_notifier *mn,
>  {
>  	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
>  					     mmu_notifier);
> -	int i, j;
> -
> -	if (!mmu_notifier_range_blockable(range))
> -		return -EAGAIN;
> +	bool blockable = mmu_notifier_range_blockable(range);
> +	int i, j, ret;
>  
>  	for (i = 0; i < dev->nvqs; i++) {
>  		struct vhost_virtqueue *vq = dev->vqs[i];
>  
> -		for (j = 0; j < VHOST_NUM_ADDRS; j++)
> -			vhost_invalidate_vq_start(vq, j,
> -						  range->start,
> -						  range->end);
> +		for (j = 0; j < VHOST_NUM_ADDRS; j++) {
> +			ret = vhost_invalidate_vq_start(vq, j,
> +							range->start,
> +							range->end, blockable);
> +			if (ret)
> +				return ret;
> +		}
>  	}
>  
>  	return 0;
> -- 
> 2.18.1
> 

-- 

^ permalink raw reply

* [RFC PATCH 1/2] dt-bindings: net: macb: Add new property for PS SGMII only
From: Harini Katakam @ 2019-07-31  9:40 UTC (permalink / raw)
  To: nicolas.ferre, davem, claudiu.beznea, robh+dt, mark.rutland
  Cc: netdev, linux-kernel, michal.simek, harinikatakamlinux,
	harini.katakam, devicetree
In-Reply-To: <1564566033-676-1-git-send-email-harini.katakam@xilinx.com>

Add a new property to indicate when PS SGMII is used with NO
external PHY on board.

Signed-off-by: Harini Katakam <harini.katakam@xilinx.com>
---
 Documentation/devicetree/bindings/net/macb.txt | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/macb.txt b/Documentation/devicetree/bindings/net/macb.txt
index 63c73fa..ae1b109 100644
--- a/Documentation/devicetree/bindings/net/macb.txt
+++ b/Documentation/devicetree/bindings/net/macb.txt
@@ -38,6 +38,10 @@ Optional properties for PHY child node:
   up via magic packet.
 - phy-handle : see ethernet.txt file in the same directory
 
+Optional properties:
+- is-internal-pcspma: Add when GEM's internal SGMII support is used without
+  any external SGMII PHY.
+
 Examples:
 
 	macb0: ethernet@fffc4000 {
-- 
2.7.4


^ permalink raw reply related

* [RFC PATCH 2/2] net: macb: Add SGMII poll thread
From: Harini Katakam @ 2019-07-31  9:40 UTC (permalink / raw)
  To: nicolas.ferre, davem, claudiu.beznea, robh+dt, mark.rutland
  Cc: netdev, linux-kernel, michal.simek, harinikatakamlinux,
	harini.katakam, devicetree
In-Reply-To: <1564566033-676-1-git-send-email-harini.katakam@xilinx.com>

The internal SGMII mode in PS GEM on ZynqMP can be used without any
external PHY on board. In this case, the phy framework doesn't kick
in to monitor the link status. Hence do the same in macb driver.

Signed-off-by: Harini Katakam <harini.katakam@xilinx.com>
Signed-off-by: Kester Aernoudt <kester.aernoudt@xilinx.com>
---
 drivers/net/ethernet/cadence/macb.h      |  8 ++++
 drivers/net/ethernet/cadence/macb_main.c | 65 ++++++++++++++++++++++++++++++--
 2 files changed, 69 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
index 03983bd..b07284a 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -155,6 +155,7 @@
 #define GEM_PEFTN		0x01f4 /* PTP Peer Event Frame Tx Ns */
 #define GEM_PEFRSL		0x01f8 /* PTP Peer Event Frame Rx Sec Low */
 #define GEM_PEFRN		0x01fc /* PTP Peer Event Frame Rx Ns */
+#define GEM_PCSSTATUS		0x0204 /* PCS Status */
 #define GEM_DCFG1		0x0280 /* Design Config 1 */
 #define GEM_DCFG2		0x0284 /* Design Config 2 */
 #define GEM_DCFG3		0x0288 /* Design Config 3 */
@@ -455,6 +456,10 @@
 #define MACB_REV_OFFSET				0
 #define MACB_REV_SIZE				16
 
+/* Bitfields in PCSSTATUS */
+#define GEM_PCSLINK_OFFSET			2
+#define GEM_PCSLINK_SIZE			1
+
 /* Bitfields in DCFG1. */
 #define GEM_IRQCOR_OFFSET			23
 #define GEM_IRQCOR_SIZE				1
@@ -1232,6 +1237,9 @@ struct macb {
 	u32	rx_intr_mask;
 
 	struct macb_pm_data pm_data;
+
+	int internal_pcspma;
+	struct task_struct *sgmii_poll_task;
 };
 
 #ifdef CONFIG_MACB_USE_HWSTAMP
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
index 5ca17e6..ae1f18d 100644
--- a/drivers/net/ethernet/cadence/macb_main.c
+++ b/drivers/net/ethernet/cadence/macb_main.c
@@ -36,6 +36,7 @@
 #include <linux/tcp.h>
 #include <linux/iopoll.h>
 #include <linux/pm_runtime.h>
+#include <linux/kthread.h>
 #include "macb.h"
 
 /* This structure is only used for MACB on SiFive FU540 devices */
@@ -2418,8 +2419,10 @@ static int macb_open(struct net_device *dev)
 	/* carrier starts down */
 	netif_carrier_off(dev);
 
-	/* if the phy is not yet register, retry later*/
-	if (!dev->phydev) {
+	/* if the phy is not yet registered and internal SGMII is not used,
+	 * retry later
+	 */
+	if (!bp->internal_pcspma && !dev->phydev) {
 		err = -EAGAIN;
 		goto pm_exit;
 	}
@@ -2441,7 +2444,8 @@ static int macb_open(struct net_device *dev)
 	macb_init_hw(bp);
 
 	/* schedule a link state check */
-	phy_start(dev->phydev);
+	if (!bp->internal_pcspma)
+		phy_start(dev->phydev);
 
 	netif_tx_start_all_queues(dev);
 
@@ -2468,7 +2472,7 @@ static int macb_close(struct net_device *dev)
 	for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue)
 		napi_disable(&queue->napi);
 
-	if (dev->phydev)
+	if (!bp->internal_pcspma && dev->phydev)
 		phy_stop(dev->phydev);
 
 	spin_lock_irqsave(&bp->lock, flags);
@@ -3187,6 +3191,49 @@ static const struct ethtool_ops gem_ethtool_ops = {
 	.set_rxnfc			= gem_set_rxnfc,
 };
 
+int gem_sgmii_status_poll(void *data)
+{
+	struct net_device *dev = data;
+	struct macb *bp = netdev_priv(dev);
+	int status, prev_status = 0;
+	u32 reg;
+
+	while (!kthread_should_stop()) {
+		status = gem_readl(bp, PCSSTATUS) & GEM_BIT(PCSLINK);
+		reg = macb_readl(bp, NCR);
+		if (!(reg & MACB_BIT(RE)) || !(reg & MACB_BIT(TE)) ||
+		    (!netif_carrier_ok(dev) && prev_status))
+			status = 0;
+
+		if (status != prev_status) {
+			if (status) {
+				reg = macb_readl(bp, NCFGR);
+				reg |= MACB_BIT(FD);
+				reg |= GEM_BIT(GBE);
+
+				macb_or_gem_writel(bp, NCFGR, reg);
+
+				bp->speed = SPEED_1000;
+				bp->duplex = DUPLEX_FULL;
+				bp->link = 1;
+				macb_set_tx_clk(bp->tx_clk, SPEED_1000, dev);
+
+				netif_carrier_on(dev);
+				netdev_info(dev, "link up (%d/%s)\n",
+					    SPEED_1000, "Full");
+			} else {
+				netif_carrier_off(dev);
+				netdev_info(dev, "link down\n");
+			}
+
+			prev_status = status;
+		}
+		schedule_timeout_uninterruptible(HZ);
+	}
+
+	return 0;
+}
+
 static int macb_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 {
 	struct phy_device *phydev = dev->phydev;
@@ -4344,6 +4391,12 @@ static int macb_probe(struct platform_device *pdev)
 		    macb_is_gem(bp) ? "GEM" : "MACB", macb_readl(bp, MID),
 		    dev->base_addr, dev->irq, dev->dev_addr);
 
+	bp->internal_pcspma = of_property_read_bool(np, "is-internal-pcspma");
+
+	if (bp->internal_pcspma)
+		bp->sgmii_poll_task = kthread_run(gem_sgmii_status_poll, dev,
+						  "gem_sgmii_status_poll");
+
 	pm_runtime_mark_last_busy(&bp->pdev->dev);
 	pm_runtime_put_autosuspend(&bp->pdev->dev);
 
@@ -4384,6 +4437,10 @@ static int macb_remove(struct platform_device *pdev)
 
 	if (dev) {
 		bp = netdev_priv(dev);
+
+		if (bp->internal_pcspma)
+			kthread_stop(bp->sgmii_poll_task);
+
 		if (dev->phydev)
 			phy_disconnect(dev->phydev);
 		mdiobus_unregister(bp->mii_bus);
-- 
2.7.4


^ permalink raw reply related

* [RFC PATCH 0/2] Macb SGMII status poll thread
From: Harini Katakam @ 2019-07-31  9:40 UTC (permalink / raw)
  To: nicolas.ferre, davem, claudiu.beznea, robh+dt, mark.rutland
  Cc: netdev, linux-kernel, michal.simek, harinikatakamlinux,
	harini.katakam, devicetree

When PS GEM is used with SGMII mode without an external PHY
on board, add a link status reporting mechanism.

Harini Katakam (2):
  dt-bindings: net: macb: Add new property for PS SGMII only
  net: macb: Add SGMII poll thread

 Documentation/devicetree/bindings/net/macb.txt |  4 ++
 drivers/net/ethernet/cadence/macb.h            |  8 ++++
 drivers/net/ethernet/cadence/macb_main.c       | 65 ++++++++++++++++++++++++--
 3 files changed, 73 insertions(+), 4 deletions(-)

-- 
2.7.4


^ permalink raw reply

* Re: [PATCH 1/2] tools: bpftool: add net load command to load XDP on interface
From: Jesper Dangaard Brouer @ 2019-07-31  9:38 UTC (permalink / raw)
  To: Daniel T. Lee; +Cc: brouer, Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190730184821.10833-2-danieltimlee@gmail.com>

On Wed, 31 Jul 2019 03:48:20 +0900
"Daniel T. Lee" <danieltimlee@gmail.com> wrote:

> diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
> index 67e99c56bc88..d3a4f18b5b95 100644
> --- a/tools/bpf/bpftool/net.c
> +++ b/tools/bpf/bpftool/net.c
> @@ -55,6 +55,35 @@ struct bpf_attach_info {
>  	__u32 flow_dissector_id;
>  };
>  
> +enum net_load_type {
> +	NET_LOAD_TYPE_XDP,
> +	NET_LOAD_TYPE_XDP_GENERIC,
> +	NET_LOAD_TYPE_XDP_DRIVE,

Why "DRIVE" ?
Why not "DRIVER" ?

> +	NET_LOAD_TYPE_XDP_OFFLOAD,
> +	__MAX_NET_LOAD_TYPE
> +};
> +
> +static const char * const load_type_strings[] = {
> +	[NET_LOAD_TYPE_XDP] = "xdp",
> +	[NET_LOAD_TYPE_XDP_GENERIC] = "xdpgeneric",
> +	[NET_LOAD_TYPE_XDP_DRIVE] = "xdpdrv",
> +	[NET_LOAD_TYPE_XDP_OFFLOAD] = "xdpoffload",
> +	[__MAX_NET_LOAD_TYPE] = NULL,
> +};



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH 1/2] include: mdio: Add private field to mdio structure
From: Harini Katakam @ 2019-07-31  9:36 UTC (permalink / raw)
  To: andrew, f.fainelli, hkallweit1, davem
  Cc: michal.simek, netdev, linux-arm-kernel, linux-kernel,
	harinikatakamlinux, harini.katakam, radhey.shyam.pandey
In-Reply-To: <1564565779-29537-1-git-send-email-harini.katakam@xilinx.com>

Add a priv pointer to mdio structure to be used by mdio devices if
required.

This priv field will be used by gmii2rgmii driver. As this IP has
no capability to read status on the MDIO bus, the driver currently
snoops the same and needs the instance information is some private
field. Since phy device "priv" can be used by external phy drivers,
it is not appropriate. Hence this addition to mdio device. This is
a temporary solution before the IP can be improved. The need for
this priv field can be re-evaluated later based on other mdio devices.

Signed-off-by: Harini Katakam <harini.katakam@xilinx.com>
Reviewed-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
---
 include/linux/mdio.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/mdio.h b/include/linux/mdio.h
index e8242ad8..3399de7 100644
--- a/include/linux/mdio.h
+++ b/include/linux/mdio.h
@@ -40,6 +40,9 @@ struct mdio_device {
 	struct reset_control *reset_ctrl;
 	unsigned int reset_assert_delay;
 	unsigned int reset_deassert_delay;
+
+	/* private data pointer for use by MDIO devices */
+	void *priv;
 };
 #define to_mdio_device(d) container_of(d, struct mdio_device, dev)

-- 
2.7.4

^ permalink raw reply related

* [PATCH 2/2] net: gmii2rgmii: Switch priv field in mdio device structure
From: Harini Katakam @ 2019-07-31  9:36 UTC (permalink / raw)
  To: andrew, f.fainelli, hkallweit1, davem
  Cc: michal.simek, netdev, linux-arm-kernel, linux-kernel,
	harinikatakamlinux, harini.katakam, radhey.shyam.pandey
In-Reply-To: <1564565779-29537-1-git-send-email-harini.katakam@xilinx.com>

Use the priv field in mdio device structure instead of the one in
phy device structure. The phy device priv field may be used by the
external phy driver and should not be overwritten.

Signed-off-by: Harini Katakam <harini.katakam@xilinx.com>
Reviewed-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
---
 drivers/net/phy/xilinx_gmii2rgmii.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/phy/xilinx_gmii2rgmii.c b/drivers/net/phy/xilinx_gmii2rgmii.c
index 2d14493..ba31b5c3 100644
--- a/drivers/net/phy/xilinx_gmii2rgmii.c
+++ b/drivers/net/phy/xilinx_gmii2rgmii.c
@@ -29,7 +29,7 @@ struct gmii2rgmii {
 
 static int xgmiitorgmii_read_status(struct phy_device *phydev)
 {
-	struct gmii2rgmii *priv = phydev->priv;
+	struct gmii2rgmii *priv = phydev->mdio.priv;
 	struct mii_bus *bus = priv->mdio->bus;
 	int addr = priv->mdio->addr;
 	u16 val = 0;
@@ -90,7 +90,7 @@ static int xgmiitorgmii_probe(struct mdio_device *mdiodev)
 	memcpy(&priv->conv_phy_drv, priv->phy_dev->drv,
 	       sizeof(struct phy_driver));
 	priv->conv_phy_drv.read_status = xgmiitorgmii_read_status;
-	priv->phy_dev->priv = priv;
+	priv->phy_dev->mdio.priv = priv;
 	priv->phy_dev->drv = &priv->conv_phy_drv;
 
 	return 0;
-- 
2.7.4


^ permalink raw reply related

* [PATCH 0/2] Fix GMII2RGMII private field
From: Harini Katakam @ 2019-07-31  9:36 UTC (permalink / raw)
  To: andrew, f.fainelli, hkallweit1, davem
  Cc: michal.simek, netdev, linux-arm-kernel, linux-kernel,
	harinikatakamlinux, harini.katakam, radhey.shyam.pandey

Fix the usage of external phy's priv field by gmii2rgmii driver.

Based on net-next.

Harini Katakam (2):
  include: mdio: Add private field to mdio structure
  net: gmii2rgmii: Switch priv field in mdio device structure

 drivers/net/phy/xilinx_gmii2rgmii.c | 4 ++--
 include/linux/mdio.h                | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

-- 
2.7.4


^ permalink raw reply

* Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
From: Jesper Dangaard Brouer @ 2019-07-31  9:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jakub Kicinski, Daniel T. Lee, Stephen Hemminger, David Ahern,
	John Fastabend, Daniel Borkmann, Alexei Starovoitov, David Miller,
	netdev, brouer
In-Reply-To: <20190731015238.3kq3r7rlascv7tzs@ast-mbp>

On Tue, 30 Jul 2019 18:52:40 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Jul 30, 2019 at 06:21:44PM -0700, Jakub Kicinski wrote:
> > > > Duplicating the same features in bpftool will only diminish the
> > > > incentive for moving iproute2 to libbpf.     
> > > 
> > > not at all. why do you think so?  
> > 
> > Because iproute2's BPF has fallen behind so the simplest thing is to
> > just contribute to bpftool. But iproute2 is the tool set for Linux
> > networking, we can't let it bit rot :(  
> 
> where were you when a lot of libbpf was copy pasted into iproute2 ?!
> Now it diverged a lot and it's difficult to move iproute2 back to the main
> train which is libbpf.

I hope we all agree that libbpf it the way forward.  I really hope we
can convert iproute2 into using libbpf.  It is challenging because
iproute2 ELF files use another map-layout, and I guess we need to stay
backward compatible.

> Same thing with at least 5 copy-pastes of samples/bpf/bpf_load.c
> that spawned a bunch of different bpf loaders.

I'm also to blame here... as I ran with bpf_load.c and based my
prototype-kernel github project on it.  And it seems that it have
gotten enough Google entropy, that people find that first. E.g. I have
pull requests that add new tools, which I have not merged, as I want to
switch to libbpf first.

Recently I've added a README[1] that warn about this issue, and point
people to xdp-tutorial[2] instead.

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/README.rst
[2] https://github.com/xdp-project/xdp-tutorial

 Quote[1]:
   "WARNING: This directory contains its own out-of-date BPF
    ELF-loader (in bpf_load.c). Instead people should use libbpf."

> > IMHO vaguely competent users of Linux networking will know ip link.
> > If they are not vaguely competent, they should not attach XDP programs
> > to interfaces by hand...  
> 
> I'm a prime example of moderately competent linux user who
> doesn't know iproute2. I'm not joking.
> I don't know tc syntax and always use my cheat sheet of ip/tc commands
> to do anything.

I agree... I also need to lookup the TC commands used for attaching BPF
programs every time.  It is a pain, and I've often made mistakes that
cause several tc filter's to get attached, which leads to strange
behavior.  (AFAIK you have to remember to use both prio and handle when
removing the old, before adding the new).

> > > 
> > > bpftool must be able to introspect every aspect of bpf programming.
> > > That includes detaching and attaching anywhere.
> > > Anyone doing 'bpftool p s' should be able to switch off particular
> > > prog id without learning ten different other tools.  
> > 
> > I think the fact that we already have an implementation in iproute2,
> > which is at the risk of bit rot is more important to me that the
> > hypothetical scenario where everyone knows to just use bpftool (for
> > XDP, for TC it's still iproute2 unless there's someone crazy enough 
> > to reimplement the TC functionality :))  
> 
> I think you're missing the point that iproute2 is still necessary
> to configure it.
> bpftool should be able to attach/detach from anything.
> But configuring that thing (whether it's cgroup or tc/xdp) is
> a job of corresponding apis and tools.
> 
> > I'm not sure we can settle our differences over email :)
> > I have tremendous respect for all the maintainers I CCed here, 
> > if nobody steps up to agree with me I'll concede the bpftool net
> > battle entirely :)  
> 
> we can keep arguing forever. Respect to ease-of-use only comes
> from the pain of operational experience. I don't think I can
> convey that pain in the email either.

Let me try to convey some pain...

Regarding operational experience.  I have a customer using this:
 https://github.com/xdp-project/xdp-cpumap-tc

The project combines TC and XDP (with CPUMAP redirect).  It has been a
*HUGE* pain that we needed to use iproute2 tc commands to attach and
load TC BPF-programs.

(Issue#1)
The iproute2 tc BPF loader uses another ELF-layout for maps.  This was
worked around, by keeping XDP and TC BPF-progs in different C-file
using different struct's for maps.  And avoiding to use map features,
that iproute2 doesn't know about... although we found a workaround for
using more advanced map features via load-order and pinning see issue#3
sub-problem(2).

(Issue#2)
As this is a C-prog I would really prefer using a library, instead of
having to call cmdline utilities, see C-code doing this[3].  This lead
to several issues.  E.g. they had installed too old version of iproute2
on their systems, after installing newer version (in /usr/local) it was
not first in $PATH for root.  Load order also matters see issue#3.

[3] https://github.com/xdp-project/xdp-cpumap-tc/blob/master/src/common_user.c#L381-L493

(Issue#3)
Next to get TC and XDP programs to cooperate, pinned maps via bpffs is
used.  And iproute2 dictates that pinned maps are located in directory
/sys/fs/bpf/tc/globals/.  What could go wrong, it's only a static dir path.

Sub-problem(1): If XDP loads _before_ any tc-bpf cmd, then the subdirs
are not created, leading to replicating mkdir creation in C[4].  Else
the XDP load will fail.  (Troubleshooting this was complicated by 

[4] https://github.com/xdp-project/xdp-cpumap-tc/commit/25e7e56699cd75a4a

Sub-problem(2): We really want to load XDP first, because libbpf
creates maps in a better way, e.g. with "name" (and BTF info).
The "name" part was needed by libbpf to find a given map (to avoid
depending on the order maps are defined in C/ELF file) via helper
bpf_object__find_map_fd_by_name().  To handle TC noname case, I had to
code up this workaround[5], which depend on extracting the name of the
pinned file name.

[5] https://github.com/xdp-project/xdp-cpumap-tc/blob/master/src/xdp_iphash_to_cpu_user.c#L117-L131

Experience/conclusion: Getting XDP and TC to cooperate suck, primarily
because iproute2 tc is based on a separate BPF-ELF loader, which is
features are not in sync / lacking behind.  Calling cmdline binaries
from C also sucks, and I would prefer some libbpf TC attach function,
but most pain comes from the slight differences between ELF-loaders.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH] myri10ge: remove unneeded variable
From: Ding Xiang @ 2019-07-31  8:53 UTC (permalink / raw)
  To: christopher.lee, davem; +Cc: netdev, linux-kernel

"error" is unneeded,just return 0

Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>
---
 drivers/net/ethernet/myricom/myri10ge/myri10ge.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
index d8b7fba..a4165e1 100644
--- a/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
+++ b/drivers/net/ethernet/myricom/myri10ge/myri10ge.c
@@ -3037,7 +3037,6 @@ static int myri10ge_set_mac_address(struct net_device *dev, void *addr)
 static int myri10ge_change_mtu(struct net_device *dev, int new_mtu)
 {
 	struct myri10ge_priv *mgp = netdev_priv(dev);
-	int error = 0;
 
 	netdev_info(dev, "changing mtu from %d to %d\n", dev->mtu, new_mtu);
 	if (mgp->running) {
@@ -3049,7 +3048,7 @@ static int myri10ge_change_mtu(struct net_device *dev, int new_mtu)
 	} else
 		dev->mtu = new_mtu;
 
-	return error;
+	return 0;
 }
 
 /*
-- 
1.9.1




^ permalink raw reply related

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-07-31  8:50 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg
In-Reply-To: <20190731084655.7024-8-jasowang@redhat.com>


On 2019/7/31 下午4:46, Jason Wang wrote:
> We used to use RCU to synchronize MMU notifier with worker. This leads
> calling synchronize_rcu() in invalidate_range_start(). But on a busy
> system, there would be many factors that may slow down the
> synchronize_rcu() which makes it unsuitable to be called in MMU
> notifier.
>
> A solution is SRCU but its overhead is obvious with the expensive full
> memory barrier. Another choice is to use seqlock, but it doesn't
> provide a synchronization method between readers and writers. The last
> choice is to use vq mutex, but it need to deal with the worst case
> that MMU notifier must be blocked and wait for the finish of swap in.
>
> So this patch switches use a counter to track whether or not the map
> was used. The counter was increased when vq try to start or finish
> uses the map. This means, when it was even, we're sure there's no
> readers and MMU notifier is synchronized. When it was odd, it means
> there's a reader we need to wait it to be even again then we are
> synchronized. To avoid full memory barrier, store_release +
> load_acquire on the counter is used.


For reviewers, I try hard to avoid e.g smp_mb(), please double check 
whether or not this trick work.

Thanks


>
> Consider the read critical section is pretty small the synchronization
> should be done very fast.
>
> Note the patch lead about 3% PPS dropping.
>
> Reported-by: Michael S. Tsirkin <mst@redhat.com>
> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>   drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++----------------
>   drivers/vhost/vhost.h |   7 +-
>   2 files changed, 94 insertions(+), 58 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index cfc11f9ed9c9..db2c81cb1e90 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
>   
>   	spin_lock(&vq->mmu_lock);
>   	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> -		map[i] = rcu_dereference_protected(vq->maps[i],
> -				  lockdep_is_held(&vq->mmu_lock));
> +		map[i] = vq->maps[i];
>   		if (map[i]) {
>   			vhost_set_map_dirty(vq, map[i], i);
> -			rcu_assign_pointer(vq->maps[i], NULL);
> +			vq->maps[i] = NULL;
>   		}
>   	}
>   	spin_unlock(&vq->mmu_lock);
>   
> -	/* No need for synchronize_rcu() or kfree_rcu() since we are
> -	 * serialized with memory accessors (e.g vq mutex held).
> +	/* No need for synchronization since we are serialized with
> +	 * memory accessors (e.g vq mutex held).
>   	 */
>   
>   	for (i = 0; i < VHOST_NUM_ADDRS; i++)
> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
>   	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
>   }
>   
> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
> +{
> +	int ref = READ_ONCE(vq->ref);
> +
> +	smp_store_release(&vq->ref, ref + 1);
> +	/* Make sure ref counter is visible before accessing the map */
> +	smp_load_acquire(&vq->ref);
> +}
> +
> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
> +{
> +	int ref = READ_ONCE(vq->ref);
> +
> +	/* Make sure vq access is done before increasing ref counter */
> +	smp_store_release(&vq->ref, ref + 1);
> +}
> +
> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
> +{
> +	int ref;
> +
> +	/* Make sure map change was done before checking ref counter */
> +	smp_mb();
> +
> +	ref = READ_ONCE(vq->ref);
> +	if (ref & 0x1) {
> +		/* When ref change, we are sure no reader can see
> +		 * previous map */
> +		while (READ_ONCE(vq->ref) == ref) {
> +			set_current_state(TASK_RUNNING);
> +			schedule();
> +		}
> +	}
> +	/* Make sure ref counter was checked before any other
> +	 * operations that was dene on map. */
> +	smp_mb();
> +}
> +
>   static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>   				      int index,
>   				      unsigned long start,
> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>   	spin_lock(&vq->mmu_lock);
>   	++vq->invalidate_count;
>   
> -	map = rcu_dereference_protected(vq->maps[index],
> -					lockdep_is_held(&vq->mmu_lock));
> +	map = vq->maps[index];
>   	if (map) {
>   		vhost_set_map_dirty(vq, map, index);
> -		rcu_assign_pointer(vq->maps[index], NULL);
> +		vq->maps[index] = NULL;
>   	}
>   	spin_unlock(&vq->mmu_lock);
>   
>   	if (map) {
> -		synchronize_rcu();
> +		vhost_vq_sync_access(vq);
>   		vhost_map_unprefetch(map);
>   	}
>   }
> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev)
>   	for (i = 0; i < dev->nvqs; ++i) {
>   		vq = dev->vqs[i];
>   		for (j = 0; j < VHOST_NUM_ADDRS; j++)
> -			RCU_INIT_POINTER(vq->maps[j], NULL);
> +			vq->maps[j] = NULL;
>   	}
>   }
>   #endif
> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>   		vq->indirect = NULL;
>   		vq->heads = NULL;
>   		vq->dev = dev;
> +		vq->ref = 0;
>   		mutex_init(&vq->mutex);
>   		spin_lock_init(&vq->mmu_lock);
>   		vhost_vq_reset(dev, vq);
> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
>   	map->npages = npages;
>   	map->pages = pages;
>   
> -	rcu_assign_pointer(vq->maps[index], map);
> +	vq->maps[index] = map;
>   	/* No need for a synchronize_rcu(). This function should be
>   	 * called by dev->worker so we are serialized with all
>   	 * readers.
> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
>   	struct vring_used *used;
>   
>   	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>   
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> +		map = vq->maps[VHOST_ADDR_USED];
>   		if (likely(map)) {
>   			used = map->addr;
>   			*((__virtio16 *)&used->ring[vq->num]) =
>   				cpu_to_vhost16(vq, vq->avail_idx);
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>   			return 0;
>   		}
>   
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>   	}
>   #endif
>   
> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>   	size_t size;
>   
>   	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>   
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> +		map = vq->maps[VHOST_ADDR_USED];
>   		if (likely(map)) {
>   			used = map->addr;
>   			size = count * sizeof(*head);
>   			memcpy(used->ring + idx, head, size);
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>   			return 0;
>   		}
>   
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>   	}
>   #endif
>   
> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>   	struct vring_used *used;
>   
>   	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>   
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> +		map = vq->maps[VHOST_ADDR_USED];
>   		if (likely(map)) {
>   			used = map->addr;
>   			used->flags = cpu_to_vhost16(vq, vq->used_flags);
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>   			return 0;
>   		}
>   
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>   	}
>   #endif
>   
> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>   	struct vring_used *used;
>   
>   	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>   
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> +		map = vq->maps[VHOST_ADDR_USED];
>   		if (likely(map)) {
>   			used = map->addr;
>   			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>   			return 0;
>   		}
>   
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>   	}
>   #endif
>   
> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>   	struct vring_avail *avail;
>   
>   	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>   
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> +		map = vq->maps[VHOST_ADDR_AVAIL];
>   		if (likely(map)) {
>   			avail = map->addr;
>   			*idx = avail->idx;
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>   			return 0;
>   		}
>   
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>   	}
>   #endif
>   
> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>   	struct vring_avail *avail;
>   
>   	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>   
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> +		map = vq->maps[VHOST_ADDR_AVAIL];
>   		if (likely(map)) {
>   			avail = map->addr;
>   			*head = avail->ring[idx & (vq->num - 1)];
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>   			return 0;
>   		}
>   
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>   	}
>   #endif
>   
> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
>   	struct vring_avail *avail;
>   
>   	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>   
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> +		map = vq->maps[VHOST_ADDR_AVAIL];
>   		if (likely(map)) {
>   			avail = map->addr;
>   			*flags = avail->flags;
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>   			return 0;
>   		}
>   
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>   	}
>   #endif
>   
> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
>   	struct vring_avail *avail;
>   
>   	if (!vq->iotlb) {
> -		rcu_read_lock();
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> +		vhost_vq_access_map_begin(vq);
> +		map = vq->maps[VHOST_ADDR_AVAIL];
>   		if (likely(map)) {
>   			avail = map->addr;
>   			*event = (__virtio16)avail->ring[vq->num];
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>   			return 0;
>   		}
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>   	}
>   #endif
>   
> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
>   	struct vring_used *used;
>   
>   	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>   
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> +		map = vq->maps[VHOST_ADDR_USED];
>   		if (likely(map)) {
>   			used = map->addr;
>   			*idx = used->idx;
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>   			return 0;
>   		}
>   
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>   	}
>   #endif
>   
> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq,
>   	struct vring_desc *d;
>   
>   	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>   
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]);
> +		map = vq->maps[VHOST_ADDR_DESC];
>   		if (likely(map)) {
>   			d = map->addr;
>   			*desc = *(d + idx);
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>   			return 0;
>   		}
>   
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>   	}
>   #endif
>   
> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
>   #if VHOST_ARCH_CAN_ACCEL_UACCESS
>   static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
>   {
> -	struct vhost_map __rcu *map;
> +	struct vhost_map *map;
>   	int i;
>   
>   	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> -		rcu_read_lock();
> -		map = rcu_dereference(vq->maps[i]);
> -		rcu_read_unlock();
> +		map = vq->maps[i];
>   		if (unlikely(!map))
>   			vhost_map_prefetch(vq, i);
>   	}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index a9a2a93857d2..f9e9558a529d 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -115,16 +115,17 @@ struct vhost_virtqueue {
>   #if VHOST_ARCH_CAN_ACCEL_UACCESS
>   	/* Read by memory accessors, modified by meta data
>   	 * prefetching, MMU notifier and vring ioctl().
> -	 * Synchonrized through mmu_lock (writers) and RCU (writers
> -	 * and readers).
> +	 * Synchonrized through mmu_lock (writers) and ref counters,
> +	 * see vhost_vq_access_map_begin()/vhost_vq_access_map_end().
>   	 */
> -	struct vhost_map __rcu *maps[VHOST_NUM_ADDRS];
> +	struct vhost_map *maps[VHOST_NUM_ADDRS];
>   	/* Read by MMU notifier, modified by vring ioctl(),
>   	 * synchronized through MMU notifier
>   	 * registering/unregistering.
>   	 */
>   	struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
>   #endif
> +	int ref;
>   	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>   
>   	struct file *kick;

^ permalink raw reply

* [PATCH V2 9/9] vhost: do not return -EAGIAN for non blocking invalidation too early
From: Jason Wang @ 2019-07-31  8:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg
In-Reply-To: <20190731084655.7024-1-jasowang@redhat.com>

Instead of returning -EAGAIN unconditionally, we'd better do that only
we're sure the range is overlapped with the metadata area.

Reported-by: Jason Gunthorpe <jgg@ziepe.ca>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index fc2da8a0c671..96c6aeb1871f 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -399,16 +399,19 @@ static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
 	smp_mb();
 }
 
-static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
-				      int index,
-				      unsigned long start,
-				      unsigned long end)
+static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
+				     int index,
+				     unsigned long start,
+				     unsigned long end,
+				     bool blockable)
 {
 	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
 	struct vhost_map *map;
 
 	if (!vhost_map_range_overlap(uaddr, start, end))
-		return;
+		return 0;
+	else if (!blockable)
+		return -EAGAIN;
 
 	spin_lock(&vq->mmu_lock);
 	++vq->invalidate_count;
@@ -423,6 +426,8 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 		vhost_set_map_dirty(vq, map, index);
 		vhost_map_unprefetch(map);
 	}
+
+	return 0;
 }
 
 static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
@@ -443,18 +448,19 @@ static int vhost_invalidate_range_start(struct mmu_notifier *mn,
 {
 	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
 					     mmu_notifier);
-	int i, j;
-
-	if (!mmu_notifier_range_blockable(range))
-		return -EAGAIN;
+	bool blockable = mmu_notifier_range_blockable(range);
+	int i, j, ret;
 
 	for (i = 0; i < dev->nvqs; i++) {
 		struct vhost_virtqueue *vq = dev->vqs[i];
 
-		for (j = 0; j < VHOST_NUM_ADDRS; j++)
-			vhost_invalidate_vq_start(vq, j,
-						  range->start,
-						  range->end);
+		for (j = 0; j < VHOST_NUM_ADDRS; j++) {
+			ret = vhost_invalidate_vq_start(vq, j,
+							range->start,
+							range->end, blockable);
+			if (ret)
+				return ret;
+		}
 	}
 
 	return 0;
-- 
2.18.1


^ permalink raw reply related

* [PATCH V2 8/9] vhost: correctly set dirty pages in MMU notifiers callback
From: Jason Wang @ 2019-07-31  8:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg
In-Reply-To: <20190731084655.7024-1-jasowang@redhat.com>

We need make sure there's no reference on the map before trying to
mark set dirty pages.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index db2c81cb1e90..fc2da8a0c671 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -414,14 +414,13 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 	++vq->invalidate_count;
 
 	map = vq->maps[index];
-	if (map) {
-		vhost_set_map_dirty(vq, map, index);
+	if (map)
 		vq->maps[index] = NULL;
-	}
 	spin_unlock(&vq->mmu_lock);
 
 	if (map) {
 		vhost_vq_sync_access(vq);
+		vhost_set_map_dirty(vq, map, index);
 		vhost_map_unprefetch(map);
 	}
 }
-- 
2.18.1


^ permalink raw reply related

* [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-07-31  8:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg
In-Reply-To: <20190731084655.7024-1-jasowang@redhat.com>

We used to use RCU to synchronize MMU notifier with worker. This leads
calling synchronize_rcu() in invalidate_range_start(). But on a busy
system, there would be many factors that may slow down the
synchronize_rcu() which makes it unsuitable to be called in MMU
notifier.

A solution is SRCU but its overhead is obvious with the expensive full
memory barrier. Another choice is to use seqlock, but it doesn't
provide a synchronization method between readers and writers. The last
choice is to use vq mutex, but it need to deal with the worst case
that MMU notifier must be blocked and wait for the finish of swap in.

So this patch switches use a counter to track whether or not the map
was used. The counter was increased when vq try to start or finish
uses the map. This means, when it was even, we're sure there's no
readers and MMU notifier is synchronized. When it was odd, it means
there's a reader we need to wait it to be even again then we are
synchronized. To avoid full memory barrier, store_release +
load_acquire on the counter is used.

Consider the read critical section is pretty small the synchronization
should be done very fast.

Note the patch lead about 3% PPS dropping.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++----------------
 drivers/vhost/vhost.h |   7 +-
 2 files changed, 94 insertions(+), 58 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index cfc11f9ed9c9..db2c81cb1e90 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
 
 	spin_lock(&vq->mmu_lock);
 	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
-		map[i] = rcu_dereference_protected(vq->maps[i],
-				  lockdep_is_held(&vq->mmu_lock));
+		map[i] = vq->maps[i];
 		if (map[i]) {
 			vhost_set_map_dirty(vq, map[i], i);
-			rcu_assign_pointer(vq->maps[i], NULL);
+			vq->maps[i] = NULL;
 		}
 	}
 	spin_unlock(&vq->mmu_lock);
 
-	/* No need for synchronize_rcu() or kfree_rcu() since we are
-	 * serialized with memory accessors (e.g vq mutex held).
+	/* No need for synchronization since we are serialized with
+	 * memory accessors (e.g vq mutex held).
 	 */
 
 	for (i = 0; i < VHOST_NUM_ADDRS; i++)
@@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
 	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
 }
 
+static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
+{
+	int ref = READ_ONCE(vq->ref);
+
+	smp_store_release(&vq->ref, ref + 1);
+	/* Make sure ref counter is visible before accessing the map */
+	smp_load_acquire(&vq->ref);
+}
+
+static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
+{
+	int ref = READ_ONCE(vq->ref);
+
+	/* Make sure vq access is done before increasing ref counter */
+	smp_store_release(&vq->ref, ref + 1);
+}
+
+static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
+{
+	int ref;
+
+	/* Make sure map change was done before checking ref counter */
+	smp_mb();
+
+	ref = READ_ONCE(vq->ref);
+	if (ref & 0x1) {
+		/* When ref change, we are sure no reader can see
+		 * previous map */
+		while (READ_ONCE(vq->ref) == ref) {
+			set_current_state(TASK_RUNNING);
+			schedule();
+		}
+	}
+	/* Make sure ref counter was checked before any other
+	 * operations that was dene on map. */
+	smp_mb();
+}
+
 static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 				      int index,
 				      unsigned long start,
@@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 	spin_lock(&vq->mmu_lock);
 	++vq->invalidate_count;
 
-	map = rcu_dereference_protected(vq->maps[index],
-					lockdep_is_held(&vq->mmu_lock));
+	map = vq->maps[index];
 	if (map) {
 		vhost_set_map_dirty(vq, map, index);
-		rcu_assign_pointer(vq->maps[index], NULL);
+		vq->maps[index] = NULL;
 	}
 	spin_unlock(&vq->mmu_lock);
 
 	if (map) {
-		synchronize_rcu();
+		vhost_vq_sync_access(vq);
 		vhost_map_unprefetch(map);
 	}
 }
@@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev)
 	for (i = 0; i < dev->nvqs; ++i) {
 		vq = dev->vqs[i];
 		for (j = 0; j < VHOST_NUM_ADDRS; j++)
-			RCU_INIT_POINTER(vq->maps[j], NULL);
+			vq->maps[j] = NULL;
 	}
 }
 #endif
@@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vq->indirect = NULL;
 		vq->heads = NULL;
 		vq->dev = dev;
+		vq->ref = 0;
 		mutex_init(&vq->mutex);
 		spin_lock_init(&vq->mmu_lock);
 		vhost_vq_reset(dev, vq);
@@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
 	map->npages = npages;
 	map->pages = pages;
 
-	rcu_assign_pointer(vq->maps[index], map);
+	vq->maps[index] = map;
 	/* No need for a synchronize_rcu(). This function should be
 	 * called by dev->worker so we are serialized with all
 	 * readers.
@@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
 	struct vring_used *used;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			*((__virtio16 *)&used->ring[vq->num]) =
 				cpu_to_vhost16(vq, vq->avail_idx);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
 	size_t size;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			size = count * sizeof(*head);
 			memcpy(used->ring + idx, head, size);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
 	struct vring_used *used;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			used->flags = cpu_to_vhost16(vq, vq->used_flags);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
 	struct vring_used *used;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
 	struct vring_avail *avail;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
+		map = vq->maps[VHOST_ADDR_AVAIL];
 		if (likely(map)) {
 			avail = map->addr;
 			*idx = avail->idx;
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
 	struct vring_avail *avail;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
+		map = vq->maps[VHOST_ADDR_AVAIL];
 		if (likely(map)) {
 			avail = map->addr;
 			*head = avail->ring[idx & (vq->num - 1)];
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
 	struct vring_avail *avail;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
+		map = vq->maps[VHOST_ADDR_AVAIL];
 		if (likely(map)) {
 			avail = map->addr;
 			*flags = avail->flags;
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
 	struct vring_avail *avail;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
-		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
+		vhost_vq_access_map_begin(vq);
+		map = vq->maps[VHOST_ADDR_AVAIL];
 		if (likely(map)) {
 			avail = map->addr;
 			*event = (__virtio16)avail->ring[vq->num];
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
 	struct vring_used *used;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
+		map = vq->maps[VHOST_ADDR_USED];
 		if (likely(map)) {
 			used = map->addr;
 			*idx = used->idx;
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq,
 	struct vring_desc *d;
 
 	if (!vq->iotlb) {
-		rcu_read_lock();
+		vhost_vq_access_map_begin(vq);
 
-		map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]);
+		map = vq->maps[VHOST_ADDR_DESC];
 		if (likely(map)) {
 			d = map->addr;
 			*desc = *(d + idx);
-			rcu_read_unlock();
+			vhost_vq_access_map_end(vq);
 			return 0;
 		}
 
-		rcu_read_unlock();
+		vhost_vq_access_map_end(vq);
 	}
 #endif
 
@@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
 static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
 {
-	struct vhost_map __rcu *map;
+	struct vhost_map *map;
 	int i;
 
 	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
-		rcu_read_lock();
-		map = rcu_dereference(vq->maps[i]);
-		rcu_read_unlock();
+		map = vq->maps[i];
 		if (unlikely(!map))
 			vhost_map_prefetch(vq, i);
 	}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index a9a2a93857d2..f9e9558a529d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -115,16 +115,17 @@ struct vhost_virtqueue {
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
 	/* Read by memory accessors, modified by meta data
 	 * prefetching, MMU notifier and vring ioctl().
-	 * Synchonrized through mmu_lock (writers) and RCU (writers
-	 * and readers).
+	 * Synchonrized through mmu_lock (writers) and ref counters,
+	 * see vhost_vq_access_map_begin()/vhost_vq_access_map_end().
 	 */
-	struct vhost_map __rcu *maps[VHOST_NUM_ADDRS];
+	struct vhost_map *maps[VHOST_NUM_ADDRS];
 	/* Read by MMU notifier, modified by vring ioctl(),
 	 * synchronized through MMU notifier
 	 * registering/unregistering.
 	 */
 	struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
 #endif
+	int ref;
 	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
 
 	struct file *kick;
-- 
2.18.1


^ permalink raw reply related

* [PATCH V2 6/9] vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps()
From: Jason Wang @ 2019-07-31  8:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg
In-Reply-To: <20190731084655.7024-1-jasowang@redhat.com>

There's no need for RCU synchronization in vhost_uninit_vq_maps()
since we've already serialized with readers (memory accessors). This
also avoid the possible userspace DOS through ioctl() because of the
possible high latency caused by synchronize_rcu().

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c12cdadb0855..cfc11f9ed9c9 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -333,7 +333,9 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
 	}
 	spin_unlock(&vq->mmu_lock);
 
-	synchronize_rcu();
+	/* No need for synchronize_rcu() or kfree_rcu() since we are
+	 * serialized with memory accessors (e.g vq mutex held).
+	 */
 
 	for (i = 0; i < VHOST_NUM_ADDRS; i++)
 		if (map[i])
-- 
2.18.1


^ permalink raw reply related

* [PATCH V2 5/9] vhost: mark dirty pages during map uninit
From: Jason Wang @ 2019-07-31  8:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg
In-Reply-To: <20190731084655.7024-1-jasowang@redhat.com>

We don't mark dirty pages if the map was teared down outside MMU
notifier. This will lead untracked dirty pages. Fixing by marking
dirty pages during map uninit.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2a7217c33668..c12cdadb0855 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -305,6 +305,18 @@ static void vhost_map_unprefetch(struct vhost_map *map)
 	kfree(map);
 }
 
+static void vhost_set_map_dirty(struct vhost_virtqueue *vq,
+				struct vhost_map *map, int index)
+{
+	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
+	int i;
+
+	if (uaddr->write) {
+		for (i = 0; i < map->npages; i++)
+			set_page_dirty(map->pages[i]);
+	}
+}
+
 static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
 {
 	struct vhost_map *map[VHOST_NUM_ADDRS];
@@ -314,8 +326,10 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
 	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
 		map[i] = rcu_dereference_protected(vq->maps[i],
 				  lockdep_is_held(&vq->mmu_lock));
-		if (map[i])
+		if (map[i]) {
+			vhost_set_map_dirty(vq, map[i], i);
 			rcu_assign_pointer(vq->maps[i], NULL);
+		}
 	}
 	spin_unlock(&vq->mmu_lock);
 
@@ -353,7 +367,6 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 {
 	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
 	struct vhost_map *map;
-	int i;
 
 	if (!vhost_map_range_overlap(uaddr, start, end))
 		return;
@@ -364,10 +377,7 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
 	map = rcu_dereference_protected(vq->maps[index],
 					lockdep_is_held(&vq->mmu_lock));
 	if (map) {
-		if (uaddr->write) {
-			for (i = 0; i < map->npages; i++)
-				set_page_dirty(map->pages[i]);
-		}
+		vhost_set_map_dirty(vq, map, index);
 		rcu_assign_pointer(vq->maps[index], NULL);
 	}
 	spin_unlock(&vq->mmu_lock);
-- 
2.18.1


^ permalink raw reply related

* [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr()
From: Jason Wang @ 2019-07-31  8:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg
In-Reply-To: <20190731084655.7024-1-jasowang@redhat.com>

The vhost_set_vring_num_addr() could be called in the middle of
invalidate_range_start() and invalidate_range_end(). If we don't reset
invalidate_count after the un-registering of MMU notifier, the
invalidate_cont will run out of sync (e.g never reach zero). This will
in fact disable the fast accessor path. Fixing by reset the count to
zero.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2a3154976277..2a7217c33668 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2073,6 +2073,10 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 		d->has_notifier = false;
 	}
 
+	/* reset invalidate_count in case we are in the middle of
+	 * invalidate_start() and invalidate_end().
+	 */
+	vq->invalidate_count = 0;
 	vhost_uninit_vq_maps(vq);
 #endif
 
-- 
2.18.1


^ permalink raw reply related

* [PATCH V2 3/9] vhost: fix vhost map leak
From: Jason Wang @ 2019-07-31  8:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg
In-Reply-To: <20190731084655.7024-1-jasowang@redhat.com>

We don't free map during vhost_map_unprefetch(). This means it could
be leaked. Fixing by free the map.

Reported-by: Michael S. Tsirkin <mst@redhat.com>
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 17f6abea192e..2a3154976277 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -302,9 +302,7 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
 static void vhost_map_unprefetch(struct vhost_map *map)
 {
 	kfree(map->pages);
-	map->pages = NULL;
-	map->npages = 0;
-	map->addr = NULL;
+	kfree(map);
 }
 
 static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
-- 
2.18.1


^ permalink raw reply related

* [PATCH V2 2/9] vhost: validate MMU notifier registration
From: Jason Wang @ 2019-07-31  8:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg
In-Reply-To: <20190731084655.7024-1-jasowang@redhat.com>

The return value of mmu_notifier_register() is not checked in
vhost_vring_set_num_addr(). This will cause an out of sync between mm
and MMU notifier thus a double free. To solve this, introduce a
boolean flag to track whether MMU notifier is registered and only do
unregistering when it was true.

Reported-and-tested-by:
syzbot+e58112d71f77113ddb7b@syzkaller.appspotmail.com
Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 19 +++++++++++++++----
 drivers/vhost/vhost.h |  1 +
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 488380a581dc..17f6abea192e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -629,6 +629,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 	dev->iov_limit = iov_limit;
 	dev->weight = weight;
 	dev->byte_weight = byte_weight;
+	dev->has_notifier = false;
 	init_llist_head(&dev->work_list);
 	init_waitqueue_head(&dev->wait);
 	INIT_LIST_HEAD(&dev->read_list);
@@ -730,6 +731,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 	if (err)
 		goto err_mmu_notifier;
 #endif
+	dev->has_notifier = true;
 
 	return 0;
 
@@ -959,7 +961,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 	}
 	if (dev->mm) {
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
-		mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
+		if (dev->has_notifier) {
+			mmu_notifier_unregister(&dev->mmu_notifier,
+						dev->mm);
+			dev->has_notifier = false;
+		}
 #endif
 		mmput(dev->mm);
 	}
@@ -2064,8 +2070,10 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 	/* Unregister MMU notifer to allow invalidation callback
 	 * can access vq->uaddrs[] without holding a lock.
 	 */
-	if (d->mm)
+	if (d->has_notifier) {
 		mmu_notifier_unregister(&d->mmu_notifier, d->mm);
+		d->has_notifier = false;
+	}
 
 	vhost_uninit_vq_maps(vq);
 #endif
@@ -2085,8 +2093,11 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 	if (r == 0)
 		vhost_setup_vq_uaddr(vq);
 
-	if (d->mm)
-		mmu_notifier_register(&d->mmu_notifier, d->mm);
+	if (d->mm) {
+		r = mmu_notifier_register(&d->mmu_notifier, d->mm);
+		if (!r)
+			d->has_notifier = true;
+	}
 #endif
 
 	mutex_unlock(&vq->mutex);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 42a8c2a13ab1..a9a2a93857d2 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -214,6 +214,7 @@ struct vhost_dev {
 	int iov_limit;
 	int weight;
 	int byte_weight;
+	bool has_notifier;
 };
 
 bool vhost_exceeds_weight(struct vhost_virtqueue *vq, int pkts, int total_len);
-- 
2.18.1


^ permalink raw reply related

* [PATCH V2 1/9] vhost: don't set uaddr for invalid address
From: Jason Wang @ 2019-07-31  8:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg
In-Reply-To: <20190731084655.7024-1-jasowang@redhat.com>

We should not setup uaddr for the invalid address, otherwise we may
try to pin or prefetch mapping of wrong pages.

Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 0536f8526359..488380a581dc 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2082,7 +2082,8 @@ static long vhost_vring_set_num_addr(struct vhost_dev *d,
 	}
 
 #if VHOST_ARCH_CAN_ACCEL_UACCESS
-	vhost_setup_vq_uaddr(vq);
+	if (r == 0)
+		vhost_setup_vq_uaddr(vq);
 
 	if (d->mm)
 		mmu_notifier_register(&d->mmu_notifier, d->mm);
-- 
2.18.1


^ permalink raw reply related

* [PATCH V2 0/9] Fixes for metadata accelreation
From: Jason Wang @ 2019-07-31  8:46 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev; +Cc: linux-kernel, linux-mm, jgg

Hi all:

This series try to fix several issues introduced by meta data
accelreation series. Please review.

Changes from V1:

- Try not use RCU to syncrhonize MMU notifier with vhost worker
- set dirty pages after no readers
- return -EAGAIN only when we find the range is overlapped with
  metadata

Jason Wang (9):
  vhost: don't set uaddr for invalid address
  vhost: validate MMU notifier registration
  vhost: fix vhost map leak
  vhost: reset invalidate_count in vhost_set_vring_num_addr()
  vhost: mark dirty pages during map uninit
  vhost: don't do synchronize_rcu() in vhost_uninit_vq_maps()
  vhost: do not use RCU to synchronize MMU notifier with worker
  vhost: correctly set dirty pages in MMU notifiers callback
  vhost: do not return -EAGIAN for non blocking invalidation too early

 drivers/vhost/vhost.c | 232 +++++++++++++++++++++++++++---------------
 drivers/vhost/vhost.h |   8 +-
 2 files changed, 154 insertions(+), 86 deletions(-)

-- 
2.18.1


^ permalink raw reply

* Re: [PATCH v5 12/29] compat_ioctl: move drivers to compat_ptr_ioctl
From: Cornelia Huck @ 2019-07-31  8:37 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Alexander Viro, devel, linux-input, kvm, Michael S . Tsirkin,
	Greg Kroah-Hartman, linux-usb, netdev, linux-kernel,
	Jarkko Sakkinen, virtualization, Jason Gunthorpe, linux-mtd,
	Stefan Hajnoczi, Jiri Kosina, linux-fsdevel, ceph-devel,
	linux-integrity, linux1394-devel, linux-stm32, linux-arm-kernel
In-Reply-To: <20190730195227.742215-1-arnd@arndb.de>

On Tue, 30 Jul 2019 21:50:28 +0200
Arnd Bergmann <arnd@arndb.de> wrote:

> Each of these drivers has a copy of the same trivial helper function to
> convert the pointer argument and then call the native ioctl handler.
> 
> We now have a generic implementation of that, so use it.
> 
> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jiri Kosina <jkosina@suse.cz>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---

>  drivers/vfio/vfio.c               | 39 +++----------------------------

vfio changes:

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply

* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Song Liu @ 2019-07-31  8:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, netdev@vger.kernel.org, Alexei Starovoitov,
	daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <CAEf4Bza3cAoZJE+24_MBiv-8yYtAaTkAez5xq1v12cLW1-RGcw@mail.gmail.com>



> On Jul 30, 2019, at 11:52 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Tue, Jul 30, 2019 at 10:19 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Jul 30, 2019, at 6:00 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>>> 
>>> On Tue, Jul 30, 2019 at 5:39 PM Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
>>>>> 
>>>>> This patch implements the core logic for BPF CO-RE offsets relocations.
>>>>> Every instruction that needs to be relocated has corresponding
>>>>> bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
>>>>> to match recorded "local" relocation spec against potentially many
>>>>> compatible "target" types, creating corresponding spec. Details of the
>>>>> algorithm are noted in corresponding comments in the code.
>>>>> 
>>>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

[...]

>>>> 
>>> 
>>> I just picked the most succinct and non-repetitive form. It's
>>> immediately apparent which type it's implicitly converted to, so I
>>> felt there is no need to repeat it. Also, just (void *) is much
>>> shorter. :)
>> 
>> _All_ other code in btf.c converts the pointer to the target type.
> 
> Most in libbpf.c doesn't, though. Also, I try to preserve pointer
> constness for uses that don't modify BTF types (pretty much all of
> them in libbpf), so it becomes really verbose, despite extremely short
> variable names:
> 
> const struct btf_member *m = (const struct btf_member *)(t + 1);

I don't think being verbose is a big problem here. Overusing 
(void *) feels like a bigger problem. 

> 
> Add one or two levels of nestedness and you are wrapping this line.
> 
>> In some cases, it is not apparent which type it is converted to,
>> for example:
>> 
>> +       m = (void *)(targ_type + 1);
>> 
>> I would suggest we do implicit conversion whenever possible.
> 
> Implicit conversion (`m = targ_type + 1;`) is a compilation error,
> that won't work.

I misused "implicit" here. I actually meant to say

	m = ((const struct btf_member *)(t + 1);




^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox