Netdev List
 help / color / mirror / Atom feed
* Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
From: Michael S. Tsirkin @ 2017-12-21  0:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Sridhar Samudrala, stephen, netdev, virtualization,
	alexander.duyck
In-Reply-To: <20171220143140.0a3dc7f1@cakuba.netronome.com>

On Wed, Dec 20, 2017 at 02:33:34PM -0800, Jakub Kicinski wrote:
> On Mon, 18 Dec 2017 16:40:36 -0800, Sridhar Samudrala wrote:
> > +static int virtio_netdev_event(struct notifier_block *this,
> > +			       unsigned long event, void *ptr)
> > +{
> > +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> > +
> > +	/* Skip our own events */
> > +	if (event_dev->netdev_ops == &virtnet_netdev)
> > +		return NOTIFY_DONE;
> 
> I wonder how does this work WRT loop prevention.  What if I have two
> virtio devices with the same MAC, what is preventing them from claiming
> each other?  Is it only the check above (not sure what is "own" in
> comment referring to)?

I expect we will add a feature bit (VIRTIO_NET_F_MASTER) and it will
be host's responsibility not to add more than 1 such device.

> I'm worried the check above will not stop virtio from enslaving hyperv
> interfaces and vice versa, potentially leading to a loop, no?  There is
> also the fact that it would be preferable to share the code between
> paravirt drivers, to avoid duplicated bugs.
> 
> My suggestion during the previous discussion was to create a paravirt
> bond device, which will explicitly tell the OS which interfaces to
> bond, regardless of which driver they're using.  Could be some form of
> ACPI/FW driver too, I don't know enough about x86 FW to suggest
> something fully fleshed :(

^ permalink raw reply

* Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
From: Michael S. Tsirkin @ 2017-12-21  0:14 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: stephen, netdev, virtualization, alexander.duyck
In-Reply-To: <1513644036-45230-1-git-send-email-sridhar.samudrala@intel.com>

On Mon, Dec 18, 2017 at 04:40:36PM -0800, Sridhar Samudrala wrote:
> This patch enables virtio to switch over to a VF datapath when a VF netdev
> is present with the same MAC address.

I prefer saying "a passthrough device" here. Does not have to be a VF at
all.

>  It allows live migration of a VM
> with a direct attached VF without the need to setup a bond/team between a
> VF and virtio net device in the guest.
> 
> The hypervisor needs to unplug the VF device from the guest on the source
> host and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed, the
> destination hypervisor sets the MAC filter on the VF and plugs it back to
> the guest to switch over to VF datapath.
> 
> It is entirely based on netvsc implementation and it should be possible to
> make this code generic and move it to a common location that can be shared
> by netvsc and virtio.
> 
> Also, i think we should make this a negotiated feature that is off by
> default via a new feature bit.

So please include this. A copy needs to go to virtio TC
to reserve the bit. Enabling this by default risks breaking
too many configurations.

> 
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> ---
>  drivers/net/virtio_net.c | 341 ++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 339 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 559b215c0169..a34c717bb15b 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -31,6 +31,8 @@
>  #include <linux/average.h>
>  #include <linux/filter.h>
>  #include <net/route.h>
> +#include <linux/netdevice.h>
> +#include <linux/netpoll.h>
>  
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -56,6 +58,8 @@ module_param(napi_tx, bool, 0644);
>   */
>  DECLARE_EWMA(pkt_len, 0, 64)
>  
> +#define VF_TAKEOVER_INT	(HZ / 10)
> +
>  #define VIRTNET_DRIVER_VERSION "1.0.0"
>  
>  static const unsigned long guest_offloads[] = {
> @@ -117,6 +121,15 @@ struct receive_queue {
>  	char name[40];
>  };
>  
> +struct virtnet_vf_pcpu_stats {
> +	u64	rx_packets;
> +	u64	rx_bytes;
> +	u64	tx_packets;
> +	u64	tx_bytes;
> +	struct u64_stats_sync   syncp;
> +	u32	tx_dropped;
> +};
> +
>  struct virtnet_info {
>  	struct virtio_device *vdev;
>  	struct virtqueue *cvq;
> @@ -179,6 +192,11 @@ struct virtnet_info {
>  	u32 speed;
>  
>  	unsigned long guest_offloads;
> +
> +	/* State to manage the associated VF interface. */
> +	struct net_device __rcu *vf_netdev;
> +	struct virtnet_vf_pcpu_stats __percpu *vf_stats;
> +	struct delayed_work vf_takeover;
>  };
>  
>  struct padded_vnet_hdr {
> @@ -1300,16 +1318,51 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>  	return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
>  }
>  
> +/* Send skb on the slave VF device. */
> +static int virtnet_vf_xmit(struct net_device *dev, struct net_device *vf_netdev,
> +			   struct sk_buff *skb)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	unsigned int len = skb->len;
> +	int rc;
> +
> +	skb->dev = vf_netdev;
> +	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
> +
> +	rc = dev_queue_xmit(skb);
> +	if (likely(rc == NET_XMIT_SUCCESS || rc == NET_XMIT_CN)) {
> +		struct virtnet_vf_pcpu_stats *pcpu_stats
> +			= this_cpu_ptr(vi->vf_stats);
> +
> +		u64_stats_update_begin(&pcpu_stats->syncp);
> +		pcpu_stats->tx_packets++;
> +		pcpu_stats->tx_bytes += len;
> +		u64_stats_update_end(&pcpu_stats->syncp);
> +	} else {
> +		this_cpu_inc(vi->vf_stats->tx_dropped);
> +	}
> +
> +	return rc;
> +}
> +
>  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	int qnum = skb_get_queue_mapping(skb);
>  	struct send_queue *sq = &vi->sq[qnum];
> +	struct net_device *vf_netdev;
>  	int err;
>  	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>  	bool kick = !skb->xmit_more;
>  	bool use_napi = sq->napi.weight;
>  
> +	/* if VF is present and up then redirect packets
> +	 * called with rcu_read_lock_bh
> +	 */
> +	vf_netdev = rcu_dereference_bh(vi->vf_netdev);
> +	if (vf_netdev && netif_running(vf_netdev) && !netpoll_tx_running(dev))
> +		return virtnet_vf_xmit(dev, vf_netdev, skb);
> +
>  	/* Free up any pending old buffers before queueing new ones. */
>  	free_old_xmit_skbs(sq);
>  
> @@ -1456,10 +1509,41 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p)
>  	return ret;
>  }
>  
> +static void virtnet_get_vf_stats(struct net_device *dev,
> +				 struct virtnet_vf_pcpu_stats *tot)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
> +
> +	memset(tot, 0, sizeof(*tot));
> +
> +	for_each_possible_cpu(i) {
> +		const struct virtnet_vf_pcpu_stats *stats
> +				= per_cpu_ptr(vi->vf_stats, i);
> +		u64 rx_packets, rx_bytes, tx_packets, tx_bytes;
> +		unsigned int start;
> +
> +		do {
> +			start = u64_stats_fetch_begin_irq(&stats->syncp);
> +			rx_packets = stats->rx_packets;
> +			tx_packets = stats->tx_packets;
> +			rx_bytes = stats->rx_bytes;
> +			tx_bytes = stats->tx_bytes;
> +		} while (u64_stats_fetch_retry_irq(&stats->syncp, start));
> +
> +		tot->rx_packets += rx_packets;
> +		tot->tx_packets += tx_packets;
> +		tot->rx_bytes   += rx_bytes;
> +		tot->tx_bytes   += tx_bytes;
> +		tot->tx_dropped += stats->tx_dropped;
> +	}
> +}
> +
>  static void virtnet_stats(struct net_device *dev,
>  			  struct rtnl_link_stats64 *tot)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	struct virtnet_vf_pcpu_stats vf_stats;
>  	int cpu;
>  	unsigned int start;
>  
> @@ -1490,6 +1574,13 @@ static void virtnet_stats(struct net_device *dev,
>  	tot->rx_dropped = dev->stats.rx_dropped;
>  	tot->rx_length_errors = dev->stats.rx_length_errors;
>  	tot->rx_frame_errors = dev->stats.rx_frame_errors;
> +
> +	virtnet_get_vf_stats(dev, &vf_stats);
> +	tot->rx_packets += vf_stats.rx_packets;
> +	tot->tx_packets += vf_stats.tx_packets;
> +	tot->rx_bytes += vf_stats.rx_bytes;
> +	tot->tx_bytes += vf_stats.tx_bytes;
> +	tot->tx_dropped += vf_stats.tx_dropped;
>  }
>  
>  #ifdef CONFIG_NET_POLL_CONTROLLER
> @@ -2508,6 +2599,47 @@ static int virtnet_validate(struct virtio_device *vdev)
>  	return 0;
>  }
>  
> +static void __virtnet_vf_setup(struct net_device *ndev,
> +			       struct net_device *vf_netdev)
> +{
> +	int ret;
> +
> +	/* Align MTU of VF with master */
> +	ret = dev_set_mtu(vf_netdev, ndev->mtu);
> +	if (ret)
> +		netdev_warn(vf_netdev,
> +			    "unable to change mtu to %u\n", ndev->mtu);
> +
> +	if (netif_running(ndev)) {
> +		ret = dev_open(vf_netdev);
> +		if (ret)
> +			netdev_warn(vf_netdev,
> +				    "unable to open: %d\n", ret);
> +	}
> +}
> +
> +/* Setup VF as slave of the virtio device.
> + * Runs in workqueue to avoid recursion in netlink callbacks.
> + */
> +static void virtnet_vf_setup(struct work_struct *w)
> +{
> +	struct virtnet_info *vi
> +		= container_of(w, struct virtnet_info, vf_takeover.work);
> +	struct net_device *ndev = vi->dev;
> +	struct net_device *vf_netdev;
> +
> +	if (!rtnl_trylock()) {
> +		schedule_delayed_work(&vi->vf_takeover, 0);
> +		return;
> +	}
> +
> +	vf_netdev = rtnl_dereference(vi->vf_netdev);
> +	if (vf_netdev)
> +		__virtnet_vf_setup(ndev, vf_netdev);
> +
> +	rtnl_unlock();
> +}
> +
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
>  	int i, err;
> @@ -2600,6 +2732,11 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	}
>  
>  	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
> +	INIT_DELAYED_WORK(&vi->vf_takeover, virtnet_vf_setup);
> +
> +	vi->vf_stats = netdev_alloc_pcpu_stats(struct virtnet_vf_pcpu_stats);
> +	if (!vi->vf_stats)
> +		goto free_stats;
>  
>  	/* If we can receive ANY GSO packets, we must allocate large ones. */
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> @@ -2634,7 +2771,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>  			 */
>  			dev_err(&vdev->dev, "device MTU appears to have changed "
>  				"it is now %d < %d", mtu, dev->min_mtu);
> -			goto free_stats;
> +			goto free_vf_stats;
>  		}
>  
>  		dev->mtu = mtu;
> @@ -2658,7 +2795,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
>  	err = init_vqs(vi);
>  	if (err)
> -		goto free_stats;
> +		goto free_vf_stats;
>  
>  #ifdef CONFIG_SYSFS
>  	if (vi->mergeable_rx_bufs)
> @@ -2712,6 +2849,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	cancel_delayed_work_sync(&vi->refill);
>  	free_receive_page_frags(vi);
>  	virtnet_del_vqs(vi);
> +free_vf_stats:
> +	free_percpu(vi->vf_stats);
>  free_stats:
>  	free_percpu(vi->stats);
>  free:
> @@ -2733,19 +2872,178 @@ static void remove_vq_common(struct virtnet_info *vi)
>  	virtnet_del_vqs(vi);
>  }
>  
> +static struct net_device *get_virtio_bymac(const u8 *mac)
> +{
> +	struct net_device *dev;
> +
> +	ASSERT_RTNL();
> +
> +	for_each_netdev(&init_net, dev) {
> +		if (dev->netdev_ops != &virtnet_netdev)
> +			continue;       /* not a virtio_net device */
> +
> +		if (ether_addr_equal(mac, dev->perm_addr))
> +			return dev;
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct net_device *get_virtio_byref(struct net_device *vf_netdev)
> +{
> +	struct net_device *dev;
> +
> +	ASSERT_RTNL();
> +
> +	for_each_netdev(&init_net, dev) {
> +		struct virtnet_info *vi;
> +
> +		if (dev->netdev_ops != &virtnet_netdev)
> +			continue;	/* not a virtio_net device */
> +
> +		vi = netdev_priv(dev);
> +		if (rtnl_dereference(vi->vf_netdev) == vf_netdev)
> +			return dev;	/* a match */
> +	}
> +
> +	return NULL;
> +}
> +
> +/* Called when VF is injecting data into network stack.
> + * Change the associated network device from VF to virtio.
> + * note: already called with rcu_read_lock
> + */
> +static rx_handler_result_t virtnet_vf_handle_frame(struct sk_buff **pskb)
> +{
> +	struct sk_buff *skb = *pskb;
> +	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
> +	struct virtnet_info *vi = netdev_priv(ndev);
> +	struct virtnet_vf_pcpu_stats *pcpu_stats =
> +				this_cpu_ptr(vi->vf_stats);
> +
> +	skb->dev = ndev;
> +
> +	u64_stats_update_begin(&pcpu_stats->syncp);
> +	pcpu_stats->rx_packets++;
> +	pcpu_stats->rx_bytes += skb->len;
> +	u64_stats_update_end(&pcpu_stats->syncp);
> +
> +	return RX_HANDLER_ANOTHER;
> +}
> +
> +static int virtnet_vf_join(struct net_device *vf_netdev,
> +			   struct net_device *ndev)
> +{
> +	struct virtnet_info *vi = netdev_priv(ndev);
> +	int ret;
> +
> +	ret = netdev_rx_handler_register(vf_netdev,
> +					 virtnet_vf_handle_frame, ndev);
> +	if (ret != 0) {
> +		netdev_err(vf_netdev,
> +			   "can not register virtio VF receive handler (err = %d)\n",
> +			   ret);
> +		goto rx_handler_failed;
> +	}
> +
> +	ret = netdev_upper_dev_link(vf_netdev, ndev, NULL);
> +	if (ret != 0) {
> +		netdev_err(vf_netdev,
> +			   "can not set master device %s (err = %d)\n",
> +			   ndev->name, ret);
> +		goto upper_link_failed;
> +	}
> +
> +	/* set slave flag before open to prevent IPv6 addrconf */
> +	vf_netdev->flags |= IFF_SLAVE;
> +
> +	schedule_delayed_work(&vi->vf_takeover, VF_TAKEOVER_INT);
> +
> +	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
> +
> +	netdev_info(vf_netdev, "joined to %s\n", ndev->name);
> +	return 0;
> +
> +upper_link_failed:
> +	netdev_rx_handler_unregister(vf_netdev);
> +rx_handler_failed:
> +	return ret;
> +}
> +
> +static int virtnet_register_vf(struct net_device *vf_netdev)
> +{
> +	struct net_device *ndev;
> +	struct virtnet_info *vi;
> +
> +	if (vf_netdev->addr_len != ETH_ALEN)
> +		return NOTIFY_DONE;
> +
> +	/* We will use the MAC address to locate the virtio_net interface to
> +	 * associate with the VF interface. If we don't find a matching
> +	 * virtio interface, move on.
> +	 */
> +	ndev = get_virtio_bymac(vf_netdev->perm_addr);
> +	if (!ndev)
> +		return NOTIFY_DONE;
> +
> +	vi = netdev_priv(ndev);
> +	if (rtnl_dereference(vi->vf_netdev))
> +		return NOTIFY_DONE;
> +
> +	if (virtnet_vf_join(vf_netdev, ndev) != 0)
> +		return NOTIFY_DONE;
> +
> +	netdev_info(ndev, "VF registering %s\n", vf_netdev->name);
> +
> +	dev_hold(vf_netdev);
> +	rcu_assign_pointer(vi->vf_netdev, vf_netdev);
> +
> +	return NOTIFY_OK;
> +}
> +
> +static int virtnet_unregister_vf(struct net_device *vf_netdev)
> +{
> +	struct net_device *ndev;
> +	struct virtnet_info *vi;
> +
> +	ndev = get_virtio_byref(vf_netdev);
> +	if (!ndev)
> +		return NOTIFY_DONE;
> +
> +	vi = netdev_priv(ndev);
> +	cancel_delayed_work_sync(&vi->vf_takeover);
> +
> +	netdev_info(ndev, "VF unregistering %s\n", vf_netdev->name);
> +
> +	netdev_rx_handler_unregister(vf_netdev);
> +	netdev_upper_dev_unlink(vf_netdev, ndev);
> +	RCU_INIT_POINTER(vi->vf_netdev, NULL);
> +	dev_put(vf_netdev);
> +
> +	return NOTIFY_OK;
> +}
> +
>  static void virtnet_remove(struct virtio_device *vdev)
>  {
>  	struct virtnet_info *vi = vdev->priv;
> +	struct net_device *vf_netdev;
>  
>  	virtnet_cpu_notif_remove(vi);
>  
>  	/* Make sure no work handler is accessing the device. */
>  	flush_work(&vi->config_work);
>  
> +	rtnl_lock();
> +	vf_netdev = rtnl_dereference(vi->vf_netdev);
> +	if (vf_netdev)
> +		virtnet_unregister_vf(vf_netdev);
> +	rtnl_unlock();
> +
>  	unregister_netdev(vi->dev);
>  
>  	remove_vq_common(vi);
>  
> +	free_percpu(vi->vf_stats);
>  	free_percpu(vi->stats);
>  	free_netdev(vi->dev);
>  }
> @@ -2823,6 +3121,42 @@ static struct virtio_driver virtio_net_driver = {
>  #endif
>  };
>  
> +static int virtio_netdev_event(struct notifier_block *this,
> +			       unsigned long event, void *ptr)
> +{
> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> +
> +	/* Skip our own events */
> +	if (event_dev->netdev_ops == &virtnet_netdev)
> +		return NOTIFY_DONE;
> +
> +	/* Avoid non-Ethernet type devices */
> +	if (event_dev->type != ARPHRD_ETHER)
> +		return NOTIFY_DONE;
> +
> +	/* Avoid Vlan dev with same MAC registering as VF */
> +	if (is_vlan_dev(event_dev))
> +		return NOTIFY_DONE;
> +
> +	/* Avoid Bonding master dev with same MAC registering as VF */
> +	if ((event_dev->priv_flags & IFF_BONDING) &&
> +	    (event_dev->flags & IFF_MASTER))
> +		return NOTIFY_DONE;
> +
> +	switch (event) {
> +	case NETDEV_REGISTER:
> +		return virtnet_register_vf(event_dev);
> +	case NETDEV_UNREGISTER:
> +		return virtnet_unregister_vf(event_dev);
> +	default:
> +		return NOTIFY_DONE;
> +	}
> +}
> +
> +static struct notifier_block virtio_netdev_notifier = {
> +	.notifier_call = virtio_netdev_event,
> +};
> +
>  static __init int virtio_net_driver_init(void)
>  {
>  	int ret;
> @@ -2841,6 +3175,8 @@ static __init int virtio_net_driver_init(void)
>          ret = register_virtio_driver(&virtio_net_driver);
>  	if (ret)
>  		goto err_virtio;
> +
> +	register_netdevice_notifier(&virtio_netdev_notifier);
>  	return 0;
>  err_virtio:
>  	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
> @@ -2853,6 +3189,7 @@ module_init(virtio_net_driver_init);
>  
>  static __exit void virtio_net_driver_exit(void)
>  {
> +	unregister_netdevice_notifier(&virtio_netdev_notifier);
>  	unregister_virtio_driver(&virtio_net_driver);
>  	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
>  	cpuhp_remove_multi_state(virtionet_online);
> -- 
> 2.14.3

^ permalink raw reply

* Re: [patch iproute2] tc: add -bs option for batch mode
From: David Ahern @ 2017-12-21  0:13 UTC (permalink / raw)
  To: Stephen Hemminger, Chris Mi; +Cc: netdev@vger.kernel.org, gerlitz.or@gmail.com
In-Reply-To: <20171220071744.25f9dd41@xeon-e3>

On 12/20/17 8:17 AM, Stephen Hemminger wrote:
> On Wed, 20 Dec 2017 09:23:34 +0000
> Chris Mi <chrism@mellanox.com> wrote:
> 
>>> Your real performance win is just not asking for ACK for every rule.  
>> No. Even if batch_size > 1, we ack every rule. The real performance win is
>> to send multiple rules in one system call. If we are not asking for ACK for every rule,
>> the performance will be improved further.
> 
> Try the no ACK method.
> 
> When we were optimizing routing daemons like Quagga, it was discovered
> that an ACK for every route insert was the main bottleneck. Doing asynchronous
> error handling got a bigger win than your batching.
> 
> Please try that, doing multiple messages using iov is not necessary.
> 

FWIW, I plan to look at batching routes in a similar fashion.

^ permalink raw reply

* Re: [PATCH net-next v4 0/6] net: tcp: sctp: dccp: Replace jprobe usage with trace events
From: Masami Hiramatsu @ 2017-12-21  0:10 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, ian.mcdonald, vyasevich, stephen, rostedt, peterz, tglx,
	linux-kernel, hpa, gerrit, nhorman, dccp, netdev, linux-sctp, sfr
In-Reply-To: <20171220.142424.1214953044518214833.davem@davemloft.net>

On Wed, 20 Dec 2017 14:24:24 -0500 (EST)
David Miller <davem@davemloft.net> wrote:

> From: David Miller <davem@davemloft.net>
> Date: Wed, 20 Dec 2017 14:20:40 -0500 (EST)
> 
> > From: Masami Hiramatsu <mhiramat@kernel.org>
> > Date: Wed, 20 Dec 2017 13:14:11 +0900
> > 
> >> This series is v4 of the replacement of jprobe usage with trace
> >> events. This version is rebased on net-next, fixes a build warning
> >> and moves a temporal variable definition in a block.
> >> 
> >> Previous version is here;
> >> https://lkml.org/lkml/2017/12/19/153
> >> 
> >> Changes from v3:
> >>   All: Rebased on net-next
> >>   [3/6]: fixes a build warning for i386 by casting pointer unsigned
> >>         long instead of __u64, and moves a temporal variable
> >>          definition in a block.
> > 
> > Looks good, series applied to net-next, thanks.
> 
> Actually, this doesn't even compile, so I've reverted:
> 
> [davem@dhcp-10-15-49-227 net-next]$ make -s -j16
> In file included from net/dccp/trace.h:105:0,
>                  from net/dccp/proto.c:42:
> ./include/trace/define_trace.h:89:42: fatal error: ./trace.h: No such file or directory
>  #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
>                                           ^
> compilation terminated.

Hmm, strange.
I could compile it on x86-64 and i386. Let me check what was wrong.

Thanks,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply

* Re: RCU callback crashes
From: Jakub Kicinski @ 2017-12-21  0:08 UTC (permalink / raw)
  To: Cong Wang; +Cc: Jiri Pirko, netdev@vger.kernel.org
In-Reply-To: <CAM_iQpUngX+oSDiforfZceqMZrg=jDJnNf3QFF9WFQdHrU9o-g@mail.gmail.com>

On Wed, 20 Dec 2017 16:03:49 -0800, Cong Wang wrote:
> On Wed, Dec 20, 2017 at 10:31 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Wed, Dec 20, 2017 at 10:17 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:  
> >>
> >> I guess it is q->miniqp which is freed in qdisc_graft() without properly
> >> waiting for rcu readers?  
> >
> > It is probably so, the call_rcu_bh(&miniq_old->rcu, mini_qdisc_rcu_func)
> > in the end of mini_qdisc_pair_swap() is invoked on miniq_old->rcu,
> > but miniq is being freed, no rcu barrier waits for it...
> >
> > You can try to add a rcu_barrier_bh() at the end to see if this crash
> > is gone, but I don't think people like adding yet another rcu barrier...  
> 
> Hi, Jakub
> 
> Can you test the following fix? I am not a fan of rcu barrier but we
> already have one so...
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 876fab2604b8..1b68fedea124 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1240,6 +1240,8 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
> 
>         if (!tp_head) {
>                 RCU_INIT_POINTER(*miniqp->p_miniq, NULL);
> +               /* Wait for existing flying RCU callback before being freed. */
> +               rcu_barrier_bh();
>                 return;
>         }

Mm.. I was running with this hack for the last two hours and it was OK:

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 876fab2604b8..d7e0c3ad0a1c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1260,6 +1260,7 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
                 * are not seeing it.
                 */
                call_rcu_bh(&miniq_old->rcu, mini_qdisc_rcu_func);
+       rcu_barrier_bh();
 }
 EXPORT_SYMBOL(mini_qdisc_pair_swap);

Let me try to move the barrier...

^ permalink raw reply related

* Re: RCU callback crashes
From: Cong Wang @ 2017-12-21  0:03 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jiri Pirko, netdev@vger.kernel.org
In-Reply-To: <CAM_iQpVPUifm3rcXu8SP9ShSmm7z9z+8UjppdY_AxMYQwHE9YQ@mail.gmail.com>

On Wed, Dec 20, 2017 at 10:31 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Wed, Dec 20, 2017 at 10:17 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>
>> I guess it is q->miniqp which is freed in qdisc_graft() without properly
>> waiting for rcu readers?
>
> It is probably so, the call_rcu_bh(&miniq_old->rcu, mini_qdisc_rcu_func)
> in the end of mini_qdisc_pair_swap() is invoked on miniq_old->rcu,
> but miniq is being freed, no rcu barrier waits for it...
>
> You can try to add a rcu_barrier_bh() at the end to see if this crash
> is gone, but I don't think people like adding yet another rcu barrier...

Hi, Jakub

Can you test the following fix? I am not a fan of rcu barrier but we
already have one so...

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 876fab2604b8..1b68fedea124 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1240,6 +1240,8 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,

        if (!tp_head) {
                RCU_INIT_POINTER(*miniqp->p_miniq, NULL);
+               /* Wait for existing flying RCU callback before being freed. */
+               rcu_barrier_bh();
                return;
        }

^ permalink raw reply related

* Re: linux-next: build failure after merge of the net-next tree
From: Jakub Kicinski @ 2017-12-20 23:59 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: David Miller, Networking, Linux-Next Mailing List,
	Linux Kernel Mailing List, Daniel Borkmann
In-Reply-To: <20171221104304.38f94a00@canb.auug.org.au>

On Thu, 21 Dec 2017 10:43:04 +1100, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the net-next tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
> 
> drivers/net/netdevsim/bpf.c: In function 'nsim_bpf_setup_tc_block_cb':
> drivers/net/netdevsim/bpf.c:127:7: error: 'TC_CLSBPF_REPLACE' undeclared (first use in this function)
>   case TC_CLSBPF_REPLACE:
>        ^
> drivers/net/netdevsim/bpf.c:129:7: error: 'TC_CLSBPF_ADD' undeclared (first use in this function)
>   case TC_CLSBPF_ADD:
>        ^
> drivers/net/netdevsim/bpf.c:131:7: error: 'TC_CLSBPF_DESTROY' undeclared (first use in this function)
>   case TC_CLSBPF_DESTROY:
>        ^
> 
> Caused by commit
> 
>   31d3ad832948 ("netdevsim: add bpf offload support")
> 
> interacting with commit
> 
>   102740bd9436 ("cls_bpf: fix offload assumptions after callback conversion")
> 
> from the net tree.
> 
> I applied the following merge fix patch:
> 
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Thu, 21 Dec 2017 10:18:46 +1100
> Subject: [PATCH] netdevsim: fix up for "cls_bpf: fix offload assumptions after
>  callback conversion"
> 
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

Hi Stephen, sorry about those merges today.  The proper fix is queued
up in patchwork for when net-next is merged in to net:

http://patchwork.ozlabs.org/patch/851063/

I will CC you on patches which may cause/fix merge trouble in the
future, often the information about how to resolve the conflict is
not part of the commit message.

Sorry I didn't think about CCing you earlier!

^ permalink raw reply

* linux-next: build failure after merge of the net-next tree
From: Stephen Rothwell @ 2017-12-20 23:43 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List,
	Jakub Kicinski, Daniel Borkmann

Hi all,

After merging the net-next tree, today's linux-next build (x86_64
allmodconfig) failed like this:

drivers/net/netdevsim/bpf.c: In function 'nsim_bpf_setup_tc_block_cb':
drivers/net/netdevsim/bpf.c:127:7: error: 'TC_CLSBPF_REPLACE' undeclared (first use in this function)
  case TC_CLSBPF_REPLACE:
       ^
drivers/net/netdevsim/bpf.c:129:7: error: 'TC_CLSBPF_ADD' undeclared (first use in this function)
  case TC_CLSBPF_ADD:
       ^
drivers/net/netdevsim/bpf.c:131:7: error: 'TC_CLSBPF_DESTROY' undeclared (first use in this function)
  case TC_CLSBPF_DESTROY:
       ^

Caused by commit

  31d3ad832948 ("netdevsim: add bpf offload support")

interacting with commit

  102740bd9436 ("cls_bpf: fix offload assumptions after callback conversion")

from the net tree.

I applied the following merge fix patch:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Thu, 21 Dec 2017 10:18:46 +1100
Subject: [PATCH] netdevsim: fix up for "cls_bpf: fix offload assumptions after
 callback conversion"

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 drivers/net/netdevsim/bpf.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c
index 7da814686ad9..afaf980bbbe7 100644
--- a/drivers/net/netdevsim/bpf.c
+++ b/drivers/net/netdevsim/bpf.c
@@ -123,16 +123,10 @@ int nsim_bpf_setup_tc_block_cb(enum tc_setup_type type,
 	if (prog && !prog->aux->offload && !ns->bpf_tc_non_bound_accept)
 		return -EOPNOTSUPP;
 
-	switch (cls_bpf->command) {
-	case TC_CLSBPF_REPLACE:
-		return nsim_bpf_offload(ns, prog, true);
-	case TC_CLSBPF_ADD:
-		return nsim_bpf_offload(ns, prog, false);
-	case TC_CLSBPF_DESTROY:
-		return nsim_bpf_offload(ns, NULL, true);
-	default:
+	if (cls_bpf->command != TC_CLSBPF_OFFLOAD)
 		return -EOPNOTSUPP;
-	}
+
+	return nsim_bpf_offload(ns, prog, cls_bpf->oldprog);
 }
 
 int nsim_bpf_disable_tc(struct netdevsim *ns)
-- 
2.15.0

-- 
Cheers,
Stephen Rothwell

^ permalink raw reply related

* Re: [PATCH] net: Revert "net_sched: no need to free qdisc in RCU callback"
From: John Fastabend @ 2017-12-20 23:40 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: xiyou.wangcong, jiri, davem, netdev, eric.dumazet
In-Reply-To: <20171220135959.3ff075ac@cakuba.netronome.com>

On 12/20/2017 01:59 PM, Jakub Kicinski wrote:
> On Wed, 20 Dec 2017 12:09:19 -0800, John Fastabend wrote:
>> RCU grace period is needed for lockless qdiscs added in the commit
>> c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array").
>>
>> It is needed now that qdiscs may be lockless otherwise we risk
>> free'ing a qdisc that is still in use from datapath. Additionally,
>> push list cleanup into RCU callback. Otherwise we risk the datapath
>> adding skbs during removal.
>>
>> Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array")
>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> 
> Seems like this revert may be too heavy handed:
> 
> # ./tools/testing/selftests/bpf/test_offload.py --log /tmp/log
> Test destruction of generic XDP...
> Test TC non-offloaded...
> Test TC non-offloaded isn't getting bound...
> Test TC offloads are off by default...
> Test TC offload by default...
> Test TC cBPF bytcode tries offload by default...
> Test TC cBPF unbound bytecode doesn't offload...
> Test TC offloads work...
> FAIL: TC filter did not load with TC offloads enabled
> 
> And it's triggering:
> 
> WARNING: CPU: 15 PID: 1853 at ../drivers/net/netdevsim/bpf.c:372 nsim_bpf_uninit+0x2e/0x41 [netdevsim]
> 
> Which is:
> 
>    368	void nsim_bpf_uninit(struct netdevsim *ns)
>    369	{
>    370		WARN_ON(!list_empty(&ns->bpf_bound_progs));
>    371		WARN_ON(ns->xdp_prog);
>>> 372		WARN_ON(ns->bpf_offloaded);
>    373	}
> 
> (Meaning the offload was not stopped by the stack before ndo_uninit.)
> 

Dang. So offload code depends on destroy being called on a qdisc
to in turn destroy the filters and unbind any offloads.

I was hoping I could get away with tearing down live qdiscs without
too much work. Looks like not.

Note the fixes tag was bogus nothing is actually broken in current
code until a lockless qdisc with classes shows up.

.John

^ permalink raw reply

* Re: [net] Revert "net: core: maybe return -EEXIST in __dev_alloc_name"
From: Rasmus Villemoes @ 2017-12-20 23:37 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Johannes Berg, netdev@vger.kernel.org, Jouni Malinen,
	Johannes Berg, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <87vah29a1m.fsf@concordia.ellerman.id.au>

On Tue, Dec 19 2017, Michael Ellerman <michael@concordia.ellerman.id.au> wrote:

> Hi Johannes,
>
>> From: Johannes Berg <johannes.berg@intel.com>
>> 
>> This reverts commit d6f295e9def0; some userspace (in the case
>
> This revert seems to have broken networking on one of my powerpc
> machines, according to git bisect.
>
> The symptom is DHCP fails and I don't get a link, I didn't dig any
> further than that. I can if it's helpful.
>
> I think the problem is that 87c320e51519 ("net: core: dev_get_valid_name
> is now the same as dev_alloc_name_ns") only makes sense while
> d6f295e9def0 remains in the tree.

I'm sorry about all of this, I really didn't think there would be such
consequences of changing an errno return. Indeed, d6f29 was preparation
for unifying the two functions that do the exact same thing (and how we
ever got into that situation is somewhat unclear), except for
their behaviour in the case the requested name already exists. So one of
the two interfaces had to change its return value, and as I wrote, I
thought EEXIST was the saner choice when an explicit name (no %d) had
been requested.

> ie. before the entire series, dev_get_valid_name() would return EEXIST,
> and that was retained when 87c320e51519 was merged, but now that
> d6f295e9def0 has been reverted dev_get_valid_name() is returning ENFILE.
>
> I can get the network up again if I also revert 87c320e51519 ("net:
> core: dev_get_valid_name is now the same as dev_alloc_name_ns"), or with
> the gross patch below.

I don't think changing -ENFILE to -EEXIST would be right either, since
dev_get_valid_name() used to be able to return both (-EEXIST in the case
where there's no %d, -ENFILE in the case where we end up calling
dev_alloc_name_ns()). If anything, we could do the check for the old
-EEXIST condition first, and then call dev_alloc_name_ns(). But I'm also
fine with reverting.

Again, sorry :(

Rasmus

^ permalink raw reply

* Re: [PATCH] net: Revert "net_sched: no need to free qdisc in RCU callback"
From: John Fastabend @ 2017-12-20 23:34 UTC (permalink / raw)
  To: Cong Wang
  Cc: Jiri Pirko, David Miller, Jakub Kicinski,
	Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <CAM_iQpVQF5MosmXqhfxtnH9kh96MEHd0-kO8SO3TbCpywOzv+g@mail.gmail.com>

On 12/20/2017 03:23 PM, Cong Wang wrote:
> On Wed, Dec 20, 2017 at 3:05 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> On 12/20/2017 02:41 PM, Cong Wang wrote:
>>> On Wed, Dec 20, 2017 at 12:09 PM, John Fastabend
>>> <john.fastabend@gmail.com> wrote:
>>>> RCU grace period is needed for lockless qdiscs added in the commit
>>>> c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array").
>>>>
>>>> It is needed now that qdiscs may be lockless otherwise we risk
>>>> free'ing a qdisc that is still in use from datapath. Additionally,
>>>> push list cleanup into RCU callback. Otherwise we risk the datapath
>>>> adding skbs during removal.
>>>
>>> What about qdisc_graft() -> dev_deactivate() -> synchronize_net() ?
>>> It doesn't work with your "lockless" patches?
>>>
>>
>> Well this is only in the 'parent == NULL' case otherwise we call
>> cops->graft(). Most sch_* seem to use qdisc_replace and this uses
>> sch_tree_lock().
>>
>> The only converted qdisc mq and mqprio at this point don't care
>> though and do their own dev_deactivate/activate. So its not fixing
>> anything in the above mentioned commit.
> 
> Sure, removing a class does not impact the whole device,
> but removing the root qdisc does.
> 
> After your "lockless", skb_array_consume_bh() is called in
> pfifo_fast_reset() and ptr_ring_cleanup() is called in
> pfifo_fast_destroy(), assuming skb_array is not buggy, what race
> do we have here with datapath?
> 

None at the moment.

> 
>>
>> I still think it will need to be done eventually. If it resolves
>> the miniq case it seems like a good idea. Although per Jakub's comment
>> perhaps I pulled too much into the RCU handler.
> 
> The case Jakub reported is a RCU callback missing a rcu
> barrier. I don't understand why you keep believing it is RCU
> readers on datapath.> 
> Not even to mention ingress is not affected by your "lockless"
> thing.
> 

I was thinking about the case where we want a lockless qdisc
with classes. Doing the qdisc destroy after a grace period would
solve this. Also we could start to cleanup a lot of the locking
and extra bits around 'running' qdisc and such by doing a clean
xchg on the qdisc layer. It seems that a dev_activate/deactivate
just to install a new qdisc is not needed.

Anyways future work. However if it resolves the miniq issue, as
Jiri indicated, seems like a clean fix. Although Jakub's issue
with the patch would need to be addressed. Seems he gets a WARN_ON
if the offload is not disabled but the device is unitialized.

.John

^ permalink raw reply

* Re: [PATCH] net: Revert "net_sched: no need to free qdisc in RCU callback"
From: Cong Wang @ 2017-12-20 23:23 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jiri Pirko, David Miller, Jakub Kicinski,
	Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <d6892431-21bc-254f-cc29-acf28d00af20@gmail.com>

On Wed, Dec 20, 2017 at 3:05 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 12/20/2017 02:41 PM, Cong Wang wrote:
>> On Wed, Dec 20, 2017 at 12:09 PM, John Fastabend
>> <john.fastabend@gmail.com> wrote:
>>> RCU grace period is needed for lockless qdiscs added in the commit
>>> c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array").
>>>
>>> It is needed now that qdiscs may be lockless otherwise we risk
>>> free'ing a qdisc that is still in use from datapath. Additionally,
>>> push list cleanup into RCU callback. Otherwise we risk the datapath
>>> adding skbs during removal.
>>
>> What about qdisc_graft() -> dev_deactivate() -> synchronize_net() ?
>> It doesn't work with your "lockless" patches?
>>
>
> Well this is only in the 'parent == NULL' case otherwise we call
> cops->graft(). Most sch_* seem to use qdisc_replace and this uses
> sch_tree_lock().
>
> The only converted qdisc mq and mqprio at this point don't care
> though and do their own dev_deactivate/activate. So its not fixing
> anything in the above mentioned commit.

Sure, removing a class does not impact the whole device,
but removing the root qdisc does.

After your "lockless", skb_array_consume_bh() is called in
pfifo_fast_reset() and ptr_ring_cleanup() is called in
pfifo_fast_destroy(), assuming skb_array is not buggy, what race
do we have here with datapath?


>
> I still think it will need to be done eventually. If it resolves
> the miniq case it seems like a good idea. Although per Jakub's comment
> perhaps I pulled too much into the RCU handler.

The case Jakub reported is a RCU callback missing a rcu
barrier. I don't understand why you keep believing it is RCU
readers on datapath.

Not even to mention ingress is not affected by your "lockless"
thing.

^ permalink raw reply

* [PATCH net-next] phylink: avoid attaching more than one PHY
From: Russell King @ 2017-12-20 23:23 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli; +Cc: netdev

Attaching more than one PHY to phylink is bad news, as we store a
pointer to the PHY in a single location. Error out if more than one
PHY is attempted to be attached.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
---
 drivers/net/phy/phylink.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index db5d5726ced9..82166a26f5c6 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -726,6 +726,9 @@ int phylink_connect_phy(struct phylink *pl, struct phy_device *phy)
 		     phy_interface_mode_is_8023z(pl->link_interface))))
 		return -EINVAL;
 
+	if (pl->phydev)
+		return -EBUSY;
+
 	/* Use PHY device/driver interface */
 	if (pl->link_interface == PHY_INTERFACE_MODE_NA) {
 		pl->link_interface = phy->interface;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net] phylink: ensure AN is enabled
From: Russell King @ 2017-12-20 23:21 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli; +Cc: netdev

Ensure that we mark AN as enabled at boot time, rather than leaving
it disabled.  This is noticable if your SFP module is fiber, and you
it supports faster speeds than 1G with 2.5G support in place.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
---
 drivers/net/phy/phylink.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index e30339fca5cf..db5d5726ced9 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -567,6 +567,7 @@ struct phylink *phylink_create(struct net_device *ndev,
 	pl->link_config.pause = MLO_PAUSE_AN;
 	pl->link_config.speed = SPEED_UNKNOWN;
 	pl->link_config.duplex = DUPLEX_UNKNOWN;
+	pl->link_config.an_enabled = true;
 	pl->ops = ops;
 	__set_bit(PHYLINK_DISABLE_STOPPED, &pl->phylink_disable_state);
 
-- 
2.7.4

^ permalink raw reply related

* [PATCH net] phylink: ensure the PHY interface mode is appropriately set
From: Russell King @ 2017-12-20 23:21 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli; +Cc: netdev

When setting the ethtool settings, ensure that the validated PHY
interface mode is propagated to the current link settings, so that
2500BaseX can be selected.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
---
 drivers/net/phy/phylink.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index f7a777475762..e30339fca5cf 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -1136,6 +1136,7 @@ int phylink_ethtool_ksettings_set(struct phylink *pl,
 	mutex_lock(&pl->state_mutex);
 	/* Configure the MAC to match the new settings */
 	linkmode_copy(pl->link_config.advertising, our_kset.link_modes.advertising);
+	pl->link_config.interface = config.interface;
 	pl->link_config.speed = our_kset.base.speed;
 	pl->link_config.duplex = our_kset.base.duplex;
 	pl->link_config.an_enabled = our_kset.base.autoneg != AUTONEG_DISABLE;
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v2] net: ibm: emac: support RGMII-[RX|TX]ID phymode
From: Benjamin Herrenschmidt @ 2017-12-20 22:04 UTC (permalink / raw)
  To: Christian Lamparter, David Miller; +Cc: netdev, andrew, christophe.jaillet
In-Reply-To: <5154461.jTBeEfsTQW@debian64>

On Wed, 2017-12-20 at 22:07 +0100, Christian Lamparter wrote:
> 
> > will read this and say "Oh the function tests against these weird
> > PHY_MODE_* aliases, but the helper function phy_interface_mode_is_rgmii()
> > tests against PHY_INTERFACE_MODE_*, what is going on?"
> > 
> > I hate to do this to you, but please get rid of these confusing and
> > completely unnecessary PHY_MODE_* CPP aliases first, and just use the
> > proper PHY_INTERFACE_MODE_* values consistently.
> 
> Yeah, I can do that. no problem. 
> 
> Question is, should I also replace the rgmii_mode_name() with phy_modes() too?
> 
> The only user of rgmii_mode_name() is this notice printk in rgmii_attach():
> <http://elixir.free-electrons.com/linux/latest/source/drivers/net/ethernet/ibm/emac/rgmii.c#L117>

Yup, this is all pre-hisorical gunk, feel free to replace it.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] net: Revert "net_sched: no need to free qdisc in RCU callback"
From: John Fastabend @ 2017-12-20 23:05 UTC (permalink / raw)
  To: Cong Wang
  Cc: Jiri Pirko, David Miller, Jakub Kicinski,
	Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <CAM_iQpVuLzmiEkQGJwZcrZQsAziTAbzyaszbVjLe_YeQxvYpBg@mail.gmail.com>

On 12/20/2017 02:41 PM, Cong Wang wrote:
> On Wed, Dec 20, 2017 at 12:09 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> RCU grace period is needed for lockless qdiscs added in the commit
>> c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array").
>>
>> It is needed now that qdiscs may be lockless otherwise we risk
>> free'ing a qdisc that is still in use from datapath. Additionally,
>> push list cleanup into RCU callback. Otherwise we risk the datapath
>> adding skbs during removal.
> 
> What about qdisc_graft() -> dev_deactivate() -> synchronize_net() ?
> It doesn't work with your "lockless" patches?
> 

Well this is only in the 'parent == NULL' case otherwise we call
cops->graft(). Most sch_* seem to use qdisc_replace and this uses
sch_tree_lock().

The only converted qdisc mq and mqprio at this point don't care
though and do their own dev_deactivate/activate. So its not fixing
anything in the above mentioned commit. 

I still think it will need to be done eventually. If it resolves
the miniq case it seems like a good idea. Although per Jakub's comment
perhaps I pulled too much into the RCU handler.

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the net tree
From: Stephen Rothwell @ 2017-12-20 22:59 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List,
	Jakub Kicinski

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  drivers/net/ethernet/netronome/nfp/bpf/main.c

between commit:

  d3f89b98e391 ("nfp: bpf: keep track of the offloaded program")

from the net tree and commit:

  bd0b2e7fe611 ("net: xdp: make the stack take care of the tear down")

from the net-next tree.

I fixed it up (the latter seems to be a fix for the same problem as the
former, so I just reverted the former by hand) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell

^ permalink raw reply

* [PATCH v5 iproute2 net-next] erspan: add erspan version II support
From: William Tu @ 2017-12-20 22:49 UTC (permalink / raw)
  To: netdev

The patch adds support for configuring the erspan v2, for both
ipv4 and ipv6 erspan implementation.  Three additional fields
are added: 'erspan_ver' for distinguishing v1 or v2, 'erspan_dir'
for specifying direction of the mirrored traffic, and 'erspan_hwid'
for users to set ERSPAN engine ID within a system.

As for manpage, the ERSPAN descriptions used to be under GRE, IPIP,
SIT Type paragraph.  Since IP6GRE/IP6GRETAP also supports ERSPAN,
the patch removes the old one, creates a separate ERSPAN paragrah,
and adds an example.

Signed-off-by: William Tu <u9012063@gmail.com>
---
change in v5:
  - update the "Usage: " in c file, so
    # ip link help erspan
    shows the new options
change in v4:                                                                   
  - use matches instead of strcmp on ingress/egress                             
change in v3:                                                                   
  - change erspan_dir 0/1 to "in[gress]/e[gress]"                               
  - update manpage                                                              
change in v2:                                                                   
  - fix typo ETH_P_ERSPAN2                                                      
  - fix space and indent 
---
 include/uapi/linux/if_ether.h  |  1 +
 include/uapi/linux/if_tunnel.h |  3 ++
 ip/link_gre.c                  | 70 ++++++++++++++++++++++++++++++--
 ip/link_gre6.c                 | 71 ++++++++++++++++++++++++++++++--
 man/man8/ip-link.8.in          | 92 ++++++++++++++++++++++++++++++++++++------
 5 files changed, 218 insertions(+), 19 deletions(-)

diff --git a/include/uapi/linux/if_ether.h b/include/uapi/linux/if_ether.h
index 2eb529a90250..133567bf2e04 100644
--- a/include/uapi/linux/if_ether.h
+++ b/include/uapi/linux/if_ether.h
@@ -47,6 +47,7 @@
 #define ETH_P_PUP	0x0200		/* Xerox PUP packet		*/
 #define ETH_P_PUPAT	0x0201		/* Xerox PUP Addr Trans packet	*/
 #define ETH_P_TSN	0x22F0		/* TSN (IEEE 1722) packet	*/
+#define ETH_P_ERSPAN2	0x22EB		/* ERSPAN version 2 (type III)	*/
 #define ETH_P_IP	0x0800		/* Internet Protocol packet	*/
 #define ETH_P_X25	0x0805		/* CCITT X.25			*/
 #define ETH_P_ARP	0x0806		/* Address Resolution packet	*/
diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
index 38cdf90692f8..ecdc76669cfd 100644
--- a/include/uapi/linux/if_tunnel.h
+++ b/include/uapi/linux/if_tunnel.h
@@ -137,6 +137,9 @@ enum {
 	IFLA_GRE_IGNORE_DF,
 	IFLA_GRE_FWMARK,
 	IFLA_GRE_ERSPAN_INDEX,
+	IFLA_GRE_ERSPAN_VER,
+	IFLA_GRE_ERSPAN_DIR,
+	IFLA_GRE_ERSPAN_HWID,
 	__IFLA_GRE_MAX,
 };
 
diff --git a/ip/link_gre.c b/ip/link_gre.c
index 43cb1af6196a..65ad8bad4b82 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -44,7 +44,11 @@ static void print_usage(FILE *f)
 		"                            [ [no]encap-csum6 ]\n"
 		"                            [ [no]encap-remcsum ]\n"
 		"                            [ fwmark MARK ]\n"
+		"                            [ erspan_ver version ]\n"
 		"                            [ erspan IDX ]\n"
+		"                            [ erspan_dir { ingress | egress } ]\n"
+		"                            [ erspan_hwid hwid ]\n"
+		"                            [ external ]\n"
 		"\n"
 		"Where: ADDR := { IP_ADDRESS | any }\n"
 		"       TOS  := { NUMBER | inherit }\n"
@@ -98,6 +102,9 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
 	__u8 ignore_df = 0;
 	__u32 fwmark = 0;
 	__u32 erspan_idx = 0;
+	__u8 erspan_ver = 0;
+	__u8 erspan_dir = 0;
+	__u16 erspan_hwid = 0;
 
 	if (!(n->nlmsg_flags & NLM_F_CREATE)) {
 		if (rtnl_talk(&rth, &req.n, &answer) < 0) {
@@ -179,6 +186,15 @@ get_failed:
 		if (greinfo[IFLA_GRE_ERSPAN_INDEX])
 			erspan_idx = rta_getattr_u32(greinfo[IFLA_GRE_ERSPAN_INDEX]);
 
+		if (greinfo[IFLA_GRE_ERSPAN_VER])
+			erspan_ver = rta_getattr_u8(greinfo[IFLA_GRE_ERSPAN_VER]);
+
+		if (greinfo[IFLA_GRE_ERSPAN_DIR])
+			erspan_dir = rta_getattr_u8(greinfo[IFLA_GRE_ERSPAN_DIR]);
+
+		if (greinfo[IFLA_GRE_ERSPAN_HWID])
+			erspan_hwid = rta_getattr_u16(greinfo[IFLA_GRE_ERSPAN_HWID]);
+
 		free(answer);
 	}
 
@@ -343,6 +359,24 @@ get_failed:
 				invarg("invalid erspan index\n", *argv);
 			if (erspan_idx & ~((1<<20) - 1) || erspan_idx == 0)
 				invarg("erspan index must be > 0 and <= 20-bit\n", *argv);
+		} else if (strcmp(*argv, "erspan_ver") == 0) {
+			NEXT_ARG();
+			if (get_u8(&erspan_ver, *argv, 0))
+				invarg("invalid erspan version\n", *argv);
+			if (erspan_ver != 1 && erspan_ver != 2)
+				invarg("erspan version must be 1 or 2\n", *argv);
+		} else if (strcmp(*argv, "erspan_dir") == 0) {
+			NEXT_ARG();
+			if (matches(*argv, "ingress") == 0)
+				erspan_dir = 0;
+			else if (matches(*argv, "egress") == 0)
+				erspan_dir = 1;
+			else
+				invarg("Invalid erspan direction.", *argv);
+		} else if (strcmp(*argv, "erspan_hwid") == 0) {
+			NEXT_ARG();
+			if (get_u16(&erspan_hwid, *argv, 0))
+				invarg("invalid erspan hwid\n", *argv);
 		} else
 			usage();
 		argc--; argv++;
@@ -374,8 +408,15 @@ get_failed:
 		addattr_l(n, 1024, IFLA_GRE_TTL, &ttl, 1);
 		addattr_l(n, 1024, IFLA_GRE_TOS, &tos, 1);
 		addattr32(n, 1024, IFLA_GRE_FWMARK, fwmark);
-		if (erspan_idx != 0)
-			addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx);
+		if (erspan_ver) {
+			addattr8(n, 1024, IFLA_GRE_ERSPAN_VER, erspan_ver);
+			if (erspan_ver == 1 && erspan_idx != 0) {
+				addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx);
+			} else if (erspan_ver == 2) {
+				addattr8(n, 1024, IFLA_GRE_ERSPAN_DIR, erspan_dir);
+				addattr16(n, 1024, IFLA_GRE_ERSPAN_HWID, erspan_hwid);
+			}
+		}
 	} else {
 		addattr_l(n, 1024, IFLA_GRE_COLLECT_METADATA, NULL, 0);
 	}
@@ -514,7 +555,30 @@ static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 	if (tb[IFLA_GRE_ERSPAN_INDEX]) {
 		__u32 erspan_idx = rta_getattr_u32(tb[IFLA_GRE_ERSPAN_INDEX]);
 
-		fprintf(f, "erspan_index %u ", erspan_idx);
+		print_uint(PRINT_ANY, "erspan_index", "erspan_index %u ", erspan_idx);
+	}
+
+	if (tb[IFLA_GRE_ERSPAN_VER]) {
+		__u8 erspan_ver = rta_getattr_u8(tb[IFLA_GRE_ERSPAN_VER]);
+
+		print_uint(PRINT_ANY, "erspan_ver", "erspan_ver %u ", erspan_ver);
+	}
+
+	if (tb[IFLA_GRE_ERSPAN_DIR]) {
+		__u8 erspan_dir = rta_getattr_u8(tb[IFLA_GRE_ERSPAN_DIR]);
+
+		if (erspan_dir == 0)
+			print_string(PRINT_ANY, "erspan_dir",
+				     "erspan_dir ingress ", NULL);
+		else
+			print_string(PRINT_ANY, "erspan_dir",
+				     "erspan_dir egress ", NULL);
+	}
+
+	if (tb[IFLA_GRE_ERSPAN_HWID]) {
+		__u16 erspan_hwid = rta_getattr_u16(tb[IFLA_GRE_ERSPAN_HWID]);
+
+		print_hex(PRINT_ANY, "erspan_hwid", "erspan_hwid 0x%x ", erspan_hwid);
 	}
 
 	if (tb[IFLA_GRE_ENCAP_TYPE] &&
diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 2cb46ca116d0..cb621806261c 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -52,7 +52,11 @@ static void print_usage(FILE *f)
 		"                                  [ [no]encap-csum ]\n"
 		"                                  [ [no]encap-csum6 ]\n"
 		"                                  [ [no]encap-remcsum ]\n"
+		"                                  [ erspan_ver version ]\n"
 		"                                  [ erspan IDX ]\n"
+		"                                  [ erspan_dir { ingress | egress } ]\n"
+		"                                  [ erspan_hwid hwid ]\n"
+		"                                  [ external ]\n"
 		"\n"
 		"Where: ADDR      := IPV6_ADDRESS\n"
 		"       TTL       := { 0..255 } (default=%d)\n"
@@ -109,6 +113,9 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
 	int len;
 	__u32 fwmark = 0;
 	__u32 erspan_idx = 0;
+	__u8 erspan_ver = 0;
+	__u8 erspan_dir = 0;
+	__u16 erspan_hwid = 0;
 
 	if (!(n->nlmsg_flags & NLM_F_CREATE)) {
 		if (rtnl_talk(&rth, &req.n, &answer) < 0) {
@@ -191,6 +198,15 @@ get_failed:
 		if (greinfo[IFLA_GRE_ERSPAN_INDEX])
 			erspan_idx = rta_getattr_u32(greinfo[IFLA_GRE_ERSPAN_INDEX]);
 
+		if (greinfo[IFLA_GRE_ERSPAN_VER])
+			erspan_ver = rta_getattr_u8(greinfo[IFLA_GRE_ERSPAN_VER]);
+
+		if (greinfo[IFLA_GRE_ERSPAN_DIR])
+			erspan_dir = rta_getattr_u8(greinfo[IFLA_GRE_ERSPAN_DIR]);
+
+		if (greinfo[IFLA_GRE_ERSPAN_HWID])
+			erspan_hwid = rta_getattr_u16(greinfo[IFLA_GRE_ERSPAN_HWID]);
+
 		free(answer);
 	}
 
@@ -389,6 +405,24 @@ get_failed:
 				invarg("invalid erspan index\n", *argv);
 			if (erspan_idx & ~((1<<20) - 1) || erspan_idx == 0)
 				invarg("erspan index must be > 0 and <= 20-bit\n", *argv);
+		} else if (strcmp(*argv, "erspan_ver") == 0) {
+			NEXT_ARG();
+			if (get_u8(&erspan_ver, *argv, 0))
+				invarg("invalid erspan version\n", *argv);
+			if (erspan_ver != 1 && erspan_ver != 2)
+				invarg("erspan version must be 1 or 2\n", *argv);
+		} else if (strcmp(*argv, "erspan_dir") == 0) {
+			NEXT_ARG();
+			if (matches(*argv, "ingress") == 0)
+				erspan_dir = 0;
+			else if (matches(*argv, "egress") == 0)
+				erspan_dir = 1;
+			else
+				invarg("Invalid erspan direction.", *argv);
+		} else if (strcmp(*argv, "erspan_hwid") == 0) {
+			NEXT_ARG();
+			if (get_u16(&erspan_hwid, *argv, 0))
+				invarg("invalid erspan hwid\n", *argv);
 		} else
 			usage();
 		argc--; argv++;
@@ -408,9 +442,15 @@ get_failed:
 		addattr_l(n, 1024, IFLA_GRE_FLOWINFO, &flowinfo, 4);
 		addattr32(n, 1024, IFLA_GRE_FLAGS, flags);
 		addattr32(n, 1024, IFLA_GRE_FWMARK, fwmark);
-		if (erspan_idx != 0)
-			addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx);
-
+		if (erspan_ver) {
+			addattr8(n, 1024, IFLA_GRE_ERSPAN_VER, erspan_ver);
+			if (erspan_ver == 1 && erspan_idx != 0) {
+				addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx);
+			} else {
+				addattr8(n, 1024, IFLA_GRE_ERSPAN_DIR, erspan_dir);
+				addattr16(n, 1024, IFLA_GRE_ERSPAN_HWID, erspan_hwid);
+			}
+		}
 		addattr16(n, 1024, IFLA_GRE_ENCAP_TYPE, encaptype);
 		addattr16(n, 1024, IFLA_GRE_ENCAP_FLAGS, encapflags);
 		addattr16(n, 1024, IFLA_GRE_ENCAP_SPORT, htons(encapsport));
@@ -587,7 +627,30 @@ static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 
 	if (tb[IFLA_GRE_ERSPAN_INDEX]) {
 		__u32 erspan_idx = rta_getattr_u32(tb[IFLA_GRE_ERSPAN_INDEX]);
-		fprintf(f, "erspan_index %u ", erspan_idx);
+		print_uint(PRINT_ANY, "erspan_index", "erspan_index %u ", erspan_idx);
+	}
+
+	if (tb[IFLA_GRE_ERSPAN_VER]) {
+		__u8 erspan_ver = rta_getattr_u8(tb[IFLA_GRE_ERSPAN_VER]);
+
+		print_uint(PRINT_ANY, "erspan_ver", "erspan_ver %u ", erspan_ver);
+	}
+
+	if (tb[IFLA_GRE_ERSPAN_DIR]) {
+		__u8 erspan_dir = rta_getattr_u8(tb[IFLA_GRE_ERSPAN_DIR]);
+
+		if (erspan_dir == 0)
+			print_string(PRINT_ANY, "erspan_dir",
+				     "erspan_dir ingress ", NULL);
+		else
+			print_string(PRINT_ANY, "erspan_dir",
+				     "erspan_dir egress ", NULL);
+	}
+
+	if (tb[IFLA_GRE_ERSPAN_HWID]) {
+		__u16 erspan_hwid = rta_getattr_u16(tb[IFLA_GRE_ERSPAN_HWID]);
+
+		print_hex(PRINT_ANY, "erspan_hwid", "erspan_hwid 0x%x ", erspan_hwid);
 	}
 
 	if (tb[IFLA_GRE_ENCAP_TYPE] &&
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 9e9a5f0d2cef..0086b3dfa09d 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -665,13 +665,13 @@ keyword.
 .in -8
 
 .TP
-GRE, IPIP, SIT, ERSPAN Type Support
+GRE, IPIP, SIT Type Support
 For a link of types
-.I GRE/IPIP/SIT/ERSPAN
+.I GRE/IPIP/SIT
 the following additional arguments are supported:
 
 .BI "ip link add " DEVICE
-.BR type " { " gre " | " ipip " | " sit " | " erspan " }"
+.BR type " { " gre " | " ipip " | " sit " }"
 .BI " remote " ADDR " local " ADDR
 [
 .BR encap " { " fou " | " gue " | " none " }"
@@ -685,8 +685,6 @@ the following additional arguments are supported:
 .I " [no]encap-remcsum "
 ] [
 .I " mode " { ip6ip | ipip | mplsip | any } "
-] [
-.BR erspan " \fIIDX "
 ]
 
 .in +8
@@ -731,13 +729,6 @@ MPLS-Over-IPv4, "any" indicates IPv6, IPv4 or MPLS Over IPv4. Supported for
 SIT where the default is "ip6ip" and IPIP where the default is "ipip".
 IPv6-Over-IPv4 is not supported for IPIP.
 
-.sp
-.BR erspan " \fIIDX "
-- specifies the ERSPAN index field.
-.IR IDX
-indicates a 20 bit index/port number associated with the ERSPAN
-traffic's source port and direction.
-
 .in -8
 
 .TP
@@ -883,6 +874,76 @@ the following additional arguments are supported:
 - specifies the mode (datagram or connected) to use.
 
 .TP
+ERSPAN Type Support
+For a link of type
+.I ERSPAN/IP6ERSPAN
+the following additional arguments are supported:
+
+.BI "ip link add " DEVICE
+.BR type " { " erspan " | " ip6erspan " }"
+.BI remote " ADDR " local " ADDR " seq
+.RB key
+.I KEY
+.BR erspan_ver " \fIversion "
+[
+.BR erspan " \fIIDX "
+] [
+.BR erspan_dir " { " \fIingress " | " \fIegress " }"
+] [
+.BR erspan_hwid " \fIhwid "
+] [
+.RB external
+]
+
+.in +8
+.sp
+.BI  remote " ADDR "
+- specifies the remote address of the tunnel.
+
+.sp
+.BI  local " ADDR "
+- specifies the fixed local address for tunneled packets.
+It must be an address on another interface on this host.
+
+.sp
+.BR erspan_ver " \fIversion "
+- specifies the ERSPAN version number.
+.IR version
+indicates the ERSPAN version to be created: 1 for version 1 (type II)
+or 2 for version 2 (type III).
+
+.sp
+.BR erspan " \fIIDX "
+- specifies the ERSPAN v1 index field.
+.IR IDX
+indicates a 20 bit index/port number associated with the ERSPAN
+traffic's source port and direction.
+
+.sp
+.BR erspan_dir " { " \fIingress " | " \fIegress " }"
+- specifies the ERSPAN v2 mirrored traffic's direction.
+
+.sp
+.BR erspan_hwid " \fIhwid "
+- an unique identifier of an ERSPAN v2 engine within a system.
+.IR hwid
+is a 6-bit value for users to configure.
+
+.sp
+.BR external
+- make this tunnel externally controlled (or not, which is the default).
+In the kernel, this is referred to as collect metadata mode.  This flag is
+mutually exclusive with the
+.BR remote ,
+.BR local ,
+.BR erspan_ver ,
+.BR erspan ,
+.BR erspan_dir " and " erspan_hwid
+options.
+
+.in -8
+
+.TP
 GENEVE Type Support
 For a link of type
 .I GENEVE
@@ -2062,6 +2123,13 @@ ip link add link wpan0 lowpan0 type lowpan
 Creates a 6LoWPAN interface named lowpan0 on the underlying
 IEEE 802.15.4 device wpan0.
 .RE
+.PP
+ip link add dev ip6erspan11 type ip6erspan seq key 102
+local fc00:100::2 remote fc00:100::1
+erspan_ver 2 erspan_dir ingress erspan_hwid 17
+.RS 4
+Creates a IP6ERSPAN version 2 interface named ip6erspan00.
+.RE
 
 .SH SEE ALSO
 .br
-- 
2.7.4

^ permalink raw reply related

* Re: Linux 4.14 - regression: broken tun/tap / bridge network with virtio - bisected
From: Willem de Bruijn @ 2017-12-20 22:44 UTC (permalink / raw)
  To: Andreas Hartmann
  Cc: Michal Kubecek, Jason Wang, David Miller, Network Development
In-Reply-To: <f0f959dc-9c6f-c82b-b245-4aedf057e992@01019freenet.de>

On Wed, Dec 20, 2017 at 10:56 AM, Andreas Hartmann
<andihartmann@01019freenet.de> wrote:
> On 12/18/2017 at 06:11 PM Andreas Hartmann wrote:
>> On 12/17/2017 at 11:33 PM Willem de Bruijn wrote:
> [...]
>>> I have been able to reproduce the hang by sending a UFO packet
>>> between two guests running v4.13 on a host running v4.15-rc1.
>>>
>>> The vhost_net_ubuf_ref refcount indeed hits overflow (-1) from
>>> vhost_zerocopy_callback being called for each segment of a
>>> segmented UFO skb. This refcount is decremented then on each
>>> segment, but incremented only once for the entire UFO skb.
>>>
>>> Before v4.14, these packets would be converted in skb_segment to
>>> regular copy packets with skb_orphan_frags and the callback function
>>> called once at this point. v4.14 added support for reference counted
>>> zerocopy skb that can pass through skb_orphan_frags unmodified and
>>> have their zerocopy state safely cloned with skb_zerocopy_clone.
>>>
>>> The call to skb_zerocopy_clone must come after skb_orphan_frags
>>> to limit cloning of this state to those skbs that can do so safely.
>>>
>>> Please try a host with the following patch. This fixes it for me. I intend to
>>> send it to net.
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index a592ca025fc4..d2d985418819 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -3654,8 +3654,6 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>>
>>>                 skb_shinfo(nskb)->tx_flags |= skb_shinfo(head_skb)->tx_flags &
>>>                                               SKBTX_SHARED_FRAG;
>>> -               if (skb_zerocopy_clone(nskb, head_skb, GFP_ATOMIC))
>>> -                       goto err;
>>>
>>>                 while (pos < offset + len) {
>>>                         if (i >= nfrags) {
>>> @@ -3681,6 +3679,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>>
>>>                         if (unlikely(skb_orphan_frags(frag_skb, GFP_ATOMIC)))
>>>                                 goto err;
>>> +                       if (skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC))
>>> +                               goto err;
>>>
>>>                         *nskb_frag = *frag;
>>>                         __skb_frag_ref(nskb_frag);
>>>
>>>
>>> This is relatively inefficient, as it calls skb_zerocopy_clone for each frag
>>> in the frags[] array. I will follow-up with a patch to net-next that only
>>> checks once per skb:
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index 466581cf4cdc..a293a33604ec 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -3662,7 +3662,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>>
>>>                 skb_shinfo(nskb)->tx_flags |= skb_shinfo(head_skb)->tx_flags &
>>>                                               SKBTX_SHARED_FRAG;
>>> -               if (skb_zerocopy_clone(nskb, head_skb, GFP_ATOMIC))
>>> +               if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
>>> +                   skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC))
>>>                         goto err;
>>>
>>>                 while (pos < offset + len) {
>>> @@ -3676,6 +3677,11 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>>
>>>                                 BUG_ON(!nfrags);
>>>
>>> +                               if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
>>> +                                   skb_zerocopy_clone(nskb, frag_skb,
>>> +                                                      GFP_ATOMIC))
>>> +                                       goto err;
>>> +
>>>                                 list_skb = list_skb->next;
>>>                         }
>>>
>>> @@ -3687,9 +3693,6 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>>                                 goto err;
>>>                         }
>>>
>>> -                       if (unlikely(skb_orphan_frags(frag_skb, GFP_ATOMIC)))
>>> -                               goto err;
>>> -
>>
>> I'm currently testing this one.
>>
>
> Test is in progress. I'm testing w/ 4.14.7, which already contains "net:
> accept UFO datagrams from tuntap and packet".
>
> At first, I tested an unpatched 4.14.7 - the problem (no more killable
> qemu-process) did occur promptly on shutdown of the machine. This was
> expected.
>
> Next, I applied the above patch (the second one). Until now, I didn't
> face any problem any more on shutdown of VMs. Looks promising.

Thanks for testing.

I sent the first, simpler, one to net together with another fix.

  http://patchwork.ozlabs.org/patch/851715/

^ permalink raw reply

* Re: [PATCH] net: Revert "net_sched: no need to free qdisc in RCU callback"
From: Cong Wang @ 2017-12-20 22:41 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jiri Pirko, David Miller, Jakub Kicinski,
	Linux Kernel Network Developers, Eric Dumazet
In-Reply-To: <20171220200919.6233.48192.stgit@john-Precision-Tower-5810>

On Wed, Dec 20, 2017 at 12:09 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> RCU grace period is needed for lockless qdiscs added in the commit
> c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array").
>
> It is needed now that qdiscs may be lockless otherwise we risk
> free'ing a qdisc that is still in use from datapath. Additionally,
> push list cleanup into RCU callback. Otherwise we risk the datapath
> adding skbs during removal.

What about qdisc_graft() -> dev_deactivate() -> synchronize_net() ?
It doesn't work with your "lockless" patches?

^ permalink raw reply

* Re: RCU callback crashes
From: Cong Wang @ 2017-12-20 22:38 UTC (permalink / raw)
  To: John Fastabend; +Cc: Jakub Kicinski, Jiri Pirko, netdev@vger.kernel.org
In-Reply-To: <92220030-e9f1-1963-00b6-05f37abb82ee@gmail.com>

On Wed, Dec 20, 2017 at 12:23 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> I'm trying to see how removing that rcu grace period was safe in the
> first place. The datapath is using rcu_read critical section to protect
> the qdisc but the control path (a) doesn't use rcu grace period and (b)
> doesn't use the qidisc lock. Going to go get a coffee and I'll think
> about it a bit more. Any ideas Cong?

qdisc_graft() -> dev_deactivate() -> synchronize_net() is the reason
you want to find?

Also, you can try `git log` to see why it was introduced in the beginning,
here it is:

commit 5d944c640b4ae5f37c537acf491c2f0eb89fa0d6
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Wed Mar 31 07:06:04 2010 +0000

    gen_estimator: deadlock fix

^ permalink raw reply

* [PATCH net 2/2] skbuff: skb_copy_ubufs must release uarg even without user frags
From: Willem de Bruijn @ 2017-12-20 22:37 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn
In-Reply-To: <20171220223750.27795-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

skb_copy_ubufs creates a private copy of frags[] to release its hold
on user frags, then calls uarg->callback to notify the owner.

Call uarg->callback even when no frags exist. This edge case can
happen when zerocopy_sg_from_iter finds enough room in skb_headlen
to copy all the data.

Fixes: 3ece782693c4 ("sock: skb_copy_ubufs support for compound pages")
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/core/skbuff.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index edf40ac0cd07..a3cb0be4c6f3 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1178,7 +1178,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	u32 d_off;
 
 	if (!num_frags)
-		return 0;
+		goto release;
 
 	if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
 		return -EINVAL;
@@ -1238,6 +1238,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	__skb_fill_page_desc(skb, new_frags - 1, head, 0, d_off);
 	skb_shinfo(skb)->nr_frags = new_frags;
 
+release:
 	skb_zcopy_clear(skb, false);
 	return 0;
 }
-- 
2.15.1.620.gb9897f4670-goog

^ permalink raw reply related

* [PATCH net 1/2] skbuff: orphan frags before zerocopy clone
From: Willem de Bruijn @ 2017-12-20 22:37 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn
In-Reply-To: <20171220223750.27795-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemb@google.com>

Call skb_zerocopy_clone after skb_orphan_frags, to avoid duplicate
calls to skb_uarg(skb)->callback for the same data.

skb_zerocopy_clone associates skb_shinfo(skb)->uarg from frag_skb
with each segment. This is only safe for uargs that do refcounting,
which is those that pass skb_orphan_frags without dropping their
shared frags. For others, skb_orphan_frags drops the user frags and
sets the uarg to NULL, after which sock_zerocopy_clone has no effect.

Qemu hangs were reported due to duplicate vhost_net_zerocopy_callback
calls for the same data causing the vhost_net_ubuf_ref_>refcount to
drop below zero.

Link: http://lkml.kernel.org/r/<CAF=yD-LWyCD4Y0aJ9O0e_CHLR+3JOeKicRRTEVCPxgw4XOcqGQ@mail.gmail.com>
Fixes: 1f8b977ab32d ("sock: enable MSG_ZEROCOPY")
Reported-by: Andreas Hartmann <andihartmann@01019freenet.de>
Reported-by: David Hill <dhill@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>

---

This fix causes skb_zerocopy_clone to be called for each frag in the
array. I will follow-up with a patch to net-next that will call both
skb_orphan_frags and skb_zerocopy_clone once per skb only.
---
 net/core/skbuff.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a592ca025fc4..edf40ac0cd07 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3654,8 +3654,6 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 		skb_shinfo(nskb)->tx_flags |= skb_shinfo(head_skb)->tx_flags &
 					      SKBTX_SHARED_FRAG;
-		if (skb_zerocopy_clone(nskb, head_skb, GFP_ATOMIC))
-			goto err;
 
 		while (pos < offset + len) {
 			if (i >= nfrags) {
@@ -3681,6 +3679,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 			if (unlikely(skb_orphan_frags(frag_skb, GFP_ATOMIC)))
 				goto err;
+			if (skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC))
+				goto err;
 
 			*nskb_frag = *frag;
 			__skb_frag_ref(nskb_frag);
-- 
2.15.1.620.gb9897f4670-goog

^ permalink raw reply related

* [PATCH net 0/2] zerocopy fixes
From: Willem de Bruijn @ 2017-12-20 22:37 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

The removal of UFO hardware offload support exposed a bug in
segmentation of zerocopy skbs.

The reference counting mechanism for msg_zerocopy was incorrectly
applied to vhost_net zerocopy packets.

The other issue observed through analysis. We do not cook skbs
with skb_zcopy(skb) but no frags, but this is possible in principle.
Correctly call skb_zcopy_clear on those.

Willem de Bruijn (2):
  skbuff: orphan frags before zerocopy clone
  skbuff: skb_copy_ubufs must release uarg even without user frags

 net/core/skbuff.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

-- 
2.15.1.620.gb9897f4670-goog

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox