Netdev List
 help / color / mirror / Atom feed
* Re: RCU callback crashes
From: Cong Wang @ 2017-12-21  0:03 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Jiri Pirko, netdev@vger.kernel.org
In-Reply-To: <CAM_iQpVPUifm3rcXu8SP9ShSmm7z9z+8UjppdY_AxMYQwHE9YQ@mail.gmail.com>

On Wed, Dec 20, 2017 at 10:31 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Wed, Dec 20, 2017 at 10:17 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>>
>> I guess it is q->miniqp which is freed in qdisc_graft() without properly
>> waiting for rcu readers?
>
> It is probably so, the call_rcu_bh(&miniq_old->rcu, mini_qdisc_rcu_func)
> in the end of mini_qdisc_pair_swap() is invoked on miniq_old->rcu,
> but miniq is being freed, no rcu barrier waits for it...
>
> You can try to add a rcu_barrier_bh() at the end to see if this crash
> is gone, but I don't think people like adding yet another rcu barrier...

Hi, Jakub

Can you test the following fix? I am not a fan of rcu barrier but we
already have one so...

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 876fab2604b8..1b68fedea124 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1240,6 +1240,8 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,

        if (!tp_head) {
                RCU_INIT_POINTER(*miniqp->p_miniq, NULL);
+               /* Wait for existing flying RCU callback before being freed. */
+               rcu_barrier_bh();
                return;
        }

^ permalink raw reply related

* Re: RCU callback crashes
From: Jakub Kicinski @ 2017-12-21  0:08 UTC (permalink / raw)
  To: Cong Wang; +Cc: Jiri Pirko, netdev@vger.kernel.org
In-Reply-To: <CAM_iQpUngX+oSDiforfZceqMZrg=jDJnNf3QFF9WFQdHrU9o-g@mail.gmail.com>

On Wed, 20 Dec 2017 16:03:49 -0800, Cong Wang wrote:
> On Wed, Dec 20, 2017 at 10:31 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Wed, Dec 20, 2017 at 10:17 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:  
> >>
> >> I guess it is q->miniqp which is freed in qdisc_graft() without properly
> >> waiting for rcu readers?  
> >
> > It is probably so, the call_rcu_bh(&miniq_old->rcu, mini_qdisc_rcu_func)
> > in the end of mini_qdisc_pair_swap() is invoked on miniq_old->rcu,
> > but miniq is being freed, no rcu barrier waits for it...
> >
> > You can try to add a rcu_barrier_bh() at the end to see if this crash
> > is gone, but I don't think people like adding yet another rcu barrier...  
> 
> Hi, Jakub
> 
> Can you test the following fix? I am not a fan of rcu barrier but we
> already have one so...
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 876fab2604b8..1b68fedea124 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1240,6 +1240,8 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
> 
>         if (!tp_head) {
>                 RCU_INIT_POINTER(*miniqp->p_miniq, NULL);
> +               /* Wait for existing flying RCU callback before being freed. */
> +               rcu_barrier_bh();
>                 return;
>         }

Mm.. I was running with this hack for the last two hours and it was OK:

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 876fab2604b8..d7e0c3ad0a1c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1260,6 +1260,7 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
                 * are not seeing it.
                 */
                call_rcu_bh(&miniq_old->rcu, mini_qdisc_rcu_func);
+       rcu_barrier_bh();
 }
 EXPORT_SYMBOL(mini_qdisc_pair_swap);

Let me try to move the barrier...

^ permalink raw reply related

* Re: [PATCH net-next v4 0/6] net: tcp: sctp: dccp: Replace jprobe usage with trace events
From: Masami Hiramatsu @ 2017-12-21  0:10 UTC (permalink / raw)
  To: David Miller
  Cc: mingo, ian.mcdonald, vyasevich, stephen, rostedt, peterz, tglx,
	linux-kernel, hpa, gerrit, nhorman, dccp, netdev, linux-sctp, sfr
In-Reply-To: <20171220.142424.1214953044518214833.davem@davemloft.net>

On Wed, 20 Dec 2017 14:24:24 -0500 (EST)
David Miller <davem@davemloft.net> wrote:

> From: David Miller <davem@davemloft.net>
> Date: Wed, 20 Dec 2017 14:20:40 -0500 (EST)
> 
> > From: Masami Hiramatsu <mhiramat@kernel.org>
> > Date: Wed, 20 Dec 2017 13:14:11 +0900
> > 
> >> This series is v4 of the replacement of jprobe usage with trace
> >> events. This version is rebased on net-next, fixes a build warning
> >> and moves a temporal variable definition in a block.
> >> 
> >> Previous version is here;
> >> https://lkml.org/lkml/2017/12/19/153
> >> 
> >> Changes from v3:
> >>   All: Rebased on net-next
> >>   [3/6]: fixes a build warning for i386 by casting pointer unsigned
> >>         long instead of __u64, and moves a temporal variable
> >>          definition in a block.
> > 
> > Looks good, series applied to net-next, thanks.
> 
> Actually, this doesn't even compile, so I've reverted:
> 
> [davem@dhcp-10-15-49-227 net-next]$ make -s -j16
> In file included from net/dccp/trace.h:105:0,
>                  from net/dccp/proto.c:42:
> ./include/trace/define_trace.h:89:42: fatal error: ./trace.h: No such file or directory
>  #include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
>                                           ^
> compilation terminated.

Hmm, strange.
I could compile it on x86-64 and i386. Let me check what was wrong.

Thanks,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply

* Re: [patch iproute2] tc: add -bs option for batch mode
From: David Ahern @ 2017-12-21  0:13 UTC (permalink / raw)
  To: Stephen Hemminger, Chris Mi; +Cc: netdev@vger.kernel.org, gerlitz.or@gmail.com
In-Reply-To: <20171220071744.25f9dd41@xeon-e3>

On 12/20/17 8:17 AM, Stephen Hemminger wrote:
> On Wed, 20 Dec 2017 09:23:34 +0000
> Chris Mi <chrism@mellanox.com> wrote:
> 
>>> Your real performance win is just not asking for ACK for every rule.  
>> No. Even if batch_size > 1, we ack every rule. The real performance win is
>> to send multiple rules in one system call. If we are not asking for ACK for every rule,
>> the performance will be improved further.
> 
> Try the no ACK method.
> 
> When we were optimizing routing daemons like Quagga, it was discovered
> that an ACK for every route insert was the main bottleneck. Doing asynchronous
> error handling got a bigger win than your batching.
> 
> Please try that, doing multiple messages using iov is not necessary.
> 

FWIW, I plan to look at batching routes in a similar fashion.

^ permalink raw reply

* Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
From: Michael S. Tsirkin @ 2017-12-21  0:14 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: stephen, netdev, virtualization, alexander.duyck
In-Reply-To: <1513644036-45230-1-git-send-email-sridhar.samudrala@intel.com>

On Mon, Dec 18, 2017 at 04:40:36PM -0800, Sridhar Samudrala wrote:
> This patch enables virtio to switch over to a VF datapath when a VF netdev
> is present with the same MAC address.

I prefer saying "a passthrough device" here. Does not have to be a VF at
all.

>  It allows live migration of a VM
> with a direct attached VF without the need to setup a bond/team between a
> VF and virtio net device in the guest.
> 
> The hypervisor needs to unplug the VF device from the guest on the source
> host and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed, the
> destination hypervisor sets the MAC filter on the VF and plugs it back to
> the guest to switch over to VF datapath.
> 
> It is entirely based on netvsc implementation and it should be possible to
> make this code generic and move it to a common location that can be shared
> by netvsc and virtio.
> 
> Also, i think we should make this a negotiated feature that is off by
> default via a new feature bit.

So please include this. A copy needs to go to virtio TC
to reserve the bit. Enabling this by default risks breaking
too many configurations.

> 
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> ---
>  drivers/net/virtio_net.c | 341 ++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 339 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 559b215c0169..a34c717bb15b 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -31,6 +31,8 @@
>  #include <linux/average.h>
>  #include <linux/filter.h>
>  #include <net/route.h>
> +#include <linux/netdevice.h>
> +#include <linux/netpoll.h>
>  
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -56,6 +58,8 @@ module_param(napi_tx, bool, 0644);
>   */
>  DECLARE_EWMA(pkt_len, 0, 64)
>  
> +#define VF_TAKEOVER_INT	(HZ / 10)
> +
>  #define VIRTNET_DRIVER_VERSION "1.0.0"
>  
>  static const unsigned long guest_offloads[] = {
> @@ -117,6 +121,15 @@ struct receive_queue {
>  	char name[40];
>  };
>  
> +struct virtnet_vf_pcpu_stats {
> +	u64	rx_packets;
> +	u64	rx_bytes;
> +	u64	tx_packets;
> +	u64	tx_bytes;
> +	struct u64_stats_sync   syncp;
> +	u32	tx_dropped;
> +};
> +
>  struct virtnet_info {
>  	struct virtio_device *vdev;
>  	struct virtqueue *cvq;
> @@ -179,6 +192,11 @@ struct virtnet_info {
>  	u32 speed;
>  
>  	unsigned long guest_offloads;
> +
> +	/* State to manage the associated VF interface. */
> +	struct net_device __rcu *vf_netdev;
> +	struct virtnet_vf_pcpu_stats __percpu *vf_stats;
> +	struct delayed_work vf_takeover;
>  };
>  
>  struct padded_vnet_hdr {
> @@ -1300,16 +1318,51 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>  	return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
>  }
>  
> +/* Send skb on the slave VF device. */
> +static int virtnet_vf_xmit(struct net_device *dev, struct net_device *vf_netdev,
> +			   struct sk_buff *skb)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	unsigned int len = skb->len;
> +	int rc;
> +
> +	skb->dev = vf_netdev;
> +	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
> +
> +	rc = dev_queue_xmit(skb);
> +	if (likely(rc == NET_XMIT_SUCCESS || rc == NET_XMIT_CN)) {
> +		struct virtnet_vf_pcpu_stats *pcpu_stats
> +			= this_cpu_ptr(vi->vf_stats);
> +
> +		u64_stats_update_begin(&pcpu_stats->syncp);
> +		pcpu_stats->tx_packets++;
> +		pcpu_stats->tx_bytes += len;
> +		u64_stats_update_end(&pcpu_stats->syncp);
> +	} else {
> +		this_cpu_inc(vi->vf_stats->tx_dropped);
> +	}
> +
> +	return rc;
> +}
> +
>  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	int qnum = skb_get_queue_mapping(skb);
>  	struct send_queue *sq = &vi->sq[qnum];
> +	struct net_device *vf_netdev;
>  	int err;
>  	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>  	bool kick = !skb->xmit_more;
>  	bool use_napi = sq->napi.weight;
>  
> +	/* if VF is present and up then redirect packets
> +	 * called with rcu_read_lock_bh
> +	 */
> +	vf_netdev = rcu_dereference_bh(vi->vf_netdev);
> +	if (vf_netdev && netif_running(vf_netdev) && !netpoll_tx_running(dev))
> +		return virtnet_vf_xmit(dev, vf_netdev, skb);
> +
>  	/* Free up any pending old buffers before queueing new ones. */
>  	free_old_xmit_skbs(sq);
>  
> @@ -1456,10 +1509,41 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p)
>  	return ret;
>  }
>  
> +static void virtnet_get_vf_stats(struct net_device *dev,
> +				 struct virtnet_vf_pcpu_stats *tot)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
> +
> +	memset(tot, 0, sizeof(*tot));
> +
> +	for_each_possible_cpu(i) {
> +		const struct virtnet_vf_pcpu_stats *stats
> +				= per_cpu_ptr(vi->vf_stats, i);
> +		u64 rx_packets, rx_bytes, tx_packets, tx_bytes;
> +		unsigned int start;
> +
> +		do {
> +			start = u64_stats_fetch_begin_irq(&stats->syncp);
> +			rx_packets = stats->rx_packets;
> +			tx_packets = stats->tx_packets;
> +			rx_bytes = stats->rx_bytes;
> +			tx_bytes = stats->tx_bytes;
> +		} while (u64_stats_fetch_retry_irq(&stats->syncp, start));
> +
> +		tot->rx_packets += rx_packets;
> +		tot->tx_packets += tx_packets;
> +		tot->rx_bytes   += rx_bytes;
> +		tot->tx_bytes   += tx_bytes;
> +		tot->tx_dropped += stats->tx_dropped;
> +	}
> +}
> +
>  static void virtnet_stats(struct net_device *dev,
>  			  struct rtnl_link_stats64 *tot)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	struct virtnet_vf_pcpu_stats vf_stats;
>  	int cpu;
>  	unsigned int start;
>  
> @@ -1490,6 +1574,13 @@ static void virtnet_stats(struct net_device *dev,
>  	tot->rx_dropped = dev->stats.rx_dropped;
>  	tot->rx_length_errors = dev->stats.rx_length_errors;
>  	tot->rx_frame_errors = dev->stats.rx_frame_errors;
> +
> +	virtnet_get_vf_stats(dev, &vf_stats);
> +	tot->rx_packets += vf_stats.rx_packets;
> +	tot->tx_packets += vf_stats.tx_packets;
> +	tot->rx_bytes += vf_stats.rx_bytes;
> +	tot->tx_bytes += vf_stats.tx_bytes;
> +	tot->tx_dropped += vf_stats.tx_dropped;
>  }
>  
>  #ifdef CONFIG_NET_POLL_CONTROLLER
> @@ -2508,6 +2599,47 @@ static int virtnet_validate(struct virtio_device *vdev)
>  	return 0;
>  }
>  
> +static void __virtnet_vf_setup(struct net_device *ndev,
> +			       struct net_device *vf_netdev)
> +{
> +	int ret;
> +
> +	/* Align MTU of VF with master */
> +	ret = dev_set_mtu(vf_netdev, ndev->mtu);
> +	if (ret)
> +		netdev_warn(vf_netdev,
> +			    "unable to change mtu to %u\n", ndev->mtu);
> +
> +	if (netif_running(ndev)) {
> +		ret = dev_open(vf_netdev);
> +		if (ret)
> +			netdev_warn(vf_netdev,
> +				    "unable to open: %d\n", ret);
> +	}
> +}
> +
> +/* Setup VF as slave of the virtio device.
> + * Runs in workqueue to avoid recursion in netlink callbacks.
> + */
> +static void virtnet_vf_setup(struct work_struct *w)
> +{
> +	struct virtnet_info *vi
> +		= container_of(w, struct virtnet_info, vf_takeover.work);
> +	struct net_device *ndev = vi->dev;
> +	struct net_device *vf_netdev;
> +
> +	if (!rtnl_trylock()) {
> +		schedule_delayed_work(&vi->vf_takeover, 0);
> +		return;
> +	}
> +
> +	vf_netdev = rtnl_dereference(vi->vf_netdev);
> +	if (vf_netdev)
> +		__virtnet_vf_setup(ndev, vf_netdev);
> +
> +	rtnl_unlock();
> +}
> +
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
>  	int i, err;
> @@ -2600,6 +2732,11 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	}
>  
>  	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
> +	INIT_DELAYED_WORK(&vi->vf_takeover, virtnet_vf_setup);
> +
> +	vi->vf_stats = netdev_alloc_pcpu_stats(struct virtnet_vf_pcpu_stats);
> +	if (!vi->vf_stats)
> +		goto free_stats;
>  
>  	/* If we can receive ANY GSO packets, we must allocate large ones. */
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> @@ -2634,7 +2771,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>  			 */
>  			dev_err(&vdev->dev, "device MTU appears to have changed "
>  				"it is now %d < %d", mtu, dev->min_mtu);
> -			goto free_stats;
> +			goto free_vf_stats;
>  		}
>  
>  		dev->mtu = mtu;
> @@ -2658,7 +2795,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
>  	err = init_vqs(vi);
>  	if (err)
> -		goto free_stats;
> +		goto free_vf_stats;
>  
>  #ifdef CONFIG_SYSFS
>  	if (vi->mergeable_rx_bufs)
> @@ -2712,6 +2849,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	cancel_delayed_work_sync(&vi->refill);
>  	free_receive_page_frags(vi);
>  	virtnet_del_vqs(vi);
> +free_vf_stats:
> +	free_percpu(vi->vf_stats);
>  free_stats:
>  	free_percpu(vi->stats);
>  free:
> @@ -2733,19 +2872,178 @@ static void remove_vq_common(struct virtnet_info *vi)
>  	virtnet_del_vqs(vi);
>  }
>  
> +static struct net_device *get_virtio_bymac(const u8 *mac)
> +{
> +	struct net_device *dev;
> +
> +	ASSERT_RTNL();
> +
> +	for_each_netdev(&init_net, dev) {
> +		if (dev->netdev_ops != &virtnet_netdev)
> +			continue;       /* not a virtio_net device */
> +
> +		if (ether_addr_equal(mac, dev->perm_addr))
> +			return dev;
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct net_device *get_virtio_byref(struct net_device *vf_netdev)
> +{
> +	struct net_device *dev;
> +
> +	ASSERT_RTNL();
> +
> +	for_each_netdev(&init_net, dev) {
> +		struct virtnet_info *vi;
> +
> +		if (dev->netdev_ops != &virtnet_netdev)
> +			continue;	/* not a virtio_net device */
> +
> +		vi = netdev_priv(dev);
> +		if (rtnl_dereference(vi->vf_netdev) == vf_netdev)
> +			return dev;	/* a match */
> +	}
> +
> +	return NULL;
> +}
> +
> +/* Called when VF is injecting data into network stack.
> + * Change the associated network device from VF to virtio.
> + * note: already called with rcu_read_lock
> + */
> +static rx_handler_result_t virtnet_vf_handle_frame(struct sk_buff **pskb)
> +{
> +	struct sk_buff *skb = *pskb;
> +	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
> +	struct virtnet_info *vi = netdev_priv(ndev);
> +	struct virtnet_vf_pcpu_stats *pcpu_stats =
> +				this_cpu_ptr(vi->vf_stats);
> +
> +	skb->dev = ndev;
> +
> +	u64_stats_update_begin(&pcpu_stats->syncp);
> +	pcpu_stats->rx_packets++;
> +	pcpu_stats->rx_bytes += skb->len;
> +	u64_stats_update_end(&pcpu_stats->syncp);
> +
> +	return RX_HANDLER_ANOTHER;
> +}
> +
> +static int virtnet_vf_join(struct net_device *vf_netdev,
> +			   struct net_device *ndev)
> +{
> +	struct virtnet_info *vi = netdev_priv(ndev);
> +	int ret;
> +
> +	ret = netdev_rx_handler_register(vf_netdev,
> +					 virtnet_vf_handle_frame, ndev);
> +	if (ret != 0) {
> +		netdev_err(vf_netdev,
> +			   "can not register virtio VF receive handler (err = %d)\n",
> +			   ret);
> +		goto rx_handler_failed;
> +	}
> +
> +	ret = netdev_upper_dev_link(vf_netdev, ndev, NULL);
> +	if (ret != 0) {
> +		netdev_err(vf_netdev,
> +			   "can not set master device %s (err = %d)\n",
> +			   ndev->name, ret);
> +		goto upper_link_failed;
> +	}
> +
> +	/* set slave flag before open to prevent IPv6 addrconf */
> +	vf_netdev->flags |= IFF_SLAVE;
> +
> +	schedule_delayed_work(&vi->vf_takeover, VF_TAKEOVER_INT);
> +
> +	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
> +
> +	netdev_info(vf_netdev, "joined to %s\n", ndev->name);
> +	return 0;
> +
> +upper_link_failed:
> +	netdev_rx_handler_unregister(vf_netdev);
> +rx_handler_failed:
> +	return ret;
> +}
> +
> +static int virtnet_register_vf(struct net_device *vf_netdev)
> +{
> +	struct net_device *ndev;
> +	struct virtnet_info *vi;
> +
> +	if (vf_netdev->addr_len != ETH_ALEN)
> +		return NOTIFY_DONE;
> +
> +	/* We will use the MAC address to locate the virtio_net interface to
> +	 * associate with the VF interface. If we don't find a matching
> +	 * virtio interface, move on.
> +	 */
> +	ndev = get_virtio_bymac(vf_netdev->perm_addr);
> +	if (!ndev)
> +		return NOTIFY_DONE;
> +
> +	vi = netdev_priv(ndev);
> +	if (rtnl_dereference(vi->vf_netdev))
> +		return NOTIFY_DONE;
> +
> +	if (virtnet_vf_join(vf_netdev, ndev) != 0)
> +		return NOTIFY_DONE;
> +
> +	netdev_info(ndev, "VF registering %s\n", vf_netdev->name);
> +
> +	dev_hold(vf_netdev);
> +	rcu_assign_pointer(vi->vf_netdev, vf_netdev);
> +
> +	return NOTIFY_OK;
> +}
> +
> +static int virtnet_unregister_vf(struct net_device *vf_netdev)
> +{
> +	struct net_device *ndev;
> +	struct virtnet_info *vi;
> +
> +	ndev = get_virtio_byref(vf_netdev);
> +	if (!ndev)
> +		return NOTIFY_DONE;
> +
> +	vi = netdev_priv(ndev);
> +	cancel_delayed_work_sync(&vi->vf_takeover);
> +
> +	netdev_info(ndev, "VF unregistering %s\n", vf_netdev->name);
> +
> +	netdev_rx_handler_unregister(vf_netdev);
> +	netdev_upper_dev_unlink(vf_netdev, ndev);
> +	RCU_INIT_POINTER(vi->vf_netdev, NULL);
> +	dev_put(vf_netdev);
> +
> +	return NOTIFY_OK;
> +}
> +
>  static void virtnet_remove(struct virtio_device *vdev)
>  {
>  	struct virtnet_info *vi = vdev->priv;
> +	struct net_device *vf_netdev;
>  
>  	virtnet_cpu_notif_remove(vi);
>  
>  	/* Make sure no work handler is accessing the device. */
>  	flush_work(&vi->config_work);
>  
> +	rtnl_lock();
> +	vf_netdev = rtnl_dereference(vi->vf_netdev);
> +	if (vf_netdev)
> +		virtnet_unregister_vf(vf_netdev);
> +	rtnl_unlock();
> +
>  	unregister_netdev(vi->dev);
>  
>  	remove_vq_common(vi);
>  
> +	free_percpu(vi->vf_stats);
>  	free_percpu(vi->stats);
>  	free_netdev(vi->dev);
>  }
> @@ -2823,6 +3121,42 @@ static struct virtio_driver virtio_net_driver = {
>  #endif
>  };
>  
> +static int virtio_netdev_event(struct notifier_block *this,
> +			       unsigned long event, void *ptr)
> +{
> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> +
> +	/* Skip our own events */
> +	if (event_dev->netdev_ops == &virtnet_netdev)
> +		return NOTIFY_DONE;
> +
> +	/* Avoid non-Ethernet type devices */
> +	if (event_dev->type != ARPHRD_ETHER)
> +		return NOTIFY_DONE;
> +
> +	/* Avoid Vlan dev with same MAC registering as VF */
> +	if (is_vlan_dev(event_dev))
> +		return NOTIFY_DONE;
> +
> +	/* Avoid Bonding master dev with same MAC registering as VF */
> +	if ((event_dev->priv_flags & IFF_BONDING) &&
> +	    (event_dev->flags & IFF_MASTER))
> +		return NOTIFY_DONE;
> +
> +	switch (event) {
> +	case NETDEV_REGISTER:
> +		return virtnet_register_vf(event_dev);
> +	case NETDEV_UNREGISTER:
> +		return virtnet_unregister_vf(event_dev);
> +	default:
> +		return NOTIFY_DONE;
> +	}
> +}
> +
> +static struct notifier_block virtio_netdev_notifier = {
> +	.notifier_call = virtio_netdev_event,
> +};
> +
>  static __init int virtio_net_driver_init(void)
>  {
>  	int ret;
> @@ -2841,6 +3175,8 @@ static __init int virtio_net_driver_init(void)
>          ret = register_virtio_driver(&virtio_net_driver);
>  	if (ret)
>  		goto err_virtio;
> +
> +	register_netdevice_notifier(&virtio_netdev_notifier);
>  	return 0;
>  err_virtio:
>  	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
> @@ -2853,6 +3189,7 @@ module_init(virtio_net_driver_init);
>  
>  static __exit void virtio_net_driver_exit(void)
>  {
> +	unregister_netdevice_notifier(&virtio_netdev_notifier);
>  	unregister_virtio_driver(&virtio_net_driver);
>  	cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
>  	cpuhp_remove_multi_state(virtionet_online);
> -- 
> 2.14.3

^ permalink raw reply

* Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
From: Michael S. Tsirkin @ 2017-12-21  0:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Sridhar Samudrala, stephen, netdev, virtualization,
	alexander.duyck
In-Reply-To: <20171220143140.0a3dc7f1@cakuba.netronome.com>

On Wed, Dec 20, 2017 at 02:33:34PM -0800, Jakub Kicinski wrote:
> On Mon, 18 Dec 2017 16:40:36 -0800, Sridhar Samudrala wrote:
> > +static int virtio_netdev_event(struct notifier_block *this,
> > +			       unsigned long event, void *ptr)
> > +{
> > +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> > +
> > +	/* Skip our own events */
> > +	if (event_dev->netdev_ops == &virtnet_netdev)
> > +		return NOTIFY_DONE;
> 
> I wonder how does this work WRT loop prevention.  What if I have two
> virtio devices with the same MAC, what is preventing them from claiming
> each other?  Is it only the check above (not sure what is "own" in
> comment referring to)?

I expect we will add a feature bit (VIRTIO_NET_F_MASTER) and it will
be host's responsibility not to add more than 1 such device.

> I'm worried the check above will not stop virtio from enslaving hyperv
> interfaces and vice versa, potentially leading to a loop, no?  There is
> also the fact that it would be preferable to share the code between
> paravirt drivers, to avoid duplicated bugs.
> 
> My suggestion during the previous discussion was to create a paravirt
> bond device, which will explicitly tell the OS which interfaces to
> bond, regardless of which driver they're using.  Could be some form of
> ACPI/FW driver too, I don't know enough about x86 FW to suggest
> something fully fleshed :(

^ permalink raw reply

* Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
From: Jakub Kicinski @ 2017-12-21  0:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, stephen, netdev, virtualization,
	alexander.duyck
In-Reply-To: <20171221021450-mutt-send-email-mst@kernel.org>

On Thu, 21 Dec 2017 02:15:31 +0200, Michael S. Tsirkin wrote:
> On Wed, Dec 20, 2017 at 02:33:34PM -0800, Jakub Kicinski wrote:
> > On Mon, 18 Dec 2017 16:40:36 -0800, Sridhar Samudrala wrote:  
> > > +static int virtio_netdev_event(struct notifier_block *this,
> > > +			       unsigned long event, void *ptr)
> > > +{
> > > +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> > > +
> > > +	/* Skip our own events */
> > > +	if (event_dev->netdev_ops == &virtnet_netdev)
> > > +		return NOTIFY_DONE;  
> > 
> > I wonder how does this work WRT loop prevention.  What if I have two
> > virtio devices with the same MAC, what is preventing them from claiming
> > each other?  Is it only the check above (not sure what is "own" in
> > comment referring to)?  
> 
> I expect we will add a feature bit (VIRTIO_NET_F_MASTER) and it will
> be host's responsibility not to add more than 1 such device.

Do you mean a virtio_net feature bit?  That won't stop the loop with
a hyperv device.  Unless we want every paravirt device to have such
bit and enforce there can be only one magic bond enslavement active in
the system.

Making the bonding information separate from the slaves seems so much
cleaner...  No limitation on device types, or how many bonds user
chooses to create.  No MAC matching...

> > I'm worried the check above will not stop virtio from enslaving hyperv
> > interfaces and vice versa, potentially leading to a loop, no?  There is
> > also the fact that it would be preferable to share the code between
> > paravirt drivers, to avoid duplicated bugs.
> > 
> > My suggestion during the previous discussion was to create a paravirt
> > bond device, which will explicitly tell the OS which interfaces to
> > bond, regardless of which driver they're using.  Could be some form of
> > ACPI/FW driver too, I don't know enough about x86 FW to suggest
> > something fully fleshed :(  

^ permalink raw reply

* Re: RCU callback crashes
From: Jakub Kicinski @ 2017-12-21  0:37 UTC (permalink / raw)
  To: Cong Wang; +Cc: Jiri Pirko, netdev@vger.kernel.org
In-Reply-To: <CAM_iQpUngX+oSDiforfZceqMZrg=jDJnNf3QFF9WFQdHrU9o-g@mail.gmail.com>

On Wed, 20 Dec 2017 16:03:49 -0800, Cong Wang wrote:
> On Wed, Dec 20, 2017 at 10:31 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Wed, Dec 20, 2017 at 10:17 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:  
> >>
> >> I guess it is q->miniqp which is freed in qdisc_graft() without properly
> >> waiting for rcu readers?  
> >
> > It is probably so, the call_rcu_bh(&miniq_old->rcu, mini_qdisc_rcu_func)
> > in the end of mini_qdisc_pair_swap() is invoked on miniq_old->rcu,
> > but miniq is being freed, no rcu barrier waits for it...
> >
> > You can try to add a rcu_barrier_bh() at the end to see if this crash
> > is gone, but I don't think people like adding yet another rcu barrier...  
> 
> Hi, Jakub
> 
> Can you test the following fix? I am not a fan of rcu barrier but we
> already have one so...
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 876fab2604b8..1b68fedea124 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1240,6 +1240,8 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
> 
>         if (!tp_head) {
>                 RCU_INIT_POINTER(*miniqp->p_miniq, NULL);
> +               /* Wait for existing flying RCU callback before being freed. */
> +               rcu_barrier_bh();
>                 return;
>         }

Looks good after 30 minutes, feel free to add if you post officially:

Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>

^ permalink raw reply

* Re: RCU callback crashes
From: Jakub Kicinski @ 2017-12-21  0:41 UTC (permalink / raw)
  To: Cong Wang, John Fastabend; +Cc: Jiri Pirko, netdev@vger.kernel.org
In-Reply-To: <20171220163710.7a5f06e5@cakuba.netronome.com>

On Wed, 20 Dec 2017 16:37:10 -0800, Jakub Kicinski wrote:
> On Wed, 20 Dec 2017 16:03:49 -0800, Cong Wang wrote:
> > On Wed, Dec 20, 2017 at 10:31 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:  
> > > On Wed, Dec 20, 2017 at 10:17 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:    
> > >>
> > >> I guess it is q->miniqp which is freed in qdisc_graft() without properly
> > >> waiting for rcu readers?    
> > >
> > > It is probably so, the call_rcu_bh(&miniq_old->rcu, mini_qdisc_rcu_func)
> > > in the end of mini_qdisc_pair_swap() is invoked on miniq_old->rcu,
> > > but miniq is being freed, no rcu barrier waits for it...
> > >
> > > You can try to add a rcu_barrier_bh() at the end to see if this crash
> > > is gone, but I don't think people like adding yet another rcu barrier...    
> > 
> > Hi, Jakub
> > 
> > Can you test the following fix? I am not a fan of rcu barrier but we
> > already have one so...
> > 
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> > index 876fab2604b8..1b68fedea124 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -1240,6 +1240,8 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
> > 
> >         if (!tp_head) {
> >                 RCU_INIT_POINTER(*miniqp->p_miniq, NULL);
> > +               /* Wait for existing flying RCU callback before being freed. */
> > +               rcu_barrier_bh();
> >                 return;
> >         }  
> 
> Looks good after 30 minutes, feel free to add if you post officially:
> 
> Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Just as I hit send... :)  but this looks unrelated, "Comm: sshd" -
so probably from the management interface.

[  154.604041] ==================================================================
[  154.612245] BUG: KASAN: slab-out-of-bounds in pfifo_fast_dequeue+0x140/0x2d0
[  154.620219] Read of size 8 at addr ffff88086bb64040 by task sshd/983
[  154.627403] 
[  154.629161] CPU: 10 PID: 983 Comm: sshd Not tainted 4.15.0-rc3-perf-00984-g82d3fc87a4aa-dirty #13
[  154.639190] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016
[  154.647665] Call Trace:
[  154.650494]  dump_stack+0xa6/0x118
[  154.654387]  ? _atomic_dec_and_lock+0xe8/0xe8
[  154.659355]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
[  154.666263]  ? rcu_segcblist_enqueue+0xe9/0x120
[  154.671422]  ? _raw_spin_unlock_bh+0x91/0xc0
[  154.676286]  ? pfifo_fast_dequeue+0x140/0x2d0
[  154.681251]  print_address_description+0x6a/0x270
[  154.686601]  ? pfifo_fast_dequeue+0x140/0x2d0
[  154.691565]  kasan_report+0x23f/0x350
[  154.695752]  pfifo_fast_dequeue+0x140/0x2d0
[  154.700523]  __qdisc_run+0x264/0xa20
[  154.704613]  ? sch_direct_xmit+0x3d0/0x3d0
[  154.709287]  ? _raw_spin_unlock+0x73/0xc0
[  154.713860]  ? is_bpf_text_address+0x1e/0x30
[  154.718724]  ? kernel_text_address+0xec/0x100
[  154.723687]  ? __kernel_text_address+0xe/0x30
[  154.728650]  ? unwind_get_return_address+0x2f/0x50
[  154.734099]  ? pfifo_fast_enqueue+0x154/0x180
[  154.739065]  __dev_queue_xmit+0x5ae/0x1110
[  154.743738]  ? dst_alloc+0x8c/0xd0
[  154.747633]  ? netdev_pick_tx+0x150/0x150
[  154.752206]  ? ip_route_output_key_hash+0xee/0x130
[  154.757654]  ? ip_queue_xmit+0x7d0/0x830
[  154.762131]  ? tcp_transmit_skb+0xc52/0x15b0
[  154.766994]  ? tcp_write_xmit+0x425/0x2060
[  154.771665]  ? __tcp_push_pending_frames+0x56/0x110
[  154.777209]  ? tcp_push+0x2cf/0x360
[  154.781200]  ? tcp_sendmsg_locked+0xdb3/0x1cb0
[  154.786259]  ? tcp_sendmsg+0x27/0x40
[  154.790347]  ? inet_sendmsg+0xb3/0x1f0
[  154.794629]  ? sock_sendmsg+0x64/0x80
[  154.798814]  ? sock_write_iter+0x148/0x1f0
[  154.803486]  ? __vfs_write+0x26e/0x370
[  154.807767]  ? vfs_write+0xe9/0x240
[  154.811747]  ? SyS_write+0xa7/0x130
[  154.815739]  ? entry_SYSCALL_64_fastpath+0x1e/0x81
[  154.821190]  ? __alias_free_mem+0x20/0x20
[  154.825766]  ? rt_cache_route+0x143/0x170
[  154.830342]  ? find_busiest_group+0x12eb/0x1630
[  154.835500]  ? inet_lookup_ifaddr_rcu+0x126/0x170
[  154.840852]  ? percpu_counter_add_batch+0x24/0xa0
[  154.846207]  ? rt_cpu_seq_stop+0x10/0x10
[  154.850684]  ? dst_alloc+0xac/0xd0
[  154.854579]  ? rt_dst_alloc+0x1f0/0x250
[  154.858958]  ? ipv4_neigh_lookup+0x3a0/0x3a0
[  154.863824]  ? __rcu_read_unlock+0x6e/0x120
[  154.868594]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
[  154.875502]  ? ip_finish_output2+0x68d/0x7c0
[  154.880366]  ip_finish_output2+0x68d/0x7c0
[  154.885040]  ? ip_send_check+0x60/0x60
[  154.889322]  ? ip_route_input_noref+0xd0/0xd0
[  154.894287]  ? xfrm_lookup+0x888/0x10f0
[  154.898668]  ? ipv4_mtu+0x163/0x200
[  154.902662]  ? load_balance+0x108d/0x14a0
[  154.907238]  ? ip_finish_output+0x39a/0x4c0
[  154.912004]  ip_finish_output+0x39a/0x4c0
[  154.916578]  ? ip_fragment.constprop.5+0xf0/0xf0
[  154.921832]  ? find_busiest_group+0x1630/0x1630
[  154.926991]  ? check_cfs_rq_runtime+0x70/0x70
[  154.931954]  ? __rcu_read_unlock+0x6e/0x120
[  154.936723]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
[  154.943630]  ? unwind_get_return_address+0x2f/0x50
[  154.949077]  ? ip_send_check+0x20/0x60
[  154.953360]  ip_output+0x106/0x280
[  154.957253]  ? ip_mc_output+0x750/0x750
[  154.961631]  ? ip_route_output_key_hash_rcu+0x1240/0x1240
[  154.967757]  ? sk_setup_caps+0x180/0x180
[  154.972236]  ? __skb_clone+0x2f8/0x370
[  154.976520]  ip_queue_xmit+0x381/0x830
[  154.980805]  ? ip_build_and_send_pkt+0x420/0x420
[  154.986060]  ? trace_event_raw_event_bpf_obj_map+0x200/0x200
[  154.992481]  ? tcp_options_write+0xc3/0x360
[  154.997248]  ? tcp_established_options+0x122/0x190
[  155.002697]  tcp_transmit_skb+0xc52/0x15b0
[  155.007374]  ? __tcp_select_window+0x3c0/0x3c0
[  155.012433]  ? is_bpf_text_address+0x1e/0x30
[  155.017296]  ? kernel_text_address+0xec/0x100
[  155.022259]  ? __kernel_text_address+0xe/0x30
[  155.027221]  ? unwind_get_return_address+0x2f/0x50
[  155.032670]  ? __save_stack_trace+0x83/0xd0
[  155.037437]  ? memcmp+0x45/0x70
[  155.041041]  ? depot_save_stack+0x12d/0x470
[  155.045811]  ? tcp_small_queue_check.isra.4+0x10a/0x1f0
[  155.051745]  ? tcp_tso_segs+0xe0/0xe0
[  155.055932]  ? native_sched_clock+0xcc/0x130
[  155.060799]  ? cyc2ns_read_end+0x20/0x20
[  155.065275]  ? sock_sendmsg+0x64/0x80
[  155.069460]  ? vfs_write+0xe9/0x240
[  155.073483]  ? entry_SYSCALL_64_fastpath+0x1e/0x81
[  155.078931]  ? sock_sendmsg+0x64/0x80
[  155.083116]  ? sock_write_iter+0x148/0x1f0
[  155.087790]  ? sched_clock+0x5/0x10
[  155.091780]  ? deref_stack_reg+0x98/0xd0
[  155.096257]  ? sched_clock+0x5/0x10
[  155.100248]  ? sched_clock_cpu+0x14/0xf0
[  155.104726]  tcp_write_xmit+0x425/0x2060
[  155.109209]  ? memcg_kmem_get_cache+0x4e0/0x4e0
[  155.114356]  ? tcp_transmit_skb+0x15b0/0x15b0
[  155.119318]  ? memcg_kmem_put_cache+0x63/0x120
[  155.124376]  ? memcg_kmem_get_cache+0x4e0/0x4e0
[  155.129536]  ? __kmalloc_node_track_caller+0x1fe/0x2a0
[  155.135371]  ? __alloc_skb+0xed/0x390
[  155.139558]  ? __kmalloc_reserve.isra.7+0x43/0x80
[  155.144908]  ? memset+0x1f/0x40
[  155.148510]  ? __alloc_skb+0x302/0x390
[  155.152792]  ? __kmalloc_reserve.isra.7+0x80/0x80
[  155.158142]  ? ipv4_mtu+0x90/0x200
[  155.162036]  ? tcp_mtu_to_mss+0x155/0x1a0
[  155.166610]  ? ipv4_negative_advice+0x60/0x60
[  155.171572]  ? tcp_trim_head+0x260/0x260
[  155.176048]  ? SyS_read+0xa7/0x130
[  155.179941]  ? iov_iter_advance+0x16a/0x780
[  155.184709]  ? copyout+0x4f/0x60
[  155.188410]  ? tcp_established_options+0x122/0x190
[  155.193858]  ? import_single_range+0x110/0x110
[  155.198918]  __tcp_push_pending_frames+0x56/0x110
[  155.204269]  tcp_push+0x2cf/0x360
[  155.208068]  ? tcp_splice_data_recv+0xb0/0xb0
[  155.213032]  ? skb_entail+0x2e5/0x300
[  155.217217]  ? _copy_from_iter+0x680/0x680
[  155.221890]  ? _raw_spin_unlock_bh+0x91/0xc0
[  155.226757]  tcp_sendmsg_locked+0xdb3/0x1cb0
[  155.231628]  ? tcp_recvmsg+0x790/0x1420
[  155.236001]  ? tcp_sendpage+0x60/0x60
[  155.240190]  ? tcp_recv_timestamp+0x240/0x240
[  155.245158]  ? compat_poll_select_copy_remaining+0x310/0x310
[  155.251582]  ? compat_poll_select_copy_remaining+0x310/0x310
[  155.258004]  ? compat_poll_select_copy_remaining+0x310/0x310
[  155.264427]  ? __rcu_read_unlock+0xf8/0x120
[  155.269197]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
[  155.276104]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
[  155.283000]  ? _raw_spin_unlock+0x73/0xc0
[  155.287573]  ? _raw_spin_trylock+0xe0/0xe0
[  155.292246]  ? __release_sock+0xc0/0x140
[  155.296727]  tcp_sendmsg+0x27/0x40
[  155.300621]  inet_sendmsg+0xb3/0x1f0
[  155.304709]  ? aa_path_link+0x260/0x260
[  155.309088]  ? inet_recvmsg+0x210/0x210
[  155.313469]  ? fsnotify+0xae8/0xb30
[  155.317462]  ? inet_recvmsg+0x210/0x210
[  155.332037]  sock_sendmsg+0x64/0x80
[  155.336029]  sock_write_iter+0x148/0x1f0
[  155.340506]  ? sock_sendmsg+0x80/0x80
[  155.344691]  ? sock_recvmsg+0x90/0x90
[  155.348881]  ? tty_ldisc_deref+0x12/0x20
[  155.353356]  ? iov_iter_init+0x77/0xb0
[  155.357639]  __vfs_write+0x26e/0x370
[  155.361727]  ? kernel_read+0xa0/0xa0
[  155.365816]  ? _raw_spin_unlock_irq+0x73/0xc0
[  155.370783]  ? __fsnotify_update_child_dentry_flags.part.0+0x150/0x150
[  155.378174]  ? __fsnotify_parent+0x84/0x220
[  155.382944]  ? __fsnotify_update_child_dentry_flags.part.0+0x150/0x150
[  155.390340]  vfs_write+0xe9/0x240
[  155.394137]  SyS_write+0xa7/0x130
[  155.397934]  ? SyS_read+0x130/0x130
[  155.401925]  ? SyS_clock_settime+0x110/0x110
[  155.406791]  ? SyS_fcntl+0x82/0xb0
[  155.410685]  entry_SYSCALL_64_fastpath+0x1e/0x81
[  155.415939] RIP: 0033:0x7fed3bbf4290
[  155.420024] RSP: 002b:00007ffcffb69468 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  155.428600] RAX: ffffffffffffffda RBX: 00007fed3bec1b20 RCX: 00007fed3bbf4290
[  155.436668] RDX: 0000000000000024 RSI: 0000559b52ddd308 RDI: 0000000000000003
[  155.444735] RBP: 0000000000000021 R08: 0000559b52ddaf60 R09: 0000000000000014
[  155.452803] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fed3bec1b78
[  155.460870] R13: 0000559b52ddb030 R14: 0000559b52ddaf50 R15: 0000559b52ddaf50
[  155.468942] 
[  155.470697] Allocated by task 780:
[  155.474589]  __kmalloc+0xfa/0x230
[  155.478377]  pfifo_fast_init+0x69/0x160
[  155.482757]  qdisc_create_dflt+0x69/0xb0
[  155.487232]  mq_init+0x195/0x1e0
[  155.490931]  qdisc_create_dflt+0x69/0xb0
[  155.495407]  dev_activate+0x48a/0x4e0
[  155.499593]  __dev_open+0x19e/0x210
[  155.503583]  __dev_change_flags+0x3b5/0x3f0
[  155.508351]  dev_change_flags+0x50/0xa0
[  155.512729]  do_setlink+0x5eb/0x1cf0
[  155.516817]  rtnl_newlink+0x9d5/0xe40
[  155.521002]  rtnetlink_rcv_msg+0x37c/0x7e0
[  155.525673]  netlink_rcv_skb+0x122/0x230
[  155.530149]  netlink_unicast+0x2ae/0x360
[  155.534624]  netlink_sendmsg+0x5d5/0x620
[  155.539100]  sock_sendmsg+0x64/0x80
[  155.543090]  ___sys_sendmsg+0x4a8/0x500
[  155.547467]  __sys_sendmsg+0xa9/0x140
[  155.551643]  entry_SYSCALL_64_fastpath+0x1e/0x81
[  155.556893] 
[  155.558646] Freed by task 0:
[  155.561953] (stack is not available)
[  155.566035] 
[  155.567791] The buggy address belongs to the object at ffff88086bb62100
[  155.567791]  which belongs to the cache kmalloc-8192 of size 8192
[  155.582099] The buggy address is located 8000 bytes inside of
[  155.582099]  8192-byte region [ffff88086bb62100, ffff88086bb64100)
[  155.595529] The buggy address belongs to the page:
[  155.600977] page:00000000a9a82c52 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
[  155.612081] flags: 0x6ffff0000008100(slab|head)
[  155.617240] raw: 06ffff0000008100 0000000000000000 0000000000000000 0000000100030003
[  155.626010] raw: dead000000000100 dead000000000200 ffff8803afc0e680 0000000000000000
[  155.634776] page dumped because: kasan: bad access detected
[  155.641094] 
[  155.642935] Memory state around the buggy address:
[  155.648382]  ffff88086bb63f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  155.656568]  ffff88086bb63f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  155.664756] >ffff88086bb64000: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[  155.672943]                                            ^
[  155.678972]  ffff88086bb64080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  155.687160]  ffff88086bb64100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  155.695346] ==================================================================

^ permalink raw reply

* [PATCH] net: phy: micrel: ksz9031: reconfigure autoneg after phy autoneg workaround
From: Grygorii Strashko @ 2017-12-21  0:45 UTC (permalink / raw)
  To: Florian Fainelli, Andrew Lunn, netdev
  Cc: linux-kernel, Sekhar Nori, Grygorii Strashko

Under some circumstances driver will perform PHY reset in
ksz9031_read_status() to fix autoneg failure case (idle error count =
0xFF). When this happens ksz9031 will not detect link status change any
more when connecting to Netgear 1G switch (link can be recovered sometimes by
restarting netdevice "ifconfig down up"). Reproduced with TI am572x board
equipped with ksz9031 PHY while connecting to Netgear 1G switch.

Fix the issue by reconfiguring autonegotiation after PHY reset in
ksz9031_read_status().

Fixes: d2fd719bcb0e ("net/phy: micrel: Add workaround for bad autoneg")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
---
 drivers/net/phy/micrel.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/phy/micrel.c b/drivers/net/phy/micrel.c
index ab46141..422ff63 100644
--- a/drivers/net/phy/micrel.c
+++ b/drivers/net/phy/micrel.c
@@ -624,6 +624,7 @@ static int ksz9031_read_status(struct phy_device *phydev)
 		phydev->link = 0;
 		if (phydev->drv->config_intr && phy_interrupt_is_valid(phydev))
 			phydev->drv->config_intr(phydev);
+		return genphy_config_aneg(phydev);
 	}
 
 	return 0;
-- 
2.10.5

^ permalink raw reply related

* Re: [-next PATCH 4/4] treewide: Use DEVICE_ATTR_WO
From: Zhang Rui @ 2017-12-21  0:50 UTC (permalink / raw)
  To: Joe Perches, Borislav Petkov, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Thomas Falcon, John Allen,
	Inaky Perez-Gonzalez, linux-wimax, James Smart, Dick Kennedy,
	Eduardo Valentin
  Cc: Martin Schwidefsky, Heiko Carstens, Thomas Gleixner,
	H. Peter Anvin, x86, Dmitry Torokhov, James E.J. Bottomley,
	Martin K. Petersen, linux-s390, linux-kernel, linux-input, netdev,
	linuxppc-dev, linux-scsi, linux-pm
In-Reply-To: <fa30f1ad73f76dafff816df40cacffe613aa2f48.1513706702.git.joe@perches.com>

On Tue, 2017-12-19 at 10:15 -0800, Joe Perches wrote:
> Convert DEVICE_ATTR uses to DEVICE_ATTR_WO where possible.
> 
> Done with perl script:
> 
> $ git grep -w --name-only DEVICE_ATTR | \
>   xargs perl -i -e 'local $/; while (<>) {
> s/\bDEVICE_ATTR\s*\(\s*(\w+)\s*,\s*\(?(?:\s*S_IWUSR\s*|\s*0200\s*)\)?
> \s*,\s*NULL\s*,\s*\s_store\s*\)/DEVICE_ATTR_WO(\1)/g; print;}'
> 
> Signed-off-by: Joe Perches <joe@perches.com>
> ---
>  arch/s390/kernel/smp.c                 | 2 +-
>  arch/x86/kernel/cpu/microcode/core.c   | 2 +-
>  drivers/input/touchscreen/elants_i2c.c | 2 +-
>  drivers/net/ethernet/ibm/ibmvnic.c     | 2 +-
>  drivers/net/wimax/i2400m/sysfs.c       | 3 +--
>  drivers/scsi/lpfc/lpfc_attr.c          | 3 +--
>  drivers/thermal/thermal_sysfs.c        | 2 +-

For the thermal part,
Acked-by: Zhang Rui <rui.zhang@intel.com>

thanks,
rui

>  7 files changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
> index b8c1a85bcf2d..a919b2f0141d 100644
> --- a/arch/s390/kernel/smp.c
> +++ b/arch/s390/kernel/smp.c
> @@ -1151,7 +1151,7 @@ static ssize_t __ref rescan_store(struct device
> *dev,
>  	rc = smp_rescan_cpus();
>  	return rc ? rc : count;
>  }
> -static DEVICE_ATTR(rescan, 0200, NULL, rescan_store);
> +static DEVICE_ATTR_WO(rescan);
>  #endif /* CONFIG_HOTPLUG_CPU */
>  
>  static int __init s390_smp_init(void)
> diff --git a/arch/x86/kernel/cpu/microcode/core.c
> b/arch/x86/kernel/cpu/microcode/core.c
> index c4fa4a85d4cb..09c74b0560dd 100644
> --- a/arch/x86/kernel/cpu/microcode/core.c
> +++ b/arch/x86/kernel/cpu/microcode/core.c
> @@ -560,7 +560,7 @@ static ssize_t pf_show(struct device *dev,
>  	return sprintf(buf, "0x%x\n", uci->cpu_sig.pf);
>  }
>  
> -static DEVICE_ATTR(reload, 0200, NULL, reload_store);
> +static DEVICE_ATTR_WO(reload);
>  static DEVICE_ATTR(version, 0400, version_show, NULL);
>  static DEVICE_ATTR(processor_flags, 0400, pf_show, NULL);
>  
> diff --git a/drivers/input/touchscreen/elants_i2c.c
> b/drivers/input/touchscreen/elants_i2c.c
> index a458e5ec9e41..819213e88f32 100644
> --- a/drivers/input/touchscreen/elants_i2c.c
> +++ b/drivers/input/touchscreen/elants_i2c.c
> @@ -1000,7 +1000,7 @@ static ssize_t show_iap_mode(struct device
> *dev,
>  				"Normal" : "Recovery");
>  }
>  
> -static DEVICE_ATTR(calibrate, S_IWUSR, NULL, calibrate_store);
> +static DEVICE_ATTR_WO(calibrate);
>  static DEVICE_ATTR(iap_mode, S_IRUGO, show_iap_mode, NULL);
>  static DEVICE_ATTR(update_fw, S_IWUSR, NULL, write_update_fw);
>  
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c
> b/drivers/net/ethernet/ibm/ibmvnic.c
> index 1dc4aef37d3a..42b96e1a1b13 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -4411,7 +4411,7 @@ static ssize_t failover_store(struct device
> *dev, struct device_attribute *attr,
>  	return count;
>  }
>  
> -static DEVICE_ATTR(failover, 0200, NULL, failover_store);
> +static DEVICE_ATTR_WO(failover);
>  
>  static unsigned long ibmvnic_get_desired_dma(struct vio_dev *vdev)
>  {
> diff --git a/drivers/net/wimax/i2400m/sysfs.c
> b/drivers/net/wimax/i2400m/sysfs.c
> index 1237109f251a..8c67df11105c 100644
> --- a/drivers/net/wimax/i2400m/sysfs.c
> +++ b/drivers/net/wimax/i2400m/sysfs.c
> @@ -65,8 +65,7 @@ ssize_t i2400m_idle_timeout_store(struct device
> *dev,
>  }
>  
>  static
> -DEVICE_ATTR(i2400m_idle_timeout, S_IWUSR,
> -	    NULL, i2400m_idle_timeout_store);
> +DEVICE_ATTR_WO(i2400m_idle_timeout);
>  
>  static
>  struct attribute *i2400m_dev_attrs[] = {
> diff --git a/drivers/scsi/lpfc/lpfc_attr.c
> b/drivers/scsi/lpfc/lpfc_attr.c
> index 517ff203cfde..6ddaf51a23f6 100644
> --- a/drivers/scsi/lpfc/lpfc_attr.c
> +++ b/drivers/scsi/lpfc/lpfc_attr.c
> @@ -2418,8 +2418,7 @@ lpfc_soft_wwn_enable_store(struct device *dev,
> struct device_attribute *attr,
>  
>  	return count;
>  }
> -static DEVICE_ATTR(lpfc_soft_wwn_enable, S_IWUSR, NULL,
> -		   lpfc_soft_wwn_enable_store);
> +static DEVICE_ATTR_WO(lpfc_soft_wwn_enable);
>  
>  /**
>   * lpfc_soft_wwpn_show - Return the cfg soft ww port name of the
> adapter
> diff --git a/drivers/thermal/thermal_sysfs.c
> b/drivers/thermal/thermal_sysfs.c
> index 2bc964392924..ba81c9080f6e 100644
> --- a/drivers/thermal/thermal_sysfs.c
> +++ b/drivers/thermal/thermal_sysfs.c
> @@ -317,7 +317,7 @@ emul_temp_store(struct device *dev, struct
> device_attribute *attr,
>  
>  	return ret ? ret : count;
>  }
> -static DEVICE_ATTR(emul_temp, S_IWUSR, NULL, emul_temp_store);
> +static DEVICE_ATTR_WO(emul_temp);
>  #endif
>  
>  static ssize_t

^ permalink raw reply

* Re: RCU callback crashes
From: Jakub Kicinski @ 2017-12-21  0:50 UTC (permalink / raw)
  To: John Fastabend; +Cc: Cong Wang, Jiri Pirko, netdev@vger.kernel.org
In-Reply-To: <20171220164058.2a862e27@cakuba.netronome.com>

On Wed, 20 Dec 2017 16:41:14 -0800, Jakub Kicinski wrote:
> Just as I hit send... :)  but this looks unrelated, "Comm: sshd" -
> so probably from the management interface.
> 
> [  154.604041] ==================================================================
> [  154.612245] BUG: KASAN: slab-out-of-bounds in pfifo_fast_dequeue+0x140/0x2d0
> [  154.620219] Read of size 8 at addr ffff88086bb64040 by task sshd/983
> [  154.627403] 
> [  154.629161] CPU: 10 PID: 983 Comm: sshd Not tainted 4.15.0-rc3-perf-00984-g82d3fc87a4aa-dirty #13
> [  154.639190] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016
> [  154.647665] Call Trace:
> [  154.650494]  dump_stack+0xa6/0x118
> [  154.654387]  ? _atomic_dec_and_lock+0xe8/0xe8
> [  154.659355]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
> [  154.666263]  ? rcu_segcblist_enqueue+0xe9/0x120
> [  154.671422]  ? _raw_spin_unlock_bh+0x91/0xc0
> [  154.676286]  ? pfifo_fast_dequeue+0x140/0x2d0
> [  154.681251]  print_address_description+0x6a/0x270
> [  154.686601]  ? pfifo_fast_dequeue+0x140/0x2d0
> [  154.691565]  kasan_report+0x23f/0x350
> [  154.695752]  pfifo_fast_dequeue+0x140/0x2d0

If we trust stack decode it's:

   615  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
   616  {
   617          struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
   618          struct sk_buff *skb = NULL;
   619          int band;
   620  
   621          for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
   622                  struct skb_array *q = band2list(priv, band);
   623  
>> 624                  if (__skb_array_empty(q))
   625                          continue;
   626  
   627                  skb = skb_array_consume_bh(q);
   628          }
   629          if (likely(skb)) {
   630                  qdisc_qstats_cpu_backlog_dec(qdisc, skb);
   631                  qdisc_bstats_cpu_update(qdisc, skb);
   632                  qdisc_qstats_cpu_qlen_dec(qdisc);
   633          }
   634  
   635          return skb;
   636  }

^ permalink raw reply

* Re: [PATCH v3 next-queue 05/10] ixgbe: add ipsec offload add and remove SA
From: Marcelo Ricardo Leitner @ 2017-12-21  1:17 UTC (permalink / raw)
  To: Shannon Nelson
  Cc: intel-wired-lan, jeffrey.t.kirsher, steffen.klassert,
	sowmini.varadhan, netdev
In-Reply-To: <1513728002-7643-6-git-send-email-shannon.nelson@oracle.com>

Hi,

On Tue, Dec 19, 2017 at 03:59:57PM -0800, Shannon Nelson wrote:
> +}
> +
> +static const struct xfrmdev_ops ixgbe_xfrmdev_ops = {
> +	.xdo_dev_state_add = ixgbe_ipsec_add_sa,
> +	.xdo_dev_state_delete = ixgbe_ipsec_del_sa,
> +};
> +

This struct is only declared if XFRM_OFFLOAD is selected. What is
selecting it for ixgbe driver?
mlx5 driver has an extra option for ipsec offload and it then does
'depends on XFRM_OFFLOAD'

  Marcelo

^ permalink raw reply

* Re: [PATCH V3 net-next 00/17] add some features and fix some bugs for HNS3 driver
From: lipeng (Y) @ 2017-12-21  1:30 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel, linuxarm, salil.mehta
In-Reply-To: <20171220.142843.568343410203899321.davem@davemloft.net>



On 2017/12/21 3:28, David Miller wrote:
> From: Lipeng <lipeng321@huawei.com>
> Date: Wed, 20 Dec 2017 16:43:02 +0800
>
>> This patchset adds some new feature support and fixes some bugs:
>> [Patch 1/17 - 5/17] add the support to modify/query the tqp number
>> through ethtool -L/l command, and also fix some related bugs for
>> change tqp number.
>> [Patch 6/17 - 9-17] add support vlan tag offload on tx&&rx direction
>> for pf, and fix some related bugs.
>> [patch 10/17 - 11/17] fix bugs for auto negotiation.
>> [patch 12/17] adds support for ethtool command set_pauseparam.
>> [patch 13/17 - 14/17] add support to update flow control settings after
>> autoneg.
>> [patch 15/17 - 17/17] fix some other bugs in net-next.
> In your From: email field, as well as you and your colleagues signoffs
> you are not using your full, real, name:
> ====================
> From: Lipeng <lipeng321@huawei.com>

My first name is 'Peng' and Surname is 'Li'.
But it is usually spelled as 'Lipeng' in one go when referring.
This is quite usual convention with full names originating from
China. Surnames comes first, followed by the first name and they
both are inseparable while they are written as well.
"Lipeng" is my full real name,  and my sign-off's appear like above.


Thanks,
Lipeng

> ====================
> Signed-off-by: qumingguang <qumingguang@huawei.com>
> Signed-off-by: Lipeng <lipeng321@huawei.com>
> ====================
>
> Please fix this.
>
> .
>

^ permalink raw reply

* RE: [patch iproute2] tc: add -bs option for batch mode
From: Chris Mi @ 2017-12-21  1:30 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev@vger.kernel.org, gerlitz.or@gmail.com
In-Reply-To: <20171220071744.25f9dd41@xeon-e3>

> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, December 20, 2017 11:18 PM
> To: Chris Mi <chrism@mellanox.com>
> Cc: netdev@vger.kernel.org; gerlitz.or@gmail.com
> Subject: Re: [patch iproute2] tc: add -bs option for batch mode
> 
> On Wed, 20 Dec 2017 09:23:34 +0000
> Chris Mi <chrism@mellanox.com> wrote:
> 
> > > Your real performance win is just not asking for ACK for every rule.
> > No. Even if batch_size > 1, we ack every rule. The real performance
> > win is to send multiple rules in one system call. If we are not asking
> > for ACK for every rule, the performance will be improved further.
> 
> Try the no ACK method.
> 
> When we were optimizing routing daemons like Quagga, it was discovered
> that an ACK for every route insert was the main bottleneck. Doing
> asynchronous error handling got a bigger win than your batching.
> 
> Please try that, doing multiple messages using iov is not necessary.

This is my testing result to insert 1,000,000 rules:

1. batch_size = 1, acking

real    0m15.125s
user    0m6.982s
sys     0m8.080s

2. batch_size = 10, acking

real    0m12.772s
user    0m5.984s
sys     0m6.723s

3. batch_size = 1, no acking

real    0m14.484s
user    0m6.919s
sys     0m7.498s

4. batch_size = 10, no acking

real    0m11.664s
user    0m6.017s
sys     0m5.578s

As we can see from above test result, the bottleneck is not in acking.
Without acking or with asynchronous error handling, we can improve the performance further.
It is worth to do that. But I think that should be in another patch. This patch only adds
the -bs option.

^ permalink raw reply

* Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
From: Siwei Liu @ 2017-12-21  1:31 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: mst, sridhar.samudrala, alexander.duyck, virtualization, netdev,
	David Miller
In-Reply-To: <20171219104159.30d9caaf@xeon-e3>


[-- Attachment #1.1: Type: text/plain, Size: 1504 bytes --]

On Tue, Dec 19, 2017 at 10:41 AM, Stephen Hemminger <
stephen@networkplumber.org> wrote:

> On Tue, 19 Dec 2017 13:21:17 -0500 (EST)
> David Miller <davem@davemloft.net> wrote:
>
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > Date: Tue, 19 Dec 2017 09:55:48 -0800
> >
> > > could be 10ms, just enough to let udev do its renaming
> >
> > Please, move to some kind of notification or event based handling of
> > this problem.
> >
> > No delay is safe, what if userspace gets swapped out or whatever
> > else might make userspace stall unexpectedly?
> >
>
> The plan is to remove the delay and do the naming in the kernel.
> This was suggested by Lennart since udev is only doing naming policy
> because kernel names were not repeatable.
>
> This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
>
> Patch is pending.
>

While it's good to show VF with specific naming to indicate enslavement, I
wonder wouldn't it be better to hide this netdev at all from the user
space? IMHO this extra device is useless when being enslaved and we may
delegate controls (e.g. ethtool) over to the para-virtual device instead?
That way it's possible to eliminate the possibility of additional udev
setup or modification?

I'm not sure if this  is consistent with Windows guest or not, but I don't
find it _Linux_ user friendly that ethtool doesn't work on the composite
interface any more, and I have to end up with finding out the correct
enslaved VF I must operate on.

Regards,
-Siwei

[-- Attachment #1.2: Type: text/html, Size: 2322 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v3 next-queue 05/10] ixgbe: add ipsec offload add and remove SA
From: Shannon Nelson @ 2017-12-21  1:39 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: intel-wired-lan, jeffrey.t.kirsher, steffen.klassert,
	sowmini.varadhan, netdev
In-Reply-To: <20171221010139.GM6122@localhost.localdomain>

On 12/20/2017 5:17 PM, Marcelo Ricardo Leitner wrote:
> Hi,
> 
> On Tue, Dec 19, 2017 at 03:59:57PM -0800, Shannon Nelson wrote:
>> +}
>> +
>> +static const struct xfrmdev_ops ixgbe_xfrmdev_ops = {
>> +	.xdo_dev_state_add = ixgbe_ipsec_add_sa,
>> +	.xdo_dev_state_delete = ixgbe_ipsec_del_sa,
>> +};
>> +
> 
> This struct is only declared if XFRM_OFFLOAD is selected. What is
> selecting it for ixgbe driver?
> mlx5 driver has an extra option for ipsec offload and it then does
> 'depends on XFRM_OFFLOAD'
> 
>    Marcelo
> 

I didn't bother putting a 'depends' item in the ixgbe's Kconfig entry, 
and I didn't create an extra CONFIG variable to enable ixgbe's support 
of the offload.  If CONFIG_XFRM_OFFLOAD is set, then ixgbe will support it.

sln

^ permalink raw reply

* tracepoint_probe_register and bpf. Was: [PATCH tip 0/3] Improvements of scheduler related Tracepoints
From: Alexei Starovoitov @ 2017-12-21  2:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Teng Qin, mingo, bgregg, daniel, linux-kernel, kernel-team,
	netdev, David S. Miller, brouer
In-Reply-To: <20171218091157.rgogahbflgwvwcdw@hirez.programming.kicks-ass.net>

On Mon, Dec 18, 2017 at 10:11:57AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 09:09:51AM -0800, Alexei Starovoitov wrote:
> 
> > yeah. Currently bpf progs are called at the end of
> > perf_trace_##call()
> > {
> >   .. regular tracepoint copy craft
> >   perf_trace_run_bpf_submit( &copied args )
> > }
> > 
> > from bpf pov we'd rather get access to raw args passed into
> > perf_trace_##call.
> > Sounds like you're suggesting to let bpf side register its
> > progs directly via tracepoint_probe_register() ?
> > That would solve the whole thing really nicely indeed.
> 
> Not sure I thought that far; but if you want the probe arguments either
> grab them from the perf probe you're running in or indeed register your
> own, don't play silly buggers and copy them in the trace data.

So I've tried both approaches:

1. grab arguments from:
perf_trace_##call(void *__data, proto)

it works ok, but it adds 50-70 bytes of .text to every perf_trace_*
functions which multiplied by the number of tracepoints (~500) which in
total adds ~30k of .text to vmlinux
Not too bad, but it also adds extra branch to critical
execution path, so not ideal

2. register your own
This approach I think is much better.
The trick I'm doing is the following:
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)  \
/* no 'static' here. The bpf probe functions are global */              \
notrace void                                                            \
__bpf_trace_##call(void *__data, proto)                                 \
{                                                                       \
        struct trace_event_call *event_call = __data;                   \
        \
        __FN_COUNT(bpf_trace_run, args)(event_call, CAST_TO_U64(args)); \
}

Where CAST_TO_U64 is a macro to cast all args to u64 one by one.
bpf_trace_run*() functions are defined as:
void bpf_trace_run1(struct trace_event_call *call, u64 arg1);
void bpf_trace_run2(struct trace_event_call *call, u64 arg1, u64 arg2);
void bpf_trace_run3(struct trace_event_call *call, u64 arg1, u64 arg2, u64 arg3);

so in assembler it looks like:
(gdb) disassemble __bpf_trace_xdp_exception
Dump of assembler code for function __bpf_trace_xdp_exception:
   0xffffffff81132080 <+0>:	mov    %ecx,%ecx
   0xffffffff81132082 <+2>:	jmpq   0xffffffff811231f0 <bpf_trace_run3>

where

TRACE_EVENT(xdp_exception,
        TP_PROTO(const struct net_device *dev,
                 const struct bpf_prog *xdp, u32 act),

The above assembler snippet is casting 32-bit 'act' field into 'u64'
to pass into bpf_trace_run3() and all other registers are passed as-is.
So all of ~500 of __bpf_trace_*() functions are only 5-10 _byte_ long
and in total this approach adds 7k bytes to .text and 8k bytes
to .rodata since I want them to appear in kallsyms, so I don't
have to add another 8-byte fields to 'struct trace_event_class'
and 'struct trace_event_call'

Such approach, I think, is the lowest overhead we can possibly have
while calling trace_xdp_exception() from kernel C code and
transitioning into bpf land.
Since many folks are starting to use tracepoint+bpf at speeds of
1M+ events per second this would be very valuable optimization.

Diff stat so far:
 drivers/gpu/drm/i915/i915_trace.h | 13 ++++++++++---
 fs/btrfs/super.c                  |  4 ++++
 include/linux/trace_events.h      | 27 +++++++++++++++++++++++++++
 include/trace/bpf_probe.h         | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/trace/define_trace.h      |  1 +
 include/uapi/linux/bpf.h          |  4 ++++
 kernel/trace/bpf_trace.c          | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 191 insertions(+), 3 deletions(-)

Since I'm not touching anything on ftrace or perf side,
I'm thinking to extend BPF_PROG_ATTACH command to specify
which tracepoint to attach to.
The user space will look like:

prog_fd = bpf_prog_load(...);
bpf_prog_attach(prog_fd, "xdp_exception");

On the kernel side I will walk kallsyms to
find __tracepoint_xdp_exception record, check that 
tp->name == "xdp_exception",
then find __bpf_trace_xdp_exception() address (also in kallsyms),
and finally use tracepoint_probe_register() to connect the two.

Thoughts?

^ permalink raw reply

* Re: [PATCH net-next] virtio_net: Add ethtool stats
From: Toshiaki Makita @ 2017-12-21  2:12 UTC (permalink / raw)
  To: Jason Wang, David S . Miller, Michael S . Tsirkin; +Cc: netdev, virtualization
In-Reply-To: <ff4eb1b4-e49f-2c5d-8e99-810093643c5c@redhat.com>

On 2017/12/20 17:13, Jason Wang wrote:
> On 2017年12月20日 12:40, Toshiaki Makita wrote:
>> The main purpose of this patch is adding a way of checking per-queue
>> stats.
>> It's useful to debug performance problems on multiqueue environment.
>>
>> $ ethtool -S ens10
>> NIC statistics:
>>       rx_packets: 4172939
>>       tx_packets: 5855538
>>       rx_bytes: 6317757408
>>       tx_bytes: 8865151846
>>       rx_dropped: 0
>>       rx_length_errors: 0
>>       rx_frame_errors: 0
>>       tx_dropped: 0
>>       tx_fifo_errors: 0
>>       rx_queue_0_packets: 2090408
>>       rx_queue_0_bytes: 3164825094
>>       rx_queue_1_packets: 2082531
>>       rx_queue_1_bytes: 3152932314
>>       tx_queue_0_packets: 2770841
>>       tx_queue_0_bytes: 4194955474
>>       tx_queue_1_packets: 3084697
>>       tx_queue_1_bytes: 4670196372
>>
>> Signed-off-by: Toshiaki Makita<makita.toshiaki@lab.ntt.co.jp>
> 
> Thanks for the patch. This is not the first patch that wants to do this.
> Maybe you can go through the previous comments (E.g there's one in
> https://patchwork.ozlabs.org/patch/244413/).

Thanks for pointing it. I was not aware of it.
I checked the previous comments but not sure if there is anything to do.
The main logic is coincidentally the same - splitting percpu stats data
structure into rx and tx ones in receive_queue and send_queue, which has
been acked.
Comments from Ben Hutchings is about stats ordering but it seems no
consensus has been formed since then.
Feedback from Michael is kind of nit-picking ones and does not apply to
mine. He also requested other stats items like vm-exit, but not in the
same patch.
You requested performance numbers. Would you like to have it for this
patch as well?

> For me, I'd expect to add some XDP counters like other device did.

Nice to have it. I guess it would be another patch.

-- 
Toshiaki Makita

^ permalink raw reply

* Re: [PATCH bpf-next 0/2] bpftool improvements for xlated dump
From: Alexei Starovoitov @ 2017-12-21  2:16 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, jakub.kicinski, quentin.monnet
In-Reply-To: <20171220124257.4512-1-daniel@iogearbox.net>

On Wed, Dec 20, 2017 at 01:42:55PM +0100, Daniel Borkmann wrote:
> This work adds correlation of maps and calls into the bpftool
> xlated dump in order to help debugging and introspection of
> loaded BPF progs. First patch makes kallsyms work on subprogs
> with bpf calls, and second implements the actual correlation.
> Details and example output can be found in the 2nd patch.

Applied, thanks Daniel!

^ permalink raw reply

* Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
From: Siwei Liu @ 2017-12-21  2:16 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, sridhar.samudrala, mst, netdev, virtualization,
	alexander.duyck, jesse.brandeburg
In-Reply-To: <20171219104159.30d9caaf@xeon-e3>

On Tue, Dec 19, 2017 at 10:41 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 19 Dec 2017 13:21:17 -0500 (EST)
> David Miller <davem@davemloft.net> wrote:
>
>> From: Stephen Hemminger <stephen@networkplumber.org>
>> Date: Tue, 19 Dec 2017 09:55:48 -0800
>>
>> > could be 10ms, just enough to let udev do its renaming
>>
>> Please, move to some kind of notification or event based handling of
>> this problem.
>>
>> No delay is safe, what if userspace gets swapped out or whatever
>> else might make userspace stall unexpectedly?
>>
>
> The plan is to remove the delay and do the naming in the kernel.
> This was suggested by Lennart since udev is only doing naming policy
> because kernel names were not repeatable.
>
> This makes the VF show up as "ethN_vf" on Hyper-V which is user friendly.
>
> Patch is pending.

While it's good to show VF with specific naming to indicate
enslavement, I wonder wouldn't it be better to hide this netdev at all
from the user space? IMHO this extra device is useless when being
enslaved and we may delegate controls (e.g. ethtool) over to the
para-virtual device instead? That way it's possible to eliminate the
possibility of additional udev setup or modification?

I'm not sure if this  is consistent with Windows guest or not, but I
don't find it _Linux_ user friendly that ethtool doesn't work on the
composite interface any more, and I have to end up with finding out
the correct enslaved VF I must operate on.

Regards,
-Siwei

^ permalink raw reply

* Re: [PATCH bpf 0/9] bpf: verifier security fixes
From: Daniel Borkmann @ 2017-12-21  2:20 UTC (permalink / raw)
  To: Alexei Starovoitov, David S . Miller
  Cc: Jann Horn, Edward Cree, netdev, kernel-team
In-Reply-To: <20171219041201.1979983-1-ast@kernel.org>

On 12/19/2017 05:11 AM, Alexei Starovoitov wrote:
> This patch set addresses a set of security vulnerabilities
> in bpf verifier logic discovered by Jann Horn.
> All of the patches are candidates for 4.14 stable.
> 
> Alexei Starovoitov (1):
>   bpf: fix integer overflows
> 
> Edward Cree (1):
>   bpf/verifier: fix bounds calculation on BPF_RSH
> 
> Jann Horn (7):
>   bpf: fix incorrect sign extension in check_alu_op()
>   bpf: fix incorrect tracking of register size truncation
>   bpf: fix 32-bit ALU op verification
>   bpf: fix missing error return in check_stack_boundary()
>   bpf: force strict alignment checks for stack pointers
>   bpf: don't prune branches when a scalar is replaced with a pointer
>   selftests/bpf: add tests for recent bugfixes
> 
>  include/linux/bpf_verifier.h                |   4 +-
>  kernel/bpf/verifier.c                       | 175 ++++++---
>  tools/testing/selftests/bpf/test_verifier.c | 549 +++++++++++++++++++++++++++-
>  3 files changed, 661 insertions(+), 67 deletions(-)

Series applied to bpf tree and queued for stable, thanks everyone!

^ permalink raw reply

* Re: [PATCH bpf] bpf: do not allow root to mangle valid pointers
From: Daniel Borkmann @ 2017-12-21  2:20 UTC (permalink / raw)
  To: Alexei Starovoitov, David S . Miller
  Cc: Jann Horn, Edward Cree, netdev, kernel-team
In-Reply-To: <20171219041520.2009449-1-ast@kernel.org>

On 12/19/2017 05:15 AM, Alexei Starovoitov wrote:
> Do not allow root to convert valid pointers into unknown scalars.
> In particular disallow:
>  ptr &= reg
>  ptr <<= reg
>  ptr += ptr
> and explicitly allow:
>  ptr -= ptr
> since pkt_end - pkt == length
> 
> 1.
> This minimizes amount of address leaks root can do.
> In the future may need to further tighten the leaks with kptr_restrict.
> 
> 2.
> If program has such pointer math it's likely a user mistake and
> when verifier complains about it right away instead of many instructions
> later on invalid memory access it's easier for users to fix their progs.
> 
> 3.
> when register holding a pointer cannot change to scalar it allows JITs to
> optimize better. Like 32-bit archs could use single register for pointers
> instead of a pair required to hold 64-bit scalars.
> 
> 4.
> reduces architecture dependent behavior. Since code:
> r1 = r10;
> r1 &= 0xff;
> if (r1 ...)
> will behave differently arm64 vs x64 and offloaded vs native.
> 
> A significant chunk of ptr mangling was allowed by
> commit f1174f77b50c ("bpf/verifier: rework value tracking")
> yet some of it was allowed even earlier.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Series applied to bpf tree, thanks Alexei!

^ permalink raw reply

* Re: [PATCH v3 next-queue 05/10] ixgbe: add ipsec offload add and remove SA
From: Marcelo Ricardo Leitner @ 2017-12-21  2:21 UTC (permalink / raw)
  To: Shannon Nelson
  Cc: intel-wired-lan, jeffrey.t.kirsher, steffen.klassert,
	sowmini.varadhan, netdev, saeedm, borisp, ilant
In-Reply-To: <85fe1cd2-e935-52dc-e36b-0973e4c03653@oracle.com>

On Wed, Dec 20, 2017 at 05:39:13PM -0800, Shannon Nelson wrote:
> On 12/20/2017 5:17 PM, Marcelo Ricardo Leitner wrote:
> > Hi,
> > 
> > On Tue, Dec 19, 2017 at 03:59:57PM -0800, Shannon Nelson wrote:
> > > +}
> > > +
> > > +static const struct xfrmdev_ops ixgbe_xfrmdev_ops = {
> > > +	.xdo_dev_state_add = ixgbe_ipsec_add_sa,
> > > +	.xdo_dev_state_delete = ixgbe_ipsec_del_sa,
> > > +};
> > > +
> > 
> > This struct is only declared if XFRM_OFFLOAD is selected. What is
> > selecting it for ixgbe driver?
> > mlx5 driver has an extra option for ipsec offload and it then does
> > 'depends on XFRM_OFFLOAD'
> > 
> >    Marcelo
> > 
> 
> I didn't bother putting a 'depends' item in the ixgbe's Kconfig entry, and I
> didn't create an extra CONFIG variable to enable ixgbe's support of the
> offload.  If CONFIG_XFRM_OFFLOAD is set, then ixgbe will support it.

You handled it via Makefile, okay. Missed it on patch 2, my bad.

Anyhow, we probably could use some standard here across the vendors
here.  With this patchset, we have 2 drivers supporting it, and 2
different ways to configure it.

  Marcelo

^ permalink raw reply

* Re: [PATCH V3 net-next 00/17] add some features and fix some bugs for HNS3 driver
From: lipeng (Y) @ 2017-12-21  2:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel, linuxarm, salil.mehta
In-Reply-To: <20171220.142843.568343410203899321.davem@davemloft.net>



On 2017/12/21 3:28, David Miller wrote:
> From: Lipeng <lipeng321@huawei.com>
> Date: Wed, 20 Dec 2017 16:43:02 +0800
>
>> This patchset adds some new feature support and fixes some bugs:
>> [Patch 1/17 - 5/17] add the support to modify/query the tqp number
>> through ethtool -L/l command, and also fix some related bugs for
>> change tqp number.
>> [Patch 6/17 - 9-17] add support vlan tag offload on tx&&rx direction
>> for pf, and fix some related bugs.
>> [patch 10/17 - 11/17] fix bugs for auto negotiation.
>> [patch 12/17] adds support for ethtool command set_pauseparam.
>> [patch 13/17 - 14/17] add support to update flow control settings after
>> autoneg.
>> [patch 15/17 - 17/17] fix some other bugs in net-next.
> In your From: email field, as well as you and your colleagues signoffs
> you are not using your full, real, name:
>
> ====================
> From: Lipeng <lipeng321@huawei.com>

My first name is 'Peng' and Surname is 'Li'.
But it is usually spelled as 'Lipeng' in one go when referring.
This is quite usual convention with full names originating from
China. Surnames comes first, followed by the first name and they
both are inseparable while they are written as well.
"Lipeng" is my full real name,  and my sign-off's appear like above.


The name "lipeng"  is very common in china, many guys in HuaWei have the
same name "lipeng",
So my email is lipeng321@huawei.com,  to Distinguish from others.

other guys who named "lipeng" in huawei ,  their email have an id too,
such as lipengxxx@huawei.com.

Wish the explanation is clear.

Thanks,
Lipeng


> ====================
> Signed-off-by: qumingguang <qumingguang@huawei.com>
> Signed-off-by: Lipeng <lipeng321@huawei.com>
> ====================
>
> Please fix this.
>
> .
>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox