Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 net-next 3/4] net: make listified RX functions return number of good packets
From: Eric Dumazet @ 2018-09-07  2:27 UTC (permalink / raw)
  To: Edward Cree, davem; +Cc: linux-net-drivers, netdev
In-Reply-To: <d601bec7-0eaa-133d-f021-a340e38c45ed@solarflare.com>



On 09/06/2018 07:26 AM, Edward Cree wrote:
> Signed-off-by: Edward Cree <ecree@solarflare.com>


Lack of changelog here ?

I do not know what is a good packet.

You are adding a lot of conditional expressions, that cpu
will mispredict quite often.

Typical micro benchmarks wont really notice.

^ permalink raw reply

* Re: [PATCH v2 net-next 0/4] net: batched receive in GRO path
From: Eric Dumazet @ 2018-09-07  2:32 UTC (permalink / raw)
  To: Edward Cree, davem; +Cc: linux-net-drivers, netdev
In-Reply-To: <c1e79c86-56ae-98c6-8dc0-c227f91ee9bc@solarflare.com>



On 09/06/2018 07:24 AM, Edward Cree wrote:
> This series listifies part of GRO processing, in a manner which allows those
>  packets which are not GROed (i.e. for which dev_gro_receive returns
>  GRO_NORMAL) to be passed on to the listified regular receive path.
> I have not listified dev_gro_receive() itself, or the per-protocol GRO
>  callback, since GRO's need to hold packets on lists under napi->gro_hash
>  makes keeping the packets on other lists awkward, and since the GRO control
>  block state of held skbs can refer only to one 'new' skb at a time.
>  Nonetheless the batching of the calling code yields some performance gains
>  in the GRO case as well.
> 
> Herewith the performance figures obtained in a NetPerf TCP stream test (with
>  four streams, and irqs bound to a single core):
> net-next: 7.166 Gbit/s (sigma 0.435)
> after #2: 7.715 Gbit/s (sigma 0.145) = datum + 7.7%
> after #4: 7.890 Gbit/s (sigma 0.217) = datum + 10.1%
> (Note that the 'net-next' results were distinctly bimodal, with two results
>  of about 8 Gbit/s and the remaining ten around 7 Gbit/s.  I don't have a
>  good explanation for this.)
> And with GRO disabled through ethtool -K (thus simulating traffic which is
>  not GRO-able but, being TCP, is still passed to the GRO entry point):
> net-next: 4.756 Gbit/s (sigma 0.240)
> after #4: 5.355 Gbit/s (sigma 0.232) = datum + 12.6%
> 
> v2: Rebased on latest net-next.  Removed RFC tags.  Otherwise unchanged
>  owing to lack of comments on v1.
>

Your performance numbers are not convincing, since TCP stream test should
get nominal GRO gains.

Adding this complexity and icache pressure needs more experimental results.

What about RPC workloads  (eg 100 concurrent netperf -t TCP_RR -- -r 8000,8000 )

Thanks.

^ permalink raw reply

* Re: [PATCH] [RFC v2] Drop all 00-INDEX files from Documentation/
From: Greg Kroah-Hartman @ 2018-09-07  7:21 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Mark Rutland, Linux MIPS Mailing List,
	Linux Fbdev development list, Jan Kandziora,
	Radim Krčmář, kvm, Linux Doc Mailing List,
	Peter Zijlstra, James Hogan, Mark Brown, Henrik Austad,
	Will Deacon, dri-devel, Masahiro Yamada, devicetree,
	Paul Mackerras, Henrik Austad, Pavel Machek, H. Peter Anvin,
	Evgeniy Polyakov, Ian Kent, linux-s390, Paul Moore
In-Reply-To: <CAKMK7uHoeB89-VVS8qVaoNiP_0waHHJ=dFCUgXkRDTnRkXz69g@mail.gmail.com>

On Thu, Sep 06, 2018 at 11:39:42PM +0200, Daniel Vetter wrote:
> On Thu, Sep 6, 2018 at 6:01 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Thu, 6 Sep 2018 09:58:04 -0600
> > Jonathan Corbet <corbet@lwn.net> wrote:
> >
> >> Thanks,
> >>
> >> jon  (who is increasingly inclined to apply this patch)
> >
> > As Colin Kaepernick now says... "Just do it!"
> >
> > ;-)
> 
> +1
> 
> But I'm biased, I'm part of the party that is responsible for the new
> shiny documentation system ...

I am not responsible for any of the new shiny documentation system, and
I think this is a good idea:

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply

* Re: [PATCH v2 2/2] net: ethernet: i40evf: fix potential build error
From: kbuild test robot @ 2018-09-07  7:25 UTC (permalink / raw)
  To: Wang Dongsheng
  Cc: kbuild-all, jeffrey.t.kirsher, sergei.shtylyov, jacob.e.keller,
	davem, intel-wired-lan, netdev, linux-kernel, Wang Dongsheng
In-Reply-To: <1536114430-21356-2-git-send-email-dongsheng.wang@hxt-semitech.com>

[-- Attachment #1: Type: text/plain, Size: 1122 bytes --]

Hi Wang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on jkirsher-next-queue/dev-queue]
[cannot apply to v4.19-rc2 next-20180906]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Wang-Dongsheng/net-ethernet-i40e-fix-build-error/20180907-063122
base:   https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git dev-queue
config: i386-allyesconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/intel/i40evf/i40evf_ethtool.o: In function `__i40e_add_stat_strings':
>> i40evf_ethtool.c:(.text+0xeb0): multiple definition of `__i40e_add_stat_strings'
   drivers/net/ethernet/intel/i40e/i40e_ethtool.o:i40e_ethtool.c:(.text+0x5e20): first defined here

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 64499 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 09/11] tuntap: accept an array of XDP buffs through sendmsg()
From: Jason Wang @ 2018-09-07  7:33 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180906134834-mutt-send-email-mst@kernel.org>



On 2018年09月07日 01:51, Michael S. Tsirkin wrote:
> On Thu, Sep 06, 2018 at 12:05:24PM +0800, Jason Wang wrote:
>> This patch implement TUN_MSG_PTR msg_control type. This type allows
>> the caller to pass an array of XDP buffs to tuntap through ptr field
>> of the tun_msg_control. If an XDP program is attached, tuntap can run
>> XDP program directly. If not, tuntap will build skb and do a fast
>> receiving since part of the work has been done by vhost_net.
>>
>> This will avoid lots of indirect calls thus improves the icache
>> utilization and allows to do XDP batched flushing when doing XDP
>> redirection.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> Is most of the benefit in batched flushing or skipping
> indirect calls? Because if it's flushing we can gain
> most of it easily by adding an analog of xmit_more.
>

Should be both. XDP_DROP doesn't flush but it gives obvious improvement.

Thanks

^ permalink raw reply

* Re: [PATCH 2/7] mark root hnode explicitly
From: Cong Wang @ 2018-09-07  2:57 UTC (permalink / raw)
  To: Al Viro; +Cc: Jamal Hadi Salim, Linux Kernel Network Developers, Jiri Pirko
In-Reply-To: <20180906105933.GU19965@ZenIV.linux.org.uk>

On Thu, Sep 6, 2018 at 3:59 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> index 3f985f29ef30..d14048e38b5c 100644
> --- a/net/sched/cls_u32.c
> +++ b/net/sched/cls_u32.c
> @@ -84,6 +84,7 @@ struct tc_u_hnode {
>         int                     refcnt;
>         unsigned int            divisor;
>         struct idr              handle_idr;
> +       bool                    is_root;
>         struct rcu_head         rcu;
>         u32                     flags;
>         /* The 'ht' field MUST be the last field in structure to allow for
> @@ -377,6 +378,7 @@ static int u32_init(struct tcf_proto *tp)
>         root_ht->refcnt++;
>         root_ht->handle = tp_c ? gen_new_htid(tp_c, root_ht) : 0x80000000;
>         root_ht->prio = tp->prio;
> +       root_ht->is_root = true;
>         idr_init(&root_ht->handle_idr);
>
>         if (tp_c == NULL) {
> @@ -693,7 +695,7 @@ static int u32_delete(struct tcf_proto *tp, void *arg, bool *last,
>                 goto out;
>         }
>
> -       if (root_ht == ht) {
> +       if (ht->is_root) {


What's wrong with comparing pointers with root ht?


>                 NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root node");
>                 return -EINVAL;
>         }
> @@ -795,6 +797,10 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp,
>                                 NL_SET_ERR_MSG_MOD(extack, "Link hash table not found");
>                                 return -EINVAL;
>                         }
> +                       if (ht_down->is_root) {

root ht is saved in tp->root, so you can compare ht_down with it too,
if you want.

If this check is all what you need, you don't need an extra flag.

^ permalink raw reply

* Re: [PATCH net-next 11/11] vhost_net: batch submitting XDP buffers to underlayer sockets
From: Jason Wang @ 2018-09-07  7:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180906122857-mutt-send-email-mst@kernel.org>



On 2018年09月07日 00:46, Michael S. Tsirkin wrote:
> On Thu, Sep 06, 2018 at 12:05:26PM +0800, Jason Wang wrote:
>> This patch implements XDP batching for vhost_net. The idea is first to
>> try to do userspace copy and build XDP buff directly in vhost. Instead
>> of submitting the packet immediately, vhost_net will batch them in an
>> array and submit every 64 (VHOST_NET_BATCH) packets to the under layer
>> sockets through msg_control of sendmsg().
>>
>> When XDP is enabled on the TUN/TAP, TUN/TAP can process XDP inside a
>> loop without caring GUP thus it can do batch map flushing. When XDP is
>> not enabled or not supported, the underlayer socket need to build skb
>> and pass it to network core. The batched packet submission allows us
>> to do batching like netif_receive_skb_list() in the future.
>>
>> This saves lots of indirect calls for better cache utilization. For
>> the case that we can't so batching e.g when sndbuf is limited or
>> packet size is too large, we will go for usual one packet per
>> sendmsg() way.
>>
>> Doing testpmd on various setups gives us:
>>
>> Test                /+pps%
>> XDP_DROP on TAP     /+44.8%
>> XDP_REDIRECT on TAP /+29%
>> macvtap (skb)       /+26%
>>
>> Netperf tests shows obvious improvements for small packet transmission:
>>
>> size/session/+thu%/+normalize%
>>     64/     1/   +2%/    0%
>>     64/     2/   +3%/   +1%
>>     64/     4/   +7%/   +5%
>>     64/     8/   +8%/   +6%
>>    256/     1/   +3%/    0%
>>    256/     2/  +10%/   +7%
>>    256/     4/  +26%/  +22%
>>    256/     8/  +27%/  +23%
>>    512/     1/   +3%/   +2%
>>    512/     2/  +19%/  +14%
>>    512/     4/  +43%/  +40%
>>    512/     8/  +45%/  +41%
>>   1024/     1/   +4%/    0%
>>   1024/     2/  +27%/  +21%
>>   1024/     4/  +38%/  +73%
>>   1024/     8/  +15%/  +24%
>>   2048/     1/  +10%/   +7%
>>   2048/     2/  +16%/  +12%
>>   2048/     4/    0%/   +2%
>>   2048/     8/    0%/   +2%
>>   4096/     1/  +36%/  +60%
>>   4096/     2/  -11%/  -26%
>>   4096/     4/    0%/  +14%
>>   4096/     8/    0%/   +4%
>> 16384/     1/   -1%/   +5%
>> 16384/     2/    0%/   +2%
>> 16384/     4/    0%/   -3%
>> 16384/     8/    0%/   +4%
>> 65535/     1/    0%/  +10%
>> 65535/     2/    0%/   +8%
>> 65535/     4/    0%/   +1%
>> 65535/     8/    0%/   +3%
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>   drivers/vhost/net.c | 164 ++++++++++++++++++++++++++++++++++++++++----
>>   1 file changed, 151 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index fb01ce6d981c..1dd4239cbff8 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -116,6 +116,7 @@ struct vhost_net_virtqueue {
>>   	 * For RX, number of batched heads
>>   	 */
>>   	int done_idx;
>> +	int batched_xdp;
> Pls add a comment documenting what does this new field do.

Ok.

>
>>   	/* an array of userspace buffers info */
>>   	struct ubuf_info *ubuf_info;
>>   	/* Reference counting for outstanding ubufs.
>> @@ -123,6 +124,7 @@ struct vhost_net_virtqueue {
>>   	struct vhost_net_ubuf_ref *ubufs;
>>   	struct ptr_ring *rx_ring;
>>   	struct vhost_net_buf rxq;
>> +	struct xdp_buff xdp[VHOST_NET_BATCH];
> This is a bit too much to have inline. 64 entries 48 bytes each, and we
> have 2 of these structures, that's already >4K.

Let me allocate it elsewhere.

>
>>   };
>>   
>>   struct vhost_net {
>> @@ -338,6 +340,11 @@ static bool vhost_sock_zcopy(struct socket *sock)
>>   		sock_flag(sock->sk, SOCK_ZEROCOPY);
>>   }
>>   
>> +static bool vhost_sock_xdp(struct socket *sock)
>> +{
>> +	return sock_flag(sock->sk, SOCK_XDP);
> what if an xdp program is attached while this processing
> is going on? Flag value will be wrong - is this safe
> and why? Pls add a comment.

Ok, let me add comment to explain. It's safe and may just lead few 
packets to be dropped if XDP program want to extend packet's header.

>
>> +}
>> +
>>   /* In case of DMA done not in order in lower device driver for some reason.
>>    * upend_idx is used to track end of used idx, done_idx is used to track head
>>    * of used idx. Once lower device DMA done contiguously, we will signal KVM
>> @@ -444,10 +451,36 @@ static void vhost_net_signal_used(struct vhost_net_virtqueue *nvq)
>>   	nvq->done_idx = 0;
>>   }
>>   
>> +static void vhost_tx_batch(struct vhost_net *net,
>> +			   struct vhost_net_virtqueue *nvq,
>> +			   struct socket *sock,
>> +			   struct msghdr *msghdr)
>> +{
>> +	struct tun_msg_ctl ctl = {
>> +		.type = nvq->batched_xdp << 16 | TUN_MSG_PTR,
>> +		.ptr = nvq->xdp,
>> +	};
>> +	int err;
>> +
>> +	if (nvq->batched_xdp == 0)
>> +		goto signal_used;
>> +
>> +	msghdr->msg_control = &ctl;
>> +	err = sock->ops->sendmsg(sock, msghdr, 0);
>> +	if (unlikely(err < 0)) {
>> +		vq_err(&nvq->vq, "Fail to batch sending packets\n");
>> +		return;
>> +	}
>> +
>> +signal_used:
>> +	vhost_net_signal_used(nvq);
>> +	nvq->batched_xdp = 0;
>> +}
>> +
>>   static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
>>   				    struct vhost_net_virtqueue *nvq,
>>   				    unsigned int *out_num, unsigned int *in_num,
>> -				    bool *busyloop_intr)
>> +				    struct msghdr *msghdr, bool *busyloop_intr)
>>   {
>>   	struct vhost_virtqueue *vq = &nvq->vq;
>>   	unsigned long uninitialized_var(endtime);
>> @@ -455,8 +488,9 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
>>   				  out_num, in_num, NULL, NULL);
>>   
>>   	if (r == vq->num && vq->busyloop_timeout) {
>> +		/* Flush batched packets first */
>>   		if (!vhost_sock_zcopy(vq->private_data))
>> -			vhost_net_signal_used(nvq);
>> +			vhost_tx_batch(net, nvq, vq->private_data, msghdr);
>>   		preempt_disable();
>>   		endtime = busy_clock() + vq->busyloop_timeout;
>>   		while (vhost_can_busy_poll(endtime)) {
>> @@ -512,7 +546,7 @@ static int get_tx_bufs(struct vhost_net *net,
>>   	struct vhost_virtqueue *vq = &nvq->vq;
>>   	int ret;
>>   
>> -	ret = vhost_net_tx_get_vq_desc(net, nvq, out, in, busyloop_intr);
>> +	ret = vhost_net_tx_get_vq_desc(net, nvq, out, in, msg, busyloop_intr);
>>   
>>   	if (ret < 0 || ret == vq->num)
>>   		return ret;
>> @@ -540,6 +574,83 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len)
>>   	       !vhost_vq_avail_empty(vq->dev, vq);
>>   }
>>   
>> +#define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
> I wonder whether NET_IP_ALIGN make sense for XDP.

XDP is not the only consumer, socket may build skb based on this.

>
>> +
>> +static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
>> +			       struct iov_iter *from)
>> +{
>> +	struct vhost_virtqueue *vq = &nvq->vq;
>> +	struct socket *sock = vq->private_data;
>> +	struct page_frag *alloc_frag = &current->task_frag;
>> +	struct virtio_net_hdr *gso;
>> +	struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
>> +	size_t len = iov_iter_count(from);
>> +	int headroom = vhost_sock_xdp(sock) ? XDP_PACKET_HEADROOM : 0;
>> +	int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +	int pad = SKB_DATA_ALIGN(VHOST_NET_RX_PAD + headroom + nvq->sock_hlen);
>> +	int sock_hlen = nvq->sock_hlen;
>> +	void *buf;
>> +	int copied;
>> +
>> +	if (unlikely(len < nvq->sock_hlen))
>> +		return -EFAULT;
>> +
>> +	if (SKB_DATA_ALIGN(len + pad) +
>> +	    SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) > PAGE_SIZE)
>> +		return -ENOSPC;
>> +
>> +	buflen += SKB_DATA_ALIGN(len + pad);
>> +	alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
>> +	if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, GFP_KERNEL)))
>> +		return -ENOMEM;
>> +
>> +	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
>> +
>> +	/* We store two kinds of metadata in the header which will be
>> +	 * used for XDP_PASS to do build_skb():
>> +	 * offset 0: buflen
>> +	 * offset sizeof(int): vnet header
> Define a struct for the metadata then?

Ok.

>
>
>> +	 */
>> +	copied = copy_page_from_iter(alloc_frag->page,
>> +				     alloc_frag->offset + sizeof(int),
>> +				     sock_hlen, from);
>> +	if (copied != sock_hlen)
>> +		return -EFAULT;
>> +
>> +	gso = (struct virtio_net_hdr *)(buf + sizeof(int));
>> +
>> +	if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
>> +	    vhost16_to_cpu(vq, gso->csum_start) +
>> +	    vhost16_to_cpu(vq, gso->csum_offset) + 2 >
>> +	    vhost16_to_cpu(vq, gso->hdr_len)) {
>> +		gso->hdr_len = cpu_to_vhost16(vq,
>> +			       vhost16_to_cpu(vq, gso->csum_start) +
>> +			       vhost16_to_cpu(vq, gso->csum_offset) + 2);
>> +
>> +		if (vhost16_to_cpu(vq, gso->hdr_len) > len)
>> +			return -EINVAL;
>> +	}
>> +
>> +	len -= sock_hlen;
>> +	copied = copy_page_from_iter(alloc_frag->page,
>> +				     alloc_frag->offset + pad,
>> +				     len, from);
>> +	if (copied != len)
>> +		return -EFAULT;
>> +
>> +	xdp->data_hard_start = buf;
>> +	xdp->data = buf + pad;
>> +	xdp->data_end = xdp->data + len;
>> +	*(int *)(xdp->data_hard_start) = buflen;
>> +
>> +	get_page(alloc_frag->page);
>> +	alloc_frag->offset += buflen;
>> +
>> +	++nvq->batched_xdp;
>> +
>> +	return 0;
>> +}
>> +
>>   static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>>   {
>>   	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
>> @@ -556,10 +667,14 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>>   	size_t len, total_len = 0;
>>   	int err;
>>   	int sent_pkts = 0;
>> +	bool bulking = (sock->sk->sk_sndbuf == INT_MAX);
> What does bulking mean?

The name is misleading, it means whether we can do batching. For 
simplicity, I disable batching is sndbuf is not INT_MAX.

>>   
>>   	for (;;) {
>>   		bool busyloop_intr = false;
>>   
>> +		if (nvq->done_idx == VHOST_NET_BATCH)
>> +			vhost_tx_batch(net, nvq, sock, &msg);
>> +
>>   		head = get_tx_bufs(net, nvq, &msg, &out, &in, &len,
>>   				   &busyloop_intr);
>>   		/* On error, stop handling until the next kick. */
>> @@ -577,14 +692,34 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>>   			break;
>>   		}
>>   
>> -		vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
>> -		vq->heads[nvq->done_idx].len = 0;
>> -
>>   		total_len += len;
>> -		if (tx_can_batch(vq, total_len))
>> -			msg.msg_flags |= MSG_MORE;
>> -		else
>> -			msg.msg_flags &= ~MSG_MORE;
>> +
>> +		/* For simplicity, TX batching is only enabled if
>> +		 * sndbuf is unlimited.
> What if sndbuf changes while this processing is going on?

We will get the correct sndbuf in the next run of handle_tx(). I think 
this is safe.

Thanks

>
>> +		 */
>> +		if (bulking) {
>> +			err = vhost_net_build_xdp(nvq, &msg.msg_iter);
>> +			if (!err) {
>> +				goto done;
>> +			} else if (unlikely(err != -ENOSPC)) {
>> +				vhost_tx_batch(net, nvq, sock, &msg);
>> +				vhost_discard_vq_desc(vq, 1);
>> +				vhost_net_enable_vq(net, vq);
>> +				break;
>> +			}
>> +
>> +			/* We can't build XDP buff, go for single
>> +			 * packet path but let's flush batched
>> +			 * packets.
>> +			 */
>> +			vhost_tx_batch(net, nvq, sock, &msg);
>> +			msg.msg_control = NULL;
>> +		} else {
>> +			if (tx_can_batch(vq, total_len))
>> +				msg.msg_flags |= MSG_MORE;
>> +			else
>> +				msg.msg_flags &= ~MSG_MORE;
>> +		}
>>   
>>   		/* TODO: Check specific error and bomb out unless ENOBUFS? */
>>   		err = sock->ops->sendmsg(sock, &msg, len);
>> @@ -596,15 +731,17 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>>   		if (err != len)
>>   			pr_debug("Truncated TX packet: len %d != %zd\n",
>>   				 err, len);
>> -		if (++nvq->done_idx >= VHOST_NET_BATCH)
>> -			vhost_net_signal_used(nvq);
>> +done:
>> +		vq->heads[nvq->done_idx].id = cpu_to_vhost32(vq, head);
>> +		vq->heads[nvq->done_idx].len = 0;
>> +		++nvq->done_idx;
>>   		if (vhost_exceeds_weight(++sent_pkts, total_len)) {
>>   			vhost_poll_queue(&vq->poll);
>>   			break;
>>   		}
>>   	}
>>   
>> -	vhost_net_signal_used(nvq);
>> +	vhost_tx_batch(net, nvq, sock, &msg);
>>   }
>>   
>>   static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>> @@ -1111,6 +1248,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
>>   		n->vqs[i].ubuf_info = NULL;
>>   		n->vqs[i].upend_idx = 0;
>>   		n->vqs[i].done_idx = 0;
>> +		n->vqs[i].batched_xdp = 0;
>>   		n->vqs[i].vhost_hlen = 0;
>>   		n->vqs[i].sock_hlen = 0;
>>   		n->vqs[i].rx_ring = NULL;
>> -- 
>> 2.17.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH 2/7] mark root hnode explicitly
From: Al Viro @ 2018-09-07  3:04 UTC (permalink / raw)
  To: Cong Wang; +Cc: Jamal Hadi Salim, Linux Kernel Network Developers, Jiri Pirko
In-Reply-To: <CAM_iQpUDpKB2y0Yy5YJjj7MALGsDLJKpdXYZkydmCo1N_uo=FA@mail.gmail.com>

On Thu, Sep 06, 2018 at 07:57:25PM -0700, Cong Wang wrote:

> > -       if (root_ht == ht) {
> > +       if (ht->is_root) {
> 
> 
> What's wrong with comparing pointers with root ht?

The fact that there may be more than one tcf_proto sharing tp->data.

> >                 NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root node");
> >                 return -EINVAL;
> >         }
> > @@ -795,6 +797,10 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp,
> >                                 NL_SET_ERR_MSG_MOD(extack, "Link hash table not found");
> >                                 return -EINVAL;
> >                         }
> > +                       if (ht_down->is_root) {
> 
> root ht is saved in tp->root, so you can compare ht_down with it too,
> if you want.
> 
> If this check is all what you need, you don't need an extra flag.

Again, *which* tp?  We can trivially check that we are not linking to/deleting
our own root, sure.  But there's nothing to stop doing the same via another
tcf_proto...

^ permalink raw reply

* Re: [PATCH 2/7] mark root hnode explicitly
From: Cong Wang @ 2018-09-07  3:23 UTC (permalink / raw)
  To: Al Viro; +Cc: Jamal Hadi Salim, Linux Kernel Network Developers, Jiri Pirko
In-Reply-To: <20180907030438.GX19965@ZenIV.linux.org.uk>

On Thu, Sep 6, 2018 at 8:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Sep 06, 2018 at 07:57:25PM -0700, Cong Wang wrote:
>
> > > -       if (root_ht == ht) {
> > > +       if (ht->is_root) {
> >
> >
> > What's wrong with comparing pointers with root ht?
>
> The fact that there may be more than one tcf_proto sharing tp->data.

Hmm? root ht is from tp->root, not tp->data.

Also, this very important information is missing in your one-line changelog...



>
> > >                 NL_SET_ERR_MSG_MOD(extack, "Not allowed to delete root node");
> > >                 return -EINVAL;
> > >         }
> > > @@ -795,6 +797,10 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp,
> > >                                 NL_SET_ERR_MSG_MOD(extack, "Link hash table not found");
> > >                                 return -EINVAL;
> > >                         }
> > > +                       if (ht_down->is_root) {
> >
> > root ht is saved in tp->root, so you can compare ht_down with it too,
> > if you want.
> >
> > If this check is all what you need, you don't need an extra flag.
>
> Again, *which* tp?  We can trivially check that we are not linking to/deleting


Pretty sure there is a 'tp' in u32_set_parms() parameter list.

Are you saying it is not what you want? If so, why?

More importantly, why this information is again missing in your
changelog? This patch is definitely not trivial, it deserves a detailed
changelog.


> our own root, sure.  But there's nothing to stop doing the same via another
> tcf_proto...

To my best knowledge, the place where you set ->is_root=true
is precisely same with where we set tp->root=root_ht, and it doesn't
change after set. What am I missing here?

^ permalink raw reply

* Re: [PATCH] PCI: Reprogram bridge prefetch registers on resume
From: Daniel Drake @ 2018-09-07  8:06 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Mika Westerberg, Linux PM, linux-pci-u79uwXL29TY76Z2rM5mHXA,
	Wysocki, Rafael J, nic_swsd-Rasf1IRRPZFBDgjK7y7TUQ, Keith Busch,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Bjorn Helgaas,
	Andy Shevchenko, Linux Upstreaming Team,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, Jon Derrick,
	hkallweit1-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <3b37e4fd-c793-bd52-7d70-950f846a7d5a-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

On Fri, Sep 7, 2018 at 2:40 PM, Sinan Kaya <okaya@kernel.org> wrote:
> On 9/6/2018 10:36 PM, Daniel Drake wrote:
>>
>> +       if (pci_dev->class == PCI_CLASS_BRIDGE_PCI << 8)
>> +               pci_setup_bridge_mmio_pref(pci_dev);
>
>
> This should probably some kind of a quirk rather than default
> for the listed card as it sounds like you are dealing with
> broken hardware.

With that approach there's a sizeable list that your quirk list is
incomplete or out of date.

And when the bug bites, it's extremely cryptic. We've spent months
working on this issue and only found this magic register write mostly
through a stroke of good luck. Separately there's been a flurry of
mails around the r8169 MSI-X problem but as far as I can see nobody
suggested even looking at the values of the parent bridge prefetch
registers. And even if they did, they'd probably have said "values are
fine, nothing to see here" (exactly as we did 4 months ago when Nvidia
mentioned these registers as a possible cause - oops!).

So here I'm instead following a suggestion from Bjorn, after also
having confirmed the windows behaviour:

https://marc.info/?l=linux-pci&m=153574276126484&w=2
> Can we tell whether Windows rewrites this register unconditionally at
> resume-time?  If so, it may be more robust for Linux to do the same.
> The whole thing is black magic, which I hate, but if it's our only
> choice, it may be better to have this applied everywhere so we don't
> keep stubbing our toes on new systems that require the quirk.

Also, we just spoke to Asus BIOS engineers who told us that the BIOS
does already restore the PCI bridge prefetch registers on resume. I
guess this is why the other registers like the (non-zero) prefetch
base address lower 32 bits do have the right value on resume even
before my patch. It sounds like a more subtle bug related to register
write timing or sequence, in that case it will be harder to define who
is responsible for the breakage and hence under which conditions the
quirk should apply.

Daniel
_______________________________________________
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply

* Re: [PATCH net-next 07/11] tuntap: move XDP flushing out of tun_do_xdp()
From: Jason Wang @ 2018-09-07  3:31 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20180906132209-mutt-send-email-mst@kernel.org>



On 2018年09月07日 01:48, Michael S. Tsirkin wrote:
> On Thu, Sep 06, 2018 at 12:05:22PM +0800, Jason Wang wrote:
>> This will allow adding batch flushing on top.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>   drivers/net/tun.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 21b125020b3b..ff1cbf3ebd50 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1646,7 +1646,6 @@ static u32 tun_do_xdp(struct tun_struct *tun,
>>   	switch (act) {
>>   	case XDP_REDIRECT:
>>   		*err = xdp_do_redirect(tun->dev, xdp, xdp_prog);
>> -		xdp_do_flush_map();
>>   		if (*err)
>>   			break;
>>   		goto out;
>> @@ -1735,6 +1734,9 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
>>   		act = tun_do_xdp(tun, tfile, xdp_prog, &xdp, &err);
>>   		if (err)
>>   			goto err_xdp;
>> +
>> +		if (act == XDP_REDIRECT)
>> +			xdp_do_flush_map();
>>   		if (act != XDP_PASS)
>>   			goto out;
> At this point the switch statement which used to contain all XDP things
> seems to be gone completely. Just rewrite with a bunch of if statements
> and all xdp handling spread out to where it makes sense?

But tun_do_xdp() will be reused in the batching path.

Thanks

>
>> -- 
>> 2.17.1

^ permalink raw reply

* Re: [PATCH net-next v3 1/2] netlink: ipv4 igmp join notifications
From: Roopa Prabhu @ 2018-09-07  3:40 UTC (permalink / raw)
  To: Patrick Ruddy; +Cc: netdev, Jiří Pírko, Stephen Hemminger
In-Reply-To: <20180906091056.21109-1-pruddy@vyatta.att-mail.com>

On Thu, Sep 6, 2018 at 2:10 AM, Patrick Ruddy
<pruddy@vyatta.att-mail.com> wrote:
> Some userspace applications need to know about IGMP joins from the
> kernel for 2 reasons:
> 1. To allow the programming of multicast MAC filters in hardware
> 2. To form a multicast FORUS list for non link-local multicast
>    groups to be sent to the kernel and from there to the interested
>    party.
> (1) can be fulfilled but simply sending the hardware multicast MAC
> address to be programmed but (2) requires the L3 address to be sent
> since this cannot be constructed from the MAC address whereas the
> reverse translation is a standard library function.
>
> This commit provides addition and deletion of multicast addresses
> using the RTM_NEWMDB and RTM_DELMDB messages with AF_INET. It also
> provides the RTM_GETMDB extension to allow multicast join state to
> be read from the kernel.
>
> Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
> ---
> v3 rework to use RTM_***MDB messages as per review comments.

Patrick, this version seems to be using RTM_***MDB msgs with the
RTM_*ADDR format.
We cant do that...because existing RTM_MDB users will be confused.

My request was to evaluate RTM_***MDB msg format. see
nlmsg_populate_mdb_fill for details.

If you can wait a day or two I can share some experimental code that
moves high level RTM_*MDB msg handling into net/core/rtnetlink.c
similar to RTM_*FDB



>
>  net/ipv4/igmp.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 139 insertions(+)
>
> diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
> index 4da39446da2d..aed819e2ea93 100644
> --- a/net/ipv4/igmp.c
> +++ b/net/ipv4/igmp.c
> @@ -86,6 +86,7 @@
>  #include <linux/inetdevice.h>
>  #include <linux/igmp.h>
>  #include <linux/if_arp.h>
> +#include <net/netlink.h>
>  #include <linux/rtnetlink.h>
>  #include <linux/times.h>
>  #include <linux/pkt_sched.h>
> @@ -1385,6 +1386,91 @@ static void ip_mc_hash_remove(struct in_device *in_dev,
>  }
>
>
> +static int fill_addr(struct sk_buff *skb, struct net_device *dev, __be32 addr,
> +                    int type, unsigned int flags)
> +{
> +       struct nlmsghdr *nlh;
> +       struct ifaddrmsg *ifm;
> +
> +       nlh = nlmsg_put(skb, 0, 0, type, sizeof(*ifm), flags);
> +       if (!nlh)
> +               return -EMSGSIZE;
> +
> +       ifm = nlmsg_data(nlh);
> +       ifm->ifa_family = AF_INET;
> +       ifm->ifa_prefixlen = 32;
> +       ifm->ifa_flags = IFA_F_PERMANENT;
> +       ifm->ifa_scope = RT_SCOPE_LINK;
> +       ifm->ifa_index = dev->ifindex;
> +
> +       if (nla_put_in_addr(skb, IFA_ADDRESS, addr))
> +               goto nla_put_failure;
> +       nlmsg_end(skb, nlh);
> +       return 0;
> +
> +nla_put_failure:
> +       nlmsg_cancel(skb, nlh);
> +       return -EMSGSIZE;
> +}
> +
> +static inline size_t addr_nlmsg_size(void)
> +{
> +       return NLMSG_ALIGN(sizeof(struct ifaddrmsg))
> +               + nla_total_size(sizeof(__be32));
> +}
> +
> +static void ip_mc_addr_notify(struct net_device *dev, __be32 addr, int type)
> +{
> +       struct net *net = dev_net(dev);
> +       struct sk_buff *skb;
> +       int err = -ENOBUFS;
> +
> +       skb = nlmsg_new(addr_nlmsg_size(), GFP_ATOMIC);
> +       if (!skb)
> +               goto errout;
> +
> +       err = fill_addr(skb, dev, addr, type, 0);
> +       if (err < 0) {
> +               WARN_ON(err == -EMSGSIZE);
> +               kfree_skb(skb);
> +               goto errout;
> +       }
> +       rtnl_notify(skb, net, 0, RTNLGRP_MDB, NULL, GFP_ATOMIC);
> +       return;
> +errout:
> +       if (err < 0)
> +               rtnl_set_sk_err(net, RTNLGRP_MDB, err);
> +}
> +
> +int ip_mc_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb,
> +                     struct net_device *dev)
> +{
> +       int s_idx;
> +       int idx = 0;
> +       struct ip_mc_list *im;
> +       struct in_device *in_dev;
> +
> +       ASSERT_RTNL();
> +
> +       s_idx = cb->args[2];
> +       in_dev = __in_dev_get_rtnl(dev);
> +
> +       for_each_pmc_rtnl(in_dev, im) {
> +               if (idx < s_idx)
> +                       continue;
> +               if (fill_addr(skb, dev, im->multiaddr, RTM_NEWMDB,
> +                             NLM_F_MULTI) < 0)
> +                       goto done;
> +               nl_dump_check_consistent(cb, nlmsg_hdr(skb));
> +               idx++;
> +       }
> +
> + done:
> +       cb->args[2] = idx;
> +
> +       return skb->len;
> +}
> +
>  /*
>   *     A socket has joined a multicast group on device dev.
>   */
> @@ -1430,6 +1516,8 @@ static void __ip_mc_inc_group(struct in_device *in_dev, __be32 addr,
>         igmpv3_del_delrec(in_dev, im);
>  #endif
>         igmp_group_added(im);
> +
> +       ip_mc_addr_notify(in_dev->dev, addr, RTM_NEWMDB);
>         if (!in_dev->dead)
>                 ip_rt_multicast_event(in_dev);
>  out:
> @@ -1661,6 +1749,8 @@ void ip_mc_dec_group(struct in_device *in_dev, __be32 addr)
>                                 in_dev->mc_count--;
>                                 igmp_group_dropped(i);
>                                 ip_mc_clear_src(i);
> +                               ip_mc_addr_notify(in_dev->dev, addr,
> +                                                 RTM_DELMDB);
>
>                                 if (!in_dev->dead)
>                                         ip_rt_multicast_event(in_dev);
> @@ -3051,6 +3141,53 @@ static struct notifier_block igmp_notifier = {
>         .notifier_call = igmp_netdev_event,
>  };
>
> +static int igmp_mc_dump_ifaddrs(struct sk_buff *skb,
> +                               struct netlink_callback *cb)
> +{
> +       struct net *net = sock_net(skb->sk);
> +       int h, s_h;
> +       int idx, s_idx;
> +       struct net_device *dev;
> +       struct in_device *in_dev;
> +       struct hlist_head *head;
> +
> +       s_h = cb->args[0];
> +       idx = cb->args[1];
> +       s_idx = idx;
> +
> +       for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
> +               idx = 0;
> +               head = &net->dev_index_head[h];
> +               rcu_read_lock();
> +               cb->seq = atomic_read(&net->ipv4.dev_addr_genid) ^
> +                         net->dev_base_seq;
> +               hlist_for_each_entry_rcu(dev, head, index_hlist) {
> +                       if (idx < s_idx)
> +                               goto cont;
> +                       if (h > s_h || idx > s_idx)
> +                               cb->args[2] = 0;
> +                       in_dev = __in_dev_get_rcu(dev);
> +                       if (!in_dev)
> +                               goto cont;
> +
> +                       /* loop over multicast addresses */
> +                       if (ip_mc_dump_ifaddr(skb, cb, dev) < 0) {
> +                               rcu_read_unlock();
> +                               goto done;
> +                       }
> +cont:
> +                       idx++;
> +               }
> +               rcu_read_unlock();
> +       }
> +
> +done:
> +       cb->args[0] = h;
> +       cb->args[1] = idx;
> +
> +       return skb->len;
> +}
> +
>  int __init igmp_mc_init(void)
>  {
>  #if defined(CONFIG_PROC_FS)
> @@ -3064,6 +3201,8 @@ int __init igmp_mc_init(void)
>                 goto reg_notif_fail;
>         return 0;
>
> +       rtnl_register(PF_INET, RTM_GETMDB, NULL, igmp_mc_dump_ifaddrs, 0);
> +
>  reg_notif_fail:
>         unregister_pernet_subsys(&igmp_net_ops);
>         return err;
> --
> 2.17.1
>

^ permalink raw reply

* Re: [PATCH 2/7] mark root hnode explicitly
From: Al Viro @ 2018-09-07  3:49 UTC (permalink / raw)
  To: Cong Wang; +Cc: Jamal Hadi Salim, Linux Kernel Network Developers, Jiri Pirko
In-Reply-To: <CAM_iQpV7H8v3Oebftd0_aHzwwU_Phr3-7a=Un2-TRmuuZr7VOw@mail.gmail.com>

On Thu, Sep 06, 2018 at 08:23:36PM -0700, Cong Wang wrote:

> Pretty sure there is a 'tp' in u32_set_parms() parameter list.
> 
> Are you saying it is not what you want? If so, why?
> 
> More importantly, why this information is again missing in your
> changelog? This patch is definitely not trivial, it deserves a detailed
> changelog.
> 
> 
> > our own root, sure.  But there's nothing to stop doing the same via another
> > tcf_proto...
> 
> To my best knowledge, the place where you set ->is_root=true
> is precisely same with where we set tp->root=root_ht, and it doesn't
> change after set. What am I missing here?

The fact that there can be two (or more) different tcf_proto instances sharing
->data, but not ->root.  And since ->data is shared, u32_get() on one tp
will be able to return you ->root of *ANOTHER* one.  So comparison with
tp->root doesn't protect you.  Try this on mainline:

tc qdisc add dev eth0 ingress
tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: u32 divisor 1
tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: u32 divisor 1
tc filter delete dev eth0 parent ffff: protocol ip prio 100 handle 801: u32

and watch the fun as soon as you get an incoming packet on eth0.  That panic
is fixed by 1/7, but you get "Not allowed to delete root node" for removing
_your_ root, with "Can not delete in-use filter" for other's root (as in the
last line of the reproducer).

^ permalink raw reply

* [PATCH] netfilter: masquerade: don't flush all conntracks if only one address deleted on device
From: Tan Hu @ 2018-09-07  8:33 UTC (permalink / raw)
  To: pablo, kadlec, fw, davem, kuznet, yoshfuji
  Cc: netfilter-devel, coreteam, netdev, linux-kernel, zhong.weidong,
	jiang.biao2

We configured iptables as below, which only allowed incoming data on
established connections:

iptables -t mangle -A PREROUTING -m state --state ESTABLISHED -j ACCEPT
iptables -t mangle -P PREROUTING DROP

When deleting a secondary address, current masquerade implements would
flush all conntracks on this device. All the established connections on
primary address also be deleted, then subsequent incoming data on the
connections would be dropped wrongly because it was identified as NEW
connection.

So when an address was delete, it should only flush connections related
with the address.

Signed-off-by: Tan Hu <tan.hu@zte.com.cn>
---
 net/ipv4/netfilter/nf_nat_masquerade_ipv4.c | 22 +++++++++++++++++++---
 net/ipv6/netfilter/nf_nat_masquerade_ipv6.c | 19 ++++++++++++++++---
 2 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c b/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c
index ad3aeff..a9d5e01 100644
--- a/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c
+++ b/net/ipv4/netfilter/nf_nat_masquerade_ipv4.c
@@ -104,12 +104,26 @@ static int masq_device_event(struct notifier_block *this,
 	return NOTIFY_DONE;
 }
 
+static int inet_cmp(struct nf_conn *ct, void *ptr)
+{
+	struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
+	struct net_device *dev = ifa->ifa_dev->dev;
+	struct nf_conntrack_tuple *tuple;
+
+	if (!device_cmp(ct, (void *)(long)dev->ifindex))
+		return 0;
+
+	tuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+
+	return ifa->ifa_address == tuple->dst.u3.ip;
+}
+
 static int masq_inet_event(struct notifier_block *this,
 			   unsigned long event,
 			   void *ptr)
 {
 	struct in_device *idev = ((struct in_ifaddr *)ptr)->ifa_dev;
-	struct netdev_notifier_info info;
+	struct net *net = dev_net(idev->dev);
 
 	/* The masq_dev_notifier will catch the case of the device going
 	 * down.  So if the inetdev is dead and being destroyed we have
@@ -119,8 +133,10 @@ static int masq_inet_event(struct notifier_block *this,
 	if (idev->dead)
 		return NOTIFY_DONE;
 
-	netdev_notifier_info_init(&info, idev->dev);
-	return masq_device_event(this, event, &info);
+	if (event == NETDEV_DOWN)
+		nf_ct_iterate_cleanup_net(net, inet_cmp, ptr, 0, 0);
+
+	return NOTIFY_DONE;
 }
 
 static struct notifier_block masq_dev_notifier = {
diff --git a/net/ipv6/netfilter/nf_nat_masquerade_ipv6.c b/net/ipv6/netfilter/nf_nat_masquerade_ipv6.c
index e6eb7cf..3e4bf22 100644
--- a/net/ipv6/netfilter/nf_nat_masquerade_ipv6.c
+++ b/net/ipv6/netfilter/nf_nat_masquerade_ipv6.c
@@ -87,18 +87,30 @@ static struct notifier_block masq_dev_notifier = {
 struct masq_dev_work {
 	struct work_struct work;
 	struct net *net;
+	struct in6_addr addr;
 	int ifindex;
 };
 
+static int inet_cmp(struct nf_conn *ct, void *work)
+{
+	struct masq_dev_work *w = (struct masq_dev_work *)work;
+	struct nf_conntrack_tuple *tuple;
+
+	if (!device_cmp(ct, (void *)(long)w->ifindex))
+		return 0;
+
+	tuple = &ct->tuplehash[IP_CT_DIR_REPLY].tuple;
+
+	return ipv6_addr_equal(&w->addr, &tuple->dst.u3.in6);
+}
+
 static void iterate_cleanup_work(struct work_struct *work)
 {
 	struct masq_dev_work *w;
-	long index;
 
 	w = container_of(work, struct masq_dev_work, work);
 
-	index = w->ifindex;
-	nf_ct_iterate_cleanup_net(w->net, device_cmp, (void *)index, 0, 0);
+	nf_ct_iterate_cleanup_net(w->net, inet_cmp, (void *)w, 0, 0);
 
 	put_net(w->net);
 	kfree(w);
@@ -147,6 +159,7 @@ static int masq_inet_event(struct notifier_block *this,
 		INIT_WORK(&w->work, iterate_cleanup_work);
 		w->ifindex = dev->ifindex;
 		w->net = net;
+		w->addr = ifa->addr;
 		schedule_work(&w->work);
 
 		return NOTIFY_DONE;
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH 2/7] mark root hnode explicitly
From: Cong Wang @ 2018-09-07  4:14 UTC (permalink / raw)
  To: Al Viro; +Cc: Jamal Hadi Salim, Linux Kernel Network Developers, Jiri Pirko
In-Reply-To: <20180907034958.GZ19965@ZenIV.linux.org.uk>

On Thu, Sep 6, 2018 at 8:50 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Sep 06, 2018 at 08:23:36PM -0700, Cong Wang wrote:
>
> > Pretty sure there is a 'tp' in u32_set_parms() parameter list.
> >
> > Are you saying it is not what you want? If so, why?
> >
> > More importantly, why this information is again missing in your
> > changelog? This patch is definitely not trivial, it deserves a detailed
> > changelog.
> >
> >
> > > our own root, sure.  But there's nothing to stop doing the same via another
> > > tcf_proto...
> >
> > To my best knowledge, the place where you set ->is_root=true
> > is precisely same with where we set tp->root=root_ht, and it doesn't
> > change after set. What am I missing here?
>
> The fact that there can be two (or more) different tcf_proto instances sharing
> ->data, but not ->root.  And since ->data is shared, u32_get() on one tp

They have to share tp->data, that is how we link hashtables together.


> will be able to return you ->root of *ANOTHER* one.  So comparison with
> tp->root doesn't protect you.  Try this on mainline:

Hmm, it is not u32_get(), it is u32_lookup_ht() which could get another
root ht... I see.


>
> tc qdisc add dev eth0 ingress
> tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: u32 divisor 1
> tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: u32 divisor 1
> tc filter delete dev eth0 parent ffff: protocol ip prio 100 handle 801: u32
>
> and watch the fun as soon as you get an incoming packet on eth0.  That panic
> is fixed by 1/7, but you get "Not allowed to delete root node" for removing
> _your_ root, with "Can not delete in-use filter" for other's root (as in the
> last line of the reproducer).

Sure, please consider:

1. adding such a test case to tools/testing/selftests/tc-testing/
2. adding it in your changelog

This would save a lot of time for both of us. I don't need to ask you for
this if it is in the changelog, you don't have to explain it again.

This is a win-win.

^ permalink raw reply

* Re: [PATCH 6/7] get rid of tc_u_common ->rcu
From: Cong Wang @ 2018-09-07  4:18 UTC (permalink / raw)
  To: Al Viro; +Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <20180905190414.5477-6-viro@ZenIV.linux.org.uk>

On Wed, Sep 5, 2018 at 12:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> From: Al Viro <viro@zeniv.linux.org.uk>
>
> unused
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  net/sched/cls_u32.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> index 8a1a573487bd..be9240ae1417 100644
> --- a/net/sched/cls_u32.c
> +++ b/net/sched/cls_u32.c
> @@ -98,7 +98,6 @@ struct tc_u_common {
>         int                     refcnt;
>         struct idr              handle_idr;
>         struct hlist_node       hnode;
> -       struct rcu_head         rcu;
>  };

Just FYI:

This was added when RCU was introduced to u32 fast path,
it looks like on fast path we never touch tc_u_common, we
only use tp->root, all the rest are slow paths with RTNL lock,
so it is probably fine to just remove it rather than converting
that kfree() to kfree_rcu().

^ permalink raw reply

* Re: [PATCH net-next v2] openvswitch: Derive IP protocol number for IPv6 later frags
From: Pravin Shelar @ 2018-09-07  4:22 UTC (permalink / raw)
  To: Yi-Hung Wei; +Cc: Linux Kernel Network Developers, William Tu
In-Reply-To: <1536100421-16360-1-git-send-email-yihung.wei@gmail.com>

On Tue, Sep 4, 2018 at 3:37 PM Yi-Hung Wei <yihung.wei@gmail.com> wrote:
>
> Currently, OVS only parses the IP protocol number for the first
> IPv6 fragment, but sets the IP protocol number for the later fragments
> to be NEXTHDF_FRAGMENT.  This patch tries to derive the IP protocol
> number for the IPV6 later frags so that we can match that.
>
> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>

Thanks.

^ permalink raw reply

* Re: [PATCH 6/7] get rid of tc_u_common ->rcu
From: Al Viro @ 2018-09-07  4:28 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko
In-Reply-To: <CAM_iQpWnydTkLYwmZq2gQPnNBTS6E0iNNgy1zsmYFcXkKJAnTw@mail.gmail.com>

On Thu, Sep 06, 2018 at 09:18:47PM -0700, Cong Wang wrote:
> On Wed, Sep 5, 2018 at 12:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > From: Al Viro <viro@zeniv.linux.org.uk>
> >
> > unused
> >
> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> > ---
> >  net/sched/cls_u32.c | 1 -
> >  1 file changed, 1 deletion(-)
> >
> > diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> > index 8a1a573487bd..be9240ae1417 100644
> > --- a/net/sched/cls_u32.c
> > +++ b/net/sched/cls_u32.c
> > @@ -98,7 +98,6 @@ struct tc_u_common {
> >         int                     refcnt;
> >         struct idr              handle_idr;
> >         struct hlist_node       hnode;
> > -       struct rcu_head         rcu;
> >  };
> 
> Just FYI:
> 
> This was added when RCU was introduced to u32 fast path,
> it looks like on fast path we never touch tc_u_common, we
> only use tp->root, all the rest are slow paths with RTNL lock,
> so it is probably fine to just remove it rather than converting
> that kfree() to kfree_rcu().

*nod*

In any case, if u32_classify() grows that dereference, we can always
re-add ->rcu at the same time.

^ permalink raw reply

* Re: [PATCH net-next v2] openvswitch: Derive IP protocol number for IPv6 later frags
From: David Miller @ 2018-09-07  4:48 UTC (permalink / raw)
  To: yihung.wei; +Cc: netdev, pshelar, u9012063
In-Reply-To: <1536100421-16360-1-git-send-email-yihung.wei@gmail.com>

From: Yi-Hung Wei <yihung.wei@gmail.com>
Date: Tue,  4 Sep 2018 15:33:41 -0700

> Currently, OVS only parses the IP protocol number for the first
> IPv6 fragment, but sets the IP protocol number for the later fragments
> to be NEXTHDF_FRAGMENT.  This patch tries to derive the IP protocol
> number for the IPV6 later frags so that we can match that.
> 
> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>

Applied.

^ permalink raw reply

* Re: [Patch net v3] tipc: call start and done ops directly in __tipc_nl_compat_dumpit()
From: David Miller @ 2018-09-07  4:50 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, tipc-discussion, jon.maloy, ying.xue
In-Reply-To: <20180904215455.3985-1-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Tue,  4 Sep 2018 14:54:55 -0700

> __tipc_nl_compat_dumpit() uses a netlink_callback on stack,
> so the only way to align it with other ->dumpit() call path
> is calling tipc_dump_start() and tipc_dump_done() directly
> inside it. Otherwise ->dumpit() would always get NULL from
> cb->args[].
> 
> But tipc_dump_start() uses sock_net(cb->skb->sk) to retrieve
> net pointer, the cb->skb here doesn't set skb->sk, the net pointer
> is saved in msg->net instead, so introduce a helper function
> __tipc_dump_start() to pass in msg->net.
> 
> Ying pointed out cb->args[0...3] are already used by other
> callbacks on this call path, so we can't use cb->args[0] any
> more, use cb->args[4] instead.
> 
> Fixes: 9a07efa9aea2 ("tipc: switch to rhashtable iterator")
> Reported-and-tested-by: syzbot+e93a2c41f91b8e2c7d9b@syzkaller.appspotmail.com
> Cc: Jon Maloy <jon.maloy@ericsson.com>
> Cc: Ying Xue <ying.xue@windriver.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Applied, thanks Cong.

^ permalink raw reply

* Re: [PATCH net-next] bnxt_en: remove set but not used variable 'addr_type'
From: David Miller @ 2018-09-07  4:55 UTC (permalink / raw)
  To: yuehaibing; +Cc: michael.chan, netdev, kernel-janitors
In-Reply-To: <1536147850-186781-1-git-send-email-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Wed, 5 Sep 2018 11:44:10 +0000

> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c: In function 'bnxt_tc_parse_flow':
> drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c:186:6: warning:
>  variable 'addr_type' set but not used [-Wunused-but-set-variable]
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Looks correct to me.

Applied, thanks.

^ permalink raw reply

* Re: [Nouveau] [PATCH] PCI: Reprogram bridge prefetch registers on resume
From: Lukas Wunner @ 2018-09-07  5:49 UTC (permalink / raw)
  To: Daniel Drake
  Cc: bhelgaas, mika.westerberg, linux-pm, linux-pci, rafael.j.wysocki,
	nic_swsd, keith.busch, netdev, nouveau, andriy.shevchenko, linux,
	davem, jonathan.derrick, hkallweit1
In-Reply-To: <20180907053614.6540-1-drake@endlessm.com>

On Fri, Sep 07, 2018 at 01:36:14PM +0800, Daniel Drake wrote:
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -934,6 +934,7 @@ struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn);
>  void pci_device_add(struct pci_dev *dev, struct pci_bus *bus);
>  unsigned int pci_scan_child_bus(struct pci_bus *bus);
>  void pci_bus_add_device(struct pci_dev *dev);
> +void pci_setup_bridge_mmio_pref(struct pci_dev *bridge);
>  void pci_read_bridge_bases(struct pci_bus *child);
>  struct resource *pci_find_parent_resource(const struct pci_dev *dev,
>  					  struct resource *res);

Since this is only used internally in the PCI core, the declaration
can live in drivers/pci/pci.h.

Thanks,

Lukas

^ permalink raw reply

* Re: [PATCH 09/12] iwlwifi: Remove local iwl_bcast_addr and use ether_broadcast_addr
From: Luciano Coelho @ 2018-09-07 10:46 UTC (permalink / raw)
  To: Joe Perches, linux-kernel
  Cc: Johannes Berg, Emmanuel Grumbach, Intel Linux Wireless,
	Kalle Valo, linux-wireless, netdev
In-Reply-To: <ebe5bbb03ca9c39e468d7f40d67d58733062a1f9.camel@intel.com>

On Sat, 2018-07-07 at 10:40 +0300, Luciano Coelho wrote:
> On Sat, 2018-03-31 at 00:05 -0700, Joe Perches wrote:
> > Use the new global to save a little bit of object code.
> > 
> > Signed-off-by: Joe Perches <joe@perches.com>
> > ---
> 
> I took this one to our internal tree and it will be applied as part
> of our normal upstreaming process.

It seems that the patch to etherdevice.h never got applied upstream, so
I'm actually dropping this patch.

Please resend if it becomes relevant again.

--
Cheers,
Luca.

^ permalink raw reply

* Re: [Xen-devel] [PATCH] xen-netfront: wait xenbus state change when load module manually
From: Jiri Slaby @ 2018-09-07 11:06 UTC (permalink / raw)
  To: Boris Ostrovsky, Juergen Gross, Xiao Liang, David Miller
  Cc: netdev, linux-kernel, xen-devel
In-Reply-To: <1a6b2326-78b3-f7fa-fb3b-08c54f4f9761@oracle.com>

On 08/24/2018, 04:26 PM, Boris Ostrovsky wrote:
> On 08/24/2018 07:26 AM, Juergen Gross wrote:
>> On 24/08/18 13:12, Jiri Slaby wrote:
>>> On 07/30/2018, 10:18 AM, Xiao Liang wrote:
>>>> On 07/29/2018 11:30 PM, David Miller wrote:
>>>>> From: Xiao Liang <xiliang@redhat.com>
>>>>> Date: Fri, 27 Jul 2018 17:56:08 +0800
>>>>>
>>>>>> @@ -1330,6 +1331,11 @@ static struct net_device
>>>>>> *xennet_create_dev(struct xenbus_device *dev)
>>>>>>       netif_carrier_off(netdev);
>>>>>>         xenbus_switch_state(dev, XenbusStateInitialising);
>>>>>> +    wait_event(module_load_q,
>>>>>> +               xenbus_read_driver_state(dev->otherend) !=
>>>>>> +               XenbusStateClosed &&
>>>>>> +               xenbus_read_driver_state(dev->otherend) !=
>>>>>> +               XenbusStateUnknown);
>>>>>>       return netdev;
>>>>>>      exit:
>>>>> What performs the wakeups that will trigger for this sleep site?
>>>> In my understanding, backend leaving closed/unknow state can trigger the
>>>> wakeups. I mean to make sure both sides are ready for creating connection.
>>> While backporting this to 4.12, I was surprised by the commit the same
>>> as Boris and David.
>>>
>>> So I assume the explanation is that wake_up_all of module_unload_q in
>>> netback_changed wakes also all the processes waiting on module_load_q?
>>> If so, what makes sure that module_unload_q is queued and the process is
>>> the same as for module_load_q?
>> How could it? Either the thread is waiting on module_unload_q _or_ on
>> module_load_q. It can't wait on two queues at the same time.
>>
>>> To me, it looks rather error-prone. Unless it is erroneous now, at least
>>> for future changes. Wouldn't it make sense to wake up module_load_q
>>> along with module_unload_q in netback_changed? Or drop module_load_q
>>> completely and use only module_unload_q (i.e. in xennet_create_dev too)?
>> To me this looks just wrong. A thread waiting on module_load_q won't be
>> woken up again.
>>
>> I'd drop module_load_q in favor of module_unload_q.
> 
> 
> Yes, use single queue, but rename it to something more neutral. module_wq?

Can somebody who is actually using the module fix this, please?

I could fix it, but untested changes are "a bit" worse than tested changes.

thanks,
-- 
js
suse labs

^ permalink raw reply

* Re: [RFC PATCH bpf-next v2 0/4] Implement bpf queue/stack maps
From: Joe Stringer @ 2018-09-07  6:27 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: mauricio.vasquez, ast, daniel, netdev, Joe Stringer
In-Reply-To: <20180907001317.fj7f6fg6ihljompp@ast-mbp.dhcp.thefacebook.com>

On Thu, 6 Sep 2018 at 17:13, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> bpf_map_pop_elem() is trying to do lookup_and_delete and preserve
> validity of value without races.
> With pcpu_freelist I don't think there is a solution.
> We can have this queue/stack map without prealloc and use kmalloc/kfree
> back and forth. Performance will not be as great, but for your use case,
> I suspect, it will be good enough.
> The key issue with kmalloc/kfree is unbounded time of rcu callbacks.
> If somebody starts doing push/pop for every packet, the rcu subsystem
> will struggle and nothing we can do about it.
>
> The only way I could think of to resolve this problem is to reuse
> the logic that Joe is working on for socket lookups inside the program.
> Joe,
> how is that going? Could you repost the latest patches?

I can rebase & send them out. Was just wanting to get a little more testing in.

Cheers,
Joe

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox