Netdev List
 help / color / mirror / Atom feed
* Re: [RFC PATCH 1/3] riscv: set HAVE_EFFICIENT_UNALIGNED_ACCESS
From: Palmer Dabbelt @ 2019-01-29  2:43 UTC (permalink / raw)
  To: Jim Wilson
  Cc: bjorn.topel, Christoph Hellwig, linux-riscv, davidlee, daniel,
	netdev
In-Reply-To: <CAFyWVaaAg9TuNVPbNmP0ongf6y2hmqFMiXJZHo8K84O4-BV0yw@mail.gmail.com>

On Fri, 25 Jan 2019 17:33:50 PST (-0800), Jim Wilson wrote:
> On Fri, Jan 25, 2019 at 12:21 PM Palmer Dabbelt <palmer@sifive.com> wrote:
>> Jim, would you be opposed to something like this?
>
> This looks OK to me.

OK, thanks.  I'll send some patches around :)

>
>>     +    builtin_define_with_int_value ("__riscv_tune_misaligned_load_cost",
>>     +                                   riscv_tune_info->slow_unaligned_access ? 1024 : 1);
>>     +    builtin_define_with_int_value ("__riscv_tune_misaligned_store_cost",
>>     +                                   riscv_tune_info->slow_unaligned_access ? 1024 : 1);
>
> It would be nice to have a better way to compute these values, maybe
> an extra field in the tune structure, but we can always worry about
> that later when we need it.

I agree.  I just went and designed the external interface first and hid the 
ugliness here.  The internal interfaces are easier to change :)

^ permalink raw reply

* Re: [PATCH v2 net 7/7] virtio_net: Differentiate sk_buff and xdp_frame on freeing
From: Toshiaki Makita @ 2019-01-29  2:35 UTC (permalink / raw)
  To: Jason Wang, David S. Miller, Michael S. Tsirkin; +Cc: netdev, virtualization
In-Reply-To: <6267c42f-ce1e-1b63-2af9-84a76090f686@redhat.com>

On 2019/01/29 11:23, Jason Wang wrote:
> On 2019/1/29 上午8:45, Toshiaki Makita wrote:
...
>> @@ -2666,10 +2696,10 @@ static void free_unused_bufs(struct
>> virtnet_info *vi)
>>       for (i = 0; i < vi->max_queue_pairs; i++) {
>>           struct virtqueue *vq = vi->sq[i].vq;
>>           while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
>> -            if (!is_xdp_raw_buffer_queue(vi, i))
>> +            if (!is_xdp_frame(buf))
> 
> 
> I believe this is the last user of is_xdp_raw_buffer_queue(), maybe you
> can sent a patch on top to remove it.

Actually patch2 added new users of it ;)

> 
> 
>>                   dev_kfree_skb(buf);
>>               else
>> -                xdp_return_frame(buf);
>> +                xdp_return_frame(ptr_to_xdp(buf));
>>           }
>>       }
>>   
> 
> 
> Acked-by: Jason Wang <jasowang@redhat.com>
> 

Thanks!

-- 
Toshiaki Makita


^ permalink raw reply

* Re: [PATCH net-next V4 0/5] vhost: accelerate metadata access through vmap()
From: Jason Wang @ 2019-01-29  2:34 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Miller
  Cc: virtualization, netdev, linux-kernel, kvm
In-Reply-To: <20190126193126-mutt-send-email-mst@kernel.org>


On 2019/1/27 上午8:31, Michael S. Tsirkin wrote:
> On Sat, Jan 26, 2019 at 02:37:08PM -0800, David Miller wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Date: Wed, 23 Jan 2019 17:55:52 +0800
>>
>>> This series tries to access virtqueue metadata through kernel virtual
>>> address instead of copy_user() friends since they had too much
>>> overheads like checks, spec barriers or even hardware feature
>>> toggling.
>>>
>>> Test shows about 24% improvement on TX PPS. It should benefit other
>>> cases as well.
>> I've read over the discussion of patch #5 a few times.
>>
>> And it seems to me that, at a minimum, a few things still need to
>> be resolved:
>>
>> 1) More perf data added to commit message.


Ok.


>>
>> 2) Whether invalidate_range_start() and invalidate_range_end() must
>>     be paired.


The reason that vhost doesn't need an invalidate_range_end() is because 
we have a fallback to copy_to_user() friends. So there's no requirement 
to setup the mapping in range_end() or lock the vq between range_start() 
and range_end(). We try to delay the setup of vmap until it will be 
really used in vhost_meta_prefetch() and we hold mmap_sem when trying to 
setup vmap, this will guarantee there's no intermediate state at this time.


>
> Add dirty tracking.


I think this could be solved by introducing e.g 
vhost_meta_prefetch_done() at the end of handle_tx()/handle_rx() and 
call set_page_dirty() for used pages instead of the tricks of 
classifying VMA. (As I saw hugetlbfs has its own set dirty method).

Thanks


>
>> Etc.  So I am marking this series "Changes Requested".

^ permalink raw reply

* Re: [PATCH v2 net 7/7] virtio_net: Differentiate sk_buff and xdp_frame on freeing
From: Jason Wang @ 2019-01-29  2:23 UTC (permalink / raw)
  To: Toshiaki Makita, David S. Miller, Michael S. Tsirkin
  Cc: netdev, virtualization
In-Reply-To: <1548722759-2470-8-git-send-email-makita.toshiaki@lab.ntt.co.jp>


On 2019/1/29 上午8:45, Toshiaki Makita wrote:
> We do not reset or free up unused buffers when enabling/disabling XDP,
> so it can happen that xdp_frames are freed after disabling XDP or
> sk_buffs are freed after enabling XDP on xdp tx queues.
> Thus we need to handle both forms (xdp_frames and sk_buffs) regardless
> of XDP setting.
> One way to trigger this problem is to disable XDP when napi_tx is
> enabled. In that case, virtnet_xdp_set() calls virtnet_napi_enable()
> which kicks NAPI. The NAPI handler will call virtnet_poll_cleantx()
> which invokes free_old_xmit_skbs() for queues which have been used by
> XDP.
>
> Note that even with this change we need to keep skipping
> free_old_xmit_skbs() from NAPI handlers when XDP is enabled, because XDP
> tx queues do not aquire queue locks.
>
> - v2: Use napi_consume_skb() instead of dev_consume_skb_any()
>
> Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> ---
> NOTE: Dropped Acked-by because of the v2 change.
>
>   drivers/net/virtio_net.c | 54 +++++++++++++++++++++++++++++++++++++-----------
>   1 file changed, 42 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 1d454ce..2594481 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -57,6 +57,8 @@
>   #define VIRTIO_XDP_TX		BIT(0)
>   #define VIRTIO_XDP_REDIR	BIT(1)
>   
> +#define VIRTIO_XDP_FLAG	BIT(0)
> +
>   /* RX packet size EWMA. The average packet size is used to determine the packet
>    * buffer size when refilling RX rings. As the entire RX ring may be refilled
>    * at once, the weight is chosen so that the EWMA will be insensitive to short-
> @@ -252,6 +254,21 @@ struct padded_vnet_hdr {
>   	char padding[4];
>   };
>   
> +static bool is_xdp_frame(void *ptr)
> +{
> +	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> +}
> +
> +static void *xdp_to_ptr(struct xdp_frame *ptr)
> +{
> +	return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
> +}
> +
> +static struct xdp_frame *ptr_to_xdp(void *ptr)
> +{
> +	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
> +}
> +
>   /* Converting between virtqueue no. and kernel tx/rx queue no.
>    * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
>    */
> @@ -462,7 +479,8 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi,
>   
>   	sg_init_one(sq->sg, xdpf->data, xdpf->len);
>   
> -	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdpf, GFP_ATOMIC);
> +	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp_to_ptr(xdpf),
> +				   GFP_ATOMIC);
>   	if (unlikely(err))
>   		return -ENOSPC; /* Caller handle free/refcnt */
>   
> @@ -482,13 +500,13 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>   {
>   	struct virtnet_info *vi = netdev_priv(dev);
>   	struct receive_queue *rq = vi->rq;
> -	struct xdp_frame *xdpf_sent;
>   	struct bpf_prog *xdp_prog;
>   	struct send_queue *sq;
>   	unsigned int len;
>   	int drops = 0;
>   	int kicks = 0;
>   	int ret, err;
> +	void *ptr;
>   	int i;
>   
>   	/* Only allow ndo_xdp_xmit if XDP is loaded on dev, as this
> @@ -507,8 +525,12 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>   	}
>   
>   	/* Free up any pending old buffers before queueing new ones. */
> -	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
> -		xdp_return_frame(xdpf_sent);
> +	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> +		if (likely(is_xdp_frame(ptr)))
> +			xdp_return_frame(ptr_to_xdp(ptr));
> +		else
> +			napi_consume_skb(ptr, false);
> +	}
>   
>   	for (i = 0; i < n; i++) {
>   		struct xdp_frame *xdpf = frames[i];
> @@ -1329,18 +1351,26 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
>   
>   static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)
>   {
> -	struct sk_buff *skb;
>   	unsigned int len;
>   	unsigned int packets = 0;
>   	unsigned int bytes = 0;
> +	void *ptr;
>   
> -	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		pr_debug("Sent skb %p\n", skb);
> +	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> +		if (likely(!is_xdp_frame(ptr))) {
> +			struct sk_buff *skb = ptr;
>   
> -		bytes += skb->len;
> -		packets++;
> +			pr_debug("Sent skb %p\n", skb);
>   
> -		napi_consume_skb(skb, in_napi);
> +			bytes += skb->len;
> +			napi_consume_skb(skb, in_napi);
> +		} else {
> +			struct xdp_frame *frame = ptr_to_xdp(ptr);
> +
> +			bytes += frame->len;
> +			xdp_return_frame(frame);
> +		}
> +		packets++;
>   	}
>   
>   	/* Avoid overhead when no packets have been processed
> @@ -2666,10 +2696,10 @@ static void free_unused_bufs(struct virtnet_info *vi)
>   	for (i = 0; i < vi->max_queue_pairs; i++) {
>   		struct virtqueue *vq = vi->sq[i].vq;
>   		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> -			if (!is_xdp_raw_buffer_queue(vi, i))
> +			if (!is_xdp_frame(buf))


I believe this is the last user of is_xdp_raw_buffer_queue(), maybe you 
can sent a patch on top to remove it.


>   				dev_kfree_skb(buf);
>   			else
> -				xdp_return_frame(buf);
> +				xdp_return_frame(ptr_to_xdp(buf));
>   		}
>   	}
>   


Acked-by: Jason Wang <jasowang@redhat.com>

Thanks



^ permalink raw reply

* Re: [PATCH v2 net 5/7] virtio_net: Don't process redirected XDP frames when XDP is disabled
From: Jason Wang @ 2019-01-29  2:20 UTC (permalink / raw)
  To: Toshiaki Makita, David S. Miller, Michael S. Tsirkin
  Cc: netdev, virtualization, Jesper Dangaard Brouer
In-Reply-To: <1548722759-2470-6-git-send-email-makita.toshiaki@lab.ntt.co.jp>


On 2019/1/29 上午8:45, Toshiaki Makita wrote:
> Commit 8dcc5b0ab0ec ("virtio_net: fix ndo_xdp_xmit crash towards dev not
> ready for XDP") tried to avoid access to unexpected sq while XDP is
> disabled, but was not complete.
>
> There was a small window which causes out of bounds sq access in
> virtnet_xdp_xmit() while disabling XDP.
>
> An example case of
>   - curr_queue_pairs = 6 (2 for SKB and 4 for XDP)
>   - online_cpu_num = xdp_queue_paris = 4
> when XDP is enabled:
>
> CPU 0                         CPU 1
> (Disabling XDP)               (Processing redirected XDP frames)
>
>                                virtnet_xdp_xmit()
> virtnet_xdp_set()
>   _virtnet_set_queues()
>    set curr_queue_pairs (2)
>                                 check if rq->xdp_prog is not NULL
>                                 virtnet_xdp_sq(vi)
>                                  qp = curr_queue_pairs -
>                                       xdp_queue_pairs +
>                                       smp_processor_id()
>                                     = 2 - 4 + 1 = -1
>                                  sq = &vi->sq[qp] // out of bounds access
>    set xdp_queue_pairs (0)
>    rq->xdp_prog = NULL
>
> Basically we should not change curr_queue_pairs and xdp_queue_pairs
> while someone can read the values. Thus, when disabling XDP, assign NULL
> to rq->xdp_prog first, and wait for RCU grace period, then change
> xxx_queue_pairs.
> Note that we need to keep the current order when enabling XDP though.
>
> - v2: Make rcu_assign_pointer/synchronize_net conditional instead of
>        _virtnet_set_queues.
>
> Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> ---
>   drivers/net/virtio_net.c | 33 ++++++++++++++++++++++++++-------
>   1 file changed, 26 insertions(+), 7 deletions(-)


Acked-by: Jason Wang <jasowang@redhat.com>

Thanks.


^ permalink raw reply

* Re: [PATCH net 1/1] bonding: fix PACKET_ORIGDEV regression on bonding masters
From: Michal Soltys @ 2019-01-29  1:47 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: David Miller, Linux NetDev, Jay Vosburgh, Vincent Bernat,
	Mahesh Bandewar, Chonggang Li
In-Reply-To: <CAHo-OoxGHAcQZR-arPBaU5EC1bq3VYfvnxyzdu2a_RxuCf0rrg@mail.gmail.com>

On 19/01/18 07:58, Maciej Żenczykowski wrote:
> I'm not sure there's a truly good answer.
> 
> Also note, that there's subtle differences between af_packet sockets
> tied to ALL protocols vs tied to specific protocols (vs none/0 of
> course).
> However ETH_P_ALL protocol raw sockets need to be avoided if at all
> possible due to perf impact.
> Certainly we don't want any of these created outside of debugging.
> 
> I believe we only cared about the utterly unbound to any device case
> working (ie. delivering all LLDP packets all nics are receiving).
> So that a single simple daemon could collect and export all the link
> information.
> [btw. now that I know about the PACKET_ORIGDEV option, I do -
> unsurprisingly - see our daemon sets it]
> We really didn't want the complexity of having to bind to individual
> interfaces and having to try to dynamically adjust the set of raw
> sockets.
> (not to mention that less raw sockets is always good, also it's never
> actually entirely clear which interfaces would need to be monitored
> due to the preponderance of tunnels/macvlan/ipvlan/veth/dummy and
> other virtual interface types)
> 
> I think from a logic perspective:
> 
> - if you bind to slave (physical interface) you should see *all*
> packets on that interface,
> regardless of whether the slave is active or not, and whether the
> packet is link-local or not.
> ie. this should show you exactly what nic is receiving (& sending:
> PACKET_OUTGOING)
> I should see *exactly* one copy of any packet.
> This mode has to be useful for debugging.
> This includes seeing packets which aren't even destined for me if nic
> receives them.
> (ie. destined to unicast/multicast mac we'd filter out: PACKET_OTHERHOST)
> Although by itself this socket's existence shouldn't affect nic mac
> filters and promiscuous mode
> (I believe there are all sorts of various additional socket options to
> change those).
> 
> - if you don't bind to an interface then I think I'd expect to see
> packets delivered to stack + link local packets
>    - ie. should not see non-link-local packets discarded due to being
> on inactive slaves
>      similarly to how I should not see packets filtered out by virtue
> of mac not matching interface mac filters
>    - IMHO you should see *all* link local packets arriving at system
> (ie. the original change's purpose)
>      [and I think, but am not certain, I shouldn't need to use any
> socket options to register the link local macs,
>       although glancing at code I do think the daemon uses
> PACKET_ADD_MEMBERSHIP to register lldp mac,
>       so perhaps filtering should apply as normal?]
>    - with PACKET_ORIGDEV they should always show up as from the
> physical interfaces (ie. slaves)
>    - without PACKET_ORIGDEV:
>       - non link local packets should show up as from bonding/team master

>       - link local packets on active slaves could show up on master or
> slave (probably master is required ifirc some earlier fix???)

As far as I can see, they (LLCs coming on active slave) will show via 
master as from master (w/o the socket option) or as from slave (with). 
Think Vincent's tests confirmed it.

>       - link local packets on inactive slaves should show up on slave -
> and definitely not on master
>    - I don't think I should ever see any PACKET_OTHERHOST packets
> 
> - if you bind to master you should see packets from active slaves (ie.
> those that will get delivered to stack)
>    - clearly you should not see non-link-local inactive slave packets
> (they'll be dropped)

>    - behaviour wrt. link local packets is more dicey... (I believe
> somewhere in these threads patches there was some description of what
> standard requires, but I don't off the top of my head remember)

They must be seen, otherwise bonds attached to a linux bridge (I assume 
enslaving an interface to a bridge essentially counts as bind) will be 
blind to them (among those - e.g. to spanning tree information - this 
was what originally caused issues for me last year). Even the bridge 
code goes to extra length to not carelessly consume stp packets, if it's 
not actively participating in stp (of any kind). Aside that, certain 
[recent] bridge features like group_fwd_mask would be non-functional on 
bond ports as well.

As for standard (the one I quoted in the oldest thread) expected the 
link-local packet to be both readable via master and slaves depending on 
need (though it didn't go into exact gory details). Your current patch I 
think cover all possible cases quite greacefully.

>    - for an active-backup bond it would seem logical to see only active
> links link local
>    - for a multiple-active aggregate bond I'd be fine with seeing none,
> or all active slaves link local packets - I guess even though the
> later is confusing it makes sense
>    - it might be okay to see link-local inactive slave packets, but I
> think I'd prefer not to (could be configurable though) - this would
> seem confusing/wrong to me.
>    - and again I don't think I should see any PACKET_OTHERHOST packets here...
>    - PACKET_ORIGDEV as above...
> 
> I gave the above a fair amount of thought... but I'm not guaranteeing
> I didn't make typos, or write something utterly stupid or
> unimplementable or not how stuff should work for other reasons...
> Comments welcome.
> 
> I think this continues to be in line with my proposal from earlier in
> the thread?

(sorry for late reply)

Yes, pretty much spot on. With some confirmation comments above.

I did bridging tests yesterday (2 LACP bonds attached via separate 
switches treating linux bridge as a shared segment in mstp scenario) and 
all is working fine on my side as well.

If everyone agrees with the proposed code, I will submit v2 patch with 
added comment explaining basic logic (or you could submit if you prefer).

> 
> On Thu, Jan 17, 2019 at 4:27 PM Michal Soltys <soltys@ziu.info> wrote:
>>
>> On 19/01/14 03:01, Maciej Żenczykowski wrote:
>> > So I don't remember the specifics...
>> >
>> > (note I'm writing this all from memory without looking it up/testing
>> > it - I may be utterly wrong or dreaming)
>> >
>> > But I seem to recall that the core problem we were trying to solve was
>> > that a daemon listening
>> > on an AF_PACKET ethertype 88CC [LLDP] socket not bound to any device
>> > would not receive LLDP packets
>> > arriving on inactive bond slaves (either active-backup or lag).
>> >
>> > [inactive = link/carrier up, but not part of active aggregator]
>> >
>> > This made monitoring for miscabling harder (IFIRC the only non kernel
>> > fix was to get the daemon to create
>> > a separate AF_PACKET/88CC socket bound to every physical interface in
>> > the system, or monitor for
>> > inactive slaves and add extra packet sockets as needed).
>> >
>> > They would get re-parented to the master and then since the slave was
>> > inactive they would be considered RX_HANDLER_EXACT match only and not
>> > match the * interface.
>> >
>> > Honestly I wasn't aware of PACKET_ORIGDEV, although I don't think it
>> > helps in this case - AFAICR the packets never made it to the packet
>> > socket.
>> >
>> > Perhaps going from:
>> >    /* don't change skb->dev for link-local packets */
>> >    if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
>> >    if (bond_should_deliver_exact_match(skb, slave, bond)) return
>> > RX_HANDLER_EXACT;
>> >
>> > to something more like:
>> >    if (bond_should_deliver_exact_match(skb, slave, bond)) {
>> >      /* don't change skb->dev for link-local packets on inactive slaves */
>> >      if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) return RX_HANDLER_PASS;
>> >      return RX_HANDLER_EXACT;
>> >    }
>>
>> Having checked the code (if I get the flow correctly), one
>> thing/question - currently with Mahesh's fixes, not bound LLDP listener
>> will receive all packets - both from active and inactive slaves directly
>> (as the check for suppression is done after the link-local check).
>>
>> The version above will do the suppression check first - so all inactive
>> slaves - excluding non-multi/non-broad ALB - will pass it and return
>> RX_HANDLER_PASS if the packet is link-local. So those will be available
>> w/o binding, but active slaves' packets will be available via master
>> device (but with working PACKET_ORIGDEV now - so slave device can be
>> retrieved easily). This is fine in your scenario I presume ?
>>
> 


^ permalink raw reply

* pull-request: bpf-next 2019-01-29
From: Daniel Borkmann @ 2019-01-29  1:46 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Teach verifier dead code removal, this also allows for optimizing /
   removing conditional branches around dead code and to shrink the
   resulting image. Code store constrained architectures like nfp would
   have hard time doing this at JIT level, from Jakub.

2) Add JMP32 instructions to BPF ISA in order to allow for optimizing
   code generation for 32-bit sub-registers. Evaluation shows that this
   can result in code reduction of ~5-20% compared to 64 bit-only code
   generation. Also add implementation for most JITs, from Jiong.

3) Add support for __int128 types in BTF which is also needed for
   vmlinux's BTF conversion to work, from Yonghong.

4) Add a new command to bpftool in order to dump a list of BPF-related
   parameters from the system or for a specific network device e.g. in
   terms of available prog/map types or helper functions, from Quentin.

5) Add AF_XDP sock_diag interface for querying sockets from user
   space which provides information about the RX/TX/fill/completion
   rings, umem, memory usage etc, from Björn.

6) Add skb context access for skb_shared_info->gso_segs field, from Eric.

7) Add support for testing flow dissector BPF programs by extending
   existing BPF_PROG_TEST_RUN infrastructure, from Stanislav.

8) Split BPF kselftest's test_verifier into various subgroups of tests
   in order better deal with merge conflicts in this area, from Jakub.

9) Add support for queue/stack manipulations in bpftool, from Stanislav.

10) Document BTF, from Yonghong.

11) Dump supported ELF section names in libbpf on program load
    failure, from Taeung.

12) Silence a false positive compiler warning in verifier's BTF
    handling, from Peter.

13) Fix help string in bpftool's feature probing, from Prashant.

14) Remove duplicate includes in BPF kselftests, from Yue.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit ae5220c672180765615458ae54dbcff9abe6a01d:

  networking: Documentation: fix snmp_counters.rst Sphinx warnings (2019-01-16 13:29:54 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to 3d2af27a84a8474e510f5d8362303bfbee946308:

  Merge branch 'bpf-flow-dissector-tests' (2019-01-29 01:08:30 +0100)

----------------------------------------------------------------
Alexei Starovoitov (4):
      Merge branch 'bpftool-probes'
      Merge branch 'dead-code-elimination'
      Merge branch 'jmp32-insns'
      Merge branch 'split-test_verifier'

Björn Töpel (3):
      net: xsk: track AF_XDP sockets on a per-netns list
      xsk: add id to umem
      xsk: add sock_diag interface for AF_XDP

Daniel Borkmann (4):
      Merge branch 'bpf-int128-btf'
      Merge branch 'bpf-bpftool-queue-stack'
      Merge branch 'af-xdp-sock-diag'
      Merge branch 'bpf-flow-dissector-tests'

Eric Dumazet (1):
      bpf: allow BPF programs access skb_shared_info->gso_segs field

Jakub Kicinski (16):
      bpf: change parameters of call/branch offset adjustment
      bpf: verifier: hard wire branches to dead code
      bpf: verifier: remove dead code
      bpf: verifier: remove unconditional branches by 0
      selftests: bpf: add tests for dead code removal
      bpf: verifier: record original instruction index
      bpf: notify offload JITs about optimizations
      nfp: bpf: don't use instruction number for jump target
      nfp: bpf: split up the skip flag
      nfp: bpf: save original program length
      nfp: bpf: support optimizing dead branches
      nfp: bpf: support removing dead code
      selftests: bpf: prepare for break up of verifier tests
      selftests: bpf: break up test_verifier
      selftests: bpf: break up the rest of test_verifier
      tools: bpftool: warn about risky prog array updates

Jiong Wang (16):
      bpf: allocate 0x06 to new eBPF instruction class JMP32
      bpf: refactor verifier min/max code for condition jump
      bpf: verifier support JMP32
      bpf: disassembler support JMP32
      tools: bpftool: teach cfg code about JMP32
      bpf: interpreter support for JMP32
      bpf: JIT blinds support JMP32
      x86_64: bpf: implement jitting of JMP32
      x32: bpf: implement jitting of JMP32
      arm64: bpf: implement jitting of JMP32
      arm: bpf: implement jitting of JMP32
      ppc: bpf: implement jitting of JMP32
      s390: bpf: implement jitting of JMP32
      nfp: bpf: implement jitting of JMP32
      selftests: bpf: functional and min/max reasoning unit tests for JMP32
      selftests: bpf: makefile support sub-register code-gen test mode

Peter Oskolkov (1):
      bpf: fix a (false) compiler warning

Prashant Bhole (1):
      bpftool: feature probing, change default action

Quentin Monnet (9):
      tools: bpftool: add basic probe capability, probe syscall availability
      tools: bpftool: add probes for /proc/ eBPF parameters
      tools: bpftool: add probes for kernel configuration options
      tools: bpftool: add probes for eBPF program types
      tools: bpftool: add probes for eBPF map types
      tools: bpftool: add probes for eBPF helper functions
      tools: bpftool: add C-style "#define" output for probes
      tools: bpftool: add probes for a network device
      tools: bpftool: add bash completion for bpftool probes

Stanislav Fomichev (13):
      libbpf: don't define CC and AR
      bpftool: make key and value optional in update command
      bpftool: make key optional in lookup command
      bpftool: don't print empty key/value for maps
      bpftool: add peek command
      bpftool: add push and enqueue commands
      bpftool: add pop and dequeue commands
      bpftool: add bash completion for peek/push/enqueue/pop/dequeue
      selftests/bpf: don't hardcode iptables/nc path in test_tcpnotify_user
      selftests/bpf: suppress readelf stderr when probing for BTF support
      net/flow_dissector: move bpf case into __skb_flow_bpf_dissect
      bpf: add BPF_PROG_TEST_RUN support for flow dissector
      selftests/bpf: add simple BPF_PROG_TEST_RUN examples for flow dissector

Taeung Song (1):
      libbpf: Show supported ELF section names when failing to guess prog/attach type

Yonghong Song (6):
      bpf: btf: support 128 bit integer type
      tools/bpf: add int128 raw test in test_btf
      tools/bpf: refactor test_btf pretty printing for multiple map value formats
      tools/bpf: add bpffs pretty print test for int128
      tools/bpf: support __int128 in bpftool map pretty dumper
      bpf: btf: add btf documentation

YueHaibing (1):
      selftests: bpf: remove duplicated include

 Documentation/bpf/btf.rst                          |   870 ++
 Documentation/bpf/index.rst                        |     7 +
 Documentation/networking/filter.txt                |    15 +-
 arch/arm/net/bpf_jit_32.c                          |    53 +-
 arch/arm/net/bpf_jit_32.h                          |     2 +
 arch/arm64/net/bpf_jit_comp.c                      |    37 +-
 arch/powerpc/include/asm/ppc-opcode.h              |     1 +
 arch/powerpc/net/bpf_jit.h                         |     4 +
 arch/powerpc/net/bpf_jit_comp64.c                  |   120 +-
 arch/s390/net/bpf_jit_comp.c                       |    66 +-
 arch/x86/net/bpf_jit_comp.c                        |    46 +-
 arch/x86/net/bpf_jit_comp32.c                      |   121 +-
 drivers/net/ethernet/netronome/nfp/bpf/jit.c       |   139 +-
 drivers/net/ethernet/netronome/nfp/bpf/main.h      |    51 +-
 drivers/net/ethernet/netronome/nfp/bpf/offload.c   |     9 +-
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c  |    74 +-
 include/linux/bpf.h                                |    10 +
 include/linux/bpf_verifier.h                       |     6 +
 include/linux/filter.h                             |    21 +
 include/linux/skbuff.h                             |     5 +
 include/net/net_namespace.h                        |     4 +
 include/net/netns/xdp.h                            |    13 +
 include/net/xdp_sock.h                             |     1 +
 include/uapi/linux/bpf.h                           |     2 +
 include/uapi/linux/xdp_diag.h                      |    72 +
 kernel/bpf/btf.c                                   |   104 +-
 kernel/bpf/core.c                                  |   273 +-
 kernel/bpf/disasm.c                                |    34 +-
 kernel/bpf/offload.c                               |    35 +
 kernel/bpf/verifier.c                              |   624 +-
 net/bpf/test_run.c                                 |    82 +
 net/core/filter.c                                  |    22 +
 net/core/flow_dissector.c                          |    92 +-
 net/xdp/Kconfig                                    |     8 +
 net/xdp/Makefile                                   |     1 +
 net/xdp/xdp_umem.c                                 |    13 +
 net/xdp/xsk.c                                      |    36 +-
 net/xdp/xsk.h                                      |    12 +
 net/xdp/xsk_diag.c                                 |   191 +
 samples/bpf/bpf_insn.h                             |    20 +
 tools/bpf/bpftool/Documentation/bpftool-cgroup.rst |     1 +
 .../bpf/bpftool/Documentation/bpftool-feature.rst  |    85 +
 tools/bpf/bpftool/Documentation/bpftool-map.rst    |    29 +-
 tools/bpf/bpftool/Documentation/bpftool-net.rst    |     1 +
 tools/bpf/bpftool/Documentation/bpftool-perf.rst   |     1 +
 tools/bpf/bpftool/Documentation/bpftool-prog.rst   |     1 +
 tools/bpf/bpftool/Documentation/bpftool.rst        |     1 +
 tools/bpf/bpftool/bash-completion/bpftool          |   110 +-
 tools/bpf/bpftool/btf_dumper.c                     |    98 +-
 tools/bpf/bpftool/cfg.c                            |     9 +-
 tools/bpf/bpftool/feature.c                        |   764 +
 tools/bpf/bpftool/main.c                           |     3 +-
 tools/bpf/bpftool/main.h                           |     4 +
 tools/bpf/bpftool/map.c                            |   232 +-
 tools/bpf/bpftool/prog.c                           |    10 +-
 tools/include/linux/filter.h                       |    20 +
 tools/include/uapi/linux/bpf.h                     |     2 +
 tools/lib/bpf/Build                                |     2 +-
 tools/lib/bpf/Makefile                             |    17 +-
 tools/lib/bpf/libbpf.c                             |    46 +-
 tools/lib/bpf/libbpf.h                             |    14 +
 tools/lib/bpf/libbpf.map                           |     7 +
 tools/lib/bpf/libbpf_probes.c                      |   242 +
 tools/testing/selftests/bpf/Makefile               |   112 +-
 tools/testing/selftests/bpf/flow_dissector_load.c  |    43 +-
 tools/testing/selftests/bpf/flow_dissector_load.h  |    55 +
 tools/testing/selftests/bpf/test_btf.c             |   680 +-
 tools/testing/selftests/bpf/test_flow_dissector.c  |     2 -
 tools/testing/selftests/bpf/test_maps.c            |     1 -
 tools/testing/selftests/bpf/test_progs.c           |    78 +-
 tools/testing/selftests/bpf/test_socket_cookie.c   |     4 +-
 tools/testing/selftests/bpf/test_sockmap.c         |     1 -
 tools/testing/selftests/bpf/test_tcpnotify_user.c  |     6 +-
 tools/testing/selftests/bpf/test_verifier.c        | 15426 +------------------
 tools/testing/selftests/bpf/verifier/.gitignore    |     1 +
 tools/testing/selftests/bpf/verifier/and.c         |    50 +
 .../testing/selftests/bpf/verifier/array_access.c  |   219 +
 tools/testing/selftests/bpf/verifier/basic.c       |    23 +
 tools/testing/selftests/bpf/verifier/basic_call.c  |    50 +
 tools/testing/selftests/bpf/verifier/basic_instr.c |   134 +
 tools/testing/selftests/bpf/verifier/basic_stack.c |    64 +
 .../testing/selftests/bpf/verifier/basic_stx_ldx.c |    45 +
 tools/testing/selftests/bpf/verifier/bounds.c      |   508 +
 .../selftests/bpf/verifier/bounds_deduction.c      |   124 +
 .../bpf/verifier/bounds_mix_sign_unsign.c          |   406 +
 .../testing/selftests/bpf/verifier/bpf_get_stack.c |    44 +
 tools/testing/selftests/bpf/verifier/calls.c       |  1942 +++
 tools/testing/selftests/bpf/verifier/cfg.c         |    70 +
 .../selftests/bpf/verifier/cgroup_inv_retcode.c    |    72 +
 tools/testing/selftests/bpf/verifier/cgroup_skb.c  |   197 +
 .../selftests/bpf/verifier/cgroup_storage.c        |   220 +
 tools/testing/selftests/bpf/verifier/const_or.c    |    60 +
 tools/testing/selftests/bpf/verifier/ctx.c         |    93 +
 tools/testing/selftests/bpf/verifier/ctx_sk_msg.c  |   180 +
 tools/testing/selftests/bpf/verifier/ctx_skb.c     |  1033 ++
 tools/testing/selftests/bpf/verifier/dead_code.c   |   159 +
 .../selftests/bpf/verifier/direct_packet_access.c  |   633 +
 .../bpf/verifier/direct_stack_access_wraparound.c  |    40 +
 tools/testing/selftests/bpf/verifier/div0.c        |   184 +
 .../testing/selftests/bpf/verifier/div_overflow.c  |   104 +
 .../selftests/bpf/verifier/helper_access_var_len.c |   614 +
 .../selftests/bpf/verifier/helper_packet_access.c  |   460 +
 .../selftests/bpf/verifier/helper_value_access.c   |   953 ++
 tools/testing/selftests/bpf/verifier/jit.c         |    88 +
 tools/testing/selftests/bpf/verifier/jmp32.c       |   724 +
 tools/testing/selftests/bpf/verifier/jset.c        |   165 +
 tools/testing/selftests/bpf/verifier/jump.c        |   180 +
 tools/testing/selftests/bpf/verifier/junk_insn.c   |    45 +
 tools/testing/selftests/bpf/verifier/ld_abs.c      |   286 +
 tools/testing/selftests/bpf/verifier/ld_dw.c       |    36 +
 tools/testing/selftests/bpf/verifier/ld_imm64.c    |   141 +
 tools/testing/selftests/bpf/verifier/ld_ind.c      |    72 +
 tools/testing/selftests/bpf/verifier/leak_ptr.c    |    67 +
 tools/testing/selftests/bpf/verifier/lwt.c         |   189 +
 tools/testing/selftests/bpf/verifier/map_in_map.c  |    62 +
 .../selftests/bpf/verifier/map_ptr_mixing.c        |   100 +
 tools/testing/selftests/bpf/verifier/map_ret_val.c |    65 +
 tools/testing/selftests/bpf/verifier/masking.c     |   322 +
 tools/testing/selftests/bpf/verifier/meta_access.c |   235 +
 .../bpf/verifier/perf_event_sample_period.c        |    59 +
 .../selftests/bpf/verifier/prevent_map_lookup.c    |    74 +
 tools/testing/selftests/bpf/verifier/raw_stack.c   |   305 +
 .../testing/selftests/bpf/verifier/ref_tracking.c  |   607 +
 tools/testing/selftests/bpf/verifier/runtime_jit.c |    80 +
 .../selftests/bpf/verifier/search_pruning.c        |   156 +
 tools/testing/selftests/bpf/verifier/spill_fill.c  |    75 +
 tools/testing/selftests/bpf/verifier/stack_ptr.c   |   317 +
 tools/testing/selftests/bpf/verifier/uninit.c      |    39 +
 tools/testing/selftests/bpf/verifier/unpriv.c      |   521 +
 tools/testing/selftests/bpf/verifier/value.c       |   104 +
 .../selftests/bpf/verifier/value_adj_spill.c       |    43 +
 .../selftests/bpf/verifier/value_illegal_alu.c     |    94 +
 .../testing/selftests/bpf/verifier/value_or_null.c |   152 +
 .../selftests/bpf/verifier/value_ptr_arith.c       |   792 +
 tools/testing/selftests/bpf/verifier/var_off.c     |    66 +
 tools/testing/selftests/bpf/verifier/xadd.c        |    97 +
 tools/testing/selftests/bpf/verifier/xdp.c         |    14 +
 .../bpf/verifier/xdp_direct_packet_access.c        |   900 ++
 138 files changed, 21229 insertions(+), 16128 deletions(-)
 create mode 100644 Documentation/bpf/btf.rst
 create mode 100644 include/net/netns/xdp.h
 create mode 100644 include/uapi/linux/xdp_diag.h
 create mode 100644 net/xdp/xsk.h
 create mode 100644 net/xdp/xsk_diag.c
 create mode 100644 tools/bpf/bpftool/Documentation/bpftool-feature.rst
 create mode 100644 tools/bpf/bpftool/feature.c
 create mode 100644 tools/lib/bpf/libbpf_probes.c
 create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.h
 create mode 100644 tools/testing/selftests/bpf/verifier/.gitignore
 create mode 100644 tools/testing/selftests/bpf/verifier/and.c
 create mode 100644 tools/testing/selftests/bpf/verifier/array_access.c
 create mode 100644 tools/testing/selftests/bpf/verifier/basic.c
 create mode 100644 tools/testing/selftests/bpf/verifier/basic_call.c
 create mode 100644 tools/testing/selftests/bpf/verifier/basic_instr.c
 create mode 100644 tools/testing/selftests/bpf/verifier/basic_stack.c
 create mode 100644 tools/testing/selftests/bpf/verifier/basic_stx_ldx.c
 create mode 100644 tools/testing/selftests/bpf/verifier/bounds.c
 create mode 100644 tools/testing/selftests/bpf/verifier/bounds_deduction.c
 create mode 100644 tools/testing/selftests/bpf/verifier/bounds_mix_sign_unsign.c
 create mode 100644 tools/testing/selftests/bpf/verifier/bpf_get_stack.c
 create mode 100644 tools/testing/selftests/bpf/verifier/calls.c
 create mode 100644 tools/testing/selftests/bpf/verifier/cfg.c
 create mode 100644 tools/testing/selftests/bpf/verifier/cgroup_inv_retcode.c
 create mode 100644 tools/testing/selftests/bpf/verifier/cgroup_skb.c
 create mode 100644 tools/testing/selftests/bpf/verifier/cgroup_storage.c
 create mode 100644 tools/testing/selftests/bpf/verifier/const_or.c
 create mode 100644 tools/testing/selftests/bpf/verifier/ctx.c
 create mode 100644 tools/testing/selftests/bpf/verifier/ctx_sk_msg.c
 create mode 100644 tools/testing/selftests/bpf/verifier/ctx_skb.c
 create mode 100644 tools/testing/selftests/bpf/verifier/dead_code.c
 create mode 100644 tools/testing/selftests/bpf/verifier/direct_packet_access.c
 create mode 100644 tools/testing/selftests/bpf/verifier/direct_stack_access_wraparound.c
 create mode 100644 tools/testing/selftests/bpf/verifier/div0.c
 create mode 100644 tools/testing/selftests/bpf/verifier/div_overflow.c
 create mode 100644 tools/testing/selftests/bpf/verifier/helper_access_var_len.c
 create mode 100644 tools/testing/selftests/bpf/verifier/helper_packet_access.c
 create mode 100644 tools/testing/selftests/bpf/verifier/helper_value_access.c
 create mode 100644 tools/testing/selftests/bpf/verifier/jit.c
 create mode 100644 tools/testing/selftests/bpf/verifier/jmp32.c
 create mode 100644 tools/testing/selftests/bpf/verifier/jset.c
 create mode 100644 tools/testing/selftests/bpf/verifier/jump.c
 create mode 100644 tools/testing/selftests/bpf/verifier/junk_insn.c
 create mode 100644 tools/testing/selftests/bpf/verifier/ld_abs.c
 create mode 100644 tools/testing/selftests/bpf/verifier/ld_dw.c
 create mode 100644 tools/testing/selftests/bpf/verifier/ld_imm64.c
 create mode 100644 tools/testing/selftests/bpf/verifier/ld_ind.c
 create mode 100644 tools/testing/selftests/bpf/verifier/leak_ptr.c
 create mode 100644 tools/testing/selftests/bpf/verifier/lwt.c
 create mode 100644 tools/testing/selftests/bpf/verifier/map_in_map.c
 create mode 100644 tools/testing/selftests/bpf/verifier/map_ptr_mixing.c
 create mode 100644 tools/testing/selftests/bpf/verifier/map_ret_val.c
 create mode 100644 tools/testing/selftests/bpf/verifier/masking.c
 create mode 100644 tools/testing/selftests/bpf/verifier/meta_access.c
 create mode 100644 tools/testing/selftests/bpf/verifier/perf_event_sample_period.c
 create mode 100644 tools/testing/selftests/bpf/verifier/prevent_map_lookup.c
 create mode 100644 tools/testing/selftests/bpf/verifier/raw_stack.c
 create mode 100644 tools/testing/selftests/bpf/verifier/ref_tracking.c
 create mode 100644 tools/testing/selftests/bpf/verifier/runtime_jit.c
 create mode 100644 tools/testing/selftests/bpf/verifier/search_pruning.c
 create mode 100644 tools/testing/selftests/bpf/verifier/spill_fill.c
 create mode 100644 tools/testing/selftests/bpf/verifier/stack_ptr.c
 create mode 100644 tools/testing/selftests/bpf/verifier/uninit.c
 create mode 100644 tools/testing/selftests/bpf/verifier/unpriv.c
 create mode 100644 tools/testing/selftests/bpf/verifier/value.c
 create mode 100644 tools/testing/selftests/bpf/verifier/value_adj_spill.c
 create mode 100644 tools/testing/selftests/bpf/verifier/value_illegal_alu.c
 create mode 100644 tools/testing/selftests/bpf/verifier/value_or_null.c
 create mode 100644 tools/testing/selftests/bpf/verifier/value_ptr_arith.c
 create mode 100644 tools/testing/selftests/bpf/verifier/var_off.c
 create mode 100644 tools/testing/selftests/bpf/verifier/xadd.c
 create mode 100644 tools/testing/selftests/bpf/verifier/xdp.c
 create mode 100644 tools/testing/selftests/bpf/verifier/xdp_direct_packet_access.c

^ permalink raw reply

* Re: [PATCH 00/33] Netfilter/IPVS updates for net-next
From: David Miller @ 2019-01-29  1:38 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <20190128235750.18412-1-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Tue, 29 Jan 2019 00:57:17 +0100

> The following patchset contains Netfilter/IPVS updates for your net-next tree:
 ...
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Pulled, thanks Pablo.

^ permalink raw reply

* Re: [PATCH bpf] bpf, doc: add reviewers to maintainers entry
From: Martin Lau @ 2019-01-29  1:25 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: ast@kernel.org, netdev@vger.kernel.org, Song Liu, Yonghong Song
In-Reply-To: <20190128225526.5340-1-daniel@iogearbox.net>

On Mon, Jan 28, 2019 at 11:55:26PM +0100, Daniel Borkmann wrote:
> In order to better scale BPF development on netdev, we've adopted a
> reviewer rotation for all BPF patches among the five of us for some
> time now. Lets give credit where credit is due, and add Martin, Song
> and Yonghong as official BPF reviewers to MAINTAINERS file. Also
> while at it, add regex matching for BPF such that we get properly
> Cc'ed for files not listed here.
Thanks!

Acked-by: Martin KaFai Lau <kafai@fb.com>

^ permalink raw reply

* [PATCH bpf-next] bpf: check that BPF programs run with preemption disabled
From: Alexei Starovoitov @ 2019-01-29  1:21 UTC (permalink / raw)
  To: davem
  Cc: daniel, peterz, jannh, paulmck, will.deacon, mingo, netdev,
	kernel-team

From: Peter Zijlstra <peterz@infradead.org>

Introduce cant_sleep() macro for annotation of functions that cannot sleep.

Use it in BPF_PROG_RUN to catch execution of BPF programs
in preemptable context.

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/filter.h |  2 +-
 include/linux/kernel.h | 14 ++++++++++++--
 kernel/sched/core.c    | 28 ++++++++++++++++++++++++++++
 3 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index e4b473f85b46..7e87863617b3 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -533,7 +533,7 @@ struct sk_filter {
 	struct bpf_prog	*prog;
 };
 
-#define BPF_PROG_RUN(filter, ctx)  (*(filter)->bpf_func)(ctx, (filter)->insnsi)
+#define BPF_PROG_RUN(filter, ctx)  ({ cant_sleep(); (*(filter)->bpf_func)(ctx, (filter)->insnsi); })
 
 #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN
 
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 8f0e68e250a7..a8868a32098c 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -245,8 +245,10 @@ extern int _cond_resched(void);
 #endif
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
-  void ___might_sleep(const char *file, int line, int preempt_offset);
-  void __might_sleep(const char *file, int line, int preempt_offset);
+extern void ___might_sleep(const char *file, int line, int preempt_offset);
+extern void __might_sleep(const char *file, int line, int preempt_offset);
+extern void __cant_sleep(const char *file, int line, int preempt_offset);
+
 /**
  * might_sleep - annotation for functions that can sleep
  *
@@ -259,6 +261,13 @@ extern int _cond_resched(void);
  */
 # define might_sleep() \
 	do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0)
+/**
+ * cant_sleep - annotation for functions that cannot sleep
+ *
+ * this macro will print a stack trace if it is executed with preemption enabled
+ */
+# define cant_sleep() \
+	do { __cant_sleep(__FILE__, __LINE__, 0); } while (0)
 # define sched_annotate_sleep()	(current->task_state_change = 0)
 #else
   static inline void ___might_sleep(const char *file, int line,
@@ -266,6 +275,7 @@ extern int _cond_resched(void);
   static inline void __might_sleep(const char *file, int line,
 				   int preempt_offset) { }
 # define might_sleep() do { might_resched(); } while (0)
+# define cant_sleep() do { } while (0)
 # define sched_annotate_sleep() do { } while (0)
 #endif
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a674c7db2f29..1dcbff62f973 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6149,6 +6149,34 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
 	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
 }
 EXPORT_SYMBOL(___might_sleep);
+
+void __cant_sleep(const char *file, int line, int preempt_offset)
+{
+	static unsigned long prev_jiffy;
+
+	if (irqs_disabled())
+		return;
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_COUNT))
+		return;
+
+	if (preempt_count() > preempt_offset)
+		return;
+
+	if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
+		return;
+	prev_jiffy = jiffies;
+
+	printk(KERN_ERR "BUG: assuming atomic context at %s:%d\n", file, line);
+	printk(KERN_ERR "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
+			in_atomic(), irqs_disabled(),
+			current->pid, current->comm);
+
+	debug_show_held_locks(current);
+	dump_stack();
+	add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+}
+EXPORT_SYMBOL_GPL(__cant_sleep);
 #endif
 
 #ifdef CONFIG_MAGIC_SYSRQ
-- 
2.20.0


^ permalink raw reply related

* [PATCH bpf-next v3 4/4] selftests: bpf: add test_lwt_ip_encap selftest
From: Peter Oskolkov @ 2019-01-29  1:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190129011217.192510-1-posk@google.com>

This patch adds a bpf self-test to cover BPF_LWT_ENCAP_IP mode
in bpf_lwt_push_encap.

Covered:
- encapping in LWT_IN and LWT_XMIT
- IPv4 and IPv6

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/testing/selftests/bpf/Makefile          |   5 +-
 .../testing/selftests/bpf/test_lwt_ip_encap.c |  84 +++++
 .../selftests/bpf/test_lwt_ip_encap.sh        | 311 ++++++++++++++++++
 3 files changed, 398 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 89b0d1799ff3..277ca5ea23ac 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ BPF_OBJ_FILES = \
 	sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
 	test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_xdp_vlan.o \
-	xdp_dummy.o test_map_in_map.o
+	xdp_dummy.o test_map_in_map.o test_lwt_ip_encap.o
 
 # Objects are built with default compilation flags and with sub-register
 # code-gen enabled.
@@ -73,7 +73,8 @@ TEST_PROGS := test_kmod.sh \
 	test_lirc_mode2.sh \
 	test_skb_cgroup_id.sh \
 	test_flow_dissector.sh \
-	test_xdp_vlan.sh
+	test_xdp_vlan.sh \
+	test_lwt_ip_encap.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh \
 	tcp_client.py \
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.c b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
new file mode 100644
index 000000000000..2cd6bf9dd7e8
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stddef.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+struct grehdr {
+	__be16 flags;
+	__be16 protocol;
+};
+
+SEC("encap_gre")
+int bpf_lwt_encap_gre(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct iphdr iph;
+		struct grehdr greh;
+	} hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(struct encap_hdr));
+
+	hdr.iph.ihl = 5;
+	hdr.iph.version = 4;
+	hdr.iph.ttl = 0x40;
+	hdr.iph.protocol = 47;  /* IPPROTO_GRE */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	hdr.iph.saddr = 0x640110ac;  /* 172.16.1.100 */
+	hdr.iph.daddr = 0x641010ac;  /* 172.16.16.100 */
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+	hdr.iph.saddr = 0xac100164;  /* 172.16.1.100 */
+	hdr.iph.daddr = 0xac101064;  /* 172.16.16.100 */
+#else
+#error "Fix your compiler's __BYTE_ORDER__?!"
+#endif
+	hdr.iph.tot_len = bpf_htons(skb->len + sizeof(struct encap_hdr));
+
+	hdr.greh.protocol = skb->protocol;
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr,
+				 sizeof(struct encap_hdr));
+	if (err)
+		return BPF_DROP;
+	return BPF_LWT_REROUTE;
+}
+
+SEC("encap_gre6")
+int bpf_lwt_encap_gre6(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct ipv6hdr ip6hdr;
+		struct grehdr greh;
+	} hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(struct encap_hdr));
+
+	hdr.ip6hdr.version = 6;
+	hdr.ip6hdr.payload_len = bpf_htons(skb->len + sizeof(struct grehdr));
+	hdr.ip6hdr.nexthdr = 47;  /* IPPROTO_GRE */
+	hdr.ip6hdr.hop_limit = 0x40;
+	/* fb01::1 */
+	hdr.ip6hdr.saddr.s6_addr[0] = 0xfb;
+	hdr.ip6hdr.saddr.s6_addr[1] = 1;
+	hdr.ip6hdr.saddr.s6_addr[15] = 1;
+	/* fb10::1 */
+	hdr.ip6hdr.daddr.s6_addr[0] = 0xfb;
+	hdr.ip6hdr.daddr.s6_addr[1] = 0x10;
+	hdr.ip6hdr.daddr.s6_addr[15] = 1;
+
+	hdr.greh.protocol = skb->protocol;
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr,
+				 sizeof(struct encap_hdr));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_LWT_REROUTE;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
new file mode 100755
index 000000000000..4ca714e23ab0
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
@@ -0,0 +1,311 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Setup/topology:
+#
+#    NS1             NS2             NS3
+#   veth1 <---> veth2   veth3 <---> veth4 (the top route)
+#   veth5 <---> veth6   veth7 <---> veth8 (the bottom route)
+#
+#   each vethN gets IPv[4|6]_N address
+#
+#   IPv*_SRC = IPv*_1
+#   IPv*_DST = IPv*_4
+#
+#   all tests test pings from IPv*_SRC to IPv*_DST
+#
+#   by default, routes are configured to allow packets to go
+#   IP*_1 <=> IP*_2 <=> IP*_3 <=> IP*_4 (the top route)
+#
+#   a GRE device is installed in NS3 with IPv*_GRE, and
+#   NS1/NS2 are configured to route packets to IPv*_GRE via IP*_8
+#   (the bottom route)
+#
+# Tests:
+#
+#   1. routes NS2->IPv*_DST are brought down, so the only way a ping
+#      from IP*_SRC to IP*_DST can work is via IPv*_GRE
+#
+#   2a. in an egress test, a bpf LWT_XMIT program is installed on veth1
+#       that encaps the packets with an IP/GRE header to route to IPv*_GRE
+#
+#       ping: SRC->[encap at veth1:egress]->GRE:decap->DST
+#       ping replies go DST->SRC directly
+#
+#   2b. in an ingress test, a bpf LWT_IN program is installed on veth2
+#       that encaps the packets with an IP/GRE header to route to IPv*_GRE
+#
+#       ping: SRC->[encap at veth2:ingress]->GRE:decap->DST
+#       ping replies go DST->SRC directly
+
+set -e  # exit on error
+
+if [[ $EUID -ne 0 ]]; then
+	echo "This script must be run as root"
+	echo "FAIL"
+	exit 1
+fi
+
+readonly NS1="ns1-$(mktemp -u XXXXXX)"
+readonly NS2="ns2-$(mktemp -u XXXXXX)"
+readonly NS3="ns3-$(mktemp -u XXXXXX)"
+
+readonly IPv4_1="172.16.1.100"
+readonly IPv4_2="172.16.2.100"
+readonly IPv4_3="172.16.3.100"
+readonly IPv4_4="172.16.4.100"
+readonly IPv4_5="172.16.5.100"
+readonly IPv4_6="172.16.6.100"
+readonly IPv4_7="172.16.7.100"
+readonly IPv4_8="172.16.8.100"
+readonly IPv4_GRE="172.16.16.100"
+
+readonly IPv4_SRC=$IPv4_1
+readonly IPv4_DST=$IPv4_4
+
+readonly IPv6_1="fb01::1"
+readonly IPv6_2="fb02::1"
+readonly IPv6_3="fb03::1"
+readonly IPv6_4="fb04::1"
+readonly IPv6_5="fb05::1"
+readonly IPv6_6="fb06::1"
+readonly IPv6_7="fb07::1"
+readonly IPv6_8="fb08::1"
+readonly IPv6_GRE="fb10::1"
+
+readonly IPv6_SRC=$IPv6_1
+readonly IPv6_DST=$IPv6_4
+
+setup() {
+set -e  # exit on error
+	# create devices and namespaces
+	ip netns add "${NS1}"
+	ip netns add "${NS2}"
+	ip netns add "${NS3}"
+
+	ip link add veth1 type veth peer name veth2
+	ip link add veth3 type veth peer name veth4
+	ip link add veth5 type veth peer name veth6
+	ip link add veth7 type veth peer name veth8
+
+	ip netns exec ${NS2} sysctl -wq net.ipv4.ip_forward=1
+	ip netns exec ${NS2} sysctl -wq net.ipv6.conf.all.forwarding=1
+
+	ip link set veth1 netns ${NS1}
+	ip link set veth2 netns ${NS2}
+	ip link set veth3 netns ${NS2}
+	ip link set veth4 netns ${NS3}
+	ip link set veth5 netns ${NS1}
+	ip link set veth6 netns ${NS2}
+	ip link set veth7 netns ${NS2}
+	ip link set veth8 netns ${NS3}
+
+	# configure addesses: the top route (1-2-3-4)
+	ip -netns ${NS1}    addr add ${IPv4_1}/24  dev veth1
+	ip -netns ${NS2}    addr add ${IPv4_2}/24  dev veth2
+	ip -netns ${NS2}    addr add ${IPv4_3}/24  dev veth3
+	ip -netns ${NS3}    addr add ${IPv4_4}/24  dev veth4
+	ip -netns ${NS1} -6 addr add ${IPv6_1}/128 nodad dev veth1
+	ip -netns ${NS2} -6 addr add ${IPv6_2}/128 nodad dev veth2
+	ip -netns ${NS2} -6 addr add ${IPv6_3}/128 nodad dev veth3
+	ip -netns ${NS3} -6 addr add ${IPv6_4}/128 nodad dev veth4
+
+	# configure addresses: the bottom route (5-6-7-8)
+	ip -netns ${NS1}    addr add ${IPv4_5}/24  dev veth5
+	ip -netns ${NS2}    addr add ${IPv4_6}/24  dev veth6
+	ip -netns ${NS2}    addr add ${IPv4_7}/24  dev veth7
+	ip -netns ${NS3}    addr add ${IPv4_8}/24  dev veth8
+	ip -netns ${NS1} -6 addr add ${IPv6_5}/128 nodad dev veth5
+	ip -netns ${NS2} -6 addr add ${IPv6_6}/128 nodad dev veth6
+	ip -netns ${NS2} -6 addr add ${IPv6_7}/128 nodad dev veth7
+	ip -netns ${NS3} -6 addr add ${IPv6_8}/128 nodad dev veth8
+
+
+	ip -netns ${NS1} link set dev veth1 up
+	ip -netns ${NS2} link set dev veth2 up
+	ip -netns ${NS2} link set dev veth3 up
+	ip -netns ${NS3} link set dev veth4 up
+	ip -netns ${NS1} link set dev veth5 up
+	ip -netns ${NS2} link set dev veth6 up
+	ip -netns ${NS2} link set dev veth7 up
+	ip -netns ${NS3} link set dev veth8 up
+
+	# configure routes: IP*_SRC -> veth1/IP*_2 (= top route) default;
+	# the bottom route to specific bottom addresses
+
+	# NS1
+	# top route
+	ip -netns ${NS1}    route add ${IPv4_2}/32  dev veth1
+	ip -netns ${NS1}    route add default dev veth1 via ${IPv4_2}  # go top by default
+	ip -netns ${NS1} -6 route add ${IPv6_2}/128 dev veth1
+	ip -netns ${NS1} -6 route add default dev veth1 via ${IPv6_2}  # go top by default
+	# bottom route
+	ip -netns ${NS1}    route add ${IPv4_6}/32  dev veth5
+	ip -netns ${NS1}    route add ${IPv4_7}/32  dev veth5 via ${IPv4_6}
+	ip -netns ${NS1}    route add ${IPv4_8}/32  dev veth5 via ${IPv4_6}
+	ip -netns ${NS1} -6 route add ${IPv6_6}/128 dev veth5
+	ip -netns ${NS1} -6 route add ${IPv6_7}/128 dev veth5 via ${IPv6_6}
+	ip -netns ${NS1} -6 route add ${IPv6_8}/128 dev veth5 via ${IPv6_6}
+
+	# NS2
+	# top route
+	ip -netns ${NS2}    route add ${IPv4_1}/32  dev veth2
+	ip -netns ${NS2}    route add ${IPv4_4}/32  dev veth3
+	ip -netns ${NS2} -6 route add ${IPv6_1}/128 dev veth2
+	ip -netns ${NS2} -6 route add ${IPv6_4}/128 dev veth3
+	# bottom route
+	ip -netns ${NS2}    route add ${IPv4_5}/32  dev veth6
+	ip -netns ${NS2}    route add ${IPv4_8}/32  dev veth7
+	ip -netns ${NS2} -6 route add ${IPv6_5}/128 dev veth6
+	ip -netns ${NS2} -6 route add ${IPv6_8}/128 dev veth7
+
+	# NS3
+	# top route
+	ip -netns ${NS3}    route add ${IPv4_3}/32  dev veth4
+	ip -netns ${NS3}    route add ${IPv4_1}/32  dev veth4 via ${IPv4_3}
+	ip -netns ${NS3}    route add ${IPv4_2}/32  dev veth4 via ${IPv4_3}
+	ip -netns ${NS3} -6 route add ${IPv6_3}/128 dev veth4
+	ip -netns ${NS3} -6 route add ${IPv6_1}/128 dev veth4 via ${IPv6_3}
+	ip -netns ${NS3} -6 route add ${IPv6_2}/128 dev veth4 via ${IPv6_3}
+	# bottom route
+	ip -netns ${NS3}    route add ${IPv4_7}/32  dev veth8
+	ip -netns ${NS3}    route add ${IPv4_5}/32  dev veth8 via ${IPv4_7}
+	ip -netns ${NS3}    route add ${IPv4_6}/32  dev veth8 via ${IPv4_7}
+	ip -netns ${NS3} -6 route add ${IPv6_7}/128 dev veth8
+	ip -netns ${NS3} -6 route add ${IPv6_5}/128 dev veth8 via ${IPv6_7}
+	ip -netns ${NS3} -6 route add ${IPv6_6}/128 dev veth8 via ${IPv6_7}
+
+	# configure IPv4 GRE device in NS3, and a route to it via the "bottom" route
+	ip -netns ${NS3} tunnel add gre_dev mode gre remote ${IPv4_1} local ${IPv4_GRE} ttl 255
+	ip -netns ${NS3} link set gre_dev up
+	ip -netns ${NS3} addr add ${IPv4_GRE} dev gre_dev
+	ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6}
+	ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8}
+
+
+	# configure IPv6 GRE device in NS3, and a route to it via the "bottom" route
+	ip -netns ${NS3} -6 tunnel add name gre6_dev mode ip6gre remote ${IPv6_1} local ${IPv6_GRE} ttl 255
+	ip -netns ${NS3} link set gre6_dev up
+	ip -netns ${NS3} -6 addr add ${IPv6_GRE} nodad dev gre6_dev
+	ip -netns ${NS1} -6 route add ${IPv6_GRE}/128 dev veth5 via ${IPv6_6}
+	ip -netns ${NS2} -6 route add ${IPv6_GRE}/128 dev veth7 via ${IPv6_8}
+
+	# rp_filter gets confused by what these tests are doing, so disable it
+	ip netns exec ${NS1} sysctl -wq net.ipv4.conf.all.rp_filter=0
+	ip netns exec ${NS2} sysctl -wq net.ipv4.conf.all.rp_filter=0
+	ip netns exec ${NS3} sysctl -wq net.ipv4.conf.all.rp_filter=0
+}
+
+cleanup() {
+	ip netns del ${NS1} 2> /dev/null
+	ip netns del ${NS2} 2> /dev/null
+	ip netns del ${NS3} 2> /dev/null
+}
+
+trap cleanup EXIT
+
+test_ping() {
+	local readonly PROTO=$1
+	local readonly EXPECTED=$2
+	local RET=0
+
+	set +e
+	if [ "${PROTO}" == "IPv4" ] ; then
+		ip netns exec ${NS1} ping  -c 1 -W 1 -I ${IPv4_SRC} ${IPv4_DST} 2>&1 > /dev/null
+		RET=$?
+	elif [ "${PROTO}" == "IPv6" ] ; then
+		ip netns exec ${NS1} ping6 -c 1 -W 6 -I ${IPv6_SRC} ${IPv6_DST} 2>&1 > /dev/null
+		RET=$?
+	else
+		echo "test_ping: unknown PROTO: ${PROTO}"
+		exit 1
+	fi
+	set -e
+
+	if [ "0" != "${RET}" ]; then
+		RET=1
+	fi
+
+	if [ "${EXPECTED}" != "${RET}" ] ; then
+		echo "FAIL: test_ping: ${RET}"
+		exit 1
+	fi
+}
+
+test_egress() {
+	local readonly ENCAP=$1
+	echo "starting egress ${ENCAP} encap test"
+	setup
+
+	# need to wait a bit for IPv6 to autoconf, otherwise
+	# ping6 sometimes fails with "unable to bind to address"
+
+	# by default, pings work
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	# remove NS2->DST routes, ping fails
+	ip -netns ${NS2}    route del ${IPv4_DST}/32  dev veth3
+	ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# install replacement routes (LWT/eBPF), pings succeed
+	if [ "${ENCAP}" == "IPv4" ] ; then
+		ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
+		ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre dev veth1
+	elif [ "${ENCAP}" == "IPv6" ] ; then
+		ip -netns ${NS1} route add ${IPv4_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
+		ip -netns ${NS1} -6 route add ${IPv6_DST} encap bpf xmit obj test_lwt_ip_encap.o sec encap_gre6 dev veth1
+	else
+		echo "FAIL: unknown encap ${ENCAP}"
+	fi
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	cleanup
+	echo "PASS"
+}
+
+test_ingress() {
+	local readonly ENCAP=$1
+	echo "starting ingress ${ENCAP} encap test"
+	setup
+
+	# need to wait a bit for IPv6 to autoconf, otherwise
+	# ping6 sometimes fails with "unable to bind to address"
+
+	# by default, pings work
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	# remove NS2->DST routes, pings fail
+	ip -netns ${NS2}    route del ${IPv4_DST}/32  dev veth3
+	ip -netns ${NS2} -6 route del ${IPv6_DST}/128 dev veth3
+	test_ping IPv4 1
+	test_ping IPv6 1
+
+	# install replacement routes (LWT/eBPF), pings succeed
+	if [ "${ENCAP}" == "IPv4" ] ; then
+		ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
+		ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre dev veth2
+	elif [ "${ENCAP}" == "IPv6" ] ; then
+		ip -netns ${NS2} route add ${IPv4_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
+		ip -netns ${NS2} -6 route add ${IPv6_DST} encap bpf in obj test_lwt_ip_encap.o sec encap_gre6 dev veth2
+	else
+		echo "FAIL: unknown encap ${ENCAP}"
+	fi
+	test_ping IPv4 0
+	test_ping IPv6 0
+
+	cleanup
+	echo "PASS"
+}
+
+test_egress IPv4
+test_egress IPv6
+
+test_ingress IPv4
+test_ingress IPv6
+
+echo "all tests passed"
-- 
2.20.1.495.gaa96b0ce6b-goog


^ permalink raw reply related

* [PATCH bpf-next v3 3/4] bpf: sync <kdir>/<uapi>/bpf.h with tools/<uapi>/bpf.h
From: Peter Oskolkov @ 2019-01-29  1:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190129011217.192510-1-posk@google.com>

This patch copies changes in bpf.h done by a previous patch
in this patchset from the kernel uapi include dir into tools
uapi include dir.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 tools/include/uapi/linux/bpf.h | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 60b99b730a41..c4fee8b45762 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2015,6 +2015,16 @@ union bpf_attr {
  *			Only works if *skb* contains an IPv6 packet. Insert a
  *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
  *			the IPv6 header.
+ *		**BPF_LWT_ENCAP_IP**
+ *			IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ *			must be IPv4 or IPv6, followed by zero or more
+ *			additional headers, up to LWT_BPF_MAX_HEADROOM total
+ *			bytes in all prepended headers.
+ *
+ *		BPF_LWT_ENCAP_SEG6*** types can be called by bpf programs of
+ *		type BPF_PROG_TYPE_LWT_IN; BPF_LWT_ENCAP_IP type can be called
+ *		by bpf programs of types BPF_PROG_TYPE_LWT_IN and
+ *		BPF_PROG_TYPE_LWT_XMIT.
  *
  * 		A call to this helper is susceptible to change the underlaying
  * 		packet buffer. Therefore, at load time, all checks on pointers
@@ -2495,7 +2505,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_SEG6,
-	BPF_LWT_ENCAP_SEG6_INLINE
+	BPF_LWT_ENCAP_SEG6_INLINE,
+	BPF_LWT_ENCAP_IP,
 };
 
 #define __bpf_md_ptr(type, name)	\
@@ -2583,7 +2594,15 @@ enum bpf_ret_code {
 	BPF_DROP = 2,
 	/* 3-6 reserved */
 	BPF_REDIRECT = 7,
-	/* >127 are reserved for prog type specific return codes */
+	/* >127 are reserved for prog type specific return codes.
+	 *
+	 * BPF_LWT_REROUTE: used by BPF_PROG_TYPE_LWT_IN and
+	 *    BPF_PROG_TYPE_LWT_XMIT to indicate that skb's dst
+	 *    has changed and appropriate dst_input() or dst_output()
+	 *    action has to be taken (this is an L3 redirect, as
+	 *    opposed to L2 redirect represented by BPF_REDIRECT above).
+	 */
+	BPF_LWT_REROUTE = 128,
 };
 
 struct bpf_sock {
-- 
2.20.1.495.gaa96b0ce6b-goog


^ permalink raw reply related

* [PATCH bpf-next v3 2/4] bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-01-29  1:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190129011217.192510-1-posk@google.com>

This patch implements BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN
and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/net/lwtunnel.h |   3 +
 net/core/filter.c      |   3 +-
 net/core/lwt_bpf.c     | 148 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 153 insertions(+), 1 deletion(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 33fd9ba7e0e5..f0973eca8036 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -126,6 +126,8 @@ int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b);
 int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb);
 int lwtunnel_input(struct sk_buff *skb);
 int lwtunnel_xmit(struct sk_buff *skb);
+int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
+			  bool ingress);
 
 static inline void lwtunnel_set_redirect(struct dst_entry *dst)
 {
@@ -138,6 +140,7 @@ static inline void lwtunnel_set_redirect(struct dst_entry *dst)
 		dst->input = lwtunnel_input;
 	}
 }
+
 #else
 
 static inline void lwtstate_free(struct lwtunnel_state *lws)
diff --git a/net/core/filter.c b/net/core/filter.c
index fd3ae092d3d7..81d18660c38b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -73,6 +73,7 @@
 #include <linux/seg6_local.h>
 #include <net/seg6.h>
 #include <net/seg6_local.h>
+#include <net/lwtunnel.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -4796,7 +4797,7 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len
 static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
 			     bool ingress)
 {
-	return -EINVAL;  /* Implemented in the next patch. */
+	return bpf_lwt_push_ip_encap(skb, hdr, len, ingress);
 }
 
 BPF_CALL_4(bpf_lwt_in_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 3e85437f7106..a9ff71aa4566 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -16,6 +16,7 @@
 #include <linux/types.h>
 #include <linux/bpf.h>
 #include <net/lwtunnel.h>
+#include <net/ip6_route.h>
 
 struct bpf_lwt_prog {
 	struct bpf_prog *prog;
@@ -55,6 +56,7 @@ static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
 
 	switch (ret) {
 	case BPF_OK:
+	case BPF_LWT_REROUTE:
 		break;
 
 	case BPF_REDIRECT:
@@ -97,6 +99,8 @@ static int bpf_input(struct sk_buff *skb)
 		ret = run_lwt_bpf(skb, &bpf->in, dst, NO_REDIRECT);
 		if (ret < 0)
 			return ret;
+		if (ret == BPF_LWT_REROUTE)
+			return dst_input(skb);
 	}
 
 	if (unlikely(!dst->lwtstate->orig_input)) {
@@ -168,6 +172,13 @@ static int bpf_xmit(struct sk_buff *skb)
 			return LWTUNNEL_XMIT_CONTINUE;
 		case BPF_REDIRECT:
 			return LWTUNNEL_XMIT_DONE;
+		case BPF_LWT_REROUTE:
+			ret = dst_output(dev_net(skb_dst(skb)->dev),
+					 skb->sk, skb);
+			if (unlikely(ret))
+				return ret;
+			/* ip[6]_finish_output2 understand LWTUNNEL_XMIT_DONE */
+			return LWTUNNEL_XMIT_DONE;
 		default:
 			return ret;
 		}
@@ -389,6 +400,143 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = {
 	.owner		= THIS_MODULE,
 };
 
+int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
+{
+	struct dst_entry *dst = NULL;
+	struct iphdr *iph;
+	bool ipv4;
+	int err;
+
+	if (unlikely(len < sizeof(struct iphdr) || len > LWT_BPF_MAX_HEADROOM))
+		return -EINVAL;
+
+	/* validate protocol and length */
+	iph = (struct iphdr *)hdr;
+	if (iph->version == 4) {
+		ipv4 = true;
+		if (iph->ihl * 4 > len)
+			return -EINVAL;
+	} else if (iph->version == 6) {
+		ipv4 = false;
+		if (unlikely(len < sizeof(struct ipv6hdr)))
+			return -EINVAL;
+	} else {
+		return -EINVAL;
+	}
+
+	/* allocate enough space for the encap headers + L2 hdr */
+	if (ingress) {
+		err = skb_cow_head(skb, len + skb->mac_len);
+		if (unlikely(err))
+			return err;
+	} else {
+		/* ip_route_input_noref below does route lookup and dst
+		 * drop/set for ingress. There is no similar function for
+		 * egress, so we need to do route lookup and replace skb's
+		 * dst in this function.
+		 */
+		struct net_device *l3mdev =
+			l3mdev_master_dev_rcu(skb_dst(skb)->dev);
+		int oif = l3mdev ? l3mdev->ifindex : 0;
+		struct sock *sk;
+		struct net *net;
+
+		sk = sk_to_full_sk(skb->sk);
+		if (sk) {
+			if (sk->sk_bound_dev_if)
+				oif = sk->sk_bound_dev_if;
+			net = sock_net(sk);
+		} else {
+			net = dev_net(skb_dst(skb)->dev);
+		}
+
+		if (ipv4) {
+			struct flowi4 fl4 = {0};
+			struct rtable *rt;
+
+			fl4.flowi4_oif = oif;
+			fl4.flowi4_mark = skb->mark;
+			fl4.flowi4_uid = sock_net_uid(net, sk);
+			fl4.flowi4_tos = RT_TOS(iph->tos);
+			fl4.flowi4_flags = FLOWI_FLAG_ANYSRC;
+			fl4.flowi4_proto = iph->protocol;
+			fl4.daddr = iph->daddr;
+			fl4.saddr = iph->saddr;
+
+			rt = ip_route_output_key(net, &fl4);
+			if (IS_ERR(rt) || rt->dst.error)
+				return -EINVAL;
+			dst = &rt->dst;
+		} else {
+			struct ipv6hdr *iph6 = (struct ipv6hdr *)hdr;
+			struct flowi6 fl6 = {0};
+
+			fl6.flowi6_oif = oif;
+			fl6.flowi6_mark = skb->mark;
+			fl6.flowi6_uid = sock_net_uid(net, sk);
+			fl6.flowlabel = ip6_flowinfo(iph6);
+			fl6.flowi6_proto = iph6->nexthdr;
+			fl6.daddr = iph6->daddr;
+			fl6.saddr = iph6->saddr;
+
+			dst = ip6_route_output(net, skb->sk, &fl6);
+			if (IS_ERR(dst) || dst->error)
+				return -EINVAL;
+		}
+
+		err = skb_cow_head(skb, len + LL_RESERVED_SPACE(dst->dev));
+		if (unlikely(err))
+			return err;
+	}
+
+	/* push the encap headers and fix pointers */
+	skb_reset_inner_headers(skb);
+	skb->encapsulation = 1;
+	skb_push(skb, len);
+	if (ingress)
+		skb_postpush_rcsum(skb, iph, len);
+	skb_reset_network_header(skb);
+	memcpy(skb_network_header(skb), hdr, len);
+	bpf_compute_data_pointers(skb);
+
+	/* final skb touches + routing */
+	if (ipv4) {
+		skb->protocol = htons(ETH_P_IP);
+		iph = ip_hdr(skb);
+		if (iph->ihl * 4 < len)
+			skb_set_transport_header(skb, iph->ihl * 4);
+
+		if (!iph->check)
+			iph->check = ip_fast_csum((unsigned char *)iph,
+						  iph->ihl);
+
+		if (ingress) {
+			err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
+						   iph->tos, skb_dst(skb)->dev);
+			if (err)
+				return err;
+		} else {
+			skb_dst_drop(skb);
+			skb_dst_set(skb, dst);
+		}
+	} else {
+		skb->protocol = htons(ETH_P_IPV6);
+		if (sizeof(struct ipv6hdr) < len)
+			skb_set_transport_header(skb, sizeof(struct ipv6hdr));
+
+		if (ingress) {
+			ip6_route_input(skb);
+			if (skb_dst(skb)->error)
+				return skb_dst(skb)->error;
+		} else {
+			skb_dst_drop(skb);
+			skb_dst_set(skb, dst);
+		}
+	}
+
+	return 0;
+}
+
 static int __init bpf_lwt_init(void)
 {
 	return lwtunnel_encap_add_ops(&bpf_encap_ops, LWTUNNEL_ENCAP_BPF);
-- 
2.20.1.495.gaa96b0ce6b-goog


^ permalink raw reply related

* [PATCH bpf-next v3 1/4] bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-01-29  1:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov
In-Reply-To: <20190129011217.192510-1-posk@google.com>

This patch adds all needed plumbing in preparation to allowing
bpf programs to do IP encapping via bpf_lwt_push_encap. Actual
implementation is added in the next patch in the patchset.

Of note:
- bpf_lwt_push_encap can now be called from BPF_PROG_TYPE_LWT_XMIT
  prog types in addition to BPF_PROG_TYPE_LWT_IN;
- as route lookups are different for ingress vs egress, the single
  external bpf_lwt_push_encap BPF helper is routed internally to
  either bpf_lwt_in_push_encap or bpf_lwt_xmit_push_encap BPF_CALLs,
  depending on prog type.

Signed-off-by: Peter Oskolkov <posk@google.com>
---
 include/uapi/linux/bpf.h | 23 ++++++++++++++++++--
 net/core/filter.c        | 46 +++++++++++++++++++++++++++++++++++-----
 2 files changed, 62 insertions(+), 7 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 60b99b730a41..c4fee8b45762 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2015,6 +2015,16 @@ union bpf_attr {
  *			Only works if *skb* contains an IPv6 packet. Insert a
  *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
  *			the IPv6 header.
+ *		**BPF_LWT_ENCAP_IP**
+ *			IP encapsulation (GRE/GUE/IPIP/etc). The outer header
+ *			must be IPv4 or IPv6, followed by zero or more
+ *			additional headers, up to LWT_BPF_MAX_HEADROOM total
+ *			bytes in all prepended headers.
+ *
+ *		BPF_LWT_ENCAP_SEG6*** types can be called by bpf programs of
+ *		type BPF_PROG_TYPE_LWT_IN; BPF_LWT_ENCAP_IP type can be called
+ *		by bpf programs of types BPF_PROG_TYPE_LWT_IN and
+ *		BPF_PROG_TYPE_LWT_XMIT.
  *
  * 		A call to this helper is susceptible to change the underlaying
  * 		packet buffer. Therefore, at load time, all checks on pointers
@@ -2495,7 +2505,8 @@ enum bpf_hdr_start_off {
 /* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
 enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_SEG6,
-	BPF_LWT_ENCAP_SEG6_INLINE
+	BPF_LWT_ENCAP_SEG6_INLINE,
+	BPF_LWT_ENCAP_IP,
 };
 
 #define __bpf_md_ptr(type, name)	\
@@ -2583,7 +2594,15 @@ enum bpf_ret_code {
 	BPF_DROP = 2,
 	/* 3-6 reserved */
 	BPF_REDIRECT = 7,
-	/* >127 are reserved for prog type specific return codes */
+	/* >127 are reserved for prog type specific return codes.
+	 *
+	 * BPF_LWT_REROUTE: used by BPF_PROG_TYPE_LWT_IN and
+	 *    BPF_PROG_TYPE_LWT_XMIT to indicate that skb's dst
+	 *    has changed and appropriate dst_input() or dst_output()
+	 *    action has to be taken (this is an L3 redirect, as
+	 *    opposed to L2 redirect represented by BPF_REDIRECT above).
+	 */
+	BPF_LWT_REROUTE = 128,
 };
 
 struct bpf_sock {
diff --git a/net/core/filter.c b/net/core/filter.c
index 8e587dd1da20..fd3ae092d3d7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4793,7 +4793,13 @@ static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len
 }
 #endif /* CONFIG_IPV6_SEG6_BPF */
 
-BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
+static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len,
+			     bool ingress)
+{
+	return -EINVAL;  /* Implemented in the next patch. */
+}
+
+BPF_CALL_4(bpf_lwt_in_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
 	   u32, len)
 {
 	switch (type) {
@@ -4801,14 +4807,41 @@ BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
 	case BPF_LWT_ENCAP_SEG6:
 	case BPF_LWT_ENCAP_SEG6_INLINE:
 		return bpf_push_seg6_encap(skb, type, hdr, len);
+#endif
+#if IS_ENABLED(CONFIG_LWTUNNEL_BPF)
+	case BPF_LWT_ENCAP_IP:
+		return bpf_push_ip_encap(skb, hdr, len, true /* ingress */);
 #endif
 	default:
 		return -EINVAL;
 	}
 }
 
-static const struct bpf_func_proto bpf_lwt_push_encap_proto = {
-	.func		= bpf_lwt_push_encap,
+BPF_CALL_4(bpf_lwt_xmit_push_encap, struct sk_buff *, skb, u32, type,
+	   void *, hdr, u32, len)
+{
+	switch (type) {
+#if IS_ENABLED(CONFIG_LWTUNNEL_BPF)
+	case BPF_LWT_ENCAP_IP:
+		return bpf_push_ip_encap(skb, hdr, len, false /* egress */);
+#endif
+	default:
+		return -EINVAL;
+	}
+}
+
+static const struct bpf_func_proto bpf_lwt_in_push_encap_proto = {
+	.func		= bpf_lwt_in_push_encap,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_MEM,
+	.arg4_type	= ARG_CONST_SIZE
+};
+
+static const struct bpf_func_proto bpf_lwt_xmit_push_encap_proto = {
+	.func		= bpf_lwt_xmit_push_encap,
 	.gpl_only	= false,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
@@ -5274,7 +5307,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 	    func == bpf_lwt_seg6_adjust_srh ||
 	    func == bpf_lwt_seg6_action ||
 #endif
-	    func == bpf_lwt_push_encap)
+	    func == bpf_lwt_in_push_encap ||
+	    func == bpf_lwt_xmit_push_encap)
 		return true;
 
 	return false;
@@ -5652,7 +5686,7 @@ lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_lwt_push_encap:
-		return &bpf_lwt_push_encap_proto;
+		return &bpf_lwt_in_push_encap_proto;
 	default:
 		return lwt_out_func_proto(func_id, prog);
 	}
@@ -5688,6 +5722,8 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_l4_csum_replace_proto;
 	case BPF_FUNC_set_hash_invalid:
 		return &bpf_set_hash_invalid_proto;
+	case BPF_FUNC_lwt_push_encap:
+		return &bpf_lwt_xmit_push_encap_proto;
 	default:
 		return lwt_out_func_proto(func_id, prog);
 	}
-- 
2.20.1.495.gaa96b0ce6b-goog


^ permalink raw reply related

* [PATCH bpf-next v3 0/4] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap
From: Peter Oskolkov @ 2019-01-29  1:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, David Ahern, Willem de Bruijn, Peter Oskolkov

This patchset implements BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN
and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

V2 changes: Added flowi-based route lookup, IPv6 encapping, and
encapping on ingress.

V3 changes: incorporated David Ahern's suggestions:
  - added l3mdev check/oif (patch 2)
  - sync bpf.h from include/uapi into tools/include/uapi
  - selftest tweaks


Peter Oskolkov (4):
  bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap
  bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
  bpf: sync <kdir>/<uapi>/bpf.h with tools/<uapi>/bpf.h
  selftests: bpf: add test_lwt_ip_encap selftest

 include/net/lwtunnel.h                        |   3 +
 include/uapi/linux/bpf.h                      |  23 +-
 net/core/filter.c                             |  47 ++-
 net/core/lwt_bpf.c                            | 147 +++++++++
 tools/include/uapi/linux/bpf.h                |  23 +-
 tools/testing/selftests/bpf/Makefile          |   5 +-
 .../testing/selftests/bpf/test_lwt_ip_encap.c |  84 +++++
 .../selftests/bpf/test_lwt_ip_encap.sh        | 311 ++++++++++++++++++
 8 files changed, 632 insertions(+), 11 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_ip_encap.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_ip_encap.sh

-- 
2.20.1.495.gaa96b0ce6b-goog


^ permalink raw reply

* [PATCH v2 net 7/7] virtio_net: Differentiate sk_buff and xdp_frame on freeing
From: Toshiaki Makita @ 2019-01-29  0:45 UTC (permalink / raw)
  To: David S. Miller, Michael S. Tsirkin, Jason Wang
  Cc: Toshiaki Makita, netdev, virtualization
In-Reply-To: <1548722759-2470-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>

We do not reset or free up unused buffers when enabling/disabling XDP,
so it can happen that xdp_frames are freed after disabling XDP or
sk_buffs are freed after enabling XDP on xdp tx queues.
Thus we need to handle both forms (xdp_frames and sk_buffs) regardless
of XDP setting.
One way to trigger this problem is to disable XDP when napi_tx is
enabled. In that case, virtnet_xdp_set() calls virtnet_napi_enable()
which kicks NAPI. The NAPI handler will call virtnet_poll_cleantx()
which invokes free_old_xmit_skbs() for queues which have been used by
XDP.

Note that even with this change we need to keep skipping
free_old_xmit_skbs() from NAPI handlers when XDP is enabled, because XDP
tx queues do not aquire queue locks.

- v2: Use napi_consume_skb() instead of dev_consume_skb_any()

Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
NOTE: Dropped Acked-by because of the v2 change.

 drivers/net/virtio_net.c | 54 +++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 42 insertions(+), 12 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 1d454ce..2594481 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -57,6 +57,8 @@
 #define VIRTIO_XDP_TX		BIT(0)
 #define VIRTIO_XDP_REDIR	BIT(1)
 
+#define VIRTIO_XDP_FLAG	BIT(0)
+
 /* RX packet size EWMA. The average packet size is used to determine the packet
  * buffer size when refilling RX rings. As the entire RX ring may be refilled
  * at once, the weight is chosen so that the EWMA will be insensitive to short-
@@ -252,6 +254,21 @@ struct padded_vnet_hdr {
 	char padding[4];
 };
 
+static bool is_xdp_frame(void *ptr)
+{
+	return (unsigned long)ptr & VIRTIO_XDP_FLAG;
+}
+
+static void *xdp_to_ptr(struct xdp_frame *ptr)
+{
+	return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
+}
+
+static struct xdp_frame *ptr_to_xdp(void *ptr)
+{
+	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
+}
+
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -462,7 +479,8 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi,
 
 	sg_init_one(sq->sg, xdpf->data, xdpf->len);
 
-	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdpf, GFP_ATOMIC);
+	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp_to_ptr(xdpf),
+				   GFP_ATOMIC);
 	if (unlikely(err))
 		return -ENOSPC; /* Caller handle free/refcnt */
 
@@ -482,13 +500,13 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct receive_queue *rq = vi->rq;
-	struct xdp_frame *xdpf_sent;
 	struct bpf_prog *xdp_prog;
 	struct send_queue *sq;
 	unsigned int len;
 	int drops = 0;
 	int kicks = 0;
 	int ret, err;
+	void *ptr;
 	int i;
 
 	/* Only allow ndo_xdp_xmit if XDP is loaded on dev, as this
@@ -507,8 +525,12 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 	}
 
 	/* Free up any pending old buffers before queueing new ones. */
-	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
-		xdp_return_frame(xdpf_sent);
+	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		if (likely(is_xdp_frame(ptr)))
+			xdp_return_frame(ptr_to_xdp(ptr));
+		else
+			napi_consume_skb(ptr, false);
+	}
 
 	for (i = 0; i < n; i++) {
 		struct xdp_frame *xdpf = frames[i];
@@ -1329,18 +1351,26 @@ static int virtnet_receive(struct receive_queue *rq, int budget,
 
 static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)
 {
-	struct sk_buff *skb;
 	unsigned int len;
 	unsigned int packets = 0;
 	unsigned int bytes = 0;
+	void *ptr;
 
-	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		pr_debug("Sent skb %p\n", skb);
+	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		if (likely(!is_xdp_frame(ptr))) {
+			struct sk_buff *skb = ptr;
 
-		bytes += skb->len;
-		packets++;
+			pr_debug("Sent skb %p\n", skb);
 
-		napi_consume_skb(skb, in_napi);
+			bytes += skb->len;
+			napi_consume_skb(skb, in_napi);
+		} else {
+			struct xdp_frame *frame = ptr_to_xdp(ptr);
+
+			bytes += frame->len;
+			xdp_return_frame(frame);
+		}
+		packets++;
 	}
 
 	/* Avoid overhead when no packets have been processed
@@ -2666,10 +2696,10 @@ static void free_unused_bufs(struct virtnet_info *vi)
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		struct virtqueue *vq = vi->sq[i].vq;
 		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
-			if (!is_xdp_raw_buffer_queue(vi, i))
+			if (!is_xdp_frame(buf))
 				dev_kfree_skb(buf);
 			else
-				xdp_return_frame(buf);
+				xdp_return_frame(ptr_to_xdp(buf));
 		}
 	}
 
-- 
1.8.3.1



^ permalink raw reply related

* [PATCH v2 net 6/7] virtio_net: Use xdp_return_frame to free xdp_frames on destroying vqs
From: Toshiaki Makita @ 2019-01-29  0:45 UTC (permalink / raw)
  To: David S. Miller, Michael S. Tsirkin, Jason Wang
  Cc: Toshiaki Makita, netdev, virtualization, Jesper Dangaard Brouer
In-Reply-To: <1548722759-2470-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>

put_page() can work as a fallback for freeing xdp_frames, but the
appropriate way is to use xdp_return_frame().

Fixes: cac320c850ef ("virtio_net: convert to use generic xdp_frame and xdp_return_frame API")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index cea52e4..1d454ce 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2669,7 +2669,7 @@ static void free_unused_bufs(struct virtnet_info *vi)
 			if (!is_xdp_raw_buffer_queue(vi, i))
 				dev_kfree_skb(buf);
 			else
-				put_page(virt_to_head_page(buf));
+				xdp_return_frame(buf);
 		}
 	}
 
-- 
1.8.3.1



^ permalink raw reply related

* [PATCH v2 net 5/7] virtio_net: Don't process redirected XDP frames when XDP is disabled
From: Toshiaki Makita @ 2019-01-29  0:45 UTC (permalink / raw)
  To: David S. Miller, Michael S. Tsirkin, Jason Wang
  Cc: Toshiaki Makita, netdev, virtualization, Jesper Dangaard Brouer
In-Reply-To: <1548722759-2470-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>

Commit 8dcc5b0ab0ec ("virtio_net: fix ndo_xdp_xmit crash towards dev not
ready for XDP") tried to avoid access to unexpected sq while XDP is
disabled, but was not complete.

There was a small window which causes out of bounds sq access in
virtnet_xdp_xmit() while disabling XDP.

An example case of
 - curr_queue_pairs = 6 (2 for SKB and 4 for XDP)
 - online_cpu_num = xdp_queue_paris = 4
when XDP is enabled:

CPU 0                         CPU 1
(Disabling XDP)               (Processing redirected XDP frames)

                              virtnet_xdp_xmit()
virtnet_xdp_set()
 _virtnet_set_queues()
  set curr_queue_pairs (2)
                               check if rq->xdp_prog is not NULL
                               virtnet_xdp_sq(vi)
                                qp = curr_queue_pairs -
                                     xdp_queue_pairs +
                                     smp_processor_id()
                                   = 2 - 4 + 1 = -1
                                sq = &vi->sq[qp] // out of bounds access
  set xdp_queue_pairs (0)
  rq->xdp_prog = NULL

Basically we should not change curr_queue_pairs and xdp_queue_pairs
while someone can read the values. Thus, when disabling XDP, assign NULL
to rq->xdp_prog first, and wait for RCU grace period, then change
xxx_queue_pairs.
Note that we need to keep the current order when enabling XDP though.

- v2: Make rcu_assign_pointer/synchronize_net conditional instead of
      _virtnet_set_queues.

Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/virtio_net.c | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 669b65c..cea52e4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2410,6 +2410,10 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 		return -ENOMEM;
 	}
 
+	old_prog = rtnl_dereference(vi->rq[0].xdp_prog);
+	if (!prog && !old_prog)
+		return 0;
+
 	if (prog) {
 		prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
 		if (IS_ERR(prog))
@@ -2424,21 +2428,30 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 		}
 	}
 
+	if (!prog) {
+		for (i = 0; i < vi->max_queue_pairs; i++) {
+			rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
+			if (i == 0)
+				virtnet_restore_guest_offloads(vi);
+		}
+		synchronize_net();
+	}
+
 	err = _virtnet_set_queues(vi, curr_qp + xdp_qp);
 	if (err)
 		goto err;
 	netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp);
 	vi->xdp_queue_pairs = xdp_qp;
 
-	for (i = 0; i < vi->max_queue_pairs; i++) {
-		old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
-		rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
-		if (i == 0) {
-			if (!old_prog)
+	if (prog) {
+		for (i = 0; i < vi->max_queue_pairs; i++) {
+			rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
+			if (i == 0 && !old_prog)
 				virtnet_clear_guest_offloads(vi);
-			if (!prog)
-				virtnet_restore_guest_offloads(vi);
 		}
+	}
+
+	for (i = 0; i < vi->max_queue_pairs; i++) {
 		if (old_prog)
 			bpf_prog_put(old_prog);
 		if (netif_running(dev)) {
@@ -2451,6 +2464,12 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	return 0;
 
 err:
+	if (!prog) {
+		virtnet_clear_guest_offloads(vi);
+		for (i = 0; i < vi->max_queue_pairs; i++)
+			rcu_assign_pointer(vi->rq[i].xdp_prog, old_prog);
+	}
+
 	if (netif_running(dev)) {
 		for (i = 0; i < vi->max_queue_pairs; i++) {
 			virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
-- 
1.8.3.1



^ permalink raw reply related

* [PATCH v2 net 4/7] virtio_net: Fix out of bounds access of sq
From: Toshiaki Makita @ 2019-01-29  0:45 UTC (permalink / raw)
  To: David S. Miller, Michael S. Tsirkin, Jason Wang
  Cc: Toshiaki Makita, netdev, virtualization
In-Reply-To: <1548722759-2470-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>

When XDP is disabled, curr_queue_pairs + smp_processor_id() can be
larger than max_queue_pairs.
There is no guarantee that we have enough XDP send queues dedicated for
each cpu when XDP is disabled, so do not count drops on sq in that case.

Fixes: 5b8f3c8d30a6 ("virtio_net: Add XDP related stats")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0e1a369..669b65c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -491,20 +491,17 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 	int ret, err;
 	int i;
 
-	sq = virtnet_xdp_sq(vi);
-
-	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK)) {
-		ret = -EINVAL;
-		drops = n;
-		goto out;
-	}
-
 	/* Only allow ndo_xdp_xmit if XDP is loaded on dev, as this
 	 * indicate XDP resources have been successfully allocated.
 	 */
 	xdp_prog = rcu_dereference(rq->xdp_prog);
-	if (!xdp_prog) {
-		ret = -ENXIO;
+	if (!xdp_prog)
+		return -ENXIO;
+
+	sq = virtnet_xdp_sq(vi);
+
+	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK)) {
+		ret = -EINVAL;
 		drops = n;
 		goto out;
 	}
-- 
1.8.3.1



^ permalink raw reply related

* [PATCH v2 net 3/7] virtio_net: Fix not restoring real_num_rx_queues
From: Toshiaki Makita @ 2019-01-29  0:45 UTC (permalink / raw)
  To: David S. Miller, Michael S. Tsirkin, Jason Wang
  Cc: Toshiaki Makita, netdev, virtualization
In-Reply-To: <1548722759-2470-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>

When _virtnet_set_queues() failed we did not restore real_num_rx_queues.
Fix this by placing the change of real_num_rx_queues after
_virtnet_set_queues().
This order is also in line with virtnet_set_channels().

Fixes: 4941d472bf95 ("virtio-net: do not reset during XDP set")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 046f955..0e1a369 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2427,10 +2427,10 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 		}
 	}
 
-	netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp);
 	err = _virtnet_set_queues(vi, curr_qp + xdp_qp);
 	if (err)
 		goto err;
+	netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp);
 	vi->xdp_queue_pairs = xdp_qp;
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
-- 
1.8.3.1



^ permalink raw reply related

* [PATCH v2 net 2/7] virtio_net: Don't call free_old_xmit_skbs for xdp_frames
From: Toshiaki Makita @ 2019-01-29  0:45 UTC (permalink / raw)
  To: David S. Miller, Michael S. Tsirkin, Jason Wang
  Cc: Toshiaki Makita, netdev, virtualization, Willem de Bruijn
In-Reply-To: <1548722759-2470-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>

When napi_tx is enabled, virtnet_poll_cleantx() called
free_old_xmit_skbs() even for xdp send queue.
This is bogus since the queue has xdp_frames, not sk_buffs, thus mangled
device tx bytes counters because skb->len is meaningless value, and even
triggered oops due to general protection fault on freeing them.

Since xdp send queues do not aquire locks, old xdp_frames should be
freed only in virtnet_xdp_xmit(), so just skip free_old_xmit_skbs() for
xdp send queues.

Similarly virtnet_poll_tx() called free_old_xmit_skbs(). This NAPI
handler is called even without calling start_xmit() because cb for tx is
by default enabled. Once the handler is called, it enabled the cb again,
and then the handler would be called again. We don't need this handler
for XDP, so don't enable cb as well as not calling free_old_xmit_skbs().

Also, we need to disable tx NAPI when disabling XDP, so
virtnet_poll_tx() can safely access curr_queue_pairs and
xdp_queue_pairs, which are not atomically updated while disabling XDP.

Fixes: b92f1e6751a6 ("virtio-net: transmit napi")
Fixes: 7b0411ef4aa6 ("virtio-net: clean tx descriptors from rx napi")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 49 ++++++++++++++++++++++++++++++++----------------
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 8e4c5d4..046f955 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1358,6 +1358,16 @@ static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)
 	u64_stats_update_end(&sq->stats.syncp);
 }
 
+static bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
+{
+	if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
+		return false;
+	else if (q < vi->curr_queue_pairs)
+		return true;
+	else
+		return false;
+}
+
 static void virtnet_poll_cleantx(struct receive_queue *rq)
 {
 	struct virtnet_info *vi = rq->vq->vdev->priv;
@@ -1365,7 +1375,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
 	struct send_queue *sq = &vi->sq[index];
 	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, index);
 
-	if (!sq->napi.weight)
+	if (!sq->napi.weight || is_xdp_raw_buffer_queue(vi, index))
 		return;
 
 	if (__netif_tx_trylock(txq)) {
@@ -1442,8 +1452,16 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
 {
 	struct send_queue *sq = container_of(napi, struct send_queue, napi);
 	struct virtnet_info *vi = sq->vq->vdev->priv;
-	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+	unsigned int index = vq2txq(sq->vq);
+	struct netdev_queue *txq;
 
+	if (unlikely(is_xdp_raw_buffer_queue(vi, index))) {
+		/* We don't need to enable cb for XDP */
+		napi_complete_done(napi, 0);
+		return 0;
+	}
+
+	txq = netdev_get_tx_queue(vi->dev, index);
 	__netif_tx_lock(txq, raw_smp_processor_id());
 	free_old_xmit_skbs(sq, true);
 	__netif_tx_unlock(txq);
@@ -2402,9 +2420,12 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	}
 
 	/* Make sure NAPI is not using any XDP TX queues for RX. */
-	if (netif_running(dev))
-		for (i = 0; i < vi->max_queue_pairs; i++)
+	if (netif_running(dev)) {
+		for (i = 0; i < vi->max_queue_pairs; i++) {
 			napi_disable(&vi->rq[i].napi);
+			virtnet_napi_tx_disable(&vi->sq[i].napi);
+		}
+	}
 
 	netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp);
 	err = _virtnet_set_queues(vi, curr_qp + xdp_qp);
@@ -2423,16 +2444,22 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 		}
 		if (old_prog)
 			bpf_prog_put(old_prog);
-		if (netif_running(dev))
+		if (netif_running(dev)) {
 			virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
+			virtnet_napi_tx_enable(vi, vi->sq[i].vq,
+					       &vi->sq[i].napi);
+		}
 	}
 
 	return 0;
 
 err:
 	if (netif_running(dev)) {
-		for (i = 0; i < vi->max_queue_pairs; i++)
+		for (i = 0; i < vi->max_queue_pairs; i++) {
 			virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
+			virtnet_napi_tx_enable(vi, vi->sq[i].vq,
+					       &vi->sq[i].napi);
+		}
 	}
 	if (prog)
 		bpf_prog_sub(prog, vi->max_queue_pairs - 1);
@@ -2615,16 +2642,6 @@ static void free_receive_page_frags(struct virtnet_info *vi)
 			put_page(vi->rq[i].alloc_frag.page);
 }
 
-static bool is_xdp_raw_buffer_queue(struct virtnet_info *vi, int q)
-{
-	if (q < (vi->curr_queue_pairs - vi->xdp_queue_pairs))
-		return false;
-	else if (q < vi->curr_queue_pairs)
-		return true;
-	else
-		return false;
-}
-
 static void free_unused_bufs(struct virtnet_info *vi)
 {
 	void *buf;
-- 
1.8.3.1



^ permalink raw reply related

* [PATCH v2 net 1/7] virtio_net: Don't enable NAPI when interface is down
From: Toshiaki Makita @ 2019-01-29  0:45 UTC (permalink / raw)
  To: David S. Miller, Michael S. Tsirkin, Jason Wang
  Cc: Toshiaki Makita, netdev, virtualization
In-Reply-To: <1548722759-2470-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>

Commit 4e09ff536284 ("virtio-net: disable NAPI only when enabled during
XDP set") tried to fix inappropriate NAPI enabling/disabling when
!netif_running(), but was not complete.

On error path virtio_net could enable NAPI even when !netif_running().
This can cause enabling NAPI twice on virtnet_open(), which would
trigger BUG_ON() in napi_enable().

Fixes: 4941d472bf95b ("virtio-net: do not reset during XDP set")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 8fadd8e..8e4c5d4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2430,8 +2430,10 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	return 0;
 
 err:
-	for (i = 0; i < vi->max_queue_pairs; i++)
-		virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
+	if (netif_running(dev)) {
+		for (i = 0; i < vi->max_queue_pairs; i++)
+			virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
+	}
 	if (prog)
 		bpf_prog_sub(prog, vi->max_queue_pairs - 1);
 	return err;
-- 
1.8.3.1



^ permalink raw reply related

* [PATCH v2 net 0/7] virtio_net: Fix problems around XDP tx and napi_tx
From: Toshiaki Makita @ 2019-01-29  0:45 UTC (permalink / raw)
  To: David S. Miller, Michael S. Tsirkin, Jason Wang
  Cc: Toshiaki Makita, netdev, virtualization, Willem de Bruijn,
	Jesper Dangaard Brouer

While I'm looking into how to account standard tx counters on XDP tx
processing, I found several bugs around XDP tx and napi_tx.

Patch1: Fix oops on error path. Patch2 depends on this.
Patch2: Fix memory corruption on freeing xdp_frames with napi_tx enabled.
Patch3: Minor fix patch5 depends on.
Patch4: Fix memory corruption on processing xdp_frames when XDP is disabled.
  Also patch5 depends on this.
Patch5: Fix memory corruption on processing xdp_frames while XDP is being
  disabled.
Patch6: Minor fix patch7 depends on.
Patch7: Fix memory corruption on freeing sk_buff or xdp_frames when a normal
  queue is reused for XDP and vise versa.

v2:
- patch5: Make rcu_assign_pointer/synchronize_net conditional instead of
          _virtnet_set_queues.
- patch7: Use napi_consume_skb() instead of dev_consume_skb_any()

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Toshiaki Makita (7):
  virtio_net: Don't enable NAPI when interface is down
  virtio_net: Don't call free_old_xmit_skbs for xdp_frames
  virtio_net: Fix not restoring real_num_rx_queues
  virtio_net: Fix out of bounds access of sq
  virtio_net: Don't process redirected XDP frames when XDP is disabled
  virtio_net: Use xdp_return_frame to free xdp_frames on destroying vqs
  virtio_net: Differentiate sk_buff and xdp_frame on freeing

 drivers/net/virtio_net.c | 159 +++++++++++++++++++++++++++++++++--------------
 1 file changed, 112 insertions(+), 47 deletions(-)

-- 
1.8.3.1



^ permalink raw reply

* Re: Packets being dropped somewhere in the kernel, between iptables and packet capture layers
From: Niklas Hambüchen @ 2019-01-29  0:45 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal
In-Reply-To: <20190128062121.fbksn2vdzpthwfkh@breakpoint.cc>

> Some DNS requests on my machine take 5 seconds when a specific network interface is used.

My big apologies for the confusion. I have found the reason.

There was a scripted

     tc qdisc add dev enp3s0 root netem loss 5%

in effect that I wasn't aware of while investigating.
This dropped 5% of the packets, and also explains why it was on that interface only.

Well, at least I know very well now at which level `tc` operates.

My bad, and sorry for the noise.

^ permalink raw reply

* Re: [PATCH bpf-next v3 0/3] support flow dissector in BPF_PROG_TEST_RUN
From: Daniel Borkmann @ 2019-01-29  0:24 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev; +Cc: davem, ast
In-Reply-To: <20190128165355.229403-1-sdf@google.com>

On 01/28/2019 05:53 PM, Stanislav Fomichev wrote:
> This patch series adds support for testing flow dissector BPF programs by
> extending already existing BPF_PROG_TEST_RUN. The goal is to have a
> packet as an input and `struct bpf_flow_key' as an output. That way
> we can easily test flow dissector programs' behavior.
> I've also modified existing test_progs.c test to do a simple flow
> dissector run as well.
> 
> * first patch introduces new __skb_flow_bpf_dissect to simplify
>   sharing between __skb_flow_bpf_dissect and BPF_PROG_TEST_RUN
> * second patch adds actual BPF_PROG_TEST_RUN support
> * third patch adds example usage to the selftests
> 
> v3:
> * rebased on top of latest bpf-next
> 
> v2:
> * loop over 'kattr->test.repeat' inside of
>   bpf_prog_test_run_flow_dissector, don't reuse
>   bpf_test_run/bpf_test_run_one
> 
> Stanislav Fomichev (3):
>   net/flow_dissector: move bpf case into __skb_flow_bpf_dissect
>   bpf: add BPF_PROG_TEST_RUN support for flow dissector
>   selftests/bpf: add simple BPF_PROG_TEST_RUN examples for flow
>     dissector
> 
>  include/linux/bpf.h                           |  3 +
>  include/linux/skbuff.h                        |  5 +
>  net/bpf/test_run.c                            | 82 +++++++++++++++++
>  net/core/filter.c                             |  1 +
>  net/core/flow_dissector.c                     | 92 +++++++++++--------
>  tools/testing/selftests/bpf/Makefile          |  3 +
>  .../selftests/bpf/flow_dissector_load.c       | 43 +--------
>  .../selftests/bpf/flow_dissector_load.h       | 55 +++++++++++
>  tools/testing/selftests/bpf/test_progs.c      | 78 +++++++++++++++-
>  9 files changed, 284 insertions(+), 78 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.h
> 

Applied, thanks!

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox