Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next v2 0/7] bpf/flow_dissector: support input flags
From: Petar Penkov @ 2019-07-25 17:05 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Networking, bpf, David S . Miller, Alexei Starovoitov,
	Daniel Borkmann, Song Liu, Willem de Bruijn
In-Reply-To: <20190725153342.3571-1-sdf@google.com>

Thanks! For the series:

Acked-by: Petar Penkov <ppenkov@google.com>

On Thu, Jul 25, 2019 at 8:33 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> C flow dissector supports input flags that tell it to customize parsing
> by either stopping early or trying to parse as deep as possible.
> BPF flow dissector always parses as deep as possible which is sub-optimal.
> Pass input flags to the BPF flow dissector as well so it can make the same
> decisions.
>
> Series outline:
> * remove unused FLOW_DISSECTOR_F_STOP_AT_L3 flag
> * export FLOW_DISSECTOR_F_XXX flags as uapi and pass them to BPF
>   flow dissector
> * add documentation for the export flags
> * support input flags in BPF_PROG_TEST_RUN via ctx_{in,out}
> * sync uapi to tools
> * support FLOW_DISSECTOR_F_PARSE_1ST_FRAG in selftest
> * support FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL in kernel and selftest
> * support FLOW_DISSECTOR_F_STOP_AT_ENCAP in selftest
>
> Pros:
> * makes BPF flow dissector faster by avoiding burning extra cycles
> * existing BPF progs continue to work by ignoring the flags and always
>   parsing as deep as possible
>
> Cons:
> * new UAPI which we need to support (OTOH, if we need to deprecate some
>   flags, we can just stop setting them upon calling BPF programs)
>
> Some numbers (with .repeat = 4000000 in test_flow_dissector):
>         test_flow_dissector:PASS:ipv4-frag 35 nsec
>         test_flow_dissector:PASS:ipv4-frag 35 nsec
>         test_flow_dissector:PASS:ipv4-no-frag 32 nsec
>         test_flow_dissector:PASS:ipv4-no-frag 32 nsec
>
>         test_flow_dissector:PASS:ipv6-frag 39 nsec
>         test_flow_dissector:PASS:ipv6-frag 39 nsec
>         test_flow_dissector:PASS:ipv6-no-frag 36 nsec
>         test_flow_dissector:PASS:ipv6-no-frag 36 nsec
>
>         test_flow_dissector:PASS:ipv6-flow-label 36 nsec
>         test_flow_dissector:PASS:ipv6-flow-label 36 nsec
>         test_flow_dissector:PASS:ipv6-no-flow-label 33 nsec
>         test_flow_dissector:PASS:ipv6-no-flow-label 33 nsec
>
>         test_flow_dissector:PASS:ipip-encap 38 nsec
>         test_flow_dissector:PASS:ipip-encap 38 nsec
>         test_flow_dissector:PASS:ipip-no-encap 32 nsec
>         test_flow_dissector:PASS:ipip-no-encap 32 nsec
>
> The improvement is around 10%, but it's in a tight cache-hot
> BPF_PROG_TEST_RUN loop.
>
> Cc: Song Liu <songliubraving@fb.com>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Petar Penkov <ppenkov@google.com>
>
> Stanislav Fomichev (7):
>   bpf/flow_dissector: pass input flags to BPF flow dissector program
>   bpf/flow_dissector: document flags
>   bpf/flow_dissector: support flags in BPF_PROG_TEST_RUN
>   tools/bpf: sync bpf_flow_keys flags
>   selftests/bpf: support FLOW_DISSECTOR_F_PARSE_1ST_FRAG
>   bpf/flow_dissector: support ipv6 flow_label and
>     FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL
>   selftests/bpf: support FLOW_DISSECTOR_F_STOP_AT_ENCAP
>
>  Documentation/bpf/prog_flow_dissector.rst     |  18 ++
>  include/linux/skbuff.h                        |   2 +-
>  include/net/flow_dissector.h                  |   4 -
>  include/uapi/linux/bpf.h                      |   6 +
>  net/bpf/test_run.c                            |  39 ++-
>  net/core/flow_dissector.c                     |  14 +-
>  tools/include/uapi/linux/bpf.h                |   6 +
>  .../selftests/bpf/prog_tests/flow_dissector.c | 242 +++++++++++++++++-
>  tools/testing/selftests/bpf/progs/bpf_flow.c  |  46 +++-
>  9 files changed, 359 insertions(+), 18 deletions(-)
>
> --
> 2.22.0.657.g960e92d24f-goog

^ permalink raw reply

* Re: [PATCH bpf-next v3 03/11] xsk: add support to allow unaligned chunk placement
From: Laatz, Kevin @ 2019-07-25 17:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	Topel, Bjorn, Karlsson, Magnus, jonathan.lemon@gmail.com,
	saeedm@mellanox.com, maximmi@mellanox.com,
	stephen@networkplumber.org, Richardson, Bruce, Loftus, Ciara,
	bpf@vger.kernel.org, intel-wired-lan@lists.osuosl.org
In-Reply-To: <20190724192253.00ac07bd@cakuba.netronome.com>

On 25/07/2019 03:22, Jakub Kicinski wrote:
> On Wed, 24 Jul 2019 05:10:35 +0000, Kevin Laatz wrote:
>> Currently, addresses are chunk size aligned. This means, we are very
>> restricted in terms of where we can place chunk within the umem. For
>> example, if we have a chunk size of 2k, then our chunks can only be placed
>> at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).
>>
>> This patch introduces the ability to use unaligned chunks. With these
>> changes, we are no longer bound to having to place chunks at a 2k (or
>> whatever your chunk size is) interval. Since we are no longer dealing with
>> aligned chunks, they can now cross page boundaries. Checks for page
>> contiguity have been added in order to keep track of which pages are
>> followed by a physically contiguous page.
>>
>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
>> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>>
>> ---
>> v2:
>>    - Add checks for the flags coming from userspace
>>    - Fix how we get chunk_size in xsk_diag.c
>>    - Add defines for masking the new descriptor format
>>    - Modified the rx functions to use new descriptor format
>>    - Modified the tx functions to use new descriptor format
>>
>> v3:
>>    - Add helper function to do address/offset masking/addition
>> ---
>>   include/net/xdp_sock.h      | 17 ++++++++
>>   include/uapi/linux/if_xdp.h |  9 ++++
>>   net/xdp/xdp_umem.c          | 18 +++++---
>>   net/xdp/xsk.c               | 86 ++++++++++++++++++++++++++++++-------
>>   net/xdp/xsk_diag.c          |  2 +-
>>   net/xdp/xsk_queue.h         | 68 +++++++++++++++++++++++++----
>>   6 files changed, 170 insertions(+), 30 deletions(-)
>>
>> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
>> index 69796d264f06..738996c0f995 100644
>> --- a/include/net/xdp_sock.h
>> +++ b/include/net/xdp_sock.h
>> @@ -19,6 +19,7 @@ struct xsk_queue;
>>   struct xdp_umem_page {
>>   	void *addr;
>>   	dma_addr_t dma;
>> +	bool next_pg_contig;
> IIRC accesses to xdp_umem_page case a lot of cache misses, so having
> this structure grow from 16 to 24B is a little unfortunate :(
> Can we try to steal lower bits of addr or dma? Or perhaps not pre
> compute this info at all?

Pre-computing is saving us from re-computing the same information 
multiple times for the same chunk in this case. Keeping that in mind, I 
would be more in favor of stealing a bit.

Will look into a nicer solution for the v4 :)

>>   };
>>   
>>   struct xdp_umem_fq_reuse {
>> @@ -48,6 +49,7 @@ struct xdp_umem {
>>   	bool zc;
>>   	spinlock_t xsk_list_lock;
>>   	struct list_head xsk_list;
>> +	u32 flags;
>>   };
>>   
>>   struct xdp_sock {
>> @@ -144,6 +146,15 @@ static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
>>   
>>   	rq->handles[rq->length++] = addr;
>>   }
>> +
>> +static inline u64 xsk_umem_handle_offset(struct xdp_umem *umem, u64 handle,
>> +					 u64 offset)
>> +{
>> +	if (umem->flags & XDP_UMEM_UNALIGNED_CHUNKS)
>> +		return handle |= (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
>> +	else
>> +		return handle += offset;
>> +}
>>   #else
>>   static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>>   {
>> @@ -241,6 +252,12 @@ static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
>>   {
>>   }
>>   
>> +static inline u64 xsk_umem_handle_offset(struct xdp_umem *umem, u64 handle,
>> +					 u64 offset)
>> +{
>> +	return NULL;
> 	return 0?
Will change for the v4, thanks.
>
>> +}
>> +
>>   #endif /* CONFIG_XDP_SOCKETS */
>>   
>>   #endif /* _LINUX_XDP_SOCK_H */
>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>> index faaa5ca2a117..f8dc68fcdf78 100644
>> --- a/include/uapi/linux/if_xdp.h
>> +++ b/include/uapi/linux/if_xdp.h
>> @@ -17,6 +17,9 @@
>>   #define XDP_COPY	(1 << 1) /* Force copy-mode */
>>   #define XDP_ZEROCOPY	(1 << 2) /* Force zero-copy mode */
>>   
>> +/* Flags for xsk_umem_config flags */
>> +#define XDP_UMEM_UNALIGNED_CHUNKS (1 << 0)
>> +
>>   struct sockaddr_xdp {
>>   	__u16 sxdp_family;
>>   	__u16 sxdp_flags;
>> @@ -53,6 +56,7 @@ struct xdp_umem_reg {
>>   	__u64 len; /* Length of packet data area */
>>   	__u32 chunk_size;
>>   	__u32 headroom;
>> +	__u32 flags;
>>   };
>>   
>>   struct xdp_statistics {
>> @@ -74,6 +78,11 @@ struct xdp_options {
>>   #define XDP_UMEM_PGOFF_FILL_RING	0x100000000ULL
>>   #define XDP_UMEM_PGOFF_COMPLETION_RING	0x180000000ULL
>>   
>> +/* Masks for unaligned chunks mode */
>> +#define XSK_UNALIGNED_BUF_OFFSET_SHIFT 48
>> +#define XSK_UNALIGNED_BUF_ADDR_MASK \
>> +	((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
>> +
>>   /* Rx/Tx descriptor */
>>   struct xdp_desc {
>>   	__u64 addr;
>> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
>> index 83de74ca729a..952ca22103e9 100644
>> --- a/net/xdp/xdp_umem.c
>> +++ b/net/xdp/xdp_umem.c
>> @@ -299,6 +299,7 @@ static int xdp_umem_account_pages(struct xdp_umem *umem)
>>   
>>   static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>>   {
>> +	bool unaligned_chunks = mr->flags & XDP_UMEM_UNALIGNED_CHUNKS;
>>   	u32 chunk_size = mr->chunk_size, headroom = mr->headroom;
>>   	unsigned int chunks, chunks_per_page;
>>   	u64 addr = mr->addr, size = mr->len;
>> @@ -314,7 +315,10 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>>   		return -EINVAL;
>>   	}
>>   
>> -	if (!is_power_of_2(chunk_size))
>> +	if (mr->flags & ~(XDP_UMEM_UNALIGNED_CHUNKS))
> parens unnecessary, consider adding a define for known flags.
Will fix in the v4.
>
>> +		return -EINVAL;
>> +
>> +	if (!unaligned_chunks && !is_power_of_2(chunk_size))
>>   		return -EINVAL;
>>   
>>   	if (!PAGE_ALIGNED(addr)) {
>> @@ -331,9 +335,11 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>>   	if (chunks == 0)
>>   		return -EINVAL;
>>   
>> -	chunks_per_page = PAGE_SIZE / chunk_size;
>> -	if (chunks < chunks_per_page || chunks % chunks_per_page)
>> -		return -EINVAL;
>> +	if (!unaligned_chunks) {
>> +		chunks_per_page = PAGE_SIZE / chunk_size;
>> +		if (chunks < chunks_per_page || chunks % chunks_per_page)
>> +			return -EINVAL;
>> +	}
>>   
>>   	headroom = ALIGN(headroom, 64);
>>   
>> @@ -342,13 +348,15 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
>>   		return -EINVAL;
>>   
>>   	umem->address = (unsigned long)addr;
>> -	umem->chunk_mask = ~((u64)chunk_size - 1);
>> +	umem->chunk_mask = unaligned_chunks ? XSK_UNALIGNED_BUF_ADDR_MASK
>> +					    : ~((u64)chunk_size - 1);
>>   	umem->size = size;
>>   	umem->headroom = headroom;
>>   	umem->chunk_size_nohr = chunk_size - headroom;
>>   	umem->npgs = size / PAGE_SIZE;
>>   	umem->pgs = NULL;
>>   	umem->user = NULL;
>> +	umem->flags = mr->flags;
>>   	INIT_LIST_HEAD(&umem->xsk_list);
>>   	spin_lock_init(&umem->xsk_list_lock);
>>   
>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>> index 59b57d708697..b3ab653091c4 100644
>> --- a/net/xdp/xsk.c
>> +++ b/net/xdp/xsk.c
>> @@ -45,7 +45,7 @@ EXPORT_SYMBOL(xsk_umem_has_addrs);
>>   
>>   u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
>>   {
>> -	return xskq_peek_addr(umem->fq, addr);
>> +	return xskq_peek_addr(umem->fq, addr, umem);
>>   }
>>   EXPORT_SYMBOL(xsk_umem_peek_addr);
>>   
>> @@ -55,21 +55,42 @@ void xsk_umem_discard_addr(struct xdp_umem *umem)
>>   }
>>   EXPORT_SYMBOL(xsk_umem_discard_addr);
>>   
>> +/* If a buffer crosses a page boundary, we need to do 2 memcpy's, one for
>> + * each page. This is only required in copy mode.
>> + */
>> +static void __xsk_rcv_memcpy(struct xdp_umem *umem, u64 addr, void *from_buf,
>> +			     u32 len, u32 metalen)
>> +{
>> +	void *to_buf = xdp_umem_get_data(umem, addr);
>> +
>> +	if (xskq_crosses_non_contig_pg(umem, addr, len + metalen)) {
>> +		void *next_pg_addr = umem->pages[(addr >> PAGE_SHIFT) + 1].addr;
>> +		u64 page_start = addr & (PAGE_SIZE - 1);
>> +		u64 first_len = PAGE_SIZE - (addr - page_start);
>> +
>> +		memcpy(to_buf, from_buf, first_len + metalen);
>> +		memcpy(next_pg_addr, from_buf + first_len, len - first_len);
>> +
>> +		return;
>> +	}
>> +
>> +	memcpy(to_buf, from_buf, len + metalen);
>> +}
> Why handle this case gracefully? Real XSK use is the zero copy mode,
> having extra code to make copy mode more permissive seems a little
> counter productive IMHO.
Agree that zero copy mode is the main use and that this is somewhat 
unnecessary. However, since we are now allowing for unaligned chunks 
which can cross page boundaries, there's no harm in adding this extra 
check and handling it gracefully.
>
>>   static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>>   {
>> -	void *to_buf, *from_buf;
>> +	u64 offset = xs->umem->headroom;
>> +	void *from_buf;
>>   	u32 metalen;
>>   	u64 addr;
>>   	int err;
>>   
>> -	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
>> +	if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
>>   	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>>   		xs->rx_dropped++;
>>   		return -ENOSPC;
>>   	}
>>   
>> -	addr += xs->umem->headroom;
>> -
>>   	if (unlikely(xdp_data_meta_unsupported(xdp))) {
>>   		from_buf = xdp->data;
>>   		metalen = 0;
>> @@ -78,9 +99,13 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>>   		metalen = xdp->data - xdp->data_meta;
>>   	}
>>   
>> -	to_buf = xdp_umem_get_data(xs->umem, addr);
>> -	memcpy(to_buf, from_buf, len + metalen);
>> -	addr += metalen;
>> +	__xsk_rcv_memcpy(xs->umem, addr + offset, from_buf, len, metalen);
>> +
>> +	offset += metalen;
>> +	if (xs->umem->flags & XDP_UMEM_UNALIGNED_CHUNKS)
>> +		addr |= offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT;
>> +	else
>> +		addr += offset;
>>   	err = xskq_produce_batch_desc(xs->rx, addr, len);
>>   	if (!err) {
>>   		xskq_discard_addr(xs->umem->fq);
>> @@ -127,6 +152,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>>   	u32 len = xdp->data_end - xdp->data;
>>   	void *buffer;
>>   	u64 addr;
>> +	u64 offset = xs->umem->headroom;
> reverse xmas tree, please

Will fix in the v4.

Thanks for reviewing!


^ permalink raw reply

* Re: [PATCH bpf-next v3 06/11] mlx5e: modify driver for handling offsets
From: Laatz, Kevin @ 2019-07-25 17:00 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	Topel, Bjorn, Karlsson, Magnus, jakub.kicinski@netronome.com,
	jonathan.lemon@gmail.com, Saeed Mahameed,
	stephen@networkplumber.org, Richardson, Bruce, Loftus, Ciara,
	bpf@vger.kernel.org, intel-wired-lan@lists.osuosl.org
In-Reply-To: <c5704b74-8efe-af2a-68e6-716fa89a5665@mellanox.com>

On 25/07/2019 11:15, Maxim Mikityanskiy wrote:
> On 2019-07-24 08:10, Kevin Laatz wrote:
>> With the addition of the unaligned chunks option, we need to make sure we
>> handle the offsets accordingly based on the mode we are currently running
>> in. This patch modifies the driver to appropriately mask the address for
>> each case.
>>
>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>>
>> ---
>> v3:
>>     - Use new helper function to handle offset
>> ---
>>    drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c    | 8 ++++++--
>>    drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c | 9 +++++++--
>>    2 files changed, 13 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>> index b0b982cf69bb..d5245893d2c8 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>> @@ -122,6 +122,7 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
>>    		      void *va, u16 *rx_headroom, u32 *len, bool xsk)
>>    {
>>    	struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
>> +	struct xdp_umem *umem = rq->umem;
>>    	struct xdp_buff xdp;
>>    	u32 act;
>>    	int err;
>> @@ -138,8 +139,11 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
>>    	xdp.rxq = &rq->xdp_rxq;
>>    
>>    	act = bpf_prog_run_xdp(prog, &xdp);
>> -	if (xsk)
>> -		xdp.handle += xdp.data - xdp.data_hard_start;
>> +	if (xsk) {
>> +		u64 off = xdp.data - xdp.data_hard_start;
>> +
>> +		xdp.handle = xsk_umem_handle_offset(umem, xdp.handle, off);
>> +	}
> What's missed is that umem_headroom is added to handle directly in
> mlx5e_xsk_page_alloc_umem. In my understanding umem_headroom should go
> to the offset part (high 16 bits) of the handle, to
> xsk_umem_handle_offset has to support increasing the offset.

Will look into it and make the changes for the v4

>>    	switch (act) {
>>    	case XDP_PASS:
>>    		*rx_headroom = xdp.data - xdp.data_hard_start;
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
>> index 35e188cf4ea4..f596e63cba00 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
>> @@ -61,6 +61,7 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
>>    	struct mlx5e_xdp_xmit_data xdptxd;
>>    	bool work_done = true;
>>    	bool flush = false;
>> +	u64 addr, offset;
>>    
>>    	xdpi.mode = MLX5E_XDP_XMIT_MODE_XSK;
>>    
>> @@ -82,8 +83,12 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
>>    			break;
>>    		}
>>    
>> -		xdptxd.dma_addr = xdp_umem_get_dma(umem, desc.addr);
>> -		xdptxd.data = xdp_umem_get_data(umem, desc.addr);
>> +		/* for unaligned chunks need to take offset from upper bits */
>> +		offset = (desc.addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT);
>> +		addr = (desc.addr & XSK_UNALIGNED_BUF_ADDR_MASK);
>> +
>> +		xdptxd.dma_addr = xdp_umem_get_dma(umem, addr + offset);
>> +		xdptxd.data = xdp_umem_get_data(umem, addr + offset);
> Why can't these calculations be encapsulated into
> xdp_umem_get_{dma,data}? I think they are common for all drivers, aren't
> they?
>
> Even if there is some reason not to put this bitshifting stuff into
> xdp_umem_get_* functions, I suggest to encapsulate it into a function
> anyway, because it's a good idea to keep those calculations in a single
> place.
Nice suggestion! I will move it to the xdp_umem_get_* functions in the 
v4. Thanks


^ permalink raw reply

* Re: [PATCH bpf-next v3 08/11] samples/bpf: add unaligned chunks mode support to xdpsock
From: Laatz, Kevin @ 2019-07-25 17:00 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	Topel, Bjorn, Karlsson, Magnus, jakub.kicinski@netronome.com,
	jonathan.lemon@gmail.com, Saeed Mahameed,
	stephen@networkplumber.org, Richardson, Bruce, Loftus, Ciara,
	bpf@vger.kernel.org, intel-wired-lan@lists.osuosl.org
In-Reply-To: <fd7b6b71-5bfd-986a-b288-b9e3478acebb@mellanox.com>


On 25/07/2019 10:43, Maxim Mikityanskiy wrote:
> On 2019-07-24 08:10, Kevin Laatz wrote:
>> This patch adds support for the unaligned chunks mode. The addition of the
>> unaligned chunks option will allow users to run the application with more
>> relaxed chunk placement in the XDP umem.
>>
>> Unaligned chunks mode can be used with the '-u' or '--unaligned' command
>> line options.
>>
>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
>> ---
>>    samples/bpf/xdpsock_user.c | 17 +++++++++++++++--
>>    1 file changed, 15 insertions(+), 2 deletions(-)
> <...>
>
>> @@ -372,6 +378,7 @@ static void usage(const char *prog)
>>    		"  -z, --zero-copy      Force zero-copy mode.\n"
>>    		"  -c, --copy           Force copy mode.\n"
>>    		"  -f, --frame-size=n   Set the frame size (must be a power of two, default is %d).\n"
> Help text for -f has to be updated, it doesn't have to be a power of two
> if -u is specified.

Will fix in the v4, thanks!




^ permalink raw reply

* Re: [PATCH bpf-next v3 03/11] xsk: add support to allow unaligned chunk placement
From: Laatz, Kevin @ 2019-07-25 17:00 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	Topel, Bjorn, Karlsson, Magnus, jakub.kicinski@netronome.com,
	jonathan.lemon@gmail.com, Saeed Mahameed,
	stephen@networkplumber.org, Richardson, Bruce, Loftus, Ciara,
	bpf@vger.kernel.org, intel-wired-lan@lists.osuosl.org
In-Reply-To: <3af74e26-8899-cf1e-6fd4-5ea0bd349fc3@mellanox.com>


On 25/07/2019 10:27, Maxim Mikityanskiy wrote:
> On 2019-07-24 08:10, Kevin Laatz wrote:
>> Currently, addresses are chunk size aligned. This means, we are very
>> restricted in terms of where we can place chunk within the umem. For
>> example, if we have a chunk size of 2k, then our chunks can only be placed
>> at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).
>>
>> This patch introduces the ability to use unaligned chunks. With these
>> changes, we are no longer bound to having to place chunks at a 2k (or
>> whatever your chunk size is) interval. Since we are no longer dealing with
>> aligned chunks, they can now cross page boundaries. Checks for page
>> contiguity have been added in order to keep track of which pages are
>> followed by a physically contiguous page.
>>
>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
>> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>>
>> ---
>> v2:
>>     - Add checks for the flags coming from userspace
>>     - Fix how we get chunk_size in xsk_diag.c
>>     - Add defines for masking the new descriptor format
>>     - Modified the rx functions to use new descriptor format
>>     - Modified the tx functions to use new descriptor format
>>
>> v3:
>>     - Add helper function to do address/offset masking/addition
>> ---
>>    include/net/xdp_sock.h      | 17 ++++++++
>>    include/uapi/linux/if_xdp.h |  9 ++++
>>    net/xdp/xdp_umem.c          | 18 +++++---
>>    net/xdp/xsk.c               | 86 ++++++++++++++++++++++++++++++-------
>>    net/xdp/xsk_diag.c          |  2 +-
>>    net/xdp/xsk_queue.h         | 68 +++++++++++++++++++++++++----
>>    6 files changed, 170 insertions(+), 30 deletions(-)
>>
> <...>
>
>> +/* If a buffer crosses a page boundary, we need to do 2 memcpy's, one for
>> + * each page. This is only required in copy mode.
>> + */
>> +static void __xsk_rcv_memcpy(struct xdp_umem *umem, u64 addr, void *from_buf,
>> +			     u32 len, u32 metalen)
>> +{
>> +	void *to_buf = xdp_umem_get_data(umem, addr);
>> +
>> +	if (xskq_crosses_non_contig_pg(umem, addr, len + metalen)) {
>> +		void *next_pg_addr = umem->pages[(addr >> PAGE_SHIFT) + 1].addr;
>> +		u64 page_start = addr & (PAGE_SIZE - 1);
>> +		u64 first_len = PAGE_SIZE - (addr - page_start);
> Let addr = 0x12345, PAGE_SIZE = 0x1000, len = 0x1000. Your calculations
> lead to page_start = 0x345, first_len = 0x1000 - 0x12000, which is
> negative. I think page_start is calculated incorrectly (is ~ missing?).

Correct, the ~ is missing in the page_start calculation. Nice spot, thanks!

>
>> +
>> +		memcpy(to_buf, from_buf, first_len + metalen);
>> +		memcpy(next_pg_addr, from_buf + first_len, len - first_len);
>> +
>> +		return;
>> +	}
>> +
>> +	memcpy(to_buf, from_buf, len + metalen);
>> +}
>> +
> <...>
>
>> +static inline bool xskq_is_valid_addr_unaligned(struct xsk_queue *q, u64 addr,
>> +						u64 length,
>> +						struct xdp_umem *umem)
>> +{
>> +	addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
>> +	addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
>> +	if (addr >= q->size ||
> Addresses like 0x00aaffffffffffff will pass the validation (0xaa +
> 0xffffffffffff will overflow mod 2^48 and become a small number),
> whereas such addresses don't look valid for me.

If you are referring to the addr >= q->size check... that check was 
already in xskq_is_valid_addr (which I based this function on). If this 
doesn't make sense, then it should be removed/fixed for both.

>
>> +	    xskq_crosses_non_contig_pg(umem, addr, length)) {
> If the region is not contiguous, we cant RX into it - that's clear.
> However, how can the userspace determine whether these two pages of UMEM
> are mapped contiguously in the DMA space? Are we going to silently drop
> descriptors to non-contiguous frames and leak them? Please explain how
> to use it correctly from the application side.

Correct, if it is not contiguous then we cannot Rx into it.

Userspace apps should be aware of the page contiguity when using 
zero-copy and if not, then should not be using the unaligned mode. 
Existing frameworks that have their own memory management subsystem, 
such as DPDK, are aware of page contiguity and manage this themselves 
while simpler apps that don't have this kind of visibility can use 
hugepages, as is shown in the xdpsock sample changes in this set.




^ permalink raw reply

* Re: [PATCH] sis900: add support for ethtool --eeprom-dump
From: Sergej Benilov @ 2019-07-25 16:41 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: venza, netdev
In-Reply-To: <20190725162543.GJ21952@lunn.ch>

On Thu, 25 Jul 2019 at 18:25, Andrew Lunn <andrew@lunn.ch> wrote:
>
> > +static int sis900_read_eeprom(struct net_device *net_dev, u8 *buf)
> > +{
> > +     struct sis900_private *sis_priv = netdev_priv(net_dev);
> > +     void __iomem *ioaddr = sis_priv->ioaddr;
> > +     int wait, ret = -EAGAIN;
> > +     u16 signature;
> > +     u16 *ebuf = (u16 *)buf;
> > +     int i;
> > +
> > +     if (sis_priv->chipset_rev == SIS96x_900_REV) {
> > +             sw32(mear, EEREQ);
> > +             for (wait = 0; wait < 2000; wait++) {
> > +                     if (sr32(mear) & EEGNT) {
> > +                             /* read 16 bits, and index by 16 bits */
> > +                             for (i = 0; i < sis_priv->eeprom_size / 2; i++)
> > +                                     ebuf[i] = (u16)read_eeprom(ioaddr, i);
> > +                     ret = 0;
> > +                     break;
> > +                     }
> > +             udelay(1);
> > +             }
> > +     sw32(mear, EEDONE);
>
> The indentation looks all messed up here.

This has passed ./scripts/checkpatch.pl, as you had suggested for the
previous patch.

>
> > +     } else {
> > +             signature = (u16)read_eeprom(ioaddr, EEPROMSignature);
> > +             if (signature != 0xffff && signature != 0x0000) {
> > +                     /* read 16 bits, and index by 16 bits */
> > +                     for (i = 0; i < sis_priv->eeprom_size / 2; i++)
> > +                             ebuf[i] = (u16)read_eeprom(ioaddr, i);
> > +                     ret = 0;
> > +             }
> > +     }
> > +     return ret;
> > +}
> > +
> > +#define SIS900_EEPROM_MAGIC  0xBABE
> > +static int sis900_get_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom, u8 *data)
> > +{
> > +     struct sis900_private *sis_priv = netdev_priv(dev);
> > +     u8 *eebuf;
> > +     int res;
> > +
> > +     eebuf = kmalloc(sis_priv->eeprom_size, GFP_KERNEL);
> > +     if (!eebuf)
> > +             return -ENOMEM;
> > +
> > +     eeprom->magic = SIS900_EEPROM_MAGIC;
> > +     spin_lock_irq(&sis_priv->lock);
> > +     res = sis900_read_eeprom(dev, eebuf);
> > +     spin_unlock_irq(&sis_priv->lock);
> > +     if (!res)
> > +             memcpy(data, eebuf + eeprom->offset, eeprom->len);
> > +     kfree(eebuf);
>
> Why do you not put the data directly into data and avoid this memory
> allocation, and memcpy?

Because EEPROM data from 'eeprom->offset' offset and of 'eeprom->len'
length only is expected to be returned in 'data'.

>
>             Andrew

^ permalink raw reply

* [PATCH] ip6_tunnel: fix possible use-after-free on xmit
From: Haishuang Yan @ 2019-07-25 16:40 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov; +Cc: netdev, linux-kernel, Haishuang Yan

ip4ip6/ip6ip6 tunnels run iptunnel_handle_offloads on xmit which
can cause a possible use-after-free accessing iph/ipv6h pointer
since the packet will be 'uncloned' running pskb_expand_head if
it is a cloned gso skb.

Fixes: 0e9a709560db ("ip6_tunnel, ip6_gre: fix setting of DSCP on encapsulated packets")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
---
 net/ipv6/ip6_tunnel.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 3134fbb..754a484 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1278,12 +1278,11 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield,
 	}
 
 	fl6.flowi6_uid = sock_net_uid(dev_net(dev), NULL);
+	dsfield = INET_ECN_encapsulate(dsfield, ipv4_get_dsfield(iph));
 
 	if (iptunnel_handle_offloads(skb, SKB_GSO_IPXIP6))
 		return -1;
 
-	dsfield = INET_ECN_encapsulate(dsfield, ipv4_get_dsfield(iph));
-
 	skb_set_inner_ipproto(skb, IPPROTO_IPIP);
 
 	err = ip6_tnl_xmit(skb, dev, dsfield, &fl6, encap_limit, &mtu,
@@ -1367,12 +1366,11 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield,
 	}
 
 	fl6.flowi6_uid = sock_net_uid(dev_net(dev), NULL);
+	dsfield = INET_ECN_encapsulate(dsfield, ipv6_get_dsfield(ipv6h));
 
 	if (iptunnel_handle_offloads(skb, SKB_GSO_IPXIP6))
 		return -1;
 
-	dsfield = INET_ECN_encapsulate(dsfield, ipv6_get_dsfield(ipv6h));
-
 	skb_set_inner_ipproto(skb, IPPROTO_IPV6);
 
 	err = ip6_tnl_xmit(skb, dev, dsfield, &fl6, encap_limit, &mtu,
-- 
1.8.3.1




^ permalink raw reply related

* Re: [PATCH v3 0/7] Convert skb_frag_t to bio_vec
From: Jonathan Lemon @ 2019-07-25 16:26 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: davem, hch, netdev
In-Reply-To: <20190723030831.11879-1-willy@infradead.org>



On 22 Jul 2019, at 20:08, Matthew Wilcox wrote:

> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>
> The skb_frag_t and bio_vec are fundamentally the same (page, offset,
> length) tuple.  This patch series unifies the two, leaving the
> skb_frag_t typedef in place.  This has the immediate advantage that
> we already have iov_iter support for bvecs and don't need to add
> support for iterating skbuffs.  It enables a long-term plan to use
> bvecs more broadly within the kernel and should make network-storage
> drivers able to do less work converting between skbuffs and biovecs.
>
> It will consume more memory on 32-bit kernels.  If that proves
> problematic, we can look at ways of addressing it.
>
> v3: Rebase on latest Linus with net-next merged.
>   - Reorder the uncontroversial 'Use skb accessors' patches first so you
>     can apply just those two if you want to hold off on the full
>     conversion.
>   - Convert all the users of 'struct skb_frag_struct' to skb_frag_t.

Thanks for doing this!  One step closer to simplifying code.

For the series:
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>



>
> Matthew Wilcox (Oracle) (7):
>   net: Use skb accessors in network drivers
>   net: Use skb accessors in network core
>   net: Increase the size of skb_frag_t
>   net: Reorder the contents of skb_frag_t
>   net: Rename skb_frag page to bv_page
>   net: Rename skb_frag_t size to bv_len
>   net: Convert skb_frag_t to bio_vec
>
>  drivers/crypto/chelsio/chtls/chtls_io.c       |  6 ++--
>  drivers/hsi/clients/ssi_protocol.c            |  3 +-
>  drivers/infiniband/hw/hfi1/vnic_sdma.c        |  2 +-
>  drivers/net/ethernet/3com/3c59x.c             |  2 +-
>  drivers/net/ethernet/agere/et131x.c           |  6 ++--
>  drivers/net/ethernet/amd/xgbe/xgbe-desc.c     |  2 +-
>  drivers/net/ethernet/amd/xgbe/xgbe-drv.c      |  2 +-
>  .../net/ethernet/apm/xgene/xgene_enet_main.c  |  3 +-
>  drivers/net/ethernet/atheros/alx/main.c       |  4 +--
>  .../net/ethernet/atheros/atl1c/atl1c_main.c   |  4 +--
>  .../net/ethernet/atheros/atl1e/atl1e_main.c   |  3 +-
>  drivers/net/ethernet/atheros/atlx/atl1.c      |  3 +-
>  drivers/net/ethernet/broadcom/bgmac.c         |  2 +-
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  2 +-
>  drivers/net/ethernet/brocade/bna/bnad.c       |  2 +-
>  drivers/net/ethernet/calxeda/xgmac.c          |  2 +-
>  .../net/ethernet/cavium/liquidio/lio_main.c   | 23 ++++++-------
>  .../ethernet/cavium/liquidio/lio_vf_main.c    | 23 ++++++-------
>  .../ethernet/cavium/thunder/nicvf_queues.c    |  4 +--
>  drivers/net/ethernet/chelsio/cxgb3/sge.c      |  2 +-
>  drivers/net/ethernet/cortina/gemini.c         |  5 ++-
>  drivers/net/ethernet/emulex/benet/be_main.c   |  2 +-
>  drivers/net/ethernet/freescale/enetc/enetc.c  |  2 +-
>  drivers/net/ethernet/freescale/fec_main.c     |  4 +--
>  drivers/net/ethernet/hisilicon/hix5hd2_gmac.c |  2 +-
>  drivers/net/ethernet/hisilicon/hns/hns_enet.c |  4 +--
>  .../net/ethernet/hisilicon/hns3/hns3_enet.c   |  8 ++---
>  drivers/net/ethernet/huawei/hinic/hinic_tx.c  |  2 +-
>  drivers/net/ethernet/ibm/emac/core.c          |  2 +-
>  drivers/net/ethernet/intel/e1000/e1000_main.c |  3 +-
>  drivers/net/ethernet/intel/e1000e/netdev.c    |  3 +-
>  drivers/net/ethernet/intel/fm10k/fm10k_main.c |  5 +--
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  4 +--
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  2 +-
>  drivers/net/ethernet/intel/iavf/iavf_txrx.c   |  4 +--
>  drivers/net/ethernet/intel/iavf/iavf_txrx.h   |  2 +-
>  drivers/net/ethernet/intel/ice/ice_txrx.c     |  6 ++--
>  drivers/net/ethernet/intel/igb/igb_main.c     |  5 +--
>  drivers/net/ethernet/intel/igbvf/netdev.c     |  2 +-
>  drivers/net/ethernet/intel/igc/igc_main.c     |  5 +--
>  drivers/net/ethernet/intel/ixgb/ixgb_main.c   |  4 +--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  9 ++---
>  .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  2 +-
>  drivers/net/ethernet/jme.c                    |  5 ++-
>  drivers/net/ethernet/marvell/mvneta.c         |  4 +--
>  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   |  7 ++--
>  drivers/net/ethernet/mediatek/mtk_eth_soc.c   |  4 +--
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c    |  4 +--
>  .../net/ethernet/mellanox/mlx5/core/en_tx.c   |  2 +-
>  drivers/net/ethernet/microchip/lan743x_main.c |  5 ++-
>  .../net/ethernet/myricom/myri10ge/myri10ge.c  | 10 +++---
>  .../ethernet/netronome/nfp/nfp_net_common.c   |  6 ++--
>  .../ethernet/qlogic/netxen/netxen_nic_main.c  |  4 +--
>  .../net/ethernet/qlogic/qlcnic/qlcnic_io.c    |  2 +-
>  drivers/net/ethernet/qualcomm/emac/emac-mac.c | 12 +++----
>  .../net/ethernet/synopsys/dwc-xlgmac-desc.c   |  2 +-
>  .../net/ethernet/synopsys/dwc-xlgmac-net.c    |  2 +-
>  drivers/net/ethernet/tehuti/tehuti.c          |  2 +-
>  drivers/net/usb/usbnet.c                      |  4 +--
>  drivers/net/vmxnet3/vmxnet3_drv.c             |  7 ++--
>  drivers/net/wireless/ath/wil6210/debugfs.c    |  3 +-
>  drivers/net/wireless/ath/wil6210/txrx.c       |  9 +++--
>  drivers/net/wireless/ath/wil6210/txrx_edma.c  |  2 +-
>  drivers/net/xen-netback/netback.c             |  4 +--
>  drivers/s390/net/qeth_core_main.c             |  2 +-
>  drivers/scsi/fcoe/fcoe_transport.c            |  2 +-
>  drivers/staging/octeon/ethernet-tx.c          |  5 ++-
>  .../staging/unisys/visornic/visornic_main.c   |  4 +--
>  drivers/target/iscsi/cxgbit/cxgbit_target.c   | 13 +++----
>  include/linux/bvec.h                          |  5 ++-
>  include/linux/skbuff.h                        | 34 ++++++-------------
>  net/core/skbuff.c                             | 26 ++++++++------
>  net/core/tso.c                                |  8 ++---
>  net/ipv4/tcp.c                                | 14 ++++----
>  net/kcm/kcmsock.c                             |  8 ++---
>  net/tls/tls_device.c                          | 14 ++++----
>  76 files changed, 202 insertions(+), 220 deletions(-)
>
> -- 
> 2.20.1

^ permalink raw reply

* Re: [PATCH] sis900: add support for ethtool --eeprom-dump
From: Andrew Lunn @ 2019-07-25 16:25 UTC (permalink / raw)
  To: Sergej Benilov; +Cc: venza, netdev
In-Reply-To: <20190725161809.14650-1-sergej.benilov@googlemail.com>

> +static int sis900_read_eeprom(struct net_device *net_dev, u8 *buf)
> +{
> +	struct sis900_private *sis_priv = netdev_priv(net_dev);
> +	void __iomem *ioaddr = sis_priv->ioaddr;
> +	int wait, ret = -EAGAIN;
> +	u16 signature;
> +	u16 *ebuf = (u16 *)buf;
> +	int i;
> +
> +	if (sis_priv->chipset_rev == SIS96x_900_REV) {
> +		sw32(mear, EEREQ);
> +		for (wait = 0; wait < 2000; wait++) {
> +			if (sr32(mear) & EEGNT) {
> +				/* read 16 bits, and index by 16 bits */
> +				for (i = 0; i < sis_priv->eeprom_size / 2; i++)
> +					ebuf[i] = (u16)read_eeprom(ioaddr, i);
> +			ret = 0;
> +			break;
> +			}
> +		udelay(1);
> +		}
> +	sw32(mear, EEDONE);

The indentation looks all messed up here.

> +	} else {
> +		signature = (u16)read_eeprom(ioaddr, EEPROMSignature);
> +		if (signature != 0xffff && signature != 0x0000) {
> +			/* read 16 bits, and index by 16 bits */
> +			for (i = 0; i < sis_priv->eeprom_size / 2; i++)
> +				ebuf[i] = (u16)read_eeprom(ioaddr, i);
> +			ret = 0;
> +		}
> +	}
> +	return ret;
> +}
> +
> +#define SIS900_EEPROM_MAGIC	0xBABE
> +static int sis900_get_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom, u8 *data)
> +{
> +	struct sis900_private *sis_priv = netdev_priv(dev);
> +	u8 *eebuf;
> +	int res;
> +
> +	eebuf = kmalloc(sis_priv->eeprom_size, GFP_KERNEL);
> +	if (!eebuf)
> +		return -ENOMEM;
> +
> +	eeprom->magic = SIS900_EEPROM_MAGIC;
> +	spin_lock_irq(&sis_priv->lock);
> +	res = sis900_read_eeprom(dev, eebuf);
> +	spin_unlock_irq(&sis_priv->lock);
> +	if (!res)
> +		memcpy(data, eebuf + eeprom->offset, eeprom->len);
> +	kfree(eebuf);

Why do you not put the data directly into data and avoid this memory
allocation, and memcpy?

	    Andrew

^ permalink raw reply

* Re: [PATCH V2 3/3] net: dsa: ksz: Add Microchip KSZ8795 DSA driver
From: Andrew Lunn @ 2019-07-25 16:18 UTC (permalink / raw)
  To: Marek Vasut
  Cc: netdev, Tristram Ha, David S . Miller, Florian Fainelli,
	Vivien Didelot, Woojung Huh
In-Reply-To: <20190725150552.6901-4-marex@denx.de>

On Thu, Jul 25, 2019 at 05:05:52PM +0200, Marek Vasut wrote:

Hi Marek

> +static void ksz8795_phy_setup(struct ksz_device *dev, int port,
> +			      struct phy_device *phy)
> +{
> +	if (port < dev->phy_port_cnt) {
> +		/*
> +		 * SUPPORTED_Asym_Pause and SUPPORTED_Pause can be removed to
> +		 * disable flow control when rate limiting is used.
> +		 */
> +		linkmode_copy(phy->advertising, phy->supported);
> +	}
> +}

Please could you drop this, since your testing seems to indicate it is
not needed.

    Thanks
	Andrew

^ permalink raw reply

* [PATCH] sis900: add support for ethtool --eeprom-dump
From: Sergej Benilov @ 2019-07-25 16:18 UTC (permalink / raw)
  To: venza, netdev, andrew; +Cc: Sergej Benilov

Implement ethtool's EEPROM dump command (ethtool -e|--eeprom-dump).

Signed-off-by: Sergej Benilov <sergej.benilov@googlemail.com>
---
 drivers/net/ethernet/sis/sis900.c | 68 +++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/drivers/net/ethernet/sis/sis900.c b/drivers/net/ethernet/sis/sis900.c
index 6e07f5ebacfc..cf9d75ffefc9 100644
--- a/drivers/net/ethernet/sis/sis900.c
+++ b/drivers/net/ethernet/sis/sis900.c
@@ -191,6 +191,8 @@ struct sis900_private {
 	unsigned int tx_full; /* The Tx queue is full. */
 	u8 host_bridge_rev;
 	u8 chipset_rev;
+	/* EEPROM data */
+	int eeprom_size;
 };
 
 MODULE_AUTHOR("Jim Huang <cmhuang@sis.com.tw>, Ollie Lho <ollie@sis.com.tw>");
@@ -475,6 +477,8 @@ static int sis900_probe(struct pci_dev *pci_dev,
 	sis_priv->pci_dev = pci_dev;
 	spin_lock_init(&sis_priv->lock);
 
+	sis_priv->eeprom_size = 24;
+
 	pci_set_drvdata(pci_dev, net_dev);
 
 	ring_space = pci_alloc_consistent(pci_dev, TX_TOTAL_SIZE, &ring_dma);
@@ -2122,6 +2126,68 @@ static void sis900_get_wol(struct net_device *net_dev, struct ethtool_wolinfo *w
 	wol->supported = (WAKE_PHY | WAKE_MAGIC);
 }
 
+static int sis900_get_eeprom_len(struct net_device *dev)
+{
+	struct sis900_private *sis_priv = netdev_priv(dev);
+
+	return sis_priv->eeprom_size;
+}
+
+static int sis900_read_eeprom(struct net_device *net_dev, u8 *buf)
+{
+	struct sis900_private *sis_priv = netdev_priv(net_dev);
+	void __iomem *ioaddr = sis_priv->ioaddr;
+	int wait, ret = -EAGAIN;
+	u16 signature;
+	u16 *ebuf = (u16 *)buf;
+	int i;
+
+	if (sis_priv->chipset_rev == SIS96x_900_REV) {
+		sw32(mear, EEREQ);
+		for (wait = 0; wait < 2000; wait++) {
+			if (sr32(mear) & EEGNT) {
+				/* read 16 bits, and index by 16 bits */
+				for (i = 0; i < sis_priv->eeprom_size / 2; i++)
+					ebuf[i] = (u16)read_eeprom(ioaddr, i);
+			ret = 0;
+			break;
+			}
+		udelay(1);
+		}
+	sw32(mear, EEDONE);
+	} else {
+		signature = (u16)read_eeprom(ioaddr, EEPROMSignature);
+		if (signature != 0xffff && signature != 0x0000) {
+			/* read 16 bits, and index by 16 bits */
+			for (i = 0; i < sis_priv->eeprom_size / 2; i++)
+				ebuf[i] = (u16)read_eeprom(ioaddr, i);
+			ret = 0;
+		}
+	}
+	return ret;
+}
+
+#define SIS900_EEPROM_MAGIC	0xBABE
+static int sis900_get_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom, u8 *data)
+{
+	struct sis900_private *sis_priv = netdev_priv(dev);
+	u8 *eebuf;
+	int res;
+
+	eebuf = kmalloc(sis_priv->eeprom_size, GFP_KERNEL);
+	if (!eebuf)
+		return -ENOMEM;
+
+	eeprom->magic = SIS900_EEPROM_MAGIC;
+	spin_lock_irq(&sis_priv->lock);
+	res = sis900_read_eeprom(dev, eebuf);
+	spin_unlock_irq(&sis_priv->lock);
+	if (!res)
+		memcpy(data, eebuf + eeprom->offset, eeprom->len);
+	kfree(eebuf);
+	return res;
+}
+
 static const struct ethtool_ops sis900_ethtool_ops = {
 	.get_drvinfo 	= sis900_get_drvinfo,
 	.get_msglevel	= sis900_get_msglevel,
@@ -2132,6 +2198,8 @@ static const struct ethtool_ops sis900_ethtool_ops = {
 	.set_wol	= sis900_set_wol,
 	.get_link_ksettings = sis900_get_link_ksettings,
 	.set_link_ksettings = sis900_set_link_ksettings,
+	.get_eeprom_len = sis900_get_eeprom_len,
+	.get_eeprom = sis900_get_eeprom,
 };
 
 /**
-- 
2.17.1


^ permalink raw reply related

* RE: [PATCH bpf-next v3 00/11] XDP unaligned chunk placement support
From: Richardson, Bruce @ 2019-07-25 15:56 UTC (permalink / raw)
  To: Jonathan Lemon, Laatz, Kevin
  Cc: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	Topel, Bjorn, Karlsson, Magnus, jakub.kicinski@netronome.com,
	saeedm@mellanox.com, maximmi@mellanox.com,
	stephen@networkplumber.org, Loftus, Ciara, bpf@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org
In-Reply-To: <94EAD717-F632-499F-8BBD-FFF5A5333CBF@gmail.com>



> -----Original Message-----
> From: Jonathan Lemon [mailto:jonathan.lemon@gmail.com]
> Sent: Thursday, July 25, 2019 4:39 PM
> To: Laatz, Kevin <kevin.laatz@intel.com>
> Cc: netdev@vger.kernel.org; ast@kernel.org; daniel@iogearbox.net; Topel,
> Bjorn <bjorn.topel@intel.com>; Karlsson, Magnus
> <magnus.karlsson@intel.com>; jakub.kicinski@netronome.com;
> saeedm@mellanox.com; maximmi@mellanox.com; stephen@networkplumber.org;
> Richardson, Bruce <bruce.richardson@intel.com>; Loftus, Ciara
> <ciara.loftus@intel.com>; bpf@vger.kernel.org; intel-wired-
> lan@lists.osuosl.org
> Subject: Re: [PATCH bpf-next v3 00/11] XDP unaligned chunk placement
> support
> 
> 
> 
> On 23 Jul 2019, at 22:10, Kevin Laatz wrote:
> 
> > This patch set adds the ability to use unaligned chunks in the XDP umem.
> >
> > Currently, all chunk addresses passed to the umem are masked to be
> > chunk size aligned (max is PAGE_SIZE). This limits where we can place
> > chunks within the umem as well as limiting the packet sizes that are
> supported.
> >
> > The changes in this patch set removes these restrictions, allowing XDP
> > to be more flexible in where it can place a chunk within a umem. By
> > relaxing where the chunks can be placed, it allows us to use an
> > arbitrary buffer size and place that wherever we have a free address
> > in the umem. These changes add the ability to support arbitrary frame
> > sizes up to 4k
> > (PAGE_SIZE) and make it easy to integrate with other existing
> > frameworks that have their own memory management systems, such as DPDK.
> > In DPDK, for example, there is already support for AF_XDP with zero-
> copy.
> > However, with this patch set the integration will be much more seamless.
> > You can find the DPDK AF_XDP driver at:
> > https://git.dpdk.org/dpdk/tree/drivers/net/af_xdp
> >
> > Since we are now dealing with arbitrary frame sizes, we need also need
> > to update how we pass around addresses. Currently, the addresses can
> > simply be masked to 2k to get back to the original address. This
> > becomes less trivial when using frame sizes that are not a 'power of
> > 2' size. This patch set modifies the Rx/Tx descriptor format to use
> > the upper 16-bits of the addr field for an offset value, leaving the
> > lower 48-bits for the address (this leaves us with 256 Terabytes,
> > which should be enough!). We only need to use the upper 16-bits to store
> the offset when running in unaligned mode.
> > Rather than adding the offset (headroom etc) to the address, we will
> > store it in the upper 16-bits of the address field. This way, we can
> > easily add the offset to the address where we need it, using some bit
> > manipulation and addition, and we can also easily get the original
> > address wherever we need it (for example in i40e_zca_fr-- ee) by
> > simply masking to get the lower 48-bits of the address field.
> 
> I wonder if it would be better to break backwards compatibility here and
> say that a handle is going to change from [addr] to [base | offset], or
> even [index | offset], where address = (index * chunk size) + offset, and
> then use accessor macros to manipulate the queue entries.
> 
> This way, the XDP hotpath can adjust the handle with simple arithmetic,
> bypassing the "if (unaligned)", check, as it changes the offset directly.
> 
> Using a chunk index instead of a base address is safer, otherwise it is
> too easy to corrupt things.
> --

The trouble with using a chunk index is that it assumes that all chunks are
contiguous, which is not always going to be the case. For example, for
userspace apps the easiest way to get memory that is IOVA/physically 
contiguous is to use hugepages, but even then we still need to skip space
when crossing a 2MB barrier.

Specifically in this example case, with a 3k buffer size and 2MB hugepages,
we'd get 666 buffers on a single page, but then waste a few KB before
starting the 667th buffer at byte 0 on the second 2MB page.
This is the setup used in DPDK, for instance, when we allocate memory for
use in buffer pools. 

Therefore, I think it's better to just have the kernel sanity checking the
request for safety and leave userspace the freedom to decide where in its
memory area it wants to place the buffers.

Regards,
/Bruce

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH v2 00/10] XDP unaligned chunk placement support
From: Jonathan Lemon @ 2019-07-25 15:43 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Alexei Starovoitov, Kevin Laatz, Jakub Kicinski, Daniel Borkmann,
	Network Development, ciara.loftus, Alexei Starovoitov,
	intel-wired-lan, bruce.richardson, bpf, Björn Töpel,
	Karlsson, Magnus
In-Reply-To: <CAJ8uoz26Byzhc4My70h3nmfd3pQB1ahFf=ZzN2heN4asrJ-NQg@mail.gmail.com>



On 24 Jul 2019, at 6:25, Magnus Karlsson wrote:

> On Tue, Jul 23, 2019 at 11:08 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> Johnathan, Bjorn, Jakub,
>> Please review!
>> The patch set has been pending for a week.
>
> There is a v3 coming out shortly so I suggest we wait for that. It
> will have Mellanox support for this feature too and some clean ups. I
> refrained from posting a review on the mailing list due to the merge
> window being closed last week, but maybe that was not correct. Should
> I still post reviews for new features submitted during the closed
> merge window period? I am happy to do it since I do not have any other
> tasks during that time. It is a quite period for me. Just let me know.

Same here - last time I posted a patch when the merge window was closed,
it was signaled to me (Hi Jakub!) to wait until it reopened.
-- 
Jonathan



>
> /Magnus
>
>> On Tue, Jul 16, 2019 at 4:21 AM Kevin Laatz <kevin.laatz@intel.com> 
>> wrote:
>>>
>>> This patch set adds the ability to use unaligned chunks in the XDP 
>>> umem.
>>>
>>> Currently, all chunk addresses passed to the umem are masked to be 
>>> chunk
>>> size aligned (default is 2k, max is PAGE_SIZE). This limits where we 
>>> can
>>> place chunks within the umem as well as limiting the packet sizes 
>>> that are
>>> supported.
>>>
>>> The changes in this patch set removes these restrictions, allowing 
>>> XDP to
>>> be more flexible in where it can place a chunk within a umem. By 
>>> relaxing
>>> where the chunks can be placed, it allows us to use an arbitrary 
>>> buffer
>>> size and place that wherever we have a free address in the umem. 
>>> These
>>> changes add the ability to support arbitrary frame sizes up to 4k
>>> (PAGE_SIZE) and make it easy to integrate with other existing 
>>> frameworks
>>> that have their own memory management systems, such as DPDK.
>>>
>>> Since we are now dealing with arbitrary frame sizes, we need also 
>>> need to
>>> update how we pass around addresses. Currently, the addresses can 
>>> simply be
>>> masked to 2k to get back to the original address. This becomes less 
>>> trivial
>>> when using frame sizes that are not a 'power of 2' size. This patch 
>>> set
>>> modifies the Rx/Tx descriptor format to use the upper 16-bits of the 
>>> addr
>>> field for an offset value, leaving the lower 48-bits for the address 
>>> (this
>>> leaves us with 256 Terabytes, which should be enough!). We only need 
>>> to use
>>> the upper 16-bits to store the offset when running in unaligned 
>>> mode.
>>> Rather than adding the offset (headroom etc) to the address, we will 
>>> store
>>> it in the upper 16-bits of the address field. This way, we can 
>>> easily add
>>> the offset to the address where we need it, using some bit 
>>> manipulation and
>>> addition, and we can also easily get the original address wherever 
>>> we need
>>> it (for example in i40e_zca_free) by simply masking to get the lower
>>> 48-bits of the address field.
>>>
>>> The numbers below were recorded with the following set up:
>>>   - Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
>>>   - Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 
>>> (rev 02)
>>>   - Driver: i40e
>>>   - Application: xdpsock with l2fwd (single interface)
>>>
>>> These are solely for comparing performance with and without the 
>>> patches.
>>> The largest drop was ~1% (in zero-copy mode).
>>>
>>> +-------------------------+------------+-----------------+-------------+
>>> | Buffer size: 2048       | SKB mode   | Zero-copy       | Copy      
>>>   |
>>> +-------------------------+------------+-----------------+-------------+
>>> | Aligned (baseline)      | 1.7 Mpps   | 15.3 Mpps       | 2.08 Mpps 
>>>   |
>>> +-------------------------+------------+-----------------+-------------+
>>> | Aligned (with patches)  | 1.7 Mpps   | 15.1 Mpps       | 2.08 Mpps 
>>>   |
>>> +-------------------------+------------+-----------------+-------------+
>>> | Unaligned               | 1.7 Mpps   | 14.5 Mpps       | 2.08 Mpps 
>>>   |
>>> +-------------------------+------------+-----------------+-------------+
>>>
>>> NOTE: We are currently working on the changes required in the 
>>> Mellanox
>>> driver. We will include these in the v3.
>>>
>>> Structure of the patchset:
>>> Patch 1:
>>>   - Remove unnecessary masking and headroom addition during 
>>> zero-copy Rx
>>>     buffer recycling in i40e. This change is required in order for 
>>> the
>>>     buffer recycling to work in the unaligned chunk mode.
>>>
>>> Patch 2:
>>>   - Remove unnecessary masking and headroom addition during
>>>     zero-copy Rx buffer recycling in ixgbe. This change is required 
>>> in
>>>     order for the  buffer recycling to work in the unaligned chunk 
>>> mode.
>>>
>>> Patch 3:
>>>   - Add infrastructure for unaligned chunks. Since we are dealing 
>>> with
>>>     unaligned chunks that could potentially cross a physical page 
>>> boundary,
>>>     we add checks to keep track of that information. We can later 
>>> use this
>>>     information to correctly handle buffers that are placed at an 
>>> address
>>>     where they cross a page boundary.  This patch also modifies the
>>>     existing Rx and Tx functions to use the new descriptor format. 
>>> To
>>>     handle addresses correctly, we need to mask appropriately based 
>>> on
>>>     whether we are in aligned or unaligned mode.
>>>
>>> Patch 4:
>>>   - This patch updates the i40e driver to make use of the new 
>>> descriptor
>>>     format. The new format is particularly useful here since we can 
>>> now
>>>     retrieve the original address in places like i40e_zca_free with 
>>> ease.
>>>     This saves us doing various calculations to get the original 
>>> address
>>>     back.
>>>
>>> Patch 5:
>>>   - This patch updates the ixgbe driver to make use of the new 
>>> descriptor
>>>     format. The new format is particularly useful here since we can 
>>> now
>>>     retrieve the original address in places like ixgbe_zca_free with 
>>> ease.
>>>     This saves us doing various calculations to get the original 
>>> address
>>>     back.
>>>
>>> Patch 6:
>>>   - Add flags for umem configuration to libbpf
>>>
>>> Patch 7:
>>>   - Modify xdpsock application to add a command line option for
>>>     unaligned chunks
>>>
>>> Patch 8:
>>>   - Since we can now run the application in unaligned chunk mode, we 
>>> need
>>>     to make sure we recycle the buffers appropriately.
>>>
>>> Patch 9:
>>>   - Adds hugepage support to the xdpsock application
>>>
>>> Patch 10:
>>>   - Documentation update to include the unaligned chunk scenario. We 
>>> need
>>>     to explicitly state that the incoming addresses are only masked 
>>> in the
>>>     aligned chunk mode and not the unaligned chunk mode.
>>>
>>> ---
>>> v2:
>>>   - fixed checkpatch issues
>>>   - fixed Rx buffer recycling for unaligned chunks in xdpsock
>>>   - removed unused defines
>>>   - fixed how chunk_size is calculated in xsk_diag.c
>>>   - added some performance numbers to cover letter
>>>   - modified descriptor format to make it easier to retrieve 
>>> original
>>>     address
>>>   - removed patch adding off_t off to the zero copy allocator. This 
>>> is no
>>>     longer needed with the new descriptor format.
>>>
>>> Kevin Laatz (10):
>>>   i40e: simplify Rx buffer recycle
>>>   ixgbe: simplify Rx buffer recycle
>>>   xsk: add support to allow unaligned chunk placement
>>>   i40e: modify driver for handling offsets
>>>   ixgbe: modify driver for handling offsets
>>>   libbpf: add flags to umem config
>>>   samples/bpf: add unaligned chunks mode support to xdpsock
>>>   samples/bpf: add buffer recycling for unaligned chunks to xdpsock
>>>   samples/bpf: use hugepages in xdpsock app
>>>   doc/af_xdp: include unaligned chunk case
>>>
>>>  Documentation/networking/af_xdp.rst          | 10 ++-
>>>  drivers/net/ethernet/intel/i40e/i40e_xsk.c   | 39 +++++----
>>>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 39 +++++----
>>>  include/net/xdp_sock.h                       |  2 +
>>>  include/uapi/linux/if_xdp.h                  |  9 ++
>>>  net/xdp/xdp_umem.c                           | 17 ++--
>>>  net/xdp/xsk.c                                | 89 
>>> ++++++++++++++++----
>>>  net/xdp/xsk_diag.c                           |  2 +-
>>>  net/xdp/xsk_queue.h                          | 70 +++++++++++++--
>>>  samples/bpf/xdpsock_user.c                   | 61 ++++++++++----
>>>  tools/include/uapi/linux/if_xdp.h            |  4 +
>>>  tools/lib/bpf/xsk.c                          |  3 +
>>>  tools/lib/bpf/xsk.h                          |  2 +
>>>  13 files changed, 266 insertions(+), 81 deletions(-)
>>>
>>> --
>>> 2.17.1
>>>
>> _______________________________________________
>> Intel-wired-lan mailing list
>> Intel-wired-lan@osuosl.org
>> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply

* Re: [PATCH 3/3] net: dsa: ksz: Add Microchip KSZ8795 DSA driver
From: Marek Vasut @ 2019-07-25 15:36 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, Tristram Ha, David S . Miller, Florian Fainelli,
	Vivien Didelot, Woojung Huh
In-Reply-To: <20190725145924.GB25700@lunn.ch>

On 7/25/19 4:59 PM, Andrew Lunn wrote:
> On Thu, Jul 25, 2019 at 04:56:37PM +0200, Marek Vasut wrote:
>> On 7/25/19 4:03 PM, Andrew Lunn wrote:
>>> On Wed, Jul 24, 2019 at 03:40:48PM +0200, Marek Vasut wrote:
>>>> From: Tristram Ha <Tristram.Ha@microchip.com>
>>>> +static void ksz8795_phy_setup(struct ksz_device *dev, int port,
>>>> +			      struct phy_device *phy)
>>>> +{
>>>> +	if (port < dev->phy_port_cnt) {
>>>> +		/*
>>>> +		 * SUPPORTED_Asym_Pause and SUPPORTED_Pause can be removed to
>>>> +		 * disable flow control when rate limiting is used.
>>>> +		 */
>>>> +		linkmode_copy(phy->advertising, phy->supported);
>>>> +	}
>>>> +}
>>>
>>> Hi Marek
>>>
>>> Do you know why this is needed?
>>
>> Unfortunately, no.
>>
>> It seems it copies supported features of the PHY to advertised features
>> of the PHY for ports which are downstream (i.e. not the CPU port).
> 
> Hi Marek
> 
> Could you test it without this copy? Do you get sensible values from
> ethtool? Does the pause configuration look sensible?

They do look OK even without the code.

^ permalink raw reply

* Re: [PATCH bpf-next v3 00/11] XDP unaligned chunk placement support
From: Jonathan Lemon @ 2019-07-25 15:39 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <20190724051043.14348-1-kevin.laatz@intel.com>



On 23 Jul 2019, at 22:10, Kevin Laatz wrote:

> This patch set adds the ability to use unaligned chunks in the XDP umem.
>
> Currently, all chunk addresses passed to the umem are masked to be chunk
> size aligned (max is PAGE_SIZE). This limits where we can place chunks
> within the umem as well as limiting the packet sizes that are supported.
>
> The changes in this patch set removes these restrictions, allowing XDP to
> be more flexible in where it can place a chunk within a umem. By relaxing
> where the chunks can be placed, it allows us to use an arbitrary buffer
> size and place that wherever we have a free address in the umem. These
> changes add the ability to support arbitrary frame sizes up to 4k
> (PAGE_SIZE) and make it easy to integrate with other existing frameworks
> that have their own memory management systems, such as DPDK.
> In DPDK, for example, there is already support for AF_XDP with zero-copy.
> However, with this patch set the integration will be much more seamless.
> You can find the DPDK AF_XDP driver at:
> https://git.dpdk.org/dpdk/tree/drivers/net/af_xdp
>
> Since we are now dealing with arbitrary frame sizes, we need also need to
> update how we pass around addresses. Currently, the addresses can simply be
> masked to 2k to get back to the original address. This becomes less trivial
> when using frame sizes that are not a 'power of 2' size. This patch set
> modifies the Rx/Tx descriptor format to use the upper 16-bits of the addr
> field for an offset value, leaving the lower 48-bits for the address (this
> leaves us with 256 Terabytes, which should be enough!). We only need to use
> the upper 16-bits to store the offset when running in unaligned mode.
> Rather than adding the offset (headroom etc) to the address, we will store
> it in the upper 16-bits of the address field. This way, we can easily add
> the offset to the address where we need it, using some bit manipulation and
> addition, and we can also easily get the original address wherever we need
> it (for example in i40e_zca_fr-- ee) by simply masking to get the lower
> 48-bits of the address field.

I wonder if it would be better to break backwards compatibility here and
say that a handle is going to change from [addr] to [base | offset], or
even [index | offset], where address = (index * chunk size) + offset, and
then use accessor macros to manipulate the queue entries.

This way, the XDP hotpath can adjust the handle with simple arithmetic,
bypassing the "if (unaligned)", check, as it changes the offset directly.

Using a chunk index instead of a base address is safer, otherwise it is
too easy to corrupt things.
-- 
Jonathan

^ permalink raw reply

* Re: [PATCH bpf-next v3 03/11] xsk: add support to allow unaligned chunk placement
From: Jonathan Lemon @ 2019-07-25 15:39 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <20190724051043.14348-4-kevin.laatz@intel.com>



On 23 Jul 2019, at 22:10, Kevin Laatz wrote:

> Currently, addresses are chunk size aligned. This means, we are very
> restricted in terms of where we can place chunk within the umem. For
> example, if we have a chunk size of 2k, then our chunks can only be 
> placed
> at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).
>
> This patch introduces the ability to use unaligned chunks. With these
> changes, we are no longer bound to having to place chunks at a 2k (or
> whatever your chunk size is) interval. Since we are no longer dealing 
> with
> aligned chunks, they can now cross page boundaries. Checks for page
> contiguity have been added in order to keep track of which pages are
> followed by a physically contiguous page.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>
> ---
> v2:
>   - Add checks for the flags coming from userspace
>   - Fix how we get chunk_size in xsk_diag.c
>   - Add defines for masking the new descriptor format
>   - Modified the rx functions to use new descriptor format
>   - Modified the tx functions to use new descriptor format
>
> v3:
>   - Add helper function to do address/offset masking/addition
> ---
>  include/net/xdp_sock.h      | 17 ++++++++
>  include/uapi/linux/if_xdp.h |  9 ++++
>  net/xdp/xdp_umem.c          | 18 +++++---
>  net/xdp/xsk.c               | 86 
> ++++++++++++++++++++++++++++++-------
>  net/xdp/xsk_diag.c          |  2 +-
>  net/xdp/xsk_queue.h         | 68 +++++++++++++++++++++++++----
>  6 files changed, 170 insertions(+), 30 deletions(-)
>
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 69796d264f06..738996c0f995 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -19,6 +19,7 @@ struct xsk_queue;
>  struct xdp_umem_page {
>  	void *addr;
>  	dma_addr_t dma;
> +	bool next_pg_contig;
>  };
>
>  struct xdp_umem_fq_reuse {
> @@ -48,6 +49,7 @@ struct xdp_umem {
>  	bool zc;
>  	spinlock_t xsk_list_lock;
>  	struct list_head xsk_list;
> +	u32 flags;
>  };
>
>  struct xdp_sock {
> @@ -144,6 +146,15 @@ static inline void xsk_umem_fq_reuse(struct 
> xdp_umem *umem, u64 addr)
>
>  	rq->handles[rq->length++] = addr;
>  }
> +
> +static inline u64 xsk_umem_handle_offset(struct xdp_umem *umem, u64 
> handle,
> +					 u64 offset)
> +{
> +	if (umem->flags & XDP_UMEM_UNALIGNED_CHUNKS)
> +		return handle |= (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
> +	else
> +		return handle += offset;
> +}

This should be something like 'xsk_umem_adjust_offset()', and use "+=" 
for both cases.


>  #else
>  static inline int xsk_generic_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp)
>  {
> @@ -241,6 +252,12 @@ static inline void xsk_umem_fq_reuse(struct 
> xdp_umem *umem, u64 addr)
>  {
>  }
>
> +static inline u64 xsk_umem_handle_offset(struct xdp_umem *umem, u64 
> handle,
> +					 u64 offset)
> +{
> +	return NULL;
> +}
> +
>  #endif /* CONFIG_XDP_SOCKETS */
>
>  #endif /* _LINUX_XDP_SOCK_H */
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index faaa5ca2a117..f8dc68fcdf78 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -17,6 +17,9 @@
>  #define XDP_COPY	(1 << 1) /* Force copy-mode */
>  #define XDP_ZEROCOPY	(1 << 2) /* Force zero-copy mode */
>
> +/* Flags for xsk_umem_config flags */
> +#define XDP_UMEM_UNALIGNED_CHUNKS (1 << 0)
> +
>  struct sockaddr_xdp {
>  	__u16 sxdp_family;
>  	__u16 sxdp_flags;
> @@ -53,6 +56,7 @@ struct xdp_umem_reg {
>  	__u64 len; /* Length of packet data area */
>  	__u32 chunk_size;
>  	__u32 headroom;
> +	__u32 flags;
>  };
>
>  struct xdp_statistics {
> @@ -74,6 +78,11 @@ struct xdp_options {
>  #define XDP_UMEM_PGOFF_FILL_RING	0x100000000ULL
>  #define XDP_UMEM_PGOFF_COMPLETION_RING	0x180000000ULL
>
> +/* Masks for unaligned chunks mode */
> +#define XSK_UNALIGNED_BUF_OFFSET_SHIFT 48
> +#define XSK_UNALIGNED_BUF_ADDR_MASK \
> +	((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
> +
>  /* Rx/Tx descriptor */
>  struct xdp_desc {
>  	__u64 addr;
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 83de74ca729a..952ca22103e9 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -299,6 +299,7 @@ static int xdp_umem_account_pages(struct xdp_umem 
> *umem)
>
>  static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg 
> *mr)
>  {
> +	bool unaligned_chunks = mr->flags & XDP_UMEM_UNALIGNED_CHUNKS;
>  	u32 chunk_size = mr->chunk_size, headroom = mr->headroom;
>  	unsigned int chunks, chunks_per_page;
>  	u64 addr = mr->addr, size = mr->len;
> @@ -314,7 +315,10 @@ static int xdp_umem_reg(struct xdp_umem *umem, 
> struct xdp_umem_reg *mr)
>  		return -EINVAL;
>  	}
>
> -	if (!is_power_of_2(chunk_size))
> +	if (mr->flags & ~(XDP_UMEM_UNALIGNED_CHUNKS))
> +		return -EINVAL;
> +
> +	if (!unaligned_chunks && !is_power_of_2(chunk_size))
>  		return -EINVAL;
>
>  	if (!PAGE_ALIGNED(addr)) {
> @@ -331,9 +335,11 @@ static int xdp_umem_reg(struct xdp_umem *umem, 
> struct xdp_umem_reg *mr)
>  	if (chunks == 0)
>  		return -EINVAL;
>
> -	chunks_per_page = PAGE_SIZE / chunk_size;
> -	if (chunks < chunks_per_page || chunks % chunks_per_page)
> -		return -EINVAL;
> +	if (!unaligned_chunks) {
> +		chunks_per_page = PAGE_SIZE / chunk_size;
> +		if (chunks < chunks_per_page || chunks % chunks_per_page)
> +			return -EINVAL;
> +	}
>
>  	headroom = ALIGN(headroom, 64);
>
> @@ -342,13 +348,15 @@ static int xdp_umem_reg(struct xdp_umem *umem, 
> struct xdp_umem_reg *mr)
>  		return -EINVAL;
>
>  	umem->address = (unsigned long)addr;
> -	umem->chunk_mask = ~((u64)chunk_size - 1);
> +	umem->chunk_mask = unaligned_chunks ? XSK_UNALIGNED_BUF_ADDR_MASK
> +					    : ~((u64)chunk_size - 1);

The handle needs to be cleaned (reset to base address) when removed
from the fill queue or recycle stack.  This will not provide the correct
semantics for unaligned mode.


>  	umem->size = size;
>  	umem->headroom = headroom;
>  	umem->chunk_size_nohr = chunk_size - headroom;
>  	umem->npgs = size / PAGE_SIZE;
>  	umem->pgs = NULL;
>  	umem->user = NULL;
> +	umem->flags = mr->flags;
>  	INIT_LIST_HEAD(&umem->xsk_list);
>  	spin_lock_init(&umem->xsk_list_lock);
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 59b57d708697..b3ab653091c4 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -45,7 +45,7 @@ EXPORT_SYMBOL(xsk_umem_has_addrs);
>
>  u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
>  {
> -	return xskq_peek_addr(umem->fq, addr);
> +	return xskq_peek_addr(umem->fq, addr, umem);
>  }
>  EXPORT_SYMBOL(xsk_umem_peek_addr);
>
> @@ -55,21 +55,42 @@ void xsk_umem_discard_addr(struct xdp_umem *umem)
>  }
>  EXPORT_SYMBOL(xsk_umem_discard_addr);
>
> +/* If a buffer crosses a page boundary, we need to do 2 memcpy's, one 
> for
> + * each page. This is only required in copy mode.
> + */
> +static void __xsk_rcv_memcpy(struct xdp_umem *umem, u64 addr, void 
> *from_buf,
> +			     u32 len, u32 metalen)
> +{
> +	void *to_buf = xdp_umem_get_data(umem, addr);
> +
> +	if (xskq_crosses_non_contig_pg(umem, addr, len + metalen)) {
> +		void *next_pg_addr = umem->pages[(addr >> PAGE_SHIFT) + 1].addr;
> +		u64 page_start = addr & (PAGE_SIZE - 1);
> +		u64 first_len = PAGE_SIZE - (addr - page_start);
> +
> +		memcpy(to_buf, from_buf, first_len + metalen);
> +		memcpy(next_pg_addr, from_buf + first_len, len - first_len);
> +
> +		return;
> +	}
> +
> +	memcpy(to_buf, from_buf, len + metalen);
> +}
> +
>  static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 
> len)
>  {
> -	void *to_buf, *from_buf;
> +	u64 offset = xs->umem->headroom;
> +	void *from_buf;
>  	u32 metalen;
>  	u64 addr;
>  	int err;
>
> -	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> +	if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
>  	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>  		xs->rx_dropped++;
>  		return -ENOSPC;
>  	}
>
> -	addr += xs->umem->headroom;
> -
>  	if (unlikely(xdp_data_meta_unsupported(xdp))) {
>  		from_buf = xdp->data;
>  		metalen = 0;
> @@ -78,9 +99,13 @@ static int __xsk_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp, u32 len)
>  		metalen = xdp->data - xdp->data_meta;
>  	}
>
> -	to_buf = xdp_umem_get_data(xs->umem, addr);
> -	memcpy(to_buf, from_buf, len + metalen);
> -	addr += metalen;
> +	__xsk_rcv_memcpy(xs->umem, addr + offset, from_buf, len, metalen);
> +
> +	offset += metalen;
> +	if (xs->umem->flags & XDP_UMEM_UNALIGNED_CHUNKS)
> +		addr |= offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> +	else
> +		addr += offset;
>  	err = xskq_produce_batch_desc(xs->rx, addr, len);
>  	if (!err) {
>  		xskq_discard_addr(xs->umem->fq);
> @@ -127,6 +152,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp)
>  	u32 len = xdp->data_end - xdp->data;
>  	void *buffer;
>  	u64 addr;
> +	u64 offset = xs->umem->headroom;
>  	int err;
>
>  	spin_lock_bh(&xs->rx_lock);
> @@ -136,17 +162,17 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp)
>  		goto out_unlock;
>  	}
>
> -	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> +	if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
>  	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>  		err = -ENOSPC;
>  		goto out_drop;
>  	}
>
> -	addr += xs->umem->headroom;
> -
> -	buffer = xdp_umem_get_data(xs->umem, addr);
> +	buffer = xdp_umem_get_data(xs->umem, addr + offset);
>  	memcpy(buffer, xdp->data_meta, len + metalen);
> -	addr += metalen;
> +	offset += metalen;
> +
> +	addr = xsk_umem_handle_offset(xs->umem, addr, offset);
>  	err = xskq_produce_batch_desc(xs->rx, addr, len);
>  	if (err)
>  		goto out_drop;
> @@ -190,7 +216,7 @@ bool xsk_umem_consume_tx(struct xdp_umem *umem, 
> struct xdp_desc *desc)
>
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
> -		if (!xskq_peek_desc(xs->tx, desc))
> +		if (!xskq_peek_desc(xs->tx, desc, umem))
>  			continue;
>
>  		if (xskq_produce_addr_lazy(umem->cq, desc->addr))
> @@ -243,7 +269,7 @@ static int xsk_generic_xmit(struct sock *sk, 
> struct msghdr *m,
>  	if (xs->queue_id >= xs->dev->real_num_tx_queues)
>  		goto out;
>
> -	while (xskq_peek_desc(xs->tx, &desc)) {
> +	while (xskq_peek_desc(xs->tx, &desc, xs->umem)) {
>  		char *buffer;
>  		u64 addr;
>  		u32 len;
> @@ -262,6 +288,10 @@ static int xsk_generic_xmit(struct sock *sk, 
> struct msghdr *m,
>
>  		skb_put(skb, len);
>  		addr = desc.addr;
> +		if (xs->umem->flags & XDP_UMEM_UNALIGNED_CHUNKS)
> +			addr = (addr & XSK_UNALIGNED_BUF_ADDR_MASK) |
> +				(addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT);

This doesn't look right to me.  Shouldn't it be "(addr & mask) + (addr 
 >> shift)"?
I'd also prefer to see this type of logic in an inline/macro

> +
>  		buffer = xdp_umem_get_data(xs->umem, addr);
>  		err = skb_store_bits(skb, 0, buffer, len);
>  		if (unlikely(err) || xskq_reserve_addr(xs->umem->cq)) {
> @@ -272,7 +302,7 @@ static int xsk_generic_xmit(struct sock *sk, 
> struct msghdr *m,
>  		skb->dev = xs->dev;
>  		skb->priority = sk->sk_priority;
>  		skb->mark = sk->sk_mark;
> -		skb_shinfo(skb)->destructor_arg = (void *)(long)addr;
> +		skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
>  		skb->destructor = xsk_destruct_skb;
>
>  		err = dev_direct_xmit(skb, xs->queue_id);
> @@ -412,6 +442,28 @@ static struct socket *xsk_lookup_xsk_from_fd(int 
> fd)
>  	return sock;
>  }
>
> +/* Check if umem pages are contiguous.
> + * If zero-copy mode, use the DMA address to do the page contiguity 
> check
> + * For all other modes we use addr (kernel virtual address)
> + */
> +static void xsk_check_page_contiguity(struct xdp_umem *umem, u32 
> flags)
> +{
> +	int i;
> +
> +	if (flags & XDP_ZEROCOPY) {
> +		for (i = 0; i < umem->npgs - 1; i++)
> +			umem->pages[i].next_pg_contig =
> +					(umem->pages[i].dma + PAGE_SIZE ==
> +						umem->pages[i + 1].dma);
> +		return;
> +	}
> +
> +	for (i = 0; i < umem->npgs - 1; i++)
> +		umem->pages[i].next_pg_contig =
> +				(umem->pages[i].addr + PAGE_SIZE ==
> +					umem->pages[i + 1].addr);
> +}
> +
>  static int xsk_bind(struct socket *sock, struct sockaddr *addr, int 
> addr_len)
>  {
>  	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
> @@ -500,6 +552,8 @@ static int xsk_bind(struct socket *sock, struct 
> sockaddr *addr, int addr_len)
>  		err = xdp_umem_assign_dev(xs->umem, dev, qid, flags);
>  		if (err)
>  			goto out_unlock;
> +
> +		xsk_check_page_contiguity(xs->umem, flags);
>  	}
>
>  	xs->dev = dev;
> diff --git a/net/xdp/xsk_diag.c b/net/xdp/xsk_diag.c
> index d5e06c8e0cbf..9986a759fe06 100644
> --- a/net/xdp/xsk_diag.c
> +++ b/net/xdp/xsk_diag.c
> @@ -56,7 +56,7 @@ static int xsk_diag_put_umem(const struct xdp_sock 
> *xs, struct sk_buff *nlskb)
>  	du.id = umem->id;
>  	du.size = umem->size;
>  	du.num_pages = umem->npgs;
> -	du.chunk_size = (__u32)(~umem->chunk_mask + 1);
> +	du.chunk_size = umem->chunk_size_nohr + umem->headroom;
>  	du.headroom = umem->headroom;
>  	du.ifindex = umem->dev ? umem->dev->ifindex : 0;
>  	du.queue_id = umem->queue_id;
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index 909c5168ed0f..0d77212367f0 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -133,6 +133,16 @@ static inline bool xskq_has_addrs(struct 
> xsk_queue *q, u32 cnt)
>
>  /* UMEM queue */
>
> +static inline bool xskq_crosses_non_contig_pg(struct xdp_umem *umem, 
> u64 addr,
> +					      u64 length)
> +{
> +	bool cross_pg = (addr & (PAGE_SIZE - 1)) + length > PAGE_SIZE;
> +	bool next_pg_contig =
> +		umem->pages[(addr >> PAGE_SHIFT) + 1].next_pg_contig;
> +
> +	return cross_pg && !next_pg_contig;
> +}
> +
>  static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
>  {
>  	if (addr >= q->size) {
> @@ -143,23 +153,50 @@ static inline bool xskq_is_valid_addr(struct 
> xsk_queue *q, u64 addr)
>  	return true;
>  }
>
> -static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr)
> +static inline bool xskq_is_valid_addr_unaligned(struct xsk_queue *q, 
> u64 addr,
> +						u64 length,
> +						struct xdp_umem *umem)
> +{
> +	addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> +	addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
> +	if (addr >= q->size ||
> +	    xskq_crosses_non_contig_pg(umem, addr, length)) {
> +		q->invalid_descs++;
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr,
> +				      struct xdp_umem *umem)
>  {
>  	while (q->cons_tail != q->cons_head) {
>  		struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
>  		unsigned int idx = q->cons_tail & q->ring_mask;
>
>  		*addr = READ_ONCE(ring->desc[idx]) & q->chunk_mask> +
> +		if (umem->flags & XDP_UMEM_UNALIGNED_CHUNKS) {
> +			if (xskq_is_valid_addr_unaligned(q, *addr,
> +							 umem->chunk_size_nohr,
> +							 umem))
> +				return addr;
> +			goto out;
> +		}
> +
>  		if (xskq_is_valid_addr(q, *addr))
>  			return addr;
>
> +out:
>  		q->cons_tail++;
>  	}
>
>  	return NULL;
>  }
>
> -static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr)
> +static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr,
> +				  struct xdp_umem *umem)
>  {
>  	if (q->cons_tail == q->cons_head) {
>  		smp_mb(); /* D, matches A */
> @@ -170,7 +207,7 @@ static inline u64 *xskq_peek_addr(struct xsk_queue 
> *q, u64 *addr)
>  		smp_rmb();
>  	}
>
> -	return xskq_validate_addr(q, addr);
> +	return xskq_validate_addr(q, addr, umem);
>  }
>
>  static inline void xskq_discard_addr(struct xsk_queue *q)
> @@ -229,8 +266,21 @@ static inline int xskq_reserve_addr(struct 
> xsk_queue *q)
>
>  /* Rx/Tx queue */
>
> -static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct 
> xdp_desc *d)
> +static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct 
> xdp_desc *d,
> +				      struct xdp_umem *umem)
>  {
> +	if (umem->flags & XDP_UMEM_UNALIGNED_CHUNKS) {
> +		if (!xskq_is_valid_addr_unaligned(q, d->addr, d->len, umem))
> +			return false;
> +
> +		if (d->len > umem->chunk_size_nohr || d->options) {
> +			q->invalid_descs++;
> +			return false;
> +		}
> +
> +		return true;
> +	}
> +
>  	if (!xskq_is_valid_addr(q, d->addr))
>  		return false;
>
> @@ -244,14 +294,15 @@ static inline bool xskq_is_valid_desc(struct 
> xsk_queue *q, struct xdp_desc *d)
>  }
>
>  static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue 
> *q,
> -						  struct xdp_desc *desc)
> +						  struct xdp_desc *desc,
> +						  struct xdp_umem *umem)
>  {
>  	while (q->cons_tail != q->cons_head) {
>  		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
>  		unsigned int idx = q->cons_tail & q->ring_mask;
>
>  		*desc = READ_ONCE(ring->desc[idx]);
> -		if (xskq_is_valid_desc(q, desc))
> +		if (xskq_is_valid_desc(q, desc, umem))
>  			return desc;
>
>  		q->cons_tail++;
> @@ -261,7 +312,8 @@ static inline struct xdp_desc 
> *xskq_validate_desc(struct xsk_queue *q,
>  }
>
>  static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
> -					      struct xdp_desc *desc)
> +					      struct xdp_desc *desc,
> +					      struct xdp_umem *umem)
>  {
>  	if (q->cons_tail == q->cons_head) {
>  		smp_mb(); /* D, matches A */
> @@ -272,7 +324,7 @@ static inline struct xdp_desc 
> *xskq_peek_desc(struct xsk_queue *q,
>  		smp_rmb(); /* C, matches B */
>  	}
>
> -	return xskq_validate_desc(q, desc);
> +	return xskq_validate_desc(q, desc, umem);
>  }
>
>  static inline void xskq_discard_desc(struct xsk_queue *q)
> -- 
> 2.17.1

^ permalink raw reply

* [PATCH bpf-next v2 7/7] selftests/bpf: support FLOW_DISSECTOR_F_STOP_AT_ENCAP
From: Stanislav Fomichev @ 2019-07-25 15:33 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Song Liu, Petar Penkov
In-Reply-To: <20190725153342.3571-1-sdf@google.com>

Exit as soon as we found that packet is encapped when
FLOW_DISSECTOR_F_STOP_AT_ENCAP is passed.
Add appropriate selftest cases.

v2:
* Subtract sizeof(struct iphdr) from .iph_inner.tot_len (Willem de Bruijn)

Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/prog_tests/flow_dissector.c | 64 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_flow.c  |  8 +++
 2 files changed, 72 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
index ada032be6199..15265c7a90a3 100644
--- a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
+++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
@@ -41,6 +41,13 @@ struct ipv4_pkt {
 	struct tcphdr tcp;
 } __packed;
 
+struct ipip_pkt {
+	struct ethhdr eth;
+	struct iphdr iph;
+	struct iphdr iph_inner;
+	struct tcphdr tcp;
+} __packed;
+
 struct svlan_ipv4_pkt {
 	struct ethhdr eth;
 	__u16 vlan_tci;
@@ -82,6 +89,7 @@ struct test {
 	union {
 		struct ipv4_pkt ipv4;
 		struct svlan_ipv4_pkt svlan_ipv4;
+		struct ipip_pkt ipip;
 		struct ipv6_pkt ipv6;
 		struct ipv6_frag_pkt ipv6_frag;
 		struct dvlan_ipv6_pkt dvlan_ipv6;
@@ -303,6 +311,62 @@ struct test tests[] = {
 		},
 		.flags = FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL,
 	},
+	{
+		.name = "ipip-encap",
+		.pkt.ipip = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
+			.iph.ihl = 5,
+			.iph.protocol = IPPROTO_IPIP,
+			.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph_inner.ihl = 5,
+			.iph_inner.protocol = IPPROTO_TCP,
+			.iph_inner.tot_len =
+				__bpf_constant_htons(MAGIC_BYTES) -
+				sizeof(struct iphdr),
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.nhoff = 0,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct iphdr) +
+				sizeof(struct iphdr),
+			.addr_proto = ETH_P_IP,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IP),
+			.is_encap = true,
+			.sport = 80,
+			.dport = 8080,
+		},
+	},
+	{
+		.name = "ipip-no-encap",
+		.pkt.ipip = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
+			.iph.ihl = 5,
+			.iph.protocol = IPPROTO_IPIP,
+			.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph_inner.ihl = 5,
+			.iph_inner.protocol = IPPROTO_TCP,
+			.iph_inner.tot_len =
+				__bpf_constant_htons(MAGIC_BYTES) -
+				sizeof(struct iphdr),
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.flags = FLOW_DISSECTOR_F_STOP_AT_ENCAP,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct iphdr),
+			.addr_proto = ETH_P_IP,
+			.ip_proto = IPPROTO_IPIP,
+			.n_proto = __bpf_constant_htons(ETH_P_IP),
+			.is_encap = true,
+		},
+		.flags = FLOW_DISSECTOR_F_STOP_AT_ENCAP,
+	},
 };
 
 static int create_tap(const char *ifname)
diff --git a/tools/testing/selftests/bpf/progs/bpf_flow.c b/tools/testing/selftests/bpf/progs/bpf_flow.c
index 7d73b7bfe609..b6236cdf8564 100644
--- a/tools/testing/selftests/bpf/progs/bpf_flow.c
+++ b/tools/testing/selftests/bpf/progs/bpf_flow.c
@@ -167,9 +167,15 @@ static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
 		return export_flow_keys(keys, BPF_OK);
 	case IPPROTO_IPIP:
 		keys->is_encap = true;
+		if (keys->flags & FLOW_DISSECTOR_F_STOP_AT_ENCAP)
+			return export_flow_keys(keys, BPF_OK);
+
 		return parse_eth_proto(skb, bpf_htons(ETH_P_IP));
 	case IPPROTO_IPV6:
 		keys->is_encap = true;
+		if (keys->flags & FLOW_DISSECTOR_F_STOP_AT_ENCAP)
+			return export_flow_keys(keys, BPF_OK);
+
 		return parse_eth_proto(skb, bpf_htons(ETH_P_IPV6));
 	case IPPROTO_GRE:
 		gre = bpf_flow_dissect_get_header(skb, sizeof(*gre), &_gre);
@@ -189,6 +195,8 @@ static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
 			keys->thoff += 4; /* Step over sequence number */
 
 		keys->is_encap = true;
+		if (keys->flags & FLOW_DISSECTOR_F_STOP_AT_ENCAP)
+			return export_flow_keys(keys, BPF_OK);
 
 		if (gre->proto == bpf_htons(ETH_P_TEB)) {
 			eth = bpf_flow_dissect_get_header(skb, sizeof(*eth),
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next v2 6/7] bpf/flow_dissector: support ipv6 flow_label and FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL
From: Stanislav Fomichev @ 2019-07-25 15:33 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Song Liu, Petar Penkov
In-Reply-To: <20190725153342.3571-1-sdf@google.com>

Add support for exporting ipv6 flow label via bpf_flow_keys.
Export flow label from bpf_flow.c and also return early when
FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL is passed.

Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/uapi/linux/bpf.h                      |  1 +
 net/core/flow_dissector.c                     |  9 ++++
 tools/include/uapi/linux/bpf.h                |  1 +
 .../selftests/bpf/prog_tests/flow_dissector.c | 46 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/bpf_flow.c  | 10 ++++
 5 files changed, 67 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b4ad19bd6aa8..83b4150466af 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3533,6 +3533,7 @@ struct bpf_flow_keys {
 		};
 	};
 	__u32	flags;
+	__be32	flow_label;
 };
 
 struct bpf_func_info {
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index a74c4ed1b30d..bcdb863cad28 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -737,6 +737,7 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
 	struct flow_dissector_key_basic *key_basic;
 	struct flow_dissector_key_addrs *key_addrs;
 	struct flow_dissector_key_ports *key_ports;
+	struct flow_dissector_key_tags *key_tags;
 
 	key_control = skb_flow_dissector_target(flow_dissector,
 						FLOW_DISSECTOR_KEY_CONTROL,
@@ -781,6 +782,14 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
 		key_ports->src = flow_keys->sport;
 		key_ports->dst = flow_keys->dport;
 	}
+
+	if (dissector_uses_key(flow_dissector,
+			       FLOW_DISSECTOR_KEY_FLOW_LABEL)) {
+		key_tags = skb_flow_dissector_target(flow_dissector,
+						     FLOW_DISSECTOR_KEY_FLOW_LABEL,
+						     target_container);
+		key_tags->flow_label = ntohl(flow_keys->flow_label);
+	}
 }
 
 bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a0e1c891b56f..c26ca432b1b3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3530,6 +3530,7 @@ struct bpf_flow_keys {
 		};
 	};
 	__u32	flags;
+	__be32	flow_label;
 };
 
 struct bpf_func_info {
diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
index f93a115db650..ada032be6199 100644
--- a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
+++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
@@ -20,6 +20,7 @@
 	      "is_encap=%u/%u "						\
 	      "ip_proto=0x%x/0x%x "					\
 	      "n_proto=0x%x/0x%x "					\
+	      "flow_label=0x%x/0x%x "					\
 	      "sport=%u/%u "						\
 	      "dport=%u/%u\n",						\
 	      got.nhoff, expected.nhoff,				\
@@ -30,6 +31,7 @@
 	      got.is_encap, expected.is_encap,				\
 	      got.ip_proto, expected.ip_proto,				\
 	      got.n_proto, expected.n_proto,				\
+	      got.flow_label, expected.flow_label,			\
 	      got.sport, expected.sport,				\
 	      got.dport, expected.dport)
 
@@ -257,6 +259,50 @@ struct test tests[] = {
 			.is_first_frag = true,
 		},
 	},
+	{
+		.name = "ipv6-flow-label",
+		.pkt.ipv6 = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.iph.nexthdr = IPPROTO_TCP,
+			.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph.flow_lbl = { 0xb, 0xee, 0xef },
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct ipv6hdr),
+			.addr_proto = ETH_P_IPV6,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.sport = 80,
+			.dport = 8080,
+			.flow_label = __bpf_constant_htonl(0xbeeef),
+		},
+	},
+	{
+		.name = "ipv6-no-flow-label",
+		.pkt.ipv6 = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.iph.nexthdr = IPPROTO_TCP,
+			.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph.flow_lbl = { 0xb, 0xee, 0xef },
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.flags = FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct ipv6hdr),
+			.addr_proto = ETH_P_IPV6,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.flow_label = __bpf_constant_htonl(0xbeeef),
+		},
+		.flags = FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL,
+	},
 };
 
 static int create_tap(const char *ifname)
diff --git a/tools/testing/selftests/bpf/progs/bpf_flow.c b/tools/testing/selftests/bpf/progs/bpf_flow.c
index 0eabe5e57944..7d73b7bfe609 100644
--- a/tools/testing/selftests/bpf/progs/bpf_flow.c
+++ b/tools/testing/selftests/bpf/progs/bpf_flow.c
@@ -83,6 +83,12 @@ static __always_inline int export_flow_keys(struct bpf_flow_keys *keys,
 	return ret;
 }
 
+#define IPV6_FLOWLABEL_MASK		__bpf_constant_htonl(0x000FFFFF)
+static inline __be32 ip6_flowlabel(const struct ipv6hdr *hdr)
+{
+	return *(__be32 *)hdr & IPV6_FLOWLABEL_MASK;
+}
+
 static __always_inline void *bpf_flow_dissect_get_header(struct __sk_buff *skb,
 							 __u16 hdr_size,
 							 void *buffer)
@@ -307,6 +313,10 @@ PROG(IPV6)(struct __sk_buff *skb)
 
 	keys->thoff += sizeof(struct ipv6hdr);
 	keys->ip_proto = ip6h->nexthdr;
+	keys->flow_label = ip6_flowlabel(ip6h);
+
+	if (keys->flags & FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL)
+		return export_flow_keys(keys, BPF_OK);
 
 	return parse_ipv6_proto(skb, ip6h->nexthdr);
 }
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next v2 5/7] selftests/bpf: support FLOW_DISSECTOR_F_PARSE_1ST_FRAG
From: Stanislav Fomichev @ 2019-07-25 15:33 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Song Liu, Petar Penkov
In-Reply-To: <20190725153342.3571-1-sdf@google.com>

bpf_flow.c: exit early unless FLOW_DISSECTOR_F_PARSE_1ST_FRAG is passed
in flags. Also, set ip_proto earlier, this makes sure we have correct
value with fragmented packets.

Add selftest cases to test ipv4/ipv6 fragments and skip eth_get_headlen
tests that don't have FLOW_DISSECTOR_F_PARSE_1ST_FRAG flag.

eth_get_headlen calls flow dissector with
FLOW_DISSECTOR_F_PARSE_1ST_FRAG flag so we can't run tests that
have different set of input flags against it.

v2:
 * sefltests -> selftests (Willem de Bruijn)
 * Reword a comment about eth_get_headlen flags (Song Liu)

Acked-by: Willem de Bruijn <willemb@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/prog_tests/flow_dissector.c | 132 +++++++++++++++++-
 tools/testing/selftests/bpf/progs/bpf_flow.c  |  28 +++-
 2 files changed, 153 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
index c938283ac232..f93a115db650 100644
--- a/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
+++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector.c
@@ -5,6 +5,10 @@
 #include <linux/if_tun.h>
 #include <sys/uio.h>
 
+#ifndef IP_MF
+#define IP_MF 0x2000
+#endif
+
 #define CHECK_FLOW_KEYS(desc, got, expected)				\
 	CHECK_ATTR(memcmp(&got, &expected, sizeof(got)) != 0,		\
 	      desc,							\
@@ -49,6 +53,18 @@ struct ipv6_pkt {
 	struct tcphdr tcp;
 } __packed;
 
+struct ipv6_frag_pkt {
+	struct ethhdr eth;
+	struct ipv6hdr iph;
+	struct frag_hdr {
+		__u8 nexthdr;
+		__u8 reserved;
+		__be16 frag_off;
+		__be32 identification;
+	} ipf;
+	struct tcphdr tcp;
+} __packed;
+
 struct dvlan_ipv6_pkt {
 	struct ethhdr eth;
 	__u16 vlan_tci;
@@ -65,9 +81,11 @@ struct test {
 		struct ipv4_pkt ipv4;
 		struct svlan_ipv4_pkt svlan_ipv4;
 		struct ipv6_pkt ipv6;
+		struct ipv6_frag_pkt ipv6_frag;
 		struct dvlan_ipv6_pkt dvlan_ipv6;
 	} pkt;
 	struct bpf_flow_keys keys;
+	__u32 flags;
 };
 
 #define VLAN_HLEN	4
@@ -143,6 +161,102 @@ struct test tests[] = {
 			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
 		},
 	},
+	{
+		.name = "ipv4-frag",
+		.pkt.ipv4 = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
+			.iph.ihl = 5,
+			.iph.protocol = IPPROTO_TCP,
+			.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph.frag_off = __bpf_constant_htons(IP_MF),
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct iphdr),
+			.addr_proto = ETH_P_IP,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IP),
+			.is_frag = true,
+			.is_first_frag = true,
+			.sport = 80,
+			.dport = 8080,
+		},
+		.flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG,
+	},
+	{
+		.name = "ipv4-no-frag",
+		.pkt.ipv4 = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
+			.iph.ihl = 5,
+			.iph.protocol = IPPROTO_TCP,
+			.iph.tot_len = __bpf_constant_htons(MAGIC_BYTES),
+			.iph.frag_off = __bpf_constant_htons(IP_MF),
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct iphdr),
+			.addr_proto = ETH_P_IP,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IP),
+			.is_frag = true,
+			.is_first_frag = true,
+		},
+	},
+	{
+		.name = "ipv6-frag",
+		.pkt.ipv6_frag = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.iph.nexthdr = IPPROTO_FRAGMENT,
+			.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+			.ipf.nexthdr = IPPROTO_TCP,
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG,
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct ipv6hdr) +
+				sizeof(struct frag_hdr),
+			.addr_proto = ETH_P_IPV6,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.is_frag = true,
+			.is_first_frag = true,
+			.sport = 80,
+			.dport = 8080,
+		},
+		.flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG,
+	},
+	{
+		.name = "ipv6-no-frag",
+		.pkt.ipv6_frag = {
+			.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.iph.nexthdr = IPPROTO_FRAGMENT,
+			.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+			.ipf.nexthdr = IPPROTO_TCP,
+			.tcp.doff = 5,
+			.tcp.source = 80,
+			.tcp.dest = 8080,
+		},
+		.keys = {
+			.nhoff = ETH_HLEN,
+			.thoff = ETH_HLEN + sizeof(struct ipv6hdr) +
+				sizeof(struct frag_hdr),
+			.addr_proto = ETH_P_IPV6,
+			.ip_proto = IPPROTO_TCP,
+			.n_proto = __bpf_constant_htons(ETH_P_IPV6),
+			.is_frag = true,
+			.is_first_frag = true,
+		},
+	},
 };
 
 static int create_tap(const char *ifname)
@@ -225,6 +339,13 @@ void test_flow_dissector(void)
 			.data_size_in = sizeof(tests[i].pkt),
 			.data_out = &flow_keys,
 		};
+		static struct bpf_flow_keys ctx = {};
+
+		if (tests[i].flags) {
+			tattr.ctx_in = &ctx;
+			tattr.ctx_size_in = sizeof(ctx);
+			ctx.flags = tests[i].flags;
+		}
 
 		err = bpf_prog_test_run_xattr(&tattr);
 		CHECK_ATTR(tattr.data_size_out != sizeof(flow_keys) ||
@@ -251,10 +372,19 @@ void test_flow_dissector(void)
 	CHECK(err, "ifup", "err %d errno %d\n", err, errno);
 
 	for (i = 0; i < ARRAY_SIZE(tests); i++) {
-		struct bpf_flow_keys flow_keys = {};
+		/* Keep in sync with 'flags' from eth_get_headlen. */
+		__u32 eth_get_headlen_flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG;
 		struct bpf_prog_test_run_attr tattr = {};
+		struct bpf_flow_keys flow_keys = {};
 		__u32 key = 0;
 
+		/* For skb-less case we can't pass input flags; run
+		 * only the tests that have a matching set of flags.
+		 */
+
+		if (tests[i].flags != eth_get_headlen_flags)
+			continue;
+
 		err = tx_tap(tap_fd, &tests[i].pkt, sizeof(tests[i].pkt));
 		CHECK(err < 0, "tx_tap", "err %d errno %d\n", err, errno);
 
diff --git a/tools/testing/selftests/bpf/progs/bpf_flow.c b/tools/testing/selftests/bpf/progs/bpf_flow.c
index 5ae485a6af3f..0eabe5e57944 100644
--- a/tools/testing/selftests/bpf/progs/bpf_flow.c
+++ b/tools/testing/selftests/bpf/progs/bpf_flow.c
@@ -153,7 +153,6 @@ static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
 	struct tcphdr *tcp, _tcp;
 	struct udphdr *udp, _udp;
 
-	keys->ip_proto = proto;
 	switch (proto) {
 	case IPPROTO_ICMP:
 		icmp = bpf_flow_dissect_get_header(skb, sizeof(*icmp), &_icmp);
@@ -231,7 +230,6 @@ static __always_inline int parse_ipv6_proto(struct __sk_buff *skb, __u8 nexthdr)
 {
 	struct bpf_flow_keys *keys = skb->flow_keys;
 
-	keys->ip_proto = nexthdr;
 	switch (nexthdr) {
 	case IPPROTO_HOPOPTS:
 	case IPPROTO_DSTOPTS:
@@ -266,6 +264,7 @@ PROG(IP)(struct __sk_buff *skb)
 	keys->addr_proto = ETH_P_IP;
 	keys->ipv4_src = iph->saddr;
 	keys->ipv4_dst = iph->daddr;
+	keys->ip_proto = iph->protocol;
 
 	keys->thoff += iph->ihl << 2;
 	if (data + keys->thoff > data_end)
@@ -273,13 +272,19 @@ PROG(IP)(struct __sk_buff *skb)
 
 	if (iph->frag_off & bpf_htons(IP_MF | IP_OFFSET)) {
 		keys->is_frag = true;
-		if (iph->frag_off & bpf_htons(IP_OFFSET))
+		if (iph->frag_off & bpf_htons(IP_OFFSET)) {
 			/* From second fragment on, packets do not have headers
 			 * we can parse.
 			 */
 			done = true;
-		else
+		} else {
 			keys->is_first_frag = true;
+			/* No need to parse fragmented packet unless
+			 * explicitly asked for.
+			 */
+			if (!(keys->flags & FLOW_DISSECTOR_F_PARSE_1ST_FRAG))
+				done = true;
+		}
 	}
 
 	if (done)
@@ -301,6 +306,7 @@ PROG(IPV6)(struct __sk_buff *skb)
 	memcpy(&keys->ipv6_src, &ip6h->saddr, 2*sizeof(ip6h->saddr));
 
 	keys->thoff += sizeof(struct ipv6hdr);
+	keys->ip_proto = ip6h->nexthdr;
 
 	return parse_ipv6_proto(skb, ip6h->nexthdr);
 }
@@ -317,7 +323,8 @@ PROG(IPV6OP)(struct __sk_buff *skb)
 	/* hlen is in 8-octets and does not include the first 8 bytes
 	 * of the header
 	 */
-	skb->flow_keys->thoff += (1 + ip6h->hdrlen) << 3;
+	keys->thoff += (1 + ip6h->hdrlen) << 3;
+	keys->ip_proto = ip6h->nexthdr;
 
 	return parse_ipv6_proto(skb, ip6h->nexthdr);
 }
@@ -333,9 +340,18 @@ PROG(IPV6FR)(struct __sk_buff *skb)
 
 	keys->thoff += sizeof(*fragh);
 	keys->is_frag = true;
-	if (!(fragh->frag_off & bpf_htons(IP6_OFFSET)))
+	keys->ip_proto = fragh->nexthdr;
+
+	if (!(fragh->frag_off & bpf_htons(IP6_OFFSET))) {
 		keys->is_first_frag = true;
 
+		/* No need to parse fragmented packet unless
+		 * explicitly asked for.
+		 */
+		if (!(keys->flags & FLOW_DISSECTOR_F_PARSE_1ST_FRAG))
+			return export_flow_keys(keys, BPF_OK);
+	}
+
 	return parse_ipv6_proto(skb, fragh->nexthdr);
 }
 
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next v2 4/7] tools/bpf: sync bpf_flow_keys flags
From: Stanislav Fomichev @ 2019-07-25 15:33 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Song Liu, Petar Penkov
In-Reply-To: <20190725153342.3571-1-sdf@google.com>

Export bpf_flow_keys flags to tools/libbpf/selftests.

Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/include/uapi/linux/bpf.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4e455018da65..a0e1c891b56f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3504,6 +3504,10 @@ enum bpf_task_fd_type {
 	BPF_FD_TYPE_URETPROBE,		/* filename + offset */
 };
 
+#define FLOW_DISSECTOR_F_PARSE_1ST_FRAG		(1U << 0)
+#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL	(1U << 1)
+#define FLOW_DISSECTOR_F_STOP_AT_ENCAP		(1U << 2)
+
 struct bpf_flow_keys {
 	__u16	nhoff;
 	__u16	thoff;
@@ -3525,6 +3529,7 @@ struct bpf_flow_keys {
 			__u32	ipv6_dst[4];	/* in6_addr; network order */
 		};
 	};
+	__u32	flags;
 };
 
 struct bpf_func_info {
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next v2 3/7] bpf/flow_dissector: support flags in BPF_PROG_TEST_RUN
From: Stanislav Fomichev @ 2019-07-25 15:33 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Song Liu, Petar Penkov
In-Reply-To: <20190725153342.3571-1-sdf@google.com>

This will allow us to write tests for those flags.

v2:
* Swap kfree(data) and kfree(user_ctx) (Song Liu)

Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 net/bpf/test_run.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 4e41d15a1098..1153bbcdff72 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -377,6 +377,22 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	return ret;
 }
 
+static int verify_user_bpf_flow_keys(struct bpf_flow_keys *ctx)
+{
+	/* make sure the fields we don't use are zeroed */
+	if (!range_is_zero(ctx, 0, offsetof(struct bpf_flow_keys, flags)))
+		return -EINVAL;
+
+	/* flags is allowed */
+
+	if (!range_is_zero(ctx, offsetof(struct bpf_flow_keys, flags) +
+			   FIELD_SIZEOF(struct bpf_flow_keys, flags),
+			   sizeof(struct bpf_flow_keys)))
+		return -EINVAL;
+
+	return 0;
+}
+
 int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 				     const union bpf_attr *kattr,
 				     union bpf_attr __user *uattr)
@@ -384,9 +400,11 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	u32 size = kattr->test.data_size_in;
 	struct bpf_flow_dissector ctx = {};
 	u32 repeat = kattr->test.repeat;
+	struct bpf_flow_keys *user_ctx;
 	struct bpf_flow_keys flow_keys;
 	u64 time_start, time_spent = 0;
 	const struct ethhdr *eth;
+	unsigned int flags = 0;
 	u32 retval, duration;
 	void *data;
 	int ret;
@@ -395,9 +413,6 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	if (prog->type != BPF_PROG_TYPE_FLOW_DISSECTOR)
 		return -EINVAL;
 
-	if (kattr->test.ctx_in || kattr->test.ctx_out)
-		return -EINVAL;
-
 	if (size < ETH_HLEN)
 		return -EINVAL;
 
@@ -410,6 +425,18 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	if (!repeat)
 		repeat = 1;
 
+	user_ctx = bpf_ctx_init(kattr, sizeof(struct bpf_flow_keys));
+	if (IS_ERR(user_ctx)) {
+		kfree(data);
+		return PTR_ERR(user_ctx);
+	}
+	if (user_ctx) {
+		ret = verify_user_bpf_flow_keys(user_ctx);
+		if (ret)
+			goto out;
+		flags = user_ctx->flags;
+	}
+
 	ctx.flow_keys = &flow_keys;
 	ctx.data = data;
 	ctx.data_end = (__u8 *)data + size;
@@ -419,7 +446,7 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	time_start = ktime_get_ns();
 	for (i = 0; i < repeat; i++) {
 		retval = bpf_flow_dissect(prog, &ctx, eth->h_proto, ETH_HLEN,
-					  size, 0);
+					  size, flags);
 
 		if (signal_pending(current)) {
 			preempt_enable();
@@ -450,8 +477,12 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 
 	ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys),
 			      retval, duration);
+	if (!ret)
+		ret = bpf_ctx_finish(kattr, uattr, user_ctx,
+				     sizeof(struct bpf_flow_keys));
 
 out:
+	kfree(user_ctx);
 	kfree(data);
 	return ret;
 }
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next v2 2/7] bpf/flow_dissector: document flags
From: Stanislav Fomichev @ 2019-07-25 15:33 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Song Liu, Petar Penkov
In-Reply-To: <20190725153342.3571-1-sdf@google.com>

Describe what each input flag does and who uses it.

Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 Documentation/bpf/prog_flow_dissector.rst | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/Documentation/bpf/prog_flow_dissector.rst b/Documentation/bpf/prog_flow_dissector.rst
index ed343abe541e..0f3f380b2ce4 100644
--- a/Documentation/bpf/prog_flow_dissector.rst
+++ b/Documentation/bpf/prog_flow_dissector.rst
@@ -26,6 +26,7 @@ and output arguments.
   * ``nhoff`` - initial offset of the networking header
   * ``thoff`` - initial offset of the transport header, initialized to nhoff
   * ``n_proto`` - L3 protocol type, parsed out of L2 header
+  * ``flags`` - optional flags
 
 Flow dissector BPF program should fill out the rest of the ``struct
 bpf_flow_keys`` fields. Input arguments ``nhoff/thoff/n_proto`` should be
@@ -101,6 +102,23 @@ can be called for both cases and would have to be written carefully to
 handle both cases.
 
 
+Flags
+=====
+
+``flow_keys->flags`` might contain optional input flags that work as follows:
+
+* ``FLOW_DISSECTOR_F_PARSE_1ST_FRAG`` - tells BPF flow dissector to continue
+  parsing first fragment; the default expected behavior is that flow dissector
+  returns as soon as it finds out that the packet is fragmented;
+  used by ``eth_get_headlen`` to estimate length of all headers for GRO.
+* ``FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL`` - tells BPF flow dissector to stop
+  parsing as soon as it reaches IPv6 flow label; used by ``___skb_get_hash``
+  and ``__skb_get_hash_symmetric`` to get flow hash.
+* ``FLOW_DISSECTOR_F_STOP_AT_ENCAP`` - tells BPF flow dissector to stop
+  parsing as soon as it reaches encapsulated headers; used by routing
+  infrastructure.
+
+
 Reference Implementation
 ========================
 
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next v2 1/7] bpf/flow_dissector: pass input flags to BPF flow dissector program
From: Stanislav Fomichev @ 2019-07-25 15:33 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Willem de Bruijn,
	Song Liu, Petar Penkov
In-Reply-To: <20190725153342.3571-1-sdf@google.com>

C flow dissector supports input flags that tell it to customize parsing
by either stopping early or trying to parse as deep as possible. Pass
those flags to the BPF flow dissector so it can make the same
decisions. In the next commits I'll add support for those flags to
our reference bpf_flow.c

Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/skbuff.h       | 2 +-
 include/net/flow_dissector.h | 4 ----
 include/uapi/linux/bpf.h     | 5 +++++
 net/bpf/test_run.c           | 2 +-
 net/core/flow_dissector.c    | 5 +++--
 5 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 718742b1c505..9b7a8038beec 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1271,7 +1271,7 @@ static inline int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
 
 struct bpf_flow_dissector;
 bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
-		      __be16 proto, int nhoff, int hlen);
+		      __be16 proto, int nhoff, int hlen, unsigned int flags);
 
 bool __skb_flow_dissect(const struct net *net,
 			const struct sk_buff *skb,
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 90bd210be060..3e2642587b76 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -253,10 +253,6 @@ enum flow_dissector_key_id {
 	FLOW_DISSECTOR_KEY_MAX,
 };
 
-#define FLOW_DISSECTOR_F_PARSE_1ST_FRAG		BIT(0)
-#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL	BIT(1)
-#define FLOW_DISSECTOR_F_STOP_AT_ENCAP		BIT(2)
-
 struct flow_dissector_key {
 	enum flow_dissector_key_id key_id;
 	size_t offset; /* offset of struct flow_dissector_key_*
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fa1c753dcdbc..b4ad19bd6aa8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3507,6 +3507,10 @@ enum bpf_task_fd_type {
 	BPF_FD_TYPE_URETPROBE,		/* filename + offset */
 };
 
+#define FLOW_DISSECTOR_F_PARSE_1ST_FRAG		(1U << 0)
+#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL	(1U << 1)
+#define FLOW_DISSECTOR_F_STOP_AT_ENCAP		(1U << 2)
+
 struct bpf_flow_keys {
 	__u16	nhoff;
 	__u16	thoff;
@@ -3528,6 +3532,7 @@ struct bpf_flow_keys {
 			__u32	ipv6_dst[4];	/* in6_addr; network order */
 		};
 	};
+	__u32	flags;
 };
 
 struct bpf_func_info {
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 80e6f3a6864d..4e41d15a1098 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -419,7 +419,7 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	time_start = ktime_get_ns();
 	for (i = 0; i < repeat; i++) {
 		retval = bpf_flow_dissect(prog, &ctx, eth->h_proto, ETH_HLEN,
-					  size);
+					  size, 0);
 
 		if (signal_pending(current)) {
 			preempt_enable();
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 3e6fedb57bc1..a74c4ed1b30d 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -784,7 +784,7 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
 }
 
 bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
-		      __be16 proto, int nhoff, int hlen)
+		      __be16 proto, int nhoff, int hlen, unsigned int flags)
 {
 	struct bpf_flow_keys *flow_keys = ctx->flow_keys;
 	u32 result;
@@ -794,6 +794,7 @@ bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
 	flow_keys->n_proto = proto;
 	flow_keys->nhoff = nhoff;
 	flow_keys->thoff = flow_keys->nhoff;
+	flow_keys->flags = flags;
 
 	preempt_disable();
 	result = BPF_PROG_RUN(prog, ctx);
@@ -914,7 +915,7 @@ bool __skb_flow_dissect(const struct net *net,
 			}
 
 			ret = bpf_flow_dissect(attached, &ctx, n_proto, nhoff,
-					       hlen);
+					       hlen, flags);
 			__skb_flow_bpf_to_target(&flow_keys, flow_dissector,
 						 target_container);
 			rcu_read_unlock();
-- 
2.22.0.657.g960e92d24f-goog


^ permalink raw reply related

* [PATCH bpf-next v2 0/7] bpf/flow_dissector: support input flags
From: Stanislav Fomichev @ 2019-07-25 15:33 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, Stanislav Fomichev, Song Liu,
	Willem de Bruijn, Petar Penkov

C flow dissector supports input flags that tell it to customize parsing
by either stopping early or trying to parse as deep as possible.
BPF flow dissector always parses as deep as possible which is sub-optimal.
Pass input flags to the BPF flow dissector as well so it can make the same
decisions.

Series outline:
* remove unused FLOW_DISSECTOR_F_STOP_AT_L3 flag
* export FLOW_DISSECTOR_F_XXX flags as uapi and pass them to BPF
  flow dissector
* add documentation for the export flags
* support input flags in BPF_PROG_TEST_RUN via ctx_{in,out}
* sync uapi to tools
* support FLOW_DISSECTOR_F_PARSE_1ST_FRAG in selftest
* support FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL in kernel and selftest
* support FLOW_DISSECTOR_F_STOP_AT_ENCAP in selftest

Pros:
* makes BPF flow dissector faster by avoiding burning extra cycles
* existing BPF progs continue to work by ignoring the flags and always
  parsing as deep as possible

Cons:
* new UAPI which we need to support (OTOH, if we need to deprecate some
  flags, we can just stop setting them upon calling BPF programs)

Some numbers (with .repeat = 4000000 in test_flow_dissector):
        test_flow_dissector:PASS:ipv4-frag 35 nsec
        test_flow_dissector:PASS:ipv4-frag 35 nsec
        test_flow_dissector:PASS:ipv4-no-frag 32 nsec
        test_flow_dissector:PASS:ipv4-no-frag 32 nsec

        test_flow_dissector:PASS:ipv6-frag 39 nsec
        test_flow_dissector:PASS:ipv6-frag 39 nsec
        test_flow_dissector:PASS:ipv6-no-frag 36 nsec
        test_flow_dissector:PASS:ipv6-no-frag 36 nsec

        test_flow_dissector:PASS:ipv6-flow-label 36 nsec
        test_flow_dissector:PASS:ipv6-flow-label 36 nsec
        test_flow_dissector:PASS:ipv6-no-flow-label 33 nsec
        test_flow_dissector:PASS:ipv6-no-flow-label 33 nsec

        test_flow_dissector:PASS:ipip-encap 38 nsec
        test_flow_dissector:PASS:ipip-encap 38 nsec
        test_flow_dissector:PASS:ipip-no-encap 32 nsec
        test_flow_dissector:PASS:ipip-no-encap 32 nsec

The improvement is around 10%, but it's in a tight cache-hot
BPF_PROG_TEST_RUN loop.

Cc: Song Liu <songliubraving@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Petar Penkov <ppenkov@google.com>

Stanislav Fomichev (7):
  bpf/flow_dissector: pass input flags to BPF flow dissector program
  bpf/flow_dissector: document flags
  bpf/flow_dissector: support flags in BPF_PROG_TEST_RUN
  tools/bpf: sync bpf_flow_keys flags
  selftests/bpf: support FLOW_DISSECTOR_F_PARSE_1ST_FRAG
  bpf/flow_dissector: support ipv6 flow_label and
    FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL
  selftests/bpf: support FLOW_DISSECTOR_F_STOP_AT_ENCAP

 Documentation/bpf/prog_flow_dissector.rst     |  18 ++
 include/linux/skbuff.h                        |   2 +-
 include/net/flow_dissector.h                  |   4 -
 include/uapi/linux/bpf.h                      |   6 +
 net/bpf/test_run.c                            |  39 ++-
 net/core/flow_dissector.c                     |  14 +-
 tools/include/uapi/linux/bpf.h                |   6 +
 .../selftests/bpf/prog_tests/flow_dissector.c | 242 +++++++++++++++++-
 tools/testing/selftests/bpf/progs/bpf_flow.c  |  46 +++-
 9 files changed, 359 insertions(+), 18 deletions(-)

-- 
2.22.0.657.g960e92d24f-goog

^ permalink raw reply

* RE: [PATCH net-next 3/3] net: stmmac: Introducing support for Page Pool
From: Jose Abreu @ 2019-07-25 15:12 UTC (permalink / raw)
  To: Jon Hunter, Jose Abreu, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com,
	linux-arm-kernel@lists.infradead.org
  Cc: Joao Pinto, David S . Miller, Giuseppe Cavallaro,
	Alexandre Torgue, Maxime Coquelin, Maxime Ripard, Chen-Yu Tsai,
	Robin Murphy, linux-tegra
In-Reply-To: <9e695f33-fd9f-a910-0891-2b63bd75e082@nvidia.com>

From: Jon Hunter <jonathanh@nvidia.com>
Date: Jul/25/2019, 15:25:59 (UTC+00:00)

> 
> On 25/07/2019 14:26, Jose Abreu wrote:
> 
> ...
> 
> > Well, I wasn't expecting that :/
> > 
> > Per documentation of barriers I think we should set descriptor fields 
> > and then barrier and finally ownership to HW so that remaining fields 
> > are coherent before owner is set.
> > 
> > Anyway, can you also add a dma_rmb() after the call to 
> > stmmac_rx_status() ?
> 
> Yes. I removed the debug print added the barrier, but that did not help.

So, I was finally able to setup NFS using your replicated setup and I 
can't see the issue :(

The only difference I have from yours is that I'm using TCP in NFS 
whilst you (I believe from the logs), use UDP.

You do have flow control active right ? And your HW FIFO size is >= 4k ?

---
Thanks,
Jose Miguel Abreu

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox