Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] bnxt_en: Use refcount_t for refcount
From: Michael Chan @ 2019-07-31 17:48 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Chuhong Yuan, David S . Miller, Network Development, linux-kernel
In-Reply-To: <CA+FuTScqD4bMpm6n13ETFVEvSKnk_rRUzspzs9HB6B5Un101Dw@mail.gmail.com>

On Wed, Jul 31, 2019 at 9:06 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Wed, Jul 31, 2019 at 8:22 AM Chuhong Yuan <hslester96@gmail.com> wrote:
> >
> > refcount_t is better for reference counters since its
> > implementation can prevent overflows.
> > So convert atomic_t ref counters to refcount_t.
> >
> > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > ---
> >  drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c | 8 ++++----
> >  drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.h | 2 +-
> >  2 files changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
> > index fc77caf0a076..eb7ed34639e2 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
> > @@ -49,7 +49,7 @@ static int bnxt_register_dev(struct bnxt_en_dev *edev, int ulp_id,
> >                         return -ENOMEM;
> >         }
> >
> > -       atomic_set(&ulp->ref_count, 0);
> > +       refcount_set(&ulp->ref_count, 0);
>
> One feature of refcount_t is that it warns on refcount_inc from 0 to
> detect possible use-after_free. It appears that that can trigger here?
>

I think that's right.  We need to change the driver to start counting
from 1 instead of 0 if we convert to refcount.

^ permalink raw reply

* Re: [PATCH net v3] ipvs: Improve robustness to the ipvs sysctl
From: Julian Anastasov @ 2019-07-31 17:53 UTC (permalink / raw)
  To: hujunwei
  Cc: wensong, horms, pablo, kadlec, Florian Westphal, davem,
	Florian Westphal, netdev@vger.kernel.org, lvs-devel,
	netfilter-devel, coreteam, Mingfangsen, wangxiaogang3, xuhanbing
In-Reply-To: <5fd55d18-f4e2-a6b4-5c54-db76c05be5df@huawei.com>


	Hello,

On Thu, 1 Aug 2019, hujunwei wrote:

> From: Junwei Hu <hujunwei4@huawei.com>
> 
> The ipvs module parse the user buffer and save it to sysctl,
> then check if the value is valid. invalid value occurs
> over a period of time.
> Here, I add a variable, struct ctl_table tmp, used to read
> the value from the user buffer, and save only when it is valid.
> I delete proc_do_sync_mode and use extra1/2 in table for the
> proc_dointvec_minmax call.
> 
> Fixes: f73181c8288f ("ipvs: add support for sync threads")
> Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
> Acked-by: Julian Anastasov <ja@ssi.bg>

	Yep, Acked-by: Julian Anastasov <ja@ssi.bg>

> ---
> V1->V2:
> - delete proc_do_sync_mode and use proc_dointvec_minmax call.
> V2->V3:
> - update git version
> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 69 +++++++++++++++++-----------------
>  1 file changed, 35 insertions(+), 34 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 060565e7d227..72189559a1cd 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -1737,12 +1737,18 @@ proc_do_defense_mode(struct ctl_table *table, int write,
>  	int val = *valp;
>  	int rc;
> 
> -	rc = proc_dointvec(table, write, buffer, lenp, ppos);
> +	struct ctl_table tmp = {
> +		.data = &val,
> +		.maxlen = sizeof(int),
> +		.mode = table->mode,
> +	};
> +
> +	rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
>  	if (write && (*valp != val)) {
> -		if ((*valp < 0) || (*valp > 3)) {
> -			/* Restore the correct value */
> -			*valp = val;
> +		if (val < 0 || val > 3) {
> +			rc = -EINVAL;
>  		} else {
> +			*valp = val;
>  			update_defense_level(ipvs);
>  		}
>  	}
> @@ -1756,33 +1762,20 @@ proc_do_sync_threshold(struct ctl_table *table, int write,
>  	int *valp = table->data;
>  	int val[2];
>  	int rc;
> +	struct ctl_table tmp = {
> +		.data = &val,
> +		.maxlen = table->maxlen,
> +		.mode = table->mode,
> +	};
> 
> -	/* backup the value first */
>  	memcpy(val, valp, sizeof(val));
> -
> -	rc = proc_dointvec(table, write, buffer, lenp, ppos);
> -	if (write && (valp[0] < 0 || valp[1] < 0 ||
> -	    (valp[0] >= valp[1] && valp[1]))) {
> -		/* Restore the correct value */
> -		memcpy(valp, val, sizeof(val));
> -	}
> -	return rc;
> -}
> -
> -static int
> -proc_do_sync_mode(struct ctl_table *table, int write,
> -		     void __user *buffer, size_t *lenp, loff_t *ppos)
> -{
> -	int *valp = table->data;
> -	int val = *valp;
> -	int rc;
> -
> -	rc = proc_dointvec(table, write, buffer, lenp, ppos);
> -	if (write && (*valp != val)) {
> -		if ((*valp < 0) || (*valp > 1)) {
> -			/* Restore the correct value */
> -			*valp = val;
> -		}
> +	rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
> +	if (write) {
> +		if (val[0] < 0 || val[1] < 0 ||
> +		    (val[0] >= val[1] && val[1]))
> +			rc = -EINVAL;
> +		else
> +			memcpy(valp, val, sizeof(val));
>  	}
>  	return rc;
>  }
> @@ -1795,12 +1788,18 @@ proc_do_sync_ports(struct ctl_table *table, int write,
>  	int val = *valp;
>  	int rc;
> 
> -	rc = proc_dointvec(table, write, buffer, lenp, ppos);
> +	struct ctl_table tmp = {
> +		.data = &val,
> +		.maxlen = sizeof(int),
> +		.mode = table->mode,
> +	};
> +
> +	rc = proc_dointvec(&tmp, write, buffer, lenp, ppos);
>  	if (write && (*valp != val)) {
> -		if (*valp < 1 || !is_power_of_2(*valp)) {
> -			/* Restore the correct value */
> +		if (val < 1 || !is_power_of_2(val))
> +			rc = -EINVAL;
> +		else
>  			*valp = val;
> -		}
>  	}
>  	return rc;
>  }
> @@ -1860,7 +1859,9 @@ static struct ctl_table vs_vars[] = {
>  		.procname	= "sync_version",
>  		.maxlen		= sizeof(int),
>  		.mode		= 0644,
> -		.proc_handler	= proc_do_sync_mode,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_ONE,
>  	},
>  	{
>  		.procname	= "sync_ports",
> -- 
> 2.21.GIT

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH 2/2] cnic: Use refcount_t for refcount
From: Michael Chan @ 2019-07-31 17:58 UTC (permalink / raw)
  To: Chuhong Yuan; +Cc: David S . Miller, Netdev, open list
In-Reply-To: <20190731122224.1003-1-hslester96@gmail.com>

On Wed, Jul 31, 2019 at 5:22 AM Chuhong Yuan <hslester96@gmail.com> wrote:

>  static void cnic_ctx_wr(struct cnic_dev *dev, u32 cid_addr, u32 off, u32 val)
> @@ -494,7 +494,7 @@ int cnic_register_driver(int ulp_type, struct cnic_ulp_ops *ulp_ops)
>         }
>         read_unlock(&cnic_dev_lock);
>
> -       atomic_set(&ulp_ops->ref_count, 0);
> +       refcount_set(&ulp_ops->ref_count, 0);
>         rcu_assign_pointer(cnic_ulp_tbl[ulp_type], ulp_ops);
>         mutex_unlock(&cnic_lock);
>

Willem's comment applies here too.  The driver needs to be modified to
count from 1 to use refcount_t.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next v2 1/4] dt-bindings: net: Add aspeed,ast2600-mdio binding
From: Rob Herring @ 2019-07-31 18:00 UTC (permalink / raw)
  To: Andrew Jeffery
  Cc: netdev, David Miller, Mark Rutland, Joel Stanley, Andrew Lunn,
	Florian Fainelli, Heiner Kallweit, devicetree,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE,
	linux-aspeed, linux-kernel@vger.kernel.org
In-Reply-To: <20190731053959.16293-2-andrew@aj.id.au>

On Tue, Jul 30, 2019 at 11:39 PM Andrew Jeffery <andrew@aj.id.au> wrote:
>
> The AST2600 splits out the MDIO bus controller from the MAC into its own
> IP block and rearranges the register layout. Add a new binding to
> describe the new hardware.
>
> Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
>
> ---
> v2:
> * aspeed: Utilise mdio.yaml
> * aspeed: Drop status from example
> ---
>  .../bindings/net/aspeed,ast2600-mdio.yaml     | 45 +++++++++++++++++++
>  1 file changed, 45 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/aspeed,ast2600-mdio.yaml

Reviewed-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH bpf-next v4 07/11] mlx5e: modify driver for handling offsets
From: Jonathan Lemon @ 2019-07-31 18:10 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <20190730085400.10376-8-kevin.laatz@intel.com>



On 30 Jul 2019, at 1:53, Kevin Laatz wrote:

> With the addition of the unaligned chunks option, we need to make sure 
> we
> handle the offsets accordingly based on the mode we are currently 
> running
> in. This patch modifies the driver to appropriately mask the address 
> for
> each case.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>
> ---
> v3:
>   - Use new helper function to handle offset
>
> v4:
>   - fixed headroom addition to handle. Using 
> xsk_umem_adjust_headroom()
>     now.
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c    | 8 ++++++--
>  drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c | 3 ++-
>  2 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> index b0b982cf69bb..d5245893d2c8 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> @@ -122,6 +122,7 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct 
> mlx5e_dma_info *di,
>  		      void *va, u16 *rx_headroom, u32 *len, bool xsk)
>  {
>  	struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
> +	struct xdp_umem *umem = rq->umem;
>  	struct xdp_buff xdp;
>  	u32 act;
>  	int err;
> @@ -138,8 +139,11 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct 
> mlx5e_dma_info *di,
>  	xdp.rxq = &rq->xdp_rxq;
>
>  	act = bpf_prog_run_xdp(prog, &xdp);
> -	if (xsk)
> -		xdp.handle += xdp.data - xdp.data_hard_start;
> +	if (xsk) {
> +		u64 off = xdp.data - xdp.data_hard_start;
> +
> +		xdp.handle = xsk_umem_handle_offset(umem, xdp.handle, off);

Shouldn't this be xdp_umem_adjust_offset()?


> +	}
>  	switch (act) {
>  	case XDP_PASS:
>  		*rx_headroom = xdp.data - xdp.data_hard_start;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> index 6a55573ec8f2..7c49a66d28c9 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> @@ -24,7 +24,8 @@ int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq,
>  	if (!xsk_umem_peek_addr_rq(umem, &handle))
>  		return -ENOMEM;
>
> -	dma_info->xsk.handle = handle + rq->buff.umem_headroom;
> +	dma_info->xsk.handle = xsk_umem_adjust_offset(umem, handle,
> +						      rq->buff.umem_headroom);
>  	dma_info->xsk.data = xdp_umem_get_data(umem, dma_info->xsk.handle);
>
>  	/* No need to add headroom to the DMA address. In striding RQ case, 
> we
> -- 
> 2.17.1

^ permalink raw reply

* Re: [PATCH bpf-next v4 04/11] xsk: add support to allow unaligned chunk placement
From: Jonathan Lemon @ 2019-07-31 18:11 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <20190730085400.10376-5-kevin.laatz@intel.com>



On 30 Jul 2019, at 1:53, Kevin Laatz wrote:

> Currently, addresses are chunk size aligned. This means, we are very
> restricted in terms of where we can place chunk within the umem. For
> example, if we have a chunk size of 2k, then our chunks can only be 
> placed
> at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).
>
> This patch introduces the ability to use unaligned chunks. With these
> changes, we are no longer bound to having to place chunks at a 2k (or
> whatever your chunk size is) interval. Since we are no longer dealing 
> with
> aligned chunks, they can now cross page boundaries. Checks for page
> contiguity have been added in order to keep track of which pages are
> followed by a physically contiguous page.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>
> ---
> v2:
>   - Add checks for the flags coming from userspace
>   - Fix how we get chunk_size in xsk_diag.c
>   - Add defines for masking the new descriptor format
>   - Modified the rx functions to use new descriptor format
>   - Modified the tx functions to use new descriptor format
>
> v3:
>   - Add helper function to do address/offset masking/addition
>
> v4:
>   - fixed page_start calculation in __xsk_rcv_memcpy().
>   - move offset handling to the xdp_umem_get_* functions
>   - modified the len field in xdp_umem_reg struct. We now use 16 bits 
> from
>     this for the flags field.
>   - removed next_pg_contig field from xdp_umem_page struct. Using low 
> 12
>     bits of addr to store flags instead.
>   - other minor changes based on review comments
> ---
>  include/net/xdp_sock.h      | 40 ++++++++++++++++++-
>  include/uapi/linux/if_xdp.h | 14 ++++++-
>  net/xdp/xdp_umem.c          | 18 ++++++---
>  net/xdp/xsk.c               | 79 
> +++++++++++++++++++++++++++++--------
>  net/xdp/xsk_diag.c          |  2 +-
>  net/xdp/xsk_queue.h         | 69 ++++++++++++++++++++++++++++----
>  6 files changed, 188 insertions(+), 34 deletions(-)
>
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 69796d264f06..a755e8ab6cac 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -16,6 +16,13 @@
>  struct net_device;
>  struct xsk_queue;
>
> +/* Masks for xdp_umem_page flags.
> + * The low 12-bits of the addr will be 0 since this is the page 
> address, so we
> + * can use them for flags.
> + */
> +#define XSK_NEXT_PG_CONTIG_SHIFT 0
> +#define XSK_NEXT_PG_CONTIG_MASK (1ULL << XSK_NEXT_PG_CONTIG_SHIFT)
> +
>  struct xdp_umem_page {
>  	void *addr;
>  	dma_addr_t dma;
> @@ -48,6 +55,7 @@ struct xdp_umem {
>  	bool zc;
>  	spinlock_t xsk_list_lock;
>  	struct list_head xsk_list;
> +	u16 flags;
>  };
>
>  struct xdp_sock {
> @@ -98,12 +106,21 @@ struct xdp_umem *xdp_get_umem_from_qid(struct 
> net_device *dev, u16 queue_id);
>
>  static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 
> addr)
>  {
> -	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 
> 1));
> +	unsigned long page_addr;
> +
> +	addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> +	addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
> +	page_addr = (unsigned long)umem->pages[addr >> PAGE_SHIFT].addr;
> +
> +	return (char *)(page_addr & PAGE_MASK) + (addr & ~PAGE_MASK);
>  }
>
>  static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 
> addr)
>  {
> -	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 
> 1));
> +	addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> +	addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
> +
> +	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & ~PAGE_MASK);
>  }
>
>  /* Reuse-queue aware version of FILL queue helpers */
> @@ -144,6 +161,19 @@ static inline void xsk_umem_fq_reuse(struct 
> xdp_umem *umem, u64 addr)
>
>  	rq->handles[rq->length++] = addr;
>  }
> +
> +/* Handle the offset appropriately depending on aligned or unaligned 
> mode.
> + * For unaligned mode, we store the offset in the upper 16-bits of 
> the address.
> + * For aligned mode, we simply add the offset to the address.
> + */
> +static inline u64 xsk_umem_adjust_offset(struct xdp_umem *umem, u64 
> address,
> +					 u64 offset)
> +{
> +	if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG)
> +		return address + (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
> +	else
> +		return address + offset;
> +}
>  #else
>  static inline int xsk_generic_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp)
>  {
> @@ -241,6 +271,12 @@ static inline void xsk_umem_fq_reuse(struct 
> xdp_umem *umem, u64 addr)
>  {
>  }
>
> +static inline u64 xsk_umem_handle_offset(struct xdp_umem *umem, u64 
> handle,
> +					 u64 offset)
> +{
> +	return 0;
> +}
> +
>  #endif /* CONFIG_XDP_SOCKETS */
>
>  #endif /* _LINUX_XDP_SOCK_H */
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index faaa5ca2a117..4a5490651b22 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -17,6 +17,10 @@
>  #define XDP_COPY	(1 << 1) /* Force copy-mode */
>  #define XDP_ZEROCOPY	(1 << 2) /* Force zero-copy mode */
>
> +/* Flags for xsk_umem_config flags */
> +#define XDP_UMEM_UNALIGNED_CHUNK_FLAG_SHIFT 15
> +#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 
> XDP_UMEM_UNALIGNED_CHUNK_FLAG_SHIFT)
> +
>  struct sockaddr_xdp {
>  	__u16 sxdp_family;
>  	__u16 sxdp_flags;
> @@ -49,8 +53,9 @@ struct xdp_mmap_offsets {
>  #define XDP_OPTIONS			8
>
>  struct xdp_umem_reg {
> -	__u64 addr; /* Start of packet data area */
> -	__u64 len; /* Length of packet data area */
> +	__u64 addr;    /* Start of packet data area */
> +	__u64 len:48;  /* Length of packet data area */
> +	__u64 flags:16; /*Flags for umem */
>  	__u32 chunk_size;
>  	__u32 headroom;
>  };
> @@ -74,6 +79,11 @@ struct xdp_options {
>  #define XDP_UMEM_PGOFF_FILL_RING	0x100000000ULL
>  #define XDP_UMEM_PGOFF_COMPLETION_RING	0x180000000ULL
>
> +/* Masks for unaligned chunks mode */
> +#define XSK_UNALIGNED_BUF_OFFSET_SHIFT 48
> +#define XSK_UNALIGNED_BUF_ADDR_MASK \
> +	((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
> +
>  /* Rx/Tx descriptor */
>  struct xdp_desc {
>  	__u64 addr;
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 83de74ca729a..5590ca7bbe15 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -299,6 +299,7 @@ static int xdp_umem_account_pages(struct xdp_umem 
> *umem)
>
>  static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg 
> *mr)
>  {
> +	bool unaligned_chunks = mr->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG;
>  	u32 chunk_size = mr->chunk_size, headroom = mr->headroom;
>  	unsigned int chunks, chunks_per_page;
>  	u64 addr = mr->addr, size = mr->len;
> @@ -314,7 +315,10 @@ static int xdp_umem_reg(struct xdp_umem *umem, 
> struct xdp_umem_reg *mr)
>  		return -EINVAL;
>  	}
>
> -	if (!is_power_of_2(chunk_size))
> +	if (mr->flags & ~XDP_UMEM_UNALIGNED_CHUNK_FLAG)
> +		return -EINVAL;
> +
> +	if (!unaligned_chunks && !is_power_of_2(chunk_size))
>  		return -EINVAL;
>
>  	if (!PAGE_ALIGNED(addr)) {
> @@ -331,9 +335,11 @@ static int xdp_umem_reg(struct xdp_umem *umem, 
> struct xdp_umem_reg *mr)
>  	if (chunks == 0)
>  		return -EINVAL;
>
> -	chunks_per_page = PAGE_SIZE / chunk_size;
> -	if (chunks < chunks_per_page || chunks % chunks_per_page)
> -		return -EINVAL;
> +	if (!unaligned_chunks) {
> +		chunks_per_page = PAGE_SIZE / chunk_size;
> +		if (chunks < chunks_per_page || chunks % chunks_per_page)
> +			return -EINVAL;
> +	}
>
>  	headroom = ALIGN(headroom, 64);
>
> @@ -342,13 +348,15 @@ static int xdp_umem_reg(struct xdp_umem *umem, 
> struct xdp_umem_reg *mr)
>  		return -EINVAL;
>
>  	umem->address = (unsigned long)addr;
> -	umem->chunk_mask = ~((u64)chunk_size - 1);
> +	umem->chunk_mask = unaligned_chunks ? XSK_UNALIGNED_BUF_ADDR_MASK
> +					    : ~((u64)chunk_size - 1);
>  	umem->size = size;
>  	umem->headroom = headroom;
>  	umem->chunk_size_nohr = chunk_size - headroom;
>  	umem->npgs = size / PAGE_SIZE;
>  	umem->pgs = NULL;
>  	umem->user = NULL;
> +	umem->flags = mr->flags;
>  	INIT_LIST_HEAD(&umem->xsk_list);
>  	spin_lock_init(&umem->xsk_list_lock);
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 59b57d708697..9b834d54549e 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -45,7 +45,7 @@ EXPORT_SYMBOL(xsk_umem_has_addrs);
>
>  u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
>  {
> -	return xskq_peek_addr(umem->fq, addr);
> +	return xskq_peek_addr(umem->fq, addr, umem);
>  }
>  EXPORT_SYMBOL(xsk_umem_peek_addr);
>
> @@ -55,21 +55,42 @@ void xsk_umem_discard_addr(struct xdp_umem *umem)
>  }
>  EXPORT_SYMBOL(xsk_umem_discard_addr);
>
> +/* If a buffer crosses a page boundary, we need to do 2 memcpy's, one 
> for
> + * each page. This is only required in copy mode.
> + */
> +static void __xsk_rcv_memcpy(struct xdp_umem *umem, u64 addr, void 
> *from_buf,
> +			     u32 len, u32 metalen)
> +{
> +	void *to_buf = xdp_umem_get_data(umem, addr);
> +
> +	if (xskq_crosses_non_contig_pg(umem, addr, len + metalen)) {
> +		void *next_pg_addr = umem->pages[(addr >> PAGE_SHIFT) + 1].addr;

If 'addr' (really handle) is v2 format and has an offset, this will
give incorrect results.  Please use accessors which correctly extract
page and offset from a handle.


> +		u64 page_start = addr & ~(PAGE_SIZE - 1);
> +		u64 first_len = PAGE_SIZE - (addr - page_start);
> +
> +		memcpy(to_buf, from_buf, first_len + metalen);
> +		memcpy(next_pg_addr, from_buf + first_len, len - first_len);
> +
> +		return;
> +	}
> +
> +	memcpy(to_buf, from_buf, len + metalen);
> +}
> +
>  static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 
> len)
>  {
> -	void *to_buf, *from_buf;
> +	u64 offset = xs->umem->headroom;
> +	void *from_buf;
>  	u32 metalen;
>  	u64 addr;
>  	int err;
>
> -	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> +	if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
>  	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>  		xs->rx_dropped++;
>  		return -ENOSPC;
>  	}
>
> -	addr += xs->umem->headroom;
> -
>  	if (unlikely(xdp_data_meta_unsupported(xdp))) {
>  		from_buf = xdp->data;
>  		metalen = 0;
> @@ -78,9 +99,10 @@ static int __xsk_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp, u32 len)
>  		metalen = xdp->data - xdp->data_meta;
>  	}
>
> -	to_buf = xdp_umem_get_data(xs->umem, addr);
> -	memcpy(to_buf, from_buf, len + metalen);
> -	addr += metalen;
> +	__xsk_rcv_memcpy(xs->umem, addr + offset, from_buf, len, metalen);

'addr' is actually a handle here.  Can't add offset, this needs 'adjust 
offset'.


> +
> +	offset += metalen;
> +	addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
>  	err = xskq_produce_batch_desc(xs->rx, addr, len);
>  	if (!err) {
>  		xskq_discard_addr(xs->umem->fq);
> @@ -125,6 +147,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp)
>  {
>  	u32 metalen = xdp->data - xdp->data_meta;
>  	u32 len = xdp->data_end - xdp->data;
> +	u64 offset = xs->umem->headroom;
>  	void *buffer;
>  	u64 addr;
>  	int err;
> @@ -136,17 +159,17 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp)
>  		goto out_unlock;
>  	}
>
> -	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> +	if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
>  	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>  		err = -ENOSPC;
>  		goto out_drop;
>  	}
>
> -	addr += xs->umem->headroom;
> -
> -	buffer = xdp_umem_get_data(xs->umem, addr);
> +	buffer = xdp_umem_get_data(xs->umem, addr + offset);
>  	memcpy(buffer, xdp->data_meta, len + metalen);
> -	addr += metalen;
> +	offset += metalen;
> +
> +	addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
>  	err = xskq_produce_batch_desc(xs->rx, addr, len);
>  	if (err)
>  		goto out_drop;
> @@ -190,7 +213,7 @@ bool xsk_umem_consume_tx(struct xdp_umem *umem, 
> struct xdp_desc *desc)
>
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
> -		if (!xskq_peek_desc(xs->tx, desc))
> +		if (!xskq_peek_desc(xs->tx, desc, umem))
>  			continue;
>
>  		if (xskq_produce_addr_lazy(umem->cq, desc->addr))
> @@ -243,7 +266,7 @@ static int xsk_generic_xmit(struct sock *sk, 
> struct msghdr *m,
>  	if (xs->queue_id >= xs->dev->real_num_tx_queues)
>  		goto out;
>
> -	while (xskq_peek_desc(xs->tx, &desc)) {
> +	while (xskq_peek_desc(xs->tx, &desc, xs->umem)) {
>  		char *buffer;
>  		u64 addr;
>  		u32 len;
> @@ -262,6 +285,10 @@ static int xsk_generic_xmit(struct sock *sk, 
> struct msghdr *m,
>
>  		skb_put(skb, len);
>  		addr = desc.addr;
> +		if (xs->umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG)
> +			addr = (addr & XSK_UNALIGNED_BUF_ADDR_MASK) +
> +				(addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT);

This is repeating the computations done in xdp_umem_get_data().  Remove.


>  		buffer = xdp_umem_get_data(xs->umem, addr);
>  		err = skb_store_bits(skb, 0, buffer, len);
>  		if (unlikely(err) || xskq_reserve_addr(xs->umem->cq)) {
> @@ -272,7 +299,7 @@ static int xsk_generic_xmit(struct sock *sk, 
> struct msghdr *m,
>  		skb->dev = xs->dev;
>  		skb->priority = sk->sk_priority;
>  		skb->mark = sk->sk_mark;
> -		skb_shinfo(skb)->destructor_arg = (void *)(long)addr;
> +		skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
>  		skb->destructor = xsk_destruct_skb;
>
>  		err = dev_direct_xmit(skb, xs->queue_id);
> @@ -412,6 +439,24 @@ static struct socket *xsk_lookup_xsk_from_fd(int 
> fd)
>  	return sock;
>  }
>
> +/* Check if umem pages are contiguous.
> + * If zero-copy mode, use the DMA address to do the page contiguity 
> check
> + * For all other modes we use addr (kernel virtual address)
> + * Store the result in the low bits of addr.
> + */
> +static void xsk_check_page_contiguity(struct xdp_umem *umem, u32 
> flags)
> +{
> +	struct xdp_umem_page *pgs = umem->pages;
> +	int i, is_contig;
> +
> +	for (i = 0; i < umem->npgs - 1; i++) {
> +		is_contig = (flags & XDP_ZEROCOPY) ?
> +			(pgs[i].dma + PAGE_SIZE == pgs[i + 1].dma) :
> +			(pgs[i].addr + PAGE_SIZE == pgs[i + 1].addr);
> +		pgs[i].addr += is_contig << XSK_NEXT_PG_CONTIG_SHIFT;
> +	}
> +}
> +
>  static int xsk_bind(struct socket *sock, struct sockaddr *addr, int 
> addr_len)
>  {
>  	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
> @@ -500,6 +545,8 @@ static int xsk_bind(struct socket *sock, struct 
> sockaddr *addr, int addr_len)
>  		err = xdp_umem_assign_dev(xs->umem, dev, qid, flags);
>  		if (err)
>  			goto out_unlock;
> +
> +		xsk_check_page_contiguity(xs->umem, flags);
>  	}
>
>  	xs->dev = dev;
> diff --git a/net/xdp/xsk_diag.c b/net/xdp/xsk_diag.c
> index d5e06c8e0cbf..9986a759fe06 100644
> --- a/net/xdp/xsk_diag.c
> +++ b/net/xdp/xsk_diag.c
> @@ -56,7 +56,7 @@ static int xsk_diag_put_umem(const struct xdp_sock 
> *xs, struct sk_buff *nlskb)
>  	du.id = umem->id;
>  	du.size = umem->size;
>  	du.num_pages = umem->npgs;
> -	du.chunk_size = (__u32)(~umem->chunk_mask + 1);
> +	du.chunk_size = umem->chunk_size_nohr + umem->headroom;
>  	du.headroom = umem->headroom;
>  	du.ifindex = umem->dev ? umem->dev->ifindex : 0;
>  	du.queue_id = umem->queue_id;
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index 909c5168ed0f..3d045c1c94b1 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -133,6 +133,17 @@ static inline bool xskq_has_addrs(struct 
> xsk_queue *q, u32 cnt)
>
>  /* UMEM queue */
>
> +static inline bool xskq_crosses_non_contig_pg(struct xdp_umem *umem, 
> u64 addr,
> +					      u64 length)
> +{
> +	bool cross_pg = (addr & (PAGE_SIZE - 1)) + length > PAGE_SIZE;
> +	bool next_pg_contig =
> +		(unsigned long)umem->pages[(addr >> PAGE_SHIFT)].addr &
> +			XSK_NEXT_PG_CONTIG_MASK;
> +
> +	return cross_pg && !next_pg_contig;
> +}
> +
>  static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
>  {
>  	if (addr >= q->size) {
> @@ -143,23 +154,50 @@ static inline bool xskq_is_valid_addr(struct 
> xsk_queue *q, u64 addr)
>  	return true;
>  }
>
> -static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr)
> +static inline bool xskq_is_valid_addr_unaligned(struct xsk_queue *q, 
> u64 addr,
> +						u64 length,
> +						struct xdp_umem *umem)
> +{
> +	addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> +	addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
> +	if (addr >= q->size ||
> +	    xskq_crosses_non_contig_pg(umem, addr, length)) {
> +		q->invalid_descs++;
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr,
> +				      struct xdp_umem *umem)
>  {
>  	while (q->cons_tail != q->cons_head) {
>  		struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
>  		unsigned int idx = q->cons_tail & q->ring_mask;
>
>  		*addr = READ_ONCE(ring->desc[idx]) & q->chunk_mask;
> +
> +		if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG) {
> +			if (xskq_is_valid_addr_unaligned(q, *addr,
> +							 umem->chunk_size_nohr,
> +							 umem))
> +				return addr;
> +			goto out;
> +		}
> +
>  		if (xskq_is_valid_addr(q, *addr))
>  			return addr;
>
> +out:
>  		q->cons_tail++;
>  	}
>
>  	return NULL;
>  }
>
> -static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr)
> +static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr,
> +				  struct xdp_umem *umem)
>  {
>  	if (q->cons_tail == q->cons_head) {
>  		smp_mb(); /* D, matches A */
> @@ -170,7 +208,7 @@ static inline u64 *xskq_peek_addr(struct xsk_queue 
> *q, u64 *addr)
>  		smp_rmb();
>  	}
>
> -	return xskq_validate_addr(q, addr);
> +	return xskq_validate_addr(q, addr, umem);
>  }
>
>  static inline void xskq_discard_addr(struct xsk_queue *q)
> @@ -229,8 +267,21 @@ static inline int xskq_reserve_addr(struct 
> xsk_queue *q)
>
>  /* Rx/Tx queue */
>
> -static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct 
> xdp_desc *d)
> +static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct 
> xdp_desc *d,
> +				      struct xdp_umem *umem)
>  {
> +	if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG) {
> +		if (!xskq_is_valid_addr_unaligned(q, d->addr, d->len, umem))
> +			return false;
> +
> +		if (d->len > umem->chunk_size_nohr || d->options) {
> +			q->invalid_descs++;
> +			return false;
> +		}
> +
> +		return true;
> +	}
> +
>  	if (!xskq_is_valid_addr(q, d->addr))
>  		return false;
>
> @@ -244,14 +295,15 @@ static inline bool xskq_is_valid_desc(struct 
> xsk_queue *q, struct xdp_desc *d)
>  }
>
>  static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue 
> *q,
> -						  struct xdp_desc *desc)
> +						  struct xdp_desc *desc,
> +						  struct xdp_umem *umem)
>  {
>  	while (q->cons_tail != q->cons_head) {
>  		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
>  		unsigned int idx = q->cons_tail & q->ring_mask;
>
>  		*desc = READ_ONCE(ring->desc[idx]);
> -		if (xskq_is_valid_desc(q, desc))
> +		if (xskq_is_valid_desc(q, desc, umem))
>  			return desc;
>
>  		q->cons_tail++;
> @@ -261,7 +313,8 @@ static inline struct xdp_desc 
> *xskq_validate_desc(struct xsk_queue *q,
>  }
>
>  static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
> -					      struct xdp_desc *desc)
> +					      struct xdp_desc *desc,
> +					      struct xdp_umem *umem)
>  {
>  	if (q->cons_tail == q->cons_head) {
>  		smp_mb(); /* D, matches A */
> @@ -272,7 +325,7 @@ static inline struct xdp_desc 
> *xskq_peek_desc(struct xsk_queue *q,
>  		smp_rmb(); /* C, matches B */
>  	}
>
> -	return xskq_validate_desc(q, desc);
> +	return xskq_validate_desc(q, desc, umem);
>  }
>
>  static inline void xskq_discard_desc(struct xsk_queue *q)
> -- 
> 2.17.1

^ permalink raw reply

* Re: [oss-drivers] Re: [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Jakub Kicinski @ 2019-07-31 18:12 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Miller, Network Development, oss-drivers, Eric Dumazet,
	borisp, aviadye, davejwatson, John Fastabend, Daniel Borkmann,
	Cong Wang, Florian Westphal, Alexei Starovoitov
In-Reply-To: <CA+FuTSdN41z5PVfyT5Z-ApnKQ9CYcDSnr4VGZnsgA-iAEK12Ow@mail.gmail.com>

On Wed, 31 Jul 2019 11:57:10 -0400, Willem de Bruijn wrote:
> On Tue, Jul 30, 2019 at 5:13 PM Jakub Kicinski wrote:
> > sk_validate_xmit_skb() and drivers depend on the sk member of
> > struct sk_buff to identify segments requiring encryption.
> > Any operation which removes or does not preserve the original TLS
> > socket such as skb_orphan() or skb_clone() will cause clear text
> > leaks.
> >
> > Make the TCP socket underlying an offloaded TLS connection
> > mark all skbs as decrypted, if TLS TX is in offload mode.
> > Then in sk_validate_xmit_skb() catch skbs which have no socket
> > (or a socket with no validation) and decrypted flag set.
> >
> > Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
> > sk->sk_validate_xmit_skb are slightly interchangeable right now,
> > they all imply TLS offload. The new checks are guarded by
> > CONFIG_TLS_DEVICE because that's the option guarding the
> > sk_buff->decrypted member.
> >
> > Second, smaller issue with orphaning is that it breaks
> > the guarantee that packets will be delivered to device
> > queues in-order. All TLS offload drivers depend on that
> > scheduling property. This means skb_orphan_partial()'s
> > trick of preserving partial socket references will cause
> > issues in the drivers. We need a full orphan, and as a
> > result netem delay/throttling will cause all TLS offload
> > skbs to be dropped.
> >
> > Reusing the sk_buff->decrypted flag also protects from
> > leaking clear text when incoming, decrypted skb is redirected
> > (e.g. by TC).
> >
> > Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index d57b0cc995a0..b0c10b518e65 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -1992,6 +1992,22 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
> >  }
> >  EXPORT_SYMBOL(skb_set_owner_w);
> >
> > +static bool can_skb_orphan_partial(const struct sk_buff *skb)
> > +{
> > +#ifdef CONFIG_TLS_DEVICE
> > +       /* Drivers depend on in-order delivery for crypto offload,
> > +        * partial orphan breaks out-of-order-OK logic.
> > +        */
> > +       if (skb->decrypted)
> > +               return false;
> > +#endif
> > +#ifdef CONFIG_INET
> > +       if (skb->destructor == tcp_wfree)
> > +               return true;
> > +#endif
> > +       return skb->destructor == sock_wfree;
> > +}
> > +  
> 
> Just insert the skb->decrypted check into skb_orphan_partial for less
> code churn?

Okie.. skb_orphan_partial() is a little ugly but will do.

> I also think that this is an independent concern from leaking plain
> text, so perhaps could be a separate patch.

Do you mean the out-of-order stuff is a separate concern?

It is, I had them separate at the first try, but GSO code looks at
the destructor and IIRC only copies the socket if its still tcp_wfree.
If we let partial orphan be we have to do temporary hairy stuff in
tcp_gso_segment(). It's easier to just deal with partial orphan here.

^ permalink raw reply

* Re: [PATCH 1/2] tools: bpftool: add net load command to load XDP on interface
From: Daniel T. Lee @ 2019-07-31 18:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Alexei Starovoitov, netdev, Quentin Monnet
In-Reply-To: <20190731120805.1091d5c9@carbon>

On Wed, Jul 31, 2019 at 7:08 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Wed, 31 Jul 2019 03:48:20 +0900
> "Daniel T. Lee" <danieltimlee@gmail.com> wrote:
>
> > By this commit, using `bpftool net load`, user can load XDP prog on
> > interface. New type of enum 'net_load_type' has been made, as stated at
> > cover-letter, the meaning of 'load' is, prog will be loaded on interface.
>
> Why the keyword "load" ?
> Why not "attach" (and "detach")?
>
> For BPF there is a clear distinction between the "load" and "attach"
> steps.  I know this is under subcommand "net", but to follow the
> conversion of other subcommands e.g. "prog" there are both "load" and
> "attach" commands.
>
>
> > BPF prog will be loaded through libbpf 'bpf_set_link_xdp_fd'.
>
> Again this is a "set" operation, not a "load" operation.

From earlier at cover-letter, I thought using the same word 'load' might give
confusion since XDP program is not considered as 'bpf_attach_type' and can't
be attached with 'BPF_PROG_ATTACH'.

But, according to the feedback from you and Andrii Nakryiko, replacing
the word 'load' as 'attach' would be more clear and more consistent.

> > Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
>
> [...]
> >  static int do_show(int argc, char **argv)
> >  {
> >       struct bpf_attach_info attach_info = {};
> > @@ -305,13 +405,17 @@ static int do_help(int argc, char **argv)
> >
> >       fprintf(stderr,
> >               "Usage: %s %s { show | list } [dev <devname>]\n"
> > +             "       %s %s load PROG LOAD_TYPE <devname>\n"
>
> The "PROG" here does it correspond to the 'bpftool prog' syntax?:
>
>  PROG := { id PROG_ID | pinned FILE | tag PROG_TAG }
>

Yes. By using the same 'prog_parse_fd' from 'bpftool prog',
user can 'attach' XDP prog with id, pinned file or tag.

> >               "       %s %s help\n"
> > +             "\n"
> > +             "       " HELP_SPEC_PROGRAM "\n"
> > +             "       LOAD_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload }\n"
> >               "Note: Only xdp and tc attachments are supported now.\n"
> >               "      For progs attached to cgroups, use \"bpftool cgroup\"\n"
> >               "      to dump program attachments. For program types\n"
> >               "      sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
> >               "      consult iproute2.\n",
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

And about the enum 'NET_LOAD_TYPE_XDP_DRIVE',
'DRIVER' looks more clear to understand.

Will change to it right away.

Thanks for the review.

^ permalink raw reply

* Re: [PATCH bpf-next v4 09/11] samples/bpf: add buffer recycling for unaligned chunks to xdpsock
From: Jonathan Lemon @ 2019-07-31 18:26 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <20190730085400.10376-10-kevin.laatz@intel.com>



On 30 Jul 2019, at 1:53, Kevin Laatz wrote:

> This patch adds buffer recycling support for unaligned buffers. Since 
> we
> don't mask the addr to 2k at umem_reg in unaligned mode, we need to 
> make
> sure we give back the correct (original) addr to the fill queue. We 
> achieve
> this using the new descriptor format and associated masks. The new 
> format
> uses the upper 16-bits for the offset and the lower 48-bits for the 
> addr.
> Since we have a field for the offset, we no longer need to modify the
> actual address. As such, all we have to do to get back the original 
> address
> is mask for the lower 48 bits (i.e. strip the offset and we get the 
> address
> on it's own).
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>
> ---
> v2:
>   - Removed unused defines
>   - Fix buffer recycling for unaligned case
>   - Remove --buf-size (--frame-size merged before this)
>   - Modifications to use the new descriptor format for buffer 
> recycling
> ---
>  samples/bpf/xdpsock_user.c | 24 +++++++++++++++---------
>  1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> index 756b00eb1afe..62b2059cd0e3 100644
> --- a/samples/bpf/xdpsock_user.c
> +++ b/samples/bpf/xdpsock_user.c
> @@ -475,6 +475,7 @@ static void kick_tx(struct xsk_socket_info *xsk)
>
>  static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
>  {
> +	struct xsk_umem_info *umem = xsk->umem;
>  	u32 idx_cq = 0, idx_fq = 0;
>  	unsigned int rcvd;
>  	size_t ndescs;
> @@ -487,22 +488,21 @@ static inline void complete_tx_l2fwd(struct 
> xsk_socket_info *xsk)
>  		xsk->outstanding_tx;
>
>  	/* re-add completed Tx buffers */
> -	rcvd = xsk_ring_cons__peek(&xsk->umem->cq, ndescs, &idx_cq);
> +	rcvd = xsk_ring_cons__peek(&umem->cq, ndescs, &idx_cq);
>  	if (rcvd > 0) {
>  		unsigned int i;
>  		int ret;
>
> -		ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
> +		ret = xsk_ring_prod__reserve(&umem->fq, rcvd, &idx_fq);
>  		while (ret != rcvd) {
>  			if (ret < 0)
>  				exit_with_error(-ret);
> -			ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd,
> -						     &idx_fq);
> +			ret = xsk_ring_prod__reserve(&umem->fq, rcvd, &idx_fq);
>  		}
> +
>  		for (i = 0; i < rcvd; i++)
> -			*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
> -				*xsk_ring_cons__comp_addr(&xsk->umem->cq,
> -							  idx_cq++);
> +			*xsk_ring_prod__fill_addr(&umem->fq, idx_fq++) =
> +				*xsk_ring_cons__comp_addr(&umem->cq, idx_cq++);
>
>  		xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
>  		xsk_ring_cons__release(&xsk->umem->cq, rcvd);
> @@ -549,7 +549,11 @@ static void rx_drop(struct xsk_socket_info *xsk)
>  	for (i = 0; i < rcvd; i++) {
>  		u64 addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
>  		u32 len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++)->len;
> -		char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> +		u64 offset = addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> +
> +		addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
> +		char *pkt = xsk_umem__get_data(xsk->umem->buffer,
> +				addr + offset);

The mask constants should not be part of the api - this should be
hidden behind an accessor.

Something like:
   u64 addr = xsk_umem__get_addr(xsk->umem, handle);




>  		hex_dump(pkt, len, addr);
>  		*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = addr;
> @@ -655,7 +659,9 @@ static void l2fwd(struct xsk_socket_info *xsk)
>  							  idx_rx)->addr;
>  			u32 len = xsk_ring_cons__rx_desc(&xsk->rx,
>  							 idx_rx++)->len;
> -			char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> +			u64 offset = addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
> +			char *pkt = xsk_umem__get_data(xsk->umem->buffer,
> +				(addr & XSK_UNALIGNED_BUF_ADDR_MASK) + offset);
>
>  			swap_mac_addresses(pkt);
>
> -- 
> 2.17.1

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-07-31 18:29 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, virtualization, netdev, linux-kernel, linux-mm, jgg,
	Paul E. McKenney
In-Reply-To: <20190731084655.7024-8-jasowang@redhat.com>

On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
> We used to use RCU to synchronize MMU notifier with worker. This leads
> calling synchronize_rcu() in invalidate_range_start(). But on a busy
> system, there would be many factors that may slow down the
> synchronize_rcu() which makes it unsuitable to be called in MMU
> notifier.
> 
> A solution is SRCU but its overhead is obvious with the expensive full
> memory barrier. Another choice is to use seqlock, but it doesn't
> provide a synchronization method between readers and writers. The last
> choice is to use vq mutex, but it need to deal with the worst case
> that MMU notifier must be blocked and wait for the finish of swap in.
> 
> So this patch switches use a counter to track whether or not the map
> was used. The counter was increased when vq try to start or finish
> uses the map. This means, when it was even, we're sure there's no
> readers and MMU notifier is synchronized. When it was odd, it means
> there's a reader we need to wait it to be even again then we are
> synchronized. To avoid full memory barrier, store_release +
> load_acquire on the counter is used.

Unfortunately this needs a lot of review and testing, so this can't make
rc2, and I don't think this is the kind of patch I can merge after rc3.
Subtle memory barrier tricks like this can introduce new bugs while they
are fixing old ones.





> 
> Consider the read critical section is pretty small the synchronization
> should be done very fast.
> 
> Note the patch lead about 3% PPS dropping.

Sorry what do you mean by this last sentence? This degrades performance
compared to what?

> 
> Reported-by: Michael S. Tsirkin <mst@redhat.com>
> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++----------------
>  drivers/vhost/vhost.h |   7 +-
>  2 files changed, 94 insertions(+), 58 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index cfc11f9ed9c9..db2c81cb1e90 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
>  
>  	spin_lock(&vq->mmu_lock);
>  	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> -		map[i] = rcu_dereference_protected(vq->maps[i],
> -				  lockdep_is_held(&vq->mmu_lock));
> +		map[i] = vq->maps[i];
>  		if (map[i]) {
>  			vhost_set_map_dirty(vq, map[i], i);
> -			rcu_assign_pointer(vq->maps[i], NULL);
> +			vq->maps[i] = NULL;
>  		}
>  	}
>  	spin_unlock(&vq->mmu_lock);
>  
> -	/* No need for synchronize_rcu() or kfree_rcu() since we are
> -	 * serialized with memory accessors (e.g vq mutex held).
> +	/* No need for synchronization since we are serialized with
> +	 * memory accessors (e.g vq mutex held).
>  	 */
>  
>  	for (i = 0; i < VHOST_NUM_ADDRS; i++)
> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
>  	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
>  }
>  
> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
> +{
> +	int ref = READ_ONCE(vq->ref);
> +
> +	smp_store_release(&vq->ref, ref + 1);
> +	/* Make sure ref counter is visible before accessing the map */
> +	smp_load_acquire(&vq->ref);

The map access is after this sequence, correct?

Just going by the rules in Documentation/memory-barriers.txt,
I think that this pair will not order following accesses with ref store.

Documentation/memory-barriers.txt says:


+     In addition, a RELEASE+ACQUIRE
+     pair is -not- guaranteed to act as a full memory barrier.



The guarantee that is made is this:
	after
     an ACQUIRE on a given variable, all memory accesses preceding any prior
     RELEASE on that same variable are guaranteed to be visible. 


And if we also had the reverse rule we'd end up with a full barrier,
won't we?

Cc Paul in case I missed something here. And if I'm right,
maybe we should call this out, adding

	"The opposite is not true: a prior RELEASE is not
	 guaranteed to be visible before memory accesses following
	 the subsequent ACQUIRE".



> +}
> +
> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
> +{
> +	int ref = READ_ONCE(vq->ref);
> +
> +	/* Make sure vq access is done before increasing ref counter */
> +	smp_store_release(&vq->ref, ref + 1);
> +}
> +
> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
> +{
> +	int ref;
> +
> +	/* Make sure map change was done before checking ref counter */
> +	smp_mb();
> +
> +	ref = READ_ONCE(vq->ref);
> +	if (ref & 0x1) {

Please document the even/odd trick here too, not just in the commit log.

> +		/* When ref change,

changes

> we are sure no reader can see
> +		 * previous map */
> +		while (READ_ONCE(vq->ref) == ref) {


what is the below line in aid of?

> +			set_current_state(TASK_RUNNING);
> +			schedule();

                        if (need_resched())
                                schedule();

?

> +		}

On an interruptible kernel, there's a risk here is that
a task got preempted with an odd ref.
So I suspect we'll have to disable preemption when we
make ref odd.


> +	}
> +	/* Make sure ref counter was checked before any other
> +	 * operations that was dene on map. */

was dene -> were done?

> +	smp_mb();
> +}
> +
>  static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>  				      int index,
>  				      unsigned long start,
> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>  	spin_lock(&vq->mmu_lock);
>  	++vq->invalidate_count;
>  
> -	map = rcu_dereference_protected(vq->maps[index],
> -					lockdep_is_held(&vq->mmu_lock));
> +	map = vq->maps[index];
>  	if (map) {
>  		vhost_set_map_dirty(vq, map, index);
> -		rcu_assign_pointer(vq->maps[index], NULL);
> +		vq->maps[index] = NULL;
>  	}
>  	spin_unlock(&vq->mmu_lock);
>  
>  	if (map) {
> -		synchronize_rcu();
> +		vhost_vq_sync_access(vq);
>  		vhost_map_unprefetch(map);
>  	}
>  }
> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev)
>  	for (i = 0; i < dev->nvqs; ++i) {
>  		vq = dev->vqs[i];
>  		for (j = 0; j < VHOST_NUM_ADDRS; j++)
> -			RCU_INIT_POINTER(vq->maps[j], NULL);
> +			vq->maps[j] = NULL;
>  	}
>  }
>  #endif
> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  		vq->indirect = NULL;
>  		vq->heads = NULL;
>  		vq->dev = dev;
> +		vq->ref = 0;
>  		mutex_init(&vq->mutex);
>  		spin_lock_init(&vq->mmu_lock);
>  		vhost_vq_reset(dev, vq);
> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
>  	map->npages = npages;
>  	map->pages = pages;
>  
> -	rcu_assign_pointer(vq->maps[index], map);
> +	vq->maps[index] = map;
>  	/* No need for a synchronize_rcu(). This function should be
>  	 * called by dev->worker so we are serialized with all
>  	 * readers.
> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
>  	struct vring_used *used;
>  
>  	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>  
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> +		map = vq->maps[VHOST_ADDR_USED];
>  		if (likely(map)) {
>  			used = map->addr;
>  			*((__virtio16 *)&used->ring[vq->num]) =
>  				cpu_to_vhost16(vq, vq->avail_idx);
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>  			return 0;
>  		}
>  
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>  	}
>  #endif
>  
> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>  	size_t size;
>  
>  	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>  
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> +		map = vq->maps[VHOST_ADDR_USED];
>  		if (likely(map)) {
>  			used = map->addr;
>  			size = count * sizeof(*head);
>  			memcpy(used->ring + idx, head, size);
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>  			return 0;
>  		}
>  
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>  	}
>  #endif
>  
> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>  	struct vring_used *used;
>  
>  	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>  
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> +		map = vq->maps[VHOST_ADDR_USED];
>  		if (likely(map)) {
>  			used = map->addr;
>  			used->flags = cpu_to_vhost16(vq, vq->used_flags);
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>  			return 0;
>  		}
>  
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>  	}
>  #endif
>  
> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>  	struct vring_used *used;
>  
>  	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>  
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> +		map = vq->maps[VHOST_ADDR_USED];
>  		if (likely(map)) {
>  			used = map->addr;
>  			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>  			return 0;
>  		}
>  
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>  	}
>  #endif
>  
> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>  	struct vring_avail *avail;
>  
>  	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>  
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> +		map = vq->maps[VHOST_ADDR_AVAIL];
>  		if (likely(map)) {
>  			avail = map->addr;
>  			*idx = avail->idx;
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>  			return 0;
>  		}
>  
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>  	}
>  #endif
>  
> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>  	struct vring_avail *avail;
>  
>  	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>  
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> +		map = vq->maps[VHOST_ADDR_AVAIL];
>  		if (likely(map)) {
>  			avail = map->addr;
>  			*head = avail->ring[idx & (vq->num - 1)];
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>  			return 0;
>  		}
>  
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>  	}
>  #endif
>  
> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
>  	struct vring_avail *avail;
>  
>  	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>  
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> +		map = vq->maps[VHOST_ADDR_AVAIL];
>  		if (likely(map)) {
>  			avail = map->addr;
>  			*flags = avail->flags;
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>  			return 0;
>  		}
>  
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>  	}
>  #endif
>  
> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
>  	struct vring_avail *avail;
>  
>  	if (!vq->iotlb) {
> -		rcu_read_lock();
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> +		vhost_vq_access_map_begin(vq);
> +		map = vq->maps[VHOST_ADDR_AVAIL];
>  		if (likely(map)) {
>  			avail = map->addr;
>  			*event = (__virtio16)avail->ring[vq->num];
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>  			return 0;
>  		}
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>  	}
>  #endif
>  
> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
>  	struct vring_used *used;
>  
>  	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>  
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> +		map = vq->maps[VHOST_ADDR_USED];
>  		if (likely(map)) {
>  			used = map->addr;
>  			*idx = used->idx;
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>  			return 0;
>  		}
>  
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>  	}
>  #endif
>  
> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq,
>  	struct vring_desc *d;
>  
>  	if (!vq->iotlb) {
> -		rcu_read_lock();
> +		vhost_vq_access_map_begin(vq);
>  
> -		map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]);
> +		map = vq->maps[VHOST_ADDR_DESC];
>  		if (likely(map)) {
>  			d = map->addr;
>  			*desc = *(d + idx);
> -			rcu_read_unlock();
> +			vhost_vq_access_map_end(vq);
>  			return 0;
>  		}
>  
> -		rcu_read_unlock();
> +		vhost_vq_access_map_end(vq);
>  	}
>  #endif
>  
> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
>  #if VHOST_ARCH_CAN_ACCEL_UACCESS
>  static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
>  {
> -	struct vhost_map __rcu *map;
> +	struct vhost_map *map;
>  	int i;
>  
>  	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> -		rcu_read_lock();
> -		map = rcu_dereference(vq->maps[i]);
> -		rcu_read_unlock();
> +		map = vq->maps[i];
>  		if (unlikely(!map))
>  			vhost_map_prefetch(vq, i);
>  	}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index a9a2a93857d2..f9e9558a529d 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -115,16 +115,17 @@ struct vhost_virtqueue {
>  #if VHOST_ARCH_CAN_ACCEL_UACCESS
>  	/* Read by memory accessors, modified by meta data
>  	 * prefetching, MMU notifier and vring ioctl().
> -	 * Synchonrized through mmu_lock (writers) and RCU (writers
> -	 * and readers).
> +	 * Synchonrized through mmu_lock (writers) and ref counters,
> +	 * see vhost_vq_access_map_begin()/vhost_vq_access_map_end().
>  	 */
> -	struct vhost_map __rcu *maps[VHOST_NUM_ADDRS];
> +	struct vhost_map *maps[VHOST_NUM_ADDRS];
>  	/* Read by MMU notifier, modified by vring ioctl(),
>  	 * synchronized through MMU notifier
>  	 * registering/unregistering.
>  	 */
>  	struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
>  #endif
> +	int ref;

Is it important that this is signed? If not I'd do unsigned here:
even though kernel does compile with 2s complement sign overflow,
it seems cleaner not to depend on that.

>  	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>  
>  	struct file *kick;
> -- 
> 2.18.1

^ permalink raw reply

* [PATCH net] mvpp2: fix panic on module removal
From: Matteo Croce @ 2019-07-31 18:31 UTC (permalink / raw)
  To: netdev
  Cc: Miquel Raynal, linux-kernel, Lorenzo Bianconi, Antoine Tenart,
	Maxime Chevallier, David S. Miller

mvpp2 uses a delayed workqueue to gather traffic statistics.
On module removal the workqueue can be destroyed before calling
cancel_delayed_work_sync() on its works.
Fix it by moving the destroy_workqueue() call after mvpp2_port_remove().

    # rmmod mvpp2
    [ 2743.311722] mvpp2 f4000000.ethernet eth1: phy link down 10gbase-kr/10Gbps/Full
    [ 2743.320063] mvpp2 f4000000.ethernet eth1: Link is Down
    [ 2743.572263] mvpp2 f4000000.ethernet eth2: phy link down sgmii/1Gbps/Full
    [ 2743.580076] mvpp2 f4000000.ethernet eth2: Link is Down
    [ 2744.102169] mvpp2 f2000000.ethernet eth0: phy link down 10gbase-kr/10Gbps/Full
    [ 2744.110441] mvpp2 f2000000.ethernet eth0: Link is Down
    [ 2744.115614] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
    [ 2744.115615] Mem abort info:
    [ 2744.115616]   ESR = 0x96000005
    [ 2744.115617]   Exception class = DABT (current EL), IL = 32 bits
    [ 2744.115618]   SET = 0, FnV = 0
    [ 2744.115619]   EA = 0, S1PTW = 0
    [ 2744.115620] Data abort info:
    [ 2744.115621]   ISV = 0, ISS = 0x00000005
    [ 2744.115622]   CM = 0, WnR = 0
    [ 2744.115624] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000422681000
    [ 2744.115626] [0000000000000000] pgd=0000000000000000, pud=0000000000000000
    [ 2744.115630] Internal error: Oops: 96000005 [#1] SMP
    [ 2744.115632] Modules linked in: mvpp2(-) algif_hash af_alg nls_iso8859_1 nls_cp437 vfat fat xhci_plat_hcd m25p80 spi_nor xhci_hcd mtd usbcore i2c_mv64xxx sfp usb_common marvell10g phy_generic spi_orion mdio_i2c i2c_core mvmdio phylink sbsa_gwdt ip_tables x_tables autofs4 [last unloaded: mvpp2]
    [ 2744.115654] CPU: 3 PID: 8357 Comm: kworker/3:2 Not tainted 5.3.0-rc2 #1
    [ 2744.115655] Hardware name: Marvell 8040 MACCHIATOBin Double-shot (DT)
    [ 2744.115665] Workqueue: events_power_efficient phylink_resolve [phylink]
    [ 2744.115669] pstate: a0000085 (NzCv daIf -PAN -UAO)
    [ 2744.115675] pc : __queue_work+0x9c/0x4d8
    [ 2744.115677] lr : __queue_work+0x170/0x4d8
    [ 2744.115678] sp : ffffff801001bd50
    [ 2744.115680] x29: ffffff801001bd50 x28: ffffffc422597600
    [ 2744.115684] x27: ffffff80109ae6f0 x26: ffffff80108e4018
    [ 2744.115688] x25: 0000000000000003 x24: 0000000000000004
    [ 2744.115691] x23: ffffff80109ae6e0 x22: 0000000000000017
    [ 2744.115694] x21: ffffffc42c030000 x20: ffffffc42209e8f8
    [ 2744.115697] x19: 0000000000000000 x18: 0000000000000000
    [ 2744.115699] x17: 0000000000000000 x16: 0000000000000000
    [ 2744.115701] x15: 0000000000000010 x14: ffffffffffffffff
    [ 2744.115702] x13: ffffff8090e2b95f x12: ffffff8010e2b967
    [ 2744.115704] x11: ffffff8010906000 x10: 0000000000000040
    [ 2744.115706] x9 : ffffff80109223b8 x8 : ffffff80109223b0
    [ 2744.115707] x7 : ffffffc42bc00068 x6 : 0000000000000000
    [ 2744.115709] x5 : ffffffc42bc00000 x4 : 0000000000000000
    [ 2744.115710] x3 : 0000000000000000 x2 : 0000000000000000
    [ 2744.115712] x1 : 0000000000000008 x0 : ffffffc42c030000
    [ 2744.115714] Call trace:
    [ 2744.115716]  __queue_work+0x9c/0x4d8
    [ 2744.115718]  delayed_work_timer_fn+0x28/0x38
    [ 2744.115722]  call_timer_fn+0x3c/0x180
    [ 2744.115723]  expire_timers+0x60/0x168
    [ 2744.115724]  run_timer_softirq+0xbc/0x1e8
    [ 2744.115727]  __do_softirq+0x128/0x320
    [ 2744.115731]  irq_exit+0xa4/0xc0
    [ 2744.115734]  __handle_domain_irq+0x70/0xc0
    [ 2744.115735]  gic_handle_irq+0x58/0xa8
    [ 2744.115737]  el1_irq+0xb8/0x140
    [ 2744.115738]  console_unlock+0x3a0/0x568
    [ 2744.115740]  vprintk_emit+0x200/0x2a0
    [ 2744.115744]  dev_vprintk_emit+0x1c8/0x1e4
    [ 2744.115747]  dev_printk_emit+0x6c/0x7c
    [ 2744.115751]  __netdev_printk+0x104/0x1d8
    [ 2744.115752]  netdev_printk+0x60/0x70
    [ 2744.115756]  phylink_resolve+0x38c/0x3c8 [phylink]
    [ 2744.115758]  process_one_work+0x1f8/0x448
    [ 2744.115760]  worker_thread+0x54/0x500
    [ 2744.115762]  kthread+0x12c/0x130
    [ 2744.115764]  ret_from_fork+0x10/0x1c
    [ 2744.115768] Code: aa1403e0 97fffbbe aa0003f5 b4000700 (f9400261)

Fixes: 118d6298f6f0 ("net: mvpp2: add ethtool GOP statistics")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
---
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index c51f1d5b550b..5002d51fc9d6 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -5760,7 +5760,6 @@ static int mvpp2_remove(struct platform_device *pdev)
 	mvpp2_dbgfs_cleanup(priv);
 
 	flush_workqueue(priv->stats_queue);
-	destroy_workqueue(priv->stats_queue);
 
 	fwnode_for_each_available_child_node(fwnode, port_fwnode) {
 		if (priv->port_list[i]) {
@@ -5770,6 +5769,8 @@ static int mvpp2_remove(struct platform_device *pdev)
 		i++;
 	}
 
+	destroy_workqueue(priv->stats_queue);
+
 	for (i = 0; i < MVPP2_BM_POOLS_NUM; i++) {
 		struct mvpp2_bm_pool *bm_pool = &priv->bm_pools[i];
 
-- 
2.21.0


^ permalink raw reply related

* [PATCH net] net: bridge: move vlan init/deinit to NETDEV_REGISTER/UNREGISTER
From: Nikolay Aleksandrov @ 2019-07-31 18:36 UTC (permalink / raw)
  To: netdev; +Cc: roopa, davem, bridge, Nikolay Aleksandrov, michael-dev

Most of the bridge device's vlan init bugs come from the fact that it's
done in the wrong place, way too early in ndo_init() before the device is
even assigned an ifindex. That makes error handling harder, especially for
older kernels which don't have bridge ndo_uninit callback. It also
introduces another bug when the bridge's dev_addr is added as fdb in the
the initial default pvid on vlan initialization, the fdb notification has
ifindex/NDA_MASTER both equal to 0 (see example below) which really
makes no sense for user-space[0]. Usually user-space software would ignore
such entries, but they are actually valid and will eventually have all
necessary attributes. I chose to change the order because this can be
backported to all kernels even pre-ndo_uninit ones without many changes
and it keeps init/deinit symmetric. As a bonus this allows us to keep
the vlan init/deinit entirely in br_vlan.c and remove those exports.
It makes much more sense to send a notification *after* the device has
registered and has a proper ifindex allocated rather than before when
there's a chance that the registration might still fail.

For the demonstration below a small change to iproute2 for printing all fdb
notifications is added, because it contained a workaround not to show
entries with ifindex == 0.
Command executed while monitoring: $ ip l add br0 type bridge
Before (both ifindex and master == 0):
$ bridge monitor fdb
36:7e:8a:b3:56:ba dev * vlan 1 master * permanent

After (proper br0 ifindex):
$ bridge monitor fdb
e6:2a:ae:7a:b7:48 dev br0 vlan 1 master br0 permanent

[0] https://bugzilla.kernel.org/show_bug.cgi?id=204389

Reported-by: michael-dev <michael-dev@fami-braun.de>
Fixes: 5be5a2df40f0 ("bridge: Add filtering support for default_pvid")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
I tried a few different approaches to resolve this but they were all
unsuitable for some kernels, this approach can go to stables easily
and IMO is the way this had to be done from the start. Alternatively
we could move only the br_vlan_add and pair it with a br_vlan_del of
default_pvid on the same events, but I don't think it hurts to move
the whole init/deinit there as it'd help older stable releases as well.

I also tested the br_vlan_init error handling after the move by always
returning errors from all over it. Since errors at NETDEV_REGISTER cause
NETDEV_UNREGISTER we can deinit vlans properly for all cases regardless
why it happened (e.g. device destruction or init error).

 net/bridge/br.c         |  5 ++++-
 net/bridge/br_device.c  | 10 ----------
 net/bridge/br_private.h | 19 ++++---------------
 net/bridge/br_vlan.c    | 23 ++++++++++++++++-------
 4 files changed, 24 insertions(+), 33 deletions(-)

diff --git a/net/bridge/br.c b/net/bridge/br.c
index d164f63a4345..8a8f9e5f264f 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -37,12 +37,15 @@ static int br_device_event(struct notifier_block *unused, unsigned long event, v
 	int err;
 
 	if (dev->priv_flags & IFF_EBRIDGE) {
+		err = br_vlan_bridge_event(dev, event, ptr);
+		if (err)
+			return notifier_from_errno(err);
+
 		if (event == NETDEV_REGISTER) {
 			/* register of bridge completed, add sysfs entries */
 			br_sysfs_addbr(dev);
 			return NOTIFY_DONE;
 		}
-		br_vlan_bridge_event(dev, event, ptr);
 	}
 
 	/* not a port of a bridge */
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 681b72862c16..b3e3de2ecf95 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -135,18 +135,9 @@ static int br_dev_init(struct net_device *dev)
 		return err;
 	}
 
-	err = br_vlan_init(br);
-	if (err) {
-		free_percpu(br->stats);
-		br_mdb_hash_fini(br);
-		br_fdb_hash_fini(br);
-		return err;
-	}
-
 	err = br_multicast_init_stats(br);
 	if (err) {
 		free_percpu(br->stats);
-		br_vlan_flush(br);
 		br_mdb_hash_fini(br);
 		br_fdb_hash_fini(br);
 	}
@@ -161,7 +152,6 @@ static void br_dev_uninit(struct net_device *dev)
 
 	br_multicast_dev_del(br);
 	br_multicast_uninit_stats(br);
-	br_vlan_flush(br);
 	br_mdb_hash_fini(br);
 	br_fdb_hash_fini(br);
 	free_percpu(br->stats);
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index e8cf03b43b7d..96dd1c68d73f 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -872,7 +872,6 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
 int br_vlan_add(struct net_bridge *br, u16 vid, u16 flags,
 		bool *changed, struct netlink_ext_ack *extack);
 int br_vlan_delete(struct net_bridge *br, u16 vid);
-void br_vlan_flush(struct net_bridge *br);
 struct net_bridge_vlan *br_vlan_find(struct net_bridge_vlan_group *vg, u16 vid);
 void br_recalculate_fwd_mask(struct net_bridge *br);
 int __br_vlan_filter_toggle(struct net_bridge *br, unsigned long val);
@@ -881,7 +880,6 @@ int __br_vlan_set_proto(struct net_bridge *br, __be16 proto);
 int br_vlan_set_proto(struct net_bridge *br, unsigned long val);
 int br_vlan_set_stats(struct net_bridge *br, unsigned long val);
 int br_vlan_set_stats_per_port(struct net_bridge *br, unsigned long val);
-int br_vlan_init(struct net_bridge *br);
 int br_vlan_set_default_pvid(struct net_bridge *br, unsigned long val);
 int __br_vlan_set_default_pvid(struct net_bridge *br, u16 pvid,
 			       struct netlink_ext_ack *extack);
@@ -894,8 +892,8 @@ int nbp_get_num_vlan_infos(struct net_bridge_port *p, u32 filter_mask);
 void br_vlan_get_stats(const struct net_bridge_vlan *v,
 		       struct br_vlan_stats *stats);
 void br_vlan_port_event(struct net_bridge_port *p, unsigned long event);
-void br_vlan_bridge_event(struct net_device *dev, unsigned long event,
-			  void *ptr);
+int br_vlan_bridge_event(struct net_device *dev, unsigned long event,
+			 void *ptr);
 
 static inline struct net_bridge_vlan_group *br_vlan_group(
 					const struct net_bridge *br)
@@ -988,19 +986,10 @@ static inline int br_vlan_delete(struct net_bridge *br, u16 vid)
 	return -EOPNOTSUPP;
 }
 
-static inline void br_vlan_flush(struct net_bridge *br)
-{
-}
-
 static inline void br_recalculate_fwd_mask(struct net_bridge *br)
 {
 }
 
-static inline int br_vlan_init(struct net_bridge *br)
-{
-	return 0;
-}
-
 static inline int nbp_vlan_add(struct net_bridge_port *port, u16 vid, u16 flags,
 			       bool *changed, struct netlink_ext_ack *extack)
 {
@@ -1085,8 +1074,8 @@ static inline void br_vlan_port_event(struct net_bridge_port *p,
 {
 }
 
-static inline void br_vlan_bridge_event(struct net_device *dev,
-					unsigned long event, void *ptr)
+static inline int br_vlan_bridge_event(struct net_device *dev,
+				       unsigned long event, void *ptr)
 {
 }
 #endif
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index a544e161c7fa..266c1214b9f9 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -709,7 +709,7 @@ int br_vlan_delete(struct net_bridge *br, u16 vid)
 	return __vlan_del(v);
 }
 
-void br_vlan_flush(struct net_bridge *br)
+static void br_vlan_flush(struct net_bridge *br)
 {
 	struct net_bridge_vlan_group *vg;
 
@@ -721,6 +721,8 @@ void br_vlan_flush(struct net_bridge *br)
 	br_fdb_delete_by_port(br, NULL, 0, 1);
 
 	vg = br_vlan_group(br);
+	if (!vg)
+		return;
 	__vlan_flush(vg);
 	RCU_INIT_POINTER(br->vlgrp, NULL);
 	synchronize_rcu();
@@ -1054,7 +1056,7 @@ int br_vlan_set_default_pvid(struct net_bridge *br, unsigned long val)
 	return err;
 }
 
-int br_vlan_init(struct net_bridge *br)
+static int br_vlan_init(struct net_bridge *br)
 {
 	struct net_bridge_vlan_group *vg;
 	int ret = -ENOMEM;
@@ -1469,13 +1471,19 @@ static void nbp_vlan_set_vlan_dev_state(struct net_bridge_port *p, u16 vid)
 }
 
 /* Must be protected by RTNL. */
-void br_vlan_bridge_event(struct net_device *dev, unsigned long event,
-			  void *ptr)
+int br_vlan_bridge_event(struct net_device *dev, unsigned long event, void *ptr)
 {
 	struct netdev_notifier_changeupper_info *info;
-	struct net_bridge *br;
+	struct net_bridge *br = netdev_priv(dev);
+	int ret = 0;
 
 	switch (event) {
+	case NETDEV_REGISTER:
+		ret = br_vlan_init(br);
+		break;
+	case NETDEV_UNREGISTER:
+		br_vlan_flush(br);
+		break;
 	case NETDEV_CHANGEUPPER:
 		info = ptr;
 		br_vlan_upper_change(dev, info->upper_dev, info->linking);
@@ -1483,12 +1491,13 @@ void br_vlan_bridge_event(struct net_device *dev, unsigned long event,
 
 	case NETDEV_CHANGE:
 	case NETDEV_UP:
-		br = netdev_priv(dev);
 		if (!br_opt_get(br, BROPT_VLAN_BRIDGE_BINDING))
-			return;
+			break;
 		br_vlan_link_state_change(dev, br);
 		break;
 	}
+
+	return ret;
 }
 
 /* Must be protected by RTNL. */
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Jakub Kicinski @ 2019-07-31 18:43 UTC (permalink / raw)
  To: Boris Pismenny
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	oss-drivers@netronome.com, edumazet@google.com, Aviad Yehezkel,
	davejwatson@fb.com, john.fastabend@gmail.com,
	daniel@iogearbox.net, willemb@google.com,
	xiyou.wangcong@gmail.com, fw@strlen.de,
	alexei.starovoitov@gmail.com
In-Reply-To: <c86943db-f098-3ce8-f0d1-4f04ba676d40@mellanox.com>

On Wed, 31 Jul 2019 13:57:26 +0000, Boris Pismenny wrote:
> > diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
> > index 048e5ca44824..2bc3ab5515d8 100644
> > --- a/Documentation/networking/tls-offload.rst
> > +++ b/Documentation/networking/tls-offload.rst
> > @@ -499,15 +499,6 @@ offloads, old connections will remain active after flags are cleared.
> >   Known bugs
> >   ==========
> >   
> > -skb_orphan() leaks clear text
> > ------------------------------
> > -
> > -Currently drivers depend on the :c:member:`sk` member of
> > -:c:type:`struct sk_buff <sk_buff>` to identify segments requiring
> > -encryption. Any operation which removes or does not preserve the socket
> > -association such as :c:func:`skb_orphan` or :c:func:`skb_clone`
> > -will cause the driver to miss the packets and lead to clear text leaks.
> > -
> >   Redirects leak clear text
> >   -------------------------  
> Doesn't this patch cover the redirect case as well?

Ah, good catch! I thought this entry was for socket redirect, which
will be a separate fix, but it seems we don't have an entry for that
one. 

> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 228db3998e46..90f3f552c789 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -814,6 +814,7 @@ enum sock_flags {
> >   	SOCK_TXTIME,
> >   	SOCK_XDP, /* XDP is attached */
> >   	SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
> > +	SOCK_CRYPTO_TX_PLAIN_TEXT, /* Generate skbs with decrypted flag set */
> >   };
> >   
> >   #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
> > @@ -2480,8 +2481,26 @@ static inline bool sk_fullsock(const struct sock *sk)
> >   	return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
> >   }
> >   
> > +static inline void sk_set_tx_crypto(const struct sock *sk, struct sk_buff *skb)
> > +{
> > +#ifdef CONFIG_TLS_DEVICE
> > +	skb->decrypted = sock_flag(sk, SOCK_CRYPTO_TX_PLAIN_TEXT);
> > +#endif
> > +}  
> 
> Have you considered the following alternative to calling 
> sk_set_tx_crypto() - Add a new MSG_DECRYPTE flag that will set the 
> skb->decrypted bit in the do_tcp_sendpages function.
> 
> Then, the rest of the TCP code can propagate this bit from the existing skb.

That'd also work marking the socket as crypto seemed easy enough. I was
planning on using sk_rx_crypto_match() for socket redirect also, so
it'd be symmetrical. Is there a preference for using the internal flags?

> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index f62f0e7e3cdd..dee93eab02fe 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -984,6 +984,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> >   			if (!skb)
> >   				goto wait_for_memory;
> >   
> > +			sk_set_tx_crypto(sk, skb);
> >   			skb_entail(sk, skb);
> >   			copy = size_goal;
> >   		}
> > @@ -993,7 +994,8 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
> >   
> >   		i = skb_shinfo(skb)->nr_frags;
> >   		can_coalesce = skb_can_coalesce(skb, i, page, offset);
> > -		if (!can_coalesce && i >= sysctl_max_skb_frags) {
> > +		if ((!can_coalesce && i >= sysctl_max_skb_frags) ||
> > +		    !sk_tx_crypto_match(sk, skb)) {  
> 
> I see that sk_tx_crypto_match is called only here to handle cases where 
> the socket expected crypto offload, while the skb is not marked 
> decrypted. AFAIU, this should not be possible, because we set the 
> skb->eor bit for the last plaintext skb before sending any traffic that 
> expects crypto offload.

Ack, I missed the tcp_skb_can_collapse_to() above.

^ permalink raw reply

* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Mickaël Salaün @ 2019-07-31 18:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Mickaël Salaün
  Cc: linux-kernel, Alexander Viro, Alexei Starovoitov, Andrew Morton,
	Andy Lutomirski, Arnaldo Carvalho de Melo, Casey Schaufler,
	Daniel Borkmann, David Drysdale, David S . Miller,
	Eric W . Biederman, James Morris, Jann Horn, John Johansen,
	Jonathan Corbet, Kees Cook, Michael Kerrisk, Paul Moore,
	Sargun Dhillon, Serge E . Hallyn, Shuah Khan, Stephen Smalley,
	Tejun Heo, Tetsuo Handa, Thomas Graf, Tycho Andersen, Will Drewry,
	kernel-hardening, linux-api, linux-fsdevel, linux-security-module,
	netdev
In-Reply-To: <20190727014048.3czy3n2hi6hfdy3m@ast-mbp.dhcp.thefacebook.com>


On 27/07/2019 03:40, Alexei Starovoitov wrote:
> On Sun, Jul 21, 2019 at 11:31:12PM +0200, Mickaël Salaün wrote:
>> FIXME: 64-bits in the doc

FYI, this FIXME was fixed, just not removed from this message. :)

>>
>> This new map store arbitrary values referenced by inode keys.  The map
>> can be updated from user space with file descriptor pointing to inodes
>> tied to a file system.  From an eBPF (Landlock) program point of view,
>> such a map is read-only and can only be used to retrieved a value tied
>> to a given inode.  This is useful to recognize an inode tagged by user
>> space, without access right to this inode (i.e. no need to have a write
>> access to this inode).
>>
>> Add dedicated BPF functions to handle this type of map:
>> * bpf_inode_htab_map_update_elem()
>> * bpf_inode_htab_map_lookup_elem()
>> * bpf_inode_htab_map_delete_elem()
>>
>> This new map require a dedicated helper inode_map_lookup_elem() because
>> of the key which is a pointer to an opaque data (only provided by the
>> kernel).  This act like a (physical or cryptographic) key, which is why
>> it is also not allowed to get the next key.
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>
> there are too many things to comment on.
> Let's do this patch.
>
> imo inode_map concept is interesting, but see below...
>
>> +
>> +    /*
>> +     * Limit number of entries in an inode map to the maximum number of
>> +     * open files for the current process. The maximum number of file
>> +     * references (including all inode maps) for a process is then
>> +     * (RLIMIT_NOFILE - 1) * RLIMIT_NOFILE. If the process' RLIMIT_NOFILE
>> +     * is 0, then any entry update is forbidden.
>> +     *
>> +     * An eBPF program can inherit all the inode map FD. The worse case is
>> +     * to fill a bunch of arraymaps, create an eBPF program, close the
>> +     * inode map FDs, and start again. The maximum number of inode map
>> +     * entries can then be close to RLIMIT_NOFILE^3.
>> +     */
>> +    if (attr->max_entries > rlimit(RLIMIT_NOFILE))
>> +            return -EMFILE;
>
> rlimit is checked, but no fd are consumed.
> Once created such inode map_fd can be passed to a different process.
> map_fd can be pinned into bpffs.
> etc.
> what the value of the check?

I was looking for the most meaningful limit for a process, and rlimit is
the best I found. As the limit of open FD per processes, rlimit is not
perfect, but I think the semantic is close here (e.g. a process can also
pass FD through unix socket).

>
>> +
>> +    /* decorelate UAPI from kernel API */
>> +    attr->key_size = sizeof(struct inode *);
>> +
>> +    return htab_map_alloc_check(attr);
>> +}
>> +
>> +static void inode_htab_put_key(void *key)
>> +{
>> +    struct inode **inode = key;
>> +
>> +    if ((*inode)->i_state & I_FREEING)
>> +            return;
>
> checking the state without take a lock? isn't it racy?

This should only trigger when called from security_inode_free(). I'll
add a comment.

>
>> +    iput(*inode);
>> +}
>> +
>> +/* called from syscall or (never) from eBPF program */
>> +static int map_get_next_no_key(struct bpf_map *map, void *key, void *next_key)
>> +{
>> +    /* do not leak a file descriptor */
>
> what this comment suppose to mean?

Because a key is a reference to an inode, a possible return value for
this function could be a file descriptor pointing to this inode (the
same way a file descriptor is use to add an element). For now, I don't
want to implement a way for a process with such a map to extract such
inode, which I compare to a possible leak (of information, not kernel
memory nor object). This could be implemented in the future if there is
value in it (and probably some additional safeguards), though.

>
>> +    return -ENOTSUPP;
>> +}
>> +
>> +/* must call iput(inode) after this call */
>> +static struct inode *inode_from_fd(int ufd, bool check_access)
>> +{
>> +    struct inode *ret;
>> +    struct fd f;
>> +    int deny;
>> +
>> +    f = fdget(ufd);
>> +    if (unlikely(!f.file))
>> +            return ERR_PTR(-EBADF);
>> +    /* TODO?: add this check when called from an eBPF program too (already
>> +    * checked by the LSM parent hooks anyway) */
>> +    if (unlikely(IS_PRIVATE(file_inode(f.file)))) {
>> +            ret = ERR_PTR(-EINVAL);
>> +            goto put_fd;
>> +    }
>> +    /* check if the FD is tied to a mount point */
>> +    /* TODO?: add this check when called from an eBPF program too */
>> +    if (unlikely(f.file->f_path.mnt->mnt_flags & MNT_INTERNAL)) {
>> +            ret = ERR_PTR(-EINVAL);
>> +            goto put_fd;
>> +    }
>
> a bunch of TODOs do not inspire confidence.

I think the current implement is good, but these TODOs are here to draw
attention on particular points for which I would like external review
and opinion (hence the "?").

>
>> +    if (check_access) {
>> +            /*
>> +            * must be allowed to access attributes from this file to then
>> +            * be able to compare an inode to its map entry
>> +            */
>> +            deny = security_inode_getattr(&f.file->f_path);
>> +            if (deny) {
>> +                    ret = ERR_PTR(deny);
>> +                    goto put_fd;
>> +            }
>> +    }
>> +    ret = file_inode(f.file);
>> +    ihold(ret);
>> +
>> +put_fd:
>> +    fdput(f);
>> +    return ret;
>> +}
>> +
>> +/*
>> + * The key is a FD when called from a syscall, but an inode address when called
>> + * from an eBPF program.
>> + */
>> +
>> +/* called from syscall */
>> +int bpf_inode_fd_htab_map_lookup_elem(struct bpf_map *map, int *key, void *value)
>> +{
>> +    void *ptr;
>> +    struct inode *inode;
>> +    int ret;
>> +
>> +    /* check inode access */
>> +    inode = inode_from_fd(*key, true);
>> +    if (IS_ERR(inode))
>> +            return PTR_ERR(inode);
>> +
>> +    rcu_read_lock();
>> +    ptr = htab_map_lookup_elem(map, &inode);
>> +    iput(inode);
>> +    if (IS_ERR(ptr)) {
>> +            ret = PTR_ERR(ptr);
>> +    } else if (!ptr) {
>> +            ret = -ENOENT;
>> +    } else {
>> +            ret = 0;
>> +            copy_map_value(map, value, ptr);
>> +    }
>> +    rcu_read_unlock();
>> +    return ret;
>> +}
>> +
>> +/* called from kernel */
>
> wrong comment?
> kernel side cannot call it, right?

This is called from bpf_inode_fd_htab_map_delete_elem() (code just
beneath), and from
kernel/bpf/syscall.c:bpf_inode_ptr_unlocked_htab_map_delet_elem() which
can be called by security_inode_free() (hook_inode_free_security).

>
>> +int bpf_inode_ptr_locked_htab_map_delete_elem(struct bpf_map *map,
>> +            struct inode **key, bool remove_in_inode)
>> +{
>> +    if (remove_in_inode)
>> +            landlock_inode_remove_map(*key, map);
>> +    return htab_map_delete_elem(map, key);
>> +}
>> +
>> +/* called from syscall */
>> +int bpf_inode_fd_htab_map_delete_elem(struct bpf_map *map, int *key)
>> +{
>> +    struct inode *inode;
>> +    int ret;
>> +
>> +    /* do not check inode access (similar to directory check) */
>> +    inode = inode_from_fd(*key, false);
>> +    if (IS_ERR(inode))
>> +            return PTR_ERR(inode);
>> +    ret = bpf_inode_ptr_locked_htab_map_delete_elem(map, &inode, true);
>> +    iput(inode);
>> +    return ret;
>> +}
>> +
>> +/* called from syscall */
>> +int bpf_inode_fd_htab_map_update_elem(struct bpf_map *map, int *key, void *value,
>> +            u64 map_flags)
>> +{
>> +    struct inode *inode;
>> +    int ret;
>> +
>> +    WARN_ON_ONCE(!rcu_read_lock_held());
>> +
>> +    /* check inode access */
>> +    inode = inode_from_fd(*key, true);
>> +    if (IS_ERR(inode))
>> +            return PTR_ERR(inode);
>> +    ret = htab_map_update_elem(map, &inode, value, map_flags);
>> +    if (!ret)
>> +            ret = landlock_inode_add_map(inode, map);
>> +    iput(inode);
>> +    return ret;
>> +}
>> +
>> +static void inode_htab_map_free(struct bpf_map *map)
>> +{
>> +    struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
>> +    struct hlist_nulls_node *n;
>> +    struct hlist_nulls_head *head;
>> +    struct htab_elem *l;
>> +    int i;
>> +
>> +    for (i = 0; i < htab->n_buckets; i++) {
>> +            head = select_bucket(htab, i);
>> +            hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
>> +                    landlock_inode_remove_map(*((struct inode **)l->key), map);
>> +            }
>> +    }
>> +    htab_map_free(map);
>> +}
>
> user space can delete the map.
> that will trigger inode_htab_map_free() which will call
> landlock_inode_remove_map().
> which will simply itereate the list and delete from the list.

landlock_inode_remove_map() removes the reference to the map (being
freed) from the inode (with an RCU lock).

>
> While in parallel inode can be destoyed and hook_inode_free_security()
> will be called.
> I think nothing that protects from this race.

According to security_inode_free(), the inode is effectively freed after
the RCU grace period. However, I forgot to call bpf_map_inc() in
landlock_inode_add_map(), which would prevent the map to be freed
outside of the security_inode_free(). I'll fix that.

>
>> +
>> +/*
>> + * We need a dedicated helper to deal with inode maps because the key is a
>> + * pointer to an opaque data, only provided by the kernel.  This really act
>> + * like a (physical or cryptographic) key, which is why it is also not allowed
>> + * to get the next key with map_get_next_key().
>
> inode pointer is like cryptographic key? :)

I wanted to highlight the fact that, contrary to other map key types,
the value of this one should not be readable, only usable. A "secret
value" is more appropriate but still confusing. I'll rephrase that.

>
>> + */
>> +BPF_CALL_2(bpf_inode_map_lookup_elem, struct bpf_map *, map, void *, key)
>> +{
>> +    WARN_ON_ONCE(!rcu_read_lock_held());
>> +    return (unsigned long)htab_map_lookup_elem(map, &key);
>> +}
>> +
>> +const struct bpf_func_proto bpf_inode_map_lookup_elem_proto = {
>> +    .func           = bpf_inode_map_lookup_elem,
>> +    .gpl_only       = false,
>> +    .pkt_access     = true,
>
> pkt_access ? :)

This slipped in with this rebase, I'll remove it. :)

>
>> +    .ret_type       = RET_PTR_TO_MAP_VALUE_OR_NULL,
>> +    .arg1_type      = ARG_CONST_MAP_PTR,
>> +    .arg2_type      = ARG_PTR_TO_INODE,
>> +};
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index b2a8cb14f28e..e46441c42b68 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -801,6 +801,8 @@ static int map_lookup_elem(union bpf_attr *attr)
>>      } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
>>                 map->map_type == BPF_MAP_TYPE_STACK) {
>>              err = map->ops->map_peek_elem(map, value);
>> +    } else if (map->map_type == BPF_MAP_TYPE_INODE) {
>> +            err = bpf_inode_fd_htab_map_lookup_elem(map, key, value);
>>      } else {
>>              rcu_read_lock();
>>              if (map->ops->map_lookup_elem_sys_only)
>> @@ -951,6 +953,10 @@ static int map_update_elem(union bpf_attr *attr)
>>      } else if (map->map_type == BPF_MAP_TYPE_QUEUE ||
>>                 map->map_type == BPF_MAP_TYPE_STACK) {
>>              err = map->ops->map_push_elem(map, value, attr->flags);
>> +    } else if (map->map_type == BPF_MAP_TYPE_INODE) {
>> +            rcu_read_lock();
>> +            err = bpf_inode_fd_htab_map_update_elem(map, key, value, attr->flags);
>> +            rcu_read_unlock();
>>      } else {
>>              rcu_read_lock();
>>              err = map->ops->map_update_elem(map, key, value, attr->flags);
>> @@ -1006,7 +1012,10 @@ static int map_delete_elem(union bpf_attr *attr)
>>      preempt_disable();
>>      __this_cpu_inc(bpf_prog_active);
>>      rcu_read_lock();
>> -    err = map->ops->map_delete_elem(map, key);
>> +    if (map->map_type == BPF_MAP_TYPE_INODE)
>> +            err = bpf_inode_fd_htab_map_delete_elem(map, key);
>> +    else
>> +            err = map->ops->map_delete_elem(map, key);
>>      rcu_read_unlock();
>>      __this_cpu_dec(bpf_prog_active);
>>      preempt_enable();
>> @@ -1018,6 +1027,22 @@ static int map_delete_elem(union bpf_attr *attr)
>>      return err;
>>  }
>>
>> +int bpf_inode_ptr_unlocked_htab_map_delete_elem(struct bpf_map *map,
>> +                                            struct inode **key, bool remove_in_inode)
>> +{
>> +    int err;
>> +
>> +    preempt_disable();
>> +    __this_cpu_inc(bpf_prog_active);
>> +    rcu_read_lock();
>> +    err = bpf_inode_ptr_locked_htab_map_delete_elem(map, key, remove_in_inode);
>> +    rcu_read_unlock();
>> +    __this_cpu_dec(bpf_prog_active);
>> +    preempt_enable();
>> +    maybe_wait_bpf_programs(map);
>
> if that function was actually doing synchronize_rcu() the consequences
> would have been unpleasant. Fortunately it's a nop in this case.
> Please read the code carefully before copy-paste.
> Also what do you think the reason of bpf_prog_active above?
> What is the reason of rcu_read_lock above?

The RCU is used as for every map modifications (usually from userspace).
I wasn't sure about the other protections so I kept the same (generic)
checks as in map_delete_elem() (just above) because this function follow
the same semantic. What can I safely remove?

>
> I think the patch set needs to shrink at least in half to be reviewable.
> The way you tie seccomp and lsm is probably the biggest obstacle
> than any of the bugs above.
> Can you drop seccomp ? and do it as normal lsm ?

The seccomp/enforcement part is needed to have a minimum viable product,
i.e. a process able to sandbox itself. Are you suggesting to first merge
a version when it is only possible to create inode maps but not use them
in an useful way (i.e. for sandboxing)? I can do it if it's OK with you,
and I hope it will not be a problem for the security folks if it can
help to move forward.

--
Mickaël Salaün
ANSSI/SDE/ST/LAM

Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

^ permalink raw reply

* [PATCH] net: mdio-octeon: Fix build error and Kconfig warning
From: Nathan Chancellor @ 2019-07-31 18:50 UTC (permalink / raw)
  To: davem
  Cc: andrew, broonie, devel, f.fainelli, gregkh, hkallweit1,
	kernel-build-reports, linux-arm-kernel, linux-next,
	natechancellor, netdev, willy, kbuild test robot, Randy Dunlap
In-Reply-To: <20190731.094150.851749535529247096.davem@davemloft.net>

arm allyesconfig warns:

WARNING: unmet direct dependencies detected for MDIO_OCTEON
  Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=y] && MDIO_BUS [=y]
&& 64BIT && HAS_IOMEM [=y] && OF_MDIO [=y]
  Selected by [y]:
  - OCTEON_ETHERNET [=y] && STAGING [=y] && (CAVIUM_OCTEON_SOC &&
NETDEVICES [=y] || COMPILE_TEST [=y])

and errors:

In file included from ../drivers/net/phy/mdio-octeon.c:14:
../drivers/net/phy/mdio-octeon.c: In function 'octeon_mdiobus_probe':
../drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of
function 'writeq'; did you mean 'writeb'?
[-Werror=implicit-function-declaration]
  111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
      |                                    ^~~~~~
../drivers/net/phy/mdio-octeon.c:56:2: note: in expansion of macro
'oct_mdio_writeq'
   56 |  oct_mdio_writeq(smi_en.u64, bus->register_base + SMI_EN);
      |  ^~~~~~~~~~~~~~~
cc1: some warnings being treated as errors

This allows MDIO_OCTEON to be built with COMPILE_TEST as well and
includes the proper header for readq/writeq. This does not address
the several -Wint-to-pointer-cast and -Wpointer-to-int-cast warnings
that appeared as a result of commit 171a9bae68c7 ("staging/octeon:
Allow test build on !MIPS") in these files.

Fixes: 171a9bae68c7 ("staging/octeon: Allow test build on !MIPS")
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Mark Brown <broonie@kernel.org>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
---
 drivers/net/phy/Kconfig       | 2 +-
 drivers/net/phy/mdio-cavium.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 20f14c5fbb7e..ed2edf4b5b0e 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -159,7 +159,7 @@ config MDIO_MSCC_MIIM
 
 config MDIO_OCTEON
 	tristate "Octeon and some ThunderX SOCs MDIO buses"
-	depends on 64BIT
+	depends on 64BIT || COMPILE_TEST
 	depends on HAS_IOMEM && OF_MDIO
 	select MDIO_CAVIUM
 	help
diff --git a/drivers/net/phy/mdio-cavium.h b/drivers/net/phy/mdio-cavium.h
index ed5f9bb5448d..b7f89ad27465 100644
--- a/drivers/net/phy/mdio-cavium.h
+++ b/drivers/net/phy/mdio-cavium.h
@@ -108,6 +108,8 @@ static inline u64 oct_mdio_readq(u64 addr)
 	return cvmx_read_csr(addr);
 }
 #else
+#include <linux/io-64-nonatomic-lo-hi.h>
+
 #define oct_mdio_writeq(val, addr)	writeq(val, (void *)addr)
 #define oct_mdio_readq(addr)		readq((void *)addr)
 #endif
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Alexei Starovoitov @ 2019-07-31 18:58 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Mickaël Salaün, LKML, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, Michael Kerrisk, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Stephen Smalley, Tejun Heo,
	Tetsuo Handa, Thomas Graf, Tycho Andersen, Will Drewry,
	Kernel Hardening, Linux API, Linux-Fsdevel, LSM List,
	Network Development
In-Reply-To: <a870c2c9-d2f7-e0fa-c8cc-35dbf8b5b87d@ssi.gouv.fr>

On Wed, Jul 31, 2019 at 11:46 AM Mickaël Salaün
<mickael.salaun@ssi.gouv.fr> wrote:
> >> +    for (i = 0; i < htab->n_buckets; i++) {
> >> +            head = select_bucket(htab, i);
> >> +            hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
> >> +                    landlock_inode_remove_map(*((struct inode **)l->key), map);
> >> +            }
> >> +    }
> >> +    htab_map_free(map);
> >> +}
> >
> > user space can delete the map.
> > that will trigger inode_htab_map_free() which will call
> > landlock_inode_remove_map().
> > which will simply itereate the list and delete from the list.
>
> landlock_inode_remove_map() removes the reference to the map (being
> freed) from the inode (with an RCU lock).

I'm going to ignore everything else for now and focus only on this bit,
since it's fundamental issue to address before this discussion can
go any further.
rcu_lock is not a spin_lock. I'm pretty sure you know this.
But you're arguing that it's somehow protecting from the race
I mentioned above?

^ permalink raw reply

* [PATCH] net/mlx5e: always initialize frag->last_in_page
From: Qian Cai @ 2019-07-31 19:02 UTC (permalink / raw)
  To: davem; +Cc: saeedm, tariqt, netdev, linux-kernel, Qian Cai

The commit 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue
memory scheme") introduced an undefined behaviour below due to
"frag->last_in_page" is only initialized in
mlx5e_init_frags_partition() when,

if (next_frag.offset + frag_info[f].frag_stride > PAGE_SIZE)

or after bailed out the loop,

for (i = 0; i < mlx5_wq_cyc_get_size(&rq->wqe.wq); i++)

As the result, there could be some "frag" have uninitialized
value of "last_in_page".

Later, get_frag() obtains those "frag" and check "rag->last_in_page" in
mlx5e_put_rx_frag() and triggers the error during boot. Fix it by always
initializing "frag->last_in_page" to "false" in
mlx5e_init_frags_partition().

UBSAN: Undefined behaviour in
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:325:12
load of value 170 is not a valid value for type 'bool' (aka '_Bool')
Call trace:
 dump_backtrace+0x0/0x264
 show_stack+0x20/0x2c
 dump_stack+0xb0/0x104
 __ubsan_handle_load_invalid_value+0x104/0x128
 mlx5e_handle_rx_cqe+0x8e8/0x12cc [mlx5_core]
 mlx5e_poll_rx_cq+0xca8/0x1a94 [mlx5_core]
 mlx5e_napi_poll+0x17c/0xa30 [mlx5_core]
 net_rx_action+0x248/0x940
 __do_softirq+0x350/0x7b8
 irq_exit+0x200/0x26c
 __handle_domain_irq+0xc8/0x128
 gic_handle_irq+0x138/0x228
 el1_irq+0xb8/0x140
 arch_cpu_idle+0x1a4/0x348
 do_idle+0x114/0x1b0
 cpu_startup_entry+0x24/0x28
 rest_init+0x1ac/0x1dc
 arch_call_rest_init+0x10/0x18
 start_kernel+0x4d4/0x57c

Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory scheme")
Signed-off-by: Qian Cai <cai@lca.pw>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 47eea6b3a1c3..96f5110a9b43 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -336,6 +336,7 @@ static void mlx5e_init_frags_partition(struct mlx5e_rq *rq)
 
 	next_frag.di = &rq->wqe.di[0];
 	next_frag.offset = 0;
+	next_frag.last_in_page = false;
 	prev = NULL;
 
 	for (i = 0; i < mlx5_wq_cyc_get_size(&rq->wqe.wq); i++) {
-- 
1.8.3.1


^ permalink raw reply related

* Re: [oss-drivers] Re: [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Willem de Bruijn @ 2019-07-31 19:07 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Miller, Network Development, oss-drivers, Eric Dumazet,
	borisp, aviadye, davejwatson, John Fastabend, Daniel Borkmann,
	Cong Wang, Florian Westphal, Alexei Starovoitov
In-Reply-To: <20190731111213.35237adc@cakuba.netronome.com>

On Wed, Jul 31, 2019 at 2:12 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Wed, 31 Jul 2019 11:57:10 -0400, Willem de Bruijn wrote:
> > On Tue, Jul 30, 2019 at 5:13 PM Jakub Kicinski wrote:
> > > sk_validate_xmit_skb() and drivers depend on the sk member of
> > > struct sk_buff to identify segments requiring encryption.
> > > Any operation which removes or does not preserve the original TLS
> > > socket such as skb_orphan() or skb_clone() will cause clear text
> > > leaks.
> > >
> > > Make the TCP socket underlying an offloaded TLS connection
> > > mark all skbs as decrypted, if TLS TX is in offload mode.
> > > Then in sk_validate_xmit_skb() catch skbs which have no socket
> > > (or a socket with no validation) and decrypted flag set.
> > >
> > > Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
> > > sk->sk_validate_xmit_skb are slightly interchangeable right now,
> > > they all imply TLS offload. The new checks are guarded by
> > > CONFIG_TLS_DEVICE because that's the option guarding the
> > > sk_buff->decrypted member.
> > >
> > > Second, smaller issue with orphaning is that it breaks
> > > the guarantee that packets will be delivered to device
> > > queues in-order. All TLS offload drivers depend on that
> > > scheduling property. This means skb_orphan_partial()'s
> > > trick of preserving partial socket references will cause
> > > issues in the drivers. We need a full orphan, and as a
> > > result netem delay/throttling will cause all TLS offload
> > > skbs to be dropped.
> > >
> > > Reusing the sk_buff->decrypted flag also protects from
> > > leaking clear text when incoming, decrypted skb is redirected
> > > (e.g. by TC).
> > >
> > > Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>
> > > diff --git a/net/core/sock.c b/net/core/sock.c
> > > index d57b0cc995a0..b0c10b518e65 100644
> > > --- a/net/core/sock.c
> > > +++ b/net/core/sock.c
> > > @@ -1992,6 +1992,22 @@ void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
> > >  }
> > >  EXPORT_SYMBOL(skb_set_owner_w);
> > >
> > > +static bool can_skb_orphan_partial(const struct sk_buff *skb)
> > > +{
> > > +#ifdef CONFIG_TLS_DEVICE
> > > +       /* Drivers depend on in-order delivery for crypto offload,
> > > +        * partial orphan breaks out-of-order-OK logic.
> > > +        */
> > > +       if (skb->decrypted)
> > > +               return false;
> > > +#endif
> > > +#ifdef CONFIG_INET
> > > +       if (skb->destructor == tcp_wfree)
> > > +               return true;
> > > +#endif
> > > +       return skb->destructor == sock_wfree;
> > > +}
> > > +
> >
> > Just insert the skb->decrypted check into skb_orphan_partial for less
> > code churn?
>
> Okie.. skb_orphan_partial() is a little ugly but will do.
>
> > I also think that this is an independent concern from leaking plain
> > text, so perhaps could be a separate patch.

Just a suggestion and very much depending on how much uglier it
becomes otherwise ;)

> Do you mean the out-of-order stuff is a separate concern?
>
> It is, I had them separate at the first try, but GSO code looks at
> the destructor and IIRC only copies the socket if its still tcp_wfree.
> If we let partial orphan be we have to do temporary hairy stuff in
> tcp_gso_segment(). It's easier to just deal with partial orphan here.

Okay, sounds good.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-07-31 19:09 UTC (permalink / raw)
  To: Song Liu
  Cc: Andy Lutomirski, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <D4040C0C-47D6-4852-933C-59EB53C05242@fb.com>

On Wed, Jul 31, 2019 at 1:10 AM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jul 30, 2019, at 1:24 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Mon, Jul 29, 2019 at 10:07 PM Song Liu <songliubraving@fb.com> wrote:
> >>
> >> Hi Andy,
> >>
> >>> On Jul 27, 2019, at 11:20 AM, Song Liu <songliubraving@fb.com> wrote:
> >>>
> >>> Hi Andy,
> >>>
> >>>
>
> [...]
>
> >>>
> >>
> >> I would like more comments on this.
> >>
> >> Currently, bpf permission is more or less "root or nothing", which we
> >> would like to change.
> >>
> >> The short term goal is to separate bpf from root, in other words, it is
> >> "all or nothing". Special user space utilities, such as systemd, would
> >> benefit from this. Once this is implemented, systemd can call sys_bpf()
> >> when it is not running as root.
> >
> > As generally nasty as Linux capabilities are, this sounds like a good
> > use for CAP_BPF_ADMIN.
>
> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.

It's been done before -- it's not that hard.  IMO the main tricky bit
would be try be somewhat careful about defining exactly what
CAP_BPF_ADMIN does.

> > I don't see why you need to invent a whole new mechanism for this.
> > The entire cgroup ecosystem outside bpf() does just fine using the
> > write permission on files in cgroupfs to control access.  Why can't
> > bpf() do the same thing?
>
> It is easier to use write permission for BPF_PROG_ATTACH. But it is
> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
> we should have target concept for all these commands. But that is a
> much bigger project. OTOH, "all or nothing" model allows all these
> commands at once.

For BPF_PROG_LOAD, I admit I've never understood why permission is
required at all.  I think that CAP_SYS_ADMIN or similar should be
needed to get is_priv in the verifier, but I think that should mainly
be useful for tracing, and that requires lots of privilege anyway.
BPF_MAP_* is probably the trickiest part.  One solution would be some
kind of bpffs, but I'm sure other solutions are possible.

^ permalink raw reply

* Re: [PATCH bpf-next v10 06/10] bpf,landlock: Add a new map type: inode
From: Mickaël Salaün @ 2019-07-31 19:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mickaël Salaün, LKML, Alexander Viro,
	Alexei Starovoitov, Andrew Morton, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, John Johansen, Jonathan Corbet,
	Kees Cook, Michael Kerrisk, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Stephen Smalley, Tejun Heo,
	Tetsuo Handa, Thomas Graf, Tycho Andersen, Will Drewry,
	Kernel Hardening, Linux API, Linux-Fsdevel, LSM List,
	Network Development
In-Reply-To: <CAADnVQLqkfVijWoOM29PxCL_yK6K0fr8B89r4c5EKgddevJhGQ@mail.gmail.com>

On 31/07/2019 20:58, Alexei Starovoitov wrote:
> On Wed, Jul 31, 2019 at 11:46 AM Mickaël Salaün
> <mickael.salaun@ssi.gouv.fr> wrote:
>>>> +    for (i = 0; i < htab->n_buckets; i++) {
>>>> +            head = select_bucket(htab, i);
>>>> +            hlist_nulls_for_each_entry_safe(l, n, head, hash_node) {
>>>> +                    landlock_inode_remove_map(*((struct inode **)l->key), map);
>>>> +            }
>>>> +    }
>>>> +    htab_map_free(map);
>>>> +}
>>>
>>> user space can delete the map.
>>> that will trigger inode_htab_map_free() which will call
>>> landlock_inode_remove_map().
>>> which will simply itereate the list and delete from the list.
>>
>> landlock_inode_remove_map() removes the reference to the map (being
>> freed) from the inode (with an RCU lock).
>
> I'm going to ignore everything else for now and focus only on this bit,
> since it's fundamental issue to address before this discussion can
> go any further.
> rcu_lock is not a spin_lock. I'm pretty sure you know this.
> But you're arguing that it's somehow protecting from the race
> I mentioned above?
>

I was just clarifying your comment to avoid misunderstanding about what
is being removed.

As said in the full response, there is currently a race but, if I add a
bpf_map_inc() call when the map is referenced by inode->security, then I
don't see how a race could occur because such added map could only be
freed in a security_inode_free() (as long as it retains a reference to
this inode).

--
Mickaël Salaün
ANSSI/SDE/ST/LAM

Les données à caractère personnel recueillies et traitées dans le cadre de cet échange, le sont à seule fin d’exécution d’une relation professionnelle et s’opèrent dans cette seule finalité et pour la durée nécessaire à cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos données, veuillez contacter contact.rgpd@sgdsn.gouv.fr. Si vous avez reçu ce message par erreur, nous vous remercions d’en informer l’expéditeur et de détruire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd@sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

^ permalink raw reply

* Re: [PATCH net-next 0/6] flow_offload: add indr-block in nf_table_offload
From: Jiri Pirko @ 2019-07-31 19:21 UTC (permalink / raw)
  To: wenxu; +Cc: pablo, fw, jakub.kicinski, netfilter-devel, netdev
In-Reply-To: <1564560753-32603-1-git-send-email-wenxu@ucloud.cn>

Wed, Jul 31, 2019 at 10:12:27AM CEST, wenxu@ucloud.cn wrote:
>From: wenxu <wenxu@ucloud.cn>
>
>This series patch make nftables offload support the vlan and
>tunnel device offload through indr-block architecture.
>
>The first four patches mv tc indr block to flow offload and
>rename to flow-indr-block.
>Because the new flow-indr-block can't get the tcf_block
>directly. The fifthe patch provide a callback list to get 
>flow_block of each subsystem immediately when the device
>register and contain a block.
>The last patch make nf_tables_offload support flow-indr-block.
>
>wenxu (6):
>  cls_api: modify the tc_indr_block_ing_cmd parameters.
>  cls_api: replace block with flow_block in tc_indr_block_dev
>  cls_api: add flow_indr_block_call function
>  flow_offload: move tc indirect block to flow offload
>  flow_offload: support get flow_block immediately
>  netfilter: nf_tables_offload: support indr block call

Wenxu, this is V5. Please repost is as V5. Also please provide per-patch
changelog, as you did with the previous versions. Thanks!

^ permalink raw reply

* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Song Liu @ 2019-07-31 17:46 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, netdev@vger.kernel.org, Alexei Starovoitov,
	daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <CAEf4Bzb6swYtf7J_m1bZo6o+aT1AcCXZX5ZBw7Uja=Tne2LCuw@mail.gmail.com>



> On Jul 31, 2019, at 10:18 AM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Wed, Jul 31, 2019 at 1:30 AM Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Jul 30, 2019, at 11:52 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>>> 
>>> On Tue, Jul 30, 2019 at 10:19 PM Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Jul 30, 2019, at 6:00 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>>>>> 
>>>>> On Tue, Jul 30, 2019 at 5:39 PM Song Liu <songliubraving@fb.com> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
>>>>>>> 
>>>>>>> This patch implements the core logic for BPF CO-RE offsets relocations.
>>>>>>> Every instruction that needs to be relocated has corresponding
>>>>>>> bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
>>>>>>> to match recorded "local" relocation spec against potentially many
>>>>>>> compatible "target" types, creating corresponding spec. Details of the
>>>>>>> algorithm are noted in corresponding comments in the code.
>>>>>>> 
>>>>>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>> 
>> [...]
>> 
>>>>>> 
>>>>> 
>>>>> I just picked the most succinct and non-repetitive form. It's
>>>>> immediately apparent which type it's implicitly converted to, so I
>>>>> felt there is no need to repeat it. Also, just (void *) is much
>>>>> shorter. :)
>>>> 
>>>> _All_ other code in btf.c converts the pointer to the target type.
>>> 
>>> Most in libbpf.c doesn't, though. Also, I try to preserve pointer
>>> constness for uses that don't modify BTF types (pretty much all of
>>> them in libbpf), so it becomes really verbose, despite extremely short
>>> variable names:
>>> 
>>> const struct btf_member *m = (const struct btf_member *)(t + 1);
>> 
>> I don't think being verbose is a big problem here. Overusing
> 
> Problem is too big and strong word to describe this :). It hurts
> readability and will often quite artificially force either wrapping
> the line or unnecessarily splitting declaration and assignment. Void *
> on the other hand is short and usually is in the same line as target
> type declaration, if not, you'll have to find local variable
> declaration to double-check type, if you are unsure.
> 
> Using (void *) + implicit cast to target pointer type is not
> unprecedented in libbpf:
> 
> $ rg ' = \((const )?struct \w+ \*\)' tools/lib/bpf/ | wc -l
> 52
> $ rg ' = \((const )?void \*\)' tools/lib/bpf/  | wc -l
> 35
> 
> 52 vs 35 is majority overall, but not by a landslide.
> 
>> (void *) feels like a bigger problem.
> 
> Why do you feel it's a problem? void * conveys that we have a piece of
> memory that we will need to reinterpret as some concrete pointer type.
> That's what we are doing, skipping btf_type and then interpreting
> memory after common btf_type prefix is some other type, depending on
> actual BTF kind. I don't think void * is misleading in any way.

(void *) hides some problem. For example:

	struct type_a *ptr_a = NULL;
	struct type_b *ptr_b = NULL;

	/* we want this */
	ptr_a = (struct type_a *)data;
	ptr_b = (struct type_b *)(data + offset);

	/* typo, should be ptr_b, compiler will complain */
	ptr_a = (struct type_b *)(data + offset);

	/* typo, should be ptr_b, compiler will ignore */
	ptr_a = (void *)(data + offset);

Such typo is not very rare. And it may be really painful to debug. 

That being said, I think we have spent too much time on this. I will 
let you make the final call. Either way:

Acked-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply

* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: Jiri Pirko @ 2019-07-31 19:26 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, sthemmin, dsahern, mlxsw
In-Reply-To: <20190730153952.73de7f00@cakuba.netronome.com>

Wed, Jul 31, 2019 at 12:39:52AM CEST, jakub.kicinski@netronome.com wrote:
>On Tue, 30 Jul 2019 10:57:32 +0200, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>> 
>> All devlink instances are created in init_net and stay there for a
>> lifetime. Allow user to be able to move devlink instances into
>> namespaces.
>> 
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>
>I'm no namespace expert, but seems reasonable, so FWIW:
>
>Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>
>If I read things right we will only send the devlink instance
>notification to other namespaces when it moves, but not
>notifications for sub-objects like ports. Is the expectation 
>that the user space dumps the objects it cares about or will
>the other notifications be added as needed (or option 3 - I 
>misread the code).

You are correct. I plan to take care of the notifications of all objects
in the follow-up patchset. But I can do it in this one if needed. Up to
you.


>
>I was also wondering it moving the devlink instance during a 
>multi-part transaction could be an issue. But AFAIU it should 
>be fine - the flashing doesn't release the instance lock, and 
>health stuff should complete correctly even if devlink is moved
>half way through?

Yes, I don't see any issue there as well.


^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Gunthorpe @ 2019-07-31 19:30 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <7555c949-ae6f-f105-6e1d-df21ddae9e4e@redhat.com>

On Wed, Jul 31, 2019 at 09:28:20PM +0800, Jason Wang wrote:
> 
> On 2019/7/31 下午8:39, Jason Gunthorpe wrote:
> > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
> > > We used to use RCU to synchronize MMU notifier with worker. This leads
> > > calling synchronize_rcu() in invalidate_range_start(). But on a busy
> > > system, there would be many factors that may slow down the
> > > synchronize_rcu() which makes it unsuitable to be called in MMU
> > > notifier.
> > > 
> > > A solution is SRCU but its overhead is obvious with the expensive full
> > > memory barrier. Another choice is to use seqlock, but it doesn't
> > > provide a synchronization method between readers and writers. The last
> > > choice is to use vq mutex, but it need to deal with the worst case
> > > that MMU notifier must be blocked and wait for the finish of swap in.
> > > 
> > > So this patch switches use a counter to track whether or not the map
> > > was used. The counter was increased when vq try to start or finish
> > > uses the map. This means, when it was even, we're sure there's no
> > > readers and MMU notifier is synchronized. When it was odd, it means
> > > there's a reader we need to wait it to be even again then we are
> > > synchronized.
> > You just described a seqlock.
> 
> 
> Kind of, see my explanation below.
> 
> 
> > 
> > We've been talking about providing this as some core service from mmu
> > notifiers because nearly every use of this API needs it.
> 
> 
> That would be very helpful.
> 
> 
> > 
> > IMHO this gets the whole thing backwards, the common pattern is to
> > protect the 'shadow pte' data with a seqlock (usually open coded),
> > such that the mmu notififer side has the write side of that lock and
> > the read side is consumed by the thread accessing or updating the SPTE.
> 
> 
> Yes, I've considered something like that. But the problem is, mmu notifier
> (writer) need to wait for the vhost worker to finish the read before it can
> do things like setting dirty pages and unmapping page.  It looks to me
> seqlock doesn't provide things like this.  

The seqlock is usually used to prevent a 2nd thread from accessing the
VA while it is being changed by the mm. ie you use something seqlocky
instead of the ugly mmu_notifier_unregister/register cycle.

You are supposed to use something simple like a spinlock or mutex
inside the invalidate_range_start to serialized tear down of the SPTEs
with their accessors.

> write_seqcount_begin()
> 
> map = vq->map[X]
> 
> write or read through map->addr directly
> 
> write_seqcount_end()
> 
> 
> There's no rmb() in write_seqcount_begin(), so map could be read before
> write_seqcount_begin(), but it looks to me now that this doesn't harm at
> all, maybe we can try this way.

That is because it is a write side lock, not a read lock. IIRC
seqlocks have weaker barriers because the write side needs to be
serialized in some other way.

The requirement I see is you need invalidate_range_start to block
until another thread exits its critical section (ie stops accessing
the SPTEs). 

That is a spinlock/mutex.

You just can't invent a faster spinlock by open coding something with
barriers, it doesn't work.

Jason

^ permalink raw reply

* Re: [PATCH V2 4/9] vhost: reset invalidate_count in vhost_set_vring_num_addr()
From: Jason Gunthorpe @ 2019-07-31 19:32 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <31ef9ed4-d74a-3454-a57d-fa843a3a802b@redhat.com>

On Wed, Jul 31, 2019 at 09:29:28PM +0800, Jason Wang wrote:
> 
> On 2019/7/31 下午8:41, Jason Gunthorpe wrote:
> > On Wed, Jul 31, 2019 at 04:46:50AM -0400, Jason Wang wrote:
> > > The vhost_set_vring_num_addr() could be called in the middle of
> > > invalidate_range_start() and invalidate_range_end(). If we don't reset
> > > invalidate_count after the un-registering of MMU notifier, the
> > > invalidate_cont will run out of sync (e.g never reach zero). This will
> > > in fact disable the fast accessor path. Fixing by reset the count to
> > > zero.
> > > 
> > > Reported-by: Michael S. Tsirkin <mst@redhat.com>
> > Did Michael report this as well?
> 
> 
> Correct me if I was wrong. I think it's point 4 described in
> https://lkml.org/lkml/2019/7/21/25.

I'm not sure what that is talking about

But this fixes what I described:

https://lkml.org/lkml/2019/7/22/554

Jason

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox