Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next v6 08/12] libbpf: add flags to umem config
From: Jonathan Lemon @ 2019-08-30 15:46 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <20190827022531.15060-9-kevin.laatz@intel.com>



On 26 Aug 2019, at 19:25, Kevin Laatz wrote:

> This patch adds a 'flags' field to the umem_config and umem_reg structs.
> This will allow for more options to be added for configuring umems.
>
> The first use for the flags field is to add a flag for unaligned chunks
> mode. These flags can either be user-provided or filled with a default.
>
> Since we change the size of the xsk_umem_config struct, we need to version
> the ABI. This patch includes the ABI versioning for xsk_umem__create. The
> Makefile was also updated to handle multiple function versions in
> check-abi.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply

* RE: [PATCH v2 1/6] mdev: Introduce sha1 based mdev alias
From: Parav Pandit @ 2019-08-30 15:45 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: alex.williamson@redhat.com, Jiri Pirko, kwankhede@nvidia.com,
	davem@davemloft.net, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20190830160223.332fd81f.cohuck@redhat.com>



> -----Original Message-----
> From: Cornelia Huck <cohuck@redhat.com>
> Sent: Friday, August 30, 2019 7:32 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; netdev@vger.kernel.org
> Subject: Re: [PATCH v2 1/6] mdev: Introduce sha1 based mdev alias
> 
> On Fri, 30 Aug 2019 12:58:04 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > > -----Original Message-----
> > > From: Cornelia Huck <cohuck@redhat.com>
> > > Sent: Friday, August 30, 2019 6:09 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> > > kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org;
> > > linux- kernel@vger.kernel.org; netdev@vger.kernel.org
> > > Subject: Re: [PATCH v2 1/6] mdev: Introduce sha1 based mdev alias
> > >
> > > On Fri, 30 Aug 2019 12:33:22 +0000
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Cornelia Huck <cohuck@redhat.com>
> > > > > Sent: Friday, August 30, 2019 2:47 PM
> > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> > > > > kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org;
> > > > > linux- kernel@vger.kernel.org; netdev@vger.kernel.org
> > > > > Subject: Re: [PATCH v2 1/6] mdev: Introduce sha1 based mdev
> > > > > alias
> > > > >
> > > > > On Thu, 29 Aug 2019 06:18:59 -0500 Parav Pandit
> > > > > <parav@mellanox.com> wrote:
> > > > >
> > > > > > Some vendor drivers want an identifier for an mdev device that
> > > > > > is shorter than the UUID, due to length restrictions in the
> > > > > > consumers of that identifier.
> > > > > >
> > > > > > Add a callback that allows a vendor driver to request an alias
> > > > > > of a specified length to be generated for an mdev device. If
> > > > > > generated, that alias is checked for collisions.
> > > > > >
> > > > > > It is an optional attribute.
> > > > > > mdev alias is generated using sha1 from the mdev name.
> > > > > >
> > > > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > > >
> > > > > > ---
> > > > > > Changelog:
> > > > > > v1->v2:
> > > > > >  - Kept mdev_device naturally aligned
> > > > > >  - Added error checking for crypt_*() calls
> > > > > >  - Corrected a typo from 'and' to 'an'
> > > > > >  - Changed return type of generate_alias() from int to char*
> > > > > > v0->v1:
> > > > > >  - Moved alias length check outside of the parent lock
> > > > > >  - Moved alias and digest allocation from kvzalloc to kzalloc
> > > > > >  - &alias[0] changed to alias
> > > > > >  - alias_length check is nested under get_alias_length
> > > > > > callback check
> > > > > >  - Changed comments to start with an empty line
> > > > > >  - Fixed cleaunup of hash if mdev_bus_register() fails
> > > > > >  - Added comment where alias memory ownership is handed over
> > > > > > to mdev device
> > > > > >  - Updated commit log to indicate motivation for this feature
> > > > > > ---
> > > > > >  drivers/vfio/mdev/mdev_core.c    | 123
> > > > > ++++++++++++++++++++++++++++++-
> > > > > >  drivers/vfio/mdev/mdev_private.h |   5 +-
> > > > > >  drivers/vfio/mdev/mdev_sysfs.c   |  13 ++--
> > > > > >  include/linux/mdev.h             |   4 +
> > > > > >  4 files changed, 135 insertions(+), 10 deletions(-)
> > >
> > > > > ...and detached from the local variable here. Who is freeing it?
> > > > > The comment states that it is done by the mdev, but I don't see it?
> > > > >
> > > > mdev_device_free() frees it.
> > >
> > > Ah yes, I overlooked the kfree().
> > >
> > > > once its assigned to mdev, mdev is the owner of it.
> > > >
> > > > > This detour via the local variable looks weird to me. Can you
> > > > > either create the alias directly in the mdev (would need to
> > > > > happen later in the function, but I'm not sure why you generate
> > > > > the alias before checking for duplicates anyway), or do an explicit copy?
> > > > Alias duplicate check is done after generating it, because
> > > > duplicate alias are
> > > not allowed.
> > > > The probability of collision is rare.
> > > > So it is speculatively generated without hold the lock, because
> > > > there is no
> > > need to hold the lock.
> > > > It is compared along with guid while mutex lock is held in single loop.
> > > > And if it is duplicate, there is no need to allocate mdev.
> > > >
> > > > It will be sub optimal to run through the mdev list 2nd time after
> > > > mdev
> > > creation and after generating alias for duplicate check.
> > >
> > > Ok, but what about copying it? I find this "set local variable to
> > > NULL after ownership is transferred" pattern a bit unintuitive.
> > > Copying it to the mdev (and then unconditionally freeing it) looks more
> obvious to me.
> > Its not unconditionally freed.
> 
> That's not what I have been saying :(
> 
Ah I see. You want to allocate alias memory twice; once inside mdev device and another one in _create() function.
_create() one you want to free unconditionally.

Well, passing pointer is fine.
mdev_register_device() has similar little tricky pattern that makes parent = NULL on __find_parent_device() finds duplicate one.

Ownership transfer is more straight forward code.

It is similar to device_initialize(), device init sequence code, where once device_initialize is done, freeing the device memory will be left to the put_device(), we don't call kfree() on mdev device.

> > Its freed in the error unwinding path.
> > I think its ok along with the comment that describes this error path area.
> 
> It is not wrong, but I'm not sure I like it.
Ok.

^ permalink raw reply

* Re: [PATCH bpf-next v6 07/12] net/mlx5e: Allow XSK frames smaller than a page
From: Jonathan Lemon @ 2019-08-30 15:45 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <20190827022531.15060-8-kevin.laatz@intel.com>



On 26 Aug 2019, at 19:25, Kevin Laatz wrote:

> From: Maxim Mikityanskiy <maximmi@mellanox.com>
>
> Relax the requirements to the XSK frame size to allow it to be smaller
> than a page and even not a power of two. The current implementation can
> work in this mode, both with Striding RQ and without it.
>
> The code that checks `mtu + headroom <= XSK frame size` is modified
> accordingly. Any frame size between 2048 and PAGE_SIZE is accepted.
>
> Functions that worked with pages only now work with XSK frames, even if
> their size is different from PAGE_SIZE.
>
> With XSK queues, regardless of the frame size, Striding RQ uses the
> stride size of PAGE_SIZE, and UMR MTTs are posted using starting
> addresses of frames, but PAGE_SIZE as page size. MTU guarantees that no
> packet data will overlap with other frames. UMR MTT size is made equal
> to the stride size of the RQ, because UMEM frames may come in random
> order, and we need to handle them one by one. PAGE_SIZE is just a power
> of two that is bigger than any allowed XSK frame size, and also it
> doesn't require making additional changes to the code.
>
> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-next v6 06/12] mlx5e: modify driver for handling offsets
From: Jonathan Lemon @ 2019-08-30 15:43 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <20190827022531.15060-7-kevin.laatz@intel.com>



On 26 Aug 2019, at 19:25, Kevin Laatz wrote:

> With the addition of the unaligned chunks option, we need to make sure we
> handle the offsets accordingly based on the mode we are currently running
> in. This patch modifies the driver to appropriately mask the address for
> each case.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-next v6 05/12] ixgbe: modify driver for handling offsets
From: Jonathan Lemon @ 2019-08-30 15:42 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <20190827022531.15060-6-kevin.laatz@intel.com>



On 26 Aug 2019, at 19:25, Kevin Laatz wrote:

> With the addition of the unaligned chunks option, we need to make sure we
> handle the offsets accordingly based on the mode we are currently running
> in. This patch modifies the driver to appropriately mask the address for
> each case.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-next v6 04/12] i40e: modify driver for handling offsets
From: Jonathan Lemon @ 2019-08-30 15:42 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <20190827022531.15060-5-kevin.laatz@intel.com>



On 26 Aug 2019, at 19:25, Kevin Laatz wrote:

> With the addition of the unaligned chunks option, we need to make sure we
> handle the offsets accordingly based on the mode we are currently running
> in. This patch modifies the driver to appropriately mask the address for
> each case.
>
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-next v6 03/12] xsk: add support to allow unaligned chunk placement
From: Jonathan Lemon @ 2019-08-30 15:41 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <20190827022531.15060-4-kevin.laatz@intel.com>



On 26 Aug 2019, at 19:25, Kevin Laatz wrote:

> Currently, addresses are chunk size aligned. This means, we are very
> restricted in terms of where we can place chunk within the umem. For
> example, if we have a chunk size of 2k, then our chunks can only be placed
> at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).
>
> This patch introduces the ability to use unaligned chunks. With these
> changes, we are no longer bound to having to place chunks at a 2k (or
> whatever your chunk size is) interval. Since we are no longer dealing with
> aligned chunks, they can now cross page boundaries. Checks for page
> contiguity have been added in order to keep track of which pages are
> followed by a physically contiguous page.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-next v6 02/12] ixgbe: simplify Rx buffer recycle
From: Jonathan Lemon @ 2019-08-30 15:39 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <20190827022531.15060-3-kevin.laatz@intel.com>

On 26 Aug 2019, at 19:25, Kevin Laatz wrote:

> Currently, the dma, addr and handle are modified when we reuse Rx buffers
> in zero-copy mode. However, this is not required as the inputs to the
> function are copies, not the original values themselves. As we use the
> copies within the function, we can use the original 'obi' values
> directly without having to mask and add the headroom.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply

* Re: [PATCH bpf-next v6 01/12] i40e: simplify Rx buffer recycle
From: Jonathan Lemon @ 2019-08-30 15:37 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	saeedm, maximmi, stephen, bruce.richardson, ciara.loftus, bpf,
	intel-wired-lan
In-Reply-To: <20190827022531.15060-2-kevin.laatz@intel.com>



On 26 Aug 2019, at 19:25, Kevin Laatz wrote:

> Currently, the dma, addr and handle are modified when we reuse Rx buffers
> in zero-copy mode. However, this is not required as the inputs to the
> function are copies, not the original values themselves. As we use the
> copies within the function, we can use the original 'old_bi' values
> directly without having to mask and add the headroom.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply

* Re: [PATCH net-next 3/4] net: flow_offload: mangle action at byte level
From: Vlad Buslov @ 2019-08-30 15:28 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel@vger.kernel.org, davem@davemloft.net,
	netdev@vger.kernel.org, vishal@chelsio.com,
	jakub.kicinski@netronome.com, Saeed Mahameed, jiri@resnulli.us
In-Reply-To: <20190830005336.23604-4-pablo@netfilter.org>


On Fri 30 Aug 2019 at 03:53, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> The flow mangle action is originally modeled after the tc pedit action,
> this has a number of shortcomings:
>
> 1) The tc pedit offset must be set on the 32-bits boundaries. Many
>    protocol header field offsets are not aligned to 32-bits, eg. port
>    destination, port source and ethernet destination. This patch adjusts
>    the offset accordingly and trim off length in these case, so drivers get
>    an exact offset and length to the header fields.
>
> 2) The maximum mangle length is one word of 32-bits, hence you need to
>    up to four actions to mangle an IPv6 address. This patch coalesces
>    consecutive tc pedit actions into one single action so drivers can
>    configure the IPv6 mangling in one go. Ethernet address fields now
>    require one single action instead of two too.
>
> The following drivers have been updated accordingly to use this new
> mangle action layout:
>
> 1) The cxgb4 driver does not need to split protocol field matching
>    larger than one 32-bit words into multiple definitions. Instead one
>    single definition per protocol field is enough. Checking for
>    transport protocol ports is also simplified.
>
> 2) The mlx5 driver logic to disallow IPv4 ttl and IPv6 hoplimit fields
>    becomes more simple too.
>
> 3) The nfp driver uses the nfp_fl_set_helper() function to configure the
>    payload mangling. The memchr_inv() function is used to check for
>    proper initialization of the value and mask. The driver has been
>    updated to refer to the exact protocol header offsets too.
>
> As a result, this patch reduces code complexity on the driver side at
> the cost of adding ~100 LOC at the core to perform offset and length
> adjustment; and to coalesce consecutive actions.
>
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---

[...]

> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index e30a151d8527..e8827fa8263a 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -3289,11 +3289,128 @@ void tc_cleanup_flow_action(struct flow_action *flow_action)
>  }
>  EXPORT_SYMBOL(tc_cleanup_flow_action);
>
> +static unsigned int flow_action_mangle_type(enum pedit_cmd cmd)
> +{
> +	switch (cmd) {
> +	case TCA_PEDIT_KEY_EX_CMD_SET:
> +		return FLOW_ACTION_MANGLE;
> +	case TCA_PEDIT_KEY_EX_CMD_ADD:
> +		return FLOW_ACTION_ADD;
> +	default:
> +		WARN_ON_ONCE(1);
> +	}
> +	return 0;
> +}
> +
> +struct flow_action_mangle_ctx {
> +	u8	cmd;
> +	u8	offset;
> +	u8	htype;
> +	u8	idx;
> +	u8	val[FLOW_ACTION_MANGLE_MAXLEN];
> +	u8	mask[FLOW_ACTION_MANGLE_MAXLEN];
> +	u32	first_word;
> +	u32	last_word;
> +};
> +
> +static void flow_action_mangle_entry(struct flow_action_entry *entry,
> +				     const struct flow_action_mangle_ctx *ctx)
> +{
> +	u32 delta;
> +
> +	entry->id = ctx->cmd;
> +	entry->mangle.htype = ctx->htype;
> +	entry->mangle.offset = ctx->offset;
> +	entry->mangle.len = ctx->idx;
> +
> +	/* Adjust offset. */
> +	delta = sizeof(u32) - (fls(ctx->first_word) / BITS_PER_BYTE);
> +	entry->mangle.offset += delta;
> +
> +	/* Adjust length. */
> +	entry->mangle.len -= ((ffs(ctx->last_word) / BITS_PER_BYTE) + delta);
> +
> +	/* Copy value and mask from offset using the adjusted length. */
> +	memcpy(entry->mangle.val, &ctx->val[delta], entry->mangle.len);
> +	memcpy(entry->mangle.mask, &ctx->mask[delta], entry->mangle.len);
> +}
> +
> +static void flow_action_mangle_ctx_update(struct flow_action_mangle_ctx *ctx,
> +					  const struct tc_action *act, int k)
> +{
> +	u32 val, mask;
> +
> +	val = tcf_pedit_val(act, k);
> +	mask = ~tcf_pedit_mask(act, k);
> +
> +	memcpy(&ctx->val[ctx->idx], &val, sizeof(u32));
> +	memcpy(&ctx->mask[ctx->idx], &mask, sizeof(u32));
> +	ctx->idx += sizeof(u32);
> +}
> +
> +static void flow_action_mangle_ctx_init(struct flow_action_mangle_ctx *ctx,
> +                                        const struct tc_action *act, int k)
> +{
> +	ctx->cmd = flow_action_mangle_type(tcf_pedit_cmd(act, k));
> +	ctx->offset = tcf_pedit_offset(act, k);
> +	ctx->htype = tcf_pedit_htype(act, k);
> +	ctx->idx = 0;
> +
> +	ctx->first_word = ntohl(~tcf_pedit_mask(act, k));
> +	ctx->last_word = ctx->first_word;
> +
> +	flow_action_mangle_ctx_update(ctx, act, k);
> +}
> +
> +static int flow_action_mangle(struct flow_action *flow_action,
> +			      struct flow_action_entry *entry,
> +			      const struct tc_action *act, int j)
> +{
> +	struct flow_action_mangle_ctx ctx;
> +	int k;
> +
> +	if (tcf_pedit_cmd(act, 0) > TCA_PEDIT_KEY_EX_CMD_ADD)
> +		return -1;
> +
> +	flow_action_mangle_ctx_init(&ctx, act, 0);
> +
> +	/* Special case: one single 32-bits word. */
> +	if (tcf_pedit_nkeys(act) == 1) {
> +		flow_action_mangle_entry(entry, &ctx);
> +		return j;
> +	}
> +
> +	for (k = 1; k < tcf_pedit_nkeys(act); k++) {
> +		if (tcf_pedit_cmd(act, k) > TCA_PEDIT_KEY_EX_CMD_ADD)
> +			return -1;
> +
> +		/* Offset is contiguous and type is the same, coalesce. */
> +		if (ctx.idx < FLOW_ACTION_MANGLE_MAXLEN &&
> +		    ctx.offset + ctx.idx == tcf_pedit_offset(act, k) &&
> +		    ctx.cmd == flow_action_mangle_type(tcf_pedit_cmd(act, k))) {
> +			flow_action_mangle_ctx_update(&ctx, act, k);
> +			continue;

Hi Pablo,

With this change you coalesce multiple pedits into single
flow_action_entry, which means that resulting rule->action.num_entries
is incorrect because number of filled flow actions entries will be less
than num_actions. With this, I get mlx5 rejecting such flow_rule with
"mlx5_core: The offload action is not supported." due to trailing
unfilled flow action(s) with id==0.

> +		}
> +		ctx.last_word = ntohl(~tcf_pedit_mask(act, k - 1));
> +
> +		/* Cannot coalesce, set up this entry. */
> +		flow_action_mangle_entry(entry, &ctx);
> +
> +		flow_action_mangle_ctx_init(&ctx, act, k);
> +		entry = &flow_action->entries[++j];
> +	}
> +
> +	ctx.last_word = ntohl(~tcf_pedit_mask(act, k - 1));
> +	flow_action_mangle_entry(entry, &ctx);
> +
> +	return j;
> +}
> +
>  int tc_setup_flow_action(struct flow_action *flow_action,
>  			 const struct tcf_exts *exts, bool rtnl_held)
>  {
>  	const struct tc_action *act;
> -	int i, j, k, err = 0;
> +	int i, j, err = 0;
>
>  	if (!exts)
>  		return 0;
> @@ -3366,25 +3483,9 @@ int tc_setup_flow_action(struct flow_action *flow_action,
>  		} else if (is_tcf_tunnel_release(act)) {
>  			entry->id = FLOW_ACTION_TUNNEL_DECAP;
>  		} else if (is_tcf_pedit(act)) {
> -			for (k = 0; k < tcf_pedit_nkeys(act); k++) {
> -				switch (tcf_pedit_cmd(act, k)) {
> -				case TCA_PEDIT_KEY_EX_CMD_SET:
> -					entry->id = FLOW_ACTION_MANGLE;
> -					break;
> -				case TCA_PEDIT_KEY_EX_CMD_ADD:
> -					entry->id = FLOW_ACTION_ADD;
> -					break;
> -				default:
> -					err = -EOPNOTSUPP;
> -					goto err_out;
> -				}
> -				entry->mangle.htype = tcf_pedit_htype(act, k);
> -				entry->mangle.mask = ~tcf_pedit_mask(act, k);
> -				entry->mangle.val = tcf_pedit_val(act, k) &
> -							entry->mangle.mask;
> -				entry->mangle.offset = tcf_pedit_offset(act, k);
> -				entry = &flow_action->entries[++j];
> -			}
> +			j = flow_action_mangle(flow_action, entry, act, j);
> +			if (j < 0)
> +				goto err_out;
>  		} else if (is_tcf_csum(act)) {
>  			entry->id = FLOW_ACTION_CSUM;
>  			entry->csum_flags = tcf_csum_update_flags(act);
> @@ -3439,8 +3540,7 @@ int tc_setup_flow_action(struct flow_action *flow_action,
>  			goto err_out;
>  		}
>
> -		if (!is_tcf_pedit(act))
> -			j++;
> +		j++;
>  	}
>
>  err_out:

^ permalink raw reply

* Re: [PATCH 2/3] samples: pktgen: add helper functions for IP(v4/v6) CIDR parsing
From: Daniel T. Lee @ 2019-08-30 15:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David S . Miller, netdev
In-Reply-To: <20190830152758.41b38c24@carbon>

On Fri, Aug 30, 2019 at 10:28 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Thu, 29 Aug 2019 05:42:42 +0900
> "Daniel T. Lee" <danieltimlee@gmail.com> wrote:
>
> > This commit adds CIDR parsing and IP validate helper function to parse
> > single IP or range of IP with CIDR. (e.g. 198.18.0.0/15)
> >
> > Helpers will be used in prior to set target address in samples/pktgen.
> >
> > Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> > ---
> >  samples/pktgen/functions.sh | 134 ++++++++++++++++++++++++++++++++++++
> >  1 file changed, 134 insertions(+)
> >
> > diff --git a/samples/pktgen/functions.sh b/samples/pktgen/functions.sh
> > index 4af4046d71be..eb1c52e25018 100644
> > --- a/samples/pktgen/functions.sh
> > +++ b/samples/pktgen/functions.sh
> > @@ -163,6 +163,140 @@ function get_node_cpus()
> >       echo $node_cpu_list
> >  }
> >
> > +# Extend shrunken IPv6 address.
> > +# fe80::42:bcff:fe84:e10a => fe80:0:0:0:42:bcff:fe84:e10a
> > +function extend_addr6()
> > +{
> > +    local addr=$1
> > +    local sep=:
> > +    local sep2=::
> > +    local sep_cnt=$(tr -cd $sep <<< $1 | wc -c)
> > +    local shrink
> > +
> > +    # separator count : should be between 2, 7.
> > +    if [[ $sep_cnt -lt 2 || $sep_cnt -gt 7 ]]; then
> > +        err 5 "Invalid IP6 address sep: $1"
> > +    fi
> > +
> > +    # if shrink '::' occurs multiple, it's malformed.
> > +    shrink=( $(egrep -o "$sep{2,}" <<< $addr) )
> > +    if [[ ${#shrink[@]} -ne 0 ]]; then
> > +        if [[ ${#shrink[@]} -gt 1 || ( ${shrink[0]} != $sep2 ) ]]; then
> > +            err 5 "Invalid IP$IP6 address shr: $1"
> > +        fi
> > +    fi
> > +
> > +    # add 0 at begin & end, and extend addr by adding :0
> > +    [[ ${addr:0:1} == $sep ]] && addr=0${addr}
> > +    [[ ${addr: -1} == $sep ]] && addr=${addr}0
> > +    echo "${addr/$sep2/$(printf ':0%.s' $(seq $[8-sep_cnt])):}"
> > +}
> > +
> > +
> > +# Given a single IP(v4/v6) address, whether it is valid.
> > +function validate_addr()
> > +{
> > +    # check function is called with (funcname)6
> > +    [[ ${FUNCNAME[1]: -1} == 6 ]] && local IP6=6
> > +    local len=$[ IP6 ? 8 : 4 ]
> > +    local max=$[ 2**(len*2)-1 ]
> > +    local addr
> > +    local sep
> > +
> > +    # set separator for each IP(v4/v6)
> > +    [[ $IP6 ]] && sep=: || sep=.
> > +    IFS=$sep read -a addr <<< $1
> > +
> > +    # array length
> > +    if [[ ${#addr[@]} != $len ]]; then
> > +        err 5 "Invalid IP$IP6 address: $1"
> > +    fi
> > +
> > +    # check each digit between 0, $max
> > +    for digit in "${addr[@]}"; do
> > +        [[ $IP6 ]] && digit=$[ 16#$digit ]
> > +        if [[ $digit -lt 0 || $digit -gt $max ]]; then
> > +            err 5 "Invalid IP$IP6 address: $1"
> > +        fi
> > +    done
> > +
> > +    return 0
> > +}
> > +
> > +function validate_addr6() { validate_addr $@ ; }
> > +
> > +# Given a single IP(v4/v6) or CIDR, return minimum and maximum IP addr.
> > +function parse_addr()
>
> I must say that I'm impressed by your bash-shell skills, BUT below
> function does look too complicated for doing this... I were expecting
> that you would use the regular & (AND) operation to do the prefix
> masking.
>
>

Thank you very much for your compliment!

I'll switch to a more intuitive code as soon as possible.
(It might take some time to send next version of patch,
but i'll try to send it asap.)

And also, thank you for taking your time for the review.

> > +{
> > +    # check function is called with (funcname)6
> > +    [[ ${FUNCNAME[1]: -1} == 6 ]] && local IP6=6
> > +    local bitlen=$[ IP6 ? 128 : 32 ]
> > +
> > +    local addr=$1
> > +    local net
> > +    local prefix
> > +    local min_ip
> > +    local max_ip
> > +
> > +    IFS='/' read net prefix <<< $addr
> > +    [[ $IP6 ]] && net=$(extend_addr6 $net)
> > +    validate_addr$IP6 $net
> > +
> > +    if [[ $prefix -gt $bitlen ]]; then
> > +        err 5 "Invalid prefix: $prefix"
> > +    elif [[ -z $prefix ]]; then
> > +        min_ip=$net
> > +        max_ip=$net
> > +    else
> > +        # defining array for converting Decimal 2 Binary
> > +        # 00000000 00000001 00000010 00000011 00000100 ...
> > +        local d2b='{0..1}{0..1}{0..1}{0..1}{0..1}{0..1}{0..1}{0..1}'
> > +        [[ $IP6 ]] && d2b+=$d2b
> > +        eval local D2B=($d2b)
> > +
> > +        local shift=$[ bitlen-prefix ]
> > +        local ip_bit
> > +        local ip
> > +        local sep
> > +
> > +        # set separator for each IP(v4/v6)
> > +        [[ $IP6 ]] && sep=: || sep=.
> > +        IFS=$sep read -ra ip <<< $net
> > +
> > +        # build full size bit
> > +        for digit in "${ip[@]}"; do
> > +            [[ $IP6 ]] && digit=$[ 16#$digit ]
> > +            ip_bit+=${D2B[$digit]}
> > +        done
> > +
> > +        # fill 0 or 1 by $shift
> > +        base_bit=${ip_bit::$prefix}
> > +        min_bit="$base_bit$(printf '0%.s' $(seq $shift))"
> > +        max_bit="$base_bit$(printf '1%.s' $(seq $shift))"
> > +
> > +        bit2addr() {
> > +            local step=$[ IP6 ? 16 : 8 ]
> > +            local max=$[ bitlen-step ]
> > +            local result
> > +            local fmt
> > +            [[ $IP6 ]] && fmt='%X' || fmt='%d'
> > +
> > +            for i in $(seq 0 $step $max); do
> > +                result+=$(printf $fmt $[ 2#${1:$i:$step} ])
> > +                [[ $i != $max ]] && result+=$sep
> > +            done
> > +            echo $result
> > +        }
> > +
> > +        min_ip=$(bit2addr $min_bit)
> > +        max_ip=$(bit2addr $max_bit)
> > +    fi
> > +
> > +    echo $min_ip $max_ip
> > +}
> > +
> > +function parse_addr6() { parse_addr $@ ; }
> > +
> >  # Given a single or range of port(s), return minimum and maximum port number.
> >  function parse_ports()
> >  {
>
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] net/skbuff: silence warnings under memory pressure
From: Qian Cai @ 2019-08-30 15:25 UTC (permalink / raw)
  To: Eric Dumazet, davem; +Cc: netdev, linux-mm, linux-kernel
In-Reply-To: <6109dab4-4061-8fee-96ac-320adf94e130@gmail.com>

On Fri, 2019-08-30 at 17:11 +0200, Eric Dumazet wrote:
> 
> On 8/30/19 4:57 PM, Qian Cai wrote:
> > When running heavy memory pressure workloads, the system is throwing
> > endless warnings below due to the allocation could fail from
> > __build_skb(), and the volume of this call could be huge which may
> > generate a lot of serial console output and cosumes all CPUs as
> > warn_alloc() could be expensive by calling dump_stack() and then
> > show_mem().
> > 
> > Fix it by silencing the warning in this call site. Also, it seems
> > unnecessary to even print a warning at all if the allocation failed in
> > __build_skb(), as it may just retransmit the packet and retry.
> > 
> 
> Same patches are showing up there and there from time to time.
> 
> Why is this particular spot interesting, against all others not adding
> __GFP_NOWARN ?
> 
> Are we going to have hundred of patches adding __GFP_NOWARN at various points,
> or should we get something generic to not flood the syslog in case of memory
> pressure ?
> 

From my testing which uses LTP oom* tests. There are only 3 places need to be
patched. The other two are in IOMMU code for both Intel and AMD. The place is
particular interesting because it could cause the system with floating serial
console output for days without making progress in OOM. I suppose it ends up in
a looping condition that warn_alloc() would end up generating more calls into
__build_skb() via ksoftirqd.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
From: Nicolas Dichtel @ 2019-08-30 15:19 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
	kernel-team, linux-api
In-Reply-To: <20190829173034.up5g74onaekp53zd@ast-mbp.dhcp.thefacebook.com>

Le 29/08/2019 à 19:30, Alexei Starovoitov a écrit :
[snip]
> These are the links that showing that k8 can delegates caps.
> Are you saying that you know of folks who specifically
> delegate cap_sys_admin and cap_net_admin _only_ to a container to run bpf in there?
> 
Yes, we need cap_sys_admin only to load bpf:
tc filter add dev eth0 ingress matchall action bpf obj ./tc_test_kern.o sec test

I'm not sure to understand why cap_net_admin is not enough to run the previous
command (ie why load is forbidden).

I want to avoid sys_admin, thus cap_bpf will be ok. But we need to manage the
backward compatibility.

Regards,
Nicolas

^ permalink raw reply

* Re: [PATCH] net/skbuff: silence warnings under memory pressure
From: Eric Dumazet @ 2019-08-30 15:11 UTC (permalink / raw)
  To: Qian Cai, davem; +Cc: netdev, linux-mm, linux-kernel
In-Reply-To: <1567177025-11016-1-git-send-email-cai@lca.pw>



On 8/30/19 4:57 PM, Qian Cai wrote:
> When running heavy memory pressure workloads, the system is throwing
> endless warnings below due to the allocation could fail from
> __build_skb(), and the volume of this call could be huge which may
> generate a lot of serial console output and cosumes all CPUs as
> warn_alloc() could be expensive by calling dump_stack() and then
> show_mem().
> 
> Fix it by silencing the warning in this call site. Also, it seems
> unnecessary to even print a warning at all if the allocation failed in
> __build_skb(), as it may just retransmit the packet and retry.
> 

Same patches are showing up there and there from time to time.

Why is this particular spot interesting, against all others not adding __GFP_NOWARN ?

Are we going to have hundred of patches adding __GFP_NOWARN at various points,
or should we get something generic to not flood the syslog in case of memory pressure ?


^ permalink raw reply

* Re: [PATCH] seccomp: fix compilation errors in seccomp-bpf kselftest
From: shuah @ 2019-08-30 14:59 UTC (permalink / raw)
  To: Alakesh Haloi, Kees Cook, Andy Lutomirski, Will Drewry,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song
  Cc: linux-kselftest, netdev, bpf, linux-kernel, shuah
In-Reply-To: <30e993fe-de76-9831-7ecc-61fcbcd51ae0@kernel.org>

On 8/30/19 8:09 AM, shuah wrote:
> On 8/22/19 3:58 PM, Alakesh Haloi wrote:
>> Without this patch we see following error while building and kselftest
>> for secccomp_bpf fails.
>>
>> seccomp_bpf.c:1787:20: error: ‘PTRACE_EVENTMSG_SYSCALL_ENTRY’ 
>> undeclared (first use in this function);
>> seccomp_bpf.c:1788:6: error: ‘PTRACE_EVENTMSG_SYSCALL_EXIT’ undeclared 
>> (first use in this function);
>>
>> Signed-off-by: Alakesh Haloi <alakesh.haloi@gmail.com>
>> ---
>>   tools/testing/selftests/seccomp/seccomp_bpf.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
>> b/tools/testing/selftests/seccomp/seccomp_bpf.c
>> index 6ef7f16c4cf5..2e619760fc3e 100644
>> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
>> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
>> @@ -1353,6 +1353,14 @@ TEST_F(precedence, log_is_fifth_in_any_order)
>>   #define PTRACE_EVENT_SECCOMP 7
>>   #endif
>> +#ifndef PTRACE_EVENTMSG_SYSCALL_ENTRY
>> +#define PTRACE_EVENTMSG_SYSCALL_ENTRY 1
>> +#endif
>> +
>> +#ifndef PTRACE_EVENTMSG_SYSCALL_EXIT
>> +#define PTRACE_EVENTMSG_SYSCALL_EXIT 2
>> +#endif
>> +
>>   #define IS_SECCOMP_EVENT(status) ((status >> 16) == 
>> PTRACE_EVENT_SECCOMP)
>>   bool tracer_running;
>>   void tracer_stop(int sig)
>>
> 
> Hi Kees,
> 
> Okay to apply this one for 5.4-rc1. Or is this going through bpf tree?
> If it is going through bpf tree:
> 
> Acked-by: Shuah Khan <skhan@linuxfoundation.org>
> 
> thanks,
> -- Shuah
> 

I saw your mail about Tycho's solution to be your preferred. Ignore this
message. I am applying Tycho's patch.

thanks,
-- Shuah

^ permalink raw reply

* [PATCH] net/skbuff: silence warnings under memory pressure
From: Qian Cai @ 2019-08-30 14:57 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-mm, linux-kernel, Qian Cai

When running heavy memory pressure workloads, the system is throwing
endless warnings below due to the allocation could fail from
__build_skb(), and the volume of this call could be huge which may
generate a lot of serial console output and cosumes all CPUs as
warn_alloc() could be expensive by calling dump_stack() and then
show_mem().

Fix it by silencing the warning in this call site. Also, it seems
unnecessary to even print a warning at all if the allocation failed in
__build_skb(), as it may just retransmit the packet and retry.

NMI watchdog: Watchdog detected hard LOCKUP on cpu 7
Hardware name: HP ProLiant XL420 Gen9/ProLiant XL420 Gen9, BIOS U19
12/27/2015
RIP: 0010:dump_stack+0xd/0x9a
Code: 5d c3 48 c7 c2 c0 ce aa bb 4c 89 ee 48 c7 c7 60 32 e3 ae e8 b3 3e
81 ff e9 ab ff ff ff 55 48 89 e5 41 55 41 83 cd ff 41 54 53 <9c> 41 5c
fa be 04 00 00 00 48 c7 c7 80 42 63 af 65 8b 1d f6 7c 8c
RSP: 0018:ffff888452389670 EFLAGS: 00000086
RAX: 000000000000000b RBX: 0000000000000007 RCX: ffffffffae75143f
RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffffaf634280
RBP: ffff888452389688 R08: 0000000000000004 R09: fffffbfff5ec6850
R10: fffffbfff5ec6850 R11: 0000000000000003 R12: 0000000000000086
R13: 00000000ffffffff R14: ffff88820547c040 R15: ffffffffaecc86a0
FS:  0000000000000000(0000) GS:ffff888452380000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f94f0537000 CR3: 00000005c8012006 CR4: 00000000001606a0
Call Trace:
 <IRQ>
 warn_alloc.cold.43+0x8a/0x148
 __alloc_pages_nodemask+0x1a5c/0x1bb0
 alloc_pages_current+0x9c/0x110
 allocate_slab+0x34a/0x11f0
 new_slab+0x46/0x70
 ___slab_alloc+0x604/0x950
 __slab_alloc+0x12/0x20
 kmem_cache_alloc+0x32a/0x400
 __build_skb+0x23/0x60
 build_skb+0x1a/0xb0
 igb_clean_rx_irq+0xafc/0x1010 [igb]
 igb_poll+0x4bb/0xe30 [igb]
 net_rx_action+0x244/0x7a0
 __do_softirq+0x1a0/0x60a
 irq_exit+0xb5/0xd0
 do_IRQ+0x81/0x170
 common_interrupt+0xf/0xf
 </IRQ>
RIP: 0010:cpuidle_enter_state+0x151/0x8d0
Code: 64 af e8 62 22 c2 ff 8b 05 04 f6 0b 01 85 c0 0f 8f 18 04 00 00 31
ff e8 9d 9e 97 ff 80 7d d0 00 0f 85 06 02 00 00 fb 45 85 ed <0f> 88 2d
02 00 00 4d 63 fd 49 83 ff 09 0f 87 8c 06 00 00 4b 8d 04
RSP: 0018:ffff888205487cf8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffdc
RAX: 0000000000000000 RBX: ffffe8fc09596f98 RCX: ffffffffadf093da
RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff8884523c2128
RBP: ffff888205487d48 R08: fffffbfff5ec9d62 R09: fffffbfff5ec9d62
R10: fffffbfff5ec9d61 R11: ffffffffaf64eb0b R12: ffffffffaf459580
R13: 0000000000000004 R14: 0000101fb3d64a9b R15: ffffe8fc09596f9c
 cpuidle_enter+0x41/0x70
 call_cpuidle+0x5e/0x90
 do_idle+0x313/0x340
 cpu_startup_entry+0x1d/0x1f
 start_secondary+0x28b/0x330
 secondary_startup_64+0xb6/0xc0

Signed-off-by: Qian Cai <cai@lca.pw>
---
 net/core/skbuff.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0338820ee0ec..9cc4148deddd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -307,7 +307,9 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size)
 {
 	struct sk_buff *skb;
 
-	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+	skb = kmem_cache_alloc(skbuff_head_cache,
+			       GFP_ATOMIC | __GFP_NOWARN);
+
 	if (unlikely(!skb))
 		return NULL;
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: Michal Kubecek @ 2019-08-30 14:49 UTC (permalink / raw)
  To: netdev
  Cc: Roopa Prabhu, Jiri Pirko, David Miller, Jakub Kicinski,
	David Ahern, Stephen Hemminger, dcbw, Andrew Lunn, parav,
	Saeed Mahameed, mlxsw
In-Reply-To: <CAJieiUgGY4amm_z1VGgBF-3WZceah+R5OVLEi=O2RS8RGpC9dg@mail.gmail.com>

On Fri, Aug 30, 2019 at 07:35:23AM -0700, Roopa Prabhu wrote:
> On Wed, Aug 28, 2019 at 10:26 PM Michal Kubecek <mkubecek@suse.cz> wrote:
> >
> > On Wed, Aug 28, 2019 at 09:36:41PM -0700, Roopa Prabhu wrote:
> > >
> > > yes,  correct. I mentioned that because I was wondering if we can
> > > think along the same lines for this API.
> > > eg
> > > (a) RTM_NEWLINK always replaces the list attribute
> > > (b) RTM_SETLINK with NLM_F_APPEND always appends to the list attribute
> > > (c) RTM_DELLINK with NLM_F_APPEND updates the list attribute
> > >
> > > (It could be NLM_F_UPDATE if NLM_F_APPEND sounds weird in the del
> > > case. I have not looked at the full dellink path if it will work
> > > neatly..its been a busy day )
> >
> > AFAICS rtnl_dellink() calls nlmsg_parse_deprecated() so that even
> > current code would ignore any future attribute in RTM_DELLINK message
> > (any kernel before the strict validation was introduced definitely will)
> > and it does not seem to check NLM_F_APPEND or NLM_F_UPDATE either. So
> > unless I missed something, such message would result in deleting the
> > network device (if possible) with any kernel not implementing the
> > feature.
> 
> ok, ack. yes today it does. I was hinting if that can be changed to
> support list update with a flag like the RTM_DELLINK AF_BRIDGE does
> for vlan list del.

What I wanted to say is that it would have to be done in a way that
would make current kernel (or even older, e.g. one with no strict
attribute checking at all) reject or at least ignore such request,
not delete the device.

Michal

^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: David Ahern @ 2019-08-30 14:47 UTC (permalink / raw)
  To: Roopa Prabhu, Michal Kubecek
  Cc: netdev, Jiri Pirko, David Miller, Jakub Kicinski,
	Stephen Hemminger, dcbw, Andrew Lunn, parav, Saeed Mahameed,
	mlxsw
In-Reply-To: <CAJieiUgGY4amm_z1VGgBF-3WZceah+R5OVLEi=O2RS8RGpC9dg@mail.gmail.com>

On 8/30/19 8:35 AM, Roopa Prabhu wrote:
> On Wed, Aug 28, 2019 at 10:26 PM Michal Kubecek <mkubecek@suse.cz> wrote:
>>
>> On Wed, Aug 28, 2019 at 09:36:41PM -0700, Roopa Prabhu wrote:
>>>
>>> yes,  correct. I mentioned that because I was wondering if we can
>>> think along the same lines for this API.
>>> eg
>>> (a) RTM_NEWLINK always replaces the list attribute
>>> (b) RTM_SETLINK with NLM_F_APPEND always appends to the list attribute
>>> (c) RTM_DELLINK with NLM_F_APPEND updates the list attribute
>>>
>>> (It could be NLM_F_UPDATE if NLM_F_APPEND sounds weird in the del
>>> case. I have not looked at the full dellink path if it will work
>>> neatly..its been a busy day )
>>
>> AFAICS rtnl_dellink() calls nlmsg_parse_deprecated() so that even
>> current code would ignore any future attribute in RTM_DELLINK message
>> (any kernel before the strict validation was introduced definitely will)
>> and it does not seem to check NLM_F_APPEND or NLM_F_UPDATE either. So
>> unless I missed something, such message would result in deleting the
>> network device (if possible) with any kernel not implementing the
>> feature.
> 
> ok, ack. yes today it does. I was hinting if that can be changed to
> support list update with a flag like the RTM_DELLINK AF_BRIDGE does
> for vlan list del.
> 
> so to summarize, i think we have discussed the following options to
> update a netlink list attribute so far:
> (a) encode an optional attribute/flag in the list attribute in
> RTM_SETLINK to indicate if it is a add or del

The ALT_IFNAME attribute could also be a struct that has both the string
and a flag.

> (b) Use a flag in RTM_SETLINK and RTM_DELINK to indicate add/del
> (close to bridge vlan add/del)
> (c) introduce a separate generic msg type to add/del to a list
> attribute (IIUC this does need a separate msg type per subsystem or
> netlink API)
> 


^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: Roopa Prabhu @ 2019-08-30 14:35 UTC (permalink / raw)
  To: Michal Kubecek
  Cc: netdev, Jiri Pirko, David Miller, Jakub Kicinski, David Ahern,
	Stephen Hemminger, dcbw, Andrew Lunn, parav, Saeed Mahameed,
	mlxsw
In-Reply-To: <20190829052620.GK29594@unicorn.suse.cz>

On Wed, Aug 28, 2019 at 10:26 PM Michal Kubecek <mkubecek@suse.cz> wrote:
>
> On Wed, Aug 28, 2019 at 09:36:41PM -0700, Roopa Prabhu wrote:
> >
> > yes,  correct. I mentioned that because I was wondering if we can
> > think along the same lines for this API.
> > eg
> > (a) RTM_NEWLINK always replaces the list attribute
> > (b) RTM_SETLINK with NLM_F_APPEND always appends to the list attribute
> > (c) RTM_DELLINK with NLM_F_APPEND updates the list attribute
> >
> > (It could be NLM_F_UPDATE if NLM_F_APPEND sounds weird in the del
> > case. I have not looked at the full dellink path if it will work
> > neatly..its been a busy day )
>
> AFAICS rtnl_dellink() calls nlmsg_parse_deprecated() so that even
> current code would ignore any future attribute in RTM_DELLINK message
> (any kernel before the strict validation was introduced definitely will)
> and it does not seem to check NLM_F_APPEND or NLM_F_UPDATE either. So
> unless I missed something, such message would result in deleting the
> network device (if possible) with any kernel not implementing the
> feature.

ok, ack. yes today it does. I was hinting if that can be changed to
support list update with a flag like the RTM_DELLINK AF_BRIDGE does
for vlan list del.

so to summarize, i think we have discussed the following options to
update a netlink list attribute so far:
(a) encode an optional attribute/flag in the list attribute in
RTM_SETLINK to indicate if it is a add or del
(b) Use a flag in RTM_SETLINK and RTM_DELINK to indicate add/del
(close to bridge vlan add/del)
(c) introduce a separate generic msg type to add/del to a list
attribute (IIUC this does need a separate msg type per subsystem or
netlink API)

^ permalink raw reply

* Re: bug report: libceph: follow redirect replies from osds
From: Ilya Dryomov @ 2019-08-30 14:26 UTC (permalink / raw)
  To: Colin Ian King
  Cc: Jeff Layton, Sage Weil, David S. Miller, Ceph Development,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <3a4ff829-7302-7201-81c2-a557fe35afc8@canonical.com>

On Fri, Aug 30, 2019 at 4:05 PM Colin Ian King <colin.king@canonical.com> wrote:
>
> Hi,
>
> Static analysis with Coverity has picked up an issue with commit:
>
> commit 205ee1187a671c3b067d7f1e974903b44036f270
> Author: Ilya Dryomov <ilya.dryomov@inktank.com>
> Date:   Mon Jan 27 17:40:20 2014 +0200
>
>     libceph: follow redirect replies from osds
>
> Specifically in function ceph_redirect_decode in net/ceph/osd_client.c:
>
> 3485
> 3486        len = ceph_decode_32(p);
>
> CID 17904: Unused value (UNUSED_VALUE)
>
> 3487        *p += len; /* skip osd_instructions */
> 3488
> 3489        /* skip the rest */
> 3490        *p = struct_end;
>
> The double write to *p looks wrong, I suspect the *p += len; should be
> just incrementing pointer p as in: p += len.  Am I correct to assume
> this is the correct fix?

Hi Colin,

No, the double write to *p is correct.  It skips over len bytes and
then skips to the end of the redirect reply.  There is no bug here but
we could drop

  len = ceph_decode_32(p);
  *p += len; /* skip osd_instructions */

and skip to the end directly to make Coverity happier.

Thanks,

                Ilya

^ permalink raw reply

* Re: [PATCH v2 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot
From: Leonardo Bras @ 2019-08-30 14:15 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Pablo Neira Ayuso, David S. Miller, netfilter-devel, coreteam,
	netdev, linux-kernel, Jozsef Kadlecsik, Alexey Kuznetsov,
	Hideaki YOSHIFUJI
In-Reply-To: <20190829205832.GM20113@breakpoint.cc>

[-- Attachment #1: Type: text/plain, Size: 1186 bytes --]

On Thu, 2019-08-29 at 22:58 +0200, Florian Westphal wrote:
> In any case your patch looks ok to me.

Great! Please give your feedback on v3: 
http://patchwork.ozlabs.org/patch/1154040/

[...]
> 
> Even if we disable call-ip6tables in br_netfilter we will at least
> in addition need a patch for nft_fib_netdev.c.
> 
> From a "avoid calls to ipv6 stack when its disabled" standpoint,
> the safest fix is to disable call-ip6tables functionality if ipv6
> module is off *and* fix nft_fib_netdev.c to BREAK in ipv6 is off case.
> 
> I started to place a list of suspicous modules here, but that got out
> of hand quickly.
> 
> So, given I don't want to plaster ipv6_mod_enabled() everywhere, I
> would suggest this course of action:
> 
> 1. add a patch to BREAK in nft_fib_netdev.c for !ipv6_mod_enabled()
> 2. change net/bridge/br_netfilter_hooks.c, br_nf_pre_routing() to
>    make sure ipv6_mod_enabled() is true before doing the ipv6 stack
>    "emulation".
> 
> Makes sense?
IMHO sure.

Shortly, I will send a couple patches proposing the above changes.
(Or my best understanding about them :) ) 

> 
> Thanks,
> Florian

Thank you,
Leonardo Bras

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] seccomp: fix compilation errors in seccomp-bpf kselftest
From: shuah @ 2019-08-30 14:09 UTC (permalink / raw)
  To: Alakesh Haloi, Kees Cook, Andy Lutomirski, Will Drewry,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song
  Cc: linux-kselftest, netdev, bpf, linux-kernel, shuah
In-Reply-To: <20190822215823.GA11292@ip-172-31-44-144.us-west-2.compute.internal>

On 8/22/19 3:58 PM, Alakesh Haloi wrote:
> Without this patch we see following error while building and kselftest
> for secccomp_bpf fails.
> 
> seccomp_bpf.c:1787:20: error: ‘PTRACE_EVENTMSG_SYSCALL_ENTRY’ undeclared (first use in this function);
> seccomp_bpf.c:1788:6: error: ‘PTRACE_EVENTMSG_SYSCALL_EXIT’ undeclared (first use in this function);
> 
> Signed-off-by: Alakesh Haloi <alakesh.haloi@gmail.com>
> ---
>   tools/testing/selftests/seccomp/seccomp_bpf.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 6ef7f16c4cf5..2e619760fc3e 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -1353,6 +1353,14 @@ TEST_F(precedence, log_is_fifth_in_any_order)
>   #define PTRACE_EVENT_SECCOMP 7
>   #endif
>   
> +#ifndef PTRACE_EVENTMSG_SYSCALL_ENTRY
> +#define PTRACE_EVENTMSG_SYSCALL_ENTRY 1
> +#endif
> +
> +#ifndef PTRACE_EVENTMSG_SYSCALL_EXIT
> +#define PTRACE_EVENTMSG_SYSCALL_EXIT 2
> +#endif
> +
>   #define IS_SECCOMP_EVENT(status) ((status >> 16) == PTRACE_EVENT_SECCOMP)
>   bool tracer_running;
>   void tracer_stop(int sig)
> 

Hi Kees,

Okay to apply this one for 5.4-rc1. Or is this going through bpf tree?
If it is going through bpf tree:

Acked-by: Shuah Khan <skhan@linuxfoundation.org>

thanks,
-- Shuah

^ permalink raw reply

* bug report: libceph: follow redirect replies from osds
From: Colin Ian King @ 2019-08-30 14:05 UTC (permalink / raw)
  To: Ilya Dryomov, Jeff Layton, Sage Weil, David S. Miller, ceph-devel,
	netdev@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org

Hi,

Static analysis with Coverity has picked up an issue with commit:

commit 205ee1187a671c3b067d7f1e974903b44036f270
Author: Ilya Dryomov <ilya.dryomov@inktank.com>
Date:   Mon Jan 27 17:40:20 2014 +0200

    libceph: follow redirect replies from osds

Specifically in function ceph_redirect_decode in net/ceph/osd_client.c:

3485
3486        len = ceph_decode_32(p);

CID 17904: Unused value (UNUSED_VALUE)

3487        *p += len; /* skip osd_instructions */
3488
3489        /* skip the rest */
3490        *p = struct_end;

The double write to *p looks wrong, I suspect the *p += len; should be
just incrementing pointer p as in: p += len.  Am I correct to assume
this is the correct fix?

Colin

^ permalink raw reply

* Re: [PATCH v2 1/6] mdev: Introduce sha1 based mdev alias
From: Cornelia Huck @ 2019-08-30 14:02 UTC (permalink / raw)
  To: Parav Pandit
  Cc: alex.williamson@redhat.com, Jiri Pirko, kwankhede@nvidia.com,
	davem@davemloft.net, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <AM0PR05MB486621283F935B673455DA63D1BD0@AM0PR05MB4866.eurprd05.prod.outlook.com>

On Fri, 30 Aug 2019 12:58:04 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: Cornelia Huck <cohuck@redhat.com>
> > Sent: Friday, August 30, 2019 6:09 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> > kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org; netdev@vger.kernel.org
> > Subject: Re: [PATCH v2 1/6] mdev: Introduce sha1 based mdev alias
> > 
> > On Fri, 30 Aug 2019 12:33:22 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > > -----Original Message-----
> > > > From: Cornelia Huck <cohuck@redhat.com>
> > > > Sent: Friday, August 30, 2019 2:47 PM
> > > > To: Parav Pandit <parav@mellanox.com>
> > > > Cc: alex.williamson@redhat.com; Jiri Pirko <jiri@mellanox.com>;
> > > > kwankhede@nvidia.com; davem@davemloft.net; kvm@vger.kernel.org;
> > > > linux- kernel@vger.kernel.org; netdev@vger.kernel.org
> > > > Subject: Re: [PATCH v2 1/6] mdev: Introduce sha1 based mdev alias
> > > >
> > > > On Thu, 29 Aug 2019 06:18:59 -0500
> > > > Parav Pandit <parav@mellanox.com> wrote:
> > > >  
> > > > > Some vendor drivers want an identifier for an mdev device that is
> > > > > shorter than the UUID, due to length restrictions in the consumers
> > > > > of that identifier.
> > > > >
> > > > > Add a callback that allows a vendor driver to request an alias of
> > > > > a specified length to be generated for an mdev device. If
> > > > > generated, that alias is checked for collisions.
> > > > >
> > > > > It is an optional attribute.
> > > > > mdev alias is generated using sha1 from the mdev name.
> > > > >
> > > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > >
> > > > > ---
> > > > > Changelog:
> > > > > v1->v2:
> > > > >  - Kept mdev_device naturally aligned
> > > > >  - Added error checking for crypt_*() calls
> > > > >  - Corrected a typo from 'and' to 'an'
> > > > >  - Changed return type of generate_alias() from int to char*
> > > > > v0->v1:
> > > > >  - Moved alias length check outside of the parent lock
> > > > >  - Moved alias and digest allocation from kvzalloc to kzalloc
> > > > >  - &alias[0] changed to alias
> > > > >  - alias_length check is nested under get_alias_length callback
> > > > > check
> > > > >  - Changed comments to start with an empty line
> > > > >  - Fixed cleaunup of hash if mdev_bus_register() fails
> > > > >  - Added comment where alias memory ownership is handed over to
> > > > > mdev device
> > > > >  - Updated commit log to indicate motivation for this feature
> > > > > ---
> > > > >  drivers/vfio/mdev/mdev_core.c    | 123  
> > > > ++++++++++++++++++++++++++++++-  
> > > > >  drivers/vfio/mdev/mdev_private.h |   5 +-
> > > > >  drivers/vfio/mdev/mdev_sysfs.c   |  13 ++--
> > > > >  include/linux/mdev.h             |   4 +
> > > > >  4 files changed, 135 insertions(+), 10 deletions(-)  
> >   
> > > > ...and detached from the local variable here. Who is freeing it? The
> > > > comment states that it is done by the mdev, but I don't see it?
> > > >  
> > > mdev_device_free() frees it.  
> > 
> > Ah yes, I overlooked the kfree().
> >   
> > > once its assigned to mdev, mdev is the owner of it.
> > >  
> > > > This detour via the local variable looks weird to me. Can you either
> > > > create the alias directly in the mdev (would need to happen later in
> > > > the function, but I'm not sure why you generate the alias before
> > > > checking for duplicates anyway), or do an explicit copy?  
> > > Alias duplicate check is done after generating it, because duplicate alias are  
> > not allowed.  
> > > The probability of collision is rare.
> > > So it is speculatively generated without hold the lock, because there is no  
> > need to hold the lock.  
> > > It is compared along with guid while mutex lock is held in single loop.
> > > And if it is duplicate, there is no need to allocate mdev.
> > >
> > > It will be sub optimal to run through the mdev list 2nd time after mdev  
> > creation and after generating alias for duplicate check.
> > 
> > Ok, but what about copying it? I find this "set local variable to NULL after
> > ownership is transferred" pattern a bit unintuitive. Copying it to the mdev (and
> > then unconditionally freeing it) looks more obvious to me.  
> Its not unconditionally freed. 

That's not what I have been saying :(

> Its freed in the error unwinding path.
> I think its ok along with the comment that describes this error path area.

It is not wrong, but I'm not sure I like it.

^ permalink raw reply

* Re: [PATCH 2/3] samples: pktgen: add helper functions for IP(v4/v6) CIDR parsing
From: Jesper Dangaard Brouer @ 2019-08-30 13:27 UTC (permalink / raw)
  To: Daniel T. Lee; +Cc: David S . Miller, netdev, brouer
In-Reply-To: <20190828204243.16666-2-danieltimlee@gmail.com>

On Thu, 29 Aug 2019 05:42:42 +0900
"Daniel T. Lee" <danieltimlee@gmail.com> wrote:

> This commit adds CIDR parsing and IP validate helper function to parse
> single IP or range of IP with CIDR. (e.g. 198.18.0.0/15)
> 
> Helpers will be used in prior to set target address in samples/pktgen.
> 
> Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> ---
>  samples/pktgen/functions.sh | 134 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 134 insertions(+)
> 
> diff --git a/samples/pktgen/functions.sh b/samples/pktgen/functions.sh
> index 4af4046d71be..eb1c52e25018 100644
> --- a/samples/pktgen/functions.sh
> +++ b/samples/pktgen/functions.sh
> @@ -163,6 +163,140 @@ function get_node_cpus()
>  	echo $node_cpu_list
>  }
>  
> +# Extend shrunken IPv6 address.
> +# fe80::42:bcff:fe84:e10a => fe80:0:0:0:42:bcff:fe84:e10a
> +function extend_addr6()
> +{
> +    local addr=$1
> +    local sep=:
> +    local sep2=::
> +    local sep_cnt=$(tr -cd $sep <<< $1 | wc -c)
> +    local shrink
> +
> +    # separator count : should be between 2, 7.
> +    if [[ $sep_cnt -lt 2 || $sep_cnt -gt 7 ]]; then
> +        err 5 "Invalid IP6 address sep: $1"
> +    fi
> +
> +    # if shrink '::' occurs multiple, it's malformed.
> +    shrink=( $(egrep -o "$sep{2,}" <<< $addr) )
> +    if [[ ${#shrink[@]} -ne 0 ]]; then
> +        if [[ ${#shrink[@]} -gt 1 || ( ${shrink[0]} != $sep2 ) ]]; then
> +            err 5 "Invalid IP$IP6 address shr: $1"
> +        fi
> +    fi
> +
> +    # add 0 at begin & end, and extend addr by adding :0
> +    [[ ${addr:0:1} == $sep ]] && addr=0${addr}
> +    [[ ${addr: -1} == $sep ]] && addr=${addr}0
> +    echo "${addr/$sep2/$(printf ':0%.s' $(seq $[8-sep_cnt])):}"
> +}
> +
> +
> +# Given a single IP(v4/v6) address, whether it is valid.
> +function validate_addr()
> +{
> +    # check function is called with (funcname)6
> +    [[ ${FUNCNAME[1]: -1} == 6 ]] && local IP6=6
> +    local len=$[ IP6 ? 8 : 4 ]
> +    local max=$[ 2**(len*2)-1 ]
> +    local addr
> +    local sep
> +
> +    # set separator for each IP(v4/v6)
> +    [[ $IP6 ]] && sep=: || sep=.
> +    IFS=$sep read -a addr <<< $1
> +
> +    # array length
> +    if [[ ${#addr[@]} != $len ]]; then
> +        err 5 "Invalid IP$IP6 address: $1"
> +    fi
> +
> +    # check each digit between 0, $max
> +    for digit in "${addr[@]}"; do
> +        [[ $IP6 ]] && digit=$[ 16#$digit ]
> +        if [[ $digit -lt 0 || $digit -gt $max ]]; then
> +            err 5 "Invalid IP$IP6 address: $1"
> +        fi
> +    done
> +
> +    return 0
> +}
> +
> +function validate_addr6() { validate_addr $@ ; }
> +
> +# Given a single IP(v4/v6) or CIDR, return minimum and maximum IP addr.
> +function parse_addr()

I must say that I'm impressed by your bash-shell skills, BUT below
function does look too complicated for doing this... I were expecting
that you would use the regular & (AND) operation to do the prefix
masking.


> +{
> +    # check function is called with (funcname)6
> +    [[ ${FUNCNAME[1]: -1} == 6 ]] && local IP6=6
> +    local bitlen=$[ IP6 ? 128 : 32 ]
> +
> +    local addr=$1
> +    local net
> +    local prefix
> +    local min_ip
> +    local max_ip
> +
> +    IFS='/' read net prefix <<< $addr
> +    [[ $IP6 ]] && net=$(extend_addr6 $net)
> +    validate_addr$IP6 $net
> +
> +    if [[ $prefix -gt $bitlen ]]; then
> +        err 5 "Invalid prefix: $prefix"
> +    elif [[ -z $prefix ]]; then
> +        min_ip=$net
> +        max_ip=$net
> +    else
> +        # defining array for converting Decimal 2 Binary
> +        # 00000000 00000001 00000010 00000011 00000100 ...
> +        local d2b='{0..1}{0..1}{0..1}{0..1}{0..1}{0..1}{0..1}{0..1}'
> +        [[ $IP6 ]] && d2b+=$d2b
> +        eval local D2B=($d2b)
> +
> +        local shift=$[ bitlen-prefix ]
> +        local ip_bit
> +        local ip
> +        local sep
> +
> +        # set separator for each IP(v4/v6)
> +        [[ $IP6 ]] && sep=: || sep=.
> +        IFS=$sep read -ra ip <<< $net
> +
> +        # build full size bit
> +        for digit in "${ip[@]}"; do
> +            [[ $IP6 ]] && digit=$[ 16#$digit ]
> +            ip_bit+=${D2B[$digit]}
> +        done
> +
> +        # fill 0 or 1 by $shift
> +        base_bit=${ip_bit::$prefix}
> +        min_bit="$base_bit$(printf '0%.s' $(seq $shift))"
> +        max_bit="$base_bit$(printf '1%.s' $(seq $shift))"
> +
> +        bit2addr() {
> +            local step=$[ IP6 ? 16 : 8 ]
> +            local max=$[ bitlen-step ]
> +            local result
> +            local fmt
> +            [[ $IP6 ]] && fmt='%X' || fmt='%d'
> +
> +            for i in $(seq 0 $step $max); do
> +                result+=$(printf $fmt $[ 2#${1:$i:$step} ])
> +                [[ $i != $max ]] && result+=$sep
> +            done
> +            echo $result
> +        }
> +
> +        min_ip=$(bit2addr $min_bit)
> +        max_ip=$(bit2addr $max_bit)
> +    fi
> +
> +    echo $min_ip $max_ip
> +}
> +
> +function parse_addr6() { parse_addr $@ ; }
> +
>  # Given a single or range of port(s), return minimum and maximum port number.
>  function parse_ports()
>  {



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox