Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v12 03/10] netdev: cavium: octeon: Add Octeon III BGX Ethernet Nexus
From: Jiri Pirko @ 2018-06-29  6:13 UTC (permalink / raw)
  To: David Miller; +Cc: cmunoz, andrew, steven.hill, netdev, cchavva
In-Reply-To: <20180629.111905.863023692523959153.davem@davemloft.net>

Fri, Jun 29, 2018 at 04:19:05AM CEST, davem@davemloft.net wrote:
>From: Carlos Munoz <cmunoz@cavium.com>
>Date: Thu, 28 Jun 2018 14:20:05 -0700
>
>> 
>> 
>> On 06/28/2018 01:41 AM, Andrew Lunn wrote:
>>> External Email
>>>
>>>> +static char *mix_port;
>>>> +module_param(mix_port, charp, 0444);
>>>> +MODULE_PARM_DESC(mix_port, "Specifies which ports connect to MIX interfaces.");
>>>> +
>>>> +static char *pki_port;
>>>> +module_param(pki_port, charp, 0444);
>>>> +MODULE_PARM_DESC(pki_port, "Specifies which ports connect to the PKI.");
>>> Module parameters are generally not liked. Can you do without them?
>> 
>> These parameters change the kernel port assignment required by user
>> space applications. We rather keep them as they simplify the
>> process.
>
>This is actually a terrible user experience.
>
>Please provide a way to do this by performing operations on a device object
>after the driver loads.
>
>Use something like devlink or similar if you have to.

Devlink params should be used for this. They are not upstream yet. We
will push it most likely early next week.

^ permalink raw reply

* Re: [PATCH v12 03/10] netdev: cavium: octeon: Add Octeon III BGX Ethernet Nexus
From: David Miller @ 2018-06-29  6:21 UTC (permalink / raw)
  To: Chandrakala.Chavva; +Cc: Carlos.Munoz, andrew, Steven.Hill, netdev
In-Reply-To: <BYAPR07MB4325916FAAE80A23219174C2E14E0@BYAPR07MB4325.namprd07.prod.outlook.com>

From: "Chavva, Chandrakala" <Chandrakala.Chavva@cavium.com>
Date: Fri, 29 Jun 2018 03:30:51 +0000

> How can we support NFS boot if pass the parameters via
> devlink. Basically this determines what phy to use from device tree.

initrd.

^ permalink raw reply

* Re: [PATCH v4] net: Remove depends on HAS_DMA in case of platform dependency
From: Kalle Valo @ 2018-06-29  6:22 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: David S . Miller, Yisen Zhuang, Sergey Matyukevich, Salil Mehta,
	Igor Mitsyanko, Avinash Patil, Wright Feng, Sergei Shtylyov,
	Quan Nguyen, Keyur Chudgar, Jiri Pirko, Iyappan Subramanian,
	Ido Schimmel, Hante Meuleman, Franky Lin, Chi-Hsien Lin,
	Arend van Spriel, netdev, linux-kernel
In-Reply-To: <20180622110843.31965-1-geert@linux-m68k.org>

Geert Uytterhoeven <geert@linux-m68k.org> writes:

> Remove dependencies on HAS_DMA where a Kconfig symbol depends on another
> symbol that implies HAS_DMA, and, optionally, on "|| COMPILE_TEST".
> In most cases this other symbol is an architecture or platform specific
> symbol, or PCI.
>
> Generic symbols and drivers without platform dependencies keep their
> dependencies on HAS_DMA, to prevent compiling subsystems or drivers that
> cannot work anyway.
>
> This simplifies the dependencies, and allows to improve compile-testing.
>
> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
> Reviewed-by: Mark Brown <broonie@kernel.org>
> Acked-by: Robin Murphy <robin.murphy@arm.com>
> ---
> v4:
>   - Rebase to v4.18-rc1 (applies to next-20180622, too),
>
> v3:
>   - Rebase to v4.17-rc1,
>   - Drop obsolete note about FSL_FMAN,
>
> v2:
>   - Add Reviewed-by, Acked-by,
>   - Drop RFC state,
>   - Split per subsystem.
> ---
>  drivers/net/ethernet/amd/Kconfig                | 2 +-
>  drivers/net/ethernet/apm/xgene-v2/Kconfig       | 1 -
>  drivers/net/ethernet/apm/xgene/Kconfig          | 1 -
>  drivers/net/ethernet/arc/Kconfig                | 6 ++++--
>  drivers/net/ethernet/broadcom/Kconfig           | 2 --
>  drivers/net/ethernet/calxeda/Kconfig            | 2 +-
>  drivers/net/ethernet/hisilicon/Kconfig          | 2 +-
>  drivers/net/ethernet/marvell/Kconfig            | 8 +++-----
>  drivers/net/ethernet/mellanox/mlxsw/Kconfig     | 2 +-
>  drivers/net/ethernet/renesas/Kconfig            | 2 --
>  drivers/net/wireless/broadcom/brcm80211/Kconfig | 1 -
>  drivers/net/wireless/quantenna/qtnfmac/Kconfig  | 2 +-
>  12 files changed, 12 insertions(+), 19 deletions(-)

For the wireless part:

Acked-by: Kalle Valo <kvalo@codeaurora.org>

-- 
Kalle Valo

^ permalink raw reply

* Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags
From: Daniel Borkmann @ 2018-06-29  7:04 UTC (permalink / raw)
  To: Jakub Kicinski, Jiri Benc
  Cc: davem, Roopa Prabhu, jiri, jhs, xiyou.wangcong, oss-drivers,
	netdev, Pieter Jansen van Vuuren
In-Reply-To: <20180628095452.6f23fdf4@cakuba.netronome.com>

On 06/28/2018 06:54 PM, Jakub Kicinski wrote:
> On Thu, 28 Jun 2018 09:42:06 +0200, Jiri Benc wrote:
>> On Wed, 27 Jun 2018 11:49:49 +0200, Daniel Borkmann wrote:
>>> Looks good to me, and yes in BPF case a mask like TUNNEL_OPTIONS_PRESENT is
>>> right approach since this is opaque info and solely defined by the BPF prog
>>> that is using the generic helper.  
>>
>> Wouldn't it make sense to introduce some safeguards here (in a backward
>> compatible way, of course)? It's easy to mistakenly set data for a
>> different tunnel type in a BPF program and then be surprised by the
>> result. It might help users if such usage was detected by the kernel,
>> one way or another.
> 
> Well, that's how it works today ;)

Well, it was designed like that on purpose, to be i) agnostic of the underlying
device, ii) to not clutter BPF API with tens of different APIs effectively doing
the same thing, and at the same time to avoid adding protocol specifics. E.g. at
least core bits of bpf_skb_{set,get}_tunnel_key() will work whether I use vxlan
or geneve underneath (we are actually using it this way) and I could use things
like tun_id to encode custom meta data from BPF for either of them depending on flavor
picked by orchestration system. For the tunnel options in bpf_skb_{set,get}_tunnel_opt()
it's similar although here there needs to be awareness of the underlying dev depending
on whether you encode data into e.g. gbp or tlvs, etc. However, downside right now I
can see with a patch like below is that:

i) People might still just keep using 'TUNNEL_OPTIONS_PRESENT path' since available
and backwards compatible with current/older kernels, ii) we cut bits away from
size over time for each new tunnel proto added in future that would support tunnel
options, iii) that extension is one-sided (at least below) and same would be needed
in getter part, and iv) there needs to be a way for the case when folks add new
tunnel options where we don't need to worry that we forget updating BPF_F_TUN_*
each time otherwise this will easily slip through and again people will just rely
on using TUNNEL_OPTIONS_PRESENT catchall. Given latter and in particular point i)
I wouldn't think it's worth the pain, the APIs were added to BPF in v4.6 so this
would buy them 2 more years wrt kernel compatibility with same functionality level.
And point v), I just noticed the patch is actually buggy: size is ARG_CONST_SIZE and
verifier will attempt to check the value whether the buffer passed in argument 2 is
valid or not, so using flags here in upper bits would let verification fail, you'd
really have to make a new helper just for this.

Best,
Daniel

>> I'm thinking about something like the BPF program voluntarily
>> specifying the type of the data; if not specified, the wildcard would be
>> used as it is now.
> 
> Hmm... in practice we could steal top bits of the size parameter for
> some flags, since it seems to be limited to values < 256 today?  Is it
> worth it?
> 
> It would look something along the lines of:
> 
> ---
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 59b19b6a40d7..194b40efa8e8 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2213,6 +2213,13 @@ enum bpf_func_id {
>  /* BPF_FUNC_perf_event_output for sk_buff input context. */
>  #define BPF_F_CTXLEN_MASK              (0xfffffULL << 32)
>  
> +#define BPF_F_TUN_VXLAN                        (1U << 31)
> +#define BPF_F_TUN_GENEVE               (1U << 30)
> +#define BPF_F_TUN_ERSPAN               (1U << 29)
> +#define BPF_F_TUN_FLAGS_ALL            (BPF_F_TUN_VXLAN | \
> +                                        BPF_F_TUN_GENEVE | \
> +                                        BPF_F_TUN_ERSPAN)
> +
>  /* Mode for BPF_FUNC_skb_adjust_room helper. */
>  enum bpf_adj_room_mode {
>         BPF_ADJ_ROOM_NET,
> diff --git a/net/core/filter.c b/net/core/filter.c
> index dade922678f6..cc592a1e8945 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3576,6 +3576,22 @@ BPF_CALL_3(bpf_skb_set_tunnel_opt, struct sk_buff *, skb,
>  {
>         struct ip_tunnel_info *info = skb_tunnel_info(skb);
>         const struct metadata_dst *md = this_cpu_ptr(md_dst);
> +       __be16 tun_flags;
> +       u32 flags;
> +
> +       BUILD_BUG_ON(BPF_F_TUN_FLAGS_ALL & IP_TUNNEL_OPTS_MAX);
> +
> +       flags = size & BPF_F_TUN_FLAGS_ALL;
> +       size &= ~flags;
> +       if (flags & BPF_F_TUN_VXLAN)
> +               tun_flags |= TUNNEL_VXLAN_OPT;
> +       if (flags & BPF_F_TUN_GENEVE)
> +               tun_flags |= TUNNEL_GENEVE_OPT;
> +       if (flags & BPF_F_TUN_ERSPAN)
> +               tun_flags |= TUNNEL_ERSPAN_OPT;
> +       /* User didn't specify the tunnel type, for backward compat set all */
> +       if (!(tun_flags & TUNNEL_OPTIONS_PRESENT))
> +               tun_flags |= TUNNEL_OPTIONS_PRESENT;
>  
>         if (unlikely(info != &md->u.tun_info || (size & (sizeof(u32) - 1))))
>                 return -EINVAL;
> 

^ permalink raw reply

* Re: [PATCH RFC 1/2] hwmon: Add support for power min, lcrit, min_alarm and lcrit_alarm
From: Andrew Lunn @ 2018-06-29  7:21 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: netdev, Florian Fainelli, Russell King, vadimp, linux-hwmon
In-Reply-To: <20180628224236.GB20118@roeck-us.net>

On Thu, Jun 28, 2018 at 03:42:36PM -0700, Guenter Roeck wrote:
> On Thu, Jun 28, 2018 at 10:41:14PM +0200, Andrew Lunn wrote:
> > Some sensors support reporting minimal and lower critical power, as
> > well as alarms when these thresholds are reached. Add support for
> > these attributes to the hwmon core.
> > 
> > Signed-off-by: Andrew Lunn <andrew@lunn.ch>
> 
> I am inclined to accept this patch immediately. I'll do that
> in the next couple of days unless someone gives me a good reason
> not to.

Hi Guenter

We need to watch out for merge dependencies. If you take it, you
probably should also take the second patch into your tree as
well. Otherwise, you need a stable branch DaveM can pull into net-next
if he takes the second patch.

I also have a patch to lm-sensors sensors, so it prints these
values. I will create a github pull request.

	Andrew

^ permalink raw reply

* Re: [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
From: Daniel Borkmann @ 2018-06-29  7:25 UTC (permalink / raw)
  To: Tushar Dave, ast, davem, jakub.kicinski, quentin.monnet,
	jiong.wang, guro, sandipan, john.fastabend, kafai, rdna, brakmo,
	netdev, acme, sowmini.varadhan
In-Reply-To: <1529431217-5264-2-git-send-email-tushar.n.dave@oracle.com>

On 06/19/2018 08:00 PM, Tushar Dave wrote:
> Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
> existing socket filter infrastructure for bpf program attach and load.
> SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
> contrast to SOCKET_FILTER which deals with struct skb. This is useful
> for kernel entities that don't have skb to represent packet data but
> want to run eBPF socket filter on packet data that is in form of struct
> scatterlist e.g. IB/RDMA
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  include/linux/bpf_types.h      |  1 +
>  include/linux/filter.h         |  8 +++++
>  include/uapi/linux/bpf.h       |  7 ++++
>  kernel/bpf/syscall.c           |  1 +
>  kernel/bpf/verifier.c          |  1 +
>  net/core/filter.c              | 77 ++++++++++++++++++++++++++++++++++++++++--
>  samples/bpf/bpf_load.c         | 11 ++++--
>  tools/bpf/bpftool/prog.c       |  1 +
>  tools/include/uapi/linux/bpf.h |  7 ++++
>  tools/lib/bpf/libbpf.c         |  3 ++
>  tools/lib/bpf/libbpf.h         |  2 ++
>  11 files changed, 114 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index c5700c2..f8b4b56 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -16,6 +16,7 @@
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
>  #endif
>  #ifdef CONFIG_BPF_EVENTS
>  BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 45fc0f5..71618b1 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -517,6 +517,14 @@ struct bpf_skb_data_end {
>  	void *data_end;
>  };
>  
> +struct bpf_scatterlist {
> +	struct scatterlist *sg;
> +	void *start;
> +	void *end;
> +	int cur_sg;
> +	int num_sg;
> +};
> +
>  struct sk_msg_buff {
>  	void *data;
>  	void *data_end;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 59b19b6..ef0a7b6 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -144,6 +144,7 @@ enum bpf_prog_type {
>  	BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
>  	BPF_PROG_TYPE_LWT_SEG6LOCAL,
>  	BPF_PROG_TYPE_LIRC_MODE2,
> +	BPF_PROG_TYPE_SOCKET_SG_FILTER,
>  };
>  
>  enum bpf_attach_type {
> @@ -2358,6 +2359,12 @@ enum sk_action {
>  	SK_PASS,
>  };
>  
> +/* use accessible scatterlist */
> +struct sg_filter_md {
> +	void *data; /* sg_virt(sg) */
> +	void *data_end; /* sg_virt(sg) + sg->length */
> +};
> +
>  /* user accessible metadata for SK_MSG packet hook, new fields must
>   * be added to the end of this structure
>   */
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 0fa2062..74193a8 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1300,6 +1300,7 @@ static int bpf_prog_load(union bpf_attr *attr)
>  
>  	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
>  	    type != BPF_PROG_TYPE_CGROUP_SKB &&
> +	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
>  	    !capable(CAP_SYS_ADMIN))
>  		return -EPERM;
>  
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index d6403b5..a00d3eb 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1320,6 +1320,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
>  	case BPF_PROG_TYPE_LWT_XMIT:
>  	case BPF_PROG_TYPE_SK_SKB:
>  	case BPF_PROG_TYPE_SK_MSG:
> +	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
>  		if (meta)
>  			return meta->pkt_access;
>  
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 3d9ba7e..8f67942 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1130,7 +1130,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
>  
>  static void __bpf_prog_release(struct bpf_prog *prog)
>  {
> -	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
> +	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
> +	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
>  		bpf_prog_put(prog);
>  	} else {
>  		bpf_release_orig_filter(prog);
> @@ -1551,10 +1552,16 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
>  
>  static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
>  {
> +	struct bpf_prog *prog;
> +
>  	if (sock_flag(sk, SOCK_FILTER_LOCKED))
>  		return ERR_PTR(-EPERM);
>  
> -	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
> +	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
> +	if (IS_ERR(prog))
> +		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
> +
> +	return prog;
>  }

Hmm, I don't think this works: this now means as unpriviledged I can attach a new
BPF_PROG_TYPE_SOCKET_SG_FILTER to a non-rds socket e.g. normal tcp/udp through the
SO_ATTACH_BPF sockopt, where input context is skb instead of sg list and thus crash
my box?

^ permalink raw reply

* Re: [PATCH] cfg80211: use IDA to allocate wiphy indeces
From: Johannes Berg @ 2018-06-29  7:42 UTC (permalink / raw)
  To: Brian Norris; +Cc: linux-kernel, linux-wireless, netdev
In-Reply-To: <20180621012945.185705-1-briannorris@chromium.org>

Hi Brian,

On Wed, 2018-06-20 at 18:29 -0700, Brian Norris wrote:
> It's annoying to see the phy index increase arbitrarily, just because a
> device got removed and re-probed (e.g., during a device reset, or due to
> probe testing). We can use the in-kernel index allocator for this,
> instead of just an increasing counter.

I can understand that it's somewhat annoying to people, but it was
actually done on purpose to avoid userspace talking to the wrong device.

Imagine you have some userspace process running that has remembered the
wiphy index to use it to talk to nl80211, and now underneath the device
goes away and reappears. This process should understand that situation,
and handle it accordingly, rather than being blind to the reset.

johannes

^ permalink raw reply

* Re: [PATCH RFC 2/2] net: phy: sfp: Add HWMON support for module sensors
From: Andrew Lunn @ 2018-06-29  7:45 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: netdev, Florian Fainelli, Russell King, vadimp, linux-hwmon
In-Reply-To: <20180628224123.GA20118@roeck-us.net>

> > +	case hwmon_power:
> > +		/* External calibration of receive power requires
> > +		 * floating point arithmetic. Doing that in the kernel
> > +		 * is not easy, so just skip it. If the module does
> > +		 * not require external calibration, we can however
> > +		 * show receiver power, since FP is then not needed.
> > +		 */
> > +		if (sfp->id.ext.diagmon & SFP_DIAGMON_EXT_CAL &&
> > +		    channel == 1)
> > +			return 0;
> 
> It would be nice if it was possible to convert the floting point to
> a fixed point calculation. Would that be possible ?

Maybe. I decided to leave it for later.

The kernel has some support for emulating floating point hardware, by
doing floating point operations in software. I didn't find any
examples of using that code outside of emulation, but i wondered if it
would be possible to use it here. We don't need high performance here,
when just reading a sensor once per second.

> > +/* Sensors values are stored as two bytes, MSB second */
> > +static int sfp_hwmon_read_sensor(struct sfp *sfp, int reg, long *value)
> > +{
> > +	u8 val[2];
> > +	int err;
> > +
> > +	err = sfp_read(sfp, true, reg, val, 2);
> > +	if (err < 0)
> > +		return err;
> > +
> > +	*value = val[0] << 8 | val[1];
> > +
> 
> Any chance to use something like __be16 and be16_to_cpu() ?
> You do that elsewhere - why not here ?

Yes. I want to look at this again. I don't like it either.

> > +	for (i = j = 0; sfp->hwmon_name[i]; i++) {
> > +		if (isalnum(sfp->hwmon_name[i])) {
> > +			if (i != j)
> > +				sfp->hwmon_name[j] = sfp->hwmon_name[i];
> > +			j++;
> > +		}
> > +	}
> 
> It might be better and simpler to replace invalid characters with '_'
> instead of dropping them. Also note that '_' is a valid character.
> Strictly speaking only "-* \t\n" are invalid.

I borrowed this code from the marvell10g driver. I don't know where it
borrowed it from. Is there a hwmon core function which we can pass an
arbitrary name to and it returned a sanitised one? Maybe we should add
one?

> > +	sfp->hwmon_name[j] = '\0';
> > +
> Is it possible that j == 0 ?

Hummm....

sfp->hwmon_name is derived from dev_name(sfp->dev), which comes from
pdev->dev in the probe function. That comes from the device tree node
name. I suppose it is possible to name the node $@#$@, but i suspect
Rob would NACK it :-)

I can add a check for j==0 and return -EINVAL.
 
> > +	sfp->hwmon_dev = devm_hwmon_device_register_with_info(sfp->dev,
> > +				sfp->hwmon_name, sfp, &sfp_hwmon_chip_info,
> > +				NULL);
> > +
> > +	return PTR_ERR_OR_ZERO(sfp->hwmon_dev);
> > +}
> > +
> > +static void sfp_hwmon_remove(struct sfp *sfp)
> > +{
> > +	devm_hwmon_device_unregister(sfp->hwmon_dev);
> 
> If registartion and removal are not tied to a device, it doesn't make sense
> to use devm_ functions. Either use hwmon_device_register_with_info()
> and hwmon_device_unregister(), or drop the remove function.

Yes. I can change it. We have a few different lifetimes involved
here. You can consider the driver probe being for the SFP cage. The
SFP module being inserted into the cage is a different life time, and
the lifetime of the hwmon device.

    Andrew

^ permalink raw reply

* pull-request: mac80211 2018-06-29
From: Johannes Berg @ 2018-06-29  7:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-wireless

Hi Dave,

For the current release, I have a few fixes. No sense waiting
for more, and most of these have been around for a while.

Please pull and let me know if there's any problem.

Thanks,
johannes



The following changes since commit 4205c88eaf17b5f3ee30032d68df55cd5d9077a1:

  net: stmmac: Set DMA buffer size in HW (2018-06-28 22:24:25 +0900)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git tags/mac80211-for-davem-2018-06-29

for you to fetch changes up to 95bca62fb723a121954fc7ae5473bb2c1f0d5986:

  nl80211: check nla_parse_nested() return values (2018-06-29 09:44:51 +0200)

----------------------------------------------------------------
Just three fixes:
 * fix HT operation in mesh mode
 * disable preemption in control frame TX
 * check nla_parse_nested() return values
   where missing (two places)

----------------------------------------------------------------
Bob Copeland (1):
      nl80211: relax ht operation checks for mesh

Denis Kenzior (1):
      mac80211: disable BHs/preemption in ieee80211_tx_control_port()

Johannes Berg (1):
      nl80211: check nla_parse_nested() return values

 net/mac80211/tx.c      |  2 ++
 net/wireless/nl80211.c | 35 ++++++++++++++---------------------
 2 files changed, 16 insertions(+), 21 deletions(-)

^ permalink raw reply

* [PATCH bpf v2 2/4] xsk: frame could be completed more than once in SKB path
From: Magnus Karlsson @ 2018-06-29  7:48 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, eric.dumazet,
	liu.song.a23
  Cc: qi.z.zhang, pavel
In-Reply-To: <1530258500-9126-1-git-send-email-magnus.karlsson@intel.com>

Fixed a bug in which a frame could be completed more than once
when an error was returned from dev_direct_xmit(). The code
erroneously retried sending the message leading to multiple
calls to the SKB destructor and therefore multiple completions
of the same buffer to user space.

The error code in this case has been changed from EAGAIN to EBUSY
in order to tell user space that the sending of the packet failed
and the buffer has been return to user space through the completion
queue.

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Reported-by: Pavel Odintsov <pavel@fastnetmon.com>
---
 net/xdp/xsk.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 3b3410ada097..d482f727f4c2 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -268,15 +268,15 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 		skb->destructor = xsk_destruct_skb;
 
 		err = dev_direct_xmit(skb, xs->queue_id);
+		xskq_discard_desc(xs->tx);
 		/* Ignore NET_XMIT_CN as packet might have been sent */
 		if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
-			err = -EAGAIN;
-			/* SKB consumed by dev_direct_xmit() */
+			/* SKB completed but not sent */
+			err = -EBUSY;
 			goto out;
 		}
 
 		sent_frame = true;
-		xskq_discard_desc(xs->tx);
 	}
 
 out:
-- 
2.7.4

^ permalink raw reply related

* [PATCH bpf v2 1/4] xsk: fix potential lost completion message in SKB path
From: Magnus Karlsson @ 2018-06-29  7:48 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, eric.dumazet,
	liu.song.a23
  Cc: qi.z.zhang, pavel
In-Reply-To: <1530258500-9126-1-git-send-email-magnus.karlsson@intel.com>

The code in xskq_produce_addr erroneously checked if there
was up to LAZY_UPDATE_THRESHOLD amount of space in the completion
queue. It only needs to check if there is one slot left in the
queue. This bug could under some circumstances lead to a WARN_ON_ONCE
being triggered and the completion message to user space being lost.

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Reported-by: Pavel Odintsov <pavel@fastnetmon.com>
---
 net/xdp/xsk_queue.h | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index ef6a6f0ec949..52ecaf770642 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -62,14 +62,9 @@ static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
 	return (entries > dcnt) ? dcnt : entries;
 }
 
-static inline u32 xskq_nb_free_lazy(struct xsk_queue *q, u32 producer)
-{
-	return q->nentries - (producer - q->cons_tail);
-}
-
 static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
 {
-	u32 free_entries = xskq_nb_free_lazy(q, producer);
+	u32 free_entries = q->nentries - (producer - q->cons_tail);
 
 	if (free_entries >= dcnt)
 		return free_entries;
@@ -129,7 +124,7 @@ static inline int xskq_produce_addr(struct xsk_queue *q, u64 addr)
 {
 	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
 
-	if (xskq_nb_free(q, q->prod_tail, LAZY_UPDATE_THRESHOLD) == 0)
+	if (xskq_nb_free(q, q->prod_tail, 1) == 0)
 		return -ENOSPC;
 
 	ring->desc[q->prod_tail++ & q->ring_mask] = addr;
-- 
2.7.4

^ permalink raw reply related

* [PATCH bpf v2 0/4] Bug fixes to the SKB TX path of AF_XDP
From: Magnus Karlsson @ 2018-06-29  7:48 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, eric.dumazet,
	liu.song.a23
  Cc: qi.z.zhang, pavel

This patch set fixes three bugs in the SKB TX path of AF_XDP.
Details in the individual commits.

The structure of the patch set is as follows:

Patch 1: Fix for lost completion message
Patch 2-3: Fix for possible multiple completions of single packet
Patch 4: Fix potential race during error

Changes from v1:

* Added explanation of race in commit message of patch 4.

/Magnus

Magnus Karlsson (4):
  xsk: fix potential lost completion message in SKB path
  xsk: frame could be completed more than once in SKB path
  samples/bpf: deal with EBUSY return code from sendmsg in xdpsock
    sample
  xsk: fix potential race in SKB TX completion code

 include/net/xdp_sock.h     |  4 ++++
 net/xdp/xsk.c              | 10 +++++++---
 net/xdp/xsk_queue.h        |  9 ++-------
 samples/bpf/xdpsock_user.c |  2 +-
 4 files changed, 14 insertions(+), 11 deletions(-)

^ permalink raw reply

* [PATCH bpf v2 3/4] samples/bpf: deal with EBUSY return code from sendmsg in xdpsock sample
From: Magnus Karlsson @ 2018-06-29  7:48 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, eric.dumazet,
	liu.song.a23
  Cc: qi.z.zhang, pavel
In-Reply-To: <1530258500-9126-1-git-send-email-magnus.karlsson@intel.com>

Sendmsg in the SKB path of AF_XDP can now return EBUSY when a packet
was discarded and completed by the driver. Just ignore this message
in the sample application.

Fixes: b4b8faa1ded7 ("samples/bpf: sample application and documentation for AF_XDP sockets")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Reported-by: Pavel Odintsov <pavel@fastnetmon.com>
---
 samples/bpf/xdpsock_user.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index d69c8d78d3fd..5904b1543831 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -729,7 +729,7 @@ static void kick_tx(int fd)
 	int ret;
 
 	ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
-	if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN)
+	if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN || errno == EBUSY)
 		return;
 	lassert(0);
 }
-- 
2.7.4

^ permalink raw reply related

* [PATCH bpf v2 4/4] xsk: fix potential race in SKB TX completion code
From: Magnus Karlsson @ 2018-06-29  7:48 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, eric.dumazet,
	liu.song.a23
  Cc: qi.z.zhang, pavel
In-Reply-To: <1530258500-9126-1-git-send-email-magnus.karlsson@intel.com>

There is a potential race in the TX completion code for the SKB
case. One process enters the sendmsg code of an AF_XDP socket in order
to send a frame. The execution eventually trickles down to the driver
that is told to send the packet. However, it decides to drop the
packet due to some error condition (e.g., rings full) and frees the
SKB. This will trigger the SKB destructor and a completion will be
sent to the AF_XDP user space through its
single-producer/single-consumer queues.

At the same time a TX interrupt has fired on another core and it
dispatches the TX completion code in the driver. It does its HW
specific things and ends up freeing the SKB associated with the
transmitted packet. This will trigger the SKB destructor and a
completion will be sent to the AF_XDP user space through its
single-producer/single-consumer queues. With a pseudo call stack, it
would look like this:

Core 1:
sendmsg() being called in the application
  netdev_start_xmit()
    Driver entered through ndo_start_xmit
      Driver decides to free the SKB for some reason (e.g., rings full)
        Destructor of SKB called
          xskq_produce_addr() is called to signal completion to user space

Core 2:
TX completion irq
  NAPI loop
    Driver irq handler for TX completions
      Frees the SKB
        Destructor of SKB called
          xskq_produce_addr() is called to signal completion to user space

We now have a violation of the single-producer/single-consumer
principle for our queues as there are two threads trying to produce at
the same time on the same queue.

Fixed by introducing a spin_lock in the destructor. In regards to the
performance, I get around 1.74 Mpps for txonly before and after the
introduction of the spinlock. There is of course some impact due to
the spin lock but it is in the less significant digits that are too
noisy for me to measure. But let us say that the version without the
spin lock got 1.745 Mpps in the best case and the version with 1.735
Mpps in the worst case, then that would mean a maximum drop in
performance of 0.5%.

Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock.h | 4 ++++
 net/xdp/xsk.c          | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 9fe472f2ac95..7161856bcf9c 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -60,6 +60,10 @@ struct xdp_sock {
 	bool zc;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
+	/* Mutual exclusion of NAPI TX thread and sendmsg error paths
+	 * in the SKB destructor callback.
+	 */
+	spinlock_t tx_completion_lock;
 	u64 rx_dropped;
 };
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index d482f727f4c2..650c4da8dc5a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -199,8 +199,11 @@ static void xsk_destruct_skb(struct sk_buff *skb)
 {
 	u64 addr = (u64)(long)skb_shinfo(skb)->destructor_arg;
 	struct xdp_sock *xs = xdp_sk(skb->sk);
+	unsigned long flags;
 
+	spin_lock_irqsave(&xs->tx_completion_lock, flags);
 	WARN_ON_ONCE(xskq_produce_addr(xs->umem->cq, addr));
+	spin_unlock_irqrestore(&xs->tx_completion_lock, flags);
 
 	sock_wfree(skb);
 }
@@ -754,6 +757,7 @@ static int xsk_create(struct net *net, struct socket *sock, int protocol,
 
 	xs = xdp_sk(sk);
 	mutex_init(&xs->mutex);
+	spin_lock_init(&xs->tx_completion_lock);
 
 	local_bh_disable();
 	sock_prot_inuse_add(net, &xsk_proto, 1);
-- 
2.7.4

^ permalink raw reply related

* [PATCH vhost] vhost_net: Fix too many vring kick on busypoll
From: Toshiaki Makita @ 2018-06-29  8:09 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang; +Cc: netdev, virtualization, kvm

Under heavy load vhost busypoll may run without suppressing
notification. For example tx zerocopy callback can push tx work while
handle_tx() is running, then busyloop exits due to vhost_has_work()
condition and enables notification but immediately reenters handle_tx()
because the pushed work was tx. In this case handle_tx() tries to
disable notification again, but when using event_idx it by design
cannot. Then busyloop will run without suppressing notification.
Another example is the case where handle_tx() tries to enable
notification but avail idx is advanced so disables it again. This case
also lead to the same situation with event_idx.

The problem is that once we enter this situation busyloop does not work
under heavy load for considerable amount of time, because notification
is likely to happen during busyloop and handle_tx() immediately enables
notification after notification happens. Specifically busyloop detects
notification by vhost_has_work() and then handle_tx() calls
vhost_enable_notify(). Because the detected work was the tx work, it
enters handle_tx(), and enters busyloop without suppression again.
This is likely to be repeated, so with event_idx we are almost not able
to suppress notification in this case.

To fix this, poll the work instead of enabling notification when
busypoll is interrupted by something. IMHO signal_pending() and
vhost_has_work() are kind of interruptions rather than signals to
completely cancel the busypoll, so let's run busypoll after the
necessary work is done. In order to avoid too long busyloop due to
interruption, save the endtime in vq field and use it when reentering
the work function.

I observed this problem makes tx performance poor but does not rx. I
guess it is because rx notification from socket is not that heavy so
did not impact the performance even when we failed to mask the
notification. Anyway for consistency I fixed rx routine as well as tx.

Performance numbers:

- Bulk transfer from guest to external physical server.
    [Guest]->vhost_net->tap--(XDP_REDIRECT)-->i40e --(wire)--> [Server]
- Set 10us busypoll.
- Guest disables checksum and TSO because of host XDP.
- Measured single flow Mbps by netperf, and kicks by perf kvm stat
  (EPT_MISCONFIG event).

                            Before              After
                          Mbps  kicks/s      Mbps  kicks/s
UDP_STREAM 1472byte              247758                 27
                Send   3645.37            6958.10
                Recv   3588.56            6958.10
              1byte                9865                 37
                Send      4.34               5.43
                Recv      4.17               5.26
TCP_STREAM             8801.03    45794   9592.77     2884

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/vhost/net.c   | 94 +++++++++++++++++++++++++++++++++++----------------
 drivers/vhost/vhost.c |  1 +
 drivers/vhost/vhost.h |  1 +
 3 files changed, 66 insertions(+), 30 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index eeaf6739215f..0e85f628b965 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -391,13 +391,14 @@ static inline unsigned long busy_clock(void)
 	return local_clock() >> 10;
 }
 
-static bool vhost_can_busy_poll(struct vhost_dev *dev,
-				unsigned long endtime)
+static bool vhost_can_busy_poll(unsigned long endtime)
 {
-	return likely(!need_resched()) &&
-	       likely(!time_after(busy_clock(), endtime)) &&
-	       likely(!signal_pending(current)) &&
-	       !vhost_has_work(dev);
+	return likely(!need_resched() && !time_after(busy_clock(), endtime));
+}
+
+static bool vhost_busy_poll_interrupted(struct vhost_dev *dev)
+{
+	return unlikely(signal_pending(current)) || vhost_has_work(dev);
 }
 
 static void vhost_net_disable_vq(struct vhost_net *n,
@@ -437,10 +438,21 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
 
 	if (r == vq->num && vq->busyloop_timeout) {
 		preempt_disable();
-		endtime = busy_clock() + vq->busyloop_timeout;
-		while (vhost_can_busy_poll(vq->dev, endtime) &&
-		       vhost_vq_avail_empty(vq->dev, vq))
+		if (vq->busyloop_endtime) {
+			endtime = vq->busyloop_endtime;
+			vq->busyloop_endtime = 0;
+		} else {
+			endtime = busy_clock() + vq->busyloop_timeout;
+		}
+		while (vhost_can_busy_poll(endtime)) {
+			if (vhost_busy_poll_interrupted(vq->dev)) {
+				vq->busyloop_endtime = endtime;
+				break;
+			}
+			if (!vhost_vq_avail_empty(vq->dev, vq))
+				break;
 			cpu_relax();
+		}
 		preempt_enable();
 		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
 				      out_num, in_num, NULL, NULL);
@@ -509,12 +521,16 @@ static void handle_tx(struct vhost_net *net)
 			break;
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
 		if (head == vq->num) {
-			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
+			if (unlikely(vq->busyloop_endtime)) {
+				/* Busy poll is interrupted. */
+				vhost_poll_queue(&vq->poll);
+			} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
 				vhost_disable_notify(&net->dev, vq);
 				continue;
 			}
 			break;
 		}
+		vq->busyloop_endtime = 0;
 		if (in) {
 			vq_err(vq, "Unexpected descriptor format for TX: "
 			       "out %d, int %d\n", out, in);
@@ -642,39 +658,51 @@ static void vhost_rx_signal_used(struct vhost_net_virtqueue *nvq)
 
 static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
 {
-	struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
-	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
-	struct vhost_virtqueue *vq = &nvq->vq;
+	struct vhost_net_virtqueue *rnvq = &net->vqs[VHOST_NET_VQ_RX];
+	struct vhost_net_virtqueue *tnvq = &net->vqs[VHOST_NET_VQ_TX];
+	struct vhost_virtqueue *rvq = &rnvq->vq;
+	struct vhost_virtqueue *tvq = &tnvq->vq;
 	unsigned long uninitialized_var(endtime);
-	int len = peek_head_len(rvq, sk);
+	int len = peek_head_len(rnvq, sk);
 
-	if (!len && vq->busyloop_timeout) {
+	if (!len && tvq->busyloop_timeout) {
 		/* Flush batched heads first */
-		vhost_rx_signal_used(rvq);
+		vhost_rx_signal_used(rnvq);
 		/* Both tx vq and rx socket were polled here */
-		mutex_lock_nested(&vq->mutex, 1);
-		vhost_disable_notify(&net->dev, vq);
+		mutex_lock_nested(&tvq->mutex, 1);
+		vhost_disable_notify(&net->dev, tvq);
 
 		preempt_disable();
-		endtime = busy_clock() + vq->busyloop_timeout;
+		if (rvq->busyloop_endtime) {
+			endtime = rvq->busyloop_endtime;
+			rvq->busyloop_endtime = 0;
+		} else {
+			endtime = busy_clock() + tvq->busyloop_timeout;
+		}
 
-		while (vhost_can_busy_poll(&net->dev, endtime) &&
-		       !sk_has_rx_data(sk) &&
-		       vhost_vq_avail_empty(&net->dev, vq))
+		while (vhost_can_busy_poll(endtime)) {
+			if (vhost_busy_poll_interrupted(&net->dev)) {
+				rvq->busyloop_endtime = endtime;
+				break;
+			}
+			if (sk_has_rx_data(sk) ||
+			    !vhost_vq_avail_empty(&net->dev, tvq))
+				break;
 			cpu_relax();
+		}
 
 		preempt_enable();
 
-		if (!vhost_vq_avail_empty(&net->dev, vq))
-			vhost_poll_queue(&vq->poll);
-		else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
-			vhost_disable_notify(&net->dev, vq);
-			vhost_poll_queue(&vq->poll);
+		if (!vhost_vq_avail_empty(&net->dev, tvq)) {
+			vhost_poll_queue(&tvq->poll);
+		} else if (unlikely(vhost_enable_notify(&net->dev, tvq))) {
+			vhost_disable_notify(&net->dev, tvq);
+			vhost_poll_queue(&tvq->poll);
 		}
 
-		mutex_unlock(&vq->mutex);
+		mutex_unlock(&tvq->mutex);
 
-		len = peek_head_len(rvq, sk);
+		len = peek_head_len(rnvq, sk);
 	}
 
 	return len;
@@ -804,6 +832,7 @@ static void handle_rx(struct vhost_net *net)
 	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
 
 	while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
+		vq->busyloop_endtime = 0;
 		sock_len += sock_hlen;
 		vhost_len = sock_len + vhost_hlen;
 		headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx,
@@ -889,7 +918,12 @@ static void handle_rx(struct vhost_net *net)
 			goto out;
 		}
 	}
-	vhost_net_enable_vq(net, vq);
+	if (unlikely(vq->busyloop_endtime)) {
+		/* Busy poll is interrupted. */
+		vhost_poll_queue(&vq->poll);
+	} else {
+		vhost_net_enable_vq(net, vq);
+	}
 out:
 	vhost_rx_signal_used(nvq);
 	mutex_unlock(&vq->mutex);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 9beefa6ed1ce..fe83578fe336 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -323,6 +323,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vhost_reset_is_le(vq);
 	vhost_disable_cross_endian(vq);
 	vq->busyloop_timeout = 0;
+	vq->busyloop_endtime = 0;
 	vq->umem = NULL;
 	vq->iotlb = NULL;
 	__vhost_vq_meta_reset(vq);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 6c844b90a168..7e9cf80ccae9 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -144,6 +144,7 @@ struct vhost_virtqueue {
 	bool user_be;
 #endif
 	u32 busyloop_timeout;
+	unsigned long busyloop_endtime;
 };
 
 struct vhost_msg_node {
-- 
2.14.2

^ permalink raw reply related

* Re: [RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper
From: Daniel Borkmann @ 2018-06-29  8:18 UTC (permalink / raw)
  To: Tushar Dave, ast, davem, jakub.kicinski, quentin.monnet,
	jiong.wang, guro, sandipan, john.fastabend, kafai, rdna, brakmo,
	netdev, acme, sowmini.varadhan
In-Reply-To: <1529431217-5264-3-git-send-email-tushar.n.dave@oracle.com>

On 06/19/2018 08:00 PM, Tushar Dave wrote:
> When sg_filter_run() is invoked it runs the attached eBPF
> SOCKET_SG_FILTER program which deals with struct scatterlist.
> 
> In addition, this patch also adds bpf_sg_next helper function that
> allows users to retrieve the next sg element from sg list.
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  include/linux/filter.h                    |  2 +
>  include/uapi/linux/bpf.h                  | 10 ++++-
>  net/core/filter.c                         | 72 +++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h            | 10 ++++-
>  tools/testing/selftests/bpf/bpf_helpers.h |  3 ++
>  5 files changed, 95 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 71618b1..d176402 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1072,4 +1072,6 @@ struct bpf_sock_ops_kern {
>  					 */
>  };
>  
> +int sg_filter_run(struct sock *sk, struct scatterlist *sg);
> +
>  #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index ef0a7b6..036432b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2076,6 +2076,13 @@ struct bpf_stack_build_id {
>   * 	Return
>   * 		A 64-bit integer containing the current cgroup id based
>   * 		on the cgroup within which the current task is running.
> + *
> + * int bpf_sg_next(struct bpf_scatterlist *sg)
> + *	Description
> + *		This helper allows user to retrieve next sg element from
> + *		sg list.
> + *	Return
> + *		Returns 0 on success, or a negative error in case of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -2158,7 +2165,8 @@ struct bpf_stack_build_id {
>  	FN(rc_repeat),			\
>  	FN(rc_keydown),			\
>  	FN(skb_cgroup_id),		\
> -	FN(get_current_cgroup_id),
> +	FN(get_current_cgroup_id),	\
> +	FN(sg_next),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 8f67942..702ff5b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -121,6 +121,53 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
>  }
>  EXPORT_SYMBOL(sk_filter_trim_cap);
>  
> +int sg_filter_run(struct sock *sk, struct scatterlist *sg)
> +{
> +	struct sk_filter *filter;
> +	int err;
> +
> +	rcu_read_lock();
> +	filter = rcu_dereference(sk->sk_filter);
> +	if (filter) {
> +		struct bpf_scatterlist bpfsg;
> +		int num_sg;
> +
> +		if (!sg) {
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		num_sg = sg_nents(sg);
> +		if (num_sg <= 0) {
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		/* We store a reference  to the sg list so it can later used by
> +		 * eBPF helpers to retrieve the next sg element.
> +		 */
> +		bpfsg.num_sg = num_sg;
> +		bpfsg.cur_sg = 0;
> +		bpfsg.sg = sg;
> +
> +		/* For the first sg element, we store the pkt access pointers
> +		 * into start and end so eBPF program can have pkt access using
> +		 * data and data_end. The pkt access for subsequent element of
> +		 * sg list is possible when eBPF program invokes bpf_sg_next
> +		 * which takes care of setting start and end to the correct sg
> +		 * element.
> +		 */
> +		bpfsg.start = sg_virt(sg);
> +		bpfsg.end = bpfsg.start + sg->length;
> +		BPF_PROG_RUN(filter->prog, &bpfsg);
> +	}
> +out:
> +	rcu_read_unlock();
> +
> +	return err;
> +}
> +EXPORT_SYMBOL(sg_filter_run);
> +
>  BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb)
>  {
>  	return skb_get_poff(skb);
> @@ -3753,6 +3800,29 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
>  	.arg1_type      = ARG_PTR_TO_CTX,
>  };
>  
> +BPF_CALL_1(bpf_sg_next, struct bpf_scatterlist *, bpfsg)
> +{
> +	struct scatterlist *sg = bpfsg->sg;
> +	int cur_sg = bpfsg->cur_sg;
> +
> +	cur_sg++;
> +	if (cur_sg >= bpfsg->num_sg)
> +		return -ENODATA;
> +
> +	bpfsg->cur_sg = cur_sg;
> +	bpfsg->start = sg_virt(&sg[cur_sg]);
> +	bpfsg->end = bpfsg->start + sg[cur_sg].length;
> +
> +	return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_sg_next_proto = {
> +	.func		= bpf_sg_next,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +};

Should be added to bpf_helper_changes_pkt_data() in order to enforce a reload
of all pkt pointers. Otherwise this is buggy in the sense that someone could only
reload pkt_end pointer in the prog while old pkt_start still points to previous
sg entry, so you would be able to access out of bounds.

Thanks,
Daniel

^ permalink raw reply

* Re: [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
From: Daniel Borkmann @ 2018-06-29  8:27 UTC (permalink / raw)
  To: Tushar Dave, ast, davem, jakub.kicinski, quentin.monnet,
	jiong.wang, guro, sandipan, john.fastabend, kafai, rdna, brakmo,
	netdev, acme, sowmini.varadhan
In-Reply-To: <1529431217-5264-2-git-send-email-tushar.n.dave@oracle.com>

On 06/19/2018 08:00 PM, Tushar Dave wrote:
> Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
> existing socket filter infrastructure for bpf program attach and load.
> SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
> contrast to SOCKET_FILTER which deals with struct skb. This is useful
> for kernel entities that don't have skb to represent packet data but
> want to run eBPF socket filter on packet data that is in form of struct
> scatterlist e.g. IB/RDMA
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  include/linux/bpf_types.h      |  1 +
>  include/linux/filter.h         |  8 +++++
>  include/uapi/linux/bpf.h       |  7 ++++
>  kernel/bpf/syscall.c           |  1 +
>  kernel/bpf/verifier.c          |  1 +
>  net/core/filter.c              | 77 ++++++++++++++++++++++++++++++++++++++++--
>  samples/bpf/bpf_load.c         | 11 ++++--
>  tools/bpf/bpftool/prog.c       |  1 +
>  tools/include/uapi/linux/bpf.h |  7 ++++
>  tools/lib/bpf/libbpf.c         |  3 ++
>  tools/lib/bpf/libbpf.h         |  2 ++
>  11 files changed, 114 insertions(+), 5 deletions(-)
> 
[...]
>  
> +static bool socksg_filter_is_valid_access(int off, int size,
> +					  enum bpf_access_type type,
> +					  const struct bpf_prog *prog,
> +					  struct bpf_insn_access_aux *info)
> +{
> +	switch (off) {
> +	case offsetof(struct sg_filter_md, data):
> +		info->reg_type = PTR_TO_PACKET;
> +		break;
> +	case offsetof(struct sg_filter_md, data_end):
> +		info->reg_type = PTR_TO_PACKET_END;
> +		break;
> +	}
> +
> +	if (off < 0 || off >= sizeof(struct sg_filter_md))
> +		return false;
> +	if (off % size != 0)
> +		return false;
> +	if (size != sizeof(__u64))
> +		return false;
> +
> +	return true;
> +}

Just a note, don't know much about rds, but when you make this writeable for
rds/tcp you definitely must make sure that it can be handled properly in there,
meaning when program rewrites packet data that this data is private to the BPF
prog (to avoid races/corruption) and that the rewritten data is correctly handled
from there.

^ permalink raw reply

* Re: [RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper
From: Daniel Borkmann @ 2018-06-29  8:32 UTC (permalink / raw)
  To: Tushar Dave, ast, davem, jakub.kicinski, quentin.monnet,
	jiong.wang, guro, sandipan, john.fastabend, kafai, rdna, brakmo,
	netdev, acme, sowmini.varadhan
In-Reply-To: <1529431217-5264-3-git-send-email-tushar.n.dave@oracle.com>

On 06/19/2018 08:00 PM, Tushar Dave wrote:
[...]
> +int sg_filter_run(struct sock *sk, struct scatterlist *sg)
> +{
> +	struct sk_filter *filter;
> +	int err;
> +
> +	rcu_read_lock();
> +	filter = rcu_dereference(sk->sk_filter);
> +	if (filter) {
> +		struct bpf_scatterlist bpfsg;
> +		int num_sg;
> +
> +		if (!sg) {
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		num_sg = sg_nents(sg);
> +		if (num_sg <= 0) {
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		/* We store a reference  to the sg list so it can later used by
> +		 * eBPF helpers to retrieve the next sg element.
> +		 */
> +		bpfsg.num_sg = num_sg;
> +		bpfsg.cur_sg = 0;
> +		bpfsg.sg = sg;
> +
> +		/* For the first sg element, we store the pkt access pointers
> +		 * into start and end so eBPF program can have pkt access using
> +		 * data and data_end. The pkt access for subsequent element of
> +		 * sg list is possible when eBPF program invokes bpf_sg_next
> +		 * which takes care of setting start and end to the correct sg
> +		 * element.
> +		 */
> +		bpfsg.start = sg_virt(sg);
> +		bpfsg.end = bpfsg.start + sg->length;
> +		BPF_PROG_RUN(filter->prog, &bpfsg);

Return code here from BPF prog is ignored entirely, I thought you wanted to
use it also for dropping packets? If UAPI would get frozen like this then it's
baked in stone.

> +	}
> +out:
> +	rcu_read_unlock();
> +
> +	return err;
> +}
> +EXPORT_SYMBOL(sg_filter_run);

^ permalink raw reply

* Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw
From: Jiri Pirko @ 2018-06-29  8:39 UTC (permalink / raw)
  To: Cong Wang, g
  Cc: Linux Kernel Network Developers, David Miller, Jamal Hadi Salim,
	Jakub Kicinski, Simon Horman, john.hurley, David Ahern, mlxsw,
	sridhar.samudrala
In-Reply-To: <CAM_iQpVHQHHfi1oaFbEXsGK2c8ryjuZ8SxxR54Kpbh-BBKcEQg@mail.gmail.com>

Fri, Jun 29, 2018 at 12:25:53AM CEST, xiyou.wangcong@gmail.com wrote:
>On Thu, Jun 28, 2018 at 6:10 AM Jiri Pirko <jiri@resnulli.us> wrote:
>> Add a template of type flower allowing to insert rules matching on last
>> 2 bytes of destination mac address:
>> # tc chaintemplate add dev dummy0 ingress proto ip flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF
>>
>> The template is now showed in the list:
>> # tc chaintemplate show dev dummy0 ingress
>> chaintemplate flower chain 0
>>   dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
>>   eth_type ipv4
>>
>> Add another template, this time for chain number 22:
>> # tc chaintemplate add dev dummy0 ingress proto ip chain 22 flower dst_ip 0.0.0.0/16
>> # tc chaintemplate show dev dummy0 ingress
>> chaintemplate flower chain 0
>>   dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
>>   eth_type ipv4
>> chaintemplate flower chain 22
>>   eth_type ipv4
>>   dst_ip 0.0.0.0/16
>
>So, if I want to check the template of a chain, I have to use
>'tc chaintemplate... chain X'.
>
>If I want to check the filters in a chain, I have to use
>'tc filter show .... chain X'.
>
>If you introduce 'tc chain', it would just need one command:
>`tc chain show ... X` which could list its template first and
>followed by filters in this chain, something like:
>
># tc chain show dev eth0 chain X
>template: # could be none
>....
>filter1
>...
>filter2
>...
>
>Isn't it more elegant?

Well, that is just another iproute2 command. It would use the same
kernel uapi. Filters+templates. Sure, why not. Can be easily introduced.
Let's do it in a follow-up iproute2 patch.

^ permalink raw reply

* Re: [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
From: Daniel Borkmann @ 2018-06-29  8:48 UTC (permalink / raw)
  To: Daniel Borkmann, Tushar Dave, ast, davem, jakub.kicinski,
	quentin.monnet, jiong.wang, guro, sandipan, john.fastabend, kafai,
	rdna, brakmo, netdev, acme, sowmini.varadhan
In-Reply-To: <c16262f3-b751-2229-232e-8d3d245dfb1e@iogearbox.net>

On 06/29/2018 09:25 AM, Daniel Borkmann wrote:
> On 06/19/2018 08:00 PM, Tushar Dave wrote:
>> Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
>> existing socket filter infrastructure for bpf program attach and load.
>> SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
>> contrast to SOCKET_FILTER which deals with struct skb. This is useful
>> for kernel entities that don't have skb to represent packet data but
>> want to run eBPF socket filter on packet data that is in form of struct
>> scatterlist e.g. IB/RDMA
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
[...]
>>  static void __bpf_prog_release(struct bpf_prog *prog)
>>  {
>> -	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
>> +	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
>> +	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
>>  		bpf_prog_put(prog);
>>  	} else {
>>  		bpf_release_orig_filter(prog);
>> @@ -1551,10 +1552,16 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
>>  
>>  static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
>>  {
>> +	struct bpf_prog *prog;
>> +
>>  	if (sock_flag(sk, SOCK_FILTER_LOCKED))
>>  		return ERR_PTR(-EPERM);
>>  
>> -	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
>> +	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
>> +	if (IS_ERR(prog))
>> +		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
>> +
>> +	return prog;
>>  }
> 
> Hmm, I don't think this works: this now means as unpriviledged I can attach a new
> BPF_PROG_TYPE_SOCKET_SG_FILTER to a non-rds socket e.g. normal tcp/udp through the
> SO_ATTACH_BPF sockopt, where input context is skb instead of sg list and thus crash
> my box?

... probably best to just make a setsockopt specific to rds here, so the two are fully
separated.

Also worth exploring whether you can reuse as much as possible from the struct sk_msg_buff
context and in general the BPF_PROG_TYPE_SK_MSG type that is using this which we already
have in sockmap today. At least feels like some of the concepts are a bit similar. For
pulling in more payload you have bpf_msg_pull_data() there which I think might be more
user-friendly at least in that you have the full payload from start to the 'current' end
available and don't need to navigate through individual sg entries back/forth which could
perhaps end up being bit painful for users, though I can see that it's a middle ground
between some skb_load_bytes()-alike helper that would copy the pieces out of the sg entries
vs needing to linearize. What are the requirements here, would it make sense to offer both
as an option or is this impractical based on what you've measured?

^ permalink raw reply

* Re: WARNING: ODEBUG bug in __do_softirq
From: syzbot @ 2018-06-29  8:58 UTC (permalink / raw)
  To: anna.schumaker, bfields, davem, jlayton, linux-kernel, linux-nfs,
	netdev, syzkaller-bugs, trond.myklebust, trond.myklebust
In-Reply-To: <001a1143e62eb6a9510565640e76@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    509fda105ba8 samples/bpf: xdp_rxq_info action XDP_TX must ..
git tree:       bpf-next
console output: https://syzkaller.appspot.com/x/log.txt?x=107f20b8400000
kernel config:  https://syzkaller.appspot.com/x/.config?x=a63be0c83e84d370
dashboard link: https://syzkaller.appspot.com/bug?extid=51c9bdfa559769d2f897
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=17990d88400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+51c9bdfa559769d2f897@syzkaller.appspotmail.com

random: crng init done
------------[ cut here ]------------
ODEBUG: free active (active state 0) object type: timer_list hint:  
xprt_init_autodisconnect+0x0/0x240 net/sunrpc/xprt.c:193
WARNING: CPU: 0 PID: 0 at lib/debugobjects.c:329  
debug_print_object+0x16a/0x210 lib/debugobjects.c:326
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.18.0-rc2+ #43
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
  panic+0x238/0x4e7 kernel/panic.c:184
  __warn.cold.8+0x163/0x1ba kernel/panic.c:536
  report_bug+0x252/0x2d0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:178 [inline]
  do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:debug_print_object+0x16a/0x210 lib/debugobjects.c:326
Code: 1a 88 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 92 00 00 00 48 8b 14 dd  
80 99 1a 88 4c 89 f6 48 c7 c7 00 8f 1a 88 e8 46 aa e6 fd <0f> 0b 83 05 79  
57 3e 06 01 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 5f
RSP: 0018:ffff8801dae07808 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: 0000000000000100 RSI: ffffffff816318f1 RDI: 0000000000000001
RBP: ffff8801dae07848 R08: ffffffff88e75dc0 R09: ffffed003b5c3ec2
R10: ffffed003b5c3ec2 R11: ffff8801dae1f617 R12: 0000000000000001
R13: ffffffff88f9ca20 R14: ffffffff881a93a0 R15: ffffffff81690d10
  __debug_check_no_obj_freed lib/debugobjects.c:783 [inline]
  debug_check_no_obj_freed+0x3b2/0x595 lib/debugobjects.c:815
  kfree+0xc7/0x260 mm/slab.c:3812
  __rcu_reclaim kernel/rcu/rcu.h:173 [inline]
  rcu_do_batch kernel/rcu/tree.c:2558 [inline]
  invoke_rcu_callbacks kernel/rcu/tree.c:2818 [inline]
  __rcu_process_callbacks kernel/rcu/tree.c:2785 [inline]
  rcu_process_callbacks+0x1004/0x1850 kernel/rcu/tree.c:2802
  __do_softirq+0x2e8/0xb17 kernel/softirq.c:288
  invoke_softirq kernel/softirq.c:368 [inline]
  irq_exit+0x1d1/0x200 kernel/softirq.c:408
  exiting_irq arch/x86/include/asm/apic.h:527 [inline]
  smp_apic_timer_interrupt+0x186/0x730 arch/x86/kernel/apic/apic.c:1052
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863
  </IRQ>
RIP: 0010:native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:54
Code: c7 48 89 45 d8 e8 da 87 27 fa 48 8b 45 d8 e9 d2 fe ff ff 48 89 df e8  
c9 87 27 fa eb 8a 90 90 90 90 90 90 90 55 48 89 e5 fb f4 <5d> c3 0f 1f 84  
00 00 00 00 00 55 48 89 e5 f4 5d c3 90 90 90 90 90
RSP: 0018:ffffffff88e07bc0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: 1ffffffff11c0f7b RCX: 0000000000000000
RDX: 1ffffffff11e3618 RSI: 0000000000000001 RDI: ffffffff88f1b0c0
RBP: ffffffff88e07bc0 R08: ffffed003b5c46d7 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffffff88e07c78 R14: ffffffff899eaae0 R15: 0000000000000000
  arch_safe_halt arch/x86/include/asm/paravirt.h:94 [inline]
  default_idle+0xc7/0x450 arch/x86/kernel/process.c:500
  arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:491
  default_idle_call+0x6d/0x90 kernel/sched/idle.c:93
  cpuidle_idle_call kernel/sched/idle.c:153 [inline]
  do_idle+0x3aa/0x570 kernel/sched/idle.c:262
  cpu_startup_entry+0x10c/0x120 kernel/sched/idle.c:368
  rest_init+0xe1/0xe4 init/main.c:442
  start_kernel+0x90e/0x949 init/main.c:738
  x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:452
  x86_64_start_kernel+0x76/0x79 arch/x86/kernel/head64.c:433
  secondary_startup_64+0xa5/0xb0 arch/x86/kernel/head_64.S:242

======================================================
WARNING: possible circular locking dependency detected
4.18.0-rc2+ #43 Not tainted
------------------------------------------------------
swapper/0/0 is trying to acquire lock:
000000006528fc84 ((console_sem).lock){-.-.}, at: down_trylock+0x13/0x70  
kernel/locking/semaphore.c:136

but task is already holding lock:
000000004bd165f9 (&obj_hash[i].lock){-.-.}, at: __debug_check_no_obj_freed  
lib/debugobjects.c:774 [inline]
000000004bd165f9 (&obj_hash[i].lock){-.-.}, at:  
debug_check_no_obj_freed+0x16c/0x595 lib/debugobjects.c:815

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (&obj_hash[i].lock){-.-.}:
        __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
        _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
        __debug_object_init+0x127/0x12e0 lib/debugobjects.c:381
        debug_object_init+0x16/0x20 lib/debugobjects.c:429
        debug_hrtimer_init kernel/time/hrtimer.c:410 [inline]
        debug_init kernel/time/hrtimer.c:458 [inline]
        hrtimer_init+0x97/0x480 kernel/time/hrtimer.c:1308
        init_dl_task_timer+0x1b/0x50 kernel/sched/deadline.c:1056
        __sched_fork+0x2ae/0x590 kernel/sched/core.c:2186
        init_idle+0x75/0x7a0 kernel/sched/core.c:5408
        sched_init+0xbf3/0xd2c kernel/sched/core.c:6106
        start_kernel+0x47d/0x949 init/main.c:602
        x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:452
        x86_64_start_kernel+0x76/0x79 arch/x86/kernel/head64.c:433
        secondary_startup_64+0xa5/0xb0 arch/x86/kernel/head_64.S:242

-> #2 (&rq->lock){-.-.}:
        __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
        _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:144
        rq_lock kernel/sched/sched.h:1805 [inline]
        task_fork_fair+0x93/0x680 kernel/sched/fair.c:9953
        sched_fork+0x446/0xb40 kernel/sched/core.c:2382
        copy_process.part.39+0x1c09/0x7220 kernel/fork.c:1773
        copy_process kernel/fork.c:1616 [inline]
        _do_fork+0x291/0x12a0 kernel/fork.c:2099
        kernel_thread+0x34/0x40 kernel/fork.c:2158
        rest_init+0x22/0xe4 init/main.c:408
        start_kernel+0x90e/0x949 init/main.c:738
        x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:452
        x86_64_start_kernel+0x76/0x79 arch/x86/kernel/head64.c:433
        secondary_startup_64+0xa5/0xb0 arch/x86/kernel/head_64.S:242

-> #1 (&p->pi_lock){-.-.}:
        __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
        _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
        try_to_wake_up+0xd2/0x12b0 kernel/sched/core.c:1986
        wake_up_process+0x10/0x20 kernel/sched/core.c:2149
        __up.isra.1+0x1c0/0x2a0 kernel/locking/semaphore.c:262
        up+0x13c/0x1c0 kernel/locking/semaphore.c:187
        __up_console_sem+0xbe/0x1b0 kernel/printk/printk.c:242
        console_unlock+0x7a2/0x10b0 kernel/printk/printk.c:2411
        do_con_write+0x12cc/0x22a0 drivers/tty/vt/vt.c:2435
        con_write+0x25/0xc0 drivers/tty/vt/vt.c:2784
        process_output_block drivers/tty/n_tty.c:579 [inline]
        n_tty_write+0x6c1/0x11a0 drivers/tty/n_tty.c:2308
        do_tty_write drivers/tty/tty_io.c:963 [inline]
        tty_write+0x45f/0xae0 drivers/tty/tty_io.c:1051
        __vfs_write+0x117/0x9f0 fs/read_write.c:485
        vfs_write+0x1f8/0x560 fs/read_write.c:549
        ksys_write+0x101/0x260 fs/read_write.c:598
        __do_sys_write fs/read_write.c:610 [inline]
        __se_sys_write fs/read_write.c:607 [inline]
        __x64_sys_write+0x73/0xb0 fs/read_write.c:607
        do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
        entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 ((console_sem).lock){-.-.}:
        lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
        __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
        _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
        down_trylock+0x13/0x70 kernel/locking/semaphore.c:136
        __down_trylock_console_sem+0xae/0x200 kernel/printk/printk.c:225
        console_trylock+0x15/0xa0 kernel/printk/printk.c:2230
        console_trylock_spinning kernel/printk/printk.c:1643 [inline]
        vprintk_emit+0x6ad/0xdf0 kernel/printk/printk.c:1906
        vprintk_default+0x28/0x30 kernel/printk/printk.c:1948
        vprintk_func+0x7a/0xe7 kernel/printk/printk_safe.c:382
        printk+0xa7/0xcf kernel/printk/printk.c:1981
        __warn_printk+0x8c/0xe0 kernel/panic.c:590
        debug_print_object+0x16a/0x210 lib/debugobjects.c:326
        __debug_check_no_obj_freed lib/debugobjects.c:783 [inline]
        debug_check_no_obj_freed+0x3b2/0x595 lib/debugobjects.c:815
        kfree+0xc7/0x260 mm/slab.c:3812
        __rcu_reclaim kernel/rcu/rcu.h:173 [inline]
        rcu_do_batch kernel/rcu/tree.c:2558 [inline]
        invoke_rcu_callbacks kernel/rcu/tree.c:2818 [inline]
        __rcu_process_callbacks kernel/rcu/tree.c:2785 [inline]
        rcu_process_callbacks+0x1004/0x1850 kernel/rcu/tree.c:2802
        __do_softirq+0x2e8/0xb17 kernel/softirq.c:288
        invoke_softirq kernel/softirq.c:368 [inline]
        irq_exit+0x1d1/0x200 kernel/softirq.c:408
        exiting_irq arch/x86/include/asm/apic.h:527 [inline]
        smp_apic_timer_interrupt+0x186/0x730 arch/x86/kernel/apic/apic.c:1052
        apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863
        native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:54
        arch_safe_halt arch/x86/include/asm/paravirt.h:94 [inline]
        default_idle+0xc7/0x450 arch/x86/kernel/process.c:500
        arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:491
        default_idle_call+0x6d/0x90 kernel/sched/idle.c:93
        cpuidle_idle_call kernel/sched/idle.c:153 [inline]
        do_idle+0x3aa/0x570 kernel/sched/idle.c:262
        cpu_startup_entry+0x10c/0x120 kernel/sched/idle.c:368
        rest_init+0xe1/0xe4 init/main.c:442
        start_kernel+0x90e/0x949 init/main.c:738
        x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:452
        x86_64_start_kernel+0x76/0x79 arch/x86/kernel/head64.c:433
        secondary_startup_64+0xa5/0xb0 arch/x86/kernel/head_64.S:242

other info that might help us debug this:

Chain exists of:
   (console_sem).lock --> &rq->lock --> &obj_hash[i].lock

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&obj_hash[i].lock);
                                lock(&rq->lock);
                                lock(&obj_hash[i].lock);
   lock((console_sem).lock);

  *** DEADLOCK ***

2 locks held by swapper/0/0:
  #0: 00000000eb5a51da (rcu_callback){....}, at: __rcu_reclaim  
kernel/rcu/rcu.h:168 [inline]
  #0: 00000000eb5a51da (rcu_callback){....}, at: rcu_do_batch  
kernel/rcu/tree.c:2558 [inline]
  #0: 00000000eb5a51da (rcu_callback){....}, at: invoke_rcu_callbacks  
kernel/rcu/tree.c:2818 [inline]
  #0: 00000000eb5a51da (rcu_callback){....}, at: __rcu_process_callbacks  
kernel/rcu/tree.c:2785 [inline]
  #0: 00000000eb5a51da (rcu_callback){....}, at:  
rcu_process_callbacks+0xfc6/0x1850 kernel/rcu/tree.c:2802
  #1: 000000004bd165f9 (&obj_hash[i].lock){-.-.}, at:  
__debug_check_no_obj_freed lib/debugobjects.c:774 [inline]
  #1: 000000004bd165f9 (&obj_hash[i].lock){-.-.}, at:  
debug_check_no_obj_freed+0x16c/0x595 lib/debugobjects.c:815

stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.18.0-rc2+ #43
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
  print_circular_bug.isra.36.cold.57+0x1bd/0x27d  
kernel/locking/lockdep.c:1227
  check_prev_add kernel/locking/lockdep.c:1867 [inline]
  check_prevs_add kernel/locking/lockdep.c:1980 [inline]
  validate_chain kernel/locking/lockdep.c:2421 [inline]
  __lock_acquire+0x3449/0x5020 kernel/locking/lockdep.c:3435
  lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
  __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
  _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
  down_trylock+0x13/0x70 kernel/locking/semaphore.c:136
  __down_trylock_console_sem+0xae/0x200 kernel/printk/printk.c:225
  console_trylock+0x15/0xa0 kernel/printk/printk.c:2230
  console_trylock_spinning kernel/printk/printk.c:1643 [inline]
  vprintk_emit+0x6ad/0xdf0 kernel/printk/printk.c:1906
  vprintk_default+0x28/0x30 kernel/printk/printk.c:1948
  vprintk_func+0x7a/0xe7 kernel/printk/printk_safe.c:382
  printk+0xa7/0xcf kernel/printk/printk.c:1981
  __warn_printk+0x8c/0xe0 kernel/panic.c:590
  debug_print_object+0x16a/0x210 lib/debugobjects.c:326
  __debug_check_no_obj_freed lib/debugobjects.c:783 [inline]
  debug_check_no_obj_freed+0x3b2/0x595 lib/debugobjects.c:815
  kfree+0xc7/0x260 mm/slab.c:3812
  __rcu_reclaim kernel/rcu/rcu.h:173 [inline]
  rcu_do_batch kernel/rcu/tree.c:2558 [inline]
  invoke_rcu_callbacks kernel/rcu/tree.c:2818 [inline]
  __rcu_process_callbacks kernel/rcu/tree.c:2785 [inline]
  rcu_process_callbacks+0x1004/0x1850 kernel/rcu/tree.c:2802
  __do_softirq+0x2e8/0xb17 kernel/softirq.c:288
  invoke_softirq kernel/softirq.c:368 [inline]
  irq_exit+0x1d1/0x200 kernel/softirq.c:408
  exiting_irq arch/x86/include/asm/apic.h:527 [inline]
  smp_apic_timer_interrupt+0x186/0x730 arch/x86/kernel/apic/apic.c:1052
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863
  </IRQ>
RIP: 0010:native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:54
Code: c7 48 89 45 d8 e8 da 87 27 fa 48 8b 45 d8 e9 d2 fe ff ff 48 89 df e8  
c9 87 27 fa eb 8a 90 90 90 90 90 90 90 55 48 89 e5 fb f4 <5d> c3 0f 1f 84  
00 00 00 00 00 55 48 89 e5 f4 5d c3 90 90 90 90 90
RSP: 0018:ffffffff88e07bc0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: dffffc0000000000 RBX: 1ffffffff11c0f7b RCX: 0000000000000000
RDX: 1ffffffff11e3618 RSI: 0000000000000001 RDI: ffffffff88f1b0c0
RBP: ffffffff88e07bc0 R08: ffffed003b5c46d7 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffffff88e07c78 R14: ffffffff899eaae0 R15: 0000000000000000
  arch_safe_halt arch/x86/include/asm/paravirt.h:94 [inline]
  default_idle+0xc7/0x450 arch/x86/kernel/process.c:500
  arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:491
  default_idle_call+0x6d/0x90 kernel/sched/idle.c:93
  cpuidle_idle_call kernel/sched/idle.c:153 [inline]
  do_idle+0x3aa/0x570 kernel/sched/idle.c:262
  cpu_startup_entry+0x10c/0x120 kernel/sched/idle.c:368
  rest_init+0xe1/0xe4 init/main.c:442
  start_kernel+0x90e/0x949 init/main.c:738
  x86_64_start_reservations+0x29/0x2 arch/x86/kernel/head64.c:452
Lost 2 message(s)!
Dumping ftrace buffer:
    (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..

^ permalink raw reply

* Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags
From: Jiri Benc @ 2018-06-29  9:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel Borkmann, davem, Roopa Prabhu, jiri, jhs, xiyou.wangcong,
	oss-drivers, netdev, Pieter Jansen van Vuuren
In-Reply-To: <20180628100536.71e5ca59@cakuba.netronome.com>

On Thu, 28 Jun 2018 10:05:36 -0700, Jakub Kicinski wrote:
> Can we take this as a follow up through the bpf-next tree or do you
> want us to respin as part of this set?

I think BPF is a separate issue.

 Jiri

^ permalink raw reply

* Re: [PATCH vhost] vhost_net: Fix too many vring kick on busypoll
From: Jason Wang @ 2018-06-29  9:30 UTC (permalink / raw)
  To: Toshiaki Makita, Michael S. Tsirkin
  Cc: kvm, virtualization, netdev, Tonghao Zhang
In-Reply-To: <1530259790-2414-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>



On 2018年06月29日 16:09, Toshiaki Makita wrote:
> Under heavy load vhost busypoll may run without suppressing
> notification. For example tx zerocopy callback can push tx work while
> handle_tx() is running, then busyloop exits due to vhost_has_work()
> condition and enables notification but immediately reenters handle_tx()
> because the pushed work was tx. In this case handle_tx() tries to
> disable notification again, but when using event_idx it by design
> cannot. Then busyloop will run without suppressing notification.
> Another example is the case where handle_tx() tries to enable
> notification but avail idx is advanced so disables it again. This case
> also lead to the same situation with event_idx.

Good catch.

>
> The problem is that once we enter this situation busyloop does not work
> under heavy load for considerable amount of time, because notification
> is likely to happen during busyloop and handle_tx() immediately enables
> notification after notification happens. Specifically busyloop detects
> notification by vhost_has_work() and then handle_tx() calls
> vhost_enable_notify(). Because the detected work was the tx work, it
> enters handle_tx(), and enters busyloop without suppression again.
> This is likely to be repeated, so with event_idx we are almost not able
> to suppress notification in this case.
>
> To fix this, poll the work instead of enabling notification when
> busypoll is interrupted by something. IMHO signal_pending() and
> vhost_has_work() are kind of interruptions rather than signals to
> completely cancel the busypoll, so let's run busypoll after the
> necessary work is done. In order to avoid too long busyloop due to
> interruption, save the endtime in vq field and use it when reentering
> the work function.

I think we don't care long busyloop unless e.g tx can starve rx?

>
> I observed this problem makes tx performance poor but does not rx. I
> guess it is because rx notification from socket is not that heavy so
> did not impact the performance even when we failed to mask the
> notification.

I think the reason is:

For tx, we may only process used ring under handle_tx(). Busy polling 
code can not recognize this case.
For rx, we call peek_head_len() after exiting busy loop, so if the work 
is for rx, it can be processed in handle_rx() immediately.

> Anyway for consistency I fixed rx routine as well as tx.
>
> Performance numbers:
>
> - Bulk transfer from guest to external physical server.
>      [Guest]->vhost_net->tap--(XDP_REDIRECT)-->i40e --(wire)--> [Server]

Just to confirm in this case since zerocopy is enabled, we are in fact 
use the generic XDP datapath?

> - Set 10us busypoll.
> - Guest disables checksum and TSO because of host XDP.
> - Measured single flow Mbps by netperf, and kicks by perf kvm stat
>    (EPT_MISCONFIG event).
>
>                              Before              After
>                            Mbps  kicks/s      Mbps  kicks/s
> UDP_STREAM 1472byte              247758                 27
>                  Send   3645.37            6958.10
>                  Recv   3588.56            6958.10
>                1byte                9865                 37
>                  Send      4.34               5.43
>                  Recv      4.17               5.26
> TCP_STREAM             8801.03    45794   9592.77     2884

Impressive numbers.

>
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> ---
>   drivers/vhost/net.c   | 94 +++++++++++++++++++++++++++++++++++----------------
>   drivers/vhost/vhost.c |  1 +
>   drivers/vhost/vhost.h |  1 +
>   3 files changed, 66 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index eeaf6739215f..0e85f628b965 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -391,13 +391,14 @@ static inline unsigned long busy_clock(void)
>   	return local_clock() >> 10;
>   }
>   
> -static bool vhost_can_busy_poll(struct vhost_dev *dev,
> -				unsigned long endtime)
> +static bool vhost_can_busy_poll(unsigned long endtime)
>   {
> -	return likely(!need_resched()) &&
> -	       likely(!time_after(busy_clock(), endtime)) &&
> -	       likely(!signal_pending(current)) &&
> -	       !vhost_has_work(dev);
> +	return likely(!need_resched() && !time_after(busy_clock(), endtime));
> +}
> +
> +static bool vhost_busy_poll_interrupted(struct vhost_dev *dev)
> +{
> +	return unlikely(signal_pending(current)) || vhost_has_work(dev);
>   }
>   
>   static void vhost_net_disable_vq(struct vhost_net *n,
> @@ -437,10 +438,21 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
>   
>   	if (r == vq->num && vq->busyloop_timeout) {
>   		preempt_disable();
> -		endtime = busy_clock() + vq->busyloop_timeout;
> -		while (vhost_can_busy_poll(vq->dev, endtime) &&
> -		       vhost_vq_avail_empty(vq->dev, vq))
> +		if (vq->busyloop_endtime) {
> +			endtime = vq->busyloop_endtime;
> +			vq->busyloop_endtime = 0;

Looks like endtime may be before current time. So we skip busy loop in 
this case.

> +		} else {
> +			endtime = busy_clock() + vq->busyloop_timeout;
> +		}
> +		while (vhost_can_busy_poll(endtime)) {
> +			if (vhost_busy_poll_interrupted(vq->dev)) {
> +				vq->busyloop_endtime = endtime;
> +				break;
> +			}
> +			if (!vhost_vq_avail_empty(vq->dev, vq))
> +				break;
>   			cpu_relax();
> +		}
>   		preempt_enable();
>   		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
>   				      out_num, in_num, NULL, NULL);
> @@ -509,12 +521,16 @@ static void handle_tx(struct vhost_net *net)
>   			break;
>   		/* Nothing new?  Wait for eventfd to tell us they refilled. */
>   		if (head == vq->num) {
> -			if (unlikely(vhost_enable_notify(&net->dev, vq))) {
> +			if (unlikely(vq->busyloop_endtime)) {
> +				/* Busy poll is interrupted. */
> +				vhost_poll_queue(&vq->poll);
> +			} else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
>   				vhost_disable_notify(&net->dev, vq);
>   				continue;
>   			}
>   			break;
>   		}
> +		vq->busyloop_endtime = 0;
>   		if (in) {
>   			vq_err(vq, "Unexpected descriptor format for TX: "
>   			       "out %d, int %d\n", out, in);
> @@ -642,39 +658,51 @@ static void vhost_rx_signal_used(struct vhost_net_virtqueue *nvq)
>   
>   static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
>   {
> -	struct vhost_net_virtqueue *rvq = &net->vqs[VHOST_NET_VQ_RX];
> -	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
> -	struct vhost_virtqueue *vq = &nvq->vq;
> +	struct vhost_net_virtqueue *rnvq = &net->vqs[VHOST_NET_VQ_RX];
> +	struct vhost_net_virtqueue *tnvq = &net->vqs[VHOST_NET_VQ_TX];
> +	struct vhost_virtqueue *rvq = &rnvq->vq;
> +	struct vhost_virtqueue *tvq = &tnvq->vq;

NIt: I admit the variable name is confusing. It would be better to do 
the renaming in a separate patch to ease review.

>   	unsigned long uninitialized_var(endtime);
> -	int len = peek_head_len(rvq, sk);
> +	int len = peek_head_len(rnvq, sk);
>   
> -	if (!len && vq->busyloop_timeout) {
> +	if (!len && tvq->busyloop_timeout) {
>   		/* Flush batched heads first */
> -		vhost_rx_signal_used(rvq);
> +		vhost_rx_signal_used(rnvq);
>   		/* Both tx vq and rx socket were polled here */
> -		mutex_lock_nested(&vq->mutex, 1);
> -		vhost_disable_notify(&net->dev, vq);
> +		mutex_lock_nested(&tvq->mutex, 1);
> +		vhost_disable_notify(&net->dev, tvq);
>   
>   		preempt_disable();
> -		endtime = busy_clock() + vq->busyloop_timeout;
> +		if (rvq->busyloop_endtime) {
> +			endtime = rvq->busyloop_endtime;
> +			rvq->busyloop_endtime = 0;
> +		} else {
> +			endtime = busy_clock() + tvq->busyloop_timeout;
> +		}
>   
> -		while (vhost_can_busy_poll(&net->dev, endtime) &&
> -		       !sk_has_rx_data(sk) &&
> -		       vhost_vq_avail_empty(&net->dev, vq))
> +		while (vhost_can_busy_poll(endtime)) {
> +			if (vhost_busy_poll_interrupted(&net->dev)) {
> +				rvq->busyloop_endtime = endtime;
> +				break;
> +			}
> +			if (sk_has_rx_data(sk) ||
> +			    !vhost_vq_avail_empty(&net->dev, tvq))
> +				break;
>   			cpu_relax();

Actually, I plan to poll guest rx refilling as well here to avoid rx 
kicks. Want to draft another patch to do it as well? It's as simple as 
add a vhost_vq_avail_empty for rvq above.

> +		}
>   
>   		preempt_enable();
>   
> -		if (!vhost_vq_avail_empty(&net->dev, vq))
> -			vhost_poll_queue(&vq->poll);
> -		else if (unlikely(vhost_enable_notify(&net->dev, vq))) {
> -			vhost_disable_notify(&net->dev, vq);
> -			vhost_poll_queue(&vq->poll);
> +		if (!vhost_vq_avail_empty(&net->dev, tvq)) {
> +			vhost_poll_queue(&tvq->poll);
> +		} else if (unlikely(vhost_enable_notify(&net->dev, tvq))) {
> +			vhost_disable_notify(&net->dev, tvq);
> +			vhost_poll_queue(&tvq->poll);
>   		}
>   
> -		mutex_unlock(&vq->mutex);
> +		mutex_unlock(&tvq->mutex);
>   
> -		len = peek_head_len(rvq, sk);
> +		len = peek_head_len(rnvq, sk);
>   	}
>   
>   	return len;
> @@ -804,6 +832,7 @@ static void handle_rx(struct vhost_net *net)
>   	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
>   
>   	while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
> +		vq->busyloop_endtime = 0;
>   		sock_len += sock_hlen;
>   		vhost_len = sock_len + vhost_hlen;
>   		headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx,
> @@ -889,7 +918,12 @@ static void handle_rx(struct vhost_net *net)
>   			goto out;
>   		}
>   	}
> -	vhost_net_enable_vq(net, vq);
> +	if (unlikely(vq->busyloop_endtime)) {
> +		/* Busy poll is interrupted. */
> +		vhost_poll_queue(&vq->poll);
> +	} else {
> +		vhost_net_enable_vq(net, vq);
> +	}

This is to reduce the rx wake ups. Better with another patch.

So I suggest to split the patches into:

1 better naming of variable of vhost_net_rx_peek_head_len().
2 avoid tx kicks (most of what this patch did)
3 avoid rx wakeups (exactly the above 6 lines did)
4 avoid rx kicks

Btw, tonghao is doing some refactoring of busy polling as well. Depends 
on the order of being merged, one of you may need rebasing.

Thanks

>   out:
>   	vhost_rx_signal_used(nvq);
>   	mutex_unlock(&vq->mutex);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 9beefa6ed1ce..fe83578fe336 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -323,6 +323,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>   	vhost_reset_is_le(vq);
>   	vhost_disable_cross_endian(vq);
>   	vq->busyloop_timeout = 0;
> +	vq->busyloop_endtime = 0;
>   	vq->umem = NULL;
>   	vq->iotlb = NULL;
>   	__vhost_vq_meta_reset(vq);
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 6c844b90a168..7e9cf80ccae9 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -144,6 +144,7 @@ struct vhost_virtqueue {
>   	bool user_be;
>   #endif
>   	u32 busyloop_timeout;
> +	unsigned long busyloop_endtime;
>   };
>   
>   struct vhost_msg_node {

^ permalink raw reply

* [PATCH v4 00/18] ARM: davinci: step towards removing at24_platform_data
From: Bartosz Golaszewski @ 2018-06-29  9:40 UTC (permalink / raw)
  To: Sekhar Nori, Kevin Hilman, Russell King, Grygorii Strashko,
	David S . Miller, Srinivas Kandagatla, Lukas Wunner, Rob Herring,
	Florian Fainelli, Dan Carpenter, Ivan Khoronzhuk, David Lechner,
	Greg Kroah-Hartman, Andrew Lunn, Jonathan Corbet
  Cc: linux-arm-kernel, linux-kernel, linux-omap, netdev,
	Bartosz Golaszewski

From: Bartosz Golaszewski <bgolaszewski@baylibre.com>

Since I took over maintainership of the at24 driver I've been working
towards removing at24_platform_data in favor for device properties.

DaVinci is the only platform that's still using it - all other users
have already been converted.

One of the obstacles in case of DaVinci is removing the setup() callback
from the pdata struct, the only user of which are some davinci boards.

Most boards use the EEPROM to store the MAC address. This series adds
support for cell lookups to the nvmem framework, registers relevant
cells for all users, converts the davinci_emac driver to using them
and replaces at24_platform_data with device properties.

The only board that's still using this callback is now mityomapl138.
Unfortunately it stores more info in EEPROM than just the MAC address
and will require some more work. Unfortunately I don't have access
to this board so I can't test any actual solutions on a live hardware.

Tested on a dm365-evm board.

v1 -> v2:
- for backward compatiblity: fall back to using of_get_mac_address() if
  we can't get an nvmem cell in patch 7
- add patch 15 which removes dead code

v2 -> v3:
- documented new nvmem cell lookup API
- renamed the nvmem cell lookup routines
- added a function for removing the lookup tables
- in order to completly remove the mac_addr from davinci's soc_info
  added a patch that moves reading the MAC address from SPI to the emac
  driver
- removed more dead code

v3 -> v4:
- only return -EPROBE_DEFER from nvmem_cell_from_lookup() if we have
  a matching lookup entry but no corresponding nvmem device yet, return
  -ENOENT if we can't match the lookup
- check for -EPROBE_DEFER in the emac driver
- put the mtd device after reading the MAC address

Bartosz Golaszewski (18):
  nvmem: add support for cell lookups
  Documentation: nvmem: document lookup entries
  ARM: davinci: dm365-evm: use nvmem lookup for mac address
  ARM: davinci: dm644-evm: use nvmem lookup for mac address
  ARM: davinci: dm646x-evm: use nvmem lookup for mac address
  ARM: davinci: da830-evm: use nvmem lookup for mac address
  ARM: davinci: mityomapl138: add nvmem cells lookup entries
  net: davinci_emac: potentially get the MAC address from MTD
  ARM: davinci: da850-evm: remove dead MTD code
  net: davinci_emac: use nvmem to retrieve the mac address
  ARM: davinci: mityomapl138: don't read the MAC address from machine
    code
  ARM: davinci: dm365-evm: use device properties for at24 eeprom
  ARM: davinci: da830-evm: use device properties for at24 eeprom
  ARM: davinci: dm644x-evm: use device properties for at24 eeprom
  ARM: davinci: dm646x-evm: use device properties for at24 eeprom
  ARM: davinci: sffsdr: fix the at24 eeprom device name
  ARM: davinci: sffsdr: use device properties for at24 eeprom
  ARM: davinci: remove dead code related to MAC address reading

 Documentation/nvmem/nvmem.txt              | 28 ++++++++
 arch/arm/mach-davinci/board-da830-evm.c    | 25 ++++---
 arch/arm/mach-davinci/board-da850-evm.c    | 28 --------
 arch/arm/mach-davinci/board-dm365-evm.c    | 25 ++++---
 arch/arm/mach-davinci/board-dm644x-evm.c   | 24 ++++---
 arch/arm/mach-davinci/board-dm646x-evm.c   | 25 ++++---
 arch/arm/mach-davinci/board-mityomapl138.c | 30 ++++++---
 arch/arm/mach-davinci/board-sffsdr.c       | 13 ++--
 arch/arm/mach-davinci/common.c             | 15 -----
 drivers/net/ethernet/ti/davinci_emac.c     | 53 ++++++++++++---
 drivers/nvmem/core.c                       | 77 +++++++++++++++++++++-
 include/linux/davinci_emac.h               |  2 -
 include/linux/nvmem-consumer.h             |  6 ++
 include/linux/nvmem-provider.h             | 10 +++
 14 files changed, 258 insertions(+), 103 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [PATCH v4 01/18] nvmem: add support for cell lookups
From: Bartosz Golaszewski @ 2018-06-29  9:40 UTC (permalink / raw)
  To: Sekhar Nori, Kevin Hilman, Russell King, Grygorii Strashko,
	David S . Miller, Srinivas Kandagatla, Lukas Wunner, Rob Herring,
	Florian Fainelli, Dan Carpenter, Ivan Khoronzhuk, David Lechner,
	Greg Kroah-Hartman, Andrew Lunn, Jonathan Corbet
  Cc: netdev, linux-omap, linux-kernel, linux-arm-kernel,
	Bartosz Golaszewski
In-Reply-To: <20180629094039.7543-1-brgl@bgdev.pl>

From: Bartosz Golaszewski <bgolaszewski@baylibre.com>

We can currently only register nvmem cells from device tree or by
manually calling nvmem_add_cells(). The latter options however forces
users to make sure that the nvmem provider with which the cells are
associated is registered before the call.

This patch proposes a new solution inspired by other frameworks that
offer resource lookups (GPIO, PWM etc.). It adds functions that allow
machine code to register nvmem lookup which are later lazily used to
add corresponding nvmem cells and remove them if no longer needed.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
---
 drivers/nvmem/core.c           | 77 +++++++++++++++++++++++++++++++++-
 include/linux/nvmem-consumer.h |  6 +++
 include/linux/nvmem-provider.h | 10 +++++
 3 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/drivers/nvmem/core.c b/drivers/nvmem/core.c
index b5b0cdc21d01..2495550cbbc4 100644
--- a/drivers/nvmem/core.c
+++ b/drivers/nvmem/core.c
@@ -62,6 +62,9 @@ static DEFINE_IDA(nvmem_ida);
 static LIST_HEAD(nvmem_cells);
 static DEFINE_MUTEX(nvmem_cells_mutex);
 
+static LIST_HEAD(nvmem_cell_lookups);
+static DEFINE_MUTEX(nvmem_lookup_mutex);
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 static struct lock_class_key eeprom_lock_key;
 #endif
@@ -247,6 +250,41 @@ static const struct attribute_group *nvmem_ro_root_dev_groups[] = {
 	NULL,
 };
 
+/**
+ * nvmem_add_lookup_table() - register a number of nvmem cell lookup entries
+ *
+ * @lookup: array of nvmem cell lookup entries
+ * @nentries: number of lookup entries in the array
+ */
+void nvmem_add_lookup_table(struct nvmem_cell_lookup *lookup, size_t nentries)
+{
+	int i;
+
+	mutex_lock(&nvmem_lookup_mutex);
+	for (i = 0; i < nentries; i++)
+		list_add_tail(&lookup[i].list, &nvmem_cell_lookups);
+	mutex_unlock(&nvmem_lookup_mutex);
+}
+EXPORT_SYMBOL_GPL(nvmem_add_lookup_table);
+
+/**
+ * nvmem_del_lookup_table() - unregister a set of previously added nvmem cell
+ *                            lookup entries
+ *
+ * @lookup: array of nvmem cell lookup entries
+ * @nentries: number of lookup entries in the array
+ */
+void nvmem_del_lookup_table(struct nvmem_cell_lookup *lookup, size_t nentries)
+{
+	int i;
+
+	mutex_lock(&nvmem_lookup_mutex);
+	for (i = 0; i < nentries; i++)
+		list_del(&lookup[i].list);
+	mutex_unlock(&nvmem_lookup_mutex);
+}
+EXPORT_SYMBOL_GPL(nvmem_del_lookup_table);
+
 static void nvmem_release(struct device *dev)
 {
 	struct nvmem_device *nvmem = to_nvmem_device(dev);
@@ -916,6 +954,39 @@ struct nvmem_cell *of_nvmem_cell_get(struct device_node *np,
 EXPORT_SYMBOL_GPL(of_nvmem_cell_get);
 #endif
 
+static struct nvmem_cell *nvmem_cell_from_lookup(const char *cell_id)
+{
+	struct nvmem_cell *cell = ERR_PTR(-ENOENT);
+	struct nvmem_cell_lookup *lookup;
+	struct nvmem_device *nvmem;
+	int rc;
+
+	mutex_lock(&nvmem_lookup_mutex);
+
+	list_for_each_entry(lookup, &nvmem_cell_lookups, list) {
+		if (strcmp(cell_id, lookup->info.name) == 0) {
+			nvmem = nvmem_find(lookup->nvmem_name);
+			if (!nvmem) {
+				cell = ERR_PTR(-EPROBE_DEFER);
+				goto out;
+			}
+
+			rc = nvmem_add_cells(nvmem, &lookup->info, 1);
+			if (rc) {
+				cell = ERR_PTR(rc);
+				goto out;
+			}
+
+			cell = nvmem_cell_get_from_list(cell_id);
+			break;
+		}
+	}
+
+out:
+	mutex_unlock(&nvmem_lookup_mutex);
+	return cell;
+}
+
 /**
  * nvmem_cell_get() - Get nvmem cell of device form a given cell name
  *
@@ -936,7 +1007,11 @@ struct nvmem_cell *nvmem_cell_get(struct device *dev, const char *cell_id)
 			return cell;
 	}
 
-	return nvmem_cell_get_from_list(cell_id);
+	cell = nvmem_cell_get_from_list(cell_id);
+	if (!IS_ERR(cell) || PTR_ERR(cell) == -EPROBE_DEFER)
+		return cell;
+
+	return nvmem_cell_from_lookup(cell_id);
 }
 EXPORT_SYMBOL_GPL(nvmem_cell_get);
 
diff --git a/include/linux/nvmem-consumer.h b/include/linux/nvmem-consumer.h
index 4e85447f7860..f4b5d3186e94 100644
--- a/include/linux/nvmem-consumer.h
+++ b/include/linux/nvmem-consumer.h
@@ -29,6 +29,12 @@ struct nvmem_cell_info {
 	unsigned int		nbits;
 };
 
+struct nvmem_cell_lookup {
+	struct nvmem_cell_info	info;
+	struct list_head	list;
+	const char		*nvmem_name;
+};
+
 #if IS_ENABLED(CONFIG_NVMEM)
 
 /* Cell based interface */
diff --git a/include/linux/nvmem-provider.h b/include/linux/nvmem-provider.h
index 24def6ad09bb..6a17d722062b 100644
--- a/include/linux/nvmem-provider.h
+++ b/include/linux/nvmem-provider.h
@@ -17,6 +17,7 @@
 
 struct nvmem_device;
 struct nvmem_cell_info;
+struct nvmem_cell_lookup;
 typedef int (*nvmem_reg_read_t)(void *priv, unsigned int offset,
 				void *val, size_t bytes);
 typedef int (*nvmem_reg_write_t)(void *priv, unsigned int offset,
@@ -72,6 +73,9 @@ struct nvmem_config {
 struct nvmem_device *nvmem_register(const struct nvmem_config *cfg);
 int nvmem_unregister(struct nvmem_device *nvmem);
 
+void nvmem_add_lookup_table(struct nvmem_cell_lookup *lookup, size_t nentries);
+void nvmem_del_lookup_table(struct nvmem_cell_lookup *lookup, size_t nentries);
+
 struct nvmem_device *devm_nvmem_register(struct device *dev,
 					 const struct nvmem_config *cfg);
 
@@ -92,6 +96,12 @@ static inline int nvmem_unregister(struct nvmem_device *nvmem)
 	return -ENOSYS;
 }
 
+static inline void
+nvmem_add_lookup_table(struct nvmem_cell_lookup *lookup, size_t nentries) {}
+
+static inline void
+nvmem_del_lookup_table(struct nvmem_cell_lookup *lookup, size_t nentries) {}
+
 static inline struct nvmem_device *
 devm_nvmem_register(struct device *dev, const struct nvmem_config *c)
 {
-- 
2.17.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox