Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [v1] net_sched: act_police: add 2 new attributes to support police 64bit rate and peakrate
From: David Z. Dai @ 2019-08-29 17:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: jhs, xiyou.wangcong, jiri, davem, netdev, linux-kernel, zdai
In-Reply-To: <7a8a5024-bbff-7443-71b3-9e3976af269f@gmail.com>

On Thu, 2019-08-29 at 10:32 +0200, Eric Dumazet wrote:
> 
> On 8/29/19 12:51 AM, David Dai wrote:
> > For high speed adapter like Mellanox CX-5 card, it can reach upto
> > 100 Gbits per second bandwidth. Currently htb already supports 64bit rate
> > in tc utility. However police action rate and peakrate are still limited
> > to 32bit value (upto 32 Gbits per second). Add 2 new attributes
> > TCA_POLICE_RATE64 and TCA_POLICE_RATE64 in kernel for 64bit support
> > so that tc utility can use them for 64bit rate and peakrate value to
> > break the 32bit limit, and still keep the backward binary compatibility.
> > 
> > Tested-by: David Dai <zdai@linux.vnet.ibm.com>
> > Signed-off-by: David Dai <zdai@linux.vnet.ibm.com>
> > ---
> >  include/uapi/linux/pkt_cls.h |    2 ++
> >  net/sched/act_police.c       |   27 +++++++++++++++++++++++----
> >  2 files changed, 25 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
> > index b057aee..eb4ea4d 100644
> > --- a/include/uapi/linux/pkt_cls.h
> > +++ b/include/uapi/linux/pkt_cls.h
> > @@ -159,6 +159,8 @@ enum {
> >  	TCA_POLICE_AVRATE,
> >  	TCA_POLICE_RESULT,
> >  	TCA_POLICE_TM,
> > +	TCA_POLICE_RATE64,
> > +	TCA_POLICE_PEAKRATE64,
> >  	TCA_POLICE_PAD,
> >  	__TCA_POLICE_MAX
> >  #define TCA_POLICE_RESULT TCA_POLICE_RESULT
> 
> Never insert new attributes, as this breaks compatibility with old binaries (including
> old kernels)
Thanks for reviewing it!
My change is only contained within the police part. I am trying to
follow the same way htb and tbf support their 64 bit rate.

I tested the old tc binary with the newly patched kernel. It works fine.

I agree the newly compiled tc binary that has these 2 new attributes can
cause backward compatibility issue when running on the old kernel.

If can't insert new attribute, is there any
comment/suggestion/alternative on how to support 64bit police rate and
still keep the backward compatibility?

> Keep TCA_POLICE_PAD value the same, thanks.



^ permalink raw reply

* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
From: Alexei Starovoitov @ 2019-08-29 17:30 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
	kernel-team, linux-api
In-Reply-To: <ed8796f5-eaea-c87d-ddd9-9d624059e5ee@iogearbox.net>

On Thu, Aug 29, 2019 at 05:32:27PM +0200, Daniel Borkmann wrote:
> On 8/29/19 7:12 AM, Alexei Starovoitov wrote:
> > Implement permissions as stated in uapi/linux/capability.h
> > 
> > Note that CAP_SYS_ADMIN is replaced with CAP_BPF.
> > All existing applications that use BPF do not drop all caps
> > and keep only CAP_SYS_ADMIN before doing bpf() syscall.
> > Hence it's highly unlikely that existing code will break.
> > If there will be reports of breakage then CAP_SYS_ADMIN
> > would be allowed as well with "it's usage is deprecated" message
> > similar to commit ee24aebffb75 ("cap_syslog: accept CAP_SYS_ADMIN for now")
> > 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> [...]
> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > index 22066a62c8c9..f459315625ac 100644
> > --- a/kernel/bpf/hashtab.c
> > +++ b/kernel/bpf/hashtab.c
> > @@ -244,9 +244,9 @@ static int htab_map_alloc_check(union bpf_attr *attr)
> >   	BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) !=
> >   		     offsetof(struct htab_elem, hash_node.pprev));
> > -	if (lru && !capable(CAP_SYS_ADMIN))
> > +	if (lru && !capable(CAP_BPF))
> >   		/* LRU implementation is much complicated than other
> > -		 * maps.  Hence, limit to CAP_SYS_ADMIN for now.
> > +		 * maps.  Hence, limit to CAP_BPF.
> >   		 */
> >   		return -EPERM;
> I don't think this works, this is pretty much going to break use cases where
> orchestration daemons are deployed as containers that are explicitly granted
> specified cap set and right now this is CAP_SYS_ADMIN and not CAP_BPF for bpf().
> The former needs to be a superset of the latter in order for this to work and
> not break compatibility between kernel upgrades.
> 
> - https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-capabilities-for-a-container
> - https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities

These are the links that showing that k8 can delegates caps.
Are you saying that you know of folks who specifically
delegate cap_sys_admin and cap_net_admin _only_ to a container to run bpf in there?


^ permalink raw reply

* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
From: Alexei Starovoitov @ 2019-08-29 17:28 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, Andy Lutomirski, davem@davemloft.net,
	peterz@infradead.org, rostedt@goodmis.org, netdev@vger.kernel.org,
	bpf@vger.kernel.org, Kernel Team, linux-api@vger.kernel.org
In-Reply-To: <B28631A9-BB92-404A-BD58-7A737BCF10C9@fb.com>

On Thu, Aug 29, 2019 at 06:04:42AM +0000, Song Liu wrote:
> 
> 
> > On Aug 28, 2019, at 10:12 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> > 
> 
> [...]
> 
> > diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
> > index 44e2d640b088..91a7f25512ca 100644
> > --- a/tools/testing/selftests/bpf/test_verifier.c
> > +++ b/tools/testing/selftests/bpf/test_verifier.c
> > @@ -805,10 +805,20 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
> > 	}
> > }
> > 
> > +struct libcap {
> > +	struct __user_cap_header_struct hdr;
> > +	struct __user_cap_data_struct data[2];
> > +};
> > +
> 
> I am confused by struct libcap. Why do we need it? 

because libcap is not compatible with new kernel.
It needs to be recompiled with new capability.h
Otherwise it limits max to CAP_AUDIT_READ
Any value higher it will error during cap_get_flag.
And will silently ignore it during cap_set_flag.
Not a great library decision.

Thankfully this struct above is exactly the kernel api.
One doesn't really need libcap. It's imo easier to do without it.


^ permalink raw reply

* Re: [PATCH v2 2/2] PTP: add support for one-shot output
From: Richard Cochran @ 2019-08-29 17:28 UTC (permalink / raw)
  To: Felipe Balbi; +Cc: Christopher S Hall, netdev, linux-kernel, davem
In-Reply-To: <20190829172509.GB2166@localhost>


Adding davem onto CC...

On Thu, Aug 29, 2019 at 12:58:25PM +0300, Felipe Balbi wrote:
> diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
> index 98ec1395544e..a407e5f76e2d 100644
> --- a/drivers/ptp/ptp_chardev.c
> +++ b/drivers/ptp/ptp_chardev.c
> @@ -177,9 +177,8 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
>  			err = -EFAULT;
>  			break;
>  		}
> -		if ((req.perout.flags || req.perout.rsv[0] || req.perout.rsv[1]
> -				|| req.perout.rsv[2] || req.perout.rsv[3])
> -			&& cmd == PTP_PEROUT_REQUEST2) {
> +		if ((req.perout.rsv[0] || req.perout.rsv[1] || req.perout.rsv[2]
> +			|| req.perout.rsv[3]) && cmd == PTP_PEROUT_REQUEST2) {

Please check that the reserved bits of req.perout.flags, namely
~PTP_PEROUT_ONE_SHOT, are clear.

>  			err = -EINVAL;
>  			break;
>  		} else if (cmd == PTP_PEROUT_REQUEST) {
> diff --git a/include/uapi/linux/ptp_clock.h b/include/uapi/linux/ptp_clock.h
> index 039cd62ec706..95840e5f5c53 100644
> --- a/include/uapi/linux/ptp_clock.h
> +++ b/include/uapi/linux/ptp_clock.h
> @@ -67,7 +67,9 @@ struct ptp_perout_request {
>  	struct ptp_clock_time start;  /* Absolute start time. */
>  	struct ptp_clock_time period; /* Desired period, zero means disable. */
>  	unsigned int index;           /* Which channel to configure. */
> -	unsigned int flags;           /* Reserved for future use. */
> +
> +#define PTP_PEROUT_ONE_SHOT BIT(0)
> +	unsigned int flags;

@davem  Any CodingStyle policy on #define within a struct?  (Some
maintainers won't allow it.)

>  	unsigned int rsv[4];          /* Reserved for future use. */
>  };
>  
> -- 
> 2.23.0
> 

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH bpf-next 0/3] tools: bpftool: improve bpftool build experience
From: Jakub Kicinski @ 2019-08-29 17:25 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, netdev, oss-drivers,
	Lorenz Bauer, Ilya Leoshkevich
In-Reply-To: <20190829105645.12285-1-quentin.monnet@netronome.com>

On Thu, 29 Aug 2019 11:56:42 +0100, Quentin Monnet wrote:
> Hi,
> This set attempts to make it easier to build bpftool, in particular when
> passing a specific output directory. This is a follow-up to the
> conversation held last month by Lorenz, Ilya and Jakub [0].
> 
> The first patch is a minor fix to bpftool's Makefile, regarding the
> retrieval of kernel version (which currently prints a non-relevant make
> warning on some invocations).
> 
> Second patch improves the Makefile commands to support more "make"
> invocations, or to fix building with custom output directory. On Jakub's
> suggestion, a script is also added to BPF selftests in order to keep track
> of the supported build variants.
> 
> At last, third patch is a sligthly modified version of Ilya's fix regarding
> libbpf.a appearing twice on the linking command for bpftool.
> 
> [0] https://lore.kernel.org/bpf/CACAyw9-CWRHVH3TJ=Tke2x8YiLsH47sLCijdp=V+5M836R9aAA@mail.gmail.com/

I think Ilya has a point, but otherwise looks good to me :)

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
From: Alexei Starovoitov @ 2019-08-29 17:25 UTC (permalink / raw)
  To: Nicolas Dichtel
  Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
	kernel-team, linux-api
In-Reply-To: <a5ef2f94-acca-eb66-b48c-899494a9f8d0@6wind.com>

On Thu, Aug 29, 2019 at 03:36:42PM +0200, Nicolas Dichtel wrote:
> Le 29/08/2019 à 07:12, Alexei Starovoitov a écrit :
> [snip]
> > CAP_BPF and CAP_NET_ADMIN together allow the following:
> > - Attach to cgroup-bpf hooks and query
> > - skb, xdp, flow_dissector test_run command
> > 
> > CAP_NET_ADMIN allows:
> > - Attach networking bpf programs to xdp, tc, lwt, flow dissector
> I'm not sure to understand the difference between these last two points.
> But, with the current kernel, CAP_NET_ADMIN is not enough to attach bpf prog
> with tc and it's still not enough after your patch.
> The following command is rejected:
> $ tc filter add dev eth0 ingress matchall action bpf obj ./tc_test_kern.o sec test
> 
> Prog section 'test' rejected: Operation not permitted (1)!
>  - Type:         4
>  - Instructions: 22 (0 over limit)
>  - License:      GPL
> 
> Verifier analysis:
> 
> Error fetching program/map!
> bad action parsing
> parse_action: bad value (5:bpf)!
> Illegal "action"

because tc/iproute2 is doing load and attach.
Currently load needs cap_sys_admin and
attach needs cap_net_admin.


^ permalink raw reply

* Re: [PATCH v2 2/2] PTP: add support for one-shot output
From: Richard Cochran @ 2019-08-29 17:25 UTC (permalink / raw)
  To: Felipe Balbi; +Cc: Christopher S Hall, netdev, linux-kernel
In-Reply-To: <20190829095825.2108-2-felipe.balbi@linux.intel.com>

On Thu, Aug 29, 2019 at 12:58:25PM +0300, Felipe Balbi wrote:
> diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
> index 98ec1395544e..a407e5f76e2d 100644
> --- a/drivers/ptp/ptp_chardev.c
> +++ b/drivers/ptp/ptp_chardev.c
> @@ -177,9 +177,8 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
>  			err = -EFAULT;
>  			break;
>  		}
> -		if ((req.perout.flags || req.perout.rsv[0] || req.perout.rsv[1]
> -				|| req.perout.rsv[2] || req.perout.rsv[3])
> -			&& cmd == PTP_PEROUT_REQUEST2) {
> +		if ((req.perout.rsv[0] || req.perout.rsv[1] || req.perout.rsv[2]
> +			|| req.perout.rsv[3]) && cmd == PTP_PEROUT_REQUEST2) {

Please check that the reserved bits of req.perout.flags, namely
~PTP_PEROUT_ONE_SHOT, are clear.

>  			err = -EINVAL;
>  			break;
>  		} else if (cmd == PTP_PEROUT_REQUEST) {
> diff --git a/include/uapi/linux/ptp_clock.h b/include/uapi/linux/ptp_clock.h
> index 039cd62ec706..95840e5f5c53 100644
> --- a/include/uapi/linux/ptp_clock.h
> +++ b/include/uapi/linux/ptp_clock.h
> @@ -67,7 +67,9 @@ struct ptp_perout_request {
>  	struct ptp_clock_time start;  /* Absolute start time. */
>  	struct ptp_clock_time period; /* Desired period, zero means disable. */
>  	unsigned int index;           /* Which channel to configure. */
> -	unsigned int flags;           /* Reserved for future use. */
> +
> +#define PTP_PEROUT_ONE_SHOT BIT(0)
> +	unsigned int flags;

@davem  Any CodingStyle policy on #define within a struct?  (Some
maintainers won't allow it.)

>  	unsigned int rsv[4];          /* Reserved for future use. */
>  };
>  
> -- 
> 2.23.0
> 

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
From: Alexei Starovoitov @ 2019-08-29 17:24 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
	kernel-team, linux-api
In-Reply-To: <87ef14iffx.fsf@toke.dk>

On Thu, Aug 29, 2019 at 09:44:18AM +0200, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <ast@kernel.org> writes:
> 
> > CAP_BPF allows the following BPF operations:
> > - Loading all types of BPF programs
> > - Creating all types of BPF maps except:
> >    - stackmap that needs CAP_TRACING
> >    - devmap that needs CAP_NET_ADMIN
> >    - cpumap that needs CAP_SYS_ADMIN
> 
> Why CAP_SYS_ADMIN instead of CAP_NET_ADMIN for cpumap?

Currently it's cap_sys_admin and I think it should stay this way
because it creates kthreads.


^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-29 17:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, Peter Zijlstra, Alexei Starovoitov, Kees Cook,
	LSM List, James Morris, Jann Horn, Masami Hiramatsu,
	David S. Miller, Daniel Borkmann, Network Development, bpf,
	kernel-team, Linux API
In-Reply-To: <CALCETrWYu0XB_d-MhXFgopEmBu-pog493G1e+KsE3dS32UULgA@mail.gmail.com>

On Thu, Aug 29, 2019 at 08:43:23AM -0700, Andy Lutomirski wrote:
> 
> I can imagine splitting it into three capabilities:
> 
> CAP_TRACE_KERNEL: learn which kernel functions are called when.  This
> would allow perf profiling, for example, but not sampling of kernel
> regs.
> 
> CAP_TRACE_READ_KERNEL_DATA: allow the tracing, profiling, etc features
> that can read the kernel's data.  So you get function arguments via
> kprobe, kernel regs, and APIs that expose probe_kernel_read()
> 
> CAP_TRACE_USER: trace unrelated user processes
> 
> I'm not sure the code is written in a way that makes splitting
> CAP_TRACE_KERNEL and CAP_TRACE_READ_KERNEL_DATA, and I'm not sure that
> CAP_TRACE_KERNEL is all that useful except for plain perf record
> without CAP_TRACE_READ_KERNEL_DATA.  What do you all think?  I suppose
> it could also be:
> 
> CAP_PROFILE_KERNEL: Use perf with events that aren't kprobes or
> tracepoints.  Does not grant the ability to sample regs or the kernel
> stack directly.
> 
> CAP_TRACE_KERNEL: Use all of perf, ftrace, kprobe, etc.
> 
> CAP_TRACE_USER: Use all of perf with scope limited to user mode and uprobes.

imo that makes little sense from security pov, since
such CAP_TRACE_KERNEL (ex kprobe) can trace "unrelated user process"
just as well. Yet not letting it do cleanly via uprobe.
Sort of like giving a spare key for back door of the house and
saying no, you cannot have main door key.


^ permalink raw reply

* Re: [PATCH v2 1/2] PTP: introduce new versions of IOCTLs
From: Richard Cochran @ 2019-08-29 17:21 UTC (permalink / raw)
  To: Felipe Balbi; +Cc: Christopher S Hall, netdev, linux-kernel
In-Reply-To: <20190829095825.2108-1-felipe.balbi@linux.intel.com>

> @@ -139,11 +141,24 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
>  		break;
>  
>  	case PTP_EXTTS_REQUEST:
> +	case PTP_EXTTS_REQUEST2:
> +		memset(&req, 0, sizeof(req));
> +
>  		if (copy_from_user(&req.extts, (void __user *)arg,
>  				   sizeof(req.extts))) {
>  			err = -EFAULT;
>  			break;
>  		}
> +		if ((req.extts.flags || req.extts.rsv[0] || req.extts.rsv[1])
> +			&& cmd == PTP_EXTTS_REQUEST2) {
> +			err = -EINVAL;
> +			break;
> +		} else if (cmd == PTP_EXTTS_REQUEST) {
> +			req.extts.flags = 0;

This still isn't quite right.  Sorry that was my fault.

The req.extts.flags can be (PTP_ENABLE_FEATURE | PTP_RISING_EDGE |
PTP_FALLING_EDGE), and ENABLE is used immediately below in this case.

Please #define those bits into a valid mask, and then:

- for PTP_EXTTS_REQUEST2 check that ~mask is zero, and
- for PTP_EXTTS_REQUEST clear the ~mask bits for the drivers. 

Thanks again for cleaning this up!

Richard

> +			req.extts.rsv[0] = 0;
> +			req.extts.rsv[1] = 0;
> +		}
> +			
>  		if (req.extts.index >= ops->n_ext_ts) {
>  			err = -EINVAL;
>  			break;

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-29 17:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Andy Lutomirski, Alexei Starovoitov, Kees Cook,
	LSM List, James Morris, Jann Horn, Masami Hiramatsu,
	David S. Miller, Daniel Borkmann, Network Development, bpf,
	kernel-team, Linux API
In-Reply-To: <20190829093434.36540972@gandalf.local.home>

On Thu, Aug 29, 2019 at 09:34:34AM -0400, Steven Rostedt wrote:
> 
> As the above seems to favor the idea of CAP_TRACING allowing write
> access to tracefs, should we have a CAP_TRACING_RO for just read access
> and limited perf abilities?

read only vs writeable is an attribute of the file system.
Bringing such things into caps seem wrong to me.


^ permalink raw reply

* Re: auto-split of commit. Was: [PATCH bpf-next 04/10] tools/bpf: add libbpf_prog_type_(from|to)_str helpers
From: Alexei Starovoitov @ 2019-08-29 17:16 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jakub Kicinski, Julia Kartseva, ast, Thomas Gleixner, rdna, bpf,
	daniel, netdev, kernel-team
In-Reply-To: <20190829065151.GB30423@kroah.com>

On Thu, Aug 29, 2019 at 08:51:51AM +0200, Greg Kroah-Hartman wrote:
> On Wed, Aug 28, 2019 at 04:46:28PM -0700, Alexei Starovoitov wrote:
> > On Wed, Aug 28, 2019 at 04:34:22PM -0700, Jakub Kicinski wrote:
> > > 
> > > Greg, Thomas, libbpf is extracted from the kernel sources and
> > > maintained in a clone repo on GitHub for ease of packaging.
> > > 
> > > IIUC Alexei's concern is that since we are moving the commits from
> > > the kernel repo to the GitHub one we have to preserve the commits
> > > exactly as they are, otherwise SOB lines lose their power.
> > > 
> > > Can you provide some guidance on whether that's a valid concern, 
> > > or whether it's perfectly fine to apply a partial patch?
> > 
> > Right. That's exactly the concern.
> > 
> > Greg, Thomas,
> > could you please put your legal hat on and clarify the following.
> > Say some developer does a patch that modifies
> > include/uapi/linux/bpf.h
> > ..some other kernel code...and
> > tools/include/uapi/linux/bpf.h
> > 
> > That tools/include/uapi/linux/bpf.h is used by perf and by libbpf.
> > We have automatic mirror of tools/libbpf into github/libbpf/
> > so that external projects and can do git submodule of it,
> > can build packages out of it, etc.
> > 
> > The question is whether it's ok to split tools/* part out of
> > original commit, keep Author and SOB, create new commit out of it,
> > and automatically push that auto-generated commit into github mirror.
> 
> Note, I am not a laywer, and am not _your_ lawyer either, only _your_
> lawyer can answer questions as to what is best for you.
> 
> That being said, from a "are you keeping the correct authorship info",
> yes, it sounds like you are doing the correct thing here.
> 
> Look at what I do for stable kernels, I take the original commit and add
> it to "another tree" keeping the original author and s-o-b chain intact,
> and adding a "this is the original git commit id" type message to the
> changelog text so that people can link it back to the original.

I think you're describing 'git cherry-pick -x'.
The question was about taking pieces of the original commit. Not the whole commit.
Author field obviously stays, but SOB is questionable.
If author meant to change X and Y and Z. Silently taking only Z chunk of the diff
doesn't quite seem right.
If we document that such commit split happens in Documentation/bpf/bpf_devel_QA.rst
do you think it will be enough to properly inform developers?
The main concern is the surprise factor when people start seeing their commits
in the mirror, but not their full commits.


^ permalink raw reply

* Re: [PATCH net-next] net: dsa: mv88e6xxx: fix freeing unused SERDES IRQ
From: Andrew Lunn @ 2019-08-29 17:08 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: netdev, davem, Marek Behún, f.fainelli
In-Reply-To: <20190828185511.21956-1-vivien.didelot@gmail.com>

On Wed, Aug 28, 2019 at 02:55:11PM -0400, Vivien Didelot wrote:
> Now mv88e6xxx does not enable its ports at setup itself and let
> the DSA core handle this, unused ports are disabled without being
> powered on first. While that is expected, the SERDES powering code
> was assuming that a port was already set up before powering it down,
> resulting in freeing an unused IRQ. The patch fixes this assumption.
> 
> Fixes: b759f528ca3d ("net: dsa: mv88e6xxx: enable SERDES after setup")
> Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>

Tested-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* [PATCH net-next v2 3/3] net: tls: export protocol version, cipher, tx_conf/rx_conf to socket diag
From: Davide Caratti @ 2019-08-29 16:48 UTC (permalink / raw)
  To: borisp, jakub.kicinski, Eric Dumazet
  Cc: aviadye, davejwatson, davem, john.fastabend, Matthieu Baerts,
	netdev
In-Reply-To: <cover.1567095873.git.dcaratti@redhat.com>

When an application configures kernel TLS on top of a TCP socket, it's
now possible for inet_diag_handler() to collect information regarding the
protocol version, the cipher type and TX / RX configuration, in case
INET_DIAG_INFO is requested.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 include/net/tls.h              | 19 ++++++++++
 include/uapi/linux/inet_diag.h |  1 +
 include/uapi/linux/tls.h       | 15 ++++++++
 net/tls/tls_main.c             | 64 ++++++++++++++++++++++++++++++++++
 4 files changed, 99 insertions(+)

diff --git a/include/net/tls.h b/include/net/tls.h
index 4997742475cd..990f1d9182a3 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -431,6 +431,25 @@ static inline bool is_tx_ready(struct tls_sw_context_tx *ctx)
 	return READ_ONCE(rec->tx_ready);
 }
 
+static inline u16 tls_user_config(struct tls_context *ctx, bool tx)
+{
+	u16 config = tx ? ctx->tx_conf : ctx->rx_conf;
+
+	switch (config) {
+	case TLS_BASE:
+		return TLS_CONF_BASE;
+	case TLS_SW:
+		return TLS_CONF_SW;
+#ifdef CONFIG_TLS_DEVICE
+	case TLS_HW:
+		return TLS_CONF_HW;
+#endif
+	case TLS_HW_RECORD:
+		return TLS_CONF_HW_RECORD;
+	}
+	return 0;
+}
+
 struct sk_buff *
 tls_validate_xmit_skb(struct sock *sk, struct net_device *dev,
 		      struct sk_buff *skb);
diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
index e2c6273274f3..a1ff345b3f33 100644
--- a/include/uapi/linux/inet_diag.h
+++ b/include/uapi/linux/inet_diag.h
@@ -162,6 +162,7 @@ enum {
 enum {
 	INET_ULP_INFO_UNSPEC,
 	INET_ULP_INFO_NAME,
+	INET_ULP_INFO_TLS,
 	__INET_ULP_INFO_MAX,
 };
 #define INET_ULP_INFO_MAX (__INET_ULP_INFO_MAX - 1)
diff --git a/include/uapi/linux/tls.h b/include/uapi/linux/tls.h
index 5b9c26753e46..bcd2869ed472 100644
--- a/include/uapi/linux/tls.h
+++ b/include/uapi/linux/tls.h
@@ -109,4 +109,19 @@ struct tls12_crypto_info_aes_ccm_128 {
 	unsigned char rec_seq[TLS_CIPHER_AES_CCM_128_REC_SEQ_SIZE];
 };
 
+enum {
+	TLS_INFO_UNSPEC,
+	TLS_INFO_VERSION,
+	TLS_INFO_CIPHER,
+	TLS_INFO_TXCONF,
+	TLS_INFO_RXCONF,
+	__TLS_INFO_MAX,
+};
+#define TLS_INFO_MAX (__TLS_INFO_MAX - 1)
+
+#define TLS_CONF_BASE 1
+#define TLS_CONF_SW 2
+#define TLS_CONF_HW 3
+#define TLS_CONF_HW_RECORD 4
+
 #endif /* _UAPI_LINUX_TLS_H */
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index f8f2d2c3d627..3351a2ace369 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -39,6 +39,7 @@
 #include <linux/netdevice.h>
 #include <linux/sched/signal.h>
 #include <linux/inetdevice.h>
+#include <linux/inet_diag.h>
 
 #include <net/tls.h>
 
@@ -835,6 +836,67 @@ static void tls_update(struct sock *sk, struct proto *p)
 	}
 }
 
+static int tls_get_info(const struct sock *sk, struct sk_buff *skb)
+{
+	struct tls_context *ctx;
+	u16 version, cipher_type;
+	struct nlattr *start;
+	int err;
+
+	start = nla_nest_start_noflag(skb, INET_ULP_INFO_TLS);
+	if (!start)
+		return -EMSGSIZE;
+
+	rcu_read_lock();
+	ctx = rcu_dereference(inet_csk(sk)->icsk_ulp_data);
+	if (!ctx) {
+		err = 0;
+		goto nla_failure;
+	}
+	version = ctx->prot_info.version;
+	if (version) {
+		err = nla_put_u16(skb, TLS_INFO_VERSION, version);
+		if (err)
+			goto nla_failure;
+	}
+	cipher_type = ctx->prot_info.cipher_type;
+	if (cipher_type) {
+		err = nla_put_u16(skb, TLS_INFO_CIPHER, cipher_type);
+		if (err)
+			goto nla_failure;
+	}
+	err = nla_put_u16(skb, TLS_INFO_TXCONF, tls_user_config(ctx, true));
+	if (err)
+		goto nla_failure;
+
+	err = nla_put_u16(skb, TLS_INFO_RXCONF, tls_user_config(ctx, false));
+	if (err)
+		goto nla_failure;
+
+	rcu_read_unlock();
+	nla_nest_end(skb, start);
+	return 0;
+
+nla_failure:
+	rcu_read_unlock();
+	nla_nest_cancel(skb, start);
+	return err;
+}
+
+static size_t tls_get_info_size(const struct sock *sk)
+{
+	size_t size = 0;
+
+	size += nla_total_size(0) +		/* INET_ULP_INFO_TLS */
+		nla_total_size(sizeof(u16)) +	/* TLS_INFO_VERSION */
+		nla_total_size(sizeof(u16)) +	/* TLS_INFO_CIPHER */
+		nla_total_size(sizeof(u16)) +	/* TLS_INFO_RXCONF */
+		nla_total_size(sizeof(u16)) +	/* TLS_INFO_TXCONF */
+		0;
+
+	return size;
+}
+
 void tls_register_device(struct tls_device *device)
 {
 	spin_lock_bh(&device_spinlock);
@@ -856,6 +918,8 @@ static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = {
 	.owner			= THIS_MODULE,
 	.init			= tls_init,
 	.update			= tls_update,
+	.get_info		= tls_get_info,
+	.get_info_size		= tls_get_info_size,
 };
 
 static int __init tls_register(void)
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v2 2/3] tcp: ulp: add functions to dump ulp-specific information
From: Davide Caratti @ 2019-08-29 16:48 UTC (permalink / raw)
  To: borisp, jakub.kicinski, Eric Dumazet
  Cc: aviadye, davejwatson, davem, john.fastabend, Matthieu Baerts,
	netdev
In-Reply-To: <cover.1567095873.git.dcaratti@redhat.com>

currently, only getsockopt(TCP_ULP) can be invoked to know if a ULP is on
top of a TCP socket. Extend idiag_get_aux() and idiag_get_aux_size(),
introduced by commit b37e88407c1d ("inet_diag: allow protocols to provide
additional data"), to report the ULP name and other information that can
be made available by the ULP through optional functions.

Users having CAP_NET_ADMIN privileges will then be able to retrieve this
information through inet_diag_handler, if they specify INET_DIAG_INFO in
the request.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 include/net/tcp.h              |  3 ++
 include/uapi/linux/inet_diag.h |  8 ++++++
 net/ipv4/tcp_diag.c            | 52 +++++++++++++++++++++++++++++++++-
 3 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 77fe87f7a992..c9a3f9688223 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2122,6 +2122,9 @@ struct tcp_ulp_ops {
 	void (*update)(struct sock *sk, struct proto *p);
 	/* cleanup ulp */
 	void (*release)(struct sock *sk);
+	/* diagnostic */
+	int (*get_info)(const struct sock *sk, struct sk_buff *skb);
+	size_t (*get_info_size)(const struct sock *sk);
 
 	char		name[TCP_ULP_NAME_MAX];
 	struct module	*owner;
diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
index e8baca85bac6..e2c6273274f3 100644
--- a/include/uapi/linux/inet_diag.h
+++ b/include/uapi/linux/inet_diag.h
@@ -153,11 +153,19 @@ enum {
 	INET_DIAG_BBRINFO,	/* request as INET_DIAG_VEGASINFO */
 	INET_DIAG_CLASS_ID,	/* request as INET_DIAG_TCLASS */
 	INET_DIAG_MD5SIG,
+	INET_DIAG_ULP_INFO,
 	__INET_DIAG_MAX,
 };
 
 #define INET_DIAG_MAX (__INET_DIAG_MAX - 1)
 
+enum {
+	INET_ULP_INFO_UNSPEC,
+	INET_ULP_INFO_NAME,
+	__INET_ULP_INFO_MAX,
+};
+#define INET_ULP_INFO_MAX (__INET_ULP_INFO_MAX - 1)
+
 /* INET_DIAG_MEM */
 
 struct inet_diag_meminfo {
diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
index a3a386236d93..babc156deabb 100644
--- a/net/ipv4/tcp_diag.c
+++ b/net/ipv4/tcp_diag.c
@@ -81,13 +81,42 @@ static int tcp_diag_put_md5sig(struct sk_buff *skb,
 }
 #endif
 
+static int tcp_diag_put_ulp(struct sk_buff *skb, struct sock *sk,
+			    const struct tcp_ulp_ops *ulp_ops)
+{
+	struct nlattr *nest;
+	int err;
+
+	nest = nla_nest_start_noflag(skb, INET_DIAG_ULP_INFO);
+	if (!nest)
+		return -EMSGSIZE;
+
+	err = nla_put_string(skb, INET_ULP_INFO_NAME, ulp_ops->name);
+	if (err)
+		goto nla_failure;
+
+	if (ulp_ops->get_info)
+		err = ulp_ops->get_info(sk, skb);
+	if (err)
+		goto nla_failure;
+
+	nla_nest_end(skb, nest);
+	return 0;
+
+nla_failure:
+	nla_nest_cancel(skb, nest);
+	return err;
+}
+
 static int tcp_diag_get_aux(struct sock *sk, bool net_admin,
 			    struct sk_buff *skb)
 {
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	int err = 0;
+
 #ifdef CONFIG_TCP_MD5SIG
 	if (net_admin) {
 		struct tcp_md5sig_info *md5sig;
-		int err = 0;
 
 		rcu_read_lock();
 		md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info);
@@ -99,11 +128,21 @@ static int tcp_diag_get_aux(struct sock *sk, bool net_admin,
 	}
 #endif
 
+	if (net_admin) {
+		const struct tcp_ulp_ops *ulp_ops;
+
+		ulp_ops = icsk->icsk_ulp_ops;
+		if (ulp_ops)
+			err = tcp_diag_put_ulp(skb, sk, ulp_ops);
+		if (err)
+			return err;
+	}
 	return 0;
 }
 
 static size_t tcp_diag_get_aux_size(struct sock *sk, bool net_admin)
 {
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	size_t size = 0;
 
 #ifdef CONFIG_TCP_MD5SIG
@@ -124,6 +163,17 @@ static size_t tcp_diag_get_aux_size(struct sock *sk, bool net_admin)
 	}
 #endif
 
+	if (net_admin) {
+		const struct tcp_ulp_ops *ulp_ops;
+
+		ulp_ops = icsk->icsk_ulp_ops;
+		if (ulp_ops) {
+			size += nla_total_size(0) +
+				nla_total_size(TCP_ULP_NAME_MAX);
+			if (ulp_ops->get_info_size)
+				size += ulp_ops->get_info_size(sk);
+		}
+	}
 	return size;
 }
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v2 1/3] net/tls: use RCU protection on icsk->icsk_ulp_data
From: Davide Caratti @ 2019-08-29 16:48 UTC (permalink / raw)
  To: borisp, jakub.kicinski, Eric Dumazet
  Cc: aviadye, davejwatson, davem, john.fastabend, Matthieu Baerts,
	netdev
In-Reply-To: <cover.1567095873.git.dcaratti@redhat.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>

We need to make sure context does not get freed while diag
code is interrogating it. Free struct tls_context with
kfree_rcu().

We add the __rcu annotation directly in icsk, and cast it
away in the datapath accessor. Presumably all ULPs will
do a similar thing.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 include/net/inet_connection_sock.h |  2 +-
 include/net/tls.h                  |  9 +++++++--
 net/core/sock_map.c                |  2 +-
 net/tls/tls_device.c               |  2 +-
 net/tls/tls_main.c                 | 26 +++++++++++++++++++-------
 5 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index c57d53e7e02c..895546058a20 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -97,7 +97,7 @@ struct inet_connection_sock {
 	const struct tcp_congestion_ops *icsk_ca_ops;
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
 	const struct tcp_ulp_ops  *icsk_ulp_ops;
-	void			  *icsk_ulp_data;
+	void __rcu		  *icsk_ulp_data;
 	void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
 	struct hlist_node         icsk_listen_portaddr_node;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
diff --git a/include/net/tls.h b/include/net/tls.h
index 41b2d41bb1b8..4997742475cd 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -41,6 +41,7 @@
 #include <linux/tcp.h>
 #include <linux/skmsg.h>
 #include <linux/netdevice.h>
+#include <linux/rcupdate.h>
 
 #include <net/tcp.h>
 #include <net/strparser.h>
@@ -290,6 +291,7 @@ struct tls_context {
 
 	struct list_head list;
 	refcount_t refcount;
+	struct rcu_head rcu;
 };
 
 enum tls_offload_ctx_dir {
@@ -348,7 +350,7 @@ struct tls_offload_context_rx {
 #define TLS_OFFLOAD_CONTEXT_SIZE_RX					\
 	(sizeof(struct tls_offload_context_rx) + TLS_DRIVER_STATE_SIZE_RX)
 
-void tls_ctx_free(struct tls_context *ctx);
+void tls_ctx_free(struct sock *sk, struct tls_context *ctx);
 int wait_on_pending_writer(struct sock *sk, long *timeo);
 int tls_sk_query(struct sock *sk, int optname, char __user *optval,
 		int __user *optlen);
@@ -467,7 +469,10 @@ static inline struct tls_context *tls_get_ctx(const struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
-	return icsk->icsk_ulp_data;
+	/* Use RCU on icsk_ulp_data only for sock diag code,
+	 * TLS data path doesn't need rcu_dereference().
+	 */
+	return (__force void *)icsk->icsk_ulp_data;
 }
 
 static inline void tls_advance_record_sn(struct sock *sk,
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 1330a7442e5b..01998860afaa 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -345,7 +345,7 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx,
 		return -EINVAL;
 	if (unlikely(idx >= map->max_entries))
 		return -E2BIG;
-	if (unlikely(icsk->icsk_ulp_data))
+	if (unlikely(rcu_access_pointer(icsk->icsk_ulp_data)))
 		return -EINVAL;
 
 	link = sk_psock_init_link();
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index a470df7ffcf9..e188139f0464 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -61,7 +61,7 @@ static void tls_device_free_ctx(struct tls_context *ctx)
 	if (ctx->rx_conf == TLS_HW)
 		kfree(tls_offload_ctx_rx(ctx));
 
-	tls_ctx_free(ctx);
+	tls_ctx_free(NULL, ctx);
 }
 
 static void tls_device_gc_task(struct work_struct *work)
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 43252a801c3f..f8f2d2c3d627 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -251,14 +251,26 @@ static void tls_write_space(struct sock *sk)
 	ctx->sk_write_space(sk);
 }
 
-void tls_ctx_free(struct tls_context *ctx)
+/**
+ * tls_ctx_free() - free TLS ULP context
+ * @sk:  socket to with @ctx is attached
+ * @ctx: TLS context structure
+ *
+ * Free TLS context. If @sk is %NULL caller guarantees that the socket
+ * to which @ctx was attached has no outstanding references.
+ */
+void tls_ctx_free(struct sock *sk, struct tls_context *ctx)
 {
 	if (!ctx)
 		return;
 
 	memzero_explicit(&ctx->crypto_send, sizeof(ctx->crypto_send));
 	memzero_explicit(&ctx->crypto_recv, sizeof(ctx->crypto_recv));
-	kfree(ctx);
+
+	if (sk)
+		kfree_rcu(ctx, rcu);
+	else
+		kfree(ctx);
 }
 
 static void tls_sk_proto_cleanup(struct sock *sk,
@@ -306,7 +318,7 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 
 	write_lock_bh(&sk->sk_callback_lock);
 	if (free_ctx)
-		icsk->icsk_ulp_data = NULL;
+		rcu_assign_pointer(icsk->icsk_ulp_data, NULL);
 	sk->sk_prot = ctx->sk_proto;
 	if (sk->sk_write_space == tls_write_space)
 		sk->sk_write_space = ctx->sk_write_space;
@@ -321,7 +333,7 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
 	ctx->sk_proto_close(sk, timeout);
 
 	if (free_ctx)
-		tls_ctx_free(ctx);
+		tls_ctx_free(sk, ctx);
 }
 
 static int do_tls_getsockopt_tx(struct sock *sk, char __user *optval,
@@ -610,7 +622,7 @@ static struct tls_context *create_ctx(struct sock *sk)
 	if (!ctx)
 		return NULL;
 
-	icsk->icsk_ulp_data = ctx;
+	rcu_assign_pointer(icsk->icsk_ulp_data, ctx);
 	ctx->setsockopt = sk->sk_prot->setsockopt;
 	ctx->getsockopt = sk->sk_prot->getsockopt;
 	ctx->sk_proto_close = sk->sk_prot->close;
@@ -651,8 +663,8 @@ static void tls_hw_sk_destruct(struct sock *sk)
 
 	ctx->sk_destruct(sk);
 	/* Free ctx */
-	tls_ctx_free(ctx);
-	icsk->icsk_ulp_data = NULL;
+	rcu_assign_pointer(icsk->icsk_ulp_data, NULL);
+	tls_ctx_free(sk, ctx);
 }
 
 static int tls_hw_prot(struct sock *sk)
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v2 0/3] net: tls: add socket diag
From: Davide Caratti @ 2019-08-29 16:48 UTC (permalink / raw)
  To: borisp, jakub.kicinski, Eric Dumazet
  Cc: aviadye, davejwatson, davem, john.fastabend, Matthieu Baerts,
	netdev

The current kernel does not provide any diagnostic tool, except
getsockopt(TCP_ULP), to know more about TCP sockets that have an upper
layer protocol (ULP) on top of them. This series extends the set of
information exported by INET_DIAG_INFO, to include data that are
specific to the ULP (and that might be meaningful for debug/testing
purposes).

patch 1/3 ensures that the control plane reads/updates ULP specific data
using RCU.

patch 2/3 extends INET_DIAG_INFO and allows knowing the ULP name for
each TCP socket that has done setsockopt(TCP_ULP) successfully.

patch 3/3 extends kTLS to let programs like 'ss' know the protocol
version and the cipher in use.

Changes since v1:
- don't worry about grace period when accessing ulp_ops, thanks to
  Jakub Kicinski and Eric Dumazet
- use rcu_dereference() to access ULP data in tls get_info(), and 
  test against NULL value, thanks to Jakub Kicinski
- move RCU protected section inside tls get_info(), thanks to Jakub
  Kicinski

Changes since RFC:
- some coding style fixes, thanks to Jakub Kicinski
- add X_UNSPEC as lowest value of uAPI enums, thanks to Jakub Kicinski
- fix assignment of struct nlattr *start, thanks to Jakub Kicinski
- let tls dump RXCONF and TXCONF, suggested by Jakub Kicinski
- don't dump anything if TLS version or cipher are 0 (but still return a
  constant size in get_aux_size()), thanks to Boris Pismenny
- constify first argument of get_info() and get_size()
- use RCU to access access ulp_ops, like it's done for ca_ops
- add patch 1/3, from Jakub Kicinski


Davide Caratti (2):
  tcp: ulp: add functions to dump ulp-specific information
  net: tls: export protocol version, cipher, tx_conf/rx_conf to socket
    diag

Jakub Kicinski (1):
  net/tls: use RCU protection on icsk->icsk_ulp_data

 include/net/inet_connection_sock.h |  2 +-
 include/net/tcp.h                  |  3 +
 include/net/tls.h                  | 28 +++++++++-
 include/uapi/linux/inet_diag.h     |  9 +++
 include/uapi/linux/tls.h           | 15 +++++
 net/core/sock_map.c                |  2 +-
 net/ipv4/tcp_diag.c                | 52 ++++++++++++++++-
 net/tls/tls_device.c               |  2 +-
 net/tls/tls_main.c                 | 90 +++++++++++++++++++++++++++---
 9 files changed, 190 insertions(+), 13 deletions(-)

-- 
2.20.1


^ permalink raw reply

* [PATCH v3 10/11] udp: Remove unlikely() from IS_ERR*() condition
From: Denis Efremov @ 2019-08-29 16:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denis Efremov, David S. Miller, Joe Perches, Andrew Morton,
	netdev
In-Reply-To: <20190829165025.15750-1-efremov@linux.com>

"unlikely(IS_ERR_OR_NULL(x))" is excessive. IS_ERR_OR_NULL() already uses
unlikely() internally.

Signed-off-by: Denis Efremov <efremov@linux.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Joe Perches <joe@perches.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: netdev@vger.kernel.org
---
 include/net/udp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 79d141d2103b..bad74f780831 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -480,7 +480,7 @@ static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
 	 * CB fragment
 	 */
 	segs = __skb_gso_segment(skb, features, false);
-	if (unlikely(IS_ERR_OR_NULL(segs))) {
+	if (IS_ERR_OR_NULL(segs)) {
 		int segs_nr = skb_shinfo(skb)->gso_segs;
 
 		atomic_add(segs_nr, &sk->sk_drops);
-- 
2.21.0


^ permalink raw reply related

* [PATCH v3 03/11] net/mlx5e: Remove unlikely() from WARN*() condition
From: Denis Efremov @ 2019-08-29 16:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denis Efremov, Boris Pismenny, Saeed Mahameed, Leon Romanovsky,
	Joe Perches, Andrew Morton, netdev
In-Reply-To: <20190829165025.15750-1-efremov@linux.com>

"unlikely(WARN_ON_ONCE(x))" is excessive. WARN_ON_ONCE() already uses
unlikely() internally.

Signed-off-by: Denis Efremov <efremov@linux.com>
Cc: Boris Pismenny <borisp@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: netdev@vger.kernel.org
---
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
index 7833ddef0427..e5222d17df35 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_tx.c
@@ -408,7 +408,7 @@ struct sk_buff *mlx5e_ktls_handle_tx_skb(struct net_device *netdev,
 		goto out;
 
 	tls_ctx = tls_get_ctx(skb->sk);
-	if (unlikely(WARN_ON_ONCE(tls_ctx->netdev != netdev)))
+	if (WARN_ON_ONCE(tls_ctx->netdev != netdev))
 		goto err_out;
 
 	priv_tx = mlx5e_get_ktls_tx_priv_ctx(tls_ctx);
-- 
2.21.0


^ permalink raw reply related

* [PATCH v3 01/11] checkpatch: check for nested (un)?likely() calls
From: Denis Efremov @ 2019-08-29 16:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Denis Efremov, Alexander Viro, Anton Altaparmakov,
	Boris Ostrovsky, Boris Pismenny, Darrick J. Wong, David S. Miller,
	Dennis Dalessandro, Dmitry Torokhov, Inaky Perez-Gonzalez,
	Juergen Gross, Leon Romanovsky, Mike Marciniszyn, Pali Rohár,
	Rob Clark, Saeed Mahameed, Sean Paul, linux-arm-msm,
	linux-fsdevel, linux-input, linux-ntfs-dev, linux-rdma,
	linux-wimax, linux-xfs, xen-devel, netdev, dri-devel, Joe Perches,
	Andrew Morton, Andy Whitcroft

IS_ERR(), IS_ERR_OR_NULL(), IS_ERR_VALUE() and WARN*() already contain
unlikely() optimization internally. Thus, there is no point in calling
these functions and defines under likely()/unlikely().

This check is based on the coccinelle rule developed by Enrico Weigelt
https://lore.kernel.org/lkml/1559767582-11081-1-git-send-email-info@metux.net/

Signed-off-by: Denis Efremov <efremov@linux.com>
Cc: Joe Perches <joe@perches.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Whitcroft <apw@canonical.com>
---
 scripts/checkpatch.pl | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 93a7edfe0f05..56969ce06df4 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -6480,6 +6480,12 @@ sub process {
 			     "Using $1 should generally have parentheses around the comparison\n" . $herecurr);
 		}
 
+# nested likely/unlikely calls
+		if ($line =~ /\b(?:(?:un)?likely)\s*\(\s*!?\s*(IS_ERR(?:_OR_NULL|_VALUE)?|WARN)/) {
+			WARN("LIKELY_MISUSE",
+			     "nested (un)?likely() calls, $1 already uses unlikely() internally\n" . $herecurr);
+		}
+
 # whine mightly about in_atomic
 		if ($line =~ /\bin_atomic\s*\(/) {
 			if ($realfile =~ m@^drivers/@) {
-- 
2.21.0


^ permalink raw reply related

* Re: general protection fault in tls_sk_proto_close (2)
From: Jakub Kicinski @ 2019-08-29 16:43 UTC (permalink / raw)
  To: Hillf Danton
  Cc: john.fastabend, syzbot, aviadye, borisp, daniel, davejwatson,
	davem, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <20190829035200.3340-1-hdanton@sina.com>

On Thu, 29 Aug 2019 11:52:00 +0800, Hillf Danton wrote:
> Alternatively work is done if sock is closed again. Anyway ctx is reset
> under sock's callback lock in write mode.
> 
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -295,6 +295,8 @@ static void tls_sk_proto_close(struct so
>  	long timeo = sock_sndtimeo(sk, 0);
>  	bool free_ctx;
>  
> +	if (!ctx)
> +		return;
>  	if (ctx->tx_conf == TLS_SW)
>  		tls_sw_cancel_work_tx(ctx);

That's no bueno, the real socket's close will never get called.

^ permalink raw reply

* Re: [PATCH net-next] r8152: fix accessing skb after napi_gro_receive
From: Eric Dumazet @ 2019-08-29 16:31 UTC (permalink / raw)
  To: Hayes Wang, netdev; +Cc: nic_swsd, linux-kernel
In-Reply-To: <1394712342-15778-302-Taiwan-albertk@realtek.com>



On 8/19/19 5:15 AM, Hayes Wang wrote:
> Fix accessing skb after napi_gro_receive which is caused by
> commit 47922fcde536 ("r8152: support skb_add_rx_frag").
> 
> Fixes: 47922fcde536 ("r8152: support skb_add_rx_frag")
> Signed-off-by: Hayes Wang <hayeswang@realtek.com>
> ---

It is customary to add a tag to credit the reporter...

Something like :

Reported-by: ....

Thanks.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
From: Andy Lutomirski @ 2019-08-29 16:28 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, David S. Miller, Peter Zijlstra,
	Steven Rostedt, Network Development, bpf, kernel-team, Linux API
In-Reply-To: <536636ad-0baf-31e9-85fe-2591b65068df@iogearbox.net>

> On Aug 29, 2019, at 8:47 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
>> On 8/29/19 7:12 AM, Alexei Starovoitov wrote:
>> [...]
>>  +/*
>> + * CAP_BPF allows the following BPF operations:
>> + * - Loading all types of BPF programs
>> + * - Creating all types of BPF maps except:
>> + *    - stackmap that needs CAP_TRACING
>> + *    - devmap that needs CAP_NET_ADMIN
>> + *    - cpumap that needs CAP_SYS_ADMIN
>> + * - Advanced verifier features
>> + *   - Indirect variable access
>> + *   - Bounded loops
>> + *   - BPF to BPF function calls
>> + *   - Scalar precision tracking
>> + *   - Larger complexity limits
>> + *   - Dead code elimination
>> + *   - And potentially other features
>> + * - Use of pointer-to-integer conversions in BPF programs
>> + * - Bypassing of speculation attack hardening measures
>> + * - Loading BPF Type Format (BTF) data
>> + * - Iterate system wide loaded programs, maps, BTF objects
>> + * - Retrieve xlated and JITed code of BPF programs
>> + * - Access maps and programs via id
>> + * - Use bpf_spin_lock() helper
>
> This is still very wide. Consider following example: app has CAP_BPF +
> CAP_NET_ADMIN. Why can't we in this case *only* allow loading networking
> related [plus generic] maps and programs? If it doesn't have CAP_TRACING,
> what would be a reason to allow loading it? Same vice versa. There are
> some misc program types like the infraread stuff, but they could continue
> to live under [CAP_BPF +] CAP_SYS_ADMIN as fallback. I think categorizing
> a specific list of prog and map types might be more clear than disallowing
> some helpers like below (e.g. why choice of bpf_probe_read() but not
> bpf_probe_write_user() etc).

Wow, I didn’t notice that bpf_probe_write_user() existed. That should
need something like CAP_PTRACE or CAP_SYS_ADMIN.

I'm starting to think that something like this:

https://lore.kernel.org/bpf/968f3551247a43e1104b198f2e58fb0595d425e7.1565040372.git.luto@kernel.org/

should maybe be finished before CAP_BPF happens at all.  It really
looks like the bpf operations that need privilege need to get fully
catalogued and dealt with rather than just coming up with a new
capability that covers a huge swath.

(bpf_probe_write_user() is also terminally broken on architectures
like s390x, but that's not really relevant right now.  I'm a bit
surprised it works on x86 with SMAP, though.)

^ permalink raw reply

* [PATCH net-next v2 0/2] Fixes for unlocked cls hardware offload API refactoring
From: Vlad Buslov @ 2019-08-29 16:15 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, davem, saeedm, idosch, sergei.shtylyov,
	Vlad Buslov

Two fixes for my "Refactor cls hardware offload API to support
rtnl-independent drivers" series.

Vlad Buslov (2):
  net: sched: cls_matchall: cleanup flow_action before deallocating
  net/mlx5e: Move local var definition into ifdef block

 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 6 ++++--
 net/sched/cls_matchall.c                          | 2 ++
 2 files changed, 6 insertions(+), 2 deletions(-)

-- 
2.21.0


^ permalink raw reply

* [PATCH net-next v2 2/2] net/mlx5e: Move local var definition into ifdef block
From: Vlad Buslov @ 2019-08-29 16:15 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, davem, saeedm, idosch, sergei.shtylyov,
	Vlad Buslov, tanhuazhong
In-Reply-To: <20190829161517.20935-1-vladbu@mellanox.com>

New local variable "struct flow_block_offload *f" was added to
mlx5e_setup_tc() in recent rtnl lock removal patches. The variable is used
in code that is only compiled when CONFIG_MLX5_ESWITCH is enabled. This
results compilation warning about unused variable when CONFIG_MLX5_ESWITCH
is not set. Move the variable definition into eswitch-specific code block
from the beginning of mlx5e_setup_tc() function.

Fixes: c9f14470d048 ("net: sched: add API for registering unlocked offload block callbacks")
Reported-by: tanhuazhong <tanhuazhong@huawei.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---

Notes:
    Changes V1 -> V2:
    
    - Fixed error in commit message.

 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8592b98d0e70..c10a1fc8e469 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3470,16 +3470,18 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
 			  void *type_data)
 {
 	struct mlx5e_priv *priv = netdev_priv(dev);
-	struct flow_block_offload *f = type_data;
 
 	switch (type) {
 #ifdef CONFIG_MLX5_ESWITCH
-	case TC_SETUP_BLOCK:
+	case TC_SETUP_BLOCK: {
+		struct flow_block_offload *f = type_data;
+
 		f->unlocked_driver_cb = true;
 		return flow_block_cb_setup_simple(type_data,
 						  &mlx5e_block_cb_list,
 						  mlx5e_setup_tc_block_cb,
 						  priv, priv, true);
+	}
 #endif
 	case TC_SETUP_QDISC_MQPRIO:
 		return mlx5e_setup_tc_mqprio(priv, type_data);
-- 
2.21.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox