* Re: [PATCH v2 2/2] PTP: add support for one-shot output
From: Richard Cochran @ 2019-08-29 17:25 UTC (permalink / raw)
To: Felipe Balbi; +Cc: Christopher S Hall, netdev, linux-kernel
In-Reply-To: <20190829095825.2108-2-felipe.balbi@linux.intel.com>
On Thu, Aug 29, 2019 at 12:58:25PM +0300, Felipe Balbi wrote:
> diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
> index 98ec1395544e..a407e5f76e2d 100644
> --- a/drivers/ptp/ptp_chardev.c
> +++ b/drivers/ptp/ptp_chardev.c
> @@ -177,9 +177,8 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
> err = -EFAULT;
> break;
> }
> - if ((req.perout.flags || req.perout.rsv[0] || req.perout.rsv[1]
> - || req.perout.rsv[2] || req.perout.rsv[3])
> - && cmd == PTP_PEROUT_REQUEST2) {
> + if ((req.perout.rsv[0] || req.perout.rsv[1] || req.perout.rsv[2]
> + || req.perout.rsv[3]) && cmd == PTP_PEROUT_REQUEST2) {
Please check that the reserved bits of req.perout.flags, namely
~PTP_PEROUT_ONE_SHOT, are clear.
> err = -EINVAL;
> break;
> } else if (cmd == PTP_PEROUT_REQUEST) {
> diff --git a/include/uapi/linux/ptp_clock.h b/include/uapi/linux/ptp_clock.h
> index 039cd62ec706..95840e5f5c53 100644
> --- a/include/uapi/linux/ptp_clock.h
> +++ b/include/uapi/linux/ptp_clock.h
> @@ -67,7 +67,9 @@ struct ptp_perout_request {
> struct ptp_clock_time start; /* Absolute start time. */
> struct ptp_clock_time period; /* Desired period, zero means disable. */
> unsigned int index; /* Which channel to configure. */
> - unsigned int flags; /* Reserved for future use. */
> +
> +#define PTP_PEROUT_ONE_SHOT BIT(0)
> + unsigned int flags;
@davem Any CodingStyle policy on #define within a struct? (Some
maintainers won't allow it.)
> unsigned int rsv[4]; /* Reserved for future use. */
> };
>
> --
> 2.23.0
>
Thanks,
Richard
^ permalink raw reply
* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
From: Alexei Starovoitov @ 2019-08-29 17:25 UTC (permalink / raw)
To: Nicolas Dichtel
Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
kernel-team, linux-api
In-Reply-To: <a5ef2f94-acca-eb66-b48c-899494a9f8d0@6wind.com>
On Thu, Aug 29, 2019 at 03:36:42PM +0200, Nicolas Dichtel wrote:
> Le 29/08/2019 à 07:12, Alexei Starovoitov a écrit :
> [snip]
> > CAP_BPF and CAP_NET_ADMIN together allow the following:
> > - Attach to cgroup-bpf hooks and query
> > - skb, xdp, flow_dissector test_run command
> >
> > CAP_NET_ADMIN allows:
> > - Attach networking bpf programs to xdp, tc, lwt, flow dissector
> I'm not sure to understand the difference between these last two points.
> But, with the current kernel, CAP_NET_ADMIN is not enough to attach bpf prog
> with tc and it's still not enough after your patch.
> The following command is rejected:
> $ tc filter add dev eth0 ingress matchall action bpf obj ./tc_test_kern.o sec test
>
> Prog section 'test' rejected: Operation not permitted (1)!
> - Type: 4
> - Instructions: 22 (0 over limit)
> - License: GPL
>
> Verifier analysis:
>
> Error fetching program/map!
> bad action parsing
> parse_action: bad value (5:bpf)!
> Illegal "action"
because tc/iproute2 is doing load and attach.
Currently load needs cap_sys_admin and
attach needs cap_net_admin.
^ permalink raw reply
* Re: [PATCH bpf-next 0/3] tools: bpftool: improve bpftool build experience
From: Jakub Kicinski @ 2019-08-29 17:25 UTC (permalink / raw)
To: Quentin Monnet
Cc: Alexei Starovoitov, Daniel Borkmann, bpf, netdev, oss-drivers,
Lorenz Bauer, Ilya Leoshkevich
In-Reply-To: <20190829105645.12285-1-quentin.monnet@netronome.com>
On Thu, 29 Aug 2019 11:56:42 +0100, Quentin Monnet wrote:
> Hi,
> This set attempts to make it easier to build bpftool, in particular when
> passing a specific output directory. This is a follow-up to the
> conversation held last month by Lorenz, Ilya and Jakub [0].
>
> The first patch is a minor fix to bpftool's Makefile, regarding the
> retrieval of kernel version (which currently prints a non-relevant make
> warning on some invocations).
>
> Second patch improves the Makefile commands to support more "make"
> invocations, or to fix building with custom output directory. On Jakub's
> suggestion, a script is also added to BPF selftests in order to keep track
> of the supported build variants.
>
> At last, third patch is a sligthly modified version of Ilya's fix regarding
> libbpf.a appearing twice on the linking command for bpftool.
>
> [0] https://lore.kernel.org/bpf/CACAyw9-CWRHVH3TJ=Tke2x8YiLsH47sLCijdp=V+5M836R9aAA@mail.gmail.com/
I think Ilya has a point, but otherwise looks good to me :)
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
^ permalink raw reply
* Re: [PATCH v2 2/2] PTP: add support for one-shot output
From: Richard Cochran @ 2019-08-29 17:28 UTC (permalink / raw)
To: Felipe Balbi; +Cc: Christopher S Hall, netdev, linux-kernel, davem
In-Reply-To: <20190829172509.GB2166@localhost>
Adding davem onto CC...
On Thu, Aug 29, 2019 at 12:58:25PM +0300, Felipe Balbi wrote:
> diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
> index 98ec1395544e..a407e5f76e2d 100644
> --- a/drivers/ptp/ptp_chardev.c
> +++ b/drivers/ptp/ptp_chardev.c
> @@ -177,9 +177,8 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
> err = -EFAULT;
> break;
> }
> - if ((req.perout.flags || req.perout.rsv[0] || req.perout.rsv[1]
> - || req.perout.rsv[2] || req.perout.rsv[3])
> - && cmd == PTP_PEROUT_REQUEST2) {
> + if ((req.perout.rsv[0] || req.perout.rsv[1] || req.perout.rsv[2]
> + || req.perout.rsv[3]) && cmd == PTP_PEROUT_REQUEST2) {
Please check that the reserved bits of req.perout.flags, namely
~PTP_PEROUT_ONE_SHOT, are clear.
> err = -EINVAL;
> break;
> } else if (cmd == PTP_PEROUT_REQUEST) {
> diff --git a/include/uapi/linux/ptp_clock.h b/include/uapi/linux/ptp_clock.h
> index 039cd62ec706..95840e5f5c53 100644
> --- a/include/uapi/linux/ptp_clock.h
> +++ b/include/uapi/linux/ptp_clock.h
> @@ -67,7 +67,9 @@ struct ptp_perout_request {
> struct ptp_clock_time start; /* Absolute start time. */
> struct ptp_clock_time period; /* Desired period, zero means disable. */
> unsigned int index; /* Which channel to configure. */
> - unsigned int flags; /* Reserved for future use. */
> +
> +#define PTP_PEROUT_ONE_SHOT BIT(0)
> + unsigned int flags;
@davem Any CodingStyle policy on #define within a struct? (Some
maintainers won't allow it.)
> unsigned int rsv[4]; /* Reserved for future use. */
> };
>
> --
> 2.23.0
>
Thanks,
Richard
^ permalink raw reply
* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
From: Alexei Starovoitov @ 2019-08-29 17:28 UTC (permalink / raw)
To: Song Liu
Cc: Alexei Starovoitov, Andy Lutomirski, davem@davemloft.net,
peterz@infradead.org, rostedt@goodmis.org, netdev@vger.kernel.org,
bpf@vger.kernel.org, Kernel Team, linux-api@vger.kernel.org
In-Reply-To: <B28631A9-BB92-404A-BD58-7A737BCF10C9@fb.com>
On Thu, Aug 29, 2019 at 06:04:42AM +0000, Song Liu wrote:
>
>
> > On Aug 28, 2019, at 10:12 PM, Alexei Starovoitov <ast@kernel.org> wrote:
> >
>
> [...]
>
> > diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
> > index 44e2d640b088..91a7f25512ca 100644
> > --- a/tools/testing/selftests/bpf/test_verifier.c
> > +++ b/tools/testing/selftests/bpf/test_verifier.c
> > @@ -805,10 +805,20 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
> > }
> > }
> >
> > +struct libcap {
> > + struct __user_cap_header_struct hdr;
> > + struct __user_cap_data_struct data[2];
> > +};
> > +
>
> I am confused by struct libcap. Why do we need it?
because libcap is not compatible with new kernel.
It needs to be recompiled with new capability.h
Otherwise it limits max to CAP_AUDIT_READ
Any value higher it will error during cap_get_flag.
And will silently ignore it during cap_set_flag.
Not a great library decision.
Thankfully this struct above is exactly the kernel api.
One doesn't really need libcap. It's imo easier to do without it.
^ permalink raw reply
* Re: [PATCH v2 bpf-next 2/3] bpf: implement CAP_BPF
From: Alexei Starovoitov @ 2019-08-29 17:30 UTC (permalink / raw)
To: Daniel Borkmann
Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
kernel-team, linux-api
In-Reply-To: <ed8796f5-eaea-c87d-ddd9-9d624059e5ee@iogearbox.net>
On Thu, Aug 29, 2019 at 05:32:27PM +0200, Daniel Borkmann wrote:
> On 8/29/19 7:12 AM, Alexei Starovoitov wrote:
> > Implement permissions as stated in uapi/linux/capability.h
> >
> > Note that CAP_SYS_ADMIN is replaced with CAP_BPF.
> > All existing applications that use BPF do not drop all caps
> > and keep only CAP_SYS_ADMIN before doing bpf() syscall.
> > Hence it's highly unlikely that existing code will break.
> > If there will be reports of breakage then CAP_SYS_ADMIN
> > would be allowed as well with "it's usage is deprecated" message
> > similar to commit ee24aebffb75 ("cap_syslog: accept CAP_SYS_ADMIN for now")
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> [...]
> > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > index 22066a62c8c9..f459315625ac 100644
> > --- a/kernel/bpf/hashtab.c
> > +++ b/kernel/bpf/hashtab.c
> > @@ -244,9 +244,9 @@ static int htab_map_alloc_check(union bpf_attr *attr)
> > BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) !=
> > offsetof(struct htab_elem, hash_node.pprev));
> > - if (lru && !capable(CAP_SYS_ADMIN))
> > + if (lru && !capable(CAP_BPF))
> > /* LRU implementation is much complicated than other
> > - * maps. Hence, limit to CAP_SYS_ADMIN for now.
> > + * maps. Hence, limit to CAP_BPF.
> > */
> > return -EPERM;
> I don't think this works, this is pretty much going to break use cases where
> orchestration daemons are deployed as containers that are explicitly granted
> specified cap set and right now this is CAP_SYS_ADMIN and not CAP_BPF for bpf().
> The former needs to be a superset of the latter in order for this to work and
> not break compatibility between kernel upgrades.
>
> - https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-capabilities-for-a-container
> - https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities
These are the links that showing that k8 can delegates caps.
Are you saying that you know of folks who specifically
delegate cap_sys_admin and cap_net_admin _only_ to a container to run bpf in there?
^ permalink raw reply
* Re: [v1] net_sched: act_police: add 2 new attributes to support police 64bit rate and peakrate
From: David Z. Dai @ 2019-08-29 17:36 UTC (permalink / raw)
To: Eric Dumazet; +Cc: jhs, xiyou.wangcong, jiri, davem, netdev, linux-kernel, zdai
In-Reply-To: <7a8a5024-bbff-7443-71b3-9e3976af269f@gmail.com>
On Thu, 2019-08-29 at 10:32 +0200, Eric Dumazet wrote:
>
> On 8/29/19 12:51 AM, David Dai wrote:
> > For high speed adapter like Mellanox CX-5 card, it can reach upto
> > 100 Gbits per second bandwidth. Currently htb already supports 64bit rate
> > in tc utility. However police action rate and peakrate are still limited
> > to 32bit value (upto 32 Gbits per second). Add 2 new attributes
> > TCA_POLICE_RATE64 and TCA_POLICE_RATE64 in kernel for 64bit support
> > so that tc utility can use them for 64bit rate and peakrate value to
> > break the 32bit limit, and still keep the backward binary compatibility.
> >
> > Tested-by: David Dai <zdai@linux.vnet.ibm.com>
> > Signed-off-by: David Dai <zdai@linux.vnet.ibm.com>
> > ---
> > include/uapi/linux/pkt_cls.h | 2 ++
> > net/sched/act_police.c | 27 +++++++++++++++++++++++----
> > 2 files changed, 25 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
> > index b057aee..eb4ea4d 100644
> > --- a/include/uapi/linux/pkt_cls.h
> > +++ b/include/uapi/linux/pkt_cls.h
> > @@ -159,6 +159,8 @@ enum {
> > TCA_POLICE_AVRATE,
> > TCA_POLICE_RESULT,
> > TCA_POLICE_TM,
> > + TCA_POLICE_RATE64,
> > + TCA_POLICE_PEAKRATE64,
> > TCA_POLICE_PAD,
> > __TCA_POLICE_MAX
> > #define TCA_POLICE_RESULT TCA_POLICE_RESULT
>
> Never insert new attributes, as this breaks compatibility with old binaries (including
> old kernels)
Thanks for reviewing it!
My change is only contained within the police part. I am trying to
follow the same way htb and tbf support their 64 bit rate.
I tested the old tc binary with the newly patched kernel. It works fine.
I agree the newly compiled tc binary that has these 2 new attributes can
cause backward compatibility issue when running on the old kernel.
If can't insert new attribute, is there any
comment/suggestion/alternative on how to support 64bit police rate and
still keep the backward compatibility?
> Keep TCA_POLICE_PAD value the same, thanks.
^ permalink raw reply
* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Andy Lutomirski @ 2019-08-29 17:36 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Andy Lutomirski, Steven Rostedt, Peter Zijlstra,
Alexei Starovoitov, Kees Cook, LSM List, James Morris, Jann Horn,
Masami Hiramatsu, David S. Miller, Daniel Borkmann,
Network Development, bpf, kernel-team, Linux API
In-Reply-To: <20190829172309.xd73ax4wgsjmv6zg@ast-mbp.dhcp.thefacebook.com>
On Thu, Aug 29, 2019 at 10:23 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Aug 29, 2019 at 08:43:23AM -0700, Andy Lutomirski wrote:
> >
> > I can imagine splitting it into three capabilities:
> >
> > CAP_TRACE_KERNEL: learn which kernel functions are called when. This
> > would allow perf profiling, for example, but not sampling of kernel
> > regs.
> >
> > CAP_TRACE_READ_KERNEL_DATA: allow the tracing, profiling, etc features
> > that can read the kernel's data. So you get function arguments via
> > kprobe, kernel regs, and APIs that expose probe_kernel_read()
> >
> > CAP_TRACE_USER: trace unrelated user processes
> >
> > I'm not sure the code is written in a way that makes splitting
> > CAP_TRACE_KERNEL and CAP_TRACE_READ_KERNEL_DATA, and I'm not sure that
> > CAP_TRACE_KERNEL is all that useful except for plain perf record
> > without CAP_TRACE_READ_KERNEL_DATA. What do you all think? I suppose
> > it could also be:
> >
> > CAP_PROFILE_KERNEL: Use perf with events that aren't kprobes or
> > tracepoints. Does not grant the ability to sample regs or the kernel
> > stack directly.
> >
> > CAP_TRACE_KERNEL: Use all of perf, ftrace, kprobe, etc.
> >
> > CAP_TRACE_USER: Use all of perf with scope limited to user mode and uprobes.
>
> imo that makes little sense from security pov, since
> such CAP_TRACE_KERNEL (ex kprobe) can trace "unrelated user process"
> just as well. Yet not letting it do cleanly via uprobe.
> Sort of like giving a spare key for back door of the house and
> saying no, you cannot have main door key.
>
Not all combinations of capabilities make total sense. CAP_SETUID,
for example, generally lets you get all the other capabilities.
CAP_TRACE_KERNEL + CAP_TRACE_USER makes sense. CAP_TRACE_USER by
itself makes sense. CAP_TRACE_READ_KERNEL_DATA without
CAP_TRACE_KERNEL does not. I don't think this is a really a problem.
^ permalink raw reply
* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Steven Rostedt @ 2019-08-29 17:47 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Peter Zijlstra, Andy Lutomirski, Alexei Starovoitov, Kees Cook,
LSM List, James Morris, Jann Horn, Masami Hiramatsu,
David S. Miller, Daniel Borkmann, Network Development, bpf,
kernel-team, Linux API
In-Reply-To: <20190829171922.hkuceiurscsxk5jq@ast-mbp.dhcp.thefacebook.com>
On Thu, 29 Aug 2019 10:19:24 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> On Thu, Aug 29, 2019 at 09:34:34AM -0400, Steven Rostedt wrote:
> >
> > As the above seems to favor the idea of CAP_TRACING allowing write
> > access to tracefs, should we have a CAP_TRACING_RO for just read access
> > and limited perf abilities?
>
> read only vs writeable is an attribute of the file system.
> Bringing such things into caps seem wrong to me.
So using groups then? I'm fine with that.
-- Steve
^ permalink raw reply
* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Steven Rostedt @ 2019-08-29 17:49 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Andy Lutomirski, Peter Zijlstra, Alexei Starovoitov, Kees Cook,
LSM List, James Morris, Jann Horn, Masami Hiramatsu,
David S. Miller, Daniel Borkmann, Network Development, bpf,
kernel-team, Linux API
In-Reply-To: <20190829172309.xd73ax4wgsjmv6zg@ast-mbp.dhcp.thefacebook.com>
On Thu, 29 Aug 2019 10:23:10 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > CAP_TRACE_KERNEL: Use all of perf, ftrace, kprobe, etc.
> >
> > CAP_TRACE_USER: Use all of perf with scope limited to user mode and uprobes.
>
> imo that makes little sense from security pov, since
> such CAP_TRACE_KERNEL (ex kprobe) can trace "unrelated user process"
> just as well. Yet not letting it do cleanly via uprobe.
> Sort of like giving a spare key for back door of the house and
> saying no, you cannot have main door key.
I took it as CAP_TRACE_KERNEL as a superset of CAP_TRACE_USER. That is,
if you have CAP_TRACE_KERNEL, by default you get USER. Where as
CAP_TRACE_USER, is much more limiting.
-- Steve
^ permalink raw reply
* Re: [PATCH] amd-xgbe: Fix error path in xgbe_mod_init()
From: Jakub Kicinski @ 2019-08-29 17:52 UTC (permalink / raw)
To: YueHaibing; +Cc: thomas.lendacky, davem, netdev, linux-kernel
In-Reply-To: <20190829024600.16052-1-yuehaibing@huawei.com>
On Thu, 29 Aug 2019 10:46:00 +0800, YueHaibing wrote:
> In xgbe_mod_init(), we should do cleanup if some error occurs
>
> Reported-by: Hulk Robot <hulkci@huawei.com>
> Fixes: efbaa828330a ("amd-xgbe: Add support to handle device renaming")
> Fixes: 47f164deab22 ("amd-xgbe: Add PCI device support")
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Looks correct.
For networking fixes please try to use [PATCH net] as a tag ([PATCH
net-next] for normal, non-fix patches).
^ permalink raw reply
* [PATCH net-next] net: Fail explicit bind to local reserved ports
From: Subash Abhinov Kasiviswanathan @ 2019-08-29 3:26 UTC (permalink / raw)
To: netdev, davem; +Cc: Subash Abhinov Kasiviswanathan, Sean Tranchetti
Reserved ports may have some special use cases which are not suitable
for use by general userspace applications. Currently, ports specified
in ip_local_reserved_ports will not be returned only in case of
automatic port assignment.
In some cases, it maybe required to prevent the host from assigning
the ports even in case of explicit binds. Consider the case of a
transparent proxy where packets are being redirected. In case a socket
matches this connection, packets from this application would be
incorrectly sent to one of the endpoints.
Add a boolean sysctl flag 'reserved_port_bind'. Default value is 1
which preserves the existing behavior. Setting the value to 0 will
prevent userspace applications from binding to these ports even when
they are explicitly requested.
Cc: Sean Tranchetti <stranche@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
Documentation/networking/ip-sysctl.txt | 5 +++++
include/net/netns/ipv4.h | 2 ++
net/ipv4/af_inet.c | 3 +++
net/ipv4/inet_connection_sock.c | 7 +++++++
net/ipv4/sysctl_net_ipv4.c | 7 +++++++
net/ipv4/udp.c | 5 +++++
6 files changed, 29 insertions(+)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 49e95f4..8a9d649 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -945,6 +945,11 @@ ip_unprivileged_port_start - INTEGER
Default: 1024
+reserved_port_bind - BOOLEAN
+ If set, allows explicit bind requests to applications requesting
+ any port within the range of ip_local_reserved_ports.
+ Default: 1
+
ip_nonlocal_bind - BOOLEAN
If set, allows processes to bind() to non-local IP addresses,
which can be quite useful - but may break some applications.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index c0c0791..0941369 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -107,6 +107,8 @@ struct netns_ipv4 {
#ifdef CONFIG_NET_L3_MASTER_DEV
int sysctl_raw_l3mdev_accept;
#endif
+ int sysctl_reserved_port_bind;
+
int sysctl_tcp_early_demux;
int sysctl_udp_early_demux;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 70f92aa..e1ad45d 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1814,6 +1814,9 @@ static __net_init int inet_init_net(struct net *net)
net->ipv4.ip_local_ports.range[0] = 32768;
net->ipv4.ip_local_ports.range[1] = 60999;
+ /* Allow explicit binding to reserved ports */
+ net->ipv4.sysctl_reserved_port_bind = 1;
+
seqlock_init(&net->ipv4.ping_group_range.lock);
/*
* Sane defaults - nobody may create ping sockets.
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index f5c163d..6dda979 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -307,6 +307,13 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
head = &hinfo->bhash[inet_bhashfn(net, port,
hinfo->bhash_size)];
spin_lock_bh(&head->lock);
+
+ if (inet_is_local_reserved_port(net, snum) &&
+ !net->ipv4.sysctl_reserved_port_bind) {
+ ret = 1;
+ goto fail_unlock;
+ }
+
inet_bind_bucket_for_each(tb, &head->chain)
if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
tb->port == port)
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 59ded25..557fdec 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -742,6 +742,13 @@ static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write,
.proc_handler = proc_do_large_bitmap,
},
{
+ .procname = "reserved_port_bind",
+ .data = &init_net.ipv4.sysctl_reserved_port_bind,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
+ {
.procname = "ip_no_pmtu_disc",
.data = &init_net.ipv4.sysctl_ip_no_pmtu_disc,
.maxlen = sizeof(int),
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index d88821c..59a4274 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -274,6 +274,11 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
} else {
hslot = udp_hashslot(udptable, net, snum);
spin_lock_bh(&hslot->lock);
+
+ if (inet_is_local_reserved_port(net, snum) &&
+ !net->ipv4.sysctl_reserved_port_bind)
+ goto fail_unlock;
+
if (hslot->count > 10) {
int exist;
unsigned int slot2 = udp_sk(sk)->udp_portaddr_hash ^ snum;
--
1.9.1
^ permalink raw reply related
* Re: [PATCH v3 1/2] net: core: Notify on changes to dev->promiscuity.
From: Ido Schimmel @ 2019-08-29 17:57 UTC (permalink / raw)
To: Andrew Lunn
Cc: Jiri Pirko, Horatiu Vultur, alexandre.belloni, UNGLinuxDriver,
davem, allan.nielsen, ivecera, f.fainelli, netdev, linux-kernel
In-Reply-To: <20190829143732.GB17864@lunn.ch>
On Thu, Aug 29, 2019 at 04:37:32PM +0200, Andrew Lunn wrote:
> > Wait, I believe there has been some misundestanding. Promisc mode is NOT
> > about getting packets to the cpu. It's about setting hw filters in a way
> > that no rx packet is dropped.
> >
> > If you want to get packets from the hw forwarding dataplane to cpu, you
> > should not use promisc mode for that. That would be incorrect.
>
> Hi Jiri
>
> I'm not sure a wireshark/tcpdump/pcap user would agree with you. They
> want to see packets on an interface, so they use these tools. The fact
> that the interface is a switch interface should not matter. The
> switchdev model is that we try to hide away the interface happens to
> be on a switch, you can just use it as normal. So why should promisc
> mode not work as normal?
Hi Andrew,
What happens when you run tcpdump on a routed interface without putting
it in promiscuous mode ('-p')? If it is a pure software switch, then you
see all unicast packets addressed to your interface's MAC address. What
happens when the same is done on a hardware switch? With the proposed
solution you will not get the same result.
On a software switch, when you run tcpdump without '-p', do you incur
major packet loss? No. Will this happen when you punt several Tbps to
your CPU on the hardware switch? Yes.
Extending the definition of promiscuous mode to mean punt all traffic to
the CPU is wrong, IMO. You will not be able to capture all the packets
anyway. If you have both elephant and mice flows, then it is very likely
you will not be able to see any packets from the mice flows. The way
this kind of monitoring is usually done is by either sampling packets
(see tc-sample) or mirroring it to capable server. Both options are
available in Linux today.
> > If you want to get packets from the hw forwarding dataplane to cpu, you
> > should use tc trap action. It is there exactly for this purpose.
>
> Do you really think a wireshark/tcpdump/pcap user should need to use
> tc trap for the special case the interface is a switch port? Doesn't that
> break the switchdev model?
I do not object to the overall goal, but I believe to implementation is
wrong. Instead, it would be much better to extend tshark/tcpdump and
with another flag that will instruct libpcap to install a rule that will
trap all traffic to the CPU. You can do that on either ingress or egress
using matchall and trap action.
If you want to do it without specifying a special flag (I think it's
very dangerous due to the potential packet loss), you can add a flag to
the interface that will indicate to libpcap that installing a tc filter
with trap action is required.
> tc trap is more about fine grained selection of packets.
Depends on the filter you associate with the action. If it's matchall,
then it's not fine grained at all :)
> Also, it seems like trapped packets are not forwarded, which is not
> what you would expect from wireshark/tcpdump/pcap.
How do you mean? Not forwarded by the HW? Right. But the trapped packets
are forwarded by the kernel. We can also add another action that means
both trap and forward. In mlxsw terminology it's called mirror.
^ permalink raw reply
* Re: [v1] net_sched: act_police: add 2 new attributes to support police 64bit rate and peakrate
From: Toke Høiland-Jørgensen @ 2019-08-29 17:58 UTC (permalink / raw)
To: David Z. Dai, Eric Dumazet
Cc: jhs, xiyou.wangcong, jiri, davem, netdev, linux-kernel, zdai
In-Reply-To: <1567100185.20025.3.camel@oc5348122405>
"David Z. Dai" <zdai@linux.vnet.ibm.com> writes:
> On Thu, 2019-08-29 at 10:32 +0200, Eric Dumazet wrote:
>>
>> On 8/29/19 12:51 AM, David Dai wrote:
>> > For high speed adapter like Mellanox CX-5 card, it can reach upto
>> > 100 Gbits per second bandwidth. Currently htb already supports 64bit rate
>> > in tc utility. However police action rate and peakrate are still limited
>> > to 32bit value (upto 32 Gbits per second). Add 2 new attributes
>> > TCA_POLICE_RATE64 and TCA_POLICE_RATE64 in kernel for 64bit support
>> > so that tc utility can use them for 64bit rate and peakrate value to
>> > break the 32bit limit, and still keep the backward binary compatibility.
>> >
>> > Tested-by: David Dai <zdai@linux.vnet.ibm.com>
>> > Signed-off-by: David Dai <zdai@linux.vnet.ibm.com>
>> > ---
>> > include/uapi/linux/pkt_cls.h | 2 ++
>> > net/sched/act_police.c | 27 +++++++++++++++++++++++----
>> > 2 files changed, 25 insertions(+), 4 deletions(-)
>> >
>> > diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
>> > index b057aee..eb4ea4d 100644
>> > --- a/include/uapi/linux/pkt_cls.h
>> > +++ b/include/uapi/linux/pkt_cls.h
>> > @@ -159,6 +159,8 @@ enum {
>> > TCA_POLICE_AVRATE,
>> > TCA_POLICE_RESULT,
>> > TCA_POLICE_TM,
>> > + TCA_POLICE_RATE64,
>> > + TCA_POLICE_PEAKRATE64,
>> > TCA_POLICE_PAD,
>> > __TCA_POLICE_MAX
>> > #define TCA_POLICE_RESULT TCA_POLICE_RESULT
>>
>> Never insert new attributes, as this breaks compatibility with old binaries (including
>> old kernels)
> Thanks for reviewing it!
> My change is only contained within the police part. I am trying to
> follow the same way htb and tbf support their 64 bit rate.
>
> I tested the old tc binary with the newly patched kernel. It works fine.
>
> I agree the newly compiled tc binary that has these 2 new attributes can
> cause backward compatibility issue when running on the old kernel.
>
> If can't insert new attribute, is there any
> comment/suggestion/alternative on how to support 64bit police rate and
> still keep the backward compatibility?
Just put the new attributes *after* PAD instead of before :)
-Toke
^ permalink raw reply
* Re: [PATCH v2 bpf-next 1/3] capability: introduce CAP_BPF and CAP_TRACING
From: Toke Høiland-Jørgensen @ 2019-08-29 18:05 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Alexei Starovoitov, luto, davem, peterz, rostedt, netdev, bpf,
kernel-team, linux-api, brouer
In-Reply-To: <20190829172410.j36gjxt6oku5zh6s@ast-mbp.dhcp.thefacebook.com>
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> On Thu, Aug 29, 2019 at 09:44:18AM +0200, Toke Høiland-Jørgensen wrote:
>> Alexei Starovoitov <ast@kernel.org> writes:
>>
>> > CAP_BPF allows the following BPF operations:
>> > - Loading all types of BPF programs
>> > - Creating all types of BPF maps except:
>> > - stackmap that needs CAP_TRACING
>> > - devmap that needs CAP_NET_ADMIN
>> > - cpumap that needs CAP_SYS_ADMIN
>>
>> Why CAP_SYS_ADMIN instead of CAP_NET_ADMIN for cpumap?
>
> Currently it's cap_sys_admin and I think it should stay this way
> because it creates kthreads.
Ah, right. I can sorta see that makes sense because of the kthreads, but
it also means that you can use all of XDP *except* cpumap with
CAP_NET_ADMIN+CAP_BPF. That is bound to create confusion, isn't it?
-Toke
^ permalink raw reply
* Re: auto-split of commit. Was: [PATCH bpf-next 04/10] tools/bpf: add libbpf_prog_type_(from|to)_str helpers
From: Arnaldo Carvalho de Melo @ 2019-08-29 18:10 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Greg Kroah-Hartman, Jakub Kicinski, Julia Kartseva, ast,
Thomas Gleixner, rdna, bpf, daniel, netdev, kernel-team
In-Reply-To: <20190829171655.fww5qxtfusehcpds@ast-mbp.dhcp.thefacebook.com>
Em Thu, Aug 29, 2019 at 10:16:56AM -0700, Alexei Starovoitov escreveu:
> On Thu, Aug 29, 2019 at 08:51:51AM +0200, Greg Kroah-Hartman wrote:
> > That being said, from a "are you keeping the correct authorship info",
> > yes, it sounds like you are doing the correct thing here.
> > Look at what I do for stable kernels, I take the original commit and add
> > it to "another tree" keeping the original author and s-o-b chain intact,
> > and adding a "this is the original git commit id" type message to the
> > changelog text so that people can link it back to the original.
> I think you're describing 'git cherry-pick -x'.
> The question was about taking pieces of the original commit. Not the whole commit.
> Author field obviously stays, but SOB is questionable.
> If author meant to change X and Y and Z. Silently taking only Z chunk of the diff
> doesn't quite seem right.
> If we document that such commit split happens in Documentation/bpf/bpf_devel_QA.rst
> do you think it will be enough to properly inform developers?
Can't we instead establish the rule that for something to be added to
tools/include/ it should first land in a separate commit in include/,
ditto for the other things tools/ copies from the kernel sources.
That was the initial intention of tools/include/ and also that is how
tools/perf/check-headers.h works, warning when something ot out of sync,
etc.
I.e. the tools/ maintainers should refuse patches that touch both
tools/include and tools/.
wdyt?
- Arnaldo
^ permalink raw reply
* [PATCH AUTOSEL 5.2 01/76] batman-adv: Fix netlink dumping of all mcast_flags buckets
From: Sasha Levin @ 2019-08-29 18:11 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Sven Eckelmann, Simon Wunderlich, Sasha Levin, netdev
From: Sven Eckelmann <sven@narfation.org>
[ Upstream commit fa3a03da549a889fc9dbc0d3c5908eb7882cac8f ]
The bucket variable is only updated outside the loop over the mcast_flags
buckets. It will only be updated during a dumping run when the dumping has
to be interrupted and a new message has to be started.
This could result in repeated or missing entries when the multicast flags
are dumped to userspace.
Fixes: d2d489b7d851 ("batman-adv: Add inconsistent multicast netlink dump detection")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
net/batman-adv/multicast.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/batman-adv/multicast.c b/net/batman-adv/multicast.c
index ec54e236e3454..50fe9dfb088b6 100644
--- a/net/batman-adv/multicast.c
+++ b/net/batman-adv/multicast.c
@@ -1653,7 +1653,7 @@ __batadv_mcast_flags_dump(struct sk_buff *msg, u32 portid,
while (bucket_tmp < hash->size) {
if (batadv_mcast_flags_dump_bucket(msg, portid, cb, hash,
- *bucket, &idx_tmp))
+ bucket_tmp, &idx_tmp))
break;
bucket_tmp++;
--
2.20.1
^ permalink raw reply related
* [PATCH AUTOSEL 5.2 20/76] Bluetooth: hidp: Let hidp_send_message return number of queued bytes
From: Sasha Levin @ 2019-08-29 18:12 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Fabian Henneke, Marcel Holtmann, Sasha Levin, linux-bluetooth,
netdev
In-Reply-To: <20190829181311.7562-1-sashal@kernel.org>
From: Fabian Henneke <fabian.henneke@gmail.com>
[ Upstream commit 48d9cc9d85dde37c87abb7ac9bbec6598ba44b56 ]
Let hidp_send_message return the number of successfully queued bytes
instead of an unconditional 0.
With the return value fixed to 0, other drivers relying on hidp, such as
hidraw, can not return meaningful values from their respective
implementations of write(). In particular, with the current behavior, a
hidraw device's write() will have different return values depending on
whether the device is connected via USB or Bluetooth, which makes it
harder to abstract away the transport layer.
Signed-off-by: Fabian Henneke <fabian.henneke@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
net/bluetooth/hidp/core.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
index 5abd423b55fa9..8d889969ae7ed 100644
--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -101,6 +101,7 @@ static int hidp_send_message(struct hidp_session *session, struct socket *sock,
{
struct sk_buff *skb;
struct sock *sk = sock->sk;
+ int ret;
BT_DBG("session %p data %p size %d", session, data, size);
@@ -114,13 +115,17 @@ static int hidp_send_message(struct hidp_session *session, struct socket *sock,
}
skb_put_u8(skb, hdr);
- if (data && size > 0)
+ if (data && size > 0) {
skb_put_data(skb, data, size);
+ ret = size;
+ } else {
+ ret = 0;
+ }
skb_queue_tail(transmit, skb);
wake_up_interruptible(sk_sleep(sk));
- return 0;
+ return ret;
}
static int hidp_send_ctrl_message(struct hidp_session *session,
--
2.20.1
^ permalink raw reply related
* [PATCH AUTOSEL 5.2 31/76] cxgb4: fix a memory leak bug
From: Sasha Levin @ 2019-08-29 18:12 UTC (permalink / raw)
To: linux-kernel, stable; +Cc: Wenwen Wang, David S . Miller, Sasha Levin, netdev
In-Reply-To: <20190829181311.7562-1-sashal@kernel.org>
From: Wenwen Wang <wenwen@cs.uga.edu>
[ Upstream commit c554336efa9bbc28d6ec14efbee3c7d63c61a34f ]
In blocked_fl_write(), 't' is not deallocated if bitmap_parse_user() fails,
leading to a memory leak bug. To fix this issue, free t before returning
the error.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index 02959035ed3f2..d692251ee252c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -3236,8 +3236,10 @@ static ssize_t blocked_fl_write(struct file *filp, const char __user *ubuf,
return -ENOMEM;
err = bitmap_parse_user(ubuf, count, t, adap->sge.egr_sz);
- if (err)
+ if (err) {
+ kvfree(t);
return err;
+ }
bitmap_copy(adap->sge.blocked_fl, t, adap->sge.egr_sz);
kvfree(t);
--
2.20.1
^ permalink raw reply related
* [PATCH AUTOSEL 5.2 45/76] wimax/i2400m: fix a memory leak bug
From: Sasha Levin @ 2019-08-29 18:12 UTC (permalink / raw)
To: linux-kernel, stable; +Cc: Wenwen Wang, David S . Miller, Sasha Levin, netdev
In-Reply-To: <20190829181311.7562-1-sashal@kernel.org>
From: Wenwen Wang <wenwen@cs.uga.edu>
[ Upstream commit 44ef3a03252844a8753479b0cea7f29e4a804bdc ]
In i2400m_barker_db_init(), 'options_orig' is allocated through kstrdup()
to hold the original command line options. Then, the options are parsed.
However, if an error occurs during the parsing process, 'options_orig' is
not deallocated, leading to a memory leak bug. To fix this issue, free
'options_orig' before returning the error.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/net/wimax/i2400m/fw.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wimax/i2400m/fw.c b/drivers/net/wimax/i2400m/fw.c
index e9fc168bb7345..489cba9b284d1 100644
--- a/drivers/net/wimax/i2400m/fw.c
+++ b/drivers/net/wimax/i2400m/fw.c
@@ -351,13 +351,15 @@ int i2400m_barker_db_init(const char *_options)
}
result = i2400m_barker_db_add(barker);
if (result < 0)
- goto error_add;
+ goto error_parse_add;
}
kfree(options_orig);
}
return 0;
+error_parse_add:
error_parse:
+ kfree(options_orig);
error_add:
kfree(i2400m_barker_db);
return result;
--
2.20.1
^ permalink raw reply related
* [PATCH AUTOSEL 4.19 03/45] rxrpc: Fix local endpoint refcounting
From: Sasha Levin @ 2019-08-29 18:15 UTC (permalink / raw)
To: linux-kernel, stable
Cc: David Howells, syzbot+1e0edc4b8b7494c28450, Sasha Levin,
linux-afs, netdev
In-Reply-To: <20190829181547.8280-1-sashal@kernel.org>
From: David Howells <dhowells@redhat.com>
[ Upstream commit 730c5fd42c1e3652a065448fd235cb9fafb2bd10 ]
The object lifetime management on the rxrpc_local struct is broken in that
the rxrpc_local_processor() function is expected to clean up and remove an
object - but it may get requeued by packets coming in on the backing UDP
socket once it starts running.
This may result in the assertion in rxrpc_local_rcu() firing because the
memory has been scheduled for RCU destruction whilst still queued:
rxrpc: Assertion failed
------------[ cut here ]------------
kernel BUG at net/rxrpc/local_object.c:468!
Note that if the processor comes around before the RCU free function, it
will just do nothing because ->dead is true.
Fix this by adding a separate refcount to count active users of the
endpoint that causes the endpoint to be destroyed when it reaches 0.
The original refcount can then be used to refcount objects through the work
processor and cause the memory to be rcu freed when that reaches 0.
Fixes: 4f95dd78a77e ("rxrpc: Rework local endpoint management")
Reported-by: syzbot+1e0edc4b8b7494c28450@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
net/rxrpc/af_rxrpc.c | 4 +-
net/rxrpc/ar-internal.h | 5 ++-
net/rxrpc/input.c | 16 ++++++--
net/rxrpc/local_object.c | 86 +++++++++++++++++++++++++---------------
4 files changed, 72 insertions(+), 39 deletions(-)
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index d76e5e58905d8..7319d3ca30e94 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -195,7 +195,7 @@ static int rxrpc_bind(struct socket *sock, struct sockaddr *saddr, int len)
service_in_use:
write_unlock(&local->services_lock);
- rxrpc_put_local(local);
+ rxrpc_unuse_local(local);
ret = -EADDRINUSE;
error_unlock:
release_sock(&rx->sk);
@@ -908,7 +908,7 @@ static int rxrpc_release_sock(struct sock *sk)
rxrpc_queue_work(&rxnet->service_conn_reaper);
rxrpc_queue_work(&rxnet->client_conn_reaper);
- rxrpc_put_local(rx->local);
+ rxrpc_unuse_local(rx->local);
rx->local = NULL;
key_put(rx->key);
rx->key = NULL;
diff --git a/net/rxrpc/ar-internal.h b/net/rxrpc/ar-internal.h
index 03e0fc8c183f0..87c3677a2daf8 100644
--- a/net/rxrpc/ar-internal.h
+++ b/net/rxrpc/ar-internal.h
@@ -258,7 +258,8 @@ struct rxrpc_security {
*/
struct rxrpc_local {
struct rcu_head rcu;
- atomic_t usage;
+ atomic_t active_users; /* Number of users of the local endpoint */
+ atomic_t usage; /* Number of references to the structure */
struct rxrpc_net *rxnet; /* The network ns in which this resides */
struct list_head link;
struct socket *socket; /* my UDP socket */
@@ -998,6 +999,8 @@ struct rxrpc_local *rxrpc_lookup_local(struct net *, const struct sockaddr_rxrpc
struct rxrpc_local *rxrpc_get_local(struct rxrpc_local *);
struct rxrpc_local *rxrpc_get_local_maybe(struct rxrpc_local *);
void rxrpc_put_local(struct rxrpc_local *);
+struct rxrpc_local *rxrpc_use_local(struct rxrpc_local *);
+void rxrpc_unuse_local(struct rxrpc_local *);
void rxrpc_queue_local(struct rxrpc_local *);
void rxrpc_destroy_all_locals(struct rxrpc_net *);
diff --git a/net/rxrpc/input.c b/net/rxrpc/input.c
index d591f54cb91fb..7965600ee5dec 100644
--- a/net/rxrpc/input.c
+++ b/net/rxrpc/input.c
@@ -1106,8 +1106,12 @@ static void rxrpc_post_packet_to_local(struct rxrpc_local *local,
{
_enter("%p,%p", local, skb);
- skb_queue_tail(&local->event_queue, skb);
- rxrpc_queue_local(local);
+ if (rxrpc_get_local_maybe(local)) {
+ skb_queue_tail(&local->event_queue, skb);
+ rxrpc_queue_local(local);
+ } else {
+ rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+ }
}
/*
@@ -1117,8 +1121,12 @@ static void rxrpc_reject_packet(struct rxrpc_local *local, struct sk_buff *skb)
{
CHECK_SLAB_OKAY(&local->usage);
- skb_queue_tail(&local->reject_queue, skb);
- rxrpc_queue_local(local);
+ if (rxrpc_get_local_maybe(local)) {
+ skb_queue_tail(&local->reject_queue, skb);
+ rxrpc_queue_local(local);
+ } else {
+ rxrpc_free_skb(skb, rxrpc_skb_rx_freed);
+ }
}
/*
diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c
index 10317dbdab5f4..2182ebfc7df4c 100644
--- a/net/rxrpc/local_object.c
+++ b/net/rxrpc/local_object.c
@@ -83,6 +83,7 @@ static struct rxrpc_local *rxrpc_alloc_local(struct rxrpc_net *rxnet,
local = kzalloc(sizeof(struct rxrpc_local), GFP_KERNEL);
if (local) {
atomic_set(&local->usage, 1);
+ atomic_set(&local->active_users, 1);
local->rxnet = rxnet;
INIT_LIST_HEAD(&local->link);
INIT_WORK(&local->processor, rxrpc_local_processor);
@@ -270,11 +271,8 @@ struct rxrpc_local *rxrpc_lookup_local(struct net *net,
* bind the transport socket may still fail if we're attempting
* to use a local address that the dying object is still using.
*/
- if (!rxrpc_get_local_maybe(local)) {
- cursor = cursor->next;
- list_del_init(&local->link);
+ if (!rxrpc_use_local(local))
break;
- }
age = "old";
goto found;
@@ -288,7 +286,10 @@ struct rxrpc_local *rxrpc_lookup_local(struct net *net,
if (ret < 0)
goto sock_error;
- list_add_tail(&local->link, cursor);
+ if (cursor != &rxnet->local_endpoints)
+ list_replace(cursor, &local->link);
+ else
+ list_add_tail(&local->link, cursor);
age = "new";
found:
@@ -346,7 +347,8 @@ struct rxrpc_local *rxrpc_get_local_maybe(struct rxrpc_local *local)
}
/*
- * Queue a local endpoint.
+ * Queue a local endpoint unless it has become unreferenced and pass the
+ * caller's reference to the work item.
*/
void rxrpc_queue_local(struct rxrpc_local *local)
{
@@ -355,15 +357,8 @@ void rxrpc_queue_local(struct rxrpc_local *local)
if (rxrpc_queue_work(&local->processor))
trace_rxrpc_local(local, rxrpc_local_queued,
atomic_read(&local->usage), here);
-}
-
-/*
- * A local endpoint reached its end of life.
- */
-static void __rxrpc_put_local(struct rxrpc_local *local)
-{
- _enter("%d", local->debug_id);
- rxrpc_queue_work(&local->processor);
+ else
+ rxrpc_put_local(local);
}
/*
@@ -379,10 +374,45 @@ void rxrpc_put_local(struct rxrpc_local *local)
trace_rxrpc_local(local, rxrpc_local_put, n, here);
if (n == 0)
- __rxrpc_put_local(local);
+ call_rcu(&local->rcu, rxrpc_local_rcu);
}
}
+/*
+ * Start using a local endpoint.
+ */
+struct rxrpc_local *rxrpc_use_local(struct rxrpc_local *local)
+{
+ unsigned int au;
+
+ local = rxrpc_get_local_maybe(local);
+ if (!local)
+ return NULL;
+
+ au = atomic_fetch_add_unless(&local->active_users, 1, 0);
+ if (au == 0) {
+ rxrpc_put_local(local);
+ return NULL;
+ }
+
+ return local;
+}
+
+/*
+ * Cease using a local endpoint. Once the number of active users reaches 0, we
+ * start the closure of the transport in the work processor.
+ */
+void rxrpc_unuse_local(struct rxrpc_local *local)
+{
+ unsigned int au;
+
+ au = atomic_dec_return(&local->active_users);
+ if (au == 0)
+ rxrpc_queue_local(local);
+ else
+ rxrpc_put_local(local);
+}
+
/*
* Destroy a local endpoint's socket and then hand the record to RCU to dispose
* of.
@@ -397,16 +427,6 @@ static void rxrpc_local_destroyer(struct rxrpc_local *local)
_enter("%d", local->debug_id);
- /* We can get a race between an incoming call packet queueing the
- * processor again and the work processor starting the destruction
- * process which will shut down the UDP socket.
- */
- if (local->dead) {
- _leave(" [already dead]");
- return;
- }
- local->dead = true;
-
mutex_lock(&rxnet->local_mutex);
list_del_init(&local->link);
mutex_unlock(&rxnet->local_mutex);
@@ -426,13 +446,11 @@ static void rxrpc_local_destroyer(struct rxrpc_local *local)
*/
rxrpc_purge_queue(&local->reject_queue);
rxrpc_purge_queue(&local->event_queue);
-
- _debug("rcu local %d", local->debug_id);
- call_rcu(&local->rcu, rxrpc_local_rcu);
}
/*
- * Process events on an endpoint
+ * Process events on an endpoint. The work item carries a ref which
+ * we must release.
*/
static void rxrpc_local_processor(struct work_struct *work)
{
@@ -445,8 +463,10 @@ static void rxrpc_local_processor(struct work_struct *work)
do {
again = false;
- if (atomic_read(&local->usage) == 0)
- return rxrpc_local_destroyer(local);
+ if (atomic_read(&local->active_users) == 0) {
+ rxrpc_local_destroyer(local);
+ break;
+ }
if (!skb_queue_empty(&local->reject_queue)) {
rxrpc_reject_packets(local);
@@ -458,6 +478,8 @@ static void rxrpc_local_processor(struct work_struct *work)
again = true;
}
} while (again);
+
+ rxrpc_put_local(local);
}
/*
--
2.20.1
^ permalink raw reply related
* [PATCH AUTOSEL 4.19 05/45] hv_netvsc: Fix a warning of suspicious RCU usage
From: Sasha Levin @ 2019-08-29 18:15 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Dexuan Cui, David S . Miller, Sasha Levin, linux-hyperv, netdev
In-Reply-To: <20190829181547.8280-1-sashal@kernel.org>
From: Dexuan Cui <decui@microsoft.com>
[ Upstream commit 6d0d779dca73cd5acb649c54f81401f93098b298 ]
This fixes a warning of "suspicious rcu_dereference_check() usage"
when nload runs.
Fixes: 776e726bfb34 ("netvsc: fix RCU warning in get_stats")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/net/hyperv/netvsc_drv.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index cc60ef9634db2..6f6c0dbd91fc8 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1248,12 +1248,15 @@ static void netvsc_get_stats64(struct net_device *net,
struct rtnl_link_stats64 *t)
{
struct net_device_context *ndev_ctx = netdev_priv(net);
- struct netvsc_device *nvdev = rcu_dereference_rtnl(ndev_ctx->nvdev);
+ struct netvsc_device *nvdev;
struct netvsc_vf_pcpu_stats vf_tot;
int i;
+ rcu_read_lock();
+
+ nvdev = rcu_dereference(ndev_ctx->nvdev);
if (!nvdev)
- return;
+ goto out;
netdev_stats_to_stats64(t, &net->stats);
@@ -1292,6 +1295,8 @@ static void netvsc_get_stats64(struct net_device *net,
t->rx_packets += packets;
t->multicast += multicast;
}
+out:
+ rcu_read_unlock();
}
static int netvsc_set_mac_addr(struct net_device *ndev, void *p)
--
2.20.1
^ permalink raw reply related
* [PATCH AUTOSEL 4.19 13/45] rxrpc: Fix read-after-free in rxrpc_queue_local()
From: Sasha Levin @ 2019-08-29 18:15 UTC (permalink / raw)
To: linux-kernel, stable
Cc: David Howells, syzbot+78e71c5bab4f76a6a719, Sasha Levin,
linux-afs, netdev
In-Reply-To: <20190829181547.8280-1-sashal@kernel.org>
From: David Howells <dhowells@redhat.com>
[ Upstream commit 06d9532fa6b34f12a6d75711162d47c17c1add72 ]
rxrpc_queue_local() attempts to queue the local endpoint it is given and
then, if successful, prints a trace line. The trace line includes the
current usage count - but we're not allowed to look at the local endpoint
at this point as we passed our ref on it to the workqueue.
Fix this by reading the usage count before queuing the work item.
Also fix the reading of local->debug_id for trace lines, which must be done
with the same consideration as reading the usage count.
Fixes: 09d2bf595db4 ("rxrpc: Add a tracepoint to track rxrpc_local refcounting")
Reported-by: syzbot+78e71c5bab4f76a6a719@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
include/trace/events/rxrpc.h | 6 +++---
net/rxrpc/local_object.c | 19 ++++++++++---------
2 files changed, 13 insertions(+), 12 deletions(-)
diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h
index 147546e0c11bd..815dcfa647430 100644
--- a/include/trace/events/rxrpc.h
+++ b/include/trace/events/rxrpc.h
@@ -500,10 +500,10 @@ rxrpc_tx_points;
#define E_(a, b) { a, b }
TRACE_EVENT(rxrpc_local,
- TP_PROTO(struct rxrpc_local *local, enum rxrpc_local_trace op,
+ TP_PROTO(unsigned int local_debug_id, enum rxrpc_local_trace op,
int usage, const void *where),
- TP_ARGS(local, op, usage, where),
+ TP_ARGS(local_debug_id, op, usage, where),
TP_STRUCT__entry(
__field(unsigned int, local )
@@ -513,7 +513,7 @@ TRACE_EVENT(rxrpc_local,
),
TP_fast_assign(
- __entry->local = local->debug_id;
+ __entry->local = local_debug_id;
__entry->op = op;
__entry->usage = usage;
__entry->where = where;
diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c
index 7f82c4e19bd1e..27f4bbe85e799 100644
--- a/net/rxrpc/local_object.c
+++ b/net/rxrpc/local_object.c
@@ -97,7 +97,7 @@ static struct rxrpc_local *rxrpc_alloc_local(struct rxrpc_net *rxnet,
local->debug_id = atomic_inc_return(&rxrpc_debug_id);
memcpy(&local->srx, srx, sizeof(*srx));
local->srx.srx_service = 0;
- trace_rxrpc_local(local, rxrpc_local_new, 1, NULL);
+ trace_rxrpc_local(local->debug_id, rxrpc_local_new, 1, NULL);
}
_leave(" = %p", local);
@@ -325,7 +325,7 @@ struct rxrpc_local *rxrpc_get_local(struct rxrpc_local *local)
int n;
n = atomic_inc_return(&local->usage);
- trace_rxrpc_local(local, rxrpc_local_got, n, here);
+ trace_rxrpc_local(local->debug_id, rxrpc_local_got, n, here);
return local;
}
@@ -339,7 +339,8 @@ struct rxrpc_local *rxrpc_get_local_maybe(struct rxrpc_local *local)
if (local) {
int n = atomic_fetch_add_unless(&local->usage, 1, 0);
if (n > 0)
- trace_rxrpc_local(local, rxrpc_local_got, n + 1, here);
+ trace_rxrpc_local(local->debug_id, rxrpc_local_got,
+ n + 1, here);
else
local = NULL;
}
@@ -347,16 +348,16 @@ struct rxrpc_local *rxrpc_get_local_maybe(struct rxrpc_local *local)
}
/*
- * Queue a local endpoint unless it has become unreferenced and pass the
- * caller's reference to the work item.
+ * Queue a local endpoint and pass the caller's reference to the work item.
*/
void rxrpc_queue_local(struct rxrpc_local *local)
{
const void *here = __builtin_return_address(0);
+ unsigned int debug_id = local->debug_id;
+ int n = atomic_read(&local->usage);
if (rxrpc_queue_work(&local->processor))
- trace_rxrpc_local(local, rxrpc_local_queued,
- atomic_read(&local->usage), here);
+ trace_rxrpc_local(debug_id, rxrpc_local_queued, n, here);
else
rxrpc_put_local(local);
}
@@ -371,7 +372,7 @@ void rxrpc_put_local(struct rxrpc_local *local)
if (local) {
n = atomic_dec_return(&local->usage);
- trace_rxrpc_local(local, rxrpc_local_put, n, here);
+ trace_rxrpc_local(local->debug_id, rxrpc_local_put, n, here);
if (n == 0)
call_rcu(&local->rcu, rxrpc_local_rcu);
@@ -458,7 +459,7 @@ static void rxrpc_local_processor(struct work_struct *work)
container_of(work, struct rxrpc_local, processor);
bool again;
- trace_rxrpc_local(local, rxrpc_local_processing,
+ trace_rxrpc_local(local->debug_id, rxrpc_local_processing,
atomic_read(&local->usage), NULL);
do {
--
2.20.1
^ permalink raw reply related
* [PATCH AUTOSEL 4.19 23/45] cx82310_eth: fix a memory leak bug
From: Sasha Levin @ 2019-08-29 18:15 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Wenwen Wang, David S . Miller, Sasha Levin, linux-usb, netdev
In-Reply-To: <20190829181547.8280-1-sashal@kernel.org>
From: Wenwen Wang <wenwen@cs.uga.edu>
[ Upstream commit 1eca92eef18719027d394bf1a2d276f43e7cf886 ]
In cx82310_bind(), 'dev->partial_data' is allocated through kmalloc().
Then, the execution waits for the firmware to become ready. If the firmware
is not ready in time, the execution is terminated. However, the allocated
'dev->partial_data' is not deallocated on this path, leading to a memory
leak bug. To fix this issue, free 'dev->partial_data' before returning the
error.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/net/usb/cx82310_eth.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/usb/cx82310_eth.c b/drivers/net/usb/cx82310_eth.c
index 947bea81d9241..dfbdea22fbad9 100644
--- a/drivers/net/usb/cx82310_eth.c
+++ b/drivers/net/usb/cx82310_eth.c
@@ -175,7 +175,8 @@ static int cx82310_bind(struct usbnet *dev, struct usb_interface *intf)
}
if (!timeout) {
dev_err(&udev->dev, "firmware not ready in time\n");
- return -ETIMEDOUT;
+ ret = -ETIMEDOUT;
+ goto err;
}
/* enable ethernet mode (?) */
--
2.20.1
^ permalink raw reply related
* [PATCH AUTOSEL 4.14 01/27] net: tundra: tsi108: use spin_lock_irqsave instead of spin_lock_irq in IRQ context
From: Sasha Levin @ 2019-08-29 18:16 UTC (permalink / raw)
To: linux-kernel, stable; +Cc: Fuqian Huang, David S . Miller, Sasha Levin, netdev
From: Fuqian Huang <huangfq.daxian@gmail.com>
[ Upstream commit 8c25d0887a8bd0e1ca2074ac0c6dff173787a83b ]
As spin_unlock_irq will enable interrupts.
Function tsi108_stat_carry is called from interrupt handler tsi108_irq.
Interrupts are enabled in interrupt handler.
Use spin_lock_irqsave/spin_unlock_irqrestore instead of spin_(un)lock_irq
in IRQ context to avoid this.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/net/ethernet/tundra/tsi108_eth.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/tundra/tsi108_eth.c b/drivers/net/ethernet/tundra/tsi108_eth.c
index c2d15d9c0c33b..455979e47424c 100644
--- a/drivers/net/ethernet/tundra/tsi108_eth.c
+++ b/drivers/net/ethernet/tundra/tsi108_eth.c
@@ -381,9 +381,10 @@ tsi108_stat_carry_one(int carry, int carry_bit, int carry_shift,
static void tsi108_stat_carry(struct net_device *dev)
{
struct tsi108_prv_data *data = netdev_priv(dev);
+ unsigned long flags;
u32 carry1, carry2;
- spin_lock_irq(&data->misclock);
+ spin_lock_irqsave(&data->misclock, flags);
carry1 = TSI_READ(TSI108_STAT_CARRY1);
carry2 = TSI_READ(TSI108_STAT_CARRY2);
@@ -451,7 +452,7 @@ static void tsi108_stat_carry(struct net_device *dev)
TSI108_STAT_TXPAUSEDROP_CARRY,
&data->tx_pause_drop);
- spin_unlock_irq(&data->misclock);
+ spin_unlock_irqrestore(&data->misclock, flags);
}
/* Read a stat counter atomically with respect to carries.
--
2.20.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox