Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] tools: bpftool: fix reading from /proc/config.gz
From: Peter Wu @ 2019-08-05 23:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann, netdev,
	Stanislav Fomichev, Quentin Monnet
In-Reply-To: <20190805120649.421211da@cakuba.netronome.com>

Hi all,

Thank you for your quick feedback, I will address them in the next
revision.

On Mon, Aug 05, 2019 at 11:41:09AM +0100, Quentin Monnet wrote:

> As far as I understood (from examining Cilium [0]), /proc/config _is_
> used by some distributions, such as CoreOS. This is why we look at that
> location in bpftool.
> 
> [0] https://github.com/cilium/cilium/blob/master/bpf/run_probes.sh#L42

This comment[1] says "CoreOS uses /proc/config", but I think that is a
typo and is supposed to say "/proc/config.gz". The original feature
request[2] uses "/boot/config" as example.

 [1]: https://github.com/cilium/cilium/pull/1065
 [2]: https://github.com/cilium/cilium/issues/891

Given that "/proc/config.gz" is the standard since at least v2.6.12-rc2,
and the official kernel has no mention of "/proc/config", I would like
to skip the latter. If someone has a need for this and it is not covered
by either /boot/config-$(uname -r) or /proc/config.gz, they could submit
a patch for it with links to documentation. How about that?

> > -static char *get_kernel_config_option(FILE *fd, const char *option)
> > +static bool get_kernel_config_option(FILE *fd, char **buf_p, size_t *n_p,
> > +				     char **value)
> 
> Maybe we could rename this function, and have "next" appear in it
> somewhere? After your changes, it does not return the value for a
> specific option anymore.

I have changed it to "read_next_kernel_config_option", let me know if
you prefer an alternative.

> >  {
> > -	size_t line_n = 0, optlen = strlen(option);
> > -	char *res, *strval, *line = NULL;
> > -	ssize_t n;
> > +	char *sep;
> > +	ssize_t linelen;
> 
> Please order the declarations in reverse-Christmas tree style.

Does this refer to the type, name, or full line length? I did not find
this in CodingStyle, the closest I could get is:
https://lore.kernel.org/patchwork/patch/732076/

I will assume the line length for now.

> >  static void probe_kernel_image_config(void)
> > @@ -386,31 +386,34 @@ static void probe_kernel_image_config(void)
> >  		/* test_bpf module for BPF tests */
> >  		"CONFIG_TEST_BPF",
> >  	};
> > +	char *values[ARRAY_SIZE(options)] = { };
> >  	char *value, *buf = NULL;
> >  	struct utsname utsn;
> >  	char path[PATH_MAX];
> >  	size_t i, n;
> >  	ssize_t ret;
> > -	FILE *fd;
> > +	FILE *fd = NULL;
> > +	bool is_pipe = false;
> 
> Reverse-Christmas-tree style please.

Even if that means moving lines? Something like this?

        char path[PATH_MAX];
   +    bool is_pipe = false;
   +    FILE *fd = NULL;
        size_t i, n;
        ssize_t ret;
   -    FILE *fd;

> >  	if (uname(&utsn))
> > -		goto no_config;
> > +		goto end_parse;
> 
> Just thinking, maybe if uname() fails we can skip /boot/config-$(uname
> -r) but still attempt to parse /proc/config{,.gz} instead of printing
> only NULL options?

Good idea, I'll try a bit harder if uname falls for whatever reason.

> Because some distributions do use /proc/config, we should keep this. You
> can probably add /proc/config.gz as another attempt below (or even
> above) the current case?

I doubt it is actually in use, it looks like a typo in the original PR.
This post only lists /proc/config.gz, /boot/config and
/boot/config-$(uname -r): https://superuser.com/questions/287371

> > +	while (get_kernel_config_option(fd, &buf, &n, &value))> +		for (i = 0; i < ARRAY_SIZE(options); i++) {
> > +			if (values[i] || strcmp(buf, options[i]))
> 
> Can we have an option set multiple times in the config file? If so,
> maybe have a p_info() if values are different to warn users that
> conflicting values were found?

scripts/kconfig/merge_config.sh seems to apply a merge strategy,
overwriting earlier values and warning about it. However this should be
rare given that it ended up at /proc/config.gz. For now I will favor
simplicity over complexity and keep the old situation. Let me know if
you prefer otherwise.


On Mon, Aug 05, 2019 at 12:06:49PM -0700, Jakub Kicinski wrote:
> On Mon, 5 Aug 2019 08:29:36 -0700, Stanislav Fomichev wrote:
> > On 08/05, Peter Wu wrote:
> > > /proc/config has never existed as far as I can see, but /proc/config.gz
> > > is present on Arch Linux. Execute an external gunzip program to avoid
> > > linking to zlib and rework the option scanning code since a pipe is not
> > > seekable. This also fixes a file handle leak on some error paths.  
> > Thanks for doing that! One question: why not link against -lz instead?
> > With fork/execing gunzip you're just hiding this dependency.
> > 
> > You can add something like this to the Makefile:
> > ifeq ($(feature-zlib),1)
> > CLFAGS += -DHAVE_ZLIB
> > endif
> > 
> > And then conditionally add support for config.gz. Thoughts?
> 
> +1

Given that the old code did not have this library dependency I did not
add it (the program would otherwise fail to run). Executing an external
process is similar to what tar does. I will look into linking directly
to zlib, thanks!
-- 
Kind regards,
Peter Wu
https://lekensteyn.nl

^ permalink raw reply

* hv_netvsc: WARNING: suspicious RCU usage?
From: Dexuan Cui @ 2019-08-05 23:55 UTC (permalink / raw)
  To: netdev@vger.kernel.org, Yidong Ren, Haiyang Zhang,
	Stephen Hemminger, David S. Miller
  Cc: linux-hyperv@vger.kernel.org

Hi,
After the VM boots up, I always get the below call-trace when I run "nload" for the first time:

[  113.910911] WARNING: suspicious RCU usage
[  113.913244] 5.2.0+ #19 Not tainted
[  113.915216] -----------------------------
[  113.917521] drivers/net/hyperv/netvsc_drv.c:1243 suspicious rcu_dereference_check() usage!
[  113.922191]
[  113.922191] other info that might help us debug this:
[  113.926573]
[  113.926573] rcu_scheduler_active = 2, debug_locks = 1
[  113.930052] 4 locks held by nload/1977:
[  113.932251]  #0: 0000000080b71e86 (&p->lock){+.+.}, at: seq_read+0x41/0x3d0
[  113.936115]  #1: 00000000cacff770 (&of->mutex){+.+.}, at: kernfs_seq_start+0x2a/0x90
[  113.940115]  #2: 00000000287c988f (kn->count#134){.+.+}, at: kernfs_seq_start+0x32/0x90
[  113.944292]  #3: 00000000996fa9cc (dev_base_lock){++.+}, at: netstat_show.isra.25+0x4a/0xb0
[  113.958076]
[  113.958076] stack backtrace:
[  113.958081] CPU: 3 PID: 1977 Comm: nload Not tainted 5.2.0+ #19
[  113.958083] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006  04/28/2016
[  113.958084] Call Trace:
[  113.958091]  dump_stack+0x67/0x90
[  113.973663]  netvsc_get_stats64+0x159/0x170 [hv_netvsc]
[  113.973663]  dev_get_stats+0x55/0xb0
[  113.973663]  netstat_show.isra.25+0x5b/0xb0
[  113.973663]  dev_attr_show+0x15/0x40
[  113.981661]  sysfs_kf_seq_show+0xad/0xf0
[  113.981661]  seq_read+0x146/0x3d0
[  113.981661]  vfs_read+0x9c/0x160
[  113.989025]  ksys_read+0x5c/0xd0
[  113.989025]  do_syscall_64+0x5e/0x220
[  113.989025]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  113.989025] RIP: 0033:0x7f4485daaf31

nload is a console application which monitors network traffic and bandwidth usage in real time.

The warning is caused by the rcu_dereference_rtnl() :

1239 static void netvsc_get_stats64(struct net_device *net,
1240                                struct rtnl_link_stats64 *t)
1241 {
1242         struct net_device_context *ndev_ctx = netdev_priv(net);
1243         struct netvsc_device *nvdev = rcu_dereference_rtnl(ndev_ctx->nvdev);

I think here netvsc_get_stats64() neither holds rcu_read_lock() nor RTNL

IMO it should call rcu_read_lock()/unlock(), or get RTNL to fix the warning?

Thanks,
-- Dexuan


^ permalink raw reply

* Re: [PATCH 1/1] bpf: introduce new helper udp_flow_src_port
From: Jakub Kicinski @ 2019-08-06  0:10 UTC (permalink / raw)
  To: Y Song, Alexei Starovoitov; +Cc: Farid Zakaria, Daniel Borkmann, netdev, bpf
In-Reply-To: <CAH3MdRXTEN-Ra+61QA37hM2mkHx99K5NM7f+H6d8Em-bxvaenw@mail.gmail.com>

On Sat, 3 Aug 2019 23:52:16 -0700, Y Song wrote:
> >  include/uapi/linux/bpf.h                      | 21 +++++++--
> >  net/core/filter.c                             | 20 ++++++++
> >  tools/include/uapi/linux/bpf.h                | 21 +++++++--
> >  tools/testing/selftests/bpf/bpf_helpers.h     |  2 +
> >  .../bpf/prog_tests/udp_flow_src_port.c        | 28 +++++++++++
> >  .../bpf/progs/test_udp_flow_src_port_kern.c   | 47 +++++++++++++++++++
> >  6 files changed, 131 insertions(+), 8 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/udp_flow_src_port.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_udp_flow_src_port_kern.c  
> 
> First, for each review, backport and sync with libbpf repo, in the future,
> could you break the patch to two patches?
>    1. kernel changes (net/core/filter.c, include/uapi/linux/bpf.h)
>    2. tools/include/uapi/linux/bpf.h
>    3. tools/testing/ changes

A lot of people get caught off by this, could explain why this is
necessary?

git can deal with this scenario without missing a step, format-patch
takes paths:

$ git show --oneline -s
1002f3e955d7 (HEAD) bpf: introduce new helper udp_flow_src_port

$ git format-patch HEAD~ -- tools/include/uapi/linux/bpf.h
0001-bpf-introduce-new-helper-udp_flow_src_port.patch

$ grep -B1 changed 0001-bpf-introduce-new-helper-udp_flow_src_port.patch 
 tools/include/uapi/linux/bpf.h | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

$ cd ../libbpf
$ git am -p2 ../linux/0001-bpf-introduce-new-helper-udp_flow_src_port.patch
Applying: bpf: introduce new helper udp_flow_src_port
error: patch failed: include/uapi/linux/bpf.h:2853
error: include/uapi/linux/bpf.h: patch does not apply
...

Well, the patch doesn't apply to libbpf right now, but git finds the
right paths and all that.

IMO it'd be good to not have this artificial process obstacle and all
the "sync headers" commits in the tree.

^ permalink raw reply

* Re: [PATCH net-next v3] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Tao Ren @ 2019-08-06  0:11 UTC (permalink / raw)
  To: Heiner Kallweit, Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, David S . Miller,
	Arun Parameswaran, Justin Chen, netdev, lkml,
	openbmc@lists.ozlabs.org
In-Reply-To: <291a3c6e-ca8f-a9b8-a0b8-735a68dc04ea@gmail.com>

On 8/5/19 1:45 PM, Heiner Kallweit wrote:
> On 04.08.2019 21:22, Vladimir Oltean wrote:
>> On Sun, 4 Aug 2019 at 19:07, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>
>>> On 04.08.2019 17:59, Vladimir Oltean wrote:
>>>> On Sun, 4 Aug 2019 at 17:52, Andrew Lunn <andrew@lunn.ch> wrote:
>>>>>
>>>>>>> The patchset looks better now. But is it ok, I wonder, to keep
>>>>>>> PHY_BCM_FLAGS_MODE_1000BX in phydev->dev_flags, considering that
>>>>>>> phy_attach_direct is overwriting it?
>>>>>>
>>>>>
>>>>>> I checked ftgmac100 driver (used on my machine) and it calls
>>>>>> phy_connect_direct which passes phydev->dev_flags when calling
>>>>>> phy_attach_direct: that explains why the flag is not cleared in my
>>>>>> case.
>>>>>
>>>>> Yes, that is the way it is intended to be used. The MAC driver can
>>>>> pass flags to the PHY. It is a fragile API, since the MAC needs to
>>>>> know what PHY is being used, since the flags are driver specific.
>>>>>
>>>>> One option would be to modify the assignment in phy_attach_direct() to
>>>>> OR in the flags passed to it with flags which are already in
>>>>> phydev->dev_flags.
>>>>>
>>>>>         Andrew
>>>>
>>>> Even if that were the case (patching phy_attach_direct to apply a
>>>> logical-or to dev_flags), it sounds fishy to me that the genphy code
>>>> is unable to determine that this PHY is running in 1000Base-X mode.
>>>>
>>>> In my opinion it all boils down to this warning:
>>>>
>>>> "PHY advertising (0,00000200,000062c0) more modes than genphy
>>>> supports, some modes not advertised".
>>>>
>>> The genphy code deals with Clause 22 + Gigabit BaseT only.
>>> Question is whether you want aneg at all in 1000Base-X mode and
>>> what you want the config_aneg callback to do.
>>> There may be some inspiration in the Marvel PHY drivers.
>>>
>>
>> AN for 1000Base-X still gives you duplex and pause frame settings. I
>> thought the base page format for exchanging that info is standardized
>> in clause 37.
>> Does genphy cover only copper media by design, or is it desirable to
>> augment genphy_read_status?
>>
> So far we care about copper only in phylib. Some constants needed for
> Clause 37 support are defined, but used by few drivers only.
> 
> ADVERTISE_1000XHALF
> ADVERTISE_1000XFULL
> ADVERTISE_1000XPAUSE
> ADVERTISE_1000XPSE_ASYM
> 
> I think it would make sense to have something like genphy_c37_config_aneg.
> Similar for read_status.

Thank you all for the inputs on this patch.

If I understand correctly, we are going to create a set of genphy_c37_* functions for 1000x support so it can be used by phy drivers? Or are we considering other options? What's your recommendation on this specific patch?


Thanks,

Tao

^ permalink raw reply

* Re: [PATCH v1 1/3] net: hisilicon: make hip04_tx_reclaim non-reentrant
From: Jakub Kicinski @ 2019-08-06  0:46 UTC (permalink / raw)
  To: Jiangfeng Xiao
  Cc: davem, yisen.zhuang, salil.mehta, netdev, linux-kernel, leeyou.li,
	xiaowei774, nixiaoming
In-Reply-To: <1564835501-90257-2-git-send-email-xiaojiangfeng@huawei.com>

On Sat, 3 Aug 2019 20:31:39 +0800, Jiangfeng Xiao wrote:
> If hip04_tx_reclaim is interrupted while it is running
> and then __napi_schedule continues to execute
> hip04_rx_poll->hip04_tx_reclaim, reentrancy occurs
> and oops is generated. So you need to mask the interrupt
> during the hip04_tx_reclaim run.

Napi poll method for the same napi instance can't be run concurrently.
Could you explain a little more what happens here?

Also looking at hip04_rx_poll() I don't think the interrupt re-enabling
logic guarantees the interrupt is not armed when NAPI is scheduled.
Please note that NAPI is no longer scheduled if napi_complete_done()
returned false.

^ permalink raw reply

* [PATCH v2] tools: bpftool: fix reading from /proc/config.gz
From: Peter Wu @ 2019-08-06  1:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: netdev, Stanislav Fomichev, Jakub Kicinski, Quentin Monnet

/proc/config has never existed as far as I can see, but /proc/config.gz
is present on Arch Linux. Execute an external gunzip program to avoid
linking to zlib and rework the option scanning code since a pipe is not
seekable. This also fixes a file handle leak on some error paths.

Fixes: 4567b983f78c ("tools: bpftool: add probes for kernel configuration options")
Cc: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Peter Wu <peter@lekensteyn.nl>
---
 v2: fix style (reorder vars as reverse xmas tree, rename function,
     braces), fallback to /proc/config.gz if uname() fails.

Hi,

Although Stanislav and Jakub suggested to use zlib in v1, I have not
implemented that yet since the current patch is quite minimal.

Using zlib instead of executing an external gzip program would like add
another 100-150 lines. It likely requires a bigger rewrite to avoid
getline() assuming that no temporary file is used for the uncompressed
config. If zlib is desired, I would suggest doing it in another patch.

Thoughts?

Kind regards,
Peter
---
 tools/bpf/bpftool/feature.c | 104 +++++++++++++++++++++---------------
 1 file changed, 60 insertions(+), 44 deletions(-)

diff --git a/tools/bpf/bpftool/feature.c b/tools/bpf/bpftool/feature.c
index d672d9086fff..b9ade5a8bc3c 100644
--- a/tools/bpf/bpftool/feature.c
+++ b/tools/bpf/bpftool/feature.c
@@ -284,34 +284,35 @@ static void probe_jit_limit(void)
 	}
 }
 
-static char *get_kernel_config_option(FILE *fd, const char *option)
+static bool read_next_kernel_config_option(FILE *fd, char **buf_p, size_t *n_p,
+					   char **value)
 {
-	size_t line_n = 0, optlen = strlen(option);
-	char *res, *strval, *line = NULL;
-	ssize_t n;
+	ssize_t linelen;
+	char *sep;
 
-	rewind(fd);
-	while ((n = getline(&line, &line_n, fd)) > 0) {
-		if (strncmp(line, option, optlen))
-			continue;
-		/* Check we have at least '=', value, and '\n' */
-		if (strlen(line) < optlen + 3)
+	while ((linelen = getline(buf_p, n_p, fd)) > 0) {
+		char *line = *buf_p;
+
+		if (strncmp(line, "CONFIG_", 7))
 			continue;
-		if (*(line + optlen) != '=')
+
+		sep = memchr(line, '=', linelen);
+		if (!sep)
 			continue;
 
 		/* Trim ending '\n' */
-		line[strlen(line) - 1] = '\0';
+		line[linelen - 1] = '\0';
+
+		/* Split on '=' and ensure that a value is present. */
+		*sep = '\0';
+		if (!sep[1])
+			continue;
 
-		/* Copy and return config option value */
-		strval = line + optlen + 1;
-		res = strdup(strval);
-		free(line);
-		return res;
+		*value = sep + 1;
+		return true;
 	}
-	free(line);
 
-	return NULL;
+	return false;
 }
 
 static void probe_kernel_image_config(void)
@@ -386,31 +387,36 @@ static void probe_kernel_image_config(void)
 		/* test_bpf module for BPF tests */
 		"CONFIG_TEST_BPF",
 	};
+	char *values[ARRAY_SIZE(options)] = { };
 	char *value, *buf = NULL;
 	struct utsname utsn;
 	char path[PATH_MAX];
+	bool is_pipe = false;
+	FILE *fd = NULL;
 	size_t i, n;
 	ssize_t ret;
-	FILE *fd;
 
-	if (uname(&utsn))
-		goto no_config;
+	if (!uname(&utsn)) {
+		snprintf(path, sizeof(path), "/boot/config-%s", utsn.release);
 
-	snprintf(path, sizeof(path), "/boot/config-%s", utsn.release);
+		fd = fopen(path, "r");
+		if (!fd && errno != ENOENT)
+			p_info("Cannot open %s: %s", path, strerror(errno));
+	}
 
-	fd = fopen(path, "r");
-	if (!fd && errno == ENOENT) {
-		/* Some distributions put the config file at /proc/config, give
-		 * it a try.
-		 * Sometimes it is also at /proc/config.gz but we do not try
-		 * this one for now, it would require linking against libz.
+	if (!fd) {
+		/* Some distributions build with CONFIG_IKCONFIG=y and put the
+		 * config file at /proc/config.gz. We try to invoke an external
+		 * gzip program to avoid linking to libz.
+		 * Hide stderr to avoid interference with the JSON output.
 		 */
-		fd = fopen("/proc/config", "r");
+		fd = popen("gunzip -c /proc/config.gz 2>/dev/null", "r");
+		is_pipe = true;
 	}
 	if (!fd) {
 		p_info("skipping kernel config, can't open file: %s",
 		       strerror(errno));
-		goto no_config;
+		goto end_parse;
 	}
 	/* Sanity checks */
 	ret = getline(&buf, &n, fd);
@@ -418,27 +424,37 @@ static void probe_kernel_image_config(void)
 	if (!buf || !ret) {
 		p_info("skipping kernel config, can't read from file: %s",
 		       strerror(errno));
-		free(buf);
-		goto no_config;
+		goto end_parse;
 	}
 	if (strcmp(buf, "# Automatically generated file; DO NOT EDIT.\n")) {
 		p_info("skipping kernel config, can't find correct file");
-		free(buf);
-		goto no_config;
+		goto end_parse;
+	}
+
+	while (read_next_kernel_config_option(fd, &buf, &n, &value)) {
+		for (i = 0; i < ARRAY_SIZE(options); i++) {
+			if (values[i] || strcmp(buf, options[i]))
+				continue;
+
+			values[i] = strdup(value);
+		}
+	}
+
+end_parse:
+	if (fd) {
+		if (is_pipe) {
+			if (pclose(fd))
+				p_info("failed to read /proc/config.gz");
+		} else {
+			fclose(fd);
+		}
 	}
 	free(buf);
 
 	for (i = 0; i < ARRAY_SIZE(options); i++) {
-		value = get_kernel_config_option(fd, options[i]);
-		print_kernel_option(options[i], value);
-		free(value);
+		print_kernel_option(options[i], values[i]);
+		free(values[i]);
 	}
-	fclose(fd);
-	return;
-
-no_config:
-	for (i = 0; i < ARRAY_SIZE(options); i++)
-		print_kernel_option(options[i], NULL);
 }
 
 static bool probe_bpf_syscall(const char *define_prefix)
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-06  1:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <CALCETrVtPs8gY-H4gmzSqPboid3CB++n50SvYd6RU9YVde_-Ow@mail.gmail.com>

On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
> It tries to make the kernel respect the access modes for fds.  Without
> this patch, there seem to be some holes: nothing looked at program fds
> and, unless I missed something, you could take a readonly fd for a
> program, pin the program, and reopen it RW.

I think it's by design. iirc Daniel had a use case for something like this.

The key to understand is that bpf is not about security. It's about safety.
All features are geared to safety.
The users are trusted most of the time.
The number of unprivileged bpf use cases is tiny compared to trusted.
Hence unprivileged bpf is actually something that can be deprecated.

> Other than the issue that this patch partially fixes, can you see any
> reason that loading a program should require privilege?  Obviously the
> verifier is weakened a bit when called by privileged users, but a lot
> of that is about excessive resource usage and various less-well-tested
> features.  It seems to me that most of the value of bpf() should be
> available to programs that should not need privilege to load.  Are
> there things I'm missing?

see below.

> LPM: I don't see why this requires privilege at all.  It indeed checks
> capable(CAP_SYS_ADMIN), but I don't see why.

see below.

> 
> >
> > Attaching to a cgroup already has file based permission checks.
> > The user needs to open cgroup directory to attach.
> > acls on cgroup dir can already be used to prevent attaching to
> > certain parts of cgroup hierarchy.
> 
> The current checks seem inadequate.
> 
> $ echo 'yay' </sys/fs/cgroup/systemd/system.slice/
> 
> The ability to obtain an fd to a cgroup does *not* imply any right to
> modify that cgroup.  The ability to write to a cgroup directory
> already means something else -- it's the ability to create cgroups
> under the group in question.  I'm suggesting that a new API be added
> that allows attaching a bpf program to a cgroup without capabilities
> and that instead requires write access to a new file in the cgroup
> directory.  (It could be a single file for all bpf types or one file
> per type.  I prefer the latter -- it gives the admin finer-grained
> control.)

This is something to discuss. I don't mind something like this,
but in general bpf is not for untrusted users.
Hence I don't want to overdesign.

> 
> > What we need is to drop privileges sooner in daemons like systemd.
> 
> This is doable right now: systemd could fork off a subprocess and
> delegate its cgroup operations to it.  It would be maybe a couple
> hundred lines of code.  As an added benefit, that subprocess could
> verify that the bpf operations in question are reasonable.
> Alternatively, if there was a CAP_BPF_ADMIN, systemd could retain that
> capability and flip it on and off as needed.

See https://github.com/systemd/systemd/blob/01234e1fe777602265ffd25ab6e73823c62b0efe/src/core/bpf-firewall.c#L671-L674
bpf based IP sandboxing doesn't work in 'systemd --user'.
That is just one of the problems that people complained about.

Note that systemd bpf usage is basic. There is ongoing work to
adopt libbpf in systemd, so more features and more use cases will open up.

Inside containers and inside nested containers we need to start processes
that will use bpf. All of the processes are trusted.
We need to drop root not to be secure, but to be safe.
Consider a bug in a code that accidently did sys_kill(-1).
Dropping root is a mitigation for bugs like this.

> 
> > Container management daemon runs in the nested containers.
> > These trusted daemons need to have access to full bpf, but they
> > don't want to be root all the time.
> > They cannot flip back and forth via seteuid to root every time they
> > need to do bpf.
> > Hence the idea is to have a file that this daemon can open,
> > then drop privileges and still keep doing bpf things because FD is held.
> > Outer container daemon can pass this /dev/bpf's FD to inner daemon, etc.
> > This /dev/bpf would be accessible to root only.
> > There is no desire to open it up to non-root.
> 
> This seems extremely dangerous right now.  A program that can bypass
> *all* of the capable() checks in bpf() can do a whole lot.  Among
> other things, it can read all of kernel memory. 

It's 'dangerous' only if you think about it from security point of view.
The tracing (and sometimes networking) bpf progs need to read all of kernel
memory without being root.
That is the whole point of the /dev/bpf.

> This seems to have most of the same problems.  My main point is that
> it conflates a whole lot of different permissions, and I really don't
> think it's that much work to mostly disentangle the permissions in
> question.  My little series (if completed) plus a patch to allow
> unprivileged cgroup attach operations if you have an FMODE_WRITE fd to
> an appropriate file should get most of the way there.

I think I understand your concern. One /dev/bpf magic to by-pass
all of capable() checks in bpf doesn't look nice indeed.

Now to answer your question about capable(sys_admin) in the verifier.

See this bugfix:
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/commit/?id=7c2e988f400e83501e0a3568250780609b7c8263

It's a bug in the JIT that was there pretty much forever,
but it got exposed due to two new bpf features.
Initially syzbot complained that bounded loops allow jmp to 1st insn.
syzbot reproducer contained 'never taken' branch, so buggy JIT
wouldn't have caused a kernel panic for this reproducer though
JITed x86 code is broken.
It took me few days to realize that with bpf2bpf calls _and_
bounded loops I can craft a test that will expose this JIT bug
and will crash the kernel.
So a combination of two recent verifier features exposed old JIT bug
that would have been a security issue if we didn't gate these verifier
features for root only.
Now it's simply a kernel bugfix.

Such subtle bugs happen all the time when new verifier features are
introduced. Hence we always start them with root only.
bpf2bpf calls, bounded loops, precision tracking, dead code elimination,
LPM maps, many programs type are root only.
We don't want cve-s to be filed for every bug like this.

Even if we start relaxing features (like dropping root from LPM map)
at any given time there will be a lot of useful verifier features,
maps and program types that are root only.
For root == for trusted users only.
Unfortunately this approach creates adoption problem.
The trusted users don't want to be root to use bpf.
Hence this /dev/bpf.

To solve your concern of bypassing all capable checks...
How about we do /dev/bpf/full_verifier first?
It will replace capable() checks in the verifier only.

How to delegate cgroup attach permission is a follow up discussion.
Could be special files in cgroup dir as you proposed or something else.
Let's table that and focus on the verifier first.

^ permalink raw reply

* [PATCH net-next v4] net: can: Fix compiling warnings for two functions
From: Mao Wenan @ 2019-08-06  1:37 UTC (permalink / raw)
  To: davem, socketcan; +Cc: netdev, linux-kernel, kernel-janitors, maowenan
In-Reply-To: <20190805.103002.641507066504156536.davem@davemloft.net>

There are two warnings in net/can, fix them by setting
bcm_sock_no_ioctlcmd and raw_sock_no_ioctlcmd as static.

net/can/bcm.c:1683:5: warning: symbol 'bcm_sock_no_ioctlcmd' was not declared. Should it be static?
net/can/raw.c:840:5: warning: symbol 'raw_sock_no_ioctlcmd' was not declared. Should it be static?

Fixes: 473d924d7d46 ("can: fix ioctl function removal")
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
---
 v1->v2: change patch description typo error, 'warings' to 'warnings'.
 v2->v3: change subject of patch.
 v3->v4: change the alignment of two functions. 
 net/can/bcm.c | 4 ++--
 net/can/raw.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/can/bcm.c b/net/can/bcm.c
index bf1d0bbecec8..eb1d28b8c46a 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1680,8 +1680,8 @@ static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
-int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
-			 unsigned long arg)
+static int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+				unsigned long arg)
 {
 	/* no ioctls for socket layer -> hand it down to NIC layer */
 	return -ENOIOCTLCMD;
diff --git a/net/can/raw.c b/net/can/raw.c
index da386f1fa815..a30aaecd9327 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -837,8 +837,8 @@ static int raw_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
-int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
-			 unsigned long arg)
+static int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+				unsigned long arg)
 {
 	/* no ioctls for socket layer -> hand it down to NIC layer */
 	return -ENOIOCTLCMD;
-- 
2.20.1


^ permalink raw reply related

* [PATCH v3] mlx5: Use refcount_t for refcount
From: Chuhong Yuan @ 2019-08-06  1:59 UTC (permalink / raw)
  Cc: Leon Romanovsky, Doug Ledford, Jason Gunthorpe, Saeed Mahameed,
	David S . Miller, linux-rdma, linux-kernel, netdev, Chuhong Yuan

Reference counters are preferred to use refcount_t instead of
atomic_t.
This is because the implementation of refcount_t can prevent
overflows and detect possible use-after-free.
So convert atomic_t ref counters to refcount_t.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
---
Changes in v3:
  - Merge v2 patches together.

 drivers/infiniband/hw/mlx5/srq_cmd.c         | 6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/qp.c | 6 +++---
 include/linux/mlx5/driver.h                  | 3 ++-
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/srq_cmd.c b/drivers/infiniband/hw/mlx5/srq_cmd.c
index b0d0687c7a68..8fc3630a9d4c 100644
--- a/drivers/infiniband/hw/mlx5/srq_cmd.c
+++ b/drivers/infiniband/hw/mlx5/srq_cmd.c
@@ -86,7 +86,7 @@ struct mlx5_core_srq *mlx5_cmd_get_srq(struct mlx5_ib_dev *dev, u32 srqn)
 	xa_lock(&table->array);
 	srq = xa_load(&table->array, srqn);
 	if (srq)
-		atomic_inc(&srq->common.refcount);
+		refcount_inc(&srq->common.refcount);
 	xa_unlock(&table->array);
 
 	return srq;
@@ -592,7 +592,7 @@ int mlx5_cmd_create_srq(struct mlx5_ib_dev *dev, struct mlx5_core_srq *srq,
 	if (err)
 		return err;
 
-	atomic_set(&srq->common.refcount, 1);
+	refcount_set(&srq->common.refcount, 1);
 	init_completion(&srq->common.free);
 
 	err = xa_err(xa_store_irq(&table->array, srq->srqn, srq, GFP_KERNEL));
@@ -675,7 +675,7 @@ static int srq_event_notifier(struct notifier_block *nb,
 	xa_lock(&table->array);
 	srq = xa_load(&table->array, srqn);
 	if (srq)
-		atomic_inc(&srq->common.refcount);
+		refcount_inc(&srq->common.refcount);
 	xa_unlock(&table->array);
 
 	if (!srq)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index b8ba74de9555..7b44d1e49604 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -53,7 +53,7 @@ mlx5_get_rsc(struct mlx5_qp_table *table, u32 rsn)
 
 	common = radix_tree_lookup(&table->tree, rsn);
 	if (common)
-		atomic_inc(&common->refcount);
+		refcount_inc(&common->refcount);
 
 	spin_unlock_irqrestore(&table->lock, flags);
 
@@ -62,7 +62,7 @@ mlx5_get_rsc(struct mlx5_qp_table *table, u32 rsn)
 
 void mlx5_core_put_rsc(struct mlx5_core_rsc_common *common)
 {
-	if (atomic_dec_and_test(&common->refcount))
+	if (refcount_dec_and_test(&common->refcount))
 		complete(&common->free);
 }
 
@@ -209,7 +209,7 @@ static int create_resource_common(struct mlx5_core_dev *dev,
 	if (err)
 		return err;
 
-	atomic_set(&qp->common.refcount, 1);
+	refcount_set(&qp->common.refcount, 1);
 	init_completion(&qp->common.free);
 	qp->pid = current->pid;
 
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 0e6da1840c7d..5b56f343ce87 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -47,6 +47,7 @@
 #include <linux/interrupt.h>
 #include <linux/idr.h>
 #include <linux/notifier.h>
+#include <linux/refcount.h>
 
 #include <linux/mlx5/device.h>
 #include <linux/mlx5/doorbell.h>
@@ -398,7 +399,7 @@ enum mlx5_res_type {
 
 struct mlx5_core_rsc_common {
 	enum mlx5_res_type	res;
-	atomic_t		refcount;
+	refcount_t		refcount;
 	struct completion	free;
 };
 
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH v2] tools: bpftool: fix reading from /proc/config.gz
From: Jakub Kicinski @ 2019-08-06  1:59 UTC (permalink / raw)
  To: Peter Wu
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Stanislav Fomichev,
	Quentin Monnet
In-Reply-To: <20190806010702.3303-1-peter@lekensteyn.nl>

On Tue,  6 Aug 2019 02:07:02 +0100, Peter Wu wrote:
> /proc/config has never existed as far as I can see, but /proc/config.gz
> is present on Arch Linux. Execute an external gunzip program to avoid
> linking to zlib and rework the option scanning code since a pipe is not
> seekable. This also fixes a file handle leak on some error paths.

Please post the fix for the handle leak separately against the bpf tree.

> Fixes: 4567b983f78c ("tools: bpftool: add probes for kernel configuration options")

Other than the leak I'm not sure this qualifies as a bug.
Reading config.gz was consciously left to be implemented later.

> Cc: Quentin Monnet <quentin.monnet@netronome.com>
> Signed-off-by: Peter Wu <peter@lekensteyn.nl>
> ---
>  v2: fix style (reorder vars as reverse xmas tree, rename function,
>      braces), fallback to /proc/config.gz if uname() fails.
> 
> Hi,
> 
> Although Stanislav and Jakub suggested to use zlib in v1, I have not
> implemented that yet since the current patch is quite minimal.
> 
> Using zlib instead of executing an external gzip program would like add
> another 100-150 lines. It likely requires a bigger rewrite to avoid
> getline() assuming that no temporary file is used for the uncompressed
> config. If zlib is desired, I would suggest doing it in another patch.
> 
> Thoughts?

I'd rather avoid the fork and pipe. We already implicitly link against
zlib for libelf etc.

^ permalink raw reply

* Re: [PATCH v1 1/3] net: hisilicon: make hip04_tx_reclaim non-reentrant
From: Jiangfeng Xiao @ 2019-08-06  2:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, yisen.zhuang, salil.mehta, netdev, linux-kernel, leeyou.li,
	xiaowei774, nixiaoming
In-Reply-To: <20190805174618.2b3b551a@cakuba.netronome.com>


On 2019/8/6 8:46, Jakub Kicinski wrote:
> On Sat, 3 Aug 2019 20:31:39 +0800, Jiangfeng Xiao wrote:
>> If hip04_tx_reclaim is interrupted while it is running
>> and then __napi_schedule continues to execute
>> hip04_rx_poll->hip04_tx_reclaim, reentrancy occurs
>> and oops is generated. So you need to mask the interrupt
>> during the hip04_tx_reclaim run.
> 
> Napi poll method for the same napi instance can't be run concurrently.
> Could you explain a little more what happens here?
> 
Because netif_napi_add(ndev, &priv->napi, hip04_rx_poll, NAPI_POLL_WEIGHT);
So hip04_rx_poll is a napi instance.
I did not say that hip04_rx_poll has reentered.
I am talking about the reentrant of hip04_tx_reclaim.


Pre-modification code:
static int hip04_rx_poll(struct napi_struct *napi, int budget)
{
	[...]
	/* enable rx interrupt */
	writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN);

	napi_complete_done(napi, rx);
done:
	/* clean up tx descriptors and start a new timer if necessary */
	tx_remaining = hip04_tx_reclaim(ndev, false);
	[...]
}
hip04_tx_reclaim is executed after "enable rx interrupt" and napi_complete_done.

If hip04_tx_reclaim is interrupted while it is running, and then
__irq_svc->gic_handle_irq->hip04_mac_interrupt->__napi_schedule->hip04_rx_poll->hip04_tx_reclaim


Also looking at hip04_tx_reclaim

static int hip04_tx_reclaim(struct net_device *ndev, bool force)
{
[1]     struct hip04_priv *priv = netdev_priv(ndev);
[2]     unsigned tx_tail = priv->tx_tail;
[3]	[...]
[4]	bytes_compl += priv->tx_skb[tx_tail]->len;
[5]	dev_kfree_skb(priv->tx_skb[tx_tail]);
[6]	priv->tx_skb[tx_tail] = NULL;
[7]	tx_tail = TX_NEXT(tx_tail);
[8]	[...]
[9]	priv->tx_tail = tx_tail;
}

An interrupt occurs if hip04_tx_reclaim just executes to the line 6,
priv->tx_skb[tx_tail] is NULL, and then
__irq_svc->gic_handle_irq->hip04_mac_interrupt->__napi_schedule->hip04_rx_poll->hip04_tx_reclaim

Then hip04_tx_reclaim will handle kernel NULL pointer dereference on line 4.
A reentrant occurs in hip04_tx_reclaim and oops is generated.





My commit is to execute hip04_tx_reclaim before "enable rx interrupt" and napi_complete_done.
Then hip04_tx_reclaim can also be protected by the napi poll method so that no reentry occurs.

thanks.


^ permalink raw reply

* [PATCH v2 bpf-next 0/2] selftests/bpf: more loop tests
From: Alexei Starovoitov @ 2019-08-06  2:17 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, bpf, kernel-team

Add two bounded loop tests.

v1-v2: addressed feedback from Yonghong.

Alexei Starovoitov (2):
  selftests/bpf: add loop test 4
  selftests/bpf: add loop test 5

 .../bpf/prog_tests/bpf_verif_scale.c          |  2 ++
 tools/testing/selftests/bpf/progs/loop4.c     | 18 +++++++++++
 tools/testing/selftests/bpf/progs/loop5.c     | 32 +++++++++++++++++++
 3 files changed, 52 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/loop4.c
 create mode 100644 tools/testing/selftests/bpf/progs/loop5.c

-- 
2.20.0


^ permalink raw reply

* [PATCH v2 bpf-next 1/2] selftests/bpf: add loop test 4
From: Alexei Starovoitov @ 2019-08-06  2:17 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, bpf, kernel-team
In-Reply-To: <20190806021744.2953168-1-ast@kernel.org>

Add a test that returns a 'random' number between [0, 2^20)
If state pruning is not working correctly for loop body the number of
processed insns will be 2^20 * num_of_insns_in_loop_body and the program
will be rejected.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
---
 .../selftests/bpf/prog_tests/bpf_verif_scale.c |  1 +
 tools/testing/selftests/bpf/progs/loop4.c      | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/loop4.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
index b4be96162ff4..e9e72d8d7aae 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
@@ -71,6 +71,7 @@ void test_bpf_verif_scale(void)
 
 		{ "loop1.o", BPF_PROG_TYPE_RAW_TRACEPOINT },
 		{ "loop2.o", BPF_PROG_TYPE_RAW_TRACEPOINT },
+		{ "loop4.o", BPF_PROG_TYPE_SCHED_CLS },
 
 		/* partial unroll. 19k insn in a loop.
 		 * Total program size 20.8k insn.
diff --git a/tools/testing/selftests/bpf/progs/loop4.c b/tools/testing/selftests/bpf/progs/loop4.c
new file mode 100644
index 000000000000..650859022771
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/loop4.c
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Facebook
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("socket")
+int combinations(volatile struct __sk_buff* skb)
+{
+	int ret = 0, i;
+
+#pragma nounroll
+	for (i = 0; i < 20; i++)
+		if (skb->len)
+			ret |= 1 << i;
+	return ret;
+}
-- 
2.20.0


^ permalink raw reply related

* [PATCH v2 bpf-next 2/2] selftests/bpf: add loop test 5
From: Alexei Starovoitov @ 2019-08-06  2:17 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, bpf, kernel-team
In-Reply-To: <20190806021744.2953168-1-ast@kernel.org>

Add a test with multiple exit conditions.
It's not an infinite loop only when the verifier can properly track
all math on variable 'i' through all possible ways of executing this loop.

barrier()s are needed to disable llvm optimization that combines multiple
branches into fewer branches.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 .../bpf/prog_tests/bpf_verif_scale.c          |  1 +
 tools/testing/selftests/bpf/progs/loop5.c     | 32 +++++++++++++++++++
 2 files changed, 33 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/loop5.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
index e9e72d8d7aae..0caf8eafa9eb 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
@@ -72,6 +72,7 @@ void test_bpf_verif_scale(void)
 		{ "loop1.o", BPF_PROG_TYPE_RAW_TRACEPOINT },
 		{ "loop2.o", BPF_PROG_TYPE_RAW_TRACEPOINT },
 		{ "loop4.o", BPF_PROG_TYPE_SCHED_CLS },
+		{ "loop5.o", BPF_PROG_TYPE_SCHED_CLS },
 
 		/* partial unroll. 19k insn in a loop.
 		 * Total program size 20.8k insn.
diff --git a/tools/testing/selftests/bpf/progs/loop5.c b/tools/testing/selftests/bpf/progs/loop5.c
new file mode 100644
index 000000000000..28d1d668f07c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/loop5.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Facebook
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+#define barrier() __asm__ __volatile__("": : :"memory")
+
+char _license[] SEC("license") = "GPL";
+
+SEC("socket")
+int while_true(volatile struct __sk_buff* skb)
+{
+	int i = 0;
+
+	while (1) {
+		if (skb->len)
+			i += 3;
+		else
+			i += 7;
+		if (i == 9)
+			break;
+		barrier();
+		if (i == 10)
+			break;
+		barrier();
+		if (i == 13)
+			break;
+		barrier();
+		if (i == 14)
+			break;
+	}
+	return i;
+}
-- 
2.20.0


^ permalink raw reply related

* [PATCH 1/2] cxgb4: smt: Add lock for atomic_dec_and_test
From: Chuhong Yuan @ 2019-08-06  2:58 UTC (permalink / raw)
  Cc: Vishal Kulkarni, David S . Miller, netdev, linux-kernel,
	Chuhong Yuan

The atomic_dec_and_test() is not safe because it is
outside of locks.
Move the locks of t4_smte_free() to its caller,
cxgb4_smt_release() to protect the atomic decrement.

Fixes: 3bdb376e6944 ("cxgb4: introduce SMT ops to prepare for SMAC rewrite support")
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
---
 drivers/net/ethernet/chelsio/cxgb4/smt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/smt.c b/drivers/net/ethernet/chelsio/cxgb4/smt.c
index eaf1fb74689c..d6e84c8b5554 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/smt.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/smt.c
@@ -97,11 +97,9 @@ static struct smt_entry *find_or_alloc_smte(struct smt_data *s, u8 *smac)
 
 static void t4_smte_free(struct smt_entry *e)
 {
-	spin_lock_bh(&e->lock);
 	if (atomic_read(&e->refcnt) == 0) {  /* hasn't been recycled */
 		e->state = SMT_STATE_UNUSED;
 	}
-	spin_unlock_bh(&e->lock);
 }
 
 /**
@@ -111,8 +109,10 @@ static void t4_smte_free(struct smt_entry *e)
  */
 void cxgb4_smt_release(struct smt_entry *e)
 {
+	spin_lock_bh(&e->lock);
 	if (atomic_dec_and_test(&e->refcnt))
 		t4_smte_free(e);
+	spin_unlock_bh(&e->lock);
 }
 EXPORT_SYMBOL(cxgb4_smt_release);
 
-- 
2.20.1


^ permalink raw reply related

* [PATCH 2/2] cxgb4: smt: Use normal int for refcount
From: Chuhong Yuan @ 2019-08-06  2:58 UTC (permalink / raw)
  Cc: Vishal Kulkarni, David S . Miller, netdev, linux-kernel,
	Chuhong Yuan

All refcount operations are protected by spinlocks now.
Then the atomic counter can be replaced by a normal int.

This patch depends on PATCH 1/2.

Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
---
 drivers/net/ethernet/chelsio/cxgb4/smt.c | 14 +++++++-------
 drivers/net/ethernet/chelsio/cxgb4/smt.h |  2 +-
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/smt.c b/drivers/net/ethernet/chelsio/cxgb4/smt.c
index d6e84c8b5554..01c65d13fc0e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/smt.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/smt.c
@@ -57,7 +57,7 @@ struct smt_data *t4_init_smt(void)
 		s->smtab[i].state = SMT_STATE_UNUSED;
 		memset(&s->smtab[i].src_mac, 0, ETH_ALEN);
 		spin_lock_init(&s->smtab[i].lock);
-		atomic_set(&s->smtab[i].refcnt, 0);
+		s->smtab[i].refcnt = 0;
 	}
 	return s;
 }
@@ -68,7 +68,7 @@ static struct smt_entry *find_or_alloc_smte(struct smt_data *s, u8 *smac)
 	struct smt_entry *e, *end;
 
 	for (e = &s->smtab[0], end = &s->smtab[s->smt_size]; e != end; ++e) {
-		if (atomic_read(&e->refcnt) == 0) {
+		if (e->refcnt == 0) {
 			if (!first_free)
 				first_free = e;
 		} else {
@@ -97,7 +97,7 @@ static struct smt_entry *find_or_alloc_smte(struct smt_data *s, u8 *smac)
 
 static void t4_smte_free(struct smt_entry *e)
 {
-	if (atomic_read(&e->refcnt) == 0) {  /* hasn't been recycled */
+	if (e->refcnt == 0) {  /* hasn't been recycled */
 		e->state = SMT_STATE_UNUSED;
 	}
 }
@@ -110,7 +110,7 @@ static void t4_smte_free(struct smt_entry *e)
 void cxgb4_smt_release(struct smt_entry *e)
 {
 	spin_lock_bh(&e->lock);
-	if (atomic_dec_and_test(&e->refcnt))
+	if ((--e->refcnt) == 0)
 		t4_smte_free(e);
 	spin_unlock_bh(&e->lock);
 }
@@ -215,14 +215,14 @@ static struct smt_entry *t4_smt_alloc_switching(struct adapter *adap, u16 pfvf,
 	e = find_or_alloc_smte(s, smac);
 	if (e) {
 		spin_lock(&e->lock);
-		if (!atomic_read(&e->refcnt)) {
-			atomic_set(&e->refcnt, 1);
+		if (!e->refcnt) {
+			e->refcnt = 1;
 			e->state = SMT_STATE_SWITCHING;
 			e->pfvf = pfvf;
 			memcpy(e->src_mac, smac, ETH_ALEN);
 			write_smt_entry(adap, e);
 		} else {
-			atomic_inc(&e->refcnt);
+			++e->refcnt;
 		}
 		spin_unlock(&e->lock);
 	}
diff --git a/drivers/net/ethernet/chelsio/cxgb4/smt.h b/drivers/net/ethernet/chelsio/cxgb4/smt.h
index d6c2cc271398..1268d6e93a47 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/smt.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/smt.h
@@ -59,7 +59,7 @@ struct smt_entry {
 	u16 idx;
 	u16 pfvf;
 	u8 src_mac[ETH_ALEN];
-	atomic_t refcnt;
+	int refcnt;
 	spinlock_t lock;	/* protect smt entry add,removal */
 };
 
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH v1 1/3] net: hisilicon: make hip04_tx_reclaim non-reentrant
From: Jakub Kicinski @ 2019-08-06  3:05 UTC (permalink / raw)
  To: Jiangfeng Xiao
  Cc: davem, yisen.zhuang, salil.mehta, netdev, linux-kernel, leeyou.li,
	xiaowei774, nixiaoming
In-Reply-To: <c150d41b-6f0e-ad49-e8c2-00896fc9cbe4@huawei.com>

On Tue, 6 Aug 2019 10:00:52 +0800, Jiangfeng Xiao wrote:
> If hip04_tx_reclaim is interrupted while it is running, and then
> __irq_svc->gic_handle_irq->hip04_mac_interrupt->__napi_schedule->hip04_rx_poll->hip04_tx_reclaim

Ah right, obviously you can't do stuff after napi_complete_done(), 
that makes sense.

Series looks reasonable.

^ permalink raw reply

* Re: [net-next 0/8][pull request] 100GbE Intel Wired LAN Driver Updates 2019-08-04
From: Jakub Kicinski @ 2019-08-06  3:17 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, netdev, nhorman, sassmann
In-Reply-To: <20190804115926.31944-1-jeffrey.t.kirsher@intel.com>

On Sun,  4 Aug 2019 04:59:18 -0700, Jeff Kirsher wrote:
> This series contains more updates to fm10k from Jake Keller.
> 
> Jake removes the unnecessary initialization of some variables to help
> resolve static code checker warnings.  Explicitly return success during
> resume, since the value of 'err' is always success.  Fixed a issue with
> incrementing a void pointer, which can produce undefined behavior.  Used
> the __always_unused macro for function templates that are passed as
> parameters in functions, but are not used.  Simplified the code by
> removing an unnecessary macro in determining the value of NON_Q_VECTORS.
> Fixed an issue, using bitwise operations to prevent the low address
> overwriting the high portion of the address.

Looks good. AFAIK void pointer arithmetic is not uncommon in the
kernel, but shouldn't hurt to fix it.

Do you guys have any plans to fix W=1 C=1 build for Intel drivers?
That'd be very nice :)

^ permalink raw reply

* Re: [PATCH net-next v6 0/6] flow_offload: add indr-block in nf_table_offload
From: Jakub Kicinski @ 2019-08-06  3:37 UTC (permalink / raw)
  To: wenxu; +Cc: jiri, netfilter-devel, netdev
In-Reply-To: <1564925041-23530-1-git-send-email-wenxu@ucloud.cn>

On Sun,  4 Aug 2019 21:23:55 +0800, wenxu@ucloud.cn wrote:
> From: wenxu <wenxu@ucloud.cn>
> 
> This series patch make nftables offload support the vlan and
> tunnel device offload through indr-block architecture.
> 
> The first four patches mv tc indr block to flow offload and
> rename to flow-indr-block.
> Because the new flow-indr-block can't get the tcf_block
> directly. The fifth patch provide a callback list to get 
> flow_block of each subsystem immediately when the device
> register and contain a block.
> The last patch make nf_tables_offload support flow-indr-block.

Looks good to me, thanks for the changes!

Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

^ permalink raw reply

* Re: [net-next v2 0/8][pull request] 40GbE Intel Wired LAN Driver Updates 2019-08-05
From: Jakub Kicinski @ 2019-08-06  3:46 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, netdev, nhorman, sassmann
In-Reply-To: <20190805185459.12846-1-jeffrey.t.kirsher@intel.com>

On Mon,  5 Aug 2019 11:54:51 -0700, Jeff Kirsher wrote:
> This series contains updates to i40e driver only.
> 
> Dmitrii adds missing statistic counters for VEB and VEB TC's.
> 
> Slawomir adds support for logging the "Disable Firmware LLDP" flag
> option and its current status.
> 
> Jake fixes an issue where VF's being notified of their link status
> before their queues are enabled which was causing issues.  So always
> report link status down when the VF queues are not enabled.  Also adds
> future proofing when statistics are added or removed by adding checks to
> ensure the data pointer for the strings lines up with the expected
> statistics count.
> 
> Czeslaw fixes the advertised mode reported in ethtool for FEC, where the
> "None BaseR RS" was always being displayed no matter what the mode it
> was in.  Also added logging information when the PF is entering or
> leaving "allmulti" (or promiscuous) mode.  Fixed up the logging logic
> for VF's when leaving multicast mode to not include unicast as well.

I can understand patch 2 may be useful for troubleshooting since FW
agent is involved. But I can't really say the same for patch 6 :S
If those entered allmutli/left allmutli messages were of value core
should print them instead..

But anyway, that's not a big deal, looks reasonable.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 0/2] selftests/bpf: more loop tests
From: Yonghong Song @ 2019-08-06  4:19 UTC (permalink / raw)
  To: Alexei Starovoitov, davem@davemloft.net
  Cc: daniel@iogearbox.net, netdev@vger.kernel.org, bpf@vger.kernel.org,
	Kernel Team
In-Reply-To: <20190806021744.2953168-1-ast@kernel.org>



On 8/5/19 7:17 PM, Alexei Starovoitov wrote:
> Add two bounded loop tests.
> 
> v1-v2: addressed feedback from Yonghong.
> 
> Alexei Starovoitov (2):
>    selftests/bpf: add loop test 4
>    selftests/bpf: add loop test 5

Looks good to me. Ack for the whole series.
Acked-by: Yonghong Song <yhs@fb.com>

> 
>   .../bpf/prog_tests/bpf_verif_scale.c          |  2 ++
>   tools/testing/selftests/bpf/progs/loop4.c     | 18 +++++++++++
>   tools/testing/selftests/bpf/progs/loop5.c     | 32 +++++++++++++++++++
>   3 files changed, 52 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/progs/loop4.c
>   create mode 100644 tools/testing/selftests/bpf/progs/loop5.c
> 

^ permalink raw reply

* Re: [PATCH net-next 03/10] net: stmmac: Fix issues when number of Queues >= 4
From: Jakub Kicinski @ 2019-08-06  4:34 UTC (permalink / raw)
  To: Jose Abreu
  Cc: netdev, Joao Pinto, Giuseppe Cavallaro, Alexandre Torgue,
	David S. Miller, Maxime Coquelin, linux-stm32, linux-arm-kernel,
	linux-kernel
In-Reply-To: <5e95bb1761f9438361f198d744640685a34790ea.1565027782.git.joabreu@synopsys.com>

On Mon,  5 Aug 2019 20:01:16 +0200, Jose Abreu wrote:
> When queues >= 4 we use different registers but we were not subtracting
> the offset of 4. Fix this.
> 
> Found out by Coverity.
> 
> Signed-off-by: Jose Abreu <joabreu@synopsys.com>

Should this be a bug fix for the net tree?

^ permalink raw reply

* [RFC PATCH] kbuild: re-implement detection of CONFIG options leaked to user-space
From: Masahiro Yamada @ 2019-08-06  4:37 UTC (permalink / raw)
  To: linux-kbuild
  Cc: Arnd Bergmann, Sam Ravnborg, Masahiro Yamada, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song, bpf,
	linux-kernel, netdev

scripts/headers_check.pl can detect references to CONFIG options in
exported headers, but it has been disabled for more than a decade.

Reverting commit 7e3fa5614117 ("kbuild: drop check for CONFIG_ in
headers_check") would emit the following warnings for headers_check
on x86:

usr/include/mtd/ubi-user.h:283: leaks CONFIG_MTD_UBI_BEB_LIMIT to userspace where it is not valid
usr/include/linux/elfcore.h:62: leaks CONFIG_BINFMT_ELF_FDPIC to userspace where it is not valid
usr/include/linux/atmdev.h:104: leaks CONFIG_COMPAT to userspace where it is not valid
usr/include/linux/raw.h:17: leaks CONFIG_MAX_RAW_DEVS to userspace where it is not valid
usr/include/linux/pktcdvd.h:37: leaks CONFIG_CDROM_PKTCDVD_WCACHE to userspace where it is not valid
usr/include/linux/videodev2.h:2465: leaks CONFIG_VIDEO_ADV_DEBUG to userspace where it is not valid
usr/include/linux/bpf.h:249: leaks CONFIG_EFFICIENT_UNALIGNED_ACCESS to userspace where it is not valid
usr/include/linux/bpf.h:819: leaks CONFIG_CGROUP_NET_CLASSID to userspace where it is not valid
usr/include/linux/bpf.h:1011: leaks CONFIG_IP_ROUTE_CLASSID to userspace where it is not valid
usr/include/linux/bpf.h:1742: leaks CONFIG_BPF_KPROBE_OVERRIDE to userspace where it is not valid
usr/include/linux/bpf.h:1747: leaks CONFIG_FUNCTION_ERROR_INJECTION to userspace where it is not valid
usr/include/linux/bpf.h:1936: leaks CONFIG_XFRM to userspace where it is not valid
usr/include/linux/bpf.h:2184: leaks CONFIG_BPF_LIRC_MODE2 to userspace where it is not valid
usr/include/linux/bpf.h:2210: leaks CONFIG_BPF_LIRC_MODE2 to userspace where it is not valid
usr/include/linux/bpf.h:2227: leaks CONFIG_SOCK_CGROUP_DATA to userspace where it is not valid
usr/include/linux/bpf.h:2311: leaks CONFIG_NET to userspace where it is not valid
usr/include/linux/bpf.h:2348: leaks CONFIG_NET to userspace where it is not valid
usr/include/linux/bpf.h:2422: leaks CONFIG_BPF_LIRC_MODE2 to userspace where it is not valid
usr/include/linux/bpf.h:2528: leaks CONFIG_NET to userspace where it is not valid
usr/include/linux/eventpoll.h:82: leaks CONFIG_PM_SLEEP to userspace where it is not valid
usr/include/linux/hw_breakpoint.h:27: leaks CONFIG_HAVE_MIXED_BREAKPOINTS_REGS to userspace where it is not valid
usr/include/linux/cm4000_cs.h:26: leaks CONFIG_COMPAT to userspace where it is not valid
usr/include/linux/pkt_cls.h:301: leaks CONFIG_NET_CLS_ACT to userspace where it is not valid
usr/include/asm-generic/unistd.h:651: leaks CONFIG_MMU to userspace where it is not valid
usr/include/asm-generic/fcntl.h:119: leaks CONFIG_64BIT to userspace where it is not valid
usr/include/asm-generic/bitsperlong.h:9: leaks CONFIG_64BIT to userspace where it is not valid
usr/include/asm/e820.h:14: leaks CONFIG_NODES_SHIFT to userspace where it is not valid
usr/include/asm/e820.h:39: leaks CONFIG_X86_PMEM_LEGACY to userspace where it is not valid
usr/include/asm/e820.h:49: leaks CONFIG_INTEL_TXT to userspace where it is not valid
usr/include/asm/mman.h:7: leaks CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS to userspace where it is not valid
usr/include/asm/auxvec.h:14: leaks CONFIG_IA32_EMULATION to userspace where it is not valid

Most of these are false positives because scripts/headers_check.pl
parses comment lines.

It is also false negative. arch/x86/include/uapi/asm/auxvec.h contains
CONFIG_IA32_EMULATION and CONFIG_X86_64, but the only former is reported.

It would be possible to fix scripts/headers_check.pl, of course.
However, we already have some duplicated checks between headers_check
and CONFIG_UAPI_HEADER_TEST. At this moment of time, there are still
dozens of headers excluded from the header test (usr/include/Makefile),
but we might be able to remove headers_check when the time comes.

I re-implemented it in scripts/headers_install.sh by using sed because
the most of code in scripts/headers_install.sh is written is sed.

This patch works like this:

[1] Run scripts/unifdef first because we need to drop the code
    surrounded by #ifdef __KERNEL__ ... #endif

[2] Remove all C style comments. The sed code is somewhat complicated
    since we need to deal with both single and multi line comments.

    Precisely speaking, a comment block is replaced with a space just
    in case.

      CONFIG_FOO/* this is a comment */CONFIG_BAR

    should be converted into:

      CONFIG_FOO CONFIG_BAR

    instead of:

      CONFIG_FOOCONFIG_BAR

[3] Match CONFIG_... pattern. It correctly matches to all CONFIG options
    that appear in a single line.

After this commit, you will see the following warnings, all of which
are real ones.

warning: include/uapi/linux/elfcore.h: leaks CONFIG_BINFMT_ELF_FDPIC to user-space
warning: include/uapi/linux/atmdev.h: leaks CONFIG_COMPAT to user-space
warning: include/uapi/linux/raw.h: leaks CONFIG_MAX_RAW_DEVS to user-space
warning: include/uapi/linux/pktcdvd.h: leaks CONFIG_CDROM_PKTCDVD_WCACHE to user-space
warning: include/uapi/linux/eventpoll.h: leaks CONFIG_PM_SLEEP to user-space
warning: include/uapi/linux/hw_breakpoint.h: leaks CONFIG_HAVE_MIXED_BREAKPOINTS_REGS to user-space
warning: include/uapi/asm-generic/fcntl.h: leaks CONFIG_64BIT to user-space
warning: arch/x86/include/uapi/asm/mman.h: leaks CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS to user-space
warning: arch/x86/include/uapi/asm/auxvec.h: leaks CONFIG_IA32_EMULATION to user-space
warning: arch/x86/include/uapi/asm/auxvec.h: leaks CONFIG_X86_64 to user-space

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
---

I was playing with sed yesterday, but the resulted code might be unreadable.

Sed scripts tend to be somewhat unreadable.
I just wondered which language is appropriate for this?
Maybe perl, or what else? I am not good at perl, though.

Maybe, it will be better to fix existing warnings
before enabling this check.
If somebody takes a closer look at them, that would be great.

 scripts/headers_install.sh | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/scripts/headers_install.sh b/scripts/headers_install.sh
index bbaf29386995..73d95e457090 100755
--- a/scripts/headers_install.sh
+++ b/scripts/headers_install.sh
@@ -41,5 +41,34 @@ sed -E -e '
 scripts/unifdef -U__KERNEL__ -D__EXPORTED_HEADERS__ $TMPFILE > $OUTFILE
 [ $? -gt 1 ] && exit 1
 
+# Remove /* ... */ style comments, and find CONFIG_ references in code
+configs=$(sed -e '
+:comment
+	s:/\*[^*][^*]*:/*:
+	s:/\*\*\**\([^/]\):/*\1:
+	t comment
+	s:/\*\*/: :
+	t comment
+	/\/\*/! b check
+	N
+	b comment
+:print
+	P
+	D
+:check
+	s:^[^[:alnum:]_][^[:alnum:]_]*::
+	t check
+	s:^\(CONFIG_[[:alnum:]_]*\):\1\n:
+	t print
+	s:^[[:alnum:]_][[:alnum:]_]*::
+	t check
+	d
+' $OUTFILE)
+
+for c in $configs
+do
+	echo "warning: $INFILE: leaks $c to user-space" >&2
+done
+
 rm -f $TMPFILE
 trap - EXIT
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH net-next 06/10] net: stmmac: Implement RSS and enable it in XGMAC core
From: Jakub Kicinski @ 2019-08-06  4:46 UTC (permalink / raw)
  To: Jose Abreu
  Cc: netdev, Joao Pinto, Giuseppe Cavallaro, Alexandre Torgue,
	David S. Miller, Maxime Coquelin, linux-stm32, linux-arm-kernel,
	linux-kernel
In-Reply-To: <e70981c111ac857a0bac77750bd69a3383d99ee0.1565027782.git.joabreu@synopsys.com>

On Mon,  5 Aug 2019 20:01:19 +0200, Jose Abreu wrote:
> Implement the RSS functionality and add the corresponding callbacks in
> XGMAC core.
> 
> Signed-off-by: Jose Abreu <joabreu@synopsys.com>
> ---
> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
> Cc: Alexandre Torgue <alexandre.torgue@st.com>
> Cc: Jose Abreu <joabreu@synopsys.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> Cc: netdev@vger.kernel.org
> Cc: linux-stm32@st-md-mailman.stormreply.com
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org

> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_descs.c b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_descs.c
> index c4c45402b8f8..9ff9d9ac1a50 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_descs.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_descs.c
> @@ -254,6 +254,34 @@ static void dwxgmac2_clear(struct dma_desc *p)
>  	p->des3 = 0;
>  }
>  
> +static int dwxgmac2_get_rx_hash(struct dma_desc *p, u32 *hash,
> +				enum pkt_hash_types *type)
> +{
> +	unsigned int rdes3 = le32_to_cpu(p->des3);
> +	u32 ptype;
> +
> +	if (rdes3 & XGMAC_RDES3_RSV) {
> +		ptype = (rdes3 & XGMAC_RDES3_L34T) >> XGMAC_RDES3_L34T_SHIFT;
> +
> +		switch (ptype) {
> +		case 0x1:
> +		case 0x2:
> +		case 0x9:
> +		case 0xA:

nit: it'd be nice to have defines for these constants

> +			*type = PKT_HASH_TYPE_L4;
> +			break;
> +		default:
> +			*type = PKT_HASH_TYPE_L3;
> +			break;
> +		}
> +
> +		*hash = le32_to_cpu(p->des1);
> +		return 0;
> +	}
> +
> +	return -EINVAL;
> +}
> +
>  const struct stmmac_desc_ops dwxgmac210_desc_ops = {
>  	.tx_status = dwxgmac2_get_tx_status,
>  	.rx_status = dwxgmac2_get_rx_status,

> @@ -4182,7 +4208,7 @@ int stmmac_dvr_probe(struct device *device,
>  	struct net_device *ndev = NULL;
>  	struct stmmac_priv *priv;
>  	u32 queue, maxq;
> -	int ret = 0;
> +	int i, ret = 0;
>  
>  	ndev = devm_alloc_etherdev_mqs(device, sizeof(struct stmmac_priv),
>  				       MTL_MAX_TX_QUEUES, MTL_MAX_RX_QUEUES);
> @@ -4290,6 +4316,14 @@ int stmmac_dvr_probe(struct device *device,
>  #endif
>  	priv->msg_enable = netif_msg_init(debug, default_msg_level);
>  
> +	/* Initialize RSS */
> +	netdev_rss_key_fill(priv->rss.key, sizeof(priv->rss.key));
> +	for (i = 0; i < ARRAY_SIZE(priv->rss.table); i++)
> +		priv->rss.table[i] = i % priv->plat->rx_queues_to_use;

ethtool_rxfh_indir_default() ?

> +	if (priv->dma_cap.rssen && priv->plat->rss_en)
> +		ndev->features |= NETIF_F_RXHASH;
> +
>  	/* MTU range: 46 - hw-specific max */
>  	ndev->min_mtu = ETH_ZLEN - ETH_HLEN;
>  	if ((priv->plat->enh_desc) || (priv->synopsys_id >= DWMAC_CORE_4_00))

^ permalink raw reply

* [PATCH net] hv_netvsc: Fix a warning of suspicious RCU usage
From: Dexuan Cui @ 2019-08-06  5:17 UTC (permalink / raw)
  To: netdev@vger.kernel.org, David S. Miller, Haiyang Zhang,
	Stephen Hemminger
  Cc: sashal@kernel.org, KY Srinivasan, Michael Kelley,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	olaf@aepfle.de, apw@canonical.com, jasowang@redhat.com, vkuznets,
	marcelo.cerri@canonical.com


This fixes a warning of "suspicious rcu_dereference_check() usage"
when nload runs.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 drivers/net/hyperv/netvsc_drv.c | 44 +++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index f9209594624b5..25502d335b94f 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1236,25 +1236,10 @@ static void netvsc_get_pcpu_stats(struct net_device *net,
 	}
 }
 
-static void netvsc_get_stats64(struct net_device *net,
-			       struct rtnl_link_stats64 *t)
+static void netvsc_get_per_chan_stats(struct netvsc_device *nvdev,
+				      struct rtnl_link_stats64 *t)
 {
-	struct net_device_context *ndev_ctx = netdev_priv(net);
-	struct netvsc_device *nvdev = rcu_dereference_rtnl(ndev_ctx->nvdev);
-	struct netvsc_vf_pcpu_stats vf_tot;
-	int i;
-
-	if (!nvdev)
-		return;
-
-	netdev_stats_to_stats64(t, &net->stats);
-
-	netvsc_get_vf_stats(net, &vf_tot);
-	t->rx_packets += vf_tot.rx_packets;
-	t->tx_packets += vf_tot.tx_packets;
-	t->rx_bytes   += vf_tot.rx_bytes;
-	t->tx_bytes   += vf_tot.tx_bytes;
-	t->tx_dropped += vf_tot.tx_dropped;
+	u32 i;
 
 	for (i = 0; i < nvdev->num_chn; i++) {
 		const struct netvsc_channel *nvchan = &nvdev->chan_table[i];
@@ -1286,6 +1271,29 @@ static void netvsc_get_stats64(struct net_device *net,
 	}
 }
 
+static void netvsc_get_stats64(struct net_device *net,
+			       struct rtnl_link_stats64 *t)
+{
+	struct net_device_context *ndev_ctx = netdev_priv(net);
+	struct netvsc_device *nvdev;
+	struct netvsc_vf_pcpu_stats vf_tot;
+
+	netdev_stats_to_stats64(t, &net->stats);
+
+	netvsc_get_vf_stats(net, &vf_tot);
+	t->rx_packets += vf_tot.rx_packets;
+	t->tx_packets += vf_tot.tx_packets;
+	t->rx_bytes   += vf_tot.rx_bytes;
+	t->tx_bytes   += vf_tot.tx_bytes;
+	t->tx_dropped += vf_tot.tx_dropped;
+
+	rcu_read_lock();
+	nvdev = rcu_dereference(ndev_ctx->nvdev);
+	if (nvdev)
+		netvsc_get_per_chan_stats(nvdev, t);
+	rcu_read_unlock();
+}
+
 static int netvsc_set_mac_addr(struct net_device *ndev, void *p)
 {
 	struct net_device_context *ndc = netdev_priv(ndev);

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox