Netdev List
 help / color / mirror / Atom feed
* auto-split of commit. Was: [PATCH bpf-next 04/10] tools/bpf: add libbpf_prog_type_(from|to)_str helpers
From: Alexei Starovoitov @ 2019-08-28 23:46 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Julia Kartseva, ast, Greg Kroah-Hartman, Thomas Gleixner, rdna,
	bpf, daniel, netdev, kernel-team
In-Reply-To: <20190828163422.3d167c4b@cakuba.netronome.com>

On Wed, Aug 28, 2019 at 04:34:22PM -0700, Jakub Kicinski wrote:
> 
> Greg, Thomas, libbpf is extracted from the kernel sources and
> maintained in a clone repo on GitHub for ease of packaging.
> 
> IIUC Alexei's concern is that since we are moving the commits from
> the kernel repo to the GitHub one we have to preserve the commits
> exactly as they are, otherwise SOB lines lose their power.
> 
> Can you provide some guidance on whether that's a valid concern, 
> or whether it's perfectly fine to apply a partial patch?

Right. That's exactly the concern.

Greg, Thomas,
could you please put your legal hat on and clarify the following.
Say some developer does a patch that modifies
include/uapi/linux/bpf.h
..some other kernel code...and
tools/include/uapi/linux/bpf.h

That tools/include/uapi/linux/bpf.h is used by perf and by libbpf.
We have automatic mirror of tools/libbpf into github/libbpf/
so that external projects and can do git submodule of it,
can build packages out of it, etc.

The question is whether it's ok to split tools/* part out of
original commit, keep Author and SOB, create new commit out of it,
and automatically push that auto-generated commit into github mirror.

So far we've requested all developers to split their patches manually.
So that tools/* update is an individual commit that mirror can
simply git cherry-pick.


^ permalink raw reply

* Re: [PATCH net-next 0/4] mlxsw: Various updates
From: Jakub Kicinski @ 2019-08-28 23:45 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, davem, jiri, mlxsw, Ido Schimmel
In-Reply-To: <20190828155437.9852-1-idosch@idosch.org>

On Wed, 28 Aug 2019 18:54:33 +0300, Ido Schimmel wrote:
> From: Ido Schimmel <idosch@mellanox.com>
> 
> Patch #1 from Amit removes 56G speed support. The reasons for this are
> detailed in the commit message.
> 
> Patch #2 from Shalom ensures that the hardware does not auto negotiate
> the number of used lanes. For example, if a four lane port supports 100G
> over both two and four lanes, it will not advertise the two lane link
> mode.
> 
> Patch #3 bumps the firmware version supported by the driver.
> 
> Patch #4 from Petr adds ethtool counters to help debug the internal PTP
> implementation in mlxsw. I copied Richard on this patch in case he has
> comments.

LGTM

^ permalink raw reply

* Re: [PATCH v2] riscv: add support for SECCOMP and SECCOMP_FILTER
From: Kees Cook @ 2019-08-28 23:44 UTC (permalink / raw)
  To: David Abdurachmanov
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, Oleg Nesterov,
	Andy Lutomirski, Will Drewry, Shuah Khan, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David Abdurachmanov, Thomas Gleixner, Allison Randal,
	Alexios Zavras, Anup Patel, Vincent Chen, Alan Kao, linux-riscv,
	linux-kernel, linux-kselftest, netdev, bpf, me
In-Reply-To: <CAEn-LToB1atxDvehBanVaxg6sk8zDkMe_CbqeTVgKNzOvD9-Sw@mail.gmail.com>

On Wed, Aug 28, 2019 at 02:37:34PM -0700, David Abdurachmanov wrote:
>     --disk path=$PWD/disk \
>     --boot kernel=$PWD/${FIRMWARE} \

This is where I tripped over things. How do I specify the kernel to boot
from OUTSIDE the disk image?

-- 
Kees Cook

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-28 23:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Kees Cook, LSM List, James Morris, Jann Horn,
	Peter Zijlstra, Masami Hiramatsu, Steven Rostedt, David S. Miller,
	Daniel Borkmann, Network Development, bpf, kernel-team, Linux API
In-Reply-To: <CALCETrX-bn2SpVzTkPz+A=z_oWDs7PNeouzK7wRWMzyaBd4+7g@mail.gmail.com>

On Tue, Aug 27, 2019 at 11:20:19PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 27, 2019 at 9:49 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Aug 27, 2019 at 07:00:40PM -0700, Andy Lutomirski wrote:
> > >
> > > Let me put this a bit differently. Part of the point is that
> > > CAP_TRACING should allow a user or program to trace without being able
> > > to corrupt the system. CAP_BPF as you’ve proposed it *can* likely
> > > crash the system.
> >
> > Really? I'm still waiting for your example where bpf+kprobe crashes the system...
> >
> 
> That's not what I meant.  bpf+kprobe causing a crash is a bug.  I'm
> referring to a totally different issue.  On my laptop:
> 
> $ sudo bpftool map
> 48: hash  name foobar  flags 0x0
>     key 8B  value 8B  max_entries 64  memlock 8192B
> 181: lpm_trie  flags 0x1
>     key 8B  value 8B  max_entries 1  memlock 4096B
> 182: lpm_trie  flags 0x1
>     key 20B  value 8B  max_entries 1  memlock 4096B
> 183: lpm_trie  flags 0x1
>     key 8B  value 8B  max_entries 1  memlock 4096B
> 184: lpm_trie  flags 0x1
>     key 20B  value 8B  max_entries 1  memlock 4096B
> 185: lpm_trie  flags 0x1
>     key 8B  value 8B  max_entries 1  memlock 4096B
> 186: lpm_trie  flags 0x1
>     key 20B  value 8B  max_entries 1  memlock 4096B
> 187: lpm_trie  flags 0x1
>     key 8B  value 8B  max_entries 1  memlock 4096B
> 188: lpm_trie  flags 0x1
>     key 20B  value 8B  max_entries 1  memlock 4096B
> 
> $ sudo bpftool map dump id 186
> key:
> 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> 00 00 00 00
> value:
> 02 00 00 00 00 00 00 00
> Found 1 element
> 
> $ sudo bpftool map delete id 186 key hex 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00
> [this worked]
> 
> I don't know what my laptop was doing with map id 186 in particular,
> but, whatever it was, I definitely broke it.  If a BPF firewall is in
> use on something important enough, this could easily remove
> connectivity from part or all of the system.  Right now, this needs
> CAP_SYS_ADMIN.  With your patch, CAP_BPF is sufficient to do this, but
> you *also* need CAP_BPF to trace the system using BPF.  Tracing with
> BPF is 'safe' in the absence of bugs.  Modifying other peoples' maps
> is not.

That lpm_trie is likely systemd implementing IP sandboxing.
Not sure whether it's white or black list.
Deleting an IP address from that map will either allow or disallow
network traffic.
Out of band operation on bpf map broke some bpf program. Sure.
But calling it 'breaking the system' is quite a stretch.
Calling it 'crashing the system' is plain wrong.
Yet you're generalizing this bpf map read/write as
"CAP_BPF as you’ve proposed it *can* likely crash the system."
This is what I have a problem with.

Anyway, changing gears...
Yes. I did propose to make a task with CAP_BPF to be able to
manipulate arbitrary maps in the system.
You could have said that if CAP_BPF is given to 'bpftool'
then any user will be able to mess with other maps because
bpftool is likely chmod-ed 755.
Absolutely correct!
It's not a fault of the CAP_BPF scope.
Just don't give that cap to bpftool or do different acl/chmod.

> If the answer is the latter, then maybe it would make sense to try to
> implement some of the unprivileged bpf stuff and then to see whether
> CAP_BPF is still needed.

<broken_record_mode=on> Nack to extensions to unprivileged bpf.


^ permalink raw reply

* Re: [PATCH bpf-next 04/10] tools/bpf: add libbpf_prog_type_(from|to)_str helpers
From: Jakub Kicinski @ 2019-08-28 23:34 UTC (permalink / raw)
  To: Julia Kartseva, ast, Greg Kroah-Hartman, Thomas Gleixner
  Cc: rdna, bpf, daniel, netdev, kernel-team
In-Reply-To: <467620c966825173dbd65b37a3f9bd7dd4fb8184.1567024943.git.hex@fb.com>

On Wed, 28 Aug 2019 14:03:07 -0700, Julia Kartseva wrote:
> Standardize string representation of prog types by putting commonly used
> names to libbpf.
> The prog_type to string mapping is taken from bpftool:
> tools/bpf/bpftool/main.h
> 
> Signed-off-by: Julia Kartseva <hex@fb.com>

This "libbpf patches have to be completely separate" just went to
another level :/ Now we are splitting code moves into add and remove
parts which are 5 patches apart? How are we supposed to review this?


Greg, Thomas, libbpf is extracted from the kernel sources and
maintained in a clone repo on GitHub for ease of packaging.

IIUC Alexei's concern is that since we are moving the commits from
the kernel repo to the GitHub one we have to preserve the commits
exactly as they are, otherwise SOB lines lose their power.

Can you provide some guidance on whether that's a valid concern, 
or whether it's perfectly fine to apply a partial patch?

(HW vendors also back port tree-wide cleanups into their drivers,
 so if SOB lines are voided by git format-patch -- driver/path/
 that'd be quite an issue..)

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 72e6e5eb397f..946a4d41f223 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -296,6 +296,35 @@ struct bpf_object {
>  };
>  #define obj_elf_valid(o)	((o)->efile.elf)
>  
> +static const char *const prog_type_strs[] = {
> +	[BPF_PROG_TYPE_UNSPEC] = "unspec",
> +	[BPF_PROG_TYPE_SOCKET_FILTER] = "socket_filter",
> +	[BPF_PROG_TYPE_KPROBE] = "kprobe",
> +	[BPF_PROG_TYPE_SCHED_CLS] = "sched_cls",
> +	[BPF_PROG_TYPE_SCHED_ACT] = "sched_act",
> +	[BPF_PROG_TYPE_TRACEPOINT] = "tracepoint",
> +	[BPF_PROG_TYPE_XDP] = "xdp",
> +	[BPF_PROG_TYPE_PERF_EVENT] = "perf_event",
> +	[BPF_PROG_TYPE_CGROUP_SKB] = "cgroup_skb",
> +	[BPF_PROG_TYPE_CGROUP_SOCK] = "cgroup_sock",
> +	[BPF_PROG_TYPE_LWT_IN] = "lwt_in",
> +	[BPF_PROG_TYPE_LWT_OUT] = "lwt_out",
> +	[BPF_PROG_TYPE_LWT_XMIT] = "lwt_xmit",
> +	[BPF_PROG_TYPE_SOCK_OPS] = "sock_ops",
> +	[BPF_PROG_TYPE_SK_SKB] = "sk_skb",
> +	[BPF_PROG_TYPE_CGROUP_DEVICE] = "cgroup_device",
> +	[BPF_PROG_TYPE_SK_MSG] = "sk_msg",
> +	[BPF_PROG_TYPE_RAW_TRACEPOINT] = "raw_tracepoint",
> +	[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
> +	[BPF_PROG_TYPE_LWT_SEG6LOCAL] = "lwt_seg6local",
> +	[BPF_PROG_TYPE_LIRC_MODE2] = "lirc_mode2",
> +	[BPF_PROG_TYPE_SK_REUSEPORT] = "sk_reuseport",
> +	[BPF_PROG_TYPE_FLOW_DISSECTOR] = "flow_dissector",
> +	[BPF_PROG_TYPE_CGROUP_SYSCTL] = "cgroup_sysctl",
> +	[BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE] = "raw_tracepoint_writable",
> +	[BPF_PROG_TYPE_CGROUP_SOCKOPT] = "cgroup_sockopt",
> +};
> +
>  void bpf_program__unload(struct bpf_program *prog)
>  {
>  	int i;
> @@ -4632,6 +4661,28 @@ int libbpf_attach_type_by_name(const char *name,
>  	return -EINVAL;
>  }
>  
> +int libbpf_prog_type_from_str(const char *str, enum bpf_prog_type *type)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(prog_type_strs); i++)
> +		if (prog_type_strs[i] && strcmp(prog_type_strs[i], str) == 0) {
> +			*type = i;
> +			return 0;
> +		}
> +
> +	return -EINVAL;
> +}
> +
> +int libbpf_prog_type_to_str(enum bpf_prog_type type, const char **str)
> +{
> +	if (type < BPF_PROG_TYPE_UNSPEC || type >= ARRAY_SIZE(prog_type_strs))
> +		return -EINVAL;
> +
> +	*str = prog_type_strs[type];
> +	return 0;
> +}
> +
>  static int
>  bpf_program__identify_section(struct bpf_program *prog,
>  			      enum bpf_prog_type *prog_type,
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index e8f70977d137..6846c488d8a2 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -122,12 +122,20 @@ LIBBPF_API int bpf_object__set_priv(struct bpf_object *obj, void *priv,
>  				    bpf_object_clear_priv_t clear_priv);
>  LIBBPF_API void *bpf_object__priv(const struct bpf_object *prog);
>  
> +/* Program and expected attach types by section name */
>  LIBBPF_API int
>  libbpf_prog_type_by_name(const char *name, enum bpf_prog_type *prog_type,
>  			 enum bpf_attach_type *expected_attach_type);
> +/* Attach type by section name */
>  LIBBPF_API int libbpf_attach_type_by_name(const char *name,
>  					  enum bpf_attach_type *attach_type);
>  
> +/* String representation of program type */
> +LIBBPF_API int libbpf_prog_type_from_str(const char *str,
> +					 enum bpf_prog_type *type);
> +LIBBPF_API int libbpf_prog_type_to_str(enum bpf_prog_type type,
> +				       const char **str);
> +
>  /* Accessors of bpf_program */
>  struct bpf_program;
>  LIBBPF_API struct bpf_program *bpf_program__next(struct bpf_program *prog,
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 664ce8e7a60e..2ea7c99f1579 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -188,4 +188,6 @@ LIBBPF_0.0.4 {
>  LIBBPF_0.0.5 {
>  	global:
>  		bpf_btf_get_next_id;
> +		libbpf_prog_type_from_str;
> +		libbpf_prog_type_to_str;
>  } LIBBPF_0.0.4;


^ permalink raw reply

* Re: [PATCH net-next 00/12] net: hns3: add some cleanups and optimizations
From: Jakub Kicinski @ 2019-08-28 23:19 UTC (permalink / raw)
  To: Huazhong Tan
  Cc: davem, netdev, linux-kernel, salil.mehta, yisen.zhuang, linuxarm,
	Andrew Lunn
In-Reply-To: <1567002196-63242-1-git-send-email-tanhuazhong@huawei.com>

On Wed, 28 Aug 2019 22:23:04 +0800, Huazhong Tan wrote:
> This patch-set includes cleanups, optimizations and bugfix for
> the HNS3 ethernet controller driver.

The phy loopback (patch 10) could probably benefit from an expert look
but in general LGTM.

^ permalink raw reply

* Re: [PATCH net v4 0/2] r8152: fix side effect
From: David Miller @ 2019-08-28 23:17 UTC (permalink / raw)
  To: hayeswang; +Cc: netdev, nic_swsd, linux-kernel
In-Reply-To: <1394712342-15778-323-Taiwan-albertk@realtek.com>

From: Hayes Wang <hayeswang@realtek.com>
Date: Wed, 28 Aug 2019 20:56:11 +0800

> v4:
> Add Fixes tag for both patch #1 and #2.

I applied v3, sorry.

I think it is OK as I will backport things to v5.2 -stable anyways.

^ permalink raw reply

* Re: [PATCH net-next] net: phy: force phy suspend when calling phy_stop
From: David Miller @ 2019-08-28 23:17 UTC (permalink / raw)
  To: shenjian15
  Cc: andrew, f.fainelli, hkallweit1, sergei.shtylyov, netdev,
	forest.zhouchang, linuxarm
In-Reply-To: <1566956087-37096-1-git-send-email-shenjian15@huawei.com>

From: Jian Shen <shenjian15@huawei.com>
Date: Wed, 28 Aug 2019 09:34:47 +0800

> Some ethernet drivers may call phy_start() and phy_stop() from
> ndo_open() and ndo_close() respectively.
> 
> When network cable is unconnected, and operate like below:
> step 1: ifconfig ethX up -> ndo_open -> phy_start ->start
> autoneg, and phy is no link.
> step 2: ifconfig ethX down -> ndo_close -> phy_stop -> just stop
> phy state machine.
> 
> This patch forces phy suspend even phydev->link is off.
> 
> Signed-off-by: Jian Shen <shenjian15@huawei.com>

Applied.

^ permalink raw reply

* [PATCH][V2] arcnet: capmode: remove redundant assignment to pointer pkt
From: Colin King @ 2019-08-28 23:14 UTC (permalink / raw)
  To: Michael Grzeschik, David S . Miller, netdev; +Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Pointer pkt is being initialized with a value that is never read
and pkt is being re-assigned a little later on. The assignment is
redundant and hence can be removed.

Addresses-Coverity: ("Ununsed value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---

V2: fix typo in patch description, pkg -> pkt

---
 drivers/net/arcnet/capmode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/arcnet/capmode.c b/drivers/net/arcnet/capmode.c
index b780be6f41ff..c09b567845e1 100644
--- a/drivers/net/arcnet/capmode.c
+++ b/drivers/net/arcnet/capmode.c
@@ -44,7 +44,7 @@ static void rx(struct net_device *dev, int bufnum,
 {
 	struct arcnet_local *lp = netdev_priv(dev);
 	struct sk_buff *skb;
-	struct archdr *pkt = pkthdr;
+	struct archdr *pkt;
 	char *pktbuf, *pkthdrbuf;
 	int ofs;
 
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] sky2: Disable MSI on yet another ASUS boards (P6Xxxx)
From: David Miller @ 2019-08-28 23:09 UTC (permalink / raw)
  To: tiwai; +Cc: mlindner, stephen, netdev, linux-kernel, swm
In-Reply-To: <20190828063119.22248-1-tiwai@suse.de>

From: Takashi Iwai <tiwai@suse.de>
Date: Wed, 28 Aug 2019 08:31:19 +0200

> A similar workaround for the suspend/resume problem is needed for yet
> another ASUS machines, P6X models.  Like the previous fix, the BIOS
> doesn't provide the standard DMI_SYS_* entry, so again DMI_BOARD_*
> entries are used instead.
> 
> Reported-and-tested-by: SteveM <swm@swm1.com>
> Signed-off-by: Takashi Iwai <tiwai@suse.de>

Applied, but this is getting suspicious.

It looks like MSI generally is not restored properly on resume on these
boards, so maybe there simply needs to be a generic PCI quirk for that?

^ permalink raw reply

* Re: [PATCH net 0/2] nfp: flower: fix bugs in merge tunnel encap code
From: David Miller @ 2019-08-28 23:08 UTC (permalink / raw)
  To: jakub.kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20190828055630.17331-1-jakub.kicinski@netronome.com>

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Tue, 27 Aug 2019 22:56:28 -0700

> John says:
> 
> There are few bugs in the merge encap code that have come to light with
> recent driver changes. Effectively, flow bind callbacks were being
> registered twice when using internal ports (new 'busy' code triggers
> this). There was also an issue with neighbour notifier messages being
> ignored for internal ports.

Series applied and queued up for v5.2 -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next 09/15] net: sgi: ioc3-eth: split ring cleaning/freeing and allocation
From: Jakub Kicinski @ 2019-08-28 23:04 UTC (permalink / raw)
  To: Thomas Bogendoerfer
  Cc: Ralf Baechle, Paul Burton, James Hogan, David S. Miller,
	linux-mips, linux-kernel, netdev
In-Reply-To: <20190828140315.17048-10-tbogendoerfer@suse.de>

On Wed, 28 Aug 2019 16:03:08 +0200, Thomas Bogendoerfer wrote:
> Do tx ring cleaning and freeing of rx buffers, when chip is shutdown and
> allocate buffers before bringing chip up.
> 
> Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>

Ah, so you are moving the alloc/free to open/close, that's good.

^ permalink raw reply

* Re: [PATCH net-next 06/15] net: sgi: ioc3-eth: get rid of ioc3_clean_rx_ring()
From: Jakub Kicinski @ 2019-08-28 23:02 UTC (permalink / raw)
  To: Thomas Bogendoerfer
  Cc: Ralf Baechle, Paul Burton, James Hogan, David S. Miller,
	linux-mips, linux-kernel, netdev
In-Reply-To: <20190828140315.17048-7-tbogendoerfer@suse.de>

On Wed, 28 Aug 2019 16:03:05 +0200, Thomas Bogendoerfer wrote:
> Clean rx ring is just called once after a new ring is allocated, which
> is per definition clean. So there is not need for this function.
> 
> Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
> ---
>  drivers/net/ethernet/sgi/ioc3-eth.c | 21 ---------------------
>  1 file changed, 21 deletions(-)
> 
> diff --git a/drivers/net/ethernet/sgi/ioc3-eth.c b/drivers/net/ethernet/sgi/ioc3-eth.c
> index 6ca560d4ab79..39631e067b71 100644
> --- a/drivers/net/ethernet/sgi/ioc3-eth.c
> +++ b/drivers/net/ethernet/sgi/ioc3-eth.c
> @@ -761,26 +761,6 @@ static void ioc3_mii_start(struct ioc3_private *ip)
>  	add_timer(&ip->ioc3_timer);
>  }
>  
> -static inline void ioc3_clean_rx_ring(struct ioc3_private *ip)
> -{
> -	struct ioc3_erxbuf *rxb;
> -	struct sk_buff *skb;
> -	int i;
> -
> -	for (i = ip->rx_ci; i & 15; i++) {
> -		ip->rx_skbs[ip->rx_pi] = ip->rx_skbs[ip->rx_ci];
> -		ip->rxr[ip->rx_pi++] = ip->rxr[ip->rx_ci++];
> -	}
> -	ip->rx_pi &= RX_RING_MASK;
> -	ip->rx_ci &= RX_RING_MASK;
> -
> -	for (i = ip->rx_ci; i != ip->rx_pi; i = (i + 1) & RX_RING_MASK) {
> -		skb = ip->rx_skbs[i];
> -		rxb = (struct ioc3_erxbuf *)(skb->data - RX_OFFSET);
> -		rxb->w0 = 0;

There's gotta be some purpose to setting this w0 word to zero no?
ioc3_rx() uses that to see if the descriptor is done, and dutifully
clears it after..

> -	}
> -}
> -
>  static inline void ioc3_clean_tx_ring(struct ioc3_private *ip)
>  {
>  	struct sk_buff *skb;
> @@ -860,7 +840,6 @@ static void ioc3_init_rings(struct net_device *dev)
>  	ioc3_free_rings(ip);
>  	ioc3_alloc_rings(dev);
>  
> -	ioc3_clean_rx_ring(ip);
>  	ioc3_clean_tx_ring(ip);
>  
>  	/* Now the rx ring base, consume & produce registers.  */


^ permalink raw reply

* Re: [PATCH net v3 0/2] r8152: fix side effect
From: David Miller @ 2019-08-28 23:02 UTC (permalink / raw)
  To: hayeswang; +Cc: netdev, nic_swsd, linux-kernel
In-Reply-To: <1394712342-15778-320-Taiwan-albertk@realtek.com>

From: Hayes Wang <hayeswang@realtek.com>
Date: Wed, 28 Aug 2019 09:51:40 +0800

> v3:
> Update the commit message for patch #1.
> 
> v2:
> Replace patch #2 with "r8152: remove calling netif_napi_del".
> 
> v1:
> The commit 0ee1f4734967 ("r8152: napi hangup fix after disconnect")
> add a check to avoid using napi_disable after netif_napi_del. However,
> the commit ffa9fec30ca0 ("r8152: set RTL8152_UNPLUG only for real
> disconnection") let the check useless.
> 
> Therefore, I revert commit 0ee1f4734967 ("r8152: napi hangup fix
> after disconnect") first, and add another patch to fix it.

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next] net/ncsi: add response handlers for PLDM over NC-SI
From: David Miller @ 2019-08-28 23:00 UTC (permalink / raw)
  To: benwei; +Cc: sam, netdev, linux-kernel, openbmc
In-Reply-To: <CH2PR15MB3686302D8210855E5AB643B1A3A00@CH2PR15MB3686.namprd15.prod.outlook.com>

From: Ben Wei <benwei@fb.com>
Date: Tue, 27 Aug 2019 23:03:53 +0000

> This patch adds handlers for PLDM over NC-SI command response.
> 
> This enables NC-SI driver recognizes the packet type so the responses don't get dropped as unknown packet type.
> 
> PLDM over NC-SI are not handled in kernel driver for now, but can be passed back to user space via Netlink for further handling.
> 
> Signed-off-by: Ben Wei <benwei@fb.com>

I don't know why but patchwork puts part of your patch into the commit message, see:

https://patchwork.ozlabs.org/patch/1154104/

It's probably an encoding issue or similar.

> +static int ncsi_rsp_handler_pldm(struct ncsi_request *nr) {
> +	return 0;
> +}
> +
>  static int ncsi_rsp_handler_netlink(struct ncsi_request *nr)  {

I know other functions in this file do it, but please put the openning
curly braces of a function on a separate line.

Thank you.

^ permalink raw reply

* Re: [PATCH net] net/sched: pfifo_fast: fix wrong dereference in pfifo_fast_enqueue
From: David Miller @ 2019-08-28 22:58 UTC (permalink / raw)
  To: dcaratti; +Cc: xiyou.wangcong, jhs, jiri, netdev, pabeni, sbrivio, shuali
In-Reply-To: <d5a7a167ab57e035685445ee641840a0c5fd39ae.1566940693.git.dcaratti@redhat.com>

From: Davide Caratti <dcaratti@redhat.com>
Date: Tue, 27 Aug 2019 23:18:53 +0200

> Now that 'TCQ_F_CPUSTATS' bit can be cleared, depending on the value of
> 'TCQ_F_NOLOCK' bit in the parent qdisc, we can't assume anymore that
> per-cpu counters are there in the error path of skb_array_produce().
> Otherwise, the following splat can be seen:
 ...
> Fix this by testing the value of 'TCQ_F_CPUSTATS' bit in 'qdisc->flags',
> before dereferencing 'qdisc->cpu_qstats'.
> 
> Fixes: 8a53e616de29 ("net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too")
> CC: Paolo Abeni <pabeni@redhat.com>
> CC: Stefano Brivio <sbrivio@redhat.com>
> Reported-by: Li Shuang <shuali@redhat.com>
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Applied and queued up for v5.2 -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next 05/15] net: sgi: ioc3-eth: allocate space for desc rings only once
From: Jakub Kicinski @ 2019-08-28 22:58 UTC (permalink / raw)
  To: Thomas Bogendoerfer
  Cc: Ralf Baechle, Paul Burton, James Hogan, David S. Miller,
	linux-mips, linux-kernel, netdev
In-Reply-To: <20190828140315.17048-6-tbogendoerfer@suse.de>

On Wed, 28 Aug 2019 16:03:04 +0200, Thomas Bogendoerfer wrote:
> Memory for descriptor rings are allocated/freed, when interface is
> brought up/down. Since the size of the rings is not changeable by
> hardware, we now allocate rings now during probe and free it, when
> device is removed.
> 
> Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>

So the rings still get freed and allocated from ioc3_init()
but there's a set allocated from the start? I guess that makes 
some sense..

Most drivers will allocate rings in open() and free them in close().

^ permalink raw reply

* Re: [PATCH net] tcp: inherit timestamp on mtu probe
From: David Miller @ 2019-08-28 22:57 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, edumazet, jakub.kicinski, willemb
In-Reply-To: <20190827190933.227725-1-willemdebruijn.kernel@gmail.com>

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Tue, 27 Aug 2019 15:09:33 -0400

> From: Willem de Bruijn <willemb@google.com>
> 
> TCP associates tx timestamp requests with a byte in the bytestream.
> If merging skbs in tcp_mtu_probe, migrate the tstamp request.
> 
> Similar to MSG_EOR, do not allow moving a timestamp from any segment
> in the probe but the last. This to avoid merging multiple timestamps.
> 
> Tested with the packetdrill script at
> https://github.com/wdebruij/packetdrill/commits/mtu_probe-1
> 
> Link: http://patchwork.ozlabs.org/patch/1143278/#2232897
> Fixes: 4ed2d765dfac ("net-timestamp: TCP timestamping")
> Signed-off-by: Willem de Bruijn <willemb@google.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-28 22:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Kees Cook, LSM List, James Morris, Jann Horn,
	Peter Zijlstra, Masami Hiramatsu, Steven Rostedt, David S. Miller,
	Daniel Borkmann, Network Development, bpf, kernel-team, Linux API
In-Reply-To: <CALCETrW1o+Lazi2Ng6b9JN6jeJffgdW9f3HvqYhNo4TpHRXW=g@mail.gmail.com>

On Tue, Aug 27, 2019 at 11:12:29PM -0700, Andy Lutomirski wrote:
> >
> > >
> > > From the previous discussion, you want to make progress toward solving
> > > a lot of problems with CAP_BPF.  One of them was making BPF
> > > firewalling more generally useful. By making CAP_BPF grant the ability
> > > to read kernel memory, you will make administrators much more nervous
> > > to grant CAP_BPF.
> >
> > Andy, were your email hacked?
> > I explained several times that in this proposal
> > CAP_BPF _and_ CAP_TRACING _both_ are necessary to read kernel memory.
> > CAP_BPF alone is _not enough_.
> 
> You have indeed said this many times.  You've stated it as a matter of
> fact as though it cannot possibly discussed.  I'm asking you to
> justify it.

That's not how I see it.
I kept stating that both CAP_BPF and CAP_TRACING are necessary to read
kernel memory whereas you kept distorting my statement by dropping second
part and then making claims that "CAP_BPF grant the ability to read
kernel memory, you will make administrators much more nervous".

Just s/CAP_BPF/CAP_BPF and CAP_TRACING/ in this above sentence.
See that meaning suddenly changes?
Now administrators would be worried about tasks that have both at once.
They also would be worried about tasks that have CAP_TRACING alone,
because that's what allows probe_kernel_read().

> It seems like you are specifically trying to add a new switch to turn
> as much of BPF as possible on and off.  Why?

Didn't I explain it several times already with multiple examples
from systemd, daemons, bpftrace ?

Let's try again.
Take your laptop with linux distro.
You're the only user there. I'm assuming you're not sharing it with
partner and kids. This is my definition of 'single user system'.
You can sudo on it at any time, but obviously prefer to run as many
apps as possible without cap_sys_admin.
Now you found some awesome open source app on the web that monitors
the health of the kernel and will pop a nice message on a screen if
something is wrong. Currently this app needs root. You hesitate,
but the apps is so useful and it has strong upstream code review process
that you keep running it 24/7.
This is open source app. New versions come. You upgrade.
You have enough trust in that app that you keep running it as root.
But there is always a chance that new version doing accidentaly
something stupid as 'kill -9 -1'. It's an open source app at the end.

Now I come with this CAP* proposal to make this app safer.
I'm not making your system more secure and not making this app
more secure. I can only make your laptop safer for day to day work
by limiting the operations this app can do.
This particular app monitros the kernel via bpf and tracing.
Hence you can give it CAP_TRACING and CAP_BPF and drop the rest.

> > speaking of MDS... I already asked you to help investigate its
> > applicability with existing bpf exposure. Are you going to do that?
> 
> I am blissfully uninvolved in MDS, and I don't know all that much more
> about the overall mechanism than a random reader of tech news :)  ISTM
> there are two meaningful ways that BPF could be involved: a BPF
> program could leak info into the state exposed by MDS, or a BPF
> program could try to read that state.  From what little I understand,
> it's essentially inevitable that BPF leaks information into MDS state,
> and this is probably even controllable by an attacker that understands
> MDS in enough detail.    So the interesting questions are: can BPF be
> used to read MDS state and can BPF be used to leak information in a
> more useful way than the rest of the kernel to an attacker.

agree. that's exactly the question to ask.

> Keeping in mind that the kernel will flush MDS state on every exit to
> usermode, I think the most likely attack is to try to read MDS state
> with BPF.  This could happen, I suppose -- BPF programs can easily
> contain the usual speculation gadgets of "do something and read an
> address that depends on the outcome".  Fortunately, outside of
> bpf_probe_read(), AFAIK BPF programs can't directly touch user memory,
> and an attacker that is allowed to use bpf_probe_read() doesn't need
> MDS to read things.

true as well.
So what do we do with that sentence in Documentation/x86/mds.rst?
Nothing?
New hw bugs will keep coming.
All of them should get similar wording?
Your understanding of MDS and BPF is way above the average.
What other users suppose to do when they read such sentence?
I think they have no choice but to do kernel.unprivileged_bpf_disabled=1.
We, as a kernel community, are forcing the users into it.
Hence I really do not see a value in any proposal today that expands
unprivileged bpf usage.
Since kernel.unprivileged_bpf_disabled=1 all bpf is under cap_sys_admin.
It's not great from security and safety pov. Hence this CAP* proposal.

> > Please take a look at Jann's var1 exploit. Was it hard to run bpf prog
> > in controlled environment without test_run command ?
> >
> 
> Can you send me a link?

https://bugs.chromium.org/p/project-zero/issues/detail?id=1272
writeup_files.tar:kernel_leak_exploit_amd_pro_a8_9600_r7/bpf_stuff.c
Execution is as trivial as write(sockfd, "X", 1) line 405.


^ permalink raw reply

* Re: [PATCH net] net: sched: act_sample: fix psample group handling on overwrite
From: David Miller @ 2019-08-28 22:54 UTC (permalink / raw)
  To: vladbu; +Cc: netdev, jhs, xiyou.wangcong, jiri, dcaratti
In-Reply-To: <20190827184938.1824-1-vladbu@mellanox.com>

From: Vlad Buslov <vladbu@mellanox.com>
Date: Tue, 27 Aug 2019 21:49:38 +0300

> Action sample doesn't properly handle psample_group pointer in overwrite
> case. Following issues need to be fixed:
> 
> - In tcf_sample_init() function RCU_INIT_POINTER() is used to set
>   s->psample_group, even though we neither setting the pointer to NULL, nor
>   preventing concurrent readers from accessing the pointer in some way.
>   Use rcu_swap_protected() instead to safely reset the pointer.
> 
> - Old value of s->psample_group is not released or deallocated in any way,
>   which results resource leak. Use psample_group_put() on non-NULL value
>   obtained with rcu_swap_protected().
> 
> - The function psample_group_put() that released reference to struct
>   psample_group pointed by rcu-pointer s->psample_group doesn't respect rcu
>   grace period when deallocating it. Extend struct psample_group with rcu
>   head and use kfree_rcu when freeing it.
> 
> Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action")
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Applied and queued up for -stable.

^ permalink raw reply

* [v1] iproute2: police: support 64bit rate and peakrate in tc utility
From: David Dai @ 2019-08-28 22:52 UTC (permalink / raw)
  To: jhs, xiyou.wangcong, jiri, netdev, linux-kernel; +Cc: zdai, zdai

For high speed adapter like Mellanox CX-5 card, it can reach upto
100 Gbits per second bandwidth. Currently htb already supports 64bit rate
in tc utility. However police action rate and peakrate are still limited
to 32bit value (upto 32 Gbits per second). Taking advantage of the 2 new
attributes TCA_POLICE_RATE64 and TCA_POLICE_PEAKRATE64 from kernel,
tc can use them to break the 32bit limit, and still keep the backward 
binary compatibility.

Tested-by: David Dai <zdai@linux.vnet.ibm.com>
Signed-off-by: David Dai <zdai@linux.vnet.ibm.com>
---
 include/uapi/linux/pkt_cls.h |    2 +
 tc/m_police.c                |   64 +++++++++++++++++++++++++++--------------
 tc/tc_core.c                 |   29 +++++++++++++++++++
 tc/tc_core.h                 |    3 ++
 4 files changed, 76 insertions(+), 22 deletions(-)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index b057aee..eb4ea4d 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -159,6 +159,8 @@ enum {
 	TCA_POLICE_AVRATE,
 	TCA_POLICE_RESULT,
 	TCA_POLICE_TM,
+	TCA_POLICE_RATE64,
+	TCA_POLICE_PEAKRATE64,
 	TCA_POLICE_PAD,
 	__TCA_POLICE_MAX
 #define TCA_POLICE_RESULT TCA_POLICE_RESULT
diff --git a/tc/m_police.c b/tc/m_police.c
index 862a39f..abdbcce 100644
--- a/tc/m_police.c
+++ b/tc/m_police.c
@@ -71,6 +71,7 @@ static int act_parse_police(struct action_util *a, int *argc_p, char ***argv_p,
 	unsigned int linklayer = LINKLAYER_ETHERNET; /* Assume ethernet */
 	int Rcell_log =  -1, Pcell_log = -1;
 	struct rtattr *tail;
+	__u64 rate64 = 0, prate64 = 0;
 
 	if (a) /* new way of doing things */
 		NEXT_ARG();
@@ -121,11 +122,11 @@ static int act_parse_police(struct action_util *a, int *argc_p, char ***argv_p,
 			}
 		} else if (strcmp(*argv, "rate") == 0) {
 			NEXT_ARG();
-			if (p.rate.rate) {
+			if (rate64) {
 				fprintf(stderr, "Double \"rate\" spec\n");
 				return -1;
 			}
-			if (get_rate(&p.rate.rate, *argv)) {
+			if (get_rate64(&rate64, *argv)) {
 				explain1("rate");
 				return -1;
 			}
@@ -141,11 +142,11 @@ static int act_parse_police(struct action_util *a, int *argc_p, char ***argv_p,
 			}
 		} else if (matches(*argv, "peakrate") == 0) {
 			NEXT_ARG();
-			if (p.peakrate.rate) {
+			if (prate64) {
 				fprintf(stderr, "Double \"peakrate\" spec\n");
 				return -1;
 			}
-			if (get_rate(&p.peakrate.rate, *argv)) {
+			if (get_rate64(&prate64, *argv)) {
 				explain1("peakrate");
 				return -1;
 			}
@@ -189,23 +190,23 @@ action_ctrl_ok:
 	if (!ok)
 		return -1;
 
-	if (p.rate.rate && avrate)
+	if (rate64 && avrate)
 		return -1;
 
 	/* Must at least do late binding, use TB or ewma policing */
-	if (!p.rate.rate && !avrate && !p.index) {
+	if (!rate64 && !avrate && !p.index) {
 		fprintf(stderr, "\"rate\" or \"avrate\" MUST be specified.\n");
 		return -1;
 	}
 
 	/* When the TB policer is used, burst is required */
-	if (p.rate.rate && !buffer && !avrate) {
+	if (rate64 && !buffer && !avrate) {
 		fprintf(stderr, "\"burst\" requires \"rate\".\n");
 		return -1;
 	}
 
-	if (p.peakrate.rate) {
-		if (!p.rate.rate) {
+	if (prate64) {
+		if (!rate64) {
 			fprintf(stderr, "\"peakrate\" requires \"rate\".\n");
 			return -1;
 		}
@@ -215,22 +216,24 @@ action_ctrl_ok:
 		}
 	}
 
-	if (p.rate.rate) {
+	if (rate64) {
+		p.rate.rate = (rate64 >= (1ULL << 32)) ? ~0U : rate64;
 		p.rate.mpu = mpu;
 		p.rate.overhead = overhead;
-		if (tc_calc_rtable(&p.rate, rtab, Rcell_log, mtu,
-				   linklayer) < 0) {
+		if (tc_calc_rtable_64(&p.rate, rtab, Rcell_log, mtu,
+				   linklayer, rate64) < 0) {
 			fprintf(stderr, "POLICE: failed to calculate rate table.\n");
 			return -1;
 		}
-		p.burst = tc_calc_xmittime(p.rate.rate, buffer);
+		p.burst = tc_calc_xmittime(rate64, buffer);
 	}
 	p.mtu = mtu;
-	if (p.peakrate.rate) {
+	if (prate64) {
+		p.peakrate.rate = (prate64 >= (1ULL << 32)) ? ~0U : prate64;
 		p.peakrate.mpu = mpu;
 		p.peakrate.overhead = overhead;
-		if (tc_calc_rtable(&p.peakrate, ptab, Pcell_log, mtu,
-				   linklayer) < 0) {
+		if (tc_calc_rtable_64(&p.peakrate, ptab, Pcell_log, mtu,
+				   linklayer, prate64) < 0) {
 			fprintf(stderr, "POLICE: failed to calculate peak rate table.\n");
 			return -1;
 		}
@@ -238,10 +241,16 @@ action_ctrl_ok:
 
 	tail = addattr_nest(n, MAX_MSG, tca_id);
 	addattr_l(n, MAX_MSG, TCA_POLICE_TBF, &p, sizeof(p));
-	if (p.rate.rate)
+	if (rate64) {
 		addattr_l(n, MAX_MSG, TCA_POLICE_RATE, rtab, 1024);
-	if (p.peakrate.rate)
+		if (rate64 >= (1ULL << 32))
+			addattr64(n, MAX_MSG, TCA_POLICE_RATE64, rate64);
+	}
+	if (prate64) {
 		addattr_l(n, MAX_MSG, TCA_POLICE_PEAKRATE, ptab, 1024);
+		if (prate64 >= (1ULL << 32))
+			addattr64(n, MAX_MSG, TCA_POLICE_PEAKRATE64, prate64);
+	}
 	if (avrate)
 		addattr32(n, MAX_MSG, TCA_POLICE_AVRATE, avrate);
 	if (presult)
@@ -268,6 +277,7 @@ static int print_police(struct action_util *a, FILE *f, struct rtattr *arg)
 	struct rtattr *tb[TCA_POLICE_MAX+1];
 	unsigned int buffer;
 	unsigned int linklayer;
+	__u64 rate64, prate64;
 
 	if (arg == NULL)
 		return 0;
@@ -286,16 +296,26 @@ static int print_police(struct action_util *a, FILE *f, struct rtattr *arg)
 #endif
 	p = RTA_DATA(tb[TCA_POLICE_TBF]);
 
+	rate64 = p->rate.rate;
+	if (tb[TCA_POLICE_RATE64] &&
+	    RTA_PAYLOAD(tb[TCA_POLICE_RATE64]) >= sizeof(rate64))
+		rate64 = rta_getattr_u64(tb[TCA_POLICE_RATE64]);
+
 	fprintf(f, " police 0x%x ", p->index);
-	fprintf(f, "rate %s ", sprint_rate(p->rate.rate, b1));
-	buffer = tc_calc_xmitsize(p->rate.rate, p->burst);
+	fprintf(f, "rate %s ", sprint_rate(rate64, b1));
+	buffer = tc_calc_xmitsize(rate64, p->burst);
 	fprintf(f, "burst %s ", sprint_size(buffer, b1));
 	fprintf(f, "mtu %s ", sprint_size(p->mtu, b1));
 	if (show_raw)
 		fprintf(f, "[%08x] ", p->burst);
 
-	if (p->peakrate.rate)
-		fprintf(f, "peakrate %s ", sprint_rate(p->peakrate.rate, b1));
+	prate64 = p->peakrate.rate;
+	if (tb[TCA_POLICE_PEAKRATE64] &&
+	    RTA_PAYLOAD(tb[TCA_POLICE_PEAKRATE64]) >= sizeof(prate64))
+		prate64 = rta_getattr_u64(tb[TCA_POLICE_PEAKRATE64]);
+
+	if (prate64)
+		fprintf(f, "peakrate %s ", sprint_rate(prate64, b1));
 
 	if (tb[TCA_POLICE_AVRATE])
 		fprintf(f, "avrate %s ",
diff --git a/tc/tc_core.c b/tc/tc_core.c
index 8eb1122..498d35d 100644
--- a/tc/tc_core.c
+++ b/tc/tc_core.c
@@ -152,6 +152,35 @@ int tc_calc_rtable(struct tc_ratespec *r, __u32 *rtab,
 	return cell_log;
 }
 
+int tc_calc_rtable_64(struct tc_ratespec *r, __u32 *rtab,
+		   int cell_log, unsigned int mtu,
+		   enum link_layer linklayer, __u64 rate)
+{
+	int i;
+	unsigned int sz;
+	__u64 bps = rate;
+	unsigned int mpu = r->mpu;
+
+	if (mtu == 0)
+		mtu = 2047;
+
+	if (cell_log < 0) {
+		cell_log = 0;
+		while ((mtu >> cell_log) > 255)
+			cell_log++;
+	}
+
+	for (i = 0; i < 256; i++) {
+		sz = tc_adjust_size((i + 1) << cell_log, mpu, linklayer);
+		rtab[i] = tc_calc_xmittime(bps, sz);
+	}
+
+	r->cell_align =  -1;
+	r->cell_log = cell_log;
+	r->linklayer = (linklayer & TC_LINKLAYER_MASK);
+	return cell_log;
+}
+
 /*
    stab[pkt_len>>cell_log] = pkt_xmit_size>>size_log
  */
diff --git a/tc/tc_core.h b/tc/tc_core.h
index bd4a99f..40520e7 100644
--- a/tc/tc_core.h
+++ b/tc/tc_core.h
@@ -21,6 +21,9 @@ unsigned tc_calc_xmittime(__u64 rate, unsigned size);
 unsigned tc_calc_xmitsize(__u64 rate, unsigned ticks);
 int tc_calc_rtable(struct tc_ratespec *r, __u32 *rtab,
 		   int cell_log, unsigned mtu, enum link_layer link_layer);
+int tc_calc_rtable_64(struct tc_ratespec *r, __u32 *rtab,
+			int cell_log, unsigned mtu, enum link_layer link_layer,
+			__u64 rate);
 int tc_calc_size_table(struct tc_sizespec *s, __u16 **stab);
 
 int tc_setup_estimator(unsigned A, unsigned time_const, struct tc_estimator *est);
-- 
1.7.1


^ permalink raw reply related

* [v1] net_sched: act_police: add 2 new attributes to support police 64bit rate and peakrate
From: David Dai @ 2019-08-28 22:51 UTC (permalink / raw)
  To: jhs, xiyou.wangcong, jiri, davem, netdev, linux-kernel; +Cc: zdai, zdai

For high speed adapter like Mellanox CX-5 card, it can reach upto
100 Gbits per second bandwidth. Currently htb already supports 64bit rate
in tc utility. However police action rate and peakrate are still limited
to 32bit value (upto 32 Gbits per second). Add 2 new attributes
TCA_POLICE_RATE64 and TCA_POLICE_RATE64 in kernel for 64bit support
so that tc utility can use them for 64bit rate and peakrate value to
break the 32bit limit, and still keep the backward binary compatibility.

Tested-by: David Dai <zdai@linux.vnet.ibm.com>
Signed-off-by: David Dai <zdai@linux.vnet.ibm.com>
---
 include/uapi/linux/pkt_cls.h |    2 ++
 net/sched/act_police.c       |   27 +++++++++++++++++++++++----
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index b057aee..eb4ea4d 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -159,6 +159,8 @@ enum {
 	TCA_POLICE_AVRATE,
 	TCA_POLICE_RESULT,
 	TCA_POLICE_TM,
+	TCA_POLICE_RATE64,
+	TCA_POLICE_PEAKRATE64,
 	TCA_POLICE_PAD,
 	__TCA_POLICE_MAX
 #define TCA_POLICE_RESULT TCA_POLICE_RESULT
diff --git a/net/sched/act_police.c b/net/sched/act_police.c
index 49cec3e..ed5372e 100644
--- a/net/sched/act_police.c
+++ b/net/sched/act_police.c
@@ -40,6 +40,8 @@ static int tcf_police_walker(struct net *net, struct sk_buff *skb,
 	[TCA_POLICE_PEAKRATE]	= { .len = TC_RTAB_SIZE },
 	[TCA_POLICE_AVRATE]	= { .type = NLA_U32 },
 	[TCA_POLICE_RESULT]	= { .type = NLA_U32 },
+	[TCA_POLICE_RATE64]     = { .type = NLA_U64 },
+	[TCA_POLICE_PEAKRATE64] = { .type = NLA_U64 },
 };
 
 static int tcf_police_init(struct net *net, struct nlattr *nla,
@@ -58,6 +60,7 @@ static int tcf_police_init(struct net *net, struct nlattr *nla,
 	struct tcf_police_params *new;
 	bool exists = false;
 	u32 index;
+	u64 rate64, prate64;
 
 	if (nla == NULL)
 		return -EINVAL;
@@ -155,14 +158,18 @@ static int tcf_police_init(struct net *net, struct nlattr *nla,
 	}
 	if (R_tab) {
 		new->rate_present = true;
-		psched_ratecfg_precompute(&new->rate, &R_tab->rate, 0);
+		rate64 = tb[TCA_POLICE_RATE64] ?
+			 nla_get_u64(tb[TCA_POLICE_RATE64]) : 0;
+		psched_ratecfg_precompute(&new->rate, &R_tab->rate, rate64);
 		qdisc_put_rtab(R_tab);
 	} else {
 		new->rate_present = false;
 	}
 	if (P_tab) {
 		new->peak_present = true;
-		psched_ratecfg_precompute(&new->peak, &P_tab->rate, 0);
+		prate64 = tb[TCA_POLICE_PEAKRATE64] ?
+			  nla_get_u64(tb[TCA_POLICE_PEAKRATE64]) : 0;
+		psched_ratecfg_precompute(&new->peak, &P_tab->rate, prate64);
 		qdisc_put_rtab(P_tab);
 	} else {
 		new->peak_present = false;
@@ -313,10 +320,22 @@ static int tcf_police_dump(struct sk_buff *skb, struct tc_action *a,
 				      lockdep_is_held(&police->tcf_lock));
 	opt.mtu = p->tcfp_mtu;
 	opt.burst = PSCHED_NS2TICKS(p->tcfp_burst);
-	if (p->rate_present)
+	if (p->rate_present) {
 		psched_ratecfg_getrate(&opt.rate, &p->rate);
-	if (p->peak_present)
+		if ((police->params->rate.rate_bytes_ps >= (1ULL << 32)) &&
+		    nla_put_u64_64bit(skb, TCA_POLICE_RATE64,
+				      police->params->rate.rate_bytes_ps,
+				      TCA_POLICE_PAD))
+			goto nla_put_failure;
+	}
+	if (p->peak_present) {
 		psched_ratecfg_getrate(&opt.peakrate, &p->peak);
+		if ((police->params->peak.rate_bytes_ps >= (1ULL << 32)) &&
+		    nla_put_u64_64bit(skb, TCA_POLICE_PEAKRATE64,
+				      police->params->peak.rate_bytes_ps,
+				      TCA_POLICE_PAD))
+			goto nla_put_failure;
+	}
 	if (nla_put(skb, TCA_POLICE_TBF, sizeof(opt), &opt))
 		goto nla_put_failure;
 	if (p->tcfp_result &&
-- 
1.7.1


^ permalink raw reply related

* Re: [net-next 00/15][pull request] 100GbE Intel Wired LAN Driver Updates 2019-08-26
From: David Miller @ 2019-08-28 22:47 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann
In-Reply-To: <20190827163832.8362-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 27 Aug 2019 09:38:17 -0700

> This series contains updates to ice driver only.

Pulled, but I have to agree with Jakub that using CPP ifdefs to control the
"static"'ness of a function is pushing the boundaries of good taste at best.

^ permalink raw reply

* Re: [PATCH net] ibmvnic: Do not process reset during or after device removal
From: David Miller @ 2019-08-28 22:46 UTC (permalink / raw)
  To: tlfalcon; +Cc: netdev, linuxppc-dev
In-Reply-To: <1566922204-8770-1-git-send-email-tlfalcon@linux.ibm.com>

From: Thomas Falcon <tlfalcon@linux.ibm.com>
Date: Tue, 27 Aug 2019 11:10:04 -0500

> Currently, the ibmvnic driver will not schedule device resets
> if the device is being removed, but does not check the device
> state before the reset is actually processed. This leads to a race
> where a reset is scheduled with a valid device state but is
> processed after the driver has been removed, resulting in an oops.
> 
> Fix this by checking the device state before processing a queued
> reset event.
> 
> Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
> Tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
> Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>

Applied.

^ permalink raw reply

* Re: [net-next 13/15] ixgbe: sync the first fragment unconditionally
From: Jakub Kicinski @ 2019-08-28 22:40 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: davem, Firo Yang, netdev, nhorman, sassmann, Alexander Duyck,
	Andrew Bowers
In-Reply-To: <20190828064407.30168-14-jeffrey.t.kirsher@intel.com>

On Tue, 27 Aug 2019 23:44:05 -0700, Jeff Kirsher wrote:
> Fixes: f3213d932173 ("ixgbe: Update driver to make use of DMA
> attributes in Rx path")

wrapped tag

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox