Netdev List
 help / color / mirror / Atom feed
* [PATCH] restore behaviour of CAP_SYS_ADMIN allowing the loading of net bpf program
From: Maciej Żenczykowski @ 2020-06-18 19:59 UTC (permalink / raw)
  To: Maciej Żenczykowski, Alexei Starovoitov, Daniel Borkmann
  Cc: Linux Network Development Mailing List, Linux Kernel Mailing List,
	BPF Mailing List, David S . Miller
In-Reply-To: <CAHo-OoyU5OHQuqpTEo-uAQcwcLpzkXezFY6Re-Hv6jGM9aSFSA@mail.gmail.com>

From: Maciej Żenczykowski <maze@google.com>

This is a 5.8-rc1 regression.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Fixes: 2c78ee898d8f ("bpf: Implement CAP_BPF")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
---
 kernel/bpf/syscall.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8da159936bab..7d946435587d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2121,7 +2121,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
 	    !bpf_capable())
 		return -EPERM;
 
-	if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN))
+	if (is_net_admin_prog_type(type) && !capable(CAP_NET_ADMIN) && !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 	if (is_perfmon_prog_type(type) && !perfmon_capable())
 		return -EPERM;
-- 
2.27.0.290.gba653c62da-goog


^ permalink raw reply related

* Re: [Patch net] net: change addr_list_lock back to static key
From: Cong Wang @ 2020-06-18 19:56 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, syzbot+f3a0e80c34b3fc28ac5e, Taehee Yoo, Dmitry Vyukov
In-Reply-To: <CA+h21hr_epEqWukZMQmZ2ecS9Y0yvX9mzR3g3OA39rg_96FfnQ@mail.gmail.com>

On Thu, Jun 18, 2020 at 12:40 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> It's me with the stacked DSA devices again:

It looks like DSA never uses netdev API to link master
device with slave devices? If so, their dev->lower_level
are always 1, therefore triggers this warning.

I think it should call one of these netdev_upper_dev_link()
API's when creating a slave device.

Thanks.

^ permalink raw reply

* RE: [Intel-wired-lan] [PATCH net] i40e: fix crash when Rx descriptor count is changed
From: Bowers, AndrewX @ 2020-06-18 19:46 UTC (permalink / raw)
  To: intel-wired-lan@lists.osuosl.org
  Cc: netdev@vger.kernel.org, bpf@vger.kernel.org
In-Reply-To: <20200612114731.144630-1-bjorn.topel@gmail.com>

> -----Original Message-----
> From: Intel-wired-lan <intel-wired-lan-bounces@osuosl.org> On Behalf Of
> Björn Töpel
> Sent: Friday, June 12, 2020 4:48 AM
> To: intel-wired-lan@lists.osuosl.org
> Cc: netdev@vger.kernel.org; bpf@vger.kernel.org; Topel, Bjorn
> <bjorn.topel@intel.com>; Karlsson, Magnus <magnus.karlsson@intel.com>
> Subject: [Intel-wired-lan] [PATCH net] i40e: fix crash when Rx descriptor
> count is changed
> 
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> When the AF_XDP buffer allocator was introduced, the Rx SW ring "rx_bi"
> allocation was moved from i40e_setup_rx_descriptors() function, and was
> instead done in the i40e_configure_rx_ring() function.
> 
> This broke the ethtool set_ringparam() hook for changing the Rx descriptor
> count, which was relying on i40e_setup_rx_descriptors() to handle the
> alloction.
> 
> Fix this by adding an explicit i40e_alloc_rx_bi() call to i40e_set_ringparam().
> 
> Fixes: be1222b585fd ("i40e: Separate kernel allocated rx_bi rings from
> AF_XDP rings")
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 3 +++
>  1 file changed, 3 insertions(+)

Tested-by: Andrew Bowers <andrewx.bowers@intel.com>




^ permalink raw reply

* Re: [PATCH] net/9p: Fix sparse rcu warnings in client.c
From: Alexander Kapshuk @ 2020-06-18 19:40 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: ericvh, lucho, davem, kuba, v9fs-developer, netdev, linux-kernel
In-Reply-To: <20200618190807.GA20699@nautica>

On Thu, Jun 18, 2020 at 10:08 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Alexander Kapshuk wrote on Thu, Jun 18, 2020:
> > Address sparse nonderef rcu warnings:
> > net/9p/client.c:790:17: warning: incorrect type in argument 1 (different address spaces)
> > net/9p/client.c:790:17:    expected struct spinlock [usertype] *lock
> > net/9p/client.c:790:17:    got struct spinlock [noderef] <asn:4> *
> > net/9p/client.c:792:48: warning: incorrect type in argument 1 (different address spaces)
> > net/9p/client.c:792:48:    expected struct spinlock [usertype] *lock
> > net/9p/client.c:792:48:    got struct spinlock [noderef] <asn:4> *
> > net/9p/client.c:872:17: warning: incorrect type in argument 1 (different address spaces)
> > net/9p/client.c:872:17:    expected struct spinlock [usertype] *lock
> > net/9p/client.c:872:17:    got struct spinlock [noderef] <asn:4> *
> > net/9p/client.c:874:48: warning: incorrect type in argument 1 (different address spaces)
> > net/9p/client.c:874:48:    expected struct spinlock [usertype] *lock
> > net/9p/client.c:874:48:    got struct spinlock [noderef] <asn:4> *
> >
> > Signed-off-by: Alexander Kapshuk <alexander.kapshuk@gmail.com>
>
> Thanks for this patch.
> From what I can see, there are tons of other parts of the code doing the
> same noderef access pattern to access current->sighand->siglock and I
> don't see much doing that.
> A couple of users justify this by saying SLAB_TYPESAFE_BY_RCU ensures
> we'll always get a usable lock which won't be reinitialized however we
> access it... It's a bit dubious we'll get the same lock than unlock to
> me, so I agree to some change though.
>
> After a second look I think we should use something like the following:
>
> if (!lock_task_sighand(current, &flags))
>         warn & skip (or some error, we'd null deref if this happened currently);
> recalc_sigpending();
> unlock_task_sighand(current, &flags);
>
> As you can see, the rcu_read_lock() isn't kept until the unlock so I'm
> not sure it will be enough to please sparse, but I've convinced myself
> current->sighand cannot change while we hold the lock and there just are
> too many such patterns in the kernel.
>
> Please let me know if I missed something or if there is an ongoing
> effort to change how this works; I'll wait for a v2.
>
> --
> Dominique

Thanks for your prompt response.
I too made the same observation of the numerous patterns in the kernel
where current->sighand is accessed without being rcu_dereference()'d.
For this patch I used kernel/signal.c:1368,1398: __lock_task_sighand()
as an example.

I will give your suggestion a careful consideration and will get back
to you soon.
Thanks.

^ permalink raw reply

* Re: [Patch net] net: change addr_list_lock back to static key
From: Vladimir Oltean @ 2020-06-18 19:40 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, syzbot+f3a0e80c34b3fc28ac5e, Taehee Yoo, Dmitry Vyukov
In-Reply-To: <20200608215301.26772-1-xiyou.wangcong@gmail.com>

Hi Cong, Taehee,

On Tue, 9 Jun 2020 at 00:54, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> The dynamic key update for addr_list_lock still causes troubles,
> for example the following race condition still exists:
>
> CPU 0:                          CPU 1:
> (RCU read lock)                 (RTNL lock)
> dev_mc_seq_show()               netdev_update_lockdep_key()
>                                   -> lockdep_unregister_key()
>  -> netif_addr_lock_bh()
>
> because lockdep doesn't provide an API to update it atomically.
> Therefore, we have to move it back to static keys and use subclass
> for nest locking like before.
>
> In commit 1a33e10e4a95 ("net: partially revert dynamic lockdep key
> changes"), I already reverted most parts of commit ab92d68fc22f
> ("net: core: add generic lockdep keys").
>
> This patch reverts the rest and also part of commit f3b0a18bb6cb
> ("net: remove unnecessary variables and callback"). After this
> patch, addr_list_lock changes back to using static keys and
> subclasses to satisfy lockdep. Thanks to dev->lower_level, we do
> not have to change back to ->ndo_get_lock_subclass().
>
> And hopefully this reduces some syzbot lockdep noises too.
>
> Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com
> Cc: Taehee Yoo <ap420073@gmail.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  drivers/net/bonding/bond_main.c               |  2 --
>  drivers/net/bonding/bond_options.c            |  2 --
>  drivers/net/hamradio/bpqether.c               |  2 ++
>  drivers/net/macsec.c                          |  5 ++++
>  drivers/net/macvlan.c                         | 13 ++++++--
>  drivers/net/vxlan.c                           |  4 +--
>  .../net/wireless/intersil/hostap/hostap_hw.c  |  3 ++
>  include/linux/netdevice.h                     | 12 +++++---
>  net/8021q/vlan_dev.c                          |  8 +++--
>  net/batman-adv/soft-interface.c               |  2 ++
>  net/bridge/br_device.c                        |  8 +++++
>  net/core/dev.c                                | 30 ++++++++++---------
>  net/core/dev_addr_lists.c                     | 12 ++++----
>  net/core/rtnetlink.c                          |  1 -
>  net/dsa/master.c                              |  4 +++
>  net/netrom/af_netrom.c                        |  2 ++
>  net/rose/af_rose.c                            |  2 ++
>  17 files changed, 76 insertions(+), 36 deletions(-)
>

It's me with the stacked DSA devices again:

[   11.424642] ============================================
[   11.429967] WARNING: possible recursive locking detected
[   11.435295] 5.8.0-rc1-00133-g923e4b5032dd-dirty #208 Not tainted
[   11.441319] --------------------------------------------
[   11.446646] dhcpcd/323 is trying to acquire lock:
[   11.451362] ffff000066dd4268
(&dsa_master_addr_list_lock_key/1){+...}-{2:2}, at:
dev_mc_sync+0x44/0x90
[   11.460713]
[   11.460713] but task is already holding lock:
[   11.466561] ffff00006608c268
(&dsa_master_addr_list_lock_key/1){+...}-{2:2}, at:
dev_mc_sync+0x44/0x90
[   11.475905]
[   11.475905] other info that might help us debug this:
[   11.482450]  Possible unsafe locking scenario:
[   11.482450]
[   11.488386]        CPU0
[   11.490833]        ----
[   11.493280]   lock(&dsa_master_addr_list_lock_key/1);
[   11.498347]   lock(&dsa_master_addr_list_lock_key/1);
[   11.503413]
[   11.503413]  *** DEADLOCK ***
[   11.503413]
[   11.509349]  May be due to missing lock nesting notation
[   11.509349]
[   11.516158] 3 locks held by dhcpcd/323:
[   11.520001]  #0: ffffdbd1381dda18 (rtnl_mutex){+.+.}-{3:3}, at:
rtnl_lock+0x24/0x30
[   11.527688]  #1: ffff00006614b268 (_xmit_ETHER){+...}-{2:2}, at:
dev_set_rx_mode+0x28/0x48
[   11.535987]  #2: ffff00006608c268
(&dsa_master_addr_list_lock_key/1){+...}-{2:2}, at:
dev_mc_sync+0x44/0x90
[   11.545766]
[   11.545766] stack backtrace:
[   11.564098] Call trace:
[   11.566549]  dump_backtrace+0x0/0x1e0
[   11.570220]  show_stack+0x20/0x30
[   11.573544]  dump_stack+0xec/0x158
[   11.576955]  __lock_acquire+0xca0/0x2398
[   11.580886]  lock_acquire+0xe8/0x440
[   11.584469]  _raw_spin_lock_nested+0x64/0x90
[   11.588749]  dev_mc_sync+0x44/0x90
[   11.592159]  dsa_slave_set_rx_mode+0x34/0x50
[   11.596438]  __dev_set_rx_mode+0x60/0xa0
[   11.600369]  dev_mc_sync+0x84/0x90
[   11.603778]  dsa_slave_set_rx_mode+0x34/0x50
[   11.608057]  __dev_set_rx_mode+0x60/0xa0
[   11.611989]  dev_set_rx_mode+0x30/0x48
[   11.615745]  __dev_open+0x10c/0x180
[   11.619240]  __dev_change_flags+0x170/0x1c8
[   11.623432]  dev_change_flags+0x2c/0x70
[   11.627279]  devinet_ioctl+0x774/0x878
[   11.631036]  inet_ioctl+0x348/0x3b0
[   11.634532]  sock_do_ioctl+0x50/0x310
[   11.638202]  sock_ioctl+0x1f8/0x580
[   11.641698]  ksys_ioctl+0xb0/0xf0
[   11.645019]  __arm64_sys_ioctl+0x28/0x38
[   11.648951]  el0_svc_common.constprop.0+0x7c/0x180
[   11.653753]  do_el0_svc+0x2c/0x98
[   11.657075]  el0_sync_handler+0x9c/0x1b8
[   11.661005]  el0_sync+0x158/0x180

Could you please share some suggestions?

Thanks,
-Vladimir

^ permalink raw reply

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Roman Gushchin @ 2020-06-18 19:36 UTC (permalink / raw)
  To: Cong Wang
  Cc: Zefan Li, Linux Kernel Network Developers, Cameron Berkenpas,
	Peter Geis, Lu Fengqi, Daniël Sonck, Daniel Borkmann,
	Tejun Heo
In-Reply-To: <CAM_iQpUFFHPnMxS2sAHZqMUs80tTn0+C_jCcne4Ddx2b9omCxg@mail.gmail.com>

On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
> >
> > Cc: Roman Gushchin <guro@fb.com>
> >
> > Thanks for fixing this.
> >
> > On 2020/6/17 2:03, Cong Wang wrote:
> > > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > > copied, so the cgroup refcnt must be taken too. And, unlike the
> > > sk_alloc() path, sock_update_netprioidx() is not called here.
> > > Therefore, it is safe and necessary to grab the cgroup refcnt
> > > even when cgroup_sk_alloc is disabled.
> > >
> > > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > > would terminate this function if called there. And for sk_alloc()
> > > skcd->val is always zero. So it's safe to factor out the code
> > > to make it more readable.
> > >
> > > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
> >
> > but I don't think the bug was introduced by this commit, because there
> > are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > write_classid(), which can be triggered by writing to ifpriomap or
> > classid in cgroupfs. This commit just made it much easier to happen
> > with systemd invovled.
> >
> > I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> > which added cgroup_bpf_get() in cgroup_sk_alloc().
> 
> Good point.
> 
> I take a deeper look, it looks like commit d979a39d7242e06
> is the one to blame, because it is the first commit that began to
> hold cgroup refcnt in cgroup_sk_alloc().

I agree, ut seems that the issue is not related to bpf and probably
can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
seems closer to the origin.

Btw, based on the number of reported-by tags it seems that there was
a real issue which the patch is fixing. Maybe you'll a couple of words
about how it reveals itself in the real life?

Thanks!

^ permalink raw reply

* Re: [PATCH] net/9p: Fix sparse endian warning in trans_fd.c
From: Alexander Kapshuk @ 2020-06-18 19:27 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: ericvh, lucho, davem, kuba, v9fs-developer, netdev, linux-kernel
In-Reply-To: <20200618190934.GB20699@nautica>

On Thu, Jun 18, 2020 at 10:09 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Alexander Kapshuk wrote on Thu, Jun 18, 2020:
> > Address sparse endian warning:
> > net/9p/trans_fd.c:932:28: warning: incorrect type in assignment (different base types)
> > net/9p/trans_fd.c:932:28:    expected restricted __be32 [addressable] [assigned] [usertype] s_addr
> > net/9p/trans_fd.c:932:28:    got unsigned long
> >
> > Signed-off-by: Alexander Kapshuk <alexander.kapshuk@gmail.com>
>
> INADDR_ANY is 0 so this really is noop but sure, less warnings is always
> good. I'll take this one for 5.9.
> Thanks!
> --
> Dominique

Noted.
Thanks.

^ permalink raw reply

* [PATCH net] ibmvnic: continue to init in CRQ reset returns H_CLOSED
From: Dany Madden @ 2020-06-18 19:24 UTC (permalink / raw)
  To: davem; +Cc: netdev, linuxppc-dev, tlfalcon, Dany Madden

Continue the reset path when partner adapter is not ready or H_CLOSED is
returned from reset crq. This patch allows the CRQ init to proceed to
establish a valid CRQ for traffic to flow after reset.

Signed-off-by: Dany Madden <drt@linux.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 2baf7b3ff4cb..4b7cb483c47f 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1971,13 +1971,18 @@ static int do_reset(struct ibmvnic_adapter *adapter,
 			release_sub_crqs(adapter, 1);
 		} else {
 			rc = ibmvnic_reset_crq(adapter);
-			if (!rc)
+			if (rc == H_CLOSED || rc == H_SUCCESS) {
 				rc = vio_enable_interrupts(adapter->vdev);
+				if (rc)
+					netdev_err(adapter->netdev,
+						   "Reset failed to enable interrupts. rc=%d\n",
+						   rc);
+			}
 		}
 
 		if (rc) {
 			netdev_err(adapter->netdev,
-				   "Couldn't initialize crq. rc=%d\n", rc);
+				   "Reset couldn't initialize crq. rc=%d\n", rc);
 			goto out;
 		}
 
-- 
2.18.2


^ permalink raw reply related

* Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
From: Cong Wang @ 2020-06-18 19:19 UTC (permalink / raw)
  To: Zefan Li
  Cc: Linux Kernel Network Developers, Cameron Berkenpas, Peter Geis,
	Lu Fengqi, Daniël Sonck, Daniel Borkmann, Tejun Heo,
	Roman Gushchin
In-Reply-To: <141629e1-55b5-34b1-b2ab-bab6b68f0671@huawei.com>

On Wed, Jun 17, 2020 at 6:44 PM Zefan Li <lizefan@huawei.com> wrote:
>
> Cc: Roman Gushchin <guro@fb.com>
>
> Thanks for fixing this.
>
> On 2020/6/17 2:03, Cong Wang wrote:
> > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > copied, so the cgroup refcnt must be taken too. And, unlike the
> > sk_alloc() path, sock_update_netprioidx() is not called here.
> > Therefore, it is safe and necessary to grab the cgroup refcnt
> > even when cgroup_sk_alloc is disabled.
> >
> > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > would terminate this function if called there. And for sk_alloc()
> > skcd->val is always zero. So it's safe to factor out the code
> > to make it more readable.
> >
> > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups")
>
> but I don't think the bug was introduced by this commit, because there
> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> write_classid(), which can be triggered by writing to ifpriomap or
> classid in cgroupfs. This commit just made it much easier to happen
> with systemd invovled.
>
> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself"),
> which added cgroup_bpf_get() in cgroup_sk_alloc().

Good point.

I take a deeper look, it looks like commit d979a39d7242e06
is the one to blame, because it is the first commit that began to
hold cgroup refcnt in cgroup_sk_alloc().

The commit you mentioned above merely adds a refcnt for
cgroup bpf on to of cgroup refcnt.

Thanks.

^ permalink raw reply

* Re: [PATCH] net: macb: reject unsupported rgmii delays
From: Russell King - ARM Linux admin @ 2020-06-18 19:10 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Helmut Grohne, Nicolas Ferre, David S. Miller, Jakub Kicinski,
	Palmer Dabbelt, Paul Walmsley, netdev@vger.kernel.org
In-Reply-To: <3eb49188-fb74-99fe-a9d9-7e295d2eaa8b@gmail.com>

On Thu, Jun 18, 2020 at 12:02:13PM -0700, Florian Fainelli wrote:
> It should not, but that means that when you describe the fixed-link, the
> 'phy-mode' within the local fixed-link node is meant to describe how the
> other side is configured not how you are configured. For some reason I
> did not consider that to be intuitive when I sent out my response, but I
> suppose spelling it out would greatly help.

Right, so in the MAC-to-MAC setup that is being discussed in this
thread, the _local_ side MAC where phy-mode is specified should not
care which of the RGMII modes is specified, because the delays are
a function of "the other side" of the link.

So, the proposal that macb should limit itself to only accepting
"rgmii" in it's _local_ phy-mode property because the _local_ MAC
does not support delays is wrong.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [PATCH] net/9p: Fix sparse endian warning in trans_fd.c
From: Dominique Martinet @ 2020-06-18 19:09 UTC (permalink / raw)
  To: Alexander Kapshuk
  Cc: ericvh, lucho, davem, kuba, v9fs-developer, netdev, linux-kernel
In-Reply-To: <20200618183417.5423-1-alexander.kapshuk@gmail.com>

Alexander Kapshuk wrote on Thu, Jun 18, 2020:
> Address sparse endian warning:
> net/9p/trans_fd.c:932:28: warning: incorrect type in assignment (different base types)
> net/9p/trans_fd.c:932:28:    expected restricted __be32 [addressable] [assigned] [usertype] s_addr
> net/9p/trans_fd.c:932:28:    got unsigned long
> 
> Signed-off-by: Alexander Kapshuk <alexander.kapshuk@gmail.com>

INADDR_ANY is 0 so this really is noop but sure, less warnings is always
good. I'll take this one for 5.9.
Thanks!
-- 
Dominique

^ permalink raw reply

* RE: [PATCH bpf 2/2] selftests/bpf: add variable-length data concatenation pattern test
From: John Fastabend @ 2020-06-18 19:09 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast, daniel
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, Christoph Hellwig
In-Reply-To: <20200616050432.1902042-2-andriin@fb.com>

Andrii Nakryiko wrote:
> Add selftest that validates variable-length data reading and concatentation
> with one big shared data array. This is a common pattern in production use for
> monitoring and tracing applications, that potentially can read a lot of data,
> but usually reads much less. Such pattern allows to determine precisely what
> amount of data needs to be sent over perfbuf/ringbuf and maximize efficiency.
> 
> This is the first BPF selftest that at all looks at and tests
> bpf_probe_read_str()-like helper's return value, closing a major gap in BPF
> testing. It surfaced the problem with bpf_probe_read_kernel_str() returning
> 0 on success, instead of amount of bytes successfully read.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---

[...]

> +/* .data */
> +int payload2_len1 = -1;
> +int payload2_len2 = -1;
> +int total2 = -1;
> +char payload2[MAX_LEN + MAX_LEN] = { 1 };
> +
> +SEC("raw_tp/sys_enter")
> +int handler64(void *regs)
> +{
> +	int pid = bpf_get_current_pid_tgid() >> 32;
> +	void *payload = payload1;
> +	u64 len;
> +
> +	/* ignore irrelevant invocations */
> +	if (test_pid != pid || !capture)
> +		return 0;
> +
> +	len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in1[0]);
> +	if (len <= MAX_LEN) {

Took me a bit grok this. You are relying on the fact that in errors,
such as a page fault, will encode to a large u64 value and so you
verifier is happy. But most of my programs actually want to distinguish
between legitimate errors on the probe vs buffer overrun cases.

Can we make these tests do explicit check for errors. For example,

  if (len < 0) goto abort;

But this also breaks your types here. This is what I was trying to
point out in the 1/2 patch thread. Wanted to make the point here as
well in case it wasn't clear. Not sure I did the best job explaining.

> +		payload += len;
> +		payload1_len1 = len;
> +	}
> +
> +	len = bpf_probe_read_kernel_str(payload, MAX_LEN, &buf_in2[0]);
> +	if (len <= MAX_LEN) {
> +		payload += len;
> +		payload1_len2 = len;
> +	}
> +
> +	total1 = payload - (void *)payload1;
> +
> +	return 0;
> +}

^ permalink raw reply

* Re: [PATCH] net/9p: Fix sparse rcu warnings in client.c
From: Dominique Martinet @ 2020-06-18 19:08 UTC (permalink / raw)
  To: Alexander Kapshuk
  Cc: ericvh, lucho, davem, kuba, v9fs-developer, netdev, linux-kernel
In-Reply-To: <20200618183310.5352-1-alexander.kapshuk@gmail.com>

Alexander Kapshuk wrote on Thu, Jun 18, 2020:
> Address sparse nonderef rcu warnings:
> net/9p/client.c:790:17: warning: incorrect type in argument 1 (different address spaces)
> net/9p/client.c:790:17:    expected struct spinlock [usertype] *lock
> net/9p/client.c:790:17:    got struct spinlock [noderef] <asn:4> *
> net/9p/client.c:792:48: warning: incorrect type in argument 1 (different address spaces)
> net/9p/client.c:792:48:    expected struct spinlock [usertype] *lock
> net/9p/client.c:792:48:    got struct spinlock [noderef] <asn:4> *
> net/9p/client.c:872:17: warning: incorrect type in argument 1 (different address spaces)
> net/9p/client.c:872:17:    expected struct spinlock [usertype] *lock
> net/9p/client.c:872:17:    got struct spinlock [noderef] <asn:4> *
> net/9p/client.c:874:48: warning: incorrect type in argument 1 (different address spaces)
> net/9p/client.c:874:48:    expected struct spinlock [usertype] *lock
> net/9p/client.c:874:48:    got struct spinlock [noderef] <asn:4> *
> 
> Signed-off-by: Alexander Kapshuk <alexander.kapshuk@gmail.com>

Thanks for this patch.
From what I can see, there are tons of other parts of the code doing the
same noderef access pattern to access current->sighand->siglock and I
don't see much doing that.
A couple of users justify this by saying SLAB_TYPESAFE_BY_RCU ensures
we'll always get a usable lock which won't be reinitialized however we
access it... It's a bit dubious we'll get the same lock than unlock to
me, so I agree to some change though.

After a second look I think we should use something like the following:

if (!lock_task_sighand(current, &flags))
	warn & skip (or some error, we'd null deref if this happened currently);
recalc_sigpending();
unlock_task_sighand(current, &flags);

As you can see, the rcu_read_lock() isn't kept until the unlock so I'm
not sure it will be enough to please sparse, but I've convinced myself
current->sighand cannot change while we hold the lock and there just are
too many such patterns in the kernel.

Please let me know if I missed something or if there is an ongoing
effort to change how this works; I'll wait for a v2.

-- 
Dominique

^ permalink raw reply

* Re: [PATCH] net: macb: reject unsupported rgmii delays
From: Florian Fainelli @ 2020-06-18 19:02 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: Helmut Grohne, Nicolas Ferre, David S. Miller, Jakub Kicinski,
	Palmer Dabbelt, Paul Walmsley, netdev@vger.kernel.org
In-Reply-To: <20200618184908.GH1551@shell.armlinux.org.uk>



On 6/18/2020 11:49 AM, Russell King - ARM Linux admin wrote:
> On Thu, Jun 18, 2020 at 11:06:43AM -0700, Florian Fainelli wrote:
>>
>>
>> On 6/18/2020 1:45 AM, Russell King - ARM Linux admin wrote:
>>> On Thu, Jun 18, 2020 at 10:14:33AM +0200, Helmut Grohne wrote:
>>>> On Wed, Jun 17, 2020 at 02:08:09PM +0200, Russell King - ARM Linux admin wrote:
>>>>> With a fixed link, we could be in either a MAC-to-PHY or MAC-to-MAC
>>>>> setup; we just don't know.  However, we don't have is access to the
>>>>> PHY (if it exists) in the fixed link case to configure it for the
>>>>> delay.
>>>>
>>>> Let me twist that a little: We may have access to the PHY, but we don't
>>>> always have access. When we do have access, we have a separate device
>>>> tree node with another fixed-link and another phy-mode. For fixed-links,
>>>> we specify the phy-mode for each end.
>>>
>>> If you have access to the PHY, you shouldn't be using fixed-link. In
>>> any case, there is no support for a fixed-link specification at the
>>> PHY end in the kernel.  The PHY binding doc does not allow for this
>>> either.
>>>
>>>>> In the MAC-to-MAC RGMII setup, where neither MAC can insert the
>>>>> necessary delay, the only way to have a RGMII conformant link is to
>>>>> have the PCB traces induce the necessary delay. That errs towards
>>>>> PHY_INTERFACE_MODE_RGMII for this case.
>>>>
>>>> Yes.
>>>>
>>>>> However, considering the MAC-to-PHY RGMII fixed link case, where the
>>>>> PHY may not be accessible, and may be configured with the necessary
>>>>> delay, should that case also use PHY_INTERFACE_MODE_RGMII - clearly
>>>>> that would be as wrong as using PHY_INTERFACE_MODE_RGMII_ID would
>>>>> be for the MAC-to-MAC RGMII with PCB-delays case.
>>>>
>>>> If you take into account that the PHY has a separate node with phy-mode
>>>> being rgmii-id, it makes a lot more sense to use rgmii for the phy-mode
>>>> of the MAC. So I don't think it is that clear that doing so is wrong.
>>>
>>> The PHY binding document does not allow for this, neither does the
>>> kernel.
>>>
>>>> In an earlier discussion Florian Fainelli said:
>>>> https://patchwork.ozlabs.org/project/netdev/patch/20190410005700.31582-19-olteanv@gmail.com/#2149183
>>>> | fixed-link really should denote a MAC to MAC connection so if you have
>>>> | "rgmii-id" on one side, you would expect "rgmii" on the other side
>>>> | (unless PCB traces account for delays, grrr).
>>>>
>>>> For these reasons, I still think that rgmii would be a useful
>>>> description for the fixed-link to PHY connection where the PHY adds the
>>>> delay.
>>>
>>> I think Florian is wrong; consider what it means when you have a
>>> fixed-link connection between a MAC and an inaccessible PHY and
>>> the link as a whole is operating in what we would call "rgmii-id"
>>> mode if the PHY were accessible.
>>>
>>> Taking Florian's stance, it basically means that DT no longer
>>> describes the hardware, but how we have chosen to interpret the
>>> properties in _one_ specific case in a completely different way
>>> to all the other cases.
>>
>> How do you ensure that a MAC to MAC connection using RGMII actually
>> works if you do not provide a 'phy-mode' for both sides right now?
>>
>> Yes this is more of a DT now describes a configuration choice rather
>> than the hardware but given that Ethernet MACs are usually capable of
>> supporting all 4 possible RGMII modes, how do you propose to solve the
>> negotiation of the appropriate RGMII mode here?
> 
> This is actually answered by yourself below.
> 
>>>>> So, I think a MAC driver should not care about the specific RGMII
>>>>> mode being asked for in any case, and just accept them all.
>>>>
>>>> I would like to agree to this. However, the implication is that when you
>>>> get your delays wrong, your driver silently ignores you and you never
>>>> notice your mistake until you see no traffic passing and wonder why.
>>>
>>> So explain to me this aspect of your reasoning:
>>>
>>> - If the link is operating in non-fixed-link mode, the rgmii* PHY modes
>>>   describe the delay to be applied at the PHY end.
>>> - If the link is operating in fixed-link mode, the rgmii* PHY modes
>>>   describe the delay to be applied at the MAC end.
>>>
>>> That doesn't make sense, and excludes properly describing a MAC-to-
>>> inaccessible-PHY setup.
>>
>> The point with a fixed link is that it does not matter what is the other
>> end, it could be a MAC, it could be a PHY, it could go nowhere, it just
>> does not matter, the only thing that you can know is you configure your
>> side of the fixed link.
> 
> That is *exactly* my point.
> 
> Today, with a PHY on the other end, the four RGMII modes tell you how
> the PHY is to be configured, not the MAC.  The documentation states
> this.
> 
> So, why should a fixed-link be any different?

It should not, but that means that when you describe the fixed-link, the
'phy-mode' within the local fixed-link node is meant to describe how the
other side is configured not how you are configured. For some reason I
did not consider that to be intuitive when I sent out my response, but I
suppose spelling it out would greatly help.
-- 
Florian

^ permalink raw reply

* Re: [PATCH] macvlan: Fix memory leak in macvlan_changelink_sources()
From: Markus Elfring @ 2020-06-18 19:00 UTC (permalink / raw)
  To: Zheng Bin, netdev
  Cc: linux-kernel, David S. Miller, Jakub Kicinski, Michael Braun,
	Yi Zhang

>     -->If fail, need to free previous malloc memory

I suggest to improve this change description.
How does the proposed addition of the function call “macvlan_flush_sources”
fit to this information?

Regards,
Markus

^ permalink raw reply

* Re: [RFC PATCH] net/sched: add indirect call wrapper hint.
From: Eric Dumazet @ 2020-06-18 19:00 UTC (permalink / raw)
  To: Paolo Abeni, netdev; +Cc: David S. Miller, Eric Dumazet
In-Reply-To: <da175b76ca89e57876cf55d3d56aef126054d12c.1592501362.git.pabeni@redhat.com>



On 6/18/20 10:31 AM, Paolo Abeni wrote:
> The sched layer can use several indirect calls per
> packet, with not work-conservative qdisc being
> more affected due to the lack of the BYPASS path.
> 
> This change tries to improve the situation using
> the indirect call wrappers infrastructure for the
> qdisc enqueue end dequeue indirect calls.
> 
> To cope with non-trivial scenarios, a compile-time know is
> introduced, so that the qdisc used by ICW can be different
> from the default one.
> 
> Tested with pktgen over qdisc, with CONFIG_HINT_FQ_CODEL=y:
> 
> qdisc		threads vanilla	patched delta
> 		nr	Kpps	Kpps	%
> pfifo_fast	1	3300	3700	12
> pfifo_fast	2	3940	4070	3
> fq_codel	1	3840	4110	7
> fq_codel	2	1920	2260	17
> fq		1	2230	2210	-1
> fq		2	1530	1540	1

Hi Paolo

This test is a bit misleading, pktgen has a way to bypass the qdisc.

Real numbers for more typical workloads would be more appealing,
before we consider a quite invasive patch ?

What is the status of static_call infrastructure ?


>  
> +#ifndef CODEL_SCOPE
> +#define CODEL_SCOPE static
> +#endif

This looks additional burden, just remove the static attribute,
if a function might be called directly.

Eg, we have EXPORT_SYMBOL() all over the places, even if the modules needing
a symbol might not be compiled at all, or being part of vmlinux.

Thanks !





^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: switch most helper return values from 32-bit int to 64-bit long
From: John Fastabend @ 2020-06-18 18:58 UTC (permalink / raw)
  To: Andrii Nakryiko, John Fastabend
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team
In-Reply-To: <CAEf4BzZi5pMTC9Fq53Mi_mXUm-EQZDyqS_pxEYuGoc0J1ETGUA@mail.gmail.com>

Andrii Nakryiko wrote:
> On Wed, Jun 17, 2020 at 11:49 PM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Andrii Nakryiko wrote:
> > > Switch most of BPF helper definitions from returning int to long. These
> > > definitions are coming from comments in BPF UAPI header and are used to
> > > generate bpf_helper_defs.h (under libbpf) to be later included and used from
> > > BPF programs.
> > >
> > > In actual in-kernel implementation, all the helpers are defined as returning
> > > u64, but due to some historical reasons, most of them are actually defined as
> > > returning int in UAPI (usually, to return 0 on success, and negative value on
> > > error).
> >
> > Could we change the helpers side to return correct types now? Meaning if the
> > UAPI claims its an int lets actually return the int.
> 
> I'm not sure how exactly you see this being done. BPF ABI dictates
> that the helper's result is passed in a full 64-bit r0 register. Are
> you suggesting that in addition to RET_ANYTHING we should add
> RET_ANYTHING32 and teach verifier that higher 32 bits of r0 are
> guaranteed to be zero? And then make helpers actually return 32-bit
> values without up-casting them to u64?

Yes this is what I was thinking, having a RET_ANYTHING32 but I would
assume the upper 32-bits could be anything not zeroed. For +alu32
and programmer using correct types I would expect clang to generate
good code here and mostly not need to zero upper bits.

I think this discussion can be independent of your changes though and
its not at the top of my todo list so probably wont get to investigating
more for awhile.

> 
> >
> > >
> > > This actually causes Clang to quite often generate sub-optimal code, because
> > > compiler believes that return value is 32-bit, and in a lot of cases has to be
> > > up-converted (usually with a pair of 32-bit bit shifts) to 64-bit values,
> > > before they can be used further in BPF code.
> > >
> > > Besides just "polluting" the code, these 32-bit shifts quite often cause
> > > problems for cases in which return value matters. This is especially the case
> > > for the family of bpf_probe_read_str() functions. There are few other similar
> > > helpers (e.g., bpf_read_branch_records()), in which return value is used by
> > > BPF program logic to record variable-length data and process it. For such
> > > cases, BPF program logic carefully manages offsets within some array or map to
> > > read variable-length data. For such uses, it's crucial for BPF verifier to
> > > track possible range of register values to prove that all the accesses happen
> > > within given memory bounds. Those extraneous zero-extending bit shifts,
> > > inserted by Clang (and quite often interleaved with other code, which makes
> > > the issues even more challenging and sometimes requires employing extra
> > > per-variable compiler barriers), throws off verifier logic and makes it mark
> > > registers as having unknown variable offset. We'll study this pattern a bit
> > > later below.
> >
> > With latest verifier zext with alu32 support should be implemented as a
> > MOV insn.
> 
> Code generation is independent of verifier version or am I not getting
> what you are saying? Also all this code was compiled with up-to-date
> Clang.

Agh sorry I read the example too fast and too late. The above is a typo
what I meant is "With the latest _clang_ zext with alu32 support...". But,
I also read your code example wrong and the left/right shift pattern here,

> > >   19:   w1 = w0                              19:   r1 = 0 ll
> > >   20:   r1 <<= 32
> > >   21:   r1 s>>= 32

above is a _signed_ zext to deal with the types. OK thanks for bearing with me
on this one. Some more thoughts below.

> 
> >
> > >
> > > Another common pattern is to check return of BPF helper for non-zero state to
> > > detect error conditions and attempt alternative actions in such case. Even in
> > > this simple and straightforward case, this 32-bit vs BPF's native 64-bit mode
> > > quite often leads to sub-optimal and unnecessary extra code. We'll look at
> > > this pattern as well.
> > >
> > > Clang's BPF target supports two modes of code generation: ALU32, in which it
> > > is capable of using lower 32-bit parts of registers, and no-ALU32, in which
> > > only full 64-bit registers are being used. ALU32 mode somewhat mitigates the
> > > above described problems, but not in all cases.
> >
> > A bit curious, do you see users running with no-ALU32 support? I have enabled
> > it by default now. It seems to generate better code and with latest 32-bit
> > bounds tracking I haven't hit any issues with verifier.
> 
> Yes, all Facebook apps are built with no-ALU32. And those apps have to
> run on quite old kernels as well, so relying on latest bug fixes in
> kernel is not an option right now.

OK got it.

> 
> >
> > >
> > > This patch switches all the cases in which BPF helpers return 0 or negative
> > > error from returning int to returning long. It is shown below that such change
> > > in definition leads to equivalent or better code. No-ALU32 mode benefits more,
> > > but ALU32 mode doesn't degrade or still gets improved code generation.
> > >
> > > Another class of cases switched from int to long are bpf_probe_read_str()-like
> > > helpers, which encode successful case as non-negative values, while still
> > > returning negative value for errors.
> > >
> > > In all of such cases, correctness is preserved due to two's complement
> > > encoding of negative values and the fact that all helpers return values with
> > > 32-bit absolute value. Two's complement ensures that for negative values
> > > higher 32 bits are all ones and when truncated, leave valid negative 32-bit
> > > value with the same value. Non-negative values have upper 32 bits set to zero
> > > and similarly preserve value when high 32 bits are truncated. This means that
> > > just casting to int/u32 is correct and efficient (and in ALU32 mode doesn't
> > > require any extra shifts).
> > >
> > > To minimize the chances of regressions, two code patterns were investigated,
> > > as mentioned above. For both patterns, BPF assembly was analyzed in
> > > ALU32/NO-ALU32 compiler modes, both with current 32-bit int return type and
> > > new 64-bit long return type.
> > >
> > > Case 1. Variable-length data reading and concatenation. This is quite
> > > ubiquitous pattern in tracing/monitoring applications, reading data like
> > > process's environment variables, file path, etc. In such case, many pieces of
> > > string-like variable-length data are read into a single big buffer, and at the
> > > end of the process, only a part of array containing actual data is sent to
> > > user-space for further processing. This case is tested in test_varlen.c
> > > selftest (in the next patch). Code flow is roughly as follows:
> > >
> > >   void *payload = &sample->payload;
> > >   u64 len;
> > >
> > >   len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
> > >   if (len <= MAX_SZ1) {
> > >       payload += len;
> > >       sample->len1 = len;
> > >   }
> > >   len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);
> > >   if (len <= MAX_SZ2) {
> > >       payload += len;
> > >       sample->len2 = len;
> > >   }
> > >   /* and so on */
> > >   sample->total_len = payload - &sample->payload;
> > >   /* send over, e.g., perf buffer */
> > >
> > > There could be two variations with slightly different code generated: when len
> > > is 64-bit integer and when it is 32-bit integer. Both variations were analysed.
> > > BPF assembly instructions between two successive invocations of
> > > bpf_probe_read_kernel_str() were used to check code regressions. Results are
> > > below, followed by short analysis. Left side is using helpers with int return
> > > type, the right one is after the switch to long.
> > >
> > > ALU32 + INT                                ALU32 + LONG
> > > ===========                                ============
> > >
> > > 64-BIT (13 insns):                         64-BIT (10 insns):
> > > ------------------------------------       ------------------------------------
> > >   17:   call 115                             17:   call 115
> > >   18:   if w0 > 256 goto +9 <LBB0_4>         18:   if r0 > 256 goto +6 <LBB0_4>
> > >   19:   w1 = w0                              19:   r1 = 0 ll
> > >   20:   r1 <<= 32                            21:   *(u64 *)(r1 + 0) = r0
> > >   21:   r1 s>>= 32                           22:   r6 = 0 ll
> >
> > What version of clang is this? That is probably a zext in llvm-ir that in
> > latest should be sufficient with the 'w1=w0'. I'm guessing (hoping?) you
> > might not have latest?
> 
> Just double-checked, very latest Clang, built today. Still generates
> the same code.
> 
> But I think this makes sense, because r1 is u64, and it gets assigned
> from int, so int first has to be converted to s64, then casted to u64.
> So sign extension is necessary. I've confirmed with this simple
> program:
> 
> $ cat bla.c
> #include <stdio.h>
> 
> int main() {
>         int a = -1;
>         unsigned long b = a;
>         printf("%lx\n", b);
>         return 0;
> }
> $ clang bla.c -o test && ./test
> ffffffffffffffff
> 
> ^^^^^^^^--- not zeroes
> 
> So I don't think it's a bug or inefficiency, C language requires that.

Agreed. Sorry for the confusion on my side. Poked at this a bit more this
morning trying to see why I don't hit the same pattern when we have many
cases very similar to above.

In your C code you never check for negative return codes? Oops, this
means you can walk backwards off the front of payload? This is probably
not valid either from program logic side and/or verifier will probably
yell. Commented where I think you want a <0 check here,

 void *payload = &sample->payload;
 u64 len; // but how do you get a len < 0 check now with u64 type?

 len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
 // should insert a negative case check here
 // if (len < 0) goto abort;
 if (len <= MAX_SZ1) {
	payload += len;
	sample->len1 = len;
 }
 // without the <0 case you may have walked backwards in payload?
 len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);

So in all of my C code I have something like this,

 int myfunc(void *a, void *b, void *c) {
	void *payload = a;
	int len;

	len = probe_read_str(payload, 1000, b);
	if (len < 0) return len;
	if (len <= 1000) { // yeah we have room
		ayload += len;
	}
	len = probe_read_str(payload, 1000, c);
	[...]
 }

And then when I look at generated code I get this,

; int myfunc(void *a, void *b, void *c) {
       4:	call 45
; 	if (len < 0) return len;
       5:	if w0 s< 0 goto +9 <LBB0_4>
; 	if (len <= 1000) {
       6:	w2 = w0
       7:	r1 = r7
       8:	r1 += r2
       9:	if w0 s< 1001 goto +1 <LBB0_3>
      10:	r1 = r7

0000000000000058 <LBB0_3>:
; 	len = probe_read_str(payload, 1000, b);
      11:	w2 = 1000
      12:	r3 = r6
      13:	call 45


Here the <0 check means we can skip the sext and do a
simple zext which is just a w2=w0 by bpf zero extension
rules.

> 
> >
> > >   22:   r2 = 0 ll                            24:   r6 += r0
> > >   24:   *(u64 *)(r2 + 0) = r1              00000000000000c8 <LBB0_4>:
> > >   25:   r6 = 0 ll                            25:   r1 = r6
> > >   27:   r6 += r1                             26:   w2 = 256
> > > 00000000000000e0 <LBB0_4>:                   27:   r3 = 0 ll
> > >   28:   r1 = r6                              29:   call 115
> > >   29:   w2 = 256
> > >   30:   r3 = 0 ll
> > >   32:   call 115
> > >
> > > 32-BIT (11 insns):                         32-BIT (12 insns):
> > > ------------------------------------       ------------------------------------
> > >   17:   call 115                             17:   call 115
> > >   18:   if w0 > 256 goto +7 <LBB1_4>         18:   if w0 > 256 goto +8 <LBB1_4>
> > >   19:   r1 = 0 ll                            19:   r1 = 0 ll
> > >   21:   *(u32 *)(r1 + 0) = r0                21:   *(u32 *)(r1 + 0) = r0
> > >   22:   w1 = w0                              22:   r0 <<= 32
> > >   23:   r6 = 0 ll                            23:   r0 >>= 32
> > >   25:   r6 += r1                             24:   r6 = 0 ll
> > > 00000000000000d0 <LBB1_4>:                   26:   r6 += r0
> > >   26:   r1 = r6                            00000000000000d8 <LBB1_4>:
> > >   27:   w2 = 256                             27:   r1 = r6
> > >   28:   r3 = 0 ll                            28:   w2 = 256
> > >   30:   call 115                             29:   r3 = 0 ll
> > >                                              31:   call 115
> > >
> > > In ALU32 mode, the variant using 64-bit length variable clearly wins and
> > > avoids unnecessary zero-extension bit shifts. In practice, this is even more
> > > important and good, because BPF code won't need to do extra checks to "prove"
> > > that payload/len are within good bounds.
> >
> > I bet with latest clang the shifts are removed. But if not we probably
> > should fix clang regardless of if helpers return longs or ints.
> 
> are we still talking about bit shifts for INT HELPER + U64 len case?
> Or now about bit shifts in LONG HELPER + U32 len case?

Forget I was confused. I put the <0 checks in there mentally and it messed up
how I read your code.

> 
> >
> > >
> > > 32-bit len is one instruction longer. Clang decided to do 64-to-32 casting
> > > with two bit shifts, instead of equivalent `w1 = w0` assignment. The former
> > > uses extra register. The latter might potentially lose some range information,
> > > but not for 32-bit value. So in this case, verifier infers that r0 is [0, 256]
> > > after check at 18:, and shifting 32 bits left/right keeps that range intact.
> > > We should probably look into Clang's logic and see why it chooses bitshifts
> > > over sub-register assignments for this.
> > >
> > > NO-ALU32 + INT                             NO-ALU32 + LONG
> > > ==============                             ===============
> > >
> > > 64-BIT (14 insns):                         64-BIT (10 insns):
> > > ------------------------------------       ------------------------------------
> > >   17:   call 115                             17:   call 115
> > >   18:   r0 <<= 32                            18:   if r0 > 256 goto +6 <LBB0_4>
> > >   19:   r1 = r0                              19:   r1 = 0 ll
> > >   20:   r1 >>= 32                            21:   *(u64 *)(r1 + 0) = r0
> > >   21:   if r1 > 256 goto +7 <LBB0_4>         22:   r6 = 0 ll
> > >   22:   r0 s>>= 32                           24:   r6 += r0
> > >   23:   r1 = 0 ll                          00000000000000c8 <LBB0_4>:
> > >   25:   *(u64 *)(r1 + 0) = r0                25:   r1 = r6
> > >   26:   r6 = 0 ll                            26:   r2 = 256
> > >   28:   r6 += r0                             27:   r3 = 0 ll
> > > 00000000000000e8 <LBB0_4>:                   29:   call 115
> > >   29:   r1 = r6
> > >   30:   r2 = 256
> > >   31:   r3 = 0 ll
> > >   33:   call 115
> > >
> > > 32-BIT (13 insns):                         32-BIT (13 insns):
> > > ------------------------------------       ------------------------------------
> > >   17:   call 115                             17:   call 115
> > >   18:   r1 = r0                              18:   r1 = r0
> > >   19:   r1 <<= 32                            19:   r1 <<= 32
> > >   20:   r1 >>= 32                            20:   r1 >>= 32
> > >   21:   if r1 > 256 goto +6 <LBB1_4>         21:   if r1 > 256 goto +6 <LBB1_4>
> > >   22:   r2 = 0 ll                            22:   r2 = 0 ll
> > >   24:   *(u32 *)(r2 + 0) = r0                24:   *(u32 *)(r2 + 0) = r0
> > >   25:   r6 = 0 ll                            25:   r6 = 0 ll
> > >   27:   r6 += r1                             27:   r6 += r1
> > > 00000000000000e0 <LBB1_4>:                 00000000000000e0 <LBB1_4>:
> > >   28:   r1 = r6                              28:   r1 = r6
> > >   29:   r2 = 256                             29:   r2 = 256
> > >   30:   r3 = 0 ll                            30:   r3 = 0 ll
> > >   32:   call 115                             32:   call 115
> > >
> > > In NO-ALU32 mode, for the case of 64-bit len variable, Clang generates much
> > > superior code, as expected, eliminating unnecessary bit shifts. For 32-bit
> > > len, code is identical.
> >
> > Right I can't think of any way clang can avoid it here. OTOH I fix this
> > by enabling alu32 ;)
> >
> > >
> > > So overall, only ALU-32 32-bit len case is more-or-less equivalent and the
> > > difference stems from internal Clang decision, rather than compiler lacking
> > > enough information about types.
> > >
> > > Case 2. Let's look at the simpler case of checking return result of BPF helper
> > > for errors. The code is very simple:
> > >
> > >   long bla;
> > >   if (bpf_probe_read_kenerl(&bla, sizeof(bla), 0))
> > >       return 1;
> > >   else
> > >       return 0;
> > >
> > > ALU32 + CHECK (9 insns)                    ALU32 + CHECK (9 insns)
> > > ====================================       ====================================
> > >   0:    r1 = r10                             0:    r1 = r10
> > >   1:    r1 += -8                             1:    r1 += -8
> > >   2:    w2 = 8                               2:    w2 = 8
> > >   3:    r3 = 0                               3:    r3 = 0
> > >   4:    call 113                             4:    call 113
> > >   5:    w1 = w0                              5:    r1 = r0
> > >   6:    w0 = 1                               6:    w0 = 1
> > >   7:    if w1 != 0 goto +1 <LBB2_2>          7:    if r1 != 0 goto +1 <LBB2_2>
> > >   8:    w0 = 0                               8:    w0 = 0
> > > 0000000000000048 <LBB2_2>:                 0000000000000048 <LBB2_2>:
> > >   9:    exit                                 9:    exit
> > >
> > > Almost identical code, the only difference is the use of full register
> > > assignment (r1 = r0) vs half-registers (w1 = w0) in instruction #5. On 32-bit
> > > architectures, new BPF assembly might be slightly less optimal, in theory. But
> > > one can argue that's not a big issue, given that use of full registers is
> > > still prevalent (e.g., for parameter passing).
> > >
> > > NO-ALU32 + CHECK (11 insns)                NO-ALU32 + CHECK (9 insns)
> > > ====================================       ====================================
> > >   0:    r1 = r10                             0:    r1 = r10
> > >   1:    r1 += -8                             1:    r1 += -8
> > >   2:    r2 = 8                               2:    r2 = 8
> > >   3:    r3 = 0                               3:    r3 = 0
> > >   4:    call 113                             4:    call 113
> > >   5:    r1 = r0                              5:    r1 = r0
> > >   6:    r1 <<= 32                            6:    r0 = 1
> > >   7:    r1 >>= 32                            7:    if r1 != 0 goto +1 <LBB2_2>
> > >   8:    r0 = 1                               8:    r0 = 0
> > >   9:    if r1 != 0 goto +1 <LBB2_2>        0000000000000048 <LBB2_2>:
> > >  10:    r0 = 0                               9:    exit
> > > 0000000000000058 <LBB2_2>:
> > >  11:    exit
> > >
> > > NO-ALU32 is a clear improvement, getting rid of unnecessary zero-extension bit
> > > shifts.
> >
> > It seems a win for the NO-ALU32 case but for the +ALU32 case I think its
> > the same with latest clang although I haven't tried yet. I was actually
> > considering going the other way and avoiding always returning u64 on
> > the other side. From a purely aesethetics point of view I prefer the
> > int type because it seems more clear/standard C. I'm also not so interested
> > in optimizing the no-alu32 case but curious if there is a use case for
> > that?
> 
> My point was that this int -> long switch doesn't degrade ALU32 and
> helps no-ALU32, and thus is good :)

With the long vs int I do see worse code when using the <0 check.
Using C function below which I took from some real code and renamed
variables. 

int myfunc(void *a, void *b, void *c) {
	void *payload = a;
	int len;

	len = probe_read_str(payload, 1000, a);
	if (len < 0) return len;
	if (len <= 1000) {
		payload += len;
	}
	len = probe_read_str(payload, 1000, b);
	if (len <= 1000) {
    		payload += len;
	}
	return 1;
}

Then here is the side-by-side of generated code, with +ALU32.

  int BPF_FUNC(probe_read, ...                  long BPF_FUNC(probe_read, ...
-------------------------------                ---------------------------------
       0:	r6 = r2                         0:	r6 = r2
       1:	r7 = r1                         1:	r7 = r1
       2:	w2 = 1000                       2:	w2 = 1000
       3:	r3 = r7                         3:	r3 = r7
       4:	call 45                         4:	call 45
       5:	if w0 s< 0 goto +9 <LBB0_4>     5:      r2 = r0
       6:	w2 = w0                         6:	if w0 s< 0 goto +10 <LBB0_4>
       7:	r1 = r7                         7:	r2 <<= 32
       8:	r1 += r2                        8:	r2 s>>= 32
       9:	if w0 s< 1001 goto +1 <LBB0_3>  9:	r1 = r7
      10:	r1 = r7                        10:	r1 += r2
      11:	w2 = 1000                      11:	if w0 s< 1001 goto +1 <LBB0_3>
      12:	r3 = r6                        12:	r1 = r7
      13:	call 45                        13:	w2 = 1000
      14:	w0 = 1                         14:	r3 = r6
      15:       exit                           15:	call 45
                                               16:	w0 = 1
                                               17:      exit

So a couple extra instruction, but more concerning we created a
<<=,s>> pattern. I'll need to do some more tests but my concern
is that could break verifier for real programs we have. I guess
it didn't in the selftests? Surely, this thread has at least
pointed out some gaps in our test cases. I guess I _could_ make
len a u64 type to remove the sext but then <0 check on a u64?!

> 
> Overall, long as a return type matches reality and BPF ABI
> specification. BTW, one of the varlen programs from patch 2 doesn't
> even validate successfully on latest kernel with latest Clang right
> now, if helpers return int, even though it's completely correct code.
> That's a real problem we have to deal with in few major BPF
> applications right now, and we have to use inline assembly to enforce
> Clang to do the right thing. A bunch of those problems are simply
> avoided with correct return types for helpers.

Do the real programs check <0? Did I miss something? I'll try
applying your patch to our real code base and see what happens.

^ permalink raw reply

* Re: [PATCH] net: macb: reject unsupported rgmii delays
From: Russell King - ARM Linux admin @ 2020-06-18 18:49 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Helmut Grohne, Nicolas Ferre, David S. Miller, Jakub Kicinski,
	Palmer Dabbelt, Paul Walmsley, netdev@vger.kernel.org
In-Reply-To: <91866c2b-77ea-c175-d672-a9ca937835bd@gmail.com>

On Thu, Jun 18, 2020 at 11:06:43AM -0700, Florian Fainelli wrote:
> 
> 
> On 6/18/2020 1:45 AM, Russell King - ARM Linux admin wrote:
> > On Thu, Jun 18, 2020 at 10:14:33AM +0200, Helmut Grohne wrote:
> >> On Wed, Jun 17, 2020 at 02:08:09PM +0200, Russell King - ARM Linux admin wrote:
> >>> With a fixed link, we could be in either a MAC-to-PHY or MAC-to-MAC
> >>> setup; we just don't know.  However, we don't have is access to the
> >>> PHY (if it exists) in the fixed link case to configure it for the
> >>> delay.
> >>
> >> Let me twist that a little: We may have access to the PHY, but we don't
> >> always have access. When we do have access, we have a separate device
> >> tree node with another fixed-link and another phy-mode. For fixed-links,
> >> we specify the phy-mode for each end.
> > 
> > If you have access to the PHY, you shouldn't be using fixed-link. In
> > any case, there is no support for a fixed-link specification at the
> > PHY end in the kernel.  The PHY binding doc does not allow for this
> > either.
> > 
> >>> In the MAC-to-MAC RGMII setup, where neither MAC can insert the
> >>> necessary delay, the only way to have a RGMII conformant link is to
> >>> have the PCB traces induce the necessary delay. That errs towards
> >>> PHY_INTERFACE_MODE_RGMII for this case.
> >>
> >> Yes.
> >>
> >>> However, considering the MAC-to-PHY RGMII fixed link case, where the
> >>> PHY may not be accessible, and may be configured with the necessary
> >>> delay, should that case also use PHY_INTERFACE_MODE_RGMII - clearly
> >>> that would be as wrong as using PHY_INTERFACE_MODE_RGMII_ID would
> >>> be for the MAC-to-MAC RGMII with PCB-delays case.
> >>
> >> If you take into account that the PHY has a separate node with phy-mode
> >> being rgmii-id, it makes a lot more sense to use rgmii for the phy-mode
> >> of the MAC. So I don't think it is that clear that doing so is wrong.
> > 
> > The PHY binding document does not allow for this, neither does the
> > kernel.
> > 
> >> In an earlier discussion Florian Fainelli said:
> >> https://patchwork.ozlabs.org/project/netdev/patch/20190410005700.31582-19-olteanv@gmail.com/#2149183
> >> | fixed-link really should denote a MAC to MAC connection so if you have
> >> | "rgmii-id" on one side, you would expect "rgmii" on the other side
> >> | (unless PCB traces account for delays, grrr).
> >>
> >> For these reasons, I still think that rgmii would be a useful
> >> description for the fixed-link to PHY connection where the PHY adds the
> >> delay.
> > 
> > I think Florian is wrong; consider what it means when you have a
> > fixed-link connection between a MAC and an inaccessible PHY and
> > the link as a whole is operating in what we would call "rgmii-id"
> > mode if the PHY were accessible.
> > 
> > Taking Florian's stance, it basically means that DT no longer
> > describes the hardware, but how we have chosen to interpret the
> > properties in _one_ specific case in a completely different way
> > to all the other cases.
> 
> How do you ensure that a MAC to MAC connection using RGMII actually
> works if you do not provide a 'phy-mode' for both sides right now?
> 
> Yes this is more of a DT now describes a configuration choice rather
> than the hardware but given that Ethernet MACs are usually capable of
> supporting all 4 possible RGMII modes, how do you propose to solve the
> negotiation of the appropriate RGMII mode here?

This is actually answered by yourself below.

> >>> So, I think a MAC driver should not care about the specific RGMII
> >>> mode being asked for in any case, and just accept them all.
> >>
> >> I would like to agree to this. However, the implication is that when you
> >> get your delays wrong, your driver silently ignores you and you never
> >> notice your mistake until you see no traffic passing and wonder why.
> > 
> > So explain to me this aspect of your reasoning:
> > 
> > - If the link is operating in non-fixed-link mode, the rgmii* PHY modes
> >   describe the delay to be applied at the PHY end.
> > - If the link is operating in fixed-link mode, the rgmii* PHY modes
> >   describe the delay to be applied at the MAC end.
> > 
> > That doesn't make sense, and excludes properly describing a MAC-to-
> > inaccessible-PHY setup.
> 
> The point with a fixed link is that it does not matter what is the other
> end, it could be a MAC, it could be a PHY, it could go nowhere, it just
> does not matter, the only thing that you can know is you configure your
> side of the fixed link.

That is *exactly* my point.

Today, with a PHY on the other end, the four RGMII modes tell you how
the PHY is to be configured, not the MAC.  The documentation states
this.

So, why should a fixed-link be any different?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* [PATCH] net/9p: Fix sparse endian warning in trans_fd.c
From: Alexander Kapshuk @ 2020-06-18 18:34 UTC (permalink / raw)
  To: ericvh, lucho, asmadeus
  Cc: davem, kuba, v9fs-developer, netdev, linux-kernel,
	alexander.kapshuk

Address sparse endian warning:
net/9p/trans_fd.c:932:28: warning: incorrect type in assignment (different base types)
net/9p/trans_fd.c:932:28:    expected restricted __be32 [addressable] [assigned] [usertype] s_addr
net/9p/trans_fd.c:932:28:    got unsigned long

Signed-off-by: Alexander Kapshuk <alexander.kapshuk@gmail.com>
---
 net/9p/trans_fd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 13cd683a658a..2581f5145a22 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -929,7 +929,7 @@ static int p9_bind_privport(struct socket *sock)

 	memset(&cl, 0, sizeof(cl));
 	cl.sin_family = AF_INET;
-	cl.sin_addr.s_addr = INADDR_ANY;
+	cl.sin_addr.s_addr = htonl(INADDR_ANY);
 	for (port = p9_ipport_resv_max; port >= p9_ipport_resv_min; port--) {
 		cl.sin_port = htons((ushort)port);
 		err = kernel_bind(sock, (struct sockaddr *)&cl, sizeof(cl));
--
2.27.0


^ permalink raw reply related

* [PATCH] net/9p: Fix sparse rcu warnings in client.c
From: Alexander Kapshuk @ 2020-06-18 18:33 UTC (permalink / raw)
  To: ericvh, lucho, asmadeus
  Cc: davem, kuba, v9fs-developer, netdev, linux-kernel,
	alexander.kapshuk

Address sparse nonderef rcu warnings:
net/9p/client.c:790:17: warning: incorrect type in argument 1 (different address spaces)
net/9p/client.c:790:17:    expected struct spinlock [usertype] *lock
net/9p/client.c:790:17:    got struct spinlock [noderef] <asn:4> *
net/9p/client.c:792:48: warning: incorrect type in argument 1 (different address spaces)
net/9p/client.c:792:48:    expected struct spinlock [usertype] *lock
net/9p/client.c:792:48:    got struct spinlock [noderef] <asn:4> *
net/9p/client.c:872:17: warning: incorrect type in argument 1 (different address spaces)
net/9p/client.c:872:17:    expected struct spinlock [usertype] *lock
net/9p/client.c:872:17:    got struct spinlock [noderef] <asn:4> *
net/9p/client.c:874:48: warning: incorrect type in argument 1 (different address spaces)
net/9p/client.c:874:48:    expected struct spinlock [usertype] *lock
net/9p/client.c:874:48:    got struct spinlock [noderef] <asn:4> *

Signed-off-by: Alexander Kapshuk <alexander.kapshuk@gmail.com>
---
 net/9p/client.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/9p/client.c b/net/9p/client.c
index fc1f3635e5dd..807e0e2e2e5a 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -787,9 +787,15 @@ p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...)
 	}
 recalc_sigpending:
 	if (sigpending) {
-		spin_lock_irqsave(&current->sighand->siglock, flags);
+		struct sighand_struct *sighand;
+		rcu_read_lock();
+		sighand = rcu_dereference(current->sighand);
+
+		spin_lock_irqsave(&sighand->siglock, flags);
 		recalc_sigpending();
-		spin_unlock_irqrestore(&current->sighand->siglock, flags);
+		spin_unlock_irqrestore(&sighand->siglock, flags);
+
+		rcu_read_unlock();
 	}
 	if (err < 0)
 		goto reterr;
@@ -869,9 +875,15 @@ static struct p9_req_t *p9_client_zc_rpc(struct p9_client *c, int8_t type,
 	}
 recalc_sigpending:
 	if (sigpending) {
-		spin_lock_irqsave(&current->sighand->siglock, flags);
+		struct sighand_struct *sighand;
+		rcu_read_lock();
+		sighand = rcu_dereference(current->sighand);
+
+		spin_lock_irqsave(&sighand->siglock, flags);
 		recalc_sigpending();
-		spin_unlock_irqrestore(&current->sighand->siglock, flags);
+		spin_unlock_irqrestore(&sighand->siglock, flags);
+
+		rcu_read_unlock();
 	}
 	if (err < 0)
 		goto reterr;
--
2.27.0


^ permalink raw reply related

* Re: [PATCH] net: macb: reject unsupported rgmii delays
From: Florian Fainelli @ 2020-06-18 18:26 UTC (permalink / raw)
  To: Russell King - ARM Linux admin, Helmut Grohne
  Cc: Nicolas Ferre, David S. Miller, Jakub Kicinski, Palmer Dabbelt,
	Paul Walmsley, netdev@vger.kernel.org
In-Reply-To: <20200618100143.GZ1551@shell.armlinux.org.uk>



On 6/18/2020 3:01 AM, Russell King - ARM Linux admin wrote:
> On Thu, Jun 18, 2020 at 11:05:26AM +0200, Helmut Grohne wrote:
>> On Thu, Jun 18, 2020 at 10:45:54AM +0200, Russell King - ARM Linux admin wrote:
>>> Why do we need that complexity?  If we decide that we can allow
>>> phy-mode = "rgmii" and introduce new properties to control the
>>> delay, then we just need:
>>>
>>>   rgmii-tx-delay-ps = <nnn>;
>>>   rgmii-rx-delay-ps = <nnn>;
>>>
>>> specified in the MAC node (to be applied only at the MAC end) or
>>> specified in the PHY node (to be applied only at the PHY end.)
>>> In the normal case, this would be the standard delay value, but
>>> in exceptional cases where supported, the delays can be arbitary.
>>> We know there are PHYs out there which allow other delays.
>>>
>>> This means each end is responsible for parsing these properties in
>>> its own node and applying them - or raising an error if they can't
>>> be supported.
>>
>> Thank you. That makes a lot more sense while keeping the (imo) important
>> properties of my proposal:
>>  * It is backwards compatible. These properties override delays
>>    specified inside phy-mode. Otherwise the vague phy-mode meaning is
>>    retained.
>>  * The ambiguity is resolved. It is always clear where delays should be
>>    configure and the properties properly account for possible PCB
>>    traces.
>>
>> It also resolves my original problem. If support for these properties is
>> added to macb_main.c, it would simply check that both delays are 0 as
>> internal delays are not supported by the hardware.  When I would have
>> attempted to configure an rx delay, it would have nicely errored out.
> 
> I think we'd want a helper or two to do the parsing and return the
> delays, something like:

Can you review Dan's patch series since he is attempting something that
may not end up solving Helmut's problem completely:

http://patchwork.ozlabs.org/project/netdev/list/?series=184052
-- 
Florian

^ permalink raw reply

* Re: [PATCH] net: macb: reject unsupported rgmii delays
From: Florian Fainelli @ 2020-06-18 18:25 UTC (permalink / raw)
  To: Russell King - ARM Linux admin, Vladimir Oltean
  Cc: Helmut Grohne, Nicolas Ferre, David S. Miller, Jakub Kicinski,
	Palmer Dabbelt, Paul Walmsley, netdev
In-Reply-To: <20200617115725.GR1551@shell.armlinux.org.uk>



On 6/17/2020 4:57 AM, Russell King - ARM Linux admin wrote:
> On Wed, Jun 17, 2020 at 02:38:25PM +0300, Vladimir Oltean wrote:
>> On Wed, 17 Jun 2020 at 14:34, Russell King - ARM Linux admin
>> <linux@armlinux.org.uk> wrote:
>>>
>>
>>>
>>> Why are you so abrasive?
>>>
>>> Not responding to this until you start behaving more reasonably.
>>>
>>> --
>>> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
>>> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
>>
>> Sorry.
>> What I meant to say is: the documentation is unclear. It has been
>> interpreted in conflicting ways, where the strict interpretations have
>> proven to have practical limitations, and the practical
>> interpretations have been accused of being incorrect. In my opinion
>> there is no way to fix it without introducing new bindings, which I am
>> not sure is really worth doing.
> 
> The documentation was added in 2016, many years after we have had users
> of this, in an attempt to clear up some of the confusion.  It is likely
> that it had to cater for existing users though - I'm sure if Florian
> cares, he can comment on that.

The need to clarify the 'phy-mode' property came in while reviewing the
aurora network driver and trying to make sense of what should be done
for a new driver in a way that would result in hopefully less confusion
moving forward.

> 
> It would be better if it made a definitive statement about it, but doing
> so would likely attract pedants to try to fix everything to conform,
> causing breakage in the process.

Exactly.

> 
> I've recently had a painful experience of this with the Atheros PHYs,
> where there are lots of platforms using "rgmii" when they should have
> been using "rgmii-id".  A patch changed this in the Atheros code breaking
> all these platforms, breakage which persisted over several kernel
> versions, needing fixes to DT files that then had to be back-ported.
> That's fine if you know what happened to break it, but if you don't, and
> you don't know what the fix is, you're mostly stuffed and stuck with non-
> working ethernet.  That really was not nice.
> 
> This is one of the reasons why I press for any new PHY interface mode
> to be documented in the phylib documentation right from the start, so
> that hopefully we can avoid this kind of thing in the future.
> 

It seems to be that this is a very RGMII specific problem and no other
electrical interface appears to have that problem, or it solves it
differently without requiring massive baby sitting from software (or
even more, just not for that particular problem maybe).
-- 
Florian

^ permalink raw reply

* Re: [RFC PATCH 06/21] mlx5: add header_split flag
From: Eric Dumazet @ 2020-06-18 18:12 UTC (permalink / raw)
  To: Jonathan Lemon, netdev
  Cc: kernel-team, axboe, Govindarajulu Varadarajan, Michal Kubecek
In-Reply-To: <20200618160941.879717-7-jonathan.lemon@gmail.com>



On 6/18/20 9:09 AM, Jonathan Lemon wrote:
> Adds a "rx_hd_split" private flag parameter to ethtool.
> 
> This enables header splitting, and sets up the fragment mappings.
> The feature is currently only enabled for netgpu channels.

We are using a similar idea (pseudo header split) to implement 4096+(headers) MTU at Google,
to enable TCP RX zerocopy on x86.

Patch for mlx4 has not been sent upstream yet.

For mlx4, we are using a single buffer of 128*(number_of_slots_per_RX_RING),
and 86 bytes for the first frag, so that the payload exactly fits a 4096 bytes page.

(In our case, most of our data TCP packets only have 12 bytes of TCP options)


I suggest that instead of a flag, you use a tunable, that can be set by ethtool,
so that the exact number of bytes can be tuned, instead of hard coded in the driver.

(Patch for the counter part of [1] was resent 10 days ago on netdev@ by Govindarajulu Varadarajan)
(Not sure if this has been merged yet)

[1]

commit f0db9b073415848709dd59a6394969882f517da9
Author: Govindarajulu Varadarajan <_govind@gmx.com>
Date:   Wed Sep 3 03:17:20 2014 +0530

    ethtool: Add generic options for tunables
    
    This patch adds new ethtool cmd, ETHTOOL_GTUNABLE & ETHTOOL_STUNABLE for getting
    tunable values from driver.
    
    Add get_tunable and set_tunable to ethtool_ops. Driver implements these
    functions for getting/setting tunable value.
    
    Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>




> 
> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> ---
>  .../ethernet/mellanox/mlx5/core/en_ethtool.c  | 15 +++++++
>  .../net/ethernet/mellanox/mlx5/core/en_main.c | 45 +++++++++++++++----
>  2 files changed, 52 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> index ec5658bbe3c5..a1b5d8b33b0b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> @@ -1905,6 +1905,20 @@ static int set_pflag_xdp_tx_mpwqe(struct net_device *netdev, bool enable)
>  	return err;
>  }
>  
> +static int set_pflag_rx_hd_split(struct net_device *netdev, bool enable)
> +{
> +	struct mlx5e_priv *priv = netdev_priv(netdev);
> +	int err;
> +
> +	priv->channels.params.hd_split = enable;
> +	err = mlx5e_safe_reopen_channels(priv);
> +	if (err)
> +		netdev_err(priv->netdev,
> +			   "%s failed to reopen channels, err(%d).\n",
> +				__func__, err);
> +	return err;
> +}
> +
>  static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS] = {
>  	{ "rx_cqe_moder",        set_pflag_rx_cqe_based_moder },
>  	{ "tx_cqe_moder",        set_pflag_tx_cqe_based_moder },
> @@ -1912,6 +1926,7 @@ static const struct pflag_desc mlx5e_priv_flags[MLX5E_NUM_PFLAGS] = {
>  	{ "rx_striding_rq",      set_pflag_rx_striding_rq },
>  	{ "rx_no_csum_complete", set_pflag_rx_no_csum_complete },
>  	{ "xdp_tx_mpwqe",        set_pflag_xdp_tx_mpwqe },
> +	{ "rx_hd_split",	 set_pflag_rx_hd_split },
>  };
>  
>  static int mlx5e_handle_pflag(struct net_device *netdev,
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index a836a02a2116..cc8d30aa8a33 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -123,7 +123,8 @@ bool mlx5e_striding_rq_possible(struct mlx5_core_dev *mdev,
>  
>  void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params *params)
>  {
> -	params->rq_wq_type = mlx5e_striding_rq_possible(mdev, params) &&
> +	params->rq_wq_type = MLX5E_HD_SPLIT(params) ? MLX5_WQ_TYPE_CYCLIC :
> +		mlx5e_striding_rq_possible(mdev, params) &&
>  		MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ) ?
>  		MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
>  		MLX5_WQ_TYPE_CYCLIC;
> @@ -323,6 +324,8 @@ static void mlx5e_init_frags_partition(struct mlx5e_rq *rq)
>  				if (prev)
>  					prev->last_in_page = true;
>  			}
> +			next_frag.di->netgpu_source =
> +						!!frag_info[f].frag_source;
>  			*frag = next_frag;
>  
>  			/* prepare next */
> @@ -373,6 +376,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>  	struct mlx5_core_dev *mdev = c->mdev;
>  	void *rqc = rqp->rqc;
>  	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
> +	bool hd_split = MLX5E_HD_SPLIT(params) && (umem == (void *)0x1);
> +	u32 num_xsk_frames = 0;
>  	u32 rq_xdp_ix;
>  	u32 pool_size;
>  	int wq_sz;
> @@ -391,9 +396,10 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>  	rq->mdev    = mdev;
>  	rq->hw_mtu  = MLX5E_SW2HW_MTU(params, params->sw_mtu);
>  	rq->xdpsq   = &c->rq_xdpsq;
> -	rq->umem    = umem;
> +	if (xsk)
> +		rq->umem    = umem;
>  
> -	if (rq->umem)
> +	if (umem)
>  		rq->stats = &c->priv->channel_stats[c->ix].xskrq;
>  	else
>  		rq->stats = &c->priv->channel_stats[c->ix].rq;
> @@ -404,14 +410,18 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>  	rq->xdp_prog = params->xdp_prog;
>  
>  	rq_xdp_ix = rq->ix;
> -	if (xsk)
> +	if (umem)
>  		rq_xdp_ix += params->num_channels * MLX5E_RQ_GROUP_XSK;
>  	err = xdp_rxq_info_reg(&rq->xdp_rxq, rq->netdev, rq_xdp_ix);
>  	if (err < 0)
>  		goto err_rq_wq_destroy;
>  
> +	if (umem == (void *)0x1)
> +		rq->buff.headroom = 0;
> +	else
> +		rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params, xsk);
> +
>  	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
> -	rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params, xsk);
>  	pool_size = 1 << params->log_rq_mtu_frames;
>  
>  	switch (rq->wq_type) {
> @@ -509,6 +519,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>  
>  		rq->wqe.skb_from_cqe = xsk ?
>  			mlx5e_xsk_skb_from_cqe_linear :
> +			hd_split ? mlx5e_skb_from_cqe_nonlinear :
>  			mlx5e_rx_is_linear_skb(params, NULL) ?
>  				mlx5e_skb_from_cqe_linear :
>  				mlx5e_skb_from_cqe_nonlinear;
> @@ -2035,13 +2046,19 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
>  	int frag_size_max = DEFAULT_FRAG_SIZE;
>  	u32 buf_size = 0;
>  	int i;
> +	bool hd_split = MLX5E_HD_SPLIT(params) && xsk;
> +
> +	if (hd_split)
> +		frag_size_max = HD_SPLIT_DEFAULT_FRAG_SIZE;
> +	else
> +		frag_size_max = DEFAULT_FRAG_SIZE;
>  
>  #ifdef CONFIG_MLX5_EN_IPSEC
>  	if (MLX5_IPSEC_DEV(mdev))
>  		byte_count += MLX5E_METADATA_ETHER_LEN;
>  #endif
>  
> -	if (mlx5e_rx_is_linear_skb(params, xsk)) {
> +	if (!hd_split && mlx5e_rx_is_linear_skb(params, xsk)) {
>  		int frag_stride;
>  
>  		frag_stride = mlx5e_rx_get_linear_frag_sz(params, xsk);
> @@ -2059,6 +2076,16 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
>  		frag_size_max = PAGE_SIZE;
>  
>  	i = 0;
> +
> +	if (hd_split) {
> +		// Start with one fragment for all headers (implementing HDS)
> +		info->arr[0].frag_size = TOTAL_HEADERS;
> +		info->arr[0].frag_stride = roundup_pow_of_two(PAGE_SIZE);
> +		buf_size += TOTAL_HEADERS;
> +		// Now, continue with the payload frags.
> +		i = 1;
> +	}
> +
>  	while (buf_size < byte_count) {
>  		int frag_size = byte_count - buf_size;
>  
> @@ -2066,8 +2093,10 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
>  			frag_size = min(frag_size, frag_size_max);
>  
>  		info->arr[i].frag_size = frag_size;
> -		info->arr[i].frag_stride = roundup_pow_of_two(frag_size);
> -
> +		info->arr[i].frag_stride = roundup_pow_of_two(hd_split ?
> +							      PAGE_SIZE :
> +							      frag_size);
> +		info->arr[i].frag_source = hd_split;
>  		buf_size += frag_size;
>  		i++;
>  	}
> 

^ permalink raw reply

* Re: [PATCH] net: macb: reject unsupported rgmii delays
From: Florian Fainelli @ 2020-06-18 18:06 UTC (permalink / raw)
  To: Russell King - ARM Linux admin, Helmut Grohne
  Cc: Nicolas Ferre, David S. Miller, Jakub Kicinski, Palmer Dabbelt,
	Paul Walmsley, netdev@vger.kernel.org
In-Reply-To: <20200618084554.GY1551@shell.armlinux.org.uk>



On 6/18/2020 1:45 AM, Russell King - ARM Linux admin wrote:
> On Thu, Jun 18, 2020 at 10:14:33AM +0200, Helmut Grohne wrote:
>> On Wed, Jun 17, 2020 at 02:08:09PM +0200, Russell King - ARM Linux admin wrote:
>>> With a fixed link, we could be in either a MAC-to-PHY or MAC-to-MAC
>>> setup; we just don't know.  However, we don't have is access to the
>>> PHY (if it exists) in the fixed link case to configure it for the
>>> delay.
>>
>> Let me twist that a little: We may have access to the PHY, but we don't
>> always have access. When we do have access, we have a separate device
>> tree node with another fixed-link and another phy-mode. For fixed-links,
>> we specify the phy-mode for each end.
> 
> If you have access to the PHY, you shouldn't be using fixed-link. In
> any case, there is no support for a fixed-link specification at the
> PHY end in the kernel.  The PHY binding doc does not allow for this
> either.
> 
>>> In the MAC-to-MAC RGMII setup, where neither MAC can insert the
>>> necessary delay, the only way to have a RGMII conformant link is to
>>> have the PCB traces induce the necessary delay. That errs towards
>>> PHY_INTERFACE_MODE_RGMII for this case.
>>
>> Yes.
>>
>>> However, considering the MAC-to-PHY RGMII fixed link case, where the
>>> PHY may not be accessible, and may be configured with the necessary
>>> delay, should that case also use PHY_INTERFACE_MODE_RGMII - clearly
>>> that would be as wrong as using PHY_INTERFACE_MODE_RGMII_ID would
>>> be for the MAC-to-MAC RGMII with PCB-delays case.
>>
>> If you take into account that the PHY has a separate node with phy-mode
>> being rgmii-id, it makes a lot more sense to use rgmii for the phy-mode
>> of the MAC. So I don't think it is that clear that doing so is wrong.
> 
> The PHY binding document does not allow for this, neither does the
> kernel.
> 
>> In an earlier discussion Florian Fainelli said:
>> https://patchwork.ozlabs.org/project/netdev/patch/20190410005700.31582-19-olteanv@gmail.com/#2149183
>> | fixed-link really should denote a MAC to MAC connection so if you have
>> | "rgmii-id" on one side, you would expect "rgmii" on the other side
>> | (unless PCB traces account for delays, grrr).
>>
>> For these reasons, I still think that rgmii would be a useful
>> description for the fixed-link to PHY connection where the PHY adds the
>> delay.
> 
> I think Florian is wrong; consider what it means when you have a
> fixed-link connection between a MAC and an inaccessible PHY and
> the link as a whole is operating in what we would call "rgmii-id"
> mode if the PHY were accessible.
> 
> Taking Florian's stance, it basically means that DT no longer
> describes the hardware, but how we have chosen to interpret the
> properties in _one_ specific case in a completely different way
> to all the other cases.

How do you ensure that a MAC to MAC connection using RGMII actually
works if you do not provide a 'phy-mode' for both sides right now?

Yes this is more of a DT now describes a configuration choice rather
than the hardware but given that Ethernet MACs are usually capable of
supporting all 4 possible RGMII modes, how do you propose to solve the
negotiation of the appropriate RGMII mode here?

> 
>>> So, I think a MAC driver should not care about the specific RGMII
>>> mode being asked for in any case, and just accept them all.
>>
>> I would like to agree to this. However, the implication is that when you
>> get your delays wrong, your driver silently ignores you and you never
>> notice your mistake until you see no traffic passing and wonder why.
> 
> So explain to me this aspect of your reasoning:
> 
> - If the link is operating in non-fixed-link mode, the rgmii* PHY modes
>   describe the delay to be applied at the PHY end.
> - If the link is operating in fixed-link mode, the rgmii* PHY modes
>   describe the delay to be applied at the MAC end.
> 
> That doesn't make sense, and excludes properly describing a MAC-to-
> inaccessible-PHY setup.

The point with a fixed link is that it does not matter what is the other
end, it could be a MAC, it could be a PHY, it could go nowhere, it just
does not matter, the only thing that you can know is you configure your
side of the fixed link. If you have visibility into the other side as
well, you can go one step further an put something in the other
fixed-link node that makes sense and will result in a working RGMII link.

> 
> It also means that we're having to conditionalise how we deal with
> this PHY mode in every single driver, which means that every single
> driver is going to end up interpreting it differently, and it's going
> to become a buggy mess.

Newsflash: it is already a buggy mess.

> 
>> In this case, I was faced with a PHY that would do rgmii-txid and I
>> configured that on the MAC. Unfortunately, macb_main.c didn't tell me
>> that it did rgmii-id instead.
> 
> The documentation makes it clear that "rgmii-*" (note the hyphen) are
> to be applied by the PHY *only*, and not by the MAC.
> 
>>> I also think that some of this ought to be put in the documentation
>>> as guidance for new implementations.
>>
>> That seems to be the part where everyone agrees.
>>
>> Given the state of the discussion, I'm wondering whether this could be
>> fixed at a more fundamental level in the device tree bindings.
>>
>> A number of people (including you) pointed out that retroactively fixing
>> the meaning of phy modes does not work and causes pain instead. That
>> hints that the only way to fix this is adding new properties. How about
>> this?
>>
>> rgmii-delay-type:
>>   description:
>>     Responsibility for adding the rgmii delay
>>   enum:
>>     # The remote PHY or MAC to this MAC is responsible for adding the
>>     # delay.
>>     - remote
>>     # The delay is added by neither MAC nor MAC, but using PCB traces
>>     # instead.
>>     - pcb
>>     # The MAC must add the delay.
>>     - local
> 
> Why do we need that complexity?  If we decide that we can allow
> phy-mode = "rgmii" and introduce new properties to control the
> delay, then we just need:
> 
>   rgmii-tx-delay-ps = <nnn>;
>   rgmii-rx-delay-ps = <nnn>;
> 
> specified in the MAC node (to be applied only at the MAC end) or
> specified in the PHY node (to be applied only at the PHY end.)
> In the normal case, this would be the standard delay value, but
> in exceptional cases where supported, the delays can be arbitary.
> We know there are PHYs out there which allow other delays.
> 
> This means each end is responsible for parsing these properties in
> its own node and applying them - or raising an error if they can't
> be supported.

And as we all know, RGMII is the most plug and play standard out there
so this is "just going to work".

> 
> With your "rgmii-delay-type" idea, if this is only specified at the
> MAC end, then the PHY code somehow needs to know what that setting is,
> which adds a lot of complexity - the PHY code has to go digging for
> the MAC node, we may even have to introduce a back-reference from the
> PHY node to the MAC node so the PHY can find it.  There are MAC drivers
> out there where there is one struct device, but multiple net devices,
> so digging inside struct net_device to get at the parent struct device,
> and then trying to parse "rgmii-delay-type" from the of-node won't work.
>-- 
Florian

^ permalink raw reply

* Re: [PATCH v1 2/3] net/fsl: acpize xgmac_mdio
From: Andy Shevchenko @ 2020-06-18 17:55 UTC (permalink / raw)
  To: Calvin Johnson
  Cc: Jeremy Linton, Andrew Lunn, Russell King - ARM Linux admin, Jon,
	Cristi Sovaiala, Ioana Ciornei, Florian Fainelli, Madalin Bucur,
	netdev, linux.cj
In-Reply-To: <20200618174252.GA9430@lsv03152.swis.in-blr01.nxp.com>

On Thu, Jun 18, 2020 at 8:43 PM Calvin Johnson
<calvin.johnson@oss.nxp.com> wrote:
> On Thu, Jun 18, 2020 at 07:00:20PM +0300, Andy Shevchenko wrote:
> > On Thu, Jun 18, 2020 at 6:46 PM Jeremy Linton <jeremy.linton@arm.com> wrote:

...

> > And here is the question, do we have (in form of email or other means)
> > an official response from NXP about above mentioned ID?
>
> At NXP, we've assgined NXP ids to various controllers and "NXP0006" is
> assigned to the xgmac_mdio controller.

Thanks!

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox