Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 bpf-next] selftests/bpf: fix clearing buffered output between tests/subtests
From: Alexei Starovoitov @ 2019-07-31  4:14 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Network Development, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team
In-Reply-To: <20190730180541.212452-1-andriin@fb.com>

On Tue, Jul 30, 2019 at 11:17 AM Andrii Nakryiko <andriin@fb.com> wrote:
>
> Clear buffered output once test or subtests finishes even if test was
> successful. Not doing this leads to accumulation of output from previous
> tests and on first failed tests lots of irrelevant output will be
> dumped, greatly confusing things.
>
> v1->v2: fix Fixes tag, add more context to patch
>
> Fixes: 3a516a0a3a7b ("selftests/bpf: add sub-tests support for test_progs")
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

Applied. Thanks

^ permalink raw reply

* Re: [bpf-next,v2 0/6] Introduce a BPF helper to generate SYN cookies
From: Alexei Starovoitov @ 2019-07-31  4:12 UTC (permalink / raw)
  To: Petar Penkov
  Cc: Petar Penkov, Networking, bpf, David S . Miller,
	Alexei Starovoitov, Daniel Borkmann, Eric Dumazet, Lorenz Bauer,
	Stanislav Fomichev, Toke Høiland-Jørgensen
In-Reply-To: <CAG4SDVV9oBYkXqof=FoD0DeRY=+tSwZo3E1jhqMnF8F8+bVTbg@mail.gmail.com>

On Mon, Jul 29, 2019 at 4:46 PM Petar Penkov <ppenkov@google.com> wrote:
>>
> > What is cpu utilization at this rate?
> > Is it cpu or nic limited if you crank up the syn flood?
> > Original 7M with all cores or single core?
> My receiver was configured with 16rx queues and 16 cores. 7M all cores
> are at 100% so I believe this case is CPU limited. At XDP, all cores
> are at roughly 40%. I couldn't reliably generate higher SYN flood rate
> than that, and the highest numbers I could see for XDP did not go past
> 10.65Mpps with ~42% utilization on each core. I think I am hitting a
> NIC limit here since the CPUs are free.

Applied. Thanks!

^ permalink raw reply

* [PATCH] atm: iphase: Fix Spectre v1 vulnerability
From: Gustavo A. R. Silva @ 2019-07-31  3:21 UTC (permalink / raw)
  To: Chas Williams
  Cc: linux-atm-general, netdev, linux-kernel, Gustavo A. R. Silva

board is controlled by user-space, hence leading to a potential
exploitation of the Spectre variant 1 vulnerability.

This issue was detected with the help of Smatch:

drivers/atm/iphase.c:2765 ia_ioctl() warn: potential spectre issue 'ia_dev' [r] (local cap)
drivers/atm/iphase.c:2774 ia_ioctl() warn: possible spectre second half.  'iadev'
drivers/atm/iphase.c:2782 ia_ioctl() warn: possible spectre second half.  'iadev'
drivers/atm/iphase.c:2816 ia_ioctl() warn: possible spectre second half.  'iadev'
drivers/atm/iphase.c:2823 ia_ioctl() warn: possible spectre second half.  'iadev'
drivers/atm/iphase.c:2830 ia_ioctl() warn: potential spectre issue '_ia_dev' [r] (local cap)
drivers/atm/iphase.c:2845 ia_ioctl() warn: possible spectre second half.  'iadev'
drivers/atm/iphase.c:2856 ia_ioctl() warn: possible spectre second half.  'iadev'

Fix this by sanitizing board before using it to index ia_dev and _ia_dev

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://lore.kernel.org/lkml/20180423164740.GY17484@dhcp22.suse.cz/

Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 drivers/atm/iphase.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/atm/iphase.c b/drivers/atm/iphase.c
index 302cf0ba1600..8c7a996d1f16 100644
--- a/drivers/atm/iphase.c
+++ b/drivers/atm/iphase.c
@@ -63,6 +63,7 @@
 #include <asm/byteorder.h>  
 #include <linux/vmalloc.h>
 #include <linux/jiffies.h>
+#include <linux/nospec.h>
 #include "iphase.h"		  
 #include "suni.h"		  
 #define swap_byte_order(x) (((x & 0xff) << 8) | ((x & 0xff00) >> 8))
@@ -2760,8 +2761,11 @@ static int ia_ioctl(struct atm_dev *dev, unsigned int cmd, void __user *arg)
    }
    if (copy_from_user(&ia_cmds, arg, sizeof ia_cmds)) return -EFAULT; 
    board = ia_cmds.status;
-   if ((board < 0) || (board > iadev_count))
-         board = 0;    
+
+	if ((board < 0) || (board > iadev_count))
+		board = 0;
+	board = array_index_nospec(board, iadev_count + 1);
+
    iadev = ia_dev[board];
    switch (ia_cmds.cmd) {
    case MEMDUMP:
-- 
2.22.0

^ permalink raw reply related

* Re: [RFC] net: phy: read link status twice when phy_check_link_status()
From: liuyonglong @ 2019-07-31  3:33 UTC (permalink / raw)
  To: Heiner Kallweit, andrew, davem
  Cc: netdev, linux-kernel, linuxarm, salil.mehta, yisen.zhuang,
	shiju.jose
In-Reply-To: <9a0a8094-42ee-0a18-0e9a-d3ca783d6d4b@gmail.com>



On 2019/7/31 3:04, Heiner Kallweit wrote:
> On 30.07.2019 08:35, liuyonglong wrote:
>> :/sys/kernel/debug/tracing$ cat trace
>> # tracer: nop
>> #
>> # entries-in-buffer/entries-written: 45/45   #P:128
>> #
>> #                              _-----=> irqs-off
>> #                             / _----=> need-resched
>> #                            | / _---=> hardirq/softirq
>> #                            || / _--=> preempt-depth
>> #                            ||| /     delay
>> #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
>> #              | |       |   ||||       |         |
>>     kworker/64:2-1028  [064] ....   172.295687: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x02 val:0x001c
>>     kworker/64:2-1028  [064] ....   172.295726: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x03 val:0xc916
>>     kworker/64:2-1028  [064] ....   172.296902: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x01 val:0x79ad
>>     kworker/64:2-1028  [064] ....   172.296938: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x0f val:0x2000
>>     kworker/64:2-1028  [064] ....   172.321213: mdio_access: mii-0000:bd:00.0 read  phy:0x01 reg:0x00 val:0x1040
>>     kworker/64:2-1028  [064] ....   172.343209: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x02 val:0x001c
>>     kworker/64:2-1028  [064] ....   172.343245: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x03 val:0xc916
>>     kworker/64:2-1028  [064] ....   172.343882: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>>     kworker/64:2-1028  [064] ....   172.343918: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x0f val:0x2000
>>     kworker/64:2-1028  [064] ....   172.362658: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
>>     kworker/64:2-1028  [064] ....   172.385961: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x02 val:0x001c
>>     kworker/64:2-1028  [064] ....   172.385996: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x03 val:0xc916
>>     kworker/64:2-1028  [064] ....   172.386646: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x01 val:0x79ad
>>     kworker/64:2-1028  [064] ....   172.386681: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x0f val:0x2000
>>     kworker/64:2-1028  [064] ....   172.411286: mdio_access: mii-0000:bd:00.2 read  phy:0x05 reg:0x00 val:0x1040
>>     kworker/64:2-1028  [064] ....   172.433225: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x02 val:0x001c
>>     kworker/64:2-1028  [064] ....   172.433260: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x03 val:0xc916
>>     kworker/64:2-1028  [064] ....   172.433887: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x01 val:0x79ad
>>     kworker/64:2-1028  [064] ....   172.433922: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x0f val:0x2000
>>     kworker/64:2-1028  [064] ....   172.452862: mdio_access: mii-0000:bd:00.3 read  phy:0x07 reg:0x00 val:0x1040
>>         ifconfig-1324  [011] ....   177.325585: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
>>   kworker/u257:0-8     [012] ....   177.325642: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x04 val:0x01e1
>>   kworker/u257:0-8     [012] ....   177.325654: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x04 val:0x05e1
>>   kworker/u257:0-8     [012] ....   177.325708: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x79ad
>>   kworker/u257:0-8     [012] ....   177.325744: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x09 val:0x0200
>>   kworker/u257:0-8     [012] ....   177.325779: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x00 val:0x1040
>>   kworker/u257:0-8     [012] ....   177.325788: mdio_access: mii-0000:bd:00.1 write phy:0x03 reg:0x00 val:0x1240
>>   kworker/u257:0-8     [012] ....   177.325843: mdio_access: mii-0000:bd:00.1 read  phy:0x03 reg:0x01 val:0x798d
> 
> What I think that happens here:
> Writing 0x1240 to BMCR starts aneg. When reading BMSR immediately after that then the PHY seems to have cleared
> the "aneg complete" bit already, but not yet the "link up" bit. This results in the false "link up" notification.
> The following patch is based on the fact that in case of enabled aneg we can't have a valid link if aneg isn't
> finished. Could you please test whether this works for you?
> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 6b5cb87f3..7ddd91df9 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -1774,6 +1774,12 @@ int genphy_update_link(struct phy_device *phydev)
>  	phydev->link = status & BMSR_LSTATUS ? 1 : 0;
>  	phydev->autoneg_complete = status & BMSR_ANEGCOMPLETE ? 1 : 0;
>  
> +	/* Consider the case that autoneg was started and "aneg complete"
> +	 * bit has been reset, but "link up" bit not yet.
> +	 */
> +	if (phydev->autoneg == AUTONEG_ENABLE && !phydev->autoneg_complete)
> +		phydev->link = 0;
> +
>  	return 0;
>  }
>  EXPORT_SYMBOL(genphy_update_link);
> 

This patch can solve the issue! Will it be upstream?

So it's nothing to do with the bios, and just the PHY's own behavior,
the "link up" bit can not reset immediately，right?

ps: It will take 1 second more to make the link up for RTL8211F when 0x798d happend.



^ permalink raw reply

* Fwd: net: ipv6: Fix a bug in ndisc_send_ns when netdev only has a global address
From: Mark Smith @ 2019-07-31  3:32 UTC (permalink / raw)
  To: David Ahern, Su Yanjun, netdev
In-Reply-To: <CAO42Z2we930YTmMUkaWXZOqFQVFxH=bd=y+JFXG8V0Y736kzug@mail.gmail.com>

Re-sending in text format due to Gmail preserving the HTML email I
received and vger (quite reasonably) rejecting my response.

---------- Forwarded message ---------
From: Mark Smith <markzzzsmith@gmail.com>
Date: Wed, 31 Jul 2019 at 12:23
Subject: Re: net: ipv6: Fix a bug in ndisc_send_ns when netdev only
has a global address
To: David Ahern <dsahern@gmail.com>
Cc: Su Yanjun <suyj.fnst@cn.fujitsu.com>, <netdev@vger.kernel.org>

Hi David,

On Wed., 31 Jul. 2019, 00:11 David Ahern, <dsahern@gmail.com> wrote:
>
> On 7/30/19 4:28 AM, Mark Smith wrote:
> > Hi Su,
> >

>
> <snip>

>
> >>> This patch is not the correct solution to this issue.
> >>>
> >
> > <snip>
> >
> >> In linux implementation, one interface may have no link local address if
> >> kernel config
> >>
> >> *addr_gen_mode* is set to IN6_ADDR_GEN_MODE_NONE. My patch is to fix
> >> this problem.
> >>
> >
> > So this "IN6_ADDR_GEN_MODE_NONE" behaviour doesn't comply with RFC 4291.
> >
> > As RFC 4291 says,
> >
> > "All interfaces are *required* to have *at least one* Link-Local
> > unicast address."
> >
> > That's not an ambiguous requirement.
>
> Interesting. Going back to the original commit:
>
> commit bc91b0f07ada5535427373a4e2050877bcc12218
> Author: Jiri Pirko <jiri@resnulli.us>
> Date:   Fri Jul 11 21:10:18 2014 +0200
>
>     ipv6: addrconf: implement address generation modes
>
>     This patch introduces a possibility for userspace to set various (so far
>     two) modes of generating addresses. This is useful for example for
>     NetworkManager because it can set the mode to NONE and take care of link
>     local addresses itself. That allow it to have the interface up,
>     monitoring carrier but still don't have any addresses on it.
>
> So the intention of IN6_ADDR_GEN_MODE_NONE was for userspace to control
> it. If an LLA is required (4291 says yes, 4861 suggests no) then the
> current behavior is correct and if IN6_ADDR_GEN_MODE_NONE is used by an
> admin some userspace agent is required to add it for IPv6 to work on
> that link.
>

Ok. That seems to be saying that IN6_ADDR_GEN_MODE_NONE means that the
kernel is not going perform any address configuration on the interface
for any prefixes.

That would then place the RFC 4291 burden to generate at least one LL
address for the interface onto the user space application that has
taken over performing IPv6 address configuration on an interface.

> <snip>
> >
> > It is an IPv6 enabled interface, so it requires a link-local address,
> > per RFC 4291. RFC 4291 doesn't exclude any interfaces types from the
> > LL address requirement.
>
> There is no 'link' for loopback, so really no point in generating an LLA
> for it.
>

If your 'link' mean something physical, then I agree, the loopback
virtual interface doesn't have a link.

From IPv6's perspective, there is a link attached, because the
interface is operationally UP and IPv6 can send and receive packets
over it. They just happen to be returned to the sender by the
link-layer below the IPv6 layer. This behaviour is functionally no
different to when a physical loopback cable/plug is plugged into a
physical interface.

IPv6 tries to be fairly generic with definitions such as 'link' and
'interface' to be future proof. Here's the RFC 8200 definitons:

"link         a communication facility or medium over which nodes can
                communicate at the link layer, i.e., the layer
                immediately below IPv6.  Examples are Ethernets [...];
                and internet-layer or higher-layer "tunnels",
                such as tunnels over IPv4 or IPv6 itself.

interface    a node's attachment to a link."

The loopback virtual interface is providing both "a communication
facility [...] over which nodes can communicate at the link layer,
i.e., the layer immediately below IPv6" and an "attachment to a link".

So the loopback virtual interface is by definition a interface per the
IPv6 specification, and therefore requires a link-local address to be
compliant.

Regards,
Mark.

^ permalink raw reply

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Andrew Lunn @ 2019-07-31  3:31 UTC (permalink / raw)
  To: Allan W. Nielsen
  Cc: Ido Schimmel, Nikolay Aleksandrov, Horatiu Vultur, roopa, davem,
	bridge, netdev, linux-kernel
In-Reply-To: <20190730190000.diacyjw6owqkf7uf@lx-anielsen.microsemi.net>

> Our plan was to implement this in pure SW, and then look at how to HW offload
> it.

Great.

> But this will take some time before we have anything meaning full to show.
> 
> > Make it an alternative to the STP code?
> I'm still working on learning the details of DLR, but I actually believe that it
> in some situations may co-exists with STP ;-)

The PDF you linked to suggests this as well. But i think you will need
to make some core changes to the bridge. At the moment, STP is a
bridge level property. But you are going to need it to be a per-port
option. You can then use DLR on the ring ports, and optionally STP on
the other ports.

> But what we are looking at here, is to offload a
> non-aware-(DLR|MRP)-switch which happens to be placed in a network
> with these protocols running.

So we need to think about why we are passing traffic to the CPU port,
and under what conditions can it be blocked.

1) The interface is not part of a bridge. In this case, we only need
the switch to pass to the CPU port MC addresses which have been set
via set_rx_mode().

I think this case does not apply for what you want. You have two ports
bridges together as part of the ring.

2) The interface is part of a bridge. There are a few sub-cases

a) IGMP snooping is being performed. We can block multicast where
there is no interest in the group. But this is limited to IP
multicast.

b) IGMP snooping is not being used and all interfaces in the bridge
are ports of the switch. IP Multicast can be blocked to the CPU.

c) IGMP snooping is not being used and there is a non-switch interface
in the bridge. Multicast needed is needed, so it can be flooded out
this port.

d) set_rx_mode() has been called on the br0 interface, indicating
there is interest in the packets on the host. They must be sent to the
CPU so they can be delivered locally.

e) ????

Does the Multicast MAC address being used by DLR also map to an IP
mmulticast address? 01:21:6C:00:00:0[123] appear to be the MAC
addresses used by DLR. IPv4 multicast MAC addresses are
01:00:5E:XX:XX:XX. IPv6 multicast MAC addresses are 33:33:XX:XX:XX:XX.

So one possibility here is to teach the SW bridge about non-IP
multicast addresses. Initially the switch should forward all MAC
multicast frames to the CPU. If the frame is not an IPv4 or IPv6
frame, and there has not been a call to set_rx_mode() for the MAC
address on the br0 interface, and the bridge only contains switch
ports, switchdev could be used to block the multicast to the CPU
frame, but forward it out all other ports of the bridge.

      Andrew

^ permalink raw reply

* Re: Reminder: 99 open syzbot bugs in net subsystem
From: Eric Biggers @ 2019-07-31  2:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, dvyukov, netdev, fw, i.maximets, edumazet, dsahern,
	linux-kernel, syzkaller-bugs
In-Reply-To: <1e07462d-61e2-9885-edd0-97a82dd7883e@gmail.com>

On Thu, Jul 25, 2019 at 07:04:47AM +0200, Eric Dumazet wrote:
> 
> 
> On 7/24/19 11:09 PM, Eric Biggers wrote:
> > On Wed, Jul 24, 2019 at 01:09:28PM -0700, David Miller wrote:
> >> From: Eric Biggers <ebiggers@kernel.org>
> >> Date: Wed, 24 Jul 2019 11:37:12 -0700
> >>
> >>> We can argue about what words to use to describe this situation, but
> >>> it doesn't change the situation itself.
> >>
> >> And we should argue about those words because it matters to humans and
> >> effects how they feel, and humans ultimately fix these bugs.
> >>
> >> So please stop with the hyperbole.
> >>
> >> Thank you.
> > 
> > Okay, there are 151 bugs that syzbot saw on the mainline Linux kernel in the
> > last 7 days (90.1% with reproducers).  Of those, 59 were reported over 3 months
> > ago (89.8% with reproducers).  Of those, 12 were reported over a year ago (83.3%
> > with reproducers).
> > 
> > No opinion on whether those are small/medium/large numbers, in case it would
> > hurt someone's feelings.
> > 
> > These numbers do *not* include bugs that are still valid but weren't seen on
> > mainline in last 7 days, e.g.:
> > 
> > - Bugs that are seen only rarely, so by chance weren't seen in last 7 days.
> > - Bugs only in linux-next and/or subsystem branches.
> > - Bugs that were seen in mainline more than 7 days ago, and then only on
> >   linux-next or subsystem branch in last 7 days.
> > - Bugs that stopped being seen due to a change in syzkaller.
> > - Bugs that stopped being seen due to a change in kernel config.
> > - Bugs that stopped being seen due to other environment changes such as kernel
> >   command line parameters.
> > - Bugs that stopped being seen due to a kernel change that hid the bug but
> >   didn't actually fix it, i.e. still reachable in other ways.
> > 
> 
> We do not doubt syzkaller is an incredible tool.
> 
> But netdev@ and lkml@ are mailing lists for humans to interact,
> exchange ideas, send patches and review them.
> 
> To me, an issue that was reported to netdev by a real user is _way_ more important
> than potential issues that a bot might have found doing crazy things.
> 
> We need to keep optimal S/N on mailing lists, so any bots trying to interact
> with these lists must be very cautious and damn smart.
> 
> When I have time to spare and can work on syzbot reports, I am going to a web
> page where I can see them and select the ones it makes sense to fix.
> I hate having to set up email filters.
> 

syzbot finds a lot of security bugs, and security bugs are important.  And the
bugs are still there regardless of whether they're reported by human or bot.

Also, there *are* bugs being fixed because of these reminders; some subsystem
maintainers have even fixed all the bugs in their subsystem.  But I can
understand that for subsystems with a lot of open bug reports it's overwhelming.

What I'll try doing next time (if there *is* a next time; it isn't actually my
job to do any of this, I just care about the security and reliability of
Linux...) is for subsystems with lots of open bug reports, only listing the ones
actually seen in the last week or so, and perhaps also spending some more time
manually checking those bugs.  That should cut down the noise a lot.

- Eric

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Andrew Lunn @ 2019-07-31  2:34 UTC (permalink / raw)
  To: Tao Ren
  Cc: Vladimir Oltean, Florian Fainelli, Heiner Kallweit,
	David S . Miller, Arun Parameswaran, Justin Chen, netdev, lkml,
	Andrew Jeffery, openbmc@lists.ozlabs.org
In-Reply-To: <885e48dd-df5b-7f08-ef58-557fc2347fa6@fb.com>

> Hi Andrew,
> 
> The BCM54616S PHY on my machine is connected to a BCM5396 switch chip over backplane (1000Base-KX).

Ah, that is different. So the board is using it for RGMII to 1000Base-KX?

phy-mode is about the MAC-PHY link. So in this case RGMII.

There is no DT way to configure the PHY-Switch link. However, it
sounds like you have the PHY strapped so it is doing 1000BaseX on the
PHY-Switch link. So do you actually need to configure this?

You report you never see link up? So maybe the problem is actually in
read_status? When in 1000BaseX mode, do you need to read the link
status from some other register? The Marvell PHYs use a second page
for 1000BaseX.

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Tao Ren @ 2019-07-31  2:09 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Vladimir Oltean, Florian Fainelli, Heiner Kallweit,
	David S . Miller, Arun Parameswaran, Justin Chen, netdev, lkml,
	Andrew Jeffery, openbmc@lists.ozlabs.org
In-Reply-To: <20190731013636.GC25700@lunn.ch>

On 7/30/19 6:36 PM, Andrew Lunn wrote:
>> The INTF_SEL pins report correct mode (RGMII-Fiber) on my machine,
>> but there are 2 "sub-modes" (1000Base-X and 100Base-FX) and I
>> couldn't find a proper/safe way to auto-detect which "sub-mode" is
>> active. The datasheet just describes instructions to enable a
>> specific mode, but it doesn't say 1000Base-X/100Base-FX mode will be
>> auto-selected. And that's why I came up with the patch to specify
>> 1000Base-X mode.
> 
> Fibre does not perform any sort of auto-negotiation. I assume you have
> an SFP connected? When using PHYLINK, the sfp driver will get the
> supported baud rate from SFP EEPROM to determine what speed could be
> used. However, there is currently no mainline support for having a
> chain MAC-PHY-SFP. For that you need Russells out of tree patches.

Hi Andrew,

The BCM54616S PHY on my machine is connected to a BCM5396 switch chip over backplane (1000Base-KX).


Thanks,

Tao

^ permalink raw reply

* [PATCH net v3] net: ipv6: Fix a bug in ndisc_send_ns when netdev only has a global address
From: Su Yanjun @ 2019-07-31  1:52 UTC (permalink / raw)
  To: davem, kuznet, yoshfuji; +Cc: netdev, linux-kernel, suyj.fnst

When the egress interface does not have a link local address, it can
not communicate with other hosts.

In RFC4861, 7.2.2 says
"If the source address of the packet prompting the solicitation is the
same as one of the addresses assigned to the outgoing interface, that
address SHOULD be placed in the IP Source Address of the outgoing
solicitation.  Otherwise, any one of the addresses assigned to the
interface should be used."

In this patch we try get a global address if we get ll address failed.

Signed-off-by: Su Yanjun <suyj.fnst@cn.fujitsu.com>
---
Changes since V2:
	- Let banned_flags under the scope of its use.
---
 include/net/addrconf.h |  2 ++
 net/ipv6/addrconf.c    | 34 ++++++++++++++++++++++++++++++++++
 net/ipv6/ndisc.c       | 10 +++++++---
 3 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 269ec27..eae1167 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -107,6 +107,8 @@ int __ipv6_get_lladdr(struct inet6_dev *idev, struct in6_addr *addr,
 		      u32 banned_flags);
 int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr,
 		    u32 banned_flags);
+int ipv6_get_addr(struct net_device *dev, struct in6_addr *addr,
+		    u32 banned_flags);
 bool inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2,
 			  bool match_wildcard);
 bool inet_rcv_saddr_any(const struct sock *sk);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4ae17a9..9e537bd 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1873,6 +1873,40 @@ int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr,
 	return err;
 }
 
+int __ipv6_get_addr(struct inet6_dev *idev, struct in6_addr *addr,
+		    u32 banned_flags)
+{
+	struct inet6_ifaddr *ifp;
+	int err = -EADDRNOTAVAIL;
+
+	list_for_each_entry(ifp, &idev->addr_list, if_list) {
+		if (ifp->scope == 0 &&
+		    !(ifp->flags & banned_flags)) {
+			*addr = ifp->addr;
+			err = 0;
+			break;
+		}
+	}
+	return err;
+}
+
+int ipv6_get_addr(struct net_device *dev, struct in6_addr *addr,
+		  u32 banned_flags)
+{
+	struct inet6_dev *idev;
+	int err = -EADDRNOTAVAIL;
+
+	rcu_read_lock();
+	idev = __in6_dev_get(dev);
+	if (idev) {
+		read_lock_bh(&idev->lock);
+		err = __ipv6_get_addr(idev, addr, banned_flags);
+		read_unlock_bh(&idev->lock);
+	}
+	rcu_read_unlock();
+	return err;
+}
+
 static int ipv6_count_addresses(const struct inet6_dev *idev)
 {
 	const struct inet6_ifaddr *ifp;
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 659ecf4e..fa58c6e 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -592,9 +592,13 @@ void ndisc_send_ns(struct net_device *dev, const struct in6_addr *solicit,
 	struct nd_msg *msg;
 
 	if (!saddr) {
-		if (ipv6_get_lladdr(dev, &addr_buf,
-				   (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)))
-			return;
+		u32 banned_flags = IFA_F_TENTATIVE | IFA_F_OPTIMISTIC;
+
+		if (ipv6_get_lladdr(dev, &addr_buf, banned_flags)) {
+			/* try global address */
+			if (ipv6_get_addr(dev, &addr_buf, banned_flags))
+				return;
+		}
 		saddr = &addr_buf;
 	}
 
-- 
2.7.4




^ permalink raw reply related

* Re: [PATCH bpf-next v10 10/10] landlock: Add user and kernel documentation for Landlock
From: Randy Dunlap @ 2019-07-31  1:53 UTC (permalink / raw)
  To: Mickaël Salaün, linux-kernel
  Cc: Alexander Viro, Alexei Starovoitov, Andrew Morton,
	Andy Lutomirski, Arnaldo Carvalho de Melo, Casey Schaufler,
	Daniel Borkmann, David Drysdale, David S . Miller,
	Eric W . Biederman, James Morris, Jann Horn, John Johansen,
	Jonathan Corbet, Kees Cook, Michael Kerrisk,
	Mickaël Salaün, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Stephen Smalley, Tejun Heo,
	Tetsuo Handa, Thomas Graf, Tycho Andersen, Will Drewry,
	kernel-hardening, linux-api, linux-fsdevel, linux-security-module,
	netdev
In-Reply-To: <20190721213116.23476-11-mic@digikod.net>

On 7/21/19 2:31 PM, Mickaël Salaün wrote:
> This documentation can be built with the Sphinx framework.
> 
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: James Morris <jmorris@namei.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Serge E. Hallyn <serge@hallyn.com>
> ---
> 
> Changes since v9:
> * update with expected attach type and expected attach triggers
> 
> Changes since v8:
> * remove documentation related to chaining and tagging according to this
>   patch series
> 
> Changes since v7:
> * update documentation according to the Landlock revamp
> 
> Changes since v6:
> * add a check for ctx->event
> * rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE
> * rename Landlock version to ABI to better reflect its purpose and add a
>   dedicated changelog section
> * update tables
> * relax no_new_privs recommendations
> * remove ABILITY_WRITE related functions
> * reword rule "appending" to "prepending" and explain it
> * cosmetic fixes
> 
> Changes since v5:
> * update the rule hierarchy inheritance explanation
> * briefly explain ctx->arg2
> * add ptrace restrictions
> * explain EPERM
> * update example (subtype)
> * use ":manpage:"
> ---
>  Documentation/security/index.rst           |   1 +
>  Documentation/security/landlock/index.rst  |  20 +++
>  Documentation/security/landlock/kernel.rst |  99 ++++++++++++++
>  Documentation/security/landlock/user.rst   | 147 +++++++++++++++++++++
>  4 files changed, 267 insertions(+)
>  create mode 100644 Documentation/security/landlock/index.rst
>  create mode 100644 Documentation/security/landlock/kernel.rst
>  create mode 100644 Documentation/security/landlock/user.rst


> diff --git a/Documentation/security/landlock/kernel.rst b/Documentation/security/landlock/kernel.rst
> new file mode 100644
> index 000000000000..7d1e06d544bf
> --- /dev/null
> +++ b/Documentation/security/landlock/kernel.rst
> @@ -0,0 +1,99 @@
> +==============================
> +Landlock: kernel documentation
> +==============================
> +
> +eBPF properties
> +===============
> +
> +To get an expressive language while still being safe and small, Landlock is
> +based on eBPF. Landlock should be usable by untrusted processes and must
> +therefore expose a minimal attack surface. The eBPF bytecode is minimal,
> +powerful, widely used and designed to be used by untrusted applications. Thus,
> +reusing the eBPF support in the kernel enables a generic approach while
> +minimizing new code.
> +
> +An eBPF program has access to an eBPF context containing some fields used to
> +inspect the current object. These arguments can be used directly (e.g. cookie)
> +or passed to helper functions according to their types (e.g. inode pointer). It
> +is then possible to do complex access checks without race conditions or
> +inconsistent evaluation (i.e.  `incorrect mirroring of the OS code and state
> +<https://www.ndss-symposium.org/ndss2003/traps-and-pitfalls-practical-problems-system-call-interposition-based-security-tools/>`_).
> +
> +A Landlock hook describes a particular access type.  For now, there is two

                                                                 there are two

> +hooks dedicated to filesystem related operations: LANDLOCK_HOOK_FS_PICK and
> +LANDLOCK_HOOK_FS_WALK.  A Landlock program is tied to one hook.  This makes it
> +possible to statically check context accesses, potentially performed by such
> +program, and hence prevents kernel address leaks and ensure the right use of

                                                        ensures

> +hook arguments with eBPF functions.  Any user can add multiple Landlock
> +programs per Landlock hook.  They are stacked and evaluated one after the
> +other, starting from the most recent program, as seccomp-bpf does with its
> +filters.  Underneath, a hook is an abstraction over a set of LSM hooks.
> +
> +
> +Guiding principles
> +==================
> +
> +Unprivileged use
> +----------------
> +
> +* Landlock helpers and context should be usable by any unprivileged and
> +  untrusted program while following the system security policy enforced by
> +  other access control mechanisms (e.g. DAC, LSM).
> +
> +
> +Landlock hook and context
> +-------------------------
> +
> +* A Landlock hook shall be focused on access control on kernel objects instead
> +  of syscall filtering (i.e. syscall arguments), which is the purpose of
> +  seccomp-bpf.
> +* A Landlock context provided by a hook shall express the minimal and more
> +  generic interface to control an access for a kernel object.
> +* A hook shall guaranty that all the BPF function calls from a program are> +  safe.  Thus, the related Landlock context arguments shall always be of the
> +  same type for a particular hook.  For example, a network hook could share
> +  helpers with a file hook because of UNIX socket.  However, the same helpers
> +  may not be compatible for a file system handle and a net handle.
> +* Multiple hooks may use the same context interface.
> +
> +
> +Landlock helpers
> +----------------
> +
> +* Landlock helpers shall be as generic as possible while at the same time being
> +  as simple as possible and following the syscall creation principles (cf.
> +  *Documentation/adding-syscalls.txt*).
> +* The only behavior change allowed on a helper is to fix a (logical) bug to
> +  match the initial semantic.
> +* Helpers shall be reentrant, i.e. only take inputs from arguments (e.g. from
> +  the BPF context), to enable a hook to use a cache.  Future program options
> +  might change this cache behavior.
> +* It is quite easy to add new helpers to extend Landlock.  The main concern
> +  should be about the possibility to leak information from the kernel that may
> +  not be accessible otherwise (i.e. side-channel attack).
> +
> +
> +Questions and answers
> +=====================
> +
> +Why not create a custom hook for each kind of action?
> +-----------------------------------------------------
> +
> +Landlock programs can handle these checks.  Adding more exceptions to the
> +kernel code would lead to more code complexity.  A decision to ignore a kind of
> +action can and should be done at the beginning of a Landlock program.
> +
> +
> +Why a program does not return an errno or a kill code?
> +------------------------------------------------------
> +
> +seccomp filters can return multiple kind of code, including an errno value or a

                                       kinds

> +kill signal, which may be convenient for access control.  Those return codes
> +are hardwired in the userland ABI.  Instead, Landlock's approach is to return a
> +boolean to allow or deny an action, which is much simpler and more generic.
> +Moreover, we do not really have a choice because, unlike to seccomp, Landlock
> +programs are not enforced at the syscall entry point but may be executed at any
> +point in the kernel (through LSM hooks) where an errno return code may not make
> +sense.  However, with this simple ABI and with the ability to call helpers,
> +Landlock may gain features similar to seccomp-bpf in the future while being
> +compatible with previous programs.
> diff --git a/Documentation/security/landlock/user.rst b/Documentation/security/landlock/user.rst
> new file mode 100644
> index 000000000000..14c4f3b377bd
> --- /dev/null
> +++ b/Documentation/security/landlock/user.rst
> @@ -0,0 +1,147 @@
> +================================
> +Landlock: userland documentation
> +================================
> +
> +Landlock programs
> +=================
> +
> +eBPF programs are used to create security programs.  They are contained and can
> +call only a whitelist of dedicated functions. Moreover, they can only loop
> +under strict conditions, which protects from denial of service.  More
> +information on BPF can be found in *Documentation/networking/filter.txt*.
> +
> +
> +Writing a program
> +-----------------
> +
> +To enforce a security policy, a thread first needs to create a Landlock program.
> +The easiest way to write an eBPF program depicting a security program is to write
> +it in the C language.  As described in *samples/bpf/README.rst*, LLVM can
> +compile such programs.  Files *samples/bpf/landlock1_kern.c* and those in
> +*tools/testing/selftests/landlock/* can be used as examples.
> +
> +Once the eBPF program is created, the next step is to create the metadata
> +describing the Landlock program.  This metadata includes an expected attach type which
> +contains the hook type to which the program is tied, and expected attach
> +triggers which identify the actions for which the program should be run.
> +
> +A hook is a policy decision point which exposes the same context type for
> +each program evaluation.
> +
> +A Landlock hook describes the kind of kernel object for which a program will be
> +triggered to allow or deny an action.  For example, the hook
> +BPF_LANDLOCK_FS_PICK can be triggered every time a landlocked thread performs a
> +set of action related to the filesystem (e.g. open, read, write, mount...).

          actions

> +This actions are identified by the `triggers` bitfield.
> +
> +The next step is to fill a :c:type:`struct bpf_load_program_attr
> +<bpf_load_program_attr>` with BPF_PROG_TYPE_LANDLOCK_HOOK, the expected attach
> +type and other BPF program metadata.  This bpf_attr must then be passed to the
> +:manpage:`bpf(2)` syscall alongside the BPF_PROG_LOAD command.  If everything
> +is deemed correct by the kernel, the thread gets a file descriptor referring to
> +this program.
> +
> +In the following code, the *insn* variable is an array of BPF instructions
> +which can be extracted from an ELF file as is done in bpf_load_file() from
> +*samples/bpf/bpf_load.c*.

A little confusing.  Is there a mixup of <insn> and <insns>?

> +
> +.. code-block:: c
> +
> +    int prog_fd;
> +    struct bpf_load_program_attr load_attr;
> +
> +    memset(&load_attr, 0, sizeof(struct bpf_load_program_attr));
> +    load_attr.prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK;
> +    load_attr.expected_attach_type = BPF_LANDLOCK_FS_PICK;
> +    load_attr.expected_attach_triggers = LANDLOCK_TRIGGER_FS_PICK_OPEN;
> +    load_attr.insns = insns;
> +    load_attr.insns_cnt = sizeof(insn) / sizeof(struct bpf_insn);
> +    load_attr.license = "GPL";
> +
> +    prog_fd = bpf_load_program_xattr(&load_attr, log_buf, log_buf_sz);
> +    if (prog_fd == -1)
> +        exit(1);
> +
> +
> +Enforcing a program
> +-------------------
> +
> +Once the Landlock program has been created or received (e.g. through a UNIX
> +socket), the thread willing to sandbox itself (and its future children) should
> +perform the following two steps.
> +
> +The thread should first request to never be allowed to get new privileges with a
> +call to :manpage:`prctl(2)` and the PR_SET_NO_NEW_PRIVS option.  More
> +information can be found in *Documentation/prctl/no_new_privs.txt*.
> +
> +.. code-block:: c
> +
> +    if (prctl(PR_SET_NO_NEW_PRIVS, 1, NULL, 0, 0))
> +        exit(1);
> +
> +A thread can apply a program to itself by using the :manpage:`seccomp(2)` syscall.
> +The operation is SECCOMP_PREPEND_LANDLOCK_PROG, the flags must be empty and the
> +*args* argument must point to a valid Landlock program file descriptor.
> +
> +.. code-block:: c
> +
> +    if (seccomp(SECCOMP_PREPEND_LANDLOCK_PROG, 0, &fd))
> +        exit(1);
> +
> +If the syscall succeeds, the program is now enforced on the calling thread and
> +will be enforced on all its subsequently created children of the thread as
> +well.  Once a thread is landlocked, there is no way to remove this security
> +policy, only stacking more restrictions is allowed.  The program evaluation is
> +performed from the newest to the oldest.
> +
> +When a syscall ask for an action on a kernel object, if this action is denied,

                  asks

> +then an EACCES errno code is returned through the syscall.
> +
> +
> +.. _inherited_programs:
> +
> +Inherited programs
> +------------------
> +
> +Every new thread resulting from a :manpage:`clone(2)` inherits Landlock program
> +restrictions from its parent.  This is similar to the seccomp inheritance as
> +described in *Documentation/prctl/seccomp_filter.txt*.
> +
> +
> +Ptrace restrictions
> +-------------------
> +
> +A landlocked process has less privileges than a non-landlocked process and must
> +then be subject to additional restrictions when manipulating another process.
> +To be allowed to use :manpage:`ptrace(2)` and related syscalls on a target
> +process, a landlocked process must have a subset of the target process programs.
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Maybe that last statement is correct, but it seems to me that it is missing something.

> +
> +
> +Landlock structures and constants
> +=================================
> +
> +Hook types
> +----------
> +
> +.. kernel-doc:: include/uapi/linux/landlock.h
> +    :functions: landlock_hook_type
> +
> +
> +Contexts
> +--------
> +
> +.. kernel-doc:: include/uapi/linux/landlock.h
> +    :functions: landlock_ctx_fs_pick landlock_ctx_fs_walk landlock_ctx_fs_get
> +
> +
> +Triggers for fs_pick
> +--------------------
> +
> +.. kernel-doc:: include/uapi/linux/landlock.h
> +    :functions: landlock_triggers
> +
> +
> +Additional documentation
> +========================
> +
> +See https://landlock.io
> 


-- 
~Randy

^ permalink raw reply

* Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
From: Alexei Starovoitov @ 2019-07-31  1:52 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel T. Lee, Stephen Hemminger, David Ahern,
	Jesper Dangaard Brouer, John Fastabend, Daniel Borkmann,
	Alexei Starovoitov, David Miller, netdev
In-Reply-To: <20190730182144.1355bf50@cakuba.netronome.com>

On Tue, Jul 30, 2019 at 06:21:44PM -0700, Jakub Kicinski wrote:
> > > Duplicating the same features in bpftool will only diminish the
> > > incentive for moving iproute2 to libbpf.   
> > 
> > not at all. why do you think so?
> 
> Because iproute2's BPF has fallen behind so the simplest thing is to
> just contribute to bpftool. But iproute2 is the tool set for Linux
> networking, we can't let it bit rot :(

where were you when a lot of libbpf was copy pasted into iproute2 ?!
Now it diverged a lot and it's difficult to move iproute2 back to the main
train which is libbpf.
Same thing with at least 5 copy-pastes of samples/bpf/bpf_load.c
that spawned a bunch of different bpf loaders.

> IMHO vaguely competent users of Linux networking will know ip link.
> If they are not vaguely competent, they should not attach XDP programs
> to interfaces by hand...

I'm a prime example of moderately competent linux user who
doesn't know iproute2. I'm not joking.
I don't know tc syntax and always use my cheat sheet of ip/tc commands
to do anything.

> > 
> > bpftool must be able to introspect every aspect of bpf programming.
> > That includes detaching and attaching anywhere.
> > Anyone doing 'bpftool p s' should be able to switch off particular
> > prog id without learning ten different other tools.
> 
> I think the fact that we already have an implementation in iproute2,
> which is at the risk of bit rot is more important to me that the
> hypothetical scenario where everyone knows to just use bpftool (for
> XDP, for TC it's still iproute2 unless there's someone crazy enough 
> to reimplement the TC functionality :))

I think you're missing the point that iproute2 is still necessary
to configure it.
bpftool should be able to attach/detach from anything.
But configuring that thing (whether it's cgroup or tc/xdp) is
a job of corresponding apis and tools.

> I'm not sure we can settle our differences over email :)
> I have tremendous respect for all the maintainers I CCed here, 
> if nobody steps up to agree with me I'll concede the bpftool net
> battle entirely :)

we can keep arguing forever. Respect to ease-of-use only comes
from the pain of operational experience. I don't think I can
convey that pain in the email either.

^ permalink raw reply

* [PATCH bpf 0/2] bpf: fix bug in x64 JIT
From: Alexei Starovoitov @ 2019-07-31  1:38 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, bpf, kernel-team

Fix bug in computation of the jump offset to 1st instruction in BPF JIT
and add 2 tests.

Alexei Starovoitov (2):
  bpf: fix x64 JIT code generation for jmp to 1st insn
  selftests/bpf: tests for jmp to 1st insn

 arch/x86/net/bpf_jit_comp.c                   |  7 +++--
 tools/testing/selftests/bpf/verifier/loops1.c | 28 +++++++++++++++++++
 2 files changed, 32 insertions(+), 3 deletions(-)

-- 
2.20.0


^ permalink raw reply

* [PATCH bpf 2/2] selftests/bpf: tests for jmp to 1st insn
From: Alexei Starovoitov @ 2019-07-31  1:38 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, bpf, kernel-team
In-Reply-To: <20190731013827.2445262-1-ast@kernel.org>

Add 2 tests that check JIT code generation to jumps to 1st insn.
1st test is similar to syzbot reproducer.
The backwards branch is never taken at runtime.
2nd test has branch to 1st insn that executes.
The test is written as two bpf functions, since it's not possible
to construct valid single bpf program that jumps to 1st insn.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/verifier/loops1.c | 28 +++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/tools/testing/selftests/bpf/verifier/loops1.c b/tools/testing/selftests/bpf/verifier/loops1.c
index 5e980a5ab69d..1fc4e61e9f9f 100644
--- a/tools/testing/selftests/bpf/verifier/loops1.c
+++ b/tools/testing/selftests/bpf/verifier/loops1.c
@@ -159,3 +159,31 @@
 	.errstr = "loop detected",
 	.prog_type = BPF_PROG_TYPE_TRACEPOINT,
 },
+{
+	"not-taken loop with back jump to 1st insn",
+	.insns = {
+	BPF_MOV64_IMM(BPF_REG_0, 123),
+	BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 4, -2),
+	BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_XDP,
+	.retval = 123,
+},
+{
+	"taken loop with back jump to 1st insn",
+	.insns = {
+	BPF_MOV64_IMM(BPF_REG_1, 10),
+	BPF_MOV64_IMM(BPF_REG_2, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_1),
+	BPF_ALU64_IMM(BPF_SUB, BPF_REG_1, 1),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, -3),
+	BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+	BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_XDP,
+	.retval = 55,
+},
-- 
2.20.0


^ permalink raw reply related

* [PATCH bpf 1/2] bpf: fix x64 JIT code generation for jmp to 1st insn
From: Alexei Starovoitov @ 2019-07-31  1:38 UTC (permalink / raw)
  To: davem; +Cc: daniel, netdev, bpf, kernel-team
In-Reply-To: <20190731013827.2445262-1-ast@kernel.org>

Introduction of bounded loops exposed old bug in x64 JIT.
JIT maintains the array of offsets to the end of all instructions to
compute jmp offsets.
addrs[0] - offset of the end of the 1st insn (that includes prologue).
addrs[1] - offset of the end of the 2nd insn.
JIT didn't keep the offset of the beginning of the 1st insn,
since classic BPF didn't have backward jumps and valid extended BPF
couldn't have a branch to 1st insn, because it didn't allow loops.
With bounded loops it's possible to construct a valid program that
jumps backwards to the 1st insn.
Fix JIT by computing:
addrs[0] - offset of the end of prologue == start of the 1st insn.
addrs[1] - offset of the end of 1st insn.

Reported-by: syzbot+35101610ff3e83119b1b@syzkaller.appspotmail.com
Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
Fixes: 0a14842f5a3c ("net: filter: Just In Time compiler for x86-64")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 arch/x86/net/bpf_jit_comp.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index eaaed5bfc4a4..a56c95805732 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -390,8 +390,9 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 
 	emit_prologue(&prog, bpf_prog->aux->stack_depth,
 		      bpf_prog_was_classic(bpf_prog));
+	addrs[0] = prog - temp;
 
-	for (i = 0; i < insn_cnt; i++, insn++) {
+	for (i = 1; i <= insn_cnt; i++, insn++) {
 		const s32 imm32 = insn->imm;
 		u32 dst_reg = insn->dst_reg;
 		u32 src_reg = insn->src_reg;
@@ -1105,7 +1106,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		extra_pass = true;
 		goto skip_init_addrs;
 	}
-	addrs = kmalloc_array(prog->len, sizeof(*addrs), GFP_KERNEL);
+	addrs = kmalloc_array(prog->len + 1, sizeof(*addrs), GFP_KERNEL);
 	if (!addrs) {
 		prog = orig_prog;
 		goto out_addrs;
@@ -1115,7 +1116,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 	 * Before first pass, make a rough estimation of addrs[]
 	 * each BPF instruction is translated to less than 64 bytes
 	 */
-	for (proglen = 0, i = 0; i < prog->len; i++) {
+	for (proglen = 0, i = 0; i <= prog->len; i++) {
 		proglen += 64;
 		addrs[i] = proglen;
 	}
-- 
2.20.0


^ permalink raw reply related

* Re: [PATCH net-next 2/2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Andrew Lunn @ 2019-07-31  1:36 UTC (permalink / raw)
  To: Tao Ren
  Cc: Vladimir Oltean, Florian Fainelli, Heiner Kallweit,
	David S . Miller, Arun Parameswaran, Justin Chen, netdev, lkml,
	Andrew Jeffery, openbmc@lists.ozlabs.org
In-Reply-To: <e8f85ef3-1216-8efb-a54d-84426234fe82@fb.com>

> The INTF_SEL pins report correct mode (RGMII-Fiber) on my machine,
> but there are 2 "sub-modes" (1000Base-X and 100Base-FX) and I
> couldn't find a proper/safe way to auto-detect which "sub-mode" is
> active. The datasheet just describes instructions to enable a
> specific mode, but it doesn't say 1000Base-X/100Base-FX mode will be
> auto-selected. And that's why I came up with the patch to specify
> 1000Base-X mode.

Fibre does not perform any sort of auto-negotiation. I assume you have
an SFP connected? When using PHYLINK, the sfp driver will get the
supported baud rate from SFP EEPROM to determine what speed could be
used. However, there is currently no mainline support for having a
chain MAC-PHY-SFP. For that you need Russells out of tree patches.

      Andrew

^ permalink raw reply

* [PATCH v2 net] hv_sock: Fix hang when a connection is closed
From: Dexuan Cui @ 2019-07-31  1:25 UTC (permalink / raw)
  To: Sunil Muthuswamy, David Miller, netdev@vger.kernel.org
  Cc: KY Srinivasan, Haiyang Zhang, Stephen Hemminger,
	sashal@kernel.org, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, olaf@aepfle.de, apw@canonical.com,
	jasowang@redhat.com, vkuznets, marcelo.cerri@canonical.com

There is a race condition for an established connection that is being closed
by the guest: the refcnt is 4 at the end of hvs_release() (Note: here the
'remove_sock' is false):

1 for the initial value;
1 for the sk being in the bound list;
1 for the sk being in the connected list;
1 for the delayed close_work.

After hvs_release() finishes, __vsock_release() -> sock_put(sk) *may*
decrease the refcnt to 3.

Concurrently, hvs_close_connection() runs in another thread:
  calls vsock_remove_sock() to decrease the refcnt by 2;
  call sock_put() to decrease the refcnt to 0, and free the sk;
  next, the "release_sock(sk)" may hang due to use-after-free.

In the above, after hvs_release() finishes, if hvs_close_connection() runs
faster than "__vsock_release() -> sock_put(sk)", then there is not any issue,
because at the beginning of hvs_close_connection(), the refcnt is still 4.

The issue can be resolved if an extra reference is taken when the
connection is established.

Fixes: a9eeb998c28d ("hv_sock: Add support for delayed close")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Cc: stable@vger.kernel.org

---

Changes in v2: 
  Changed the location of the sock_hold() call. 
  Updated the changelog accordingly.

  Thanks Sunil for the suggestion!

With the proper kernel debugging options enabled, first a warning can
appear:

kworker/1:0/4467 is freeing memory ..., with a lock still held there!
stack backtrace:
Workqueue: events vmbus_onmessage_work [hv_vmbus]
Call Trace:
 dump_stack+0x67/0x90
 debug_check_no_locks_freed.cold.52+0x78/0x7d
 slab_free_freelist_hook+0x85/0x140
 kmem_cache_free+0xa5/0x380
 __sk_destruct+0x150/0x260
 hvs_close_connection+0x24/0x30 [hv_sock]
 vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
 process_one_work+0x241/0x600
 worker_thread+0x3c/0x390
 kthread+0x11b/0x140
 ret_from_fork+0x24/0x30

and then the following release_sock(sk) can hang:

watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:4467]
...
irq event stamp: 62890
CPU: 1 PID: 4467 Comm: kworker/1:0 Tainted: G        W         5.2.0+ #39
Workqueue: events vmbus_onmessage_work [hv_vmbus]
RIP: 0010:queued_spin_lock_slowpath+0x2b/0x1e0
...
Call Trace:
 do_raw_spin_lock+0xab/0xb0
 release_sock+0x19/0xb0
 vmbus_onmessage_work+0x1d/0x30 [hv_vmbus]
 process_one_work+0x241/0x600
 worker_thread+0x3c/0x390
 kthread+0x11b/0x140
 ret_from_fork+0x24/0x30

 net/vmw_vsock/hyperv_transport.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index f2084e3f7aa4..9d864ebeb7b3 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -312,6 +312,11 @@ static void hvs_close_connection(struct vmbus_channel *chan)
 	lock_sock(sk);
 	hvs_do_close_lock_held(vsock_sk(sk), true);
 	release_sock(sk);
+
+	/* Release the refcnt for the channel that's opened in
+	 * hvs_open_connection().
+	 */
+	sock_put(sk);
 }

 static void hvs_open_connection(struct vmbus_channel *chan)
@@ -407,6 +412,9 @@ static void hvs_open_connection(struct vmbus_channel *chan)
 	}

 	set_per_channel_state(chan, conn_from_host ? new : sk);
+
+	/* This reference will be dropped by hvs_close_connection(). */
+	sock_hold(conn_from_host ? new : sk);
 	vmbus_set_chn_rescind_callback(chan, hvs_close_connection);

 	/* Set the pending send size to max packet size to always get
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
From: Jakub Kicinski @ 2019-07-31  1:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel T. Lee, Stephen Hemminger, David Ahern,
	Jesper Dangaard Brouer, John Fastabend, Daniel Borkmann,
	Alexei Starovoitov, David Miller, netdev
In-Reply-To: <20190731002338.d4lp2grsmm3aaav3@ast-mbp>

On Tue, 30 Jul 2019 17:23:39 -0700, Alexei Starovoitov wrote:
> On Tue, Jul 30, 2019 at 05:07:25PM -0700, Jakub Kicinski wrote:
> > Nothing meaning you disagree it's duplicated effort and unnecessary 
> > LoC the community has to maintain, review, test..?  
> 
> I don't see duplicated effort.

Could you walk me through a scenario where, say, a new xdp attachment
flag is added? How can there be support in both bpftool and iproute2 
for specifying it without modifying both?

How can we say we want to make iproute2 use libbpf instead of it's own
library to avoid duplicated effort on the back end, and at the same time
claim having two tools do the same thing is no duplication 🤯

> > Duplicating the same features in bpftool will only diminish the
> > incentive for moving iproute2 to libbpf.   
> 
> not at all. why do you think so?

Because iproute2's BPF has fallen behind so the simplest thing is to
just contribute to bpftool. But iproute2 is the tool set for Linux
networking, we can't let it bit rot :(

Admittedly that's just me reading the tea leaves.

> > And for folks who deal
> > with a wide variety of customers, often novices maintaining two
> > ways of doing the same thing is a hassle :(  
> 
> It's not the same thing.
> I'm talking about my experience dealing with 'wide variety of bpf customers'.
> They only have a fraction of their time to learn one tool.
> Making every bpf customer learn ten tools is not an option.

IMHO vaguely competent users of Linux networking will know ip link.
If they are not vaguely competent, they should not attach XDP programs
to interfaces by hand...

> > Let's just put the two commands next to each other:
> > 
> >        ip link set xdp $PROG dev $DEV
> > 
> > bpftool net attach xdp $PROG dev $DEV
> > 
> > Are they that different?  
> 
> yes.
> they're different tools with they own upgrade/rollout cycle

I think this came up when bpftool net was added and I never really
understood it.

> > > If bpftool was taught to do equivalent of 'ip link' that would be
> > > very different story and I would be opposed to that.  
> > 
> > Yes, that'd be pretty clear cut, only the XDP stuff is a bit more 
> > of a judgement call.  
> 
> bpftool must be able to introspect every aspect of bpf programming.
> That includes detaching and attaching anywhere.
> Anyone doing 'bpftool p s' should be able to switch off particular
> prog id without learning ten different other tools.

I think the fact that we already have an implementation in iproute2,
which is at the risk of bit rot is more important to me that the
hypothetical scenario where everyone knows to just use bpftool (for
XDP, for TC it's still iproute2 unless there's someone crazy enough 
to reimplement the TC functionality :))

I'm not sure we can settle our differences over email :)
I have tremendous respect for all the maintainers I CCed here, 
if nobody steps up to agree with me I'll concede the bpftool net
battle entirely :)

^ permalink raw reply

* Re: [PATCH v4 net-next 15/19] ionic: Add netdev-event handling
From: Shannon Nelson @ 2019-07-31  1:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, davem
In-Reply-To: <20190730163947.3e730f25@hermes.lan>

On 7/30/19 4:39 PM, Stephen Hemminger wrote:
> On Mon, 22 Jul 2019 14:40:19 -0700
> Shannon Nelson <snelson@pensando.io> wrote:
>
>> +
>> +static void ionic_lif_set_netdev_info(struct lif *lif)
>> +{
>> +	struct ionic_admin_ctx ctx = {
>> +		.work = COMPLETION_INITIALIZER_ONSTACK(ctx.work),
>> +		.cmd.lif_setattr = {
>> +			.opcode = CMD_OPCODE_LIF_SETATTR,
>> +			.index = cpu_to_le16(lif->index),
>> +			.attr = IONIC_LIF_ATTR_NAME,
>> +		},
>> +	};
>> +
>> +	strlcpy(ctx.cmd.lif_setattr.name, lif->netdev->name,
>> +		sizeof(ctx.cmd.lif_setattr.name));
>> +
>> +	dev_info(lif->ionic->dev, "NETDEV_CHANGENAME %s %s\n",
>> +		 lif->name, ctx.cmd.lif_setattr.name);
>> +
> There is already a kernel message for this. Repeating the same thing in the
> driver is redundant.

True, except for the lif name information.  But since that really is 
debugging information, and I was getting tired of seeing those redundant 
messages, I was planning to remove them soon anyway. Thanks for the 
extra nudge.

sln


^ permalink raw reply

* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Andrii Nakryiko @ 2019-07-31  1:00 UTC (permalink / raw)
  To: Song Liu
  Cc: Andrii Nakryiko, bpf, netdev@vger.kernel.org, Alexei Starovoitov,
	daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <4AB53FC1-5390-4BC7-83B4-7DDBAFD78ABC@fb.com>

On Tue, Jul 30, 2019 at 5:39 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> >
> > This patch implements the core logic for BPF CO-RE offsets relocations.
> > Every instruction that needs to be relocated has corresponding
> > bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> > to match recorded "local" relocation spec against potentially many
> > compatible "target" types, creating corresponding spec. Details of the
> > algorithm are noted in corresponding comments in the code.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
> > tools/lib/bpf/libbpf.c | 915 ++++++++++++++++++++++++++++++++++++++++-
> > tools/lib/bpf/libbpf.h |   1 +
> > 2 files changed, 909 insertions(+), 7 deletions(-)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c

[...]

Please trim irrelevant parts. It doesn't matter with desktop Gmail,
but pretty much everywhere else is very hard to work with.

> > +
> > +     for (i = 1; i < spec->raw_len; i++) {
> > +             t = skip_mods_and_typedefs(btf, id, &id);
> > +             if (!t)
> > +                     return -EINVAL;
> > +
> > +             access_idx = spec->raw_spec[i];
> > +
> > +             if (btf_is_composite(t)) {
> > +                     const struct btf_member *m = (void *)(t + 1);
>
> Why (void *) instead of (const struct btf_member *)? There are a few more
> in the rest of the patch.
>

I just picked the most succinct and non-repetitive form. It's
immediately apparent which type it's implicitly converted to, so I
felt there is no need to repeat it. Also, just (void *) is much
shorter. :)


> > +                     __u32 offset;
> > +
> > +                     if (access_idx >= BTF_INFO_VLEN(t->info))
> > +                             return -EINVAL;
> > +

[...]

^ permalink raw reply

* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Song Liu @ 2019-07-31  0:39 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev@vger.kernel.org, Alexei Starovoitov,
	daniel@iogearbox.net, Yonghong Song, andrii.nakryiko@gmail.com,
	Kernel Team
In-Reply-To: <20190730195408.670063-3-andriin@fb.com>



> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
> 
> This patch implements the core logic for BPF CO-RE offsets relocations.
> Every instruction that needs to be relocated has corresponding
> bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
> to match recorded "local" relocation spec against potentially many
> compatible "target" types, creating corresponding spec. Details of the
> algorithm are noted in corresponding comments in the code.
> 
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> ---
> tools/lib/bpf/libbpf.c | 915 ++++++++++++++++++++++++++++++++++++++++-
> tools/lib/bpf/libbpf.h |   1 +
> 2 files changed, 909 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index ead915aec349..75da90928257 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -38,6 +38,7 @@
> #include <sys/stat.h>
> #include <sys/types.h>
> #include <sys/vfs.h>
> +#include <sys/utsname.h>
> #include <tools/libc_compat.h>
> #include <libelf.h>
> #include <gelf.h>
> @@ -47,6 +48,7 @@
> #include "btf.h"
> #include "str_error.h"
> #include "libbpf_internal.h"
> +#include "hashmap.h"
> 
> #ifndef EM_BPF
> #define EM_BPF 247
> @@ -1015,17 +1017,22 @@ static int bpf_object__init_user_maps(struct bpf_object *obj, bool strict)
> 	return 0;
> }
> 
> -static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> -						     __u32 id)
> +static const struct btf_type *
> +skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id)
> {
> 	const struct btf_type *t = btf__type_by_id(btf, id);
> 
> +	if (res_id)
> +		*res_id = id;
> +
> 	while (true) {
> 		switch (BTF_INFO_KIND(t->info)) {
> 		case BTF_KIND_VOLATILE:
> 		case BTF_KIND_CONST:
> 		case BTF_KIND_RESTRICT:
> 		case BTF_KIND_TYPEDEF:
> +			if (res_id)
> +				*res_id = t->type;
> 			t = btf__type_by_id(btf, t->type);
> 			break;
> 		default:
> @@ -1044,7 +1051,7 @@ static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
> static bool get_map_field_int(const char *map_name, const struct btf *btf,
> 			      const struct btf_type *def,
> 			      const struct btf_member *m, __u32 *res) {
> -	const struct btf_type *t = skip_mods_and_typedefs(btf, m->type);
> +	const struct btf_type *t = skip_mods_and_typedefs(btf, m->type, NULL);
> 	const char *name = btf__name_by_offset(btf, m->name_off);
> 	const struct btf_array *arr_info;
> 	const struct btf_type *arr_t;
> @@ -1110,7 +1117,7 @@ static int bpf_object__init_user_btf_map(struct bpf_object *obj,
> 		return -EOPNOTSUPP;
> 	}
> 
> -	def = skip_mods_and_typedefs(obj->btf, var->type);
> +	def = skip_mods_and_typedefs(obj->btf, var->type, NULL);
> 	if (BTF_INFO_KIND(def->info) != BTF_KIND_STRUCT) {
> 		pr_warning("map '%s': unexpected def kind %u.\n",
> 			   map_name, BTF_INFO_KIND(var->info));
> @@ -2292,6 +2299,893 @@ bpf_program_reloc_btf_ext(struct bpf_program *prog, struct bpf_object *obj,
> 	return 0;
> }
> 
> +#define BPF_CORE_SPEC_MAX_LEN 64
> +
> +/* represents BPF CO-RE field or array element accessor */
> +struct bpf_core_accessor {
> +	__u32 type_id;		/* struct/union type or array element type */
> +	__u32 idx;		/* field index or array index */
> +	const char *name;	/* field name or NULL for array accessor */
> +};
> +
> +struct bpf_core_spec {
> +	const struct btf *btf;
> +	/* high-level spec: named fields and array indices only */
> +	struct bpf_core_accessor spec[BPF_CORE_SPEC_MAX_LEN];
> +	/* high-level spec length */
> +	int len;
> +	/* raw, low-level spec: 1-to-1 with accessor spec string */
> +	int raw_spec[BPF_CORE_SPEC_MAX_LEN];
> +	/* raw spec length */
> +	int raw_len;
> +	/* field byte offset represented by spec */
> +	__u32 offset;
> +};
> +
> +static bool str_is_empty(const char *s)
> +{
> +	return !s || !s[0];
> +}
> +
> +static int btf_kind(const struct btf_type *t)
> +{
> +	return BTF_INFO_KIND(t->info);
> +}
> +
> +static bool btf_is_composite(const struct btf_type *t)
> +{
> +	int kind = btf_kind(t);
> +
> +	return kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION;
> +}
> +
> +static bool btf_is_array(const struct btf_type *t)
> +{
> +	return btf_kind(t) == BTF_KIND_ARRAY;
> +}
> +
> +/* 
> + * Turn bpf_offset_reloc into a low- and high-level spec representation,
> + * validating correctness along the way, as well as calculating resulting
> + * field offset (in bytes), specified by accessor string. Low-level spec
> + * captures every single level of nestedness, including traversing anonymous
> + * struct/union members. High-level one only captures semantically meaningful
> + * "turning points": named fields and array indicies.
> + * E.g., for this case:
> + *
> + *   struct sample {
> + *       int __unimportant;
> + *       struct {
> + *           int __1;
> + *           int __2;
> + *           int a[7];
> + *       };
> + *   };
> + *
> + *   struct sample *s = ...;
> + *
> + *   int x = &s->a[3]; // access string = '0:1:2:3'
> + *
> + * Low-level spec has 1:1 mapping with each element of access string (it's
> + * just a parsed access string representation): [0, 1, 2, 3].
> + *
> + * High-level spec will capture only 3 points:
> + *   - intial zero-index access by pointer (&s->... is the same as &s[0]...);
> + *   - field 'a' access (corresponds to '2' in low-level spec);
> + *   - array element #3 access (corresponds to '3' in low-level spec).
> + *
> + */
> +static int bpf_core_spec_parse(const struct btf *btf,
> +			       __u32 type_id,
> +			       const char *spec_str,
> +			       struct bpf_core_spec *spec)
> +{
> +	int access_idx, parsed_len, i;
> +	const struct btf_type *t;
> +	const char *name;
> +	__u32 id;
> +	__s64 sz;
> +
> +	if (str_is_empty(spec_str) || *spec_str == ':')
> +		return -EINVAL;
> +
> +	memset(spec, 0, sizeof(*spec));
> +	spec->btf = btf;
> +
> +	/* parse spec_str="0:1:2:3:4" into array raw_spec=[0, 1, 2, 3, 4] */
> +	while (*spec_str) {
> +		if (*spec_str == ':')
> +			++spec_str;
> +		if (sscanf(spec_str, "%d%n", &access_idx, &parsed_len) != 1)
> +			return -EINVAL;
> +		if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> +			return -E2BIG;
> +		spec_str += parsed_len;
> +		spec->raw_spec[spec->raw_len++] = access_idx;
> +	}
> +
> +	if (spec->raw_len == 0)
> +		return -EINVAL;
> +
> +	/* first spec value is always reloc type array index */
> +	t = skip_mods_and_typedefs(btf, type_id, &id);
> +	if (!t)
> +		return -EINVAL;
> +
> +	access_idx = spec->raw_spec[0];
> +	spec->spec[0].type_id = id;
> +	spec->spec[0].idx = access_idx;
> +	spec->len++;
> +
> +	sz = btf__resolve_size(btf, id);
> +	if (sz < 0)
> +		return sz;
> +	spec->offset = access_idx * sz;
> +
> +	for (i = 1; i < spec->raw_len; i++) {
> +		t = skip_mods_and_typedefs(btf, id, &id);
> +		if (!t)
> +			return -EINVAL;
> +
> +		access_idx = spec->raw_spec[i];
> +
> +		if (btf_is_composite(t)) {
> +			const struct btf_member *m = (void *)(t + 1);

Why (void *) instead of (const struct btf_member *)? There are a few more
in the rest of the patch. 

> +			__u32 offset;
> +
> +			if (access_idx >= BTF_INFO_VLEN(t->info))
> +				return -EINVAL;
> +
> +			m = &m[access_idx];
> +
> +			if (BTF_INFO_KFLAG(t->info)) {
> +				if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> +					return -EINVAL;
> +				offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> +			} else {
> +				offset = m->offset;
> +			}
> +			if (m->offset % 8)
> +				return -EINVAL;
> +			spec->offset += offset / 8;
> +
> +			if (m->name_off) {
> +				name = btf__name_by_offset(btf, m->name_off);
> +				if (str_is_empty(name))
> +					return -EINVAL;
> +
> +				spec->spec[spec->len].type_id = id;
> +				spec->spec[spec->len].idx = access_idx;
> +				spec->spec[spec->len].name = name;
> +				spec->len++;
> +			}
> +
> +			id = m->type;
> +		} else if (btf_is_array(t)) {
> +			const struct btf_array *a = (void *)(t + 1);
> +
> +			t = skip_mods_and_typedefs(btf, a->type, &id);
> +			if (!t || access_idx >= a->nelems)
> +				return -EINVAL;
> +
> +			spec->spec[spec->len].type_id = id;
> +			spec->spec[spec->len].idx = access_idx;
> +			spec->len++;
> +
> +			sz = btf__resolve_size(btf, id);
> +			if (sz < 0)
> +				return sz;
> +			spec->offset += access_idx * sz;
> +		} else {
> +			pr_warning("relo for [%u] %s (at idx %d) captures type [%d] of unexpected kind %d\n",
> +				   type_id, spec_str, i, id, btf_kind(t));
> +			return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/* Given 'some_struct_name___with_flavor' return the length of a name prefix
> + * before last triple underscore. Struct name part after last triple
> + * underscore is ignored by BPF CO-RE relocation during relocation matching.
> + */
> +static size_t bpf_core_essential_name_len(const char *name)
> +{
> +	size_t n = strlen(name);
> +	int i = n - 3;
> +
> +	while (i > 0) {
> +		if (name[i] == '_' && name[i + 1] == '_' && name[i + 2] == '_')
> +			return i;
> +		i--;
> +	}
> +	return n;
> +}
> +
> +/* dynamically sized list of type IDs */
> +struct ids_vec {
> +	__u32 *data;
> +	int len;
> +};
> +
> +static void bpf_core_free_cands(struct ids_vec *cand_ids)
> +{
> +	free(cand_ids->data);
> +	free(cand_ids);
> +}
> +
> +static struct ids_vec *bpf_core_find_cands(const struct btf *local_btf,
> +					   __u32 local_type_id,
> +					   const struct btf *targ_btf)
> +{
> +	size_t local_essent_len, targ_essent_len;
> +	const char *local_name, *targ_name;
> +	const struct btf_type *t;
> +	struct ids_vec *cand_ids;
> +	__u32 *new_ids;
> +	int i, err, n;
> +
> +	t = btf__type_by_id(local_btf, local_type_id);
> +	if (!t)
> +		return ERR_PTR(-EINVAL);
> +
> +	local_name = btf__name_by_offset(local_btf, t->name_off);
> +	if (str_is_empty(local_name))
> +		return ERR_PTR(-EINVAL);
> +	local_essent_len = bpf_core_essential_name_len(local_name);
> +
> +	cand_ids = calloc(1, sizeof(*cand_ids));
> +	if (!cand_ids)
> +		return ERR_PTR(-ENOMEM);
> +
> +	n = btf__get_nr_types(targ_btf);
> +	for (i = 1; i <= n; i++) {
> +		t = btf__type_by_id(targ_btf, i);
> +		targ_name = btf__name_by_offset(targ_btf, t->name_off);
> +		if (str_is_empty(targ_name))
> +			continue;
> +
> +		targ_essent_len = bpf_core_essential_name_len(targ_name);
> +		if (targ_essent_len != local_essent_len)
> +			continue;
> +
> +		if (strncmp(local_name, targ_name, local_essent_len) == 0) {
> +			pr_debug("[%d] (%s): found candidate [%d] (%s)\n",
> +				 local_type_id, local_name, i, targ_name);
> +			new_ids = realloc(cand_ids->data, cand_ids->len + 1);
> +			if (!new_ids) {
> +				err = -ENOMEM;
> +				goto err_out;
> +			}
> +			cand_ids->data = new_ids;
> +			cand_ids->data[cand_ids->len++] = i;
> +		}
> +	}
> +	return cand_ids;
> +err_out:
> +	bpf_core_free_cands(cand_ids);
> +	return ERR_PTR(err);
> +}
> +
> +/* Check two types for compatibility, skipping const/volatile/restrict and
> + * typedefs, to ensure we are relocating offset to the compatible entities:
> + *   - any two STRUCTs/UNIONs are compatible and can be mixed;
> + *   - any two FWDs are compatible;
> + *   - any two PTRs are always compatible;
> + *   - for ENUMs, check sizes, names are ignored;
> + *   - for INT, size and bitness should match, signedness is ignored;
> + *   - for ARRAY, dimensionality is ignored, element types are checked for
> + *     compatibility recursively;
> + *   - everything else shouldn't be ever a target of relocation.
> + * These rules are not set in stone and probably will be adjusted as we get
> + * more experience with using BPF CO-RE relocations.
> + */
> +static int bpf_core_fields_are_compat(const struct btf *local_btf,
> +				      __u32 local_id,
> +				      const struct btf *targ_btf,
> +				      __u32 targ_id)
> +{
> +	const struct btf_type *local_type, *targ_type;
> +	__u16 kind;
> +
> +recur:
> +	local_type = skip_mods_and_typedefs(local_btf, local_id, &local_id);
> +	targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id);
> +	if (!local_type || !targ_type)
> +		return -EINVAL;
> +
> +	if (btf_is_composite(local_type) && btf_is_composite(targ_type))
> +		return 1;
> +	if (BTF_INFO_KIND(local_type->info) != BTF_INFO_KIND(targ_type->info))
> +		return 0;
> +
> +	kind = BTF_INFO_KIND(local_type->info);
> +	switch (kind) {
> +	case BTF_KIND_FWD:
> +	case BTF_KIND_PTR:
> +		return 1;
> +	case BTF_KIND_ENUM:
> +		return local_type->size == targ_type->size;
> +	case BTF_KIND_INT: {
> +		__u32 loc_int = *(__u32 *)(local_type + 1);
> +		__u32 targ_int = *(__u32 *)(targ_type + 1);
> +
> +		return BTF_INT_OFFSET(loc_int) == 0 &&
> +		       BTF_INT_OFFSET(targ_int) == 0 &&
> +		       local_type->size == targ_type->size &&
> +		       BTF_INT_BITS(loc_int) == BTF_INT_BITS(targ_int);
> +	}
> +	case BTF_KIND_ARRAY: {
> +		const struct btf_array *loc_a, *targ_a;
> +
> +		loc_a = (void *)(local_type + 1);
> +		targ_a = (void *)(targ_type + 1);
> +		local_id = loc_a->type;
> +		targ_id = targ_a->type;
> +		goto recur;
> +	}
> +	default:
> +		pr_warning("unexpected kind %d relocated, local [%d], target [%d]\n",
> +			   kind, local_id, targ_id);
> +		return 0;
> +	}
> +}
> +
> +/* 
> + * Given single high-level named field accessor in local type, find
> + * corresponding high-level accessor for a target type. Along the way,
> + * maintain low-level spec for target as well. Also keep updating target
> + * offset.
> + *
> + * Searching is performed through recursive exhaustive enumeration of all
> + * fields of a struct/union. If there are any anonymous (embedded)
> + * structs/unions, they are recursively searched as well. If field with
> + * desired name is found, check compatibility between local and target types,
> + * before returning result.
> + *
> + * 1 is returned, if field is found.
> + * 0 is returned if no compatible field is found.
> + * <0 is returned on error.
> + */
> +static int bpf_core_match_member(const struct btf *local_btf,
> +				 const struct bpf_core_accessor *local_acc,
> +				 const struct btf *targ_btf,
> +				 __u32 targ_id,
> +				 struct bpf_core_spec *spec,
> +				 __u32 *next_targ_id)
> +{
> +	const struct btf_type *local_type, *targ_type;
> +	const struct btf_member *local_member, *m;
> +	const char *local_name, *targ_name;
> +	__u32 local_id;
> +	int i, n, found;
> +
> +	targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id);
> +	if (!targ_type)
> +		return -EINVAL;
> +	if (!btf_is_composite(targ_type))
> +		return 0;
> +
> +	local_id = local_acc->type_id;
> +	local_type = btf__type_by_id(local_btf, local_id);
> +	local_member = (void *)(local_type + 1);
> +	local_member += local_acc->idx;
> +	local_name = btf__name_by_offset(local_btf, local_member->name_off);
> +
> +	n = BTF_INFO_VLEN(targ_type->info);
> +	m = (void *)(targ_type + 1);
> +	for (i = 0; i < n; i++, m++) {
> +		__u32 offset;
> +
> +		/* bitfield relocations not supported */
> +		if (BTF_INFO_KFLAG(targ_type->info)) {
> +			if (BTF_MEMBER_BITFIELD_SIZE(m->offset))
> +				continue;
> +			offset = BTF_MEMBER_BIT_OFFSET(m->offset);
> +		} else {
> +			offset = m->offset;
> +		}
> +		if (offset % 8)
> +			continue;
> +
> +		/* too deep struct/union/array nesting */
> +		if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> +			return -E2BIG;
> +
> +		/* speculate this member will be the good one */
> +		spec->offset += offset / 8;
> +		spec->raw_spec[spec->raw_len++] = i;
> +
> +		targ_name = btf__name_by_offset(targ_btf, m->name_off);
> +		if (str_is_empty(targ_name)) {
> +			/* embedded struct/union, we need to go deeper */
> +			found = bpf_core_match_member(local_btf, local_acc,
> +						      targ_btf, m->type,
> +						      spec, next_targ_id);
> +			if (found) /* either found or error */
> +				return found;
> +		} else if (strcmp(local_name, targ_name) == 0) {
> +			/* matching named field */
> +			struct bpf_core_accessor *targ_acc;
> +
> +			targ_acc = &spec->spec[spec->len++];
> +			targ_acc->type_id = targ_id;
> +			targ_acc->idx = i;
> +			targ_acc->name = targ_name;
> +
> +			*next_targ_id = m->type;
> +			found = bpf_core_fields_are_compat(local_btf,
> +							   local_member->type,
> +							   targ_btf, m->type);
> +			if (!found)
> +				spec->len--; /* pop accessor */
> +			return found;
> +		}
> +		/* member turned out not to be what we looked for */
> +		spec->offset -= offset / 8;
> +		spec->raw_len--;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Try to match local spec to a target type and, if successful, produce full
> + * target spec (high-level, low-level + offset).
> + */
> +static int bpf_core_spec_match(struct bpf_core_spec *local_spec,
> +			       const struct btf *targ_btf, __u32 targ_id,
> +			       struct bpf_core_spec *targ_spec)
> +{
> +	const struct btf_type *targ_type;
> +	const struct bpf_core_accessor *local_acc;
> +	struct bpf_core_accessor *targ_acc;
> +	int i, sz, matched;
> +
> +	memset(targ_spec, 0, sizeof(*targ_spec));
> +	targ_spec->btf = targ_btf;
> +
> +	local_acc = &local_spec->spec[0];
> +	targ_acc = &targ_spec->spec[0];
> +
> +	for (i = 0; i < local_spec->len; i++, local_acc++, targ_acc++) {
> +		targ_type = skip_mods_and_typedefs(targ_spec->btf, targ_id,
> +						   &targ_id);
> +		if (!targ_type)
> +			return -EINVAL;
> +
> +		if (local_acc->name) {
> +			matched = bpf_core_match_member(local_spec->btf,
> +							local_acc,
> +							targ_btf, targ_id,
> +							targ_spec, &targ_id);
> +			if (matched <= 0)
> +				return matched;
> +		} else {
> +			/* for i=0, targ_id is already treated as array element
> +			 * type (because it's the original struct), for others
> +			 * we should find array element type first
> +			 */
> +			if (i > 0) {
> +				const struct btf_array *a;
> +
> +				if (!btf_is_array(targ_type))
> +					return 0;
> +
> +				a = (void *)(targ_type + 1);
> +				if (local_acc->idx >= a->nelems)
> +					return 0;
> +				if (!skip_mods_and_typedefs(targ_btf, a->type,
> +							    &targ_id))
> +					return -EINVAL;
> +			}
> +
> +			/* too deep struct/union/array nesting */
> +			if (targ_spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
> +				return -E2BIG;
> +
> +			targ_acc->type_id = targ_id;
> +			targ_acc->idx = local_acc->idx;
> +			targ_acc->name = NULL;
> +			targ_spec->len++;
> +			targ_spec->raw_spec[targ_spec->raw_len] = targ_acc->idx;
> +			targ_spec->raw_len++;
> +
> +			sz = btf__resolve_size(targ_btf, targ_id);
> +			if (sz < 0)
> +				return sz;
> +			targ_spec->offset += local_acc->idx * sz;
> +		}
> +	}
> +
> +	return 1;
> +}
> +
> +/*
> + * Patch relocatable BPF instruction.
> + * Expected insn->imm value is provided for validation, as well as the new
> + * relocated value.
> + *
> + * Currently three kinds of BPF instructions are supported:
> + * 1. rX = <imm> (assignment with immediate operand);
> + * 2. rX += <imm> (arithmetic operations with immediate operand);
> + * 3. *(rX) = <imm> (indirect memory assignment with immediate operand).
> + *
> + * If actual insn->imm value is wrong, bail out.
> + */
> +static int bpf_core_reloc_insn(struct bpf_program *prog, int insn_off,
> +			       __u32 orig_off, __u32 new_off)
> +{
> +	struct bpf_insn *insn;
> +	int insn_idx;
> +	__u8 class;
> +
> +	if (insn_off % sizeof(struct bpf_insn))
> +		return -EINVAL;
> +	insn_idx = insn_off / sizeof(struct bpf_insn);
> +
> +	insn = &prog->insns[insn_idx];
> +	class = BPF_CLASS(insn->code);
> +
> +	if (class == BPF_ALU || class == BPF_ALU64) {
> +		if (BPF_SRC(insn->code) != BPF_K)
> +			return -EINVAL;
> +		if (insn->imm != orig_off)
> +			return -EINVAL;
> +		insn->imm = new_off;
> +		pr_debug("prog '%s': patched insn #%d (ALU/ALU64) imm %d -> %d\n",
> +			 bpf_program__title(prog, false),
> +			 insn_idx, orig_off, new_off);
> +	} else {
> +		pr_warning("prog '%s': trying to relocate unrecognized insn #%d, code:%x, src:%x, dst:%x, off:%x, imm:%x\n",
> +			   bpf_program__title(prog, false),
> +			   insn_idx, insn->code, insn->src_reg, insn->dst_reg,
> +			   insn->off, insn->imm);
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Probe few well-known locations for vmlinux kernel image and try to load BTF
> + * data out of it to use for target BTF.
> + */
> +static struct btf *bpf_core_find_kernel_btf(void)
> +{
> +	const char *locations[] = {
> +		"/lib/modules/%1$s/vmlinux-%1$s",
> +		"/usr/lib/modules/%1$s/kernel/vmlinux",
> +	};
> +	char path[PATH_MAX + 1];
> +	struct utsname buf;
> +	struct btf *btf;
> +	int i, err;
> +
> +	err = uname(&buf);
> +	if (err) {
> +		pr_warning("failed to uname(): %d\n", err);
> +		return ERR_PTR(err);
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(locations); i++) {
> +		snprintf(path, PATH_MAX, locations[i], buf.release);
> +		pr_debug("attempting to load kernel BTF from '%s'\n", path);
> +
> +		if (access(path, R_OK))
> +			continue;
> +
> +		btf = btf__parse_elf(path, NULL);
> +		if (IS_ERR(btf))
> +			continue;
> +
> +		pr_debug("successfully loaded kernel BTF from '%s'\n", path);
> +		return btf;
> +	}
> +
> +	pr_warning("failed to find valid kernel BTF\n");
> +	return ERR_PTR(-ESRCH);
> +}
> +
> +/* Output spec definition in the format:
> + * [<type-id>] (<type-name>) + <raw-spec> => <offset>@<spec>,
> + * where <spec> is a C-syntax view of recorded field access, e.g.: x.a[3].b
> + */
> +static void bpf_core_dump_spec(int level, const struct bpf_core_spec *spec)
> +{
> +	const struct btf_type *t;
> +	const char *s;
> +	__u32 type_id;
> +	int i;
> +
> +	type_id = spec->spec[0].type_id;
> +	t = btf__type_by_id(spec->btf, type_id);
> +	s = btf__name_by_offset(spec->btf, t->name_off);
> +	libbpf_print(level, "[%u] (%s) + ", type_id, s);
> +
> +	for (i = 0; i < spec->raw_len; i++)
> +		libbpf_print(level, "%d%s", spec->raw_spec[i],
> +			     i == spec->raw_len - 1 ? " => " : ":");
> +
> +	libbpf_print(level, "%u @ &x", spec->offset);
> +
> +	for (i = 0; i < spec->len; i++) {
> +		if (spec->spec[i].name)
> +			libbpf_print(level, ".%s", spec->spec[i].name);
> +		else
> +			libbpf_print(level, "[%u]", spec->spec[i].idx);
> +	}
> +
> +}
> +
> +static size_t bpf_core_hash_fn(const void *key, void *ctx)
> +{
> +	return (size_t)key;
> +}
> +
> +static bool bpf_core_equal_fn(const void *k1, const void *k2, void *ctx)
> +{
> +	return k1 == k2;
> +}
> +
> +static void *u32_to_ptr(__u32 x)
> +{
> +	return (void *)(uintptr_t)x;
> +}
> +
> +/* 
> + * CO-RE relocate single instruction.
> + *
> + * The outline and important points of the algorithm:
> + * 1. For given local type, find corresponding candidate target types.
> + *    Candidate type is a type with the same "essential" name, ignoring
> + *    everything after last triple underscore (___). E.g., `sample`,
> + *    `sample___flavor_one`, `sample___flavor_another_one`, are all candidates
> + *    for each other. Names with triple underscore are referred to as
> + *    "flavors" and are useful, among other things, to allow to
> + *    specify/support incompatible variations of the same kernel struct, which
> + *    might differ between different kernel versions and/or build
> + *    configurations.
> + *
> + *    N.B. Struct "flavors" could be generated by bpftool's BTF-to-C
> + *    converter, when deduplicated BTF of a kernel still contains more than
> + *    one different types with the same name. In that case, ___2, ___3, etc
> + *    are appended starting from second name conflict. But start flavors are
> + *    also useful to be defined "locally", in BPF program, to extract same
> + *    data from incompatible changes between different kernel
> + *    versions/configurations. For instance, to handle field renames between
> + *    kernel versions, one can use two flavors of the struct name with the
> + *    same common name and use conditional relocations to extract that field,
> + *    depending on target kernel version.
> + * 2. For each candidate type, try to match local specification to this
> + *    candidate target type. Matching involves finding corresponding
> + *    high-level spec accessors, meaning that all named fields should match,
> + *    as well as all array accesses should be within the actual bounds. Also,
> + *    types should be compatible (see bpf_core_fields_are_compat for details).
> + * 3. It is supported and expected that there might be multiple flavors
> + *    matching the spec. As long as all the specs resolve to the same set of
> + *    offsets across all candidates, there is not error. If there is any
> + *    ambiguity, CO-RE relocation will fail. This is necessary to accomodate
> + *    imprefection of BTF deduplication, which can cause slight duplication of
> + *    the same BTF type, if some directly or indirectly referenced (by
> + *    pointer) type gets resolved to different actual types in different
> + *    object files. If such situation occurs, deduplicated BTF will end up
> + *    with two (or more) structurally identical types, which differ only in
> + *    types they refer to through pointer. This should be OK in most cases and
> + *    is not an error.
> + * 4. Candidate types search is performed by linearly scanning through all
> + *    types in target BTF. It is anticipated that this is overall more
> + *    efficient memory-wise and not significantly worse (if not better)
> + *    CPU-wise compared to prebuilding a map from all local type names to
> + *    a list of candidate type names. It's also sped up by caching resolved
> + *    list of matching candidates per each local "root" type ID, that has at
> + *    least one bpf_offset_reloc associated with it. This list is shared
> + *    between multiple relocations for the same type ID and is updated as some
> + *    of the candidates are pruned due to structural incompatibility.
> + */
> +static int bpf_core_reloc_offset(struct bpf_program *prog,
> +				 const struct bpf_offset_reloc *relo,
> +				 int relo_idx,
> +				 const struct btf *local_btf,
> +				 const struct btf *targ_btf,
> +				 struct hashmap *cand_cache)
> +{
> +	const char *prog_name = bpf_program__title(prog, false);
> +	struct bpf_core_spec local_spec, cand_spec, targ_spec;
> +	const void *type_key = u32_to_ptr(relo->type_id);
> +	const struct btf_type *local_type, *cand_type;
> +	const char *local_name, *cand_name;
> +	struct ids_vec *cand_ids;
> +	__u32 local_id, cand_id;
> +	const char *spec_str;
> +	int i, j, err;
> +
> +	local_id = relo->type_id;
> +	local_type = btf__type_by_id(local_btf, local_id);
> +	if (!local_type)
> +		return -EINVAL;
> +
> +	local_name = btf__name_by_offset(local_btf, local_type->name_off);
> +	if (str_is_empty(local_name))
> +		return -EINVAL;
> +
> +	spec_str = btf__name_by_offset(local_btf, relo->access_str_off);
> +	if (str_is_empty(spec_str))
> +		return -EINVAL;
> +
> +	err = bpf_core_spec_parse(local_btf, local_id, spec_str, &local_spec);
> +	if (err) {
> +		pr_warning("prog '%s': relo #%d: parsing [%d] (%s) + %s failed: %d\n",
> +			   prog_name, relo_idx, local_id, local_name, spec_str,
> +			   err);
> +		return -EINVAL;
> +	}
> +
> +	pr_debug("prog '%s': relo #%d: spec is ", prog_name, relo_idx);
> +	bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> +	libbpf_print(LIBBPF_DEBUG, "\n");
> +
> +	if (!hashmap__find(cand_cache, type_key, (void **)&cand_ids)) {
> +		cand_ids = bpf_core_find_cands(local_btf, local_id, targ_btf);
> +		if (IS_ERR(cand_ids)) {
> +			pr_warning("prog '%s': relo #%d: target candidate search failed for [%d] (%s): %ld",
> +				   prog_name, relo_idx, local_id, local_name,
> +				   PTR_ERR(cand_ids));
> +			return PTR_ERR(cand_ids);
> +		}
> +		err = hashmap__set(cand_cache, type_key, cand_ids, NULL, NULL);
> +		if (err) {
> +			bpf_core_free_cands(cand_ids);
> +			return err;
> +		}
> +	}
> +
> +	for (i = 0, j = 0; i < cand_ids->len; i++) {
> +		cand_id = cand_ids->data[i];
> +		cand_type = btf__type_by_id(targ_btf, cand_id);
> +		cand_name = btf__name_by_offset(targ_btf, cand_type->name_off);
> +
> +		err = bpf_core_spec_match(&local_spec, targ_btf,
> +					  cand_id, &cand_spec);
> +		if (err < 0) {
> +			pr_warning("prog '%s': relo #%d: failed to match spec ",
> +				   prog_name, relo_idx);
> +			bpf_core_dump_spec(LIBBPF_WARN, &local_spec);
> +			libbpf_print(LIBBPF_WARN,
> +				     " to candidate #%d [%d] (%s): %d\n",
> +				     i, cand_id, cand_name, err);
> +			return err;
> +		}
> +		if (err == 0) {
> +			pr_debug("prog '%s': relo #%d: candidate #%d [%d] (%s) doesn't match spec ",
> +				 prog_name, relo_idx, i, cand_id, cand_name);
> +			bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> +			libbpf_print(LIBBPF_DEBUG, "\n");
> +			continue;
> +		}
> +
> +		pr_debug("prog '%s': relo #%d: candidate #%d matched as spec ",
> +			 prog_name, relo_idx, i);
> +		bpf_core_dump_spec(LIBBPF_DEBUG, &cand_spec);
> +		libbpf_print(LIBBPF_DEBUG, "\n");
> +
> +		if (j == 0) {
> +			targ_spec = cand_spec;
> +		} else if (cand_spec.offset != targ_spec.offset) {
> +			/* if there are many candidates, they should all
> +			 * resolve to the same offset
> +			 */
> +			pr_warning("prog '%s': relo #%d: candidate #%d spec ",
> +				   prog_name, relo_idx, i);
> +			bpf_core_dump_spec(LIBBPF_WARN, &cand_spec);
> +			libbpf_print(LIBBPF_WARN,
> +				     " conflicts with target spec ");
> +			bpf_core_dump_spec(LIBBPF_WARN, &targ_spec);
> +			libbpf_print(LIBBPF_WARN, "\n");
> +			return -EINVAL;
> +		}
> +
> +		cand_ids->data[j++] = cand_spec.spec[0].type_id;
> +	}
> +
> +	cand_ids->len = j;
> +	if (cand_ids->len == 0) {
> +		pr_warning("prog '%s': relo #%d: no matching targets found for [%d] (%s) + %s\n",
> +			   prog_name, relo_idx, local_id, local_name, spec_str);
> +		return -ESRCH;
> +	}
> +
> +	err = bpf_core_reloc_insn(prog, relo->insn_off,
> +				  local_spec.offset, targ_spec.offset);
> +	if (err) {
> +		pr_warning("prog '%s': relo #%d: failed to patch insn at offset %d: %d\n",
> +			   prog_name, relo_idx, relo->insn_off, err);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +bpf_core_reloc_offsets(struct bpf_object *obj, const char *targ_btf_path)
> +{
> +	const struct btf_ext_info_sec *sec;
> +	const struct bpf_offset_reloc *rec;
> +	const struct btf_ext_info *seg;
> +	struct hashmap_entry *entry;
> +	struct hashmap *cand_cache = NULL;
> +	struct bpf_program *prog;
> +	struct btf *targ_btf;
> +	const char *sec_name;
> +	int i, err = 0;
> +
> +	if (targ_btf_path)
> +		targ_btf = btf__parse_elf(targ_btf_path, NULL);
> +	else
> +		targ_btf = bpf_core_find_kernel_btf();
> +	if (IS_ERR(targ_btf)) {
> +		pr_warning("failed to get target BTF: %ld\n",
> +			   PTR_ERR(targ_btf));
> +		return PTR_ERR(targ_btf);
> +	}
> +
> +	cand_cache = hashmap__new(bpf_core_hash_fn, bpf_core_equal_fn, NULL);
> +	if (IS_ERR(cand_cache)) {
> +		err = PTR_ERR(cand_cache);
> +		goto out;
> +	}
> +
> +	seg = &obj->btf_ext->offset_reloc_info;
> +	for_each_btf_ext_sec(seg, sec) {
> +		sec_name = btf__name_by_offset(obj->btf, sec->sec_name_off);
> +		if (str_is_empty(sec_name)) {
> +			err = -EINVAL;
> +			goto out;
> +		}
> +		prog = bpf_object__find_program_by_title(obj, sec_name);
> +		if (!prog) {
> +			pr_warning("failed to find program '%s' for CO-RE offset relocation\n",
> +				   sec_name);
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		pr_debug("prog '%s': performing %d CO-RE offset relocs\n",
> +			 sec_name, sec->num_info);
> +
> +		for_each_btf_ext_rec(seg, sec, i, rec) {
> +			err = bpf_core_reloc_offset(prog, rec, i, obj->btf,
> +						    targ_btf, cand_cache);
> +			if (err) {
> +				pr_warning("prog '%s': relo #%d: failed to relocate: %d\n",
> +					   sec_name, i, err);
> +				goto out;
> +			}
> +		}
> +	}
> +
> +out:
> +	btf__free(targ_btf);
> +	if (!IS_ERR_OR_NULL(cand_cache)) {
> +		hashmap__for_each_entry(cand_cache, entry, i) {
> +			bpf_core_free_cands(entry->value);
> +		}
> +		hashmap__free(cand_cache);
> +	}
> +	return err;
> +}
> +
> +static int
> +bpf_object__relocate_core(struct bpf_object *obj, const char *targ_btf_path)
> +{
> +	int err = 0;
> +
> +	if (obj->btf_ext->offset_reloc_info.len)
> +		err = bpf_core_reloc_offsets(obj, targ_btf_path);
> +
> +	return err;
> +}
> +
> static int
> bpf_program__reloc_text(struct bpf_program *prog, struct bpf_object *obj,
> 			struct reloc_desc *relo)
> @@ -2399,14 +3293,21 @@ bpf_program__relocate(struct bpf_program *prog, struct bpf_object *obj)
> 	return 0;
> }
> 
> -
> static int
> -bpf_object__relocate(struct bpf_object *obj)
> +bpf_object__relocate(struct bpf_object *obj, const char *targ_btf_path)
> {
> 	struct bpf_program *prog;
> 	size_t i;
> 	int err;
> 
> +	if (obj->btf_ext) {
> +		err = bpf_object__relocate_core(obj, targ_btf_path);
> +		if (err) {
> +			pr_warning("failed to perform CO-RE relocations: %d\n",
> +				   err);
> +			return err;
> +		}
> +	}
> 	for (i = 0; i < obj->nr_programs; i++) {
> 		prog = &obj->programs[i];
> 
> @@ -2807,7 +3708,7 @@ int bpf_object__load_xattr(struct bpf_object_load_attr *attr)
> 	obj->loaded = true;
> 
> 	CHECK_ERR(bpf_object__create_maps(obj), err, out);
> -	CHECK_ERR(bpf_object__relocate(obj), err, out);
> +	CHECK_ERR(bpf_object__relocate(obj, attr->target_btf_path), err, out);
> 	CHECK_ERR(bpf_object__load_progs(obj, attr->log_level), err, out);
> 
> 	return 0;
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 8a9d462a6f6d..e8f70977d137 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -92,6 +92,7 @@ LIBBPF_API void bpf_object__close(struct bpf_object *object);
> struct bpf_object_load_attr {
> 	struct bpf_object *obj;
> 	int log_level;
> +	const char *target_btf_path;
> };
> 
> /* Load/unload object into/from kernel */
> -- 
> 2.17.1
> 


^ permalink raw reply

* Re: [PATCH 0/2] tools: bpftool: add net (un)load command to load XDP
From: Alexei Starovoitov @ 2019-07-31  0:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel T. Lee, Stephen Hemminger, David Ahern,
	Jesper Dangaard Brouer, John Fastabend, Daniel Borkmann,
	Alexei Starovoitov, netdev
In-Reply-To: <20190730170725.279761e7@cakuba.netronome.com>

On Tue, Jul 30, 2019 at 05:07:25PM -0700, Jakub Kicinski wrote:
> On Tue, 30 Jul 2019 16:17:56 -0700, Alexei Starovoitov wrote:
> > On Tue, Jul 30, 2019 at 03:59:15PM -0700, Jakub Kicinski wrote:
> > > On Wed, 31 Jul 2019 03:48:19 +0900, Daniel T. Lee wrote:  
> > > > Currently, bpftool net only supports dumping progs loaded on the
> > > > interface. To load XDP prog on interface, user must use other tool
> > > > (eg. iproute2). By this patch, with `bpftool net (un)load`, user can
> > > > (un)load XDP prog on interface.  
> > > 
> > > I don't understand why using another tool is a bad thing :(
> > > What happened to the Unix philosophy?
> > > 
> > > I remain opposed to duplicating iproute2's functionality under 
> > > bpftool net :( The way to attach bpf programs in the networking
> > > subsystem is through the iproute2 commends - ip and tc.. 
> > > 
> > > It seems easy enough to add a feature to bpftool but from 
> > > a perspective of someone adding a new feature to the kernel, 
> > > and wanting to update user space components it's quite painful :(
> > > 
> > > So could you describe to me in more detail why this is a good idea?
> > > Perhaps others can chime in?  
> > 
> > I don't think it has anything to do with 'unix philosophy'.
> > Here the proposal to teach bpftool to attach xdp progs.
> > I see nothing wrong with that.
> 
> Nothing meaning you disagree it's duplicated effort and unnecessary 
> LoC the community has to maintain, review, test..?

I don't see duplicated effort.

> > Another reason is iproute2 is still far away from adopting libbpf.
> > So all the latest goodness like BTF, introspection, etc will not
> > be available to iproute2 users for some time.
> 
> Duplicating the same features in bpftool will only diminish the
> incentive for moving iproute2 to libbpf. 

not at all. why do you think so?

> And for folks who deal
> with a wide variety of customers, often novices maintaining two
> ways of doing the same thing is a hassle :(

It's not the same thing.
I'm talking about my experience dealing with 'wide variety of bpf customers'.
They only have a fraction of their time to learn one tool.
Making every bpf customer learn ten tools is not an option.

> > Even when iproute2 is ready it would be convenient for folks like me
> > (who need to debug stuff in production) to remember cmd line of
> > bpftool only to introspect the server. Debugging often includes
> > detaching/attaching progs. Not only doing 'bpftool p s'.
> 
> Let's just put the two commands next to each other:
> 
>        ip link set xdp $PROG dev $DEV
> 
> bpftool net attach xdp $PROG dev $DEV
> 
> Are they that different?

yes.
they're different tools with they own upgrade/rollout cycle

> 
> > If bpftool was taught to do equivalent of 'ip link' that would be
> > very different story and I would be opposed to that.
> 
> Yes, that'd be pretty clear cut, only the XDP stuff is a bit more 
> of a judgement call.

bpftool must be able to introspect every aspect of bpf programming.
That includes detaching and attaching anywhere.
Anyone doing 'bpftool p s' should be able to switch off particular
prog id without learning ten different other tools.
iproute2 is a small bit of it. There is cgroup and tracing too.
bpftool should be one tool to do everything directly related to bpf.


^ permalink raw reply

* Re: [PATCH nf] netfilter: nf_tables: map basechain priority to hardware priority
From: Jakub Kicinski @ 2019-07-31  0:21 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, wenxu, jiri, marcelo.leitner, saeedm, gerlitz.or,
	paulb, netdev
In-Reply-To: <20190730105417.14538-1-pablo@netfilter.org>

On Tue, 30 Jul 2019 12:54:17 +0200, Pablo Neira Ayuso wrote:
> This patch maps basechain netfilter priorities from -8192 to 8191 to
> hardware priority 0xC000 + 1. tcf_auto_prio() uses 0xC000 if the user
> specifies no priority, then it subtract 1 for each new tcf_proto object.
> This patch uses the hardware priority range from 0xC000 to 0xFFFF for
> netfilter.
> 
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
> This follows a rather conservative approach, I could just expose the
> 2^16 hardware priority range, but we may need to split this priority
> range among the ethtool_rx, tc and netfilter subsystems to start with
> and it should be possible to extend the priority range later on.
> 
> By netfilter priority, I'm refering to the basechain priority:
> 
> 	add chain x y { type filter hook ingress device eth0 priority 0; }
>                                                              ^^^^^^^^^^^
> 
> This is no transparently mapped to hardware, this patch shifts it to
> make it fit into the 0xC000 + 1 .. 0xFFFF hardware priority range.

Mmm.. so the ordering of tables is intended to be decided by priority
and not block type (nft, tc, ethtool)?  I was always expecting we 
would just follow the software order when it comes to inter-subsystem
decisions.  So ethtool first, then XDP, then TC, then nft, then
bridging etc. TC vs NFT based on:

static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc,
				    struct packet_type **ppt_prev)
{
...
	if (static_branch_unlikely(&ingress_needed_key)) {
		skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev);
		if (!skb)
			goto out;

		if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
			goto out;
	}

Are they solid use cases for choosing the ordering arbitrarily?

^ permalink raw reply

* Re: [PATCH v2 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Song Liu @ 2019-07-31  0:16 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, bpf, netdev@vger.kernel.org, Alexei Starovoitov,
	daniel@iogearbox.net, Yonghong Song, Kernel Team
In-Reply-To: <CAEf4BzZpcP1aBwrz8DbToQ=nVUukPwiG-PBCFGZNb2wXg_msnA@mail.gmail.com>



> On Jul 30, 2019, at 4:55 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Tue, Jul 30, 2019 at 4:44 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Jul 30, 2019, at 12:53 PM, Andrii Nakryiko <andriin@fb.com> wrote:
>>> 
>>> This patch implements the core logic for BPF CO-RE offsets relocations.
>>> Every instruction that needs to be relocated has corresponding
>>> bpf_offset_reloc as part of BTF.ext. Relocations are performed by trying
>>> to match recorded "local" relocation spec against potentially many
>>> compatible "target" types, creating corresponding spec. Details of the
>>> algorithm are noted in corresponding comments in the code.
>>> 
>>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
>>> ---
>>> tools/lib/bpf/libbpf.c | 915 ++++++++++++++++++++++++++++++++++++++++-
>>> tools/lib/bpf/libbpf.h |   1 +
>>> 2 files changed, 909 insertions(+), 7 deletions(-)
>>> 
>>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>>> index ead915aec349..75da90928257 100644
>>> --- a/tools/lib/bpf/libbpf.c
>>> +++ b/tools/lib/bpf/libbpf.c
>>> @@ -38,6 +38,7 @@
> 
> [...]
> 
>>> 
>>> -static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
>>> -                                                  __u32 id)
>>> +static const struct btf_type *
>>> +skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id)
>>> {
>>>      const struct btf_type *t = btf__type_by_id(btf, id);
>>> 
>>> +     if (res_id)
>>> +             *res_id = id;
>>> +
>>>      while (true) {
>>>              switch (BTF_INFO_KIND(t->info)) {
>>>              case BTF_KIND_VOLATILE:
>>>              case BTF_KIND_CONST:
>>>              case BTF_KIND_RESTRICT:
>>>              case BTF_KIND_TYPEDEF:
>>> +                     if (res_id)
>>> +                             *res_id = t->type;
>>>                      t = btf__type_by_id(btf, t->type);
>> 
>> So btf->types[*res_id] == retval, right? Then with retval and btf, we can
>> calculate *res_id without this change?
> 
> Unless I'm missing something very clever here, no. btf->types is array
> of pointers (it's an index into a variable-sized types). This function
> returns `struct btf_type *`, which is one of the **values** stored in
> that array. You are claiming that by having value of one of array
> elements you can easily find element's index? If it was possible to do
> in O(1), we wouldn't have so many algorithms and data structures for
> search and indexing. You can do that only with linear search, not some
> clever pointer arithmetic or at least binary search. So I'm not sure
> what you are proposing here...

oops.. Clearly, I made some silly mistake. Sorry for the noise. 

Song

> 
> The way BTF is defined, struct btf_type doesn't know its own type ID,
> which is often inconvenient and requires to keep track of that ID, if
> it's necessary, but that's how it is.
> 
> But then again, what are we trying to achieve here? Eliminate
> returning id and pointer? I could always return id and easily look up
> pointer, but having both is super convenient and makes code simpler
> and shorter, so I'd like to keep it.
> 
>> 
>>>                      break;
>>>              default:
>>> @@ -1044,7 +1051,7 @@ static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
>>> static bool get_map_field_int(const char *map_name, const struct btf *btf,
>>>                            const struct btf_type *def,
>>>                            const struct btf_member *m, __u32 *res) {
> 
> [...]


^ permalink raw reply

* Re: [PATCH net-next 2/2] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Tao Ren @ 2019-07-31  0:15 UTC (permalink / raw)
  To: Andrew Lunn, Vladimir Oltean
  Cc: Florian Fainelli, Heiner Kallweit, David S . Miller,
	Arun Parameswaran, Justin Chen, netdev, lkml, Andrew Jeffery,
	openbmc@lists.ozlabs.org
In-Reply-To: <20190730131719.GA28552@lunn.ch>

On 7/30/19 6:17 AM, Andrew Lunn wrote:
>> Again, I don't think Linux has generic support for overwriting (or
>> even describing) the operating mode of a PHY, although maybe that's a
>> direction we would want to push the discussion towards. RGMII to
>> copper, RGMII to fiber, SGMII to copper, copper to fiber (media
>> converter), even RGMII to SGMII (RTL8211FS supports this) - lots of
>> modes, and this is only for gigabit PHYs...
> 
> This is something Russell King has PHYLINK patches for, which have not
> yet been merged. There are some boards which use a PHY as a media
> converter, placed between the MAC and an SFP.

Thank you Andrew. Let me see if the operating mode can be auto-detected on BCM54616S chip, and I will update back soon.

Thanks,

Tao

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox