Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: David Ahern @ 2018-11-08 16:06 UTC (permalink / raw)
  To: Paweł Staszewski, Jesper Dangaard Brouer; +Cc: netdev, Yoel Caspersen
In-Reply-To: <8dde3b32-59ce-38f3-5913-2ce08264e9dc@itcare.pl>

On 11/8/18 6:33 AM, Paweł Staszewski wrote:
> 
> 
> W dniu 07.11.2018 o 22:06, David Ahern pisze:
>> On 11/3/18 6:24 PM, Paweł Staszewski wrote:
>>>> Does your setup have any other device types besides physical ports with
>>>> VLANs (e.g., any macvlans or bonds)?
>>>>
>>>>
>>> no.
>>> just
>>> phy(mlnx)->vlans only config
>> VLAN and non-VLAN (and a mix) seem to work ok. Patches are here:
>>     https://github.com/dsahern/linux.git bpf/kernel-tables-wip
>>
>> I got lazy with the vlan exports; right now it requires 8021q to be
>> builtin (CONFIG_VLAN_8021Q=y)
>>
>> You can use the xdp_fwd sample:
>>    make O=kbuild -C samples/bpf -j 8
>>
>> Copy samples/bpf/xdp_fwd_kern.o and samples/bpf/xdp_fwd to the server
>> and run:
>>     ./xdp_fwd <list of NIC ports>
>>
>> e.g., in my testing I run:
>>     xdp_fwd eth1 eth2 eth3 eth4
>>
>> All of the relevant forwarding ports need to be on the same command
>> line. This version populates a second map to verify the egress port has
>> XDP enabled.
> Installed today on some lab server with mellanox connectx4
> 
> And trying some simple static routing first - but after enabling xdp
> program - receiver is not receiving frames
> 
> Route table is simple as possible for tests :)
> 
> icmp ping test send from 192.168.22.237 to 172.16.0.2 - incomming
> packets on vlan 4081
> 
> ip r
> default via 192.168.22.236 dev vlan4081
> 172.16.0.0/30 dev vlan1740 proto kernel scope link src 172.16.0.1
> 192.168.22.0/24 dev vlan4081 proto kernel scope link src 192.168.22.205
> 
> neigh table:
> ip neigh ls
> 
> 192.168.22.237 dev vlan4081 lladdr 00:25:90:fb:a6:8d REACHABLE
> 172.16.0.2 dev vlan1740 lladdr ac:1f:6b:2c:2e:5a REACHABLE
> 
> and interfaces:
> 4: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state
> UP mode DEFAULT group default qlen 1000
>     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
> 5: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state
> UP mode DEFAULT group default qlen 1000
>     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
> 6: vlan4081@enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> noqueue state UP mode DEFAULT group default qlen 1000
>     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
> 7: vlan1740@enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> noqueue state UP mode DEFAULT group default qlen 1000
>     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
> 
> 5: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:5 qdisc
> mq state UP group default qlen 1000
>     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>     inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
>        valid_lft forever preferred_lft forever
> 6: vlan4081@enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> noqueue state UP group default qlen 1000
>     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
>     inet 192.168.22.205/24 scope global vlan4081
>        valid_lft forever preferred_lft forever
>     inet6 fe80::ae1f:6bff:fe07:c890/64 scope link
>        valid_lft forever preferred_lft forever
> 7: vlan1740@enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> noqueue state UP group default qlen 1000
>     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>     inet 172.16.0.1/30 scope global vlan1740
>        valid_lft forever preferred_lft forever
>     inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
>        valid_lft forever preferred_lft forever
> 
> 
> xdp program detached:
> Receiving side tcpdump:
> 14:28:09.141233 IP 192.168.22.237 > 172.16.0.2: ICMP echo request, id
> 30227, seq 487, length 64
> 
> I can see icmp requests
> 
> 
> enabling xdp
> ./xdp_fwd enp175s0f1 enp175s0f0
> 
> 4: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq
> state UP mode DEFAULT group default qlen 1000
>     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
>     prog/xdp id 5 tag 3c231ff1e5e77f3f
> 5: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq
> state UP mode DEFAULT group default qlen 1000
>     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>     prog/xdp id 5 tag 3c231ff1e5e77f3f
> 6: vlan4081@enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> noqueue state UP mode DEFAULT group default qlen 1000
>     link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
> 7: vlan1740@enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> noqueue state UP mode DEFAULT group default qlen 1000
>     link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
> 
What hardware is this?

Start with:

echo 1 > /sys/kernel/debug/tracing/events/xdp/enable
cat /sys/kernel/debug/tracing/trace_pipe

>From there, you can check the FIB lookups:
sysctl -w kernel.perf_event_max_stack=16
perf record -e fib:* -a -g -- sleep 5
perf script

^ permalink raw reply

* Re: [PATCH] e1000e: Change watchdog task to be delayed work
From: Jeff Kirsher @ 2018-11-09  1:49 UTC (permalink / raw)
  To: Robert Eshleman; +Cc: David S. Miller, intel-wired-lan, netdev, linux-kernel
In-Reply-To: <1541290645-25033-1-git-send-email-bobbyeshleman@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 407 bytes --]

On Sat, 2018-11-03 at 17:17 -0700, Robert Eshleman wrote:
> This completes a pending TODO to use queue_delayed_work() instead of
> schedule_work().
> 
> Signed-off-by: Robert Eshleman <bobbyeshleman@gmail.com>
> ---
>  drivers/net/ethernet/intel/e1000e/netdev.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)

Dropping this patch due to the problems seen with it applied.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v2 3/5] virtio_ring: add packed ring support
From: Tiwei Bie @ 2018-11-09  1:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtualization, linux-kernel, netdev, virtio-dev,
	wexu, jfreimann
In-Reply-To: <20181108103155-mutt-send-email-mst@kernel.org>

On Thu, Nov 08, 2018 at 10:56:02AM -0500, Michael S. Tsirkin wrote:
> On Thu, Nov 08, 2018 at 07:51:48PM +0800, Tiwei Bie wrote:
> > On Thu, Nov 08, 2018 at 04:18:25PM +0800, Jason Wang wrote:
> > > 
> > > On 2018/11/8 上午9:38, Tiwei Bie wrote:
> > > > > > +
> > > > > > +	if (vq->vq.num_free < descs_used) {
> > > > > > +		pr_debug("Can't add buf len %i - avail = %i\n",
> > > > > > +			 descs_used, vq->vq.num_free);
> > > > > > +		/* FIXME: for historical reasons, we force a notify here if
> > > > > > +		 * there are outgoing parts to the buffer.  Presumably the
> > > > > > +		 * host should service the ring ASAP. */
> > > > > I don't think we have a reason to do this for packed ring.
> > > > > No historical baggage there, right?
> > > > Based on the original commit log, it seems that the notify here
> > > > is just an "optimization". But I don't quite understand what does
> > > > the "the heuristics which KVM uses" refer to. If it's safe to drop
> > > > this in packed ring, I'd like to do it.
> > > 
> > > 
> > > According to the commit log, it seems like a workaround of lguest networking
> > > backend.
> > 
> > Do you know why removing this notify in Tx will break "the
> > heuristics which KVM uses"? Or what does "the heuristics
> > which KVM uses" refer to?
> 
> Yes. QEMU has a mode where it disables notifications and processes TX
> ring periodically from a timer.  It's off by default but used to be on
> by default a long time ago. If ring becomes full this causes traffic
> stalls.  As a work-around Rusty put in this hack to kick on ring full
> even with notifications disabled.  It's easy enough to make sure QEMU
> does not combine devices with packed ring support with the timer hack.
> And I am guessing it's safe enough to also block that option completely
> e.g. when virtio 1.0 is enabled.

I see. Thanks!

> 
> > 
> > > I agree to drop it, we should not have such burden.
> > > 
> > > But we should notice that, with this removed, the compare between packed vs
> > > split is kind of unfair. Consider the removal of lguest support recently,
> > > maybe we can drop this for split ring as well?
> > > 
> > > Thanks
> > > 
> > > 
> > > > 
> > > > commit 44653eae1407f79dff6f52fcf594ae84cb165ec4
> > > > Author: Rusty Russell<rusty@rustcorp.com.au>
> > > > Date:   Fri Jul 25 12:06:04 2008 -0500
> > > > 
> > > >      virtio: don't always force a notification when ring is full
> > > >      We force notification when the ring is full, even if the host has
> > > >      indicated it doesn't want to know.  This seemed like a good idea at
> > > >      the time: if we fill the transmit ring, we should tell the host
> > > >      immediately.
> > > >      Unfortunately this logic also applies to the receiving ring, which is
> > > >      refilled constantly.  We should introduce real notification thesholds
> > > >      to replace this logic.  Meanwhile, removing the logic altogether breaks
> > > >      the heuristics which KVM uses, so we use a hack: only notify if there are
> > > >      outgoing parts of the new buffer.
> > > >      Here are the number of exits with lguest's crappy network implementation:
> > > >      Before:
> > > >              network xmit 7859051 recv 236420
> > > >      After:
> > > >              network xmit 7858610 recv 118136
> > > >      Signed-off-by: Rusty Russell<rusty@rustcorp.com.au>
> > > > 
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 72bf8bc09014..21d9a62767af 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -87,8 +87,11 @@ static int vring_add_buf(struct virtqueue *_vq,
> > > >   	if (vq->num_free < out + in) {
> > > >   		pr_debug("Can't add buf len %i - avail = %i\n",
> > > >   			 out + in, vq->num_free);
> > > > -		/* We notify*even if*  VRING_USED_F_NO_NOTIFY is set here. */
> > > > -		vq->notify(&vq->vq);
> > > > +		/* FIXME: for historical reasons, we force a notify here if
> > > > +		 * there are outgoing parts to the buffer.  Presumably the
> > > > +		 * host should service the ring ASAP. */
> > > > +		if (out)
> > > > +			vq->notify(&vq->vq);
> > > >   		END_USE(vq);
> > > >   		return -ENOSPC;
> > > >   	}
> > > > 
> > > > 

^ permalink raw reply

* Re: [PATCH net-next 1/3] devlink: Add fw_version_check generic parameter
From: Ido Schimmel @ 2018-11-08 16:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ido Schimmel, netdev@vger.kernel.org, davem@davemloft.net,
	Jiri Pirko, Shalom Toledo, Moshe Shemesh, dsahern@gmail.com,
	andrew@lunn.ch, f.fainelli@gmail.com, mlxsw
In-Reply-To: <20181107110518.6502665a@cakuba.netronome.com>

On Wed, Nov 07, 2018 at 11:05:18AM -0800, Jakub Kicinski wrote:
> On Wed, 7 Nov 2018 12:11:32 +0200, Ido Schimmel wrote:
> > On Tue, Nov 06, 2018 at 02:47:13PM -0800, Jakub Kicinski wrote:
> > > On Tue, 6 Nov 2018 22:37:51 +0200, Ido Schimmel wrote:  
> > > > On Tue, Nov 06, 2018 at 12:19:13PM -0800, Jakub Kicinski wrote:  
> > > > > We have a FW loading policy for NFP, too, so it'd be good to see if we
> > > > > can find a common ground.    
> > > > 
> > > > If the parameter is set, then device runs with whatever firmware version
> > > > was last flashed (via ethtool, for example). Otherwise, the driver will
> > > > flash a version according to its policy. In mlxsw, it is a specific
> > > > version.
> > > > 
> > > > Will that work for you?  
> > > 
> > > Our FW is always backward compatible so there is no need to downgrade.
> > > 
> > > What we have is this more along these lines: there are two images one
> > > on disk and second in the flash.  The FW loading policy can decide
> > > which of those should be preferred, or should the versions be compared
> > > and the newer one win (default).  But we don't flash the newer FW, just
> > > potentially load it from disk today.  
> > 
> > Not sure I understand. You have a currently flashed firmware and another
> > firmware image on disk. 
> 
> Correct.
> 
> > You potentially load the firmware from the disk, but never flash it?
> 
> You can flash it if you want, but default is to load the
> linux-firmware/disk one when system boots.  
> 
> Flashing is useful for example if you have some super special FW that
> you prefer over whatever comes from linux-firmware updates, or (most
> commonly) if distributing FWs is hard in your provisioning system (ugh).
> 
> > If so, why load it?
> 
> We need to load some FW..

OK. Got it now. mlxsw must flash a firmware in order to load it, but in
your case it's not necessary.

> > > I'm not sure whether 'fw_version_check' describes the general behaviour
> > > of not updating the FW in flash.  The policy of updating the FW in the
> > > flash if the one on disk is newer seems to be something we could adopt
> > > as well.  Can we come up with a more general parameter which could
> > > select FW loading policy that'd for both cases?
> > > 
> > > Would values like these make any sense to you?
> > >  - driver preferred (your default behaviour, we don't support since
> > >    driver doesn't care);
> > >  - newest (our default, device compares images and picks newer);
> > >  - always disk (always run with what's on the disk, regardless of
> > >    versions);
> > >  - always flash (always run with what's already in flash, don't look at
> > >    disk);
> > > 
> > > Separate bool parameter 'fw_flash_auto_update' would decide whether the
> > > selected FW should be flashed to the device (always true for you AFAIU).
> > > 
> > > Let me know if that makes sense, it would be nice if we could converge
> > > on a common solution, or at least name our parameters sufficiently
> > > distinctly to avoid confusion :)  
> > 
> > I think that the above scheme is a bit too complicated and I'm not sure
> > this is warranted. I'll try to better explain the motivation for this
> > parameter and where we are coming from.
> 
> Certainly, let me know if what I wrote above helps to understand the
> motivation.

Yes, it does. Thanks!

> 
> > We want to keep things as simple as possible. This means we don't want
> > users to fiddle with devlink parameter unless they have to. Things
> > should just work.
> 
> 100% agree, you can choose the default for the parameter to be whatever
> you want, nobody will have to touch it in normal operation.
> 
> I'm just proposing widening the values of the parameter so it works for
> others (given you propose it as generic).
> 
> > This parameter should only be used in exceptional cases.
> > 
> > For example, when user reports a problem with current firmware version
> > enforced by the driver. Assuming we have a new firmware version with a
> > fix, we would like the user to try it and confirm bug was fixed.
> > Ideally, the user would do something like this:
> > 
> > 1. Flash new firmware via ethtool
> > 2. Perform a reset via devlink to have changes take effect
> > 
> > Problem is that after the reset the driver's init sequence will run and
> > overwrite the new firmware version with the one specified in its source
> > as a compatible version. The driver needs to enforce a specific version
> > because newer versions are not necessarily backward compatible.
> > 
> > Therefore, we added this new parameter that gives the user the ability
> > to explicitly run with a different version than what was specified as
> > compatible. New sequence is therefore:
> > 
> > 1. Flash new firmware via ethtool
> > 2. Toggle devlink parameter
> > 3. Perform a reset via devlink to have changes take effect
> > 
> > Firmware loading policy is basically always go with what the driver is
> > enforcing (it knows best), unless user specified he/she knows better.
> 
> Thanks, got it.
> 
> > I think this is both generic and simple, but I possibly didn't
> > understand the full scope of your use cases.
> 
> I agree, it is simple, but the semantics of the parameter you're adding
> unnecessarily involve firmware flashing, which is mlxsw quirk.  My
> impression (mostly from my WiFi driver days) is that the FW is either in
> flash or in /lib/firmware, but rarely /lib/firmware autofeeds the flash.
> 
> The commit message says "Many drivers checking the device's firmware
> version during the initialization flow and flashing a compatible
> version if the current version is not."  Do Ethernet drivers do that 
> or some other drivers?

Yes, the terminology is not accurate. The intention was that drivers
load (we used flashing...) a compatible firmware version during their
initialization routine.

> To reiterate I think the definition of the flag unnecessarily involves
> flashing.  It may make sense to mlxsw users, but I'd postulate it won't
> to almost everyone else :)
> 
> Therefore AFAICS we could make one of two improvements:
>  - make this a full-fledged flash update policy with clear operational
>    semantics which others can reuse; or 
>  - remove the flashing part and leave it unmentioned - definition of the
>    parameter becomes in a nutshell "ignore all FW version
>    incompatibilities"; the limitation of mlxsw having to flash a FW
>    image to load it is then an implementation detail.

Discussed the topic with Jakub during the bi-weekly switchdev call. We
will submit a v2 with a parameter called 'fw_load_policy' that will have
these options:

* driver: Load firmware version preferred by the driver (default in 
  mlxsw)
* flash: Load firmware currently stored in flash

Therefore, new sequence is:

1. Flash new firmware via ethtool
2. Set devlink 'fw_load_policy' parameter to 'flash'
3. Perform a reset via devlink to have changes take effect

The parameter can be later extended with more options such as 'newest'
and 'disk' that Jakub mentioned earlier in the thread.

We will not add the 'fw_flash_auto_update' parameter as it is not needed
for mlxsw, but can be added for nfp/others in the future.

Jakub, thanks again for your time!

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-11-08 16:25 UTC (permalink / raw)
  To: David Ahern, Jesper Dangaard Brouer; +Cc: netdev, Yoel Caspersen
In-Reply-To: <6165513d-1e27-31dc-8f94-9de029a73f93@gmail.com>



W dniu 08.11.2018 o 17:06, David Ahern pisze:
> On 11/8/18 6:33 AM, Paweł Staszewski wrote:
>>
>> W dniu 07.11.2018 o 22:06, David Ahern pisze:
>>> On 11/3/18 6:24 PM, Paweł Staszewski wrote:
>>>>> Does your setup have any other device types besides physical ports with
>>>>> VLANs (e.g., any macvlans or bonds)?
>>>>>
>>>>>
>>>> no.
>>>> just
>>>> phy(mlnx)->vlans only config
>>> VLAN and non-VLAN (and a mix) seem to work ok. Patches are here:
>>>      https://github.com/dsahern/linux.git bpf/kernel-tables-wip
>>>
>>> I got lazy with the vlan exports; right now it requires 8021q to be
>>> builtin (CONFIG_VLAN_8021Q=y)
>>>
>>> You can use the xdp_fwd sample:
>>>     make O=kbuild -C samples/bpf -j 8
>>>
>>> Copy samples/bpf/xdp_fwd_kern.o and samples/bpf/xdp_fwd to the server
>>> and run:
>>>      ./xdp_fwd <list of NIC ports>
>>>
>>> e.g., in my testing I run:
>>>      xdp_fwd eth1 eth2 eth3 eth4
>>>
>>> All of the relevant forwarding ports need to be on the same command
>>> line. This version populates a second map to verify the egress port has
>>> XDP enabled.
>> Installed today on some lab server with mellanox connectx4
>>
>> And trying some simple static routing first - but after enabling xdp
>> program - receiver is not receiving frames
>>
>> Route table is simple as possible for tests :)
>>
>> icmp ping test send from 192.168.22.237 to 172.16.0.2 - incomming
>> packets on vlan 4081
>>
>> ip r
>> default via 192.168.22.236 dev vlan4081
>> 172.16.0.0/30 dev vlan1740 proto kernel scope link src 172.16.0.1
>> 192.168.22.0/24 dev vlan4081 proto kernel scope link src 192.168.22.205
>>
>> neigh table:
>> ip neigh ls
>>
>> 192.168.22.237 dev vlan4081 lladdr 00:25:90:fb:a6:8d REACHABLE
>> 172.16.0.2 dev vlan1740 lladdr ac:1f:6b:2c:2e:5a REACHABLE
>>
>> and interfaces:
>> 4: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state
>> UP mode DEFAULT group default qlen 1000
>>      link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
>> 5: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state
>> UP mode DEFAULT group default qlen 1000
>>      link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>> 6: vlan4081@enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>> noqueue state UP mode DEFAULT group default qlen 1000
>>      link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
>> 7: vlan1740@enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>> noqueue state UP mode DEFAULT group default qlen 1000
>>      link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>>
>> 5: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:5 qdisc
>> mq state UP group default qlen 1000
>>      link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>>      inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
>>         valid_lft forever preferred_lft forever
>> 6: vlan4081@enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>> noqueue state UP group default qlen 1000
>>      link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
>>      inet 192.168.22.205/24 scope global vlan4081
>>         valid_lft forever preferred_lft forever
>>      inet6 fe80::ae1f:6bff:fe07:c890/64 scope link
>>         valid_lft forever preferred_lft forever
>> 7: vlan1740@enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>> noqueue state UP group default qlen 1000
>>      link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>>      inet 172.16.0.1/30 scope global vlan1740
>>         valid_lft forever preferred_lft forever
>>      inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
>>         valid_lft forever preferred_lft forever
>>
>>
>> xdp program detached:
>> Receiving side tcpdump:
>> 14:28:09.141233 IP 192.168.22.237 > 172.16.0.2: ICMP echo request, id
>> 30227, seq 487, length 64
>>
>> I can see icmp requests
>>
>>
>> enabling xdp
>> ./xdp_fwd enp175s0f1 enp175s0f0
>>
>> 4: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq
>> state UP mode DEFAULT group default qlen 1000
>>      link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
>>      prog/xdp id 5 tag 3c231ff1e5e77f3f
>> 5: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq
>> state UP mode DEFAULT group default qlen 1000
>>      link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>>      prog/xdp id 5 tag 3c231ff1e5e77f3f
>> 6: vlan4081@enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>> noqueue state UP mode DEFAULT group default qlen 1000
>>      link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
>> 7: vlan1740@enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>> noqueue state UP mode DEFAULT group default qlen 1000
>>      link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>>
> What hardware is this?
>
> Start with:
>
> echo 1 > /sys/kernel/debug/tracing/events/xdp/enable
> cat /sys/kernel/debug/tracing/trace_pipe
  cat /sys/kernel/debug/tracing/trace_pipe
          <idle>-0     [045] ..s. 68469.467752: xdp_devmap_xmit: 
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
from_ifindex=4 to_ifindex=5 err=-6
           <idle>-0     [045] ..s. 68470.483836: xdp_redirect_map: 
prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
map_index=5
           <idle>-0     [045] ..s. 68470.483837: xdp_devmap_xmit: 
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
from_ifindex=4 to_ifindex=5 err=-6
           <idle>-0     [045] ..s. 68471.503853: xdp_redirect_map: 
prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
map_index=5
           <idle>-0     [045] ..s. 68471.503853: xdp_devmap_xmit: 
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
from_ifindex=4 to_ifindex=5 err=-6
           <idle>-0     [045] ..s. 68472.527871: xdp_redirect_map: 
prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
map_index=5
           <idle>-0     [045] ..s. 68472.527877: xdp_devmap_xmit: 
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
from_ifindex=4 to_ifindex=5 err=-6
           <idle>-0     [045] ..s. 68473.551876: xdp_redirect_map: 
prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
map_index=5
           <idle>-0     [045] ..s. 68473.551880: xdp_devmap_xmit: 
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
from_ifindex=4 to_ifindex=5 err=-6
           <idle>-0     [045] ..s. 68474.575893: xdp_redirect_map: 
prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
map_index=5
           <idle>-0     [045] ..s. 68474.575897: xdp_devmap_xmit: 
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
from_ifindex=4 to_ifindex=5 err=-6
           <idle>-0     [045] ..s. 68475.599909: xdp_redirect_map: 
prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
map_index=5
           <idle>-0     [045] ..s. 68475.599912: xdp_devmap_xmit: 
ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
from_ifindex=4 to_ifindex=5 err=-6



>
> >From there, you can check the FIB lookups:
> sysctl -w kernel.perf_event_max_stack=16
> perf record -e fib:* -a -g -- sleep 5
> perf script
>
swapper     0 [045] 68493.746274: fib:fib_table_lookup: table 254 oif 0 
iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> 
dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

swapper     0 [045] 68494.770287: fib:fib_table_lookup: table 254 oif 0 
iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> 
dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

swapper     0 [045] 68495.794304: fib:fib_table_lookup: table 254 oif 0 
iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> 
dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

swapper     0 [045] 68496.818308: fib:fib_table_lookup: table 254 oif 0 
iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> 
dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

swapper     0 [045] 68497.842313: fib:fib_table_lookup: table 254 oif 0 
iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> 
dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-11-08 16:27 UTC (permalink / raw)
  To: David Ahern, Jesper Dangaard Brouer; +Cc: netdev, Yoel Caspersen
In-Reply-To: <11199f9f-da21-527b-f5db-0bbf1e448a8b@itcare.pl>



W dniu 08.11.2018 o 17:25, Paweł Staszewski pisze:
>
>
> W dniu 08.11.2018 o 17:06, David Ahern pisze:
>> On 11/8/18 6:33 AM, Paweł Staszewski wrote:
>>>
>>> W dniu 07.11.2018 o 22:06, David Ahern pisze:
>>>> On 11/3/18 6:24 PM, Paweł Staszewski wrote:
>>>>>> Does your setup have any other device types besides physical 
>>>>>> ports with
>>>>>> VLANs (e.g., any macvlans or bonds)?
>>>>>>
>>>>>>
>>>>> no.
>>>>> just
>>>>> phy(mlnx)->vlans only config
>>>> VLAN and non-VLAN (and a mix) seem to work ok. Patches are here:
>>>>      https://github.com/dsahern/linux.git bpf/kernel-tables-wip
>>>>
>>>> I got lazy with the vlan exports; right now it requires 8021q to be
>>>> builtin (CONFIG_VLAN_8021Q=y)
>>>>
>>>> You can use the xdp_fwd sample:
>>>>     make O=kbuild -C samples/bpf -j 8
>>>>
>>>> Copy samples/bpf/xdp_fwd_kern.o and samples/bpf/xdp_fwd to the server
>>>> and run:
>>>>      ./xdp_fwd <list of NIC ports>
>>>>
>>>> e.g., in my testing I run:
>>>>      xdp_fwd eth1 eth2 eth3 eth4
>>>>
>>>> All of the relevant forwarding ports need to be on the same command
>>>> line. This version populates a second map to verify the egress port 
>>>> has
>>>> XDP enabled.
>>> Installed today on some lab server with mellanox connectx4
>>>
>>> And trying some simple static routing first - but after enabling xdp
>>> program - receiver is not receiving frames
>>>
>>> Route table is simple as possible for tests :)
>>>
>>> icmp ping test send from 192.168.22.237 to 172.16.0.2 - incomming
>>> packets on vlan 4081
>>>
>>> ip r
>>> default via 192.168.22.236 dev vlan4081
>>> 172.16.0.0/30 dev vlan1740 proto kernel scope link src 172.16.0.1
>>> 192.168.22.0/24 dev vlan4081 proto kernel scope link src 192.168.22.205
>>>
>>> neigh table:
>>> ip neigh ls
>>>
>>> 192.168.22.237 dev vlan4081 lladdr 00:25:90:fb:a6:8d REACHABLE
>>> 172.16.0.2 dev vlan1740 lladdr ac:1f:6b:2c:2e:5a REACHABLE
>>>
>>> and interfaces:
>>> 4: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq 
>>> state
>>> UP mode DEFAULT group default qlen 1000
>>>      link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
>>> 5: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq 
>>> state
>>> UP mode DEFAULT group default qlen 1000
>>>      link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>>> 6: vlan4081@enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
>>> qdisc
>>> noqueue state UP mode DEFAULT group default qlen 1000
>>>      link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
>>> 7: vlan1740@enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
>>> qdisc
>>> noqueue state UP mode DEFAULT group default qlen 1000
>>>      link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>>>
>>> 5: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp/id:5 
>>> qdisc
>>> mq state UP group default qlen 1000
>>>      link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>>>      inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
>>>         valid_lft forever preferred_lft forever
>>> 6: vlan4081@enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
>>> qdisc
>>> noqueue state UP group default qlen 1000
>>>      link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
>>>      inet 192.168.22.205/24 scope global vlan4081
>>>         valid_lft forever preferred_lft forever
>>>      inet6 fe80::ae1f:6bff:fe07:c890/64 scope link
>>>         valid_lft forever preferred_lft forever
>>> 7: vlan1740@enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
>>> qdisc
>>> noqueue state UP group default qlen 1000
>>>      link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>>>      inet 172.16.0.1/30 scope global vlan1740
>>>         valid_lft forever preferred_lft forever
>>>      inet6 fe80::ae1f:6bff:fe07:c891/64 scope link
>>>         valid_lft forever preferred_lft forever
>>>
>>>
>>> xdp program detached:
>>> Receiving side tcpdump:
>>> 14:28:09.141233 IP 192.168.22.237 > 172.16.0.2: ICMP echo request, id
>>> 30227, seq 487, length 64
>>>
>>> I can see icmp requests
>>>
>>>
>>> enabling xdp
>>> ./xdp_fwd enp175s0f1 enp175s0f0
>>>
>>> 4: enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq
>>> state UP mode DEFAULT group default qlen 1000
>>>      link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
>>>      prog/xdp id 5 tag 3c231ff1e5e77f3f
>>> 5: enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq
>>> state UP mode DEFAULT group default qlen 1000
>>>      link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>>>      prog/xdp id 5 tag 3c231ff1e5e77f3f
>>> 6: vlan4081@enp175s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
>>> qdisc
>>> noqueue state UP mode DEFAULT group default qlen 1000
>>>      link/ether ac:1f:6b:07:c8:90 brd ff:ff:ff:ff:ff:ff
>>> 7: vlan1740@enp175s0f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 
>>> qdisc
>>> noqueue state UP mode DEFAULT group default qlen 1000
>>>      link/ether ac:1f:6b:07:c8:91 brd ff:ff:ff:ff:ff:ff
>>>
>> What hardware is this?
>>
mellanox connectx 4
ethtool -i enp175s0f0
driver: mlx5_core
version: 5.0-0
firmware-version: 12.21.1000 (SM_2001000001033)
expansion-rom-version:
bus-info: 0000:af:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

ethtool -i enp175s0f1
driver: mlx5_core
version: 5.0-0
firmware-version: 12.21.1000 (SM_2001000001033)
expansion-rom-version:
bus-info: 0000:af:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

>> Start with:
>>
>> echo 1 > /sys/kernel/debug/tracing/events/xdp/enable
>> cat /sys/kernel/debug/tracing/trace_pipe
>  cat /sys/kernel/debug/tracing/trace_pipe
>          <idle>-0     [045] ..s. 68469.467752: xdp_devmap_xmit: 
> ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
> from_ifindex=4 to_ifindex=5 err=-6
>           <idle>-0     [045] ..s. 68470.483836: xdp_redirect_map: 
> prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
> map_index=5
>           <idle>-0     [045] ..s. 68470.483837: xdp_devmap_xmit: 
> ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
> from_ifindex=4 to_ifindex=5 err=-6
>           <idle>-0     [045] ..s. 68471.503853: xdp_redirect_map: 
> prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
> map_index=5
>           <idle>-0     [045] ..s. 68471.503853: xdp_devmap_xmit: 
> ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
> from_ifindex=4 to_ifindex=5 err=-6
>           <idle>-0     [045] ..s. 68472.527871: xdp_redirect_map: 
> prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
> map_index=5
>           <idle>-0     [045] ..s. 68472.527877: xdp_devmap_xmit: 
> ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
> from_ifindex=4 to_ifindex=5 err=-6
>           <idle>-0     [045] ..s. 68473.551876: xdp_redirect_map: 
> prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
> map_index=5
>           <idle>-0     [045] ..s. 68473.551880: xdp_devmap_xmit: 
> ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
> from_ifindex=4 to_ifindex=5 err=-6
>           <idle>-0     [045] ..s. 68474.575893: xdp_redirect_map: 
> prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
> map_index=5
>           <idle>-0     [045] ..s. 68474.575897: xdp_devmap_xmit: 
> ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
> from_ifindex=4 to_ifindex=5 err=-6
>           <idle>-0     [045] ..s. 68475.599909: xdp_redirect_map: 
> prog_id=30 action=REDIRECT ifindex=4 to_ifindex=5 err=0 map_id=32 
> map_index=5
>           <idle>-0     [045] ..s. 68475.599912: xdp_devmap_xmit: 
> ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1 
> from_ifindex=4 to_ifindex=5 err=-6
>
>
>
>>
>> >From there, you can check the FIB lookups:
>> sysctl -w kernel.perf_event_max_stack=16
>> perf record -e fib:* -a -g -- sleep 5
>> perf script
>>
> swapper     0 [045] 68493.746274: fib:fib_table_lookup: table 254 oif 
> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 
> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
>
> swapper     0 [045] 68494.770287: fib:fib_table_lookup: table 254 oif 
> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 
> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
>
> swapper     0 [045] 68495.794304: fib:fib_table_lookup: table 254 oif 
> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 
> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
>
> swapper     0 [045] 68496.818308: fib:fib_table_lookup: table 254 oif 
> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 
> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
>
> swapper     0 [045] 68497.842313: fib:fib_table_lookup: table 254 oif 
> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 
> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
>
>

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: David Ahern @ 2018-11-08 16:32 UTC (permalink / raw)
  To: Paweł Staszewski, Jesper Dangaard Brouer; +Cc: netdev, Yoel Caspersen
In-Reply-To: <87a2a15c-f9bf-743b-b4c5-7d37da0bd887@itcare.pl>

On 11/8/18 9:27 AM, Paweł Staszewski wrote:
>>> What hardware is this?
>>>
> mellanox connectx 4
> ethtool -i enp175s0f0
> driver: mlx5_core
> version: 5.0-0
> firmware-version: 12.21.1000 (SM_2001000001033)
> expansion-rom-version:
> bus-info: 0000:af:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: no
> supports-register-dump: no
> supports-priv-flags: yes
> 
> ethtool -i enp175s0f1
> driver: mlx5_core
> version: 5.0-0
> firmware-version: 12.21.1000 (SM_2001000001033)
> expansion-rom-version:
> bus-info: 0000:af:00.1
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: no
> supports-register-dump: no
> supports-priv-flags: yes
> 
>>> Start with:
>>>
>>> echo 1 > /sys/kernel/debug/tracing/events/xdp/enable
>>> cat /sys/kernel/debug/tracing/trace_pipe
>>  cat /sys/kernel/debug/tracing/trace_pipe
>>          <idle>-0     [045] ..s. 68469.467752: xdp_devmap_xmit:
>> ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1
>> from_ifindex=4 to_ifindex=5 err=-6

FIB lookup is good, the redirect is happening, but the mlx5 driver does
not like it.

I think the -6 is coming from the mlx5 driver and the packet is getting
dropped. Perhaps this check in mlx5e_xdp_xmit:

       if (unlikely(sq_num >= priv->channels.num))
                return -ENXIO;



>> swapper     0 [045] 68493.746274: fib:fib_table_lookup: table 254 oif
>> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
>> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>>             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
>>
>> swapper     0 [045] 68494.770287: fib:fib_table_lookup: table 254 oif
>> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
>> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>>             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
>>
>> swapper     0 [045] 68495.794304: fib:fib_table_lookup: table 254 oif
>> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
>> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>>             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
>>
>> swapper     0 [045] 68496.818308: fib:fib_table_lookup: table 254 oif
>> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
>> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>>             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
>>
>> swapper     0 [045] 68497.842313: fib:fib_table_lookup: table 254 oif
>> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
>> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>>             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

^ permalink raw reply

* [PATCH 3/3] ath6kl: Use debug instead of error message when disabled
From: Kyle Roeschley @ 2018-11-08 16:36 UTC (permalink / raw)
  To: Kalle Valo
  Cc: David S . Miller, linux-wireless, netdev, linux-kernel,
	Kyle Roeschley
In-Reply-To: <20181108163659.19535-1-kyle.roeschley@ni.com>

This is not an unexpected condition, so we don't need to be shouting to the
world about it.

Signed-off-by: Kyle Roeschley <kyle.roeschley@ni.com>
---
 drivers/net/wireless/ath/ath6kl/cfg80211.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/ath/ath6kl/cfg80211.c b/drivers/net/wireless/ath/ath6kl/cfg80211.c
index d7c626d9594e..59dd50866932 100644
--- a/drivers/net/wireless/ath/ath6kl/cfg80211.c
+++ b/drivers/net/wireless/ath/ath6kl/cfg80211.c
@@ -291,7 +291,7 @@ static bool ath6kl_cfg80211_ready(struct ath6kl_vif *vif)
 	}
 
 	if (!test_bit(WLAN_ENABLED, &vif->flags)) {
-		ath6kl_err("wlan disabled\n");
+		ath6kl_dbg(ATH6KL_DBG_WLAN_CFG, "wlan disabled\n");
 		return false;
 	}
 
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH net-next v3 6/6] net/ncsi: Configure multi-package, multi-channel modes with failover
From: Samuel Mendoza-Jonas @ 2018-11-09  2:17 UTC (permalink / raw)
  To: Justin.Lee1, netdev; +Cc: davem, linux-kernel, openbmc
In-Reply-To: <f9d0309dafe248a6aac8198f96f9bc4b@AUSX13MPS306.AMER.DELL.COM>

On Thu, 2018-11-08 at 22:48 +0000, Justin.Lee1@Dell.com wrote:
> Hi Samuel,
> 
> For multi-package and multi-channel case, channel seems to be select correctly. Expect that,
> I still see the timing issue for back-to-back netlink command. Due to that, channel might be
> set to invisible state. Please refer to ncsi0 and ncsi2 below. The channel state is set to 3.
> 
> cat /sys/kernel/debug/ncsi_protocol/ncsi_device_status
> IFIDX IFNAME NAME   PID CID RX TX MP MC WP WC PC CS PS LS RU CR NQ HA
> =====================================================================
>   2   eth2   ncsi0  000 000 1  1  1  1  1  0  0  3  0  1  1  1  0  1
>   2   eth2   ncsi1  000 001 0  0  1  1  1  0  0  1  0  1  1  1  0  1
>   2   eth2   ncsi2  001 000 1  0  1  1  1  1  0  3  0  1  1  1  0  1
>   2   eth2   ncsi3  001 001 1  0  1  1  1  1  0  2  1  1  1  1  0  1
> =====================================================================
> MP: Multi-mode Package  WP: Whitelist Package
> MC: Multi-mode Channel  WC: Whitelist Channel
> PC: Primary Channel     CS: Channel State IA/A/IV 1/2/3
> PS: Poll Status         LS: Link Status
> RU: Running             CR: Carrier OK
> NQ: Queue Stopped       HA: Hardware Arbitration
> 
> The timing issue is not only happening in application. If I use using the following way
> to send the request, I can see the issue as well. 
> 
> ncsi_netlink -l 2 -a 0x01 -m; ncsi_netlink -l 2 -p 0 -b 0x03 -m; ncsi_netlink -l 2 -p 1 -b 0x00 -m;
> ncsi_netlink -l 2 -a 0x03 -m; ncsi_netlink -l 2 -p 0 -b 0x00 -m; ncsi_netlink -l 2 -p 1 -b 0x03 -m;

This actually recreates for me as well; I see now what you mean about
channels getting stuck in the invisible state. I believe I've narrowed
down the issue. I've pasted an additional patch below if you are able to
test on your machine.

> 
> 
> Also, there is one issue below for non-multi-package/non-multi-channel case.
> 
> Thanks,
> Justin
> 
> 
> > @@ -1008,32 +1164,49 @@ static int ncsi_choose_active_channel(struct ncsi_dev_priv *ndp)
> >  
> >  			ncm = &nc->modes[NCSI_MODE_LINK];
> >  			if (ncm->data[2] & 0x1) {
> > -				spin_unlock_irqrestore(&nc->lock, flags);
> >  				found = nc;
> > -				goto out;
> > +				with_link = true;
> >  			}
> >  
> > -			spin_unlock_irqrestore(&nc->lock, flags);
> > +			/* If multi_channel is enabled configure all valid
> > +			 * channels whether or not they currently have link
> > +			 * so they will have AENs enabled.
> > +			 */
> > +			if (with_link || np->multi_channel) {
> > +				spin_lock_irqsave(&ndp->lock, flags);
> > +				list_add_tail_rcu(&nc->link,
> > +						  &ndp->channel_queue);
> > +				spin_unlock_irqrestore(&ndp->lock, flags);
> > +
> > +				netdev_dbg(ndp->ndev.dev,
> > +					   "NCSI: Channel %u added to queue (link %s)\n",
> > +					   nc->id,
> > +					   ncm->data[2] & 0x1 ? "up" : "down");
> > +			}
> > +
> > +			spin_unlock_irqrestore(&nc->lock, cflags);
> > +
> > +			if (with_link && !np->multi_channel)
> > +				break;
> 
> The line needs to change to "goto found". If not, all channels with link will be added
> even if the multi-channel is not enabled for that package. The ncsi1 below is enabled.
> There is no netlink command sent to enable multi-package or multi-channel.
> 
> IFIDX IFNAME NAME   PID CID RX TX MP MC WP WC PC CS PS LS RU CR NQ HA
> =====================================================================
>   2   eth2   ncsi0  000 000 1  1  0  0  1  1  0  2  1  1  1  1  0  1
>   2   eth2   ncsi1  000 001 1  0  0  0  1  1  0  2  1  1  1  1  0  1
>   2   eth2   ncsi2  001 000 0  0  0  0  1  1  0  1  0  1  1  1  0  1
>   2   eth2   ncsi3  001 001 0  0  0  0  1  1  0  1  0  1  1  1  0  1
> =====================================================================
> MP: Multi-mode Package  WP: Whitelist Package
> MC: Multi-mode Channel  WC: Whitelist Channel
> PC: Primary Channel     CS: Channel State IA/A/IV 1/2/3
> PS: Poll Status         LS: Link Status
> RU: Running             CR: Carrier OK
> NQ: Queue Stopped       HA: Hardware Arbitration
> 
> >  		}
> > +		if (with_link && !ndp->multi_package)
> > +			break;
> >  	}
> 
> found:

This *may* be part of the above issue, I don't see this in normal
operation. The combination of (with_link && !np->multi_channel) and
(with_link && !ndp->multi_package) should prevent additional channels
being added without the need for 'goto found'. Please let me know if you
still see it with the extra patch.

> 
> After applying this change, I notice that if there is no link available to BMC when BMC
> starts, NC-SI can't properly configure channel once I plug in the Ethernet cable. 
> 
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_aen_handler_lsc() - pkg 0 ch 0 state up
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_aen_handler_lsc() - had_link 0, has_link 1, chained 0
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_stop_channel_monitor() - pkg 0 ch 0
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_process_next_channel()
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_process_next_channel() - pkg 0 ch 0 INVISIBLE
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_process_next_channel() - suspending pkg 0 ch 0
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 0 state 0400 select
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_dev_work()
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 0 state 0403 dc
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_dev_work()
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 0 state 0404 deselect
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_dev_work()
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 0 state 0405 done
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_rsp_handler_dp() - pkg 0 ch 0 INACTIVE
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_rsp_handler_dp() - pkg 0 ch 1 INACTIVE
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_dev_work()
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 0 state 0406 deselect
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 0 INACTIVE
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_process_next_channel()
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_process_next_channel() - No more channels to process
> npcm7xx-emc f0825000.eth eth2: NCSI interface down

Good find, there was a corner case in the LSC AEN handler changes that
led to this, I've fixed this in the patch as well. Thanks for testing!


>From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001
From: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Date: Fri, 9 Nov 2018 13:11:03 +1100
Subject: [PATCH] net/ncsi: Reset state fixes, single-channel LSC

---
 net/ncsi/ncsi-aen.c    |  8 +++++---
 net/ncsi/ncsi-manage.c | 19 +++++++++++++++----
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/net/ncsi/ncsi-aen.c b/net/ncsi/ncsi-aen.c
index 39c2e9eea2ba..034cb1dc5566 100644
--- a/net/ncsi/ncsi-aen.c
+++ b/net/ncsi/ncsi-aen.c
@@ -93,14 +93,16 @@ static int ncsi_aen_handler_lsc(struct ncsi_dev_priv *ndp,
 	if ((had_link == has_link) || chained)
 		return 0;
 
-	if (!ndp->multi_package && !nc->package->multi_channel) {
-		if (had_link)
-			ndp->flags |= NCSI_DEV_RESHUFFLE;
+	if (!ndp->multi_package && !nc->package->multi_channel && had_link) {
+		ndp->flags |= NCSI_DEV_RESHUFFLE;
 		ncsi_stop_channel_monitor(nc);
 		spin_lock_irqsave(&ndp->lock, flags);
 		list_add_tail_rcu(&nc->link, &ndp->channel_queue);
 		spin_unlock_irqrestore(&ndp->lock, flags);
 		return ncsi_process_next_channel(ndp);
+	} else {
+		/* Configured channel came up */
+		return 0;
 	}
 
 	if (had_link) {
diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index fa3c2144f5ba..92e59f07f9a7 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -1063,17 +1063,17 @@ static void ncsi_configure_channel(struct ncsi_dev_priv *ndp)
 	case ncsi_dev_state_config_done:
 		netdev_dbg(ndp->ndev.dev, "NCSI: channel %u config done\n",
 			   nc->id);
+		spin_lock_irqsave(&nc->lock, flags);
+		nc->state = NCSI_CHANNEL_ACTIVE;
+
 		if (ndp->flags & NCSI_DEV_RESET) {
 			/* A reset event happened during config, start it now */
-			spin_lock_irqsave(&nc->lock, flags);
 			nc->reconfigure_needed = false;
 			spin_unlock_irqrestore(&nc->lock, flags);
-			nd->state = ncsi_dev_state_functional;
 			ncsi_reset_dev(nd);
 			break;
 		}
 
-		spin_lock_irqsave(&nc->lock, flags);
 		if (nc->reconfigure_needed) {
 			/* This channel's configuration has been updated
 			 * part-way during the config state - start the
@@ -1092,7 +1092,6 @@ static void ncsi_configure_channel(struct ncsi_dev_priv *ndp)
 			break;
 		}
 
-		nc->state = NCSI_CHANNEL_ACTIVE;
 		if (nc->modes[NCSI_MODE_LINK].data[2] & 0x1) {
 			hot_nc = nc;
 		} else {
@@ -1803,6 +1802,18 @@ int ncsi_reset_dev(struct ncsi_dev *nd)
 			spin_unlock_irqrestore(&ndp->lock, flags);
 			return 0;
 		}
+	} else {
+		switch (nd->state) {
+		case ncsi_dev_state_suspend_done:
+		case ncsi_dev_state_config_done:
+		case ncsi_dev_state_functional:
+			/* Ok */
+			break;
+		default:
+			/* Current reset operation happening */
+			spin_unlock_irqrestore(&ndp->lock, flags);
+			return 0;
+		}
 	}
 
 	if (!list_empty(&ndp->channel_queue)) {
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH net-next v2 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-11-09  2:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tiwei Bie, virtualization, linux-kernel, netdev, virtio-dev, wexu,
	jfreimann
In-Reply-To: <20181108091337-mutt-send-email-mst@kernel.org>


On 2018/11/8 下午10:14, Michael S. Tsirkin wrote:
> On Thu, Nov 08, 2018 at 04:18:25PM +0800, Jason Wang wrote:
>> On 2018/11/8 上午9:38, Tiwei Bie wrote:
>>>>> +
>>>>> +	if (vq->vq.num_free < descs_used) {
>>>>> +		pr_debug("Can't add buf len %i - avail = %i\n",
>>>>> +			 descs_used, vq->vq.num_free);
>>>>> +		/* FIXME: for historical reasons, we force a notify here if
>>>>> +		 * there are outgoing parts to the buffer.  Presumably the
>>>>> +		 * host should service the ring ASAP. */
>>>> I don't think we have a reason to do this for packed ring.
>>>> No historical baggage there, right?
>>> Based on the original commit log, it seems that the notify here
>>> is just an "optimization". But I don't quite understand what does
>>> the "the heuristics which KVM uses" refer to. If it's safe to drop
>>> this in packed ring, I'd like to do it.
>>
>> According to the commit log, it seems like a workaround of lguest networking
>> backend. I agree to drop it, we should not have such burden.
>>
>> But we should notice that, with this removed, the compare between packed vs
>> split is kind of unfair.
> I don't think this ever triggers to be frank. When would it?


I think it can happen e.g in the path of XDP transmission in 
__virtnet_xdp_xmit_one():


         err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdpf, GFP_ATOMIC);
         if (unlikely(err))
                 return -ENOSPC; /* Caller handle free/refcnt */


>
>> Consider the removal of lguest support recently,
>> maybe we can drop this for split ring as well?
>>
>> Thanks
> If it's helpful, then for sure we can drop it for virtio 1.
> Can you see any perf differences at all? With which device?


I don't test but consider the case of XDP_TX in guest plus vhost_net in 
host. Since vhost_net is half duplex, it's pretty easier to trigger this 
condition.

Thanks


>
>>> commit 44653eae1407f79dff6f52fcf594ae84cb165ec4
>>> Author: Rusty Russell<rusty@rustcorp.com.au>
>>> Date:   Fri Jul 25 12:06:04 2008 -0500
>>>
>>>       virtio: don't always force a notification when ring is full
>>>       We force notification when the ring is full, even if the host has
>>>       indicated it doesn't want to know.  This seemed like a good idea at
>>>       the time: if we fill the transmit ring, we should tell the host
>>>       immediately.
>>>       Unfortunately this logic also applies to the receiving ring, which is
>>>       refilled constantly.  We should introduce real notification thesholds
>>>       to replace this logic.  Meanwhile, removing the logic altogether breaks
>>>       the heuristics which KVM uses, so we use a hack: only notify if there are
>>>       outgoing parts of the new buffer.
>>>       Here are the number of exits with lguest's crappy network implementation:
>>>       Before:
>>>               network xmit 7859051 recv 236420
>>>       After:
>>>               network xmit 7858610 recv 118136
>>>       Signed-off-by: Rusty Russell<rusty@rustcorp.com.au>
>>>
>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>> index 72bf8bc09014..21d9a62767af 100644
>>> --- a/drivers/virtio/virtio_ring.c
>>> +++ b/drivers/virtio/virtio_ring.c
>>> @@ -87,8 +87,11 @@ static int vring_add_buf(struct virtqueue *_vq,
>>>    	if (vq->num_free < out + in) {
>>>    		pr_debug("Can't add buf len %i - avail = %i\n",
>>>    			 out + in, vq->num_free);
>>> -		/* We notify*even if*  VRING_USED_F_NO_NOTIFY is set here. */
>>> -		vq->notify(&vq->vq);
>>> +		/* FIXME: for historical reasons, we force a notify here if
>>> +		 * there are outgoing parts to the buffer.  Presumably the
>>> +		 * host should service the ring ASAP. */
>>> +		if (out)
>>> +			vq->notify(&vq->vq);
>>>    		END_USE(vq);
>>>    		return -ENOSPC;
>>>    	}
>>>
>>>

^ permalink raw reply

* Re: [PATCH net-next v2 3/5] virtio_ring: add packed ring support
From: Jason Wang @ 2018-11-09  2:30 UTC (permalink / raw)
  To: Michael S. Tsirkin, Tiwei Bie
  Cc: virtualization, linux-kernel, netdev, virtio-dev, wexu, jfreimann
In-Reply-To: <20181108103155-mutt-send-email-mst@kernel.org>


On 2018/11/8 下午11:56, Michael S. Tsirkin wrote:
> On Thu, Nov 08, 2018 at 07:51:48PM +0800, Tiwei Bie wrote:
>> On Thu, Nov 08, 2018 at 04:18:25PM +0800, Jason Wang wrote:
>>> On 2018/11/8 上午9:38, Tiwei Bie wrote:
>>>>>> +
>>>>>> +	if (vq->vq.num_free < descs_used) {
>>>>>> +		pr_debug("Can't add buf len %i - avail = %i\n",
>>>>>> +			 descs_used, vq->vq.num_free);
>>>>>> +		/* FIXME: for historical reasons, we force a notify here if
>>>>>> +		 * there are outgoing parts to the buffer.  Presumably the
>>>>>> +		 * host should service the ring ASAP. */
>>>>> I don't think we have a reason to do this for packed ring.
>>>>> No historical baggage there, right?
>>>> Based on the original commit log, it seems that the notify here
>>>> is just an "optimization". But I don't quite understand what does
>>>> the "the heuristics which KVM uses" refer to. If it's safe to drop
>>>> this in packed ring, I'd like to do it.
>>>
>>> According to the commit log, it seems like a workaround of lguest networking
>>> backend.
>> Do you know why removing this notify in Tx will break "the
>> heuristics which KVM uses"? Or what does "the heuristics
>> which KVM uses" refer to?
> Yes. QEMU has a mode where it disables notifications and processes TX
> ring periodically from a timer.  It's off by default but used to be on
> by default a long time ago. If ring becomes full this causes traffic
> stalls.


Do you mean tx-timer? If yes, we can still enable it for packed ring and 
the timer will finally fired and we can go.


> As a work-around Rusty put in this hack to kick on ring full
> even with notifications disabled.


 From the commit log it looks more like a performance workaround instead 
of a bug fix.


> It's easy enough to make sure QEMU
> does not combine devices with packed ring support with the timer hack.
> And I am guessing it's safe enough to also block that option completely
> e.g. when virtio 1.0 is enabled.


I agree.

Thanks


>>> I agree to drop it, we should not have such burden.
>>>
>>> But we should notice that, with this removed, the compare between packed vs
>>> split is kind of unfair. Consider the removal of lguest support recently,
>>> maybe we can drop this for split ring as well?
>>>
>>> Thanks
>>>
>>>
>>>> commit 44653eae1407f79dff6f52fcf594ae84cb165ec4
>>>> Author: Rusty Russell<rusty@rustcorp.com.au>
>>>> Date:   Fri Jul 25 12:06:04 2008 -0500
>>>>
>>>>       virtio: don't always force a notification when ring is full
>>>>       We force notification when the ring is full, even if the host has
>>>>       indicated it doesn't want to know.  This seemed like a good idea at
>>>>       the time: if we fill the transmit ring, we should tell the host
>>>>       immediately.
>>>>       Unfortunately this logic also applies to the receiving ring, which is
>>>>       refilled constantly.  We should introduce real notification thesholds
>>>>       to replace this logic.  Meanwhile, removing the logic altogether breaks
>>>>       the heuristics which KVM uses, so we use a hack: only notify if there are
>>>>       outgoing parts of the new buffer.
>>>>       Here are the number of exits with lguest's crappy network implementation:
>>>>       Before:
>>>>               network xmit 7859051 recv 236420
>>>>       After:
>>>>               network xmit 7858610 recv 118136
>>>>       Signed-off-by: Rusty Russell<rusty@rustcorp.com.au>
>>>>
>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>> index 72bf8bc09014..21d9a62767af 100644
>>>> --- a/drivers/virtio/virtio_ring.c
>>>> +++ b/drivers/virtio/virtio_ring.c
>>>> @@ -87,8 +87,11 @@ static int vring_add_buf(struct virtqueue *_vq,
>>>>    	if (vq->num_free < out + in) {
>>>>    		pr_debug("Can't add buf len %i - avail = %i\n",
>>>>    			 out + in, vq->num_free);
>>>> -		/* We notify*even if*  VRING_USED_F_NO_NOTIFY is set here. */
>>>> -		vq->notify(&vq->vq);
>>>> +		/* FIXME: for historical reasons, we force a notify here if
>>>> +		 * there are outgoing parts to the buffer.  Presumably the
>>>> +		 * host should service the ring ASAP. */
>>>> +		if (out)
>>>> +			vq->notify(&vq->vq);
>>>>    		END_USE(vq);
>>>>    		return -ENOSPC;
>>>>    	}
>>>>
>>>>

^ permalink raw reply

* [PATCH bpf-next 2/4] bpf: Split bpf_sk_lookup
From: Andrey Ignatov @ 2018-11-08 16:54 UTC (permalink / raw)
  To: netdev; +Cc: Andrey Ignatov, ast, daniel, joe, kernel-team
In-Reply-To: <cover.1541695683.git.rdna@fb.com>

Split bpf_sk_lookup to separate core functionality, that can be reused
to make socket lookup available to more program types, from
functionality specific to program types that have access to skb.

Core functionality is placed to __bpf_sk_lookup. And bpf_sk_lookup only
gets caller netns and ifindex from skb and passes it to __bpf_sk_lookup.

Program types that don't have access to skb can just pass NULL to
__bpf_sk_lookup that will be handled correctly by both inet{,6}_sdif and
lookup functions.

This is refactoring that simply moves blocks around and does NOT change
existing logic.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 net/core/filter.c | 38 +++++++++++++++++++++++---------------
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 9a1327eb25fa..dc0f86a707b7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4825,14 +4825,10 @@ static const struct bpf_func_proto bpf_lwt_seg6_adjust_srh_proto = {
 
 #ifdef CONFIG_INET
 static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
-			      struct sk_buff *skb, u8 family, u8 proto)
+			      struct sk_buff *skb, u8 family, u8 proto, int dif)
 {
 	bool refcounted = false;
 	struct sock *sk = NULL;
-	int dif = 0;
-
-	if (skb->dev)
-		dif = skb->dev->ifindex;
 
 	if (family == AF_INET) {
 		__be32 src4 = tuple->ipv4.saddr;
@@ -4875,16 +4871,16 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 	return sk;
 }
 
-/* bpf_sk_lookup performs the core lookup for different types of sockets,
+/* __bpf_sk_lookup performs the core lookup for different types of sockets,
  * taking a reference on the socket if it doesn't have the flag SOCK_RCU_FREE.
  * Returns the socket as an 'unsigned long' to simplify the casting in the
  * callers to satisfy BPF_CALL declarations.
  */
 static unsigned long
-bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
-	      u8 proto, u64 netns_id, u64 flags)
+__bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
+		u8 proto, u64 netns_id, struct net *caller_net, int ifindex,
+		u64 flags)
 {
-	struct net *caller_net;
 	struct sock *sk = NULL;
 	u8 family = AF_UNSPEC;
 	struct net *net;
@@ -4893,19 +4889,15 @@ bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
 	if (unlikely(family == AF_UNSPEC || netns_id > U32_MAX || flags))
 		goto out;
 
-	if (skb->dev)
-		caller_net = dev_net(skb->dev);
-	else
-		caller_net = sock_net(skb->sk);
 	if (netns_id) {
 		net = get_net_ns_by_id(caller_net, netns_id);
 		if (unlikely(!net))
 			goto out;
-		sk = sk_lookup(net, tuple, skb, family, proto);
+		sk = sk_lookup(net, tuple, skb, family, proto, ifindex);
 		put_net(net);
 	} else {
 		net = caller_net;
-		sk = sk_lookup(net, tuple, skb, family, proto);
+		sk = sk_lookup(net, tuple, skb, family, proto, ifindex);
 	}
 
 	if (sk)
@@ -4914,6 +4906,22 @@ bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
 	return (unsigned long) sk;
 }
 
+static unsigned long
+bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
+	      u8 proto, u64 netns_id, u64 flags)
+{
+	struct net *caller_net = sock_net(skb->sk);
+	int ifindex = 0;
+
+	if (skb->dev) {
+		caller_net = dev_net(skb->dev);
+		ifindex = skb->dev->ifindex;
+	}
+
+	return __bpf_sk_lookup(skb, tuple, len, proto, netns_id, caller_net,
+			       ifindex, flags);
+}
+
 BPF_CALL_5(bpf_sk_lookup_tcp, struct sk_buff *, skb,
 	   struct bpf_sock_tuple *, tuple, u32, len, u64, netns_id, u64, flags)
 {
-- 
2.17.1

^ permalink raw reply related

* [PATCH bpf-next 0/4] bpf: Support socket lookup in CGROUP_SOCK_ADDR progs
From: Andrey Ignatov @ 2018-11-08 16:54 UTC (permalink / raw)
  To: netdev; +Cc: Andrey Ignatov, ast, daniel, joe, kernel-team

This patch set makes bpf_sk_lookup_tcp, bpf_sk_lookup_udp and
bpf_sk_release helpers available in programs of type
BPF_PROG_TYPE_CGROUP_SOCK_ADDR.

Patch 1 is a fix for bpf_sk_lookup_udp that was already sent to netdev
separately for bpf (stable) tree. Here it's prerequisite for patch 4.

Patch 2 is refactoring to prepare for patch 3. Similar refactoring was done
as part of "bpf: Extend the sk_lookup() helper to XDP hookpoint." patch
published on netdev earlier but not merged yet. This patch set can reuse
the work done for xdp if it's merged. This patch doesn't make any logic
changes and simply moves code around.

Patch 3 is the main patch in the set, it makes the helpers available for
BPF_PROG_TYPE_CGROUP_SOCK_ADDR and provides more details about use-case.

Patch 4 adds selftest for new functionality.

Andrey Ignatov (4):
  bpf: Fix IPv6 dport byte order in bpf_sk_lookup_udp
  bpf: Split bpf_sk_lookup
  bpf: Support socket lookup in CGROUP_SOCK_ADDR progs
  selftest/bpf: Use bpf_sk_lookup_{tcp,udp} in test_sock_addr

 net/core/filter.c                           | 96 +++++++++++++++++----
 tools/testing/selftests/bpf/connect4_prog.c | 43 +++++++--
 tools/testing/selftests/bpf/connect6_prog.c | 56 +++++++++---
 3 files changed, 156 insertions(+), 39 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [PATCH bpf-next 4/4] selftest/bpf: Use bpf_sk_lookup_{tcp,udp} in test_sock_addr
From: Andrey Ignatov @ 2018-11-08 16:54 UTC (permalink / raw)
  To: netdev; +Cc: Andrey Ignatov, ast, daniel, joe, kernel-team
In-Reply-To: <cover.1541695683.git.rdna@fb.com>

Use bpf_sk_lookup_tcp, bpf_sk_lookup_udp and bpf_sk_release helpers from
test_sock_addr programs to make sure they're available and can lookup
and release socket properly for IPv4/IPv4, TCP/UDP.

Reading from a few fields of returned struct bpf_sock is also tested.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/connect4_prog.c | 43 ++++++++++++----
 tools/testing/selftests/bpf/connect6_prog.c | 56 ++++++++++++++++-----
 2 files changed, 78 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/bpf/connect4_prog.c b/tools/testing/selftests/bpf/connect4_prog.c
index 5a88a681d2ab..b8395f3c43e9 100644
--- a/tools/testing/selftests/bpf/connect4_prog.c
+++ b/tools/testing/selftests/bpf/connect4_prog.c
@@ -21,23 +21,48 @@ int _version SEC("version") = 1;
 SEC("cgroup/connect4")
 int connect_v4_prog(struct bpf_sock_addr *ctx)
 {
+	struct bpf_sock_tuple tuple = {};
 	struct sockaddr_in sa;
+	struct bpf_sock *sk;
+
+	/* Verify that new destination is available. */
+	memset(&tuple.ipv4.saddr, 0, sizeof(tuple.ipv4.saddr));
+	memset(&tuple.ipv4.sport, 0, sizeof(tuple.ipv4.sport));
+
+	tuple.ipv4.daddr = bpf_htonl(DST_REWRITE_IP4);
+	tuple.ipv4.dport = bpf_htons(DST_REWRITE_PORT4);
+
+	if (ctx->type != SOCK_STREAM && ctx->type != SOCK_DGRAM)
+		return 0;
+	else if (ctx->type == SOCK_STREAM)
+		sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof(tuple.ipv4), 0, 0);
+	else
+		sk = bpf_sk_lookup_udp(ctx, &tuple, sizeof(tuple.ipv4), 0, 0);
+
+	if (!sk)
+		return 0;
+
+	if (sk->src_ip4 != tuple.ipv4.daddr ||
+	    sk->src_port != DST_REWRITE_PORT4) {
+		bpf_sk_release(sk);
+		return 0;
+	}
+
+	bpf_sk_release(sk);
 
 	/* Rewrite destination. */
 	ctx->user_ip4 = bpf_htonl(DST_REWRITE_IP4);
 	ctx->user_port = bpf_htons(DST_REWRITE_PORT4);
 
-	if (ctx->type == SOCK_DGRAM || ctx->type == SOCK_STREAM) {
-		///* Rewrite source. */
-		memset(&sa, 0, sizeof(sa));
+	/* Rewrite source. */
+	memset(&sa, 0, sizeof(sa));
 
-		sa.sin_family = AF_INET;
-		sa.sin_port = bpf_htons(0);
-		sa.sin_addr.s_addr = bpf_htonl(SRC_REWRITE_IP4);
+	sa.sin_family = AF_INET;
+	sa.sin_port = bpf_htons(0);
+	sa.sin_addr.s_addr = bpf_htonl(SRC_REWRITE_IP4);
 
-		if (bpf_bind(ctx, (struct sockaddr *)&sa, sizeof(sa)) != 0)
-			return 0;
-	}
+	if (bpf_bind(ctx, (struct sockaddr *)&sa, sizeof(sa)) != 0)
+		return 0;
 
 	return 1;
 }
diff --git a/tools/testing/selftests/bpf/connect6_prog.c b/tools/testing/selftests/bpf/connect6_prog.c
index 8ea3f7d12dee..25f5dc7b7aa0 100644
--- a/tools/testing/selftests/bpf/connect6_prog.c
+++ b/tools/testing/selftests/bpf/connect6_prog.c
@@ -29,7 +29,41 @@ int _version SEC("version") = 1;
 SEC("cgroup/connect6")
 int connect_v6_prog(struct bpf_sock_addr *ctx)
 {
+	struct bpf_sock_tuple tuple = {};
 	struct sockaddr_in6 sa;
+	struct bpf_sock *sk;
+
+	/* Verify that new destination is available. */
+	memset(&tuple.ipv6.saddr, 0, sizeof(tuple.ipv6.saddr));
+	memset(&tuple.ipv6.sport, 0, sizeof(tuple.ipv6.sport));
+
+	tuple.ipv6.daddr[0] = bpf_htonl(DST_REWRITE_IP6_0);
+	tuple.ipv6.daddr[1] = bpf_htonl(DST_REWRITE_IP6_1);
+	tuple.ipv6.daddr[2] = bpf_htonl(DST_REWRITE_IP6_2);
+	tuple.ipv6.daddr[3] = bpf_htonl(DST_REWRITE_IP6_3);
+
+	tuple.ipv6.dport = bpf_htons(DST_REWRITE_PORT6);
+
+	if (ctx->type != SOCK_STREAM && ctx->type != SOCK_DGRAM)
+		return 0;
+	else if (ctx->type == SOCK_STREAM)
+		sk = bpf_sk_lookup_tcp(ctx, &tuple, sizeof(tuple.ipv6), 0, 0);
+	else
+		sk = bpf_sk_lookup_udp(ctx, &tuple, sizeof(tuple.ipv6), 0, 0);
+
+	if (!sk)
+		return 0;
+
+	if (sk->src_ip6[0] != tuple.ipv6.daddr[0] ||
+	    sk->src_ip6[1] != tuple.ipv6.daddr[1] ||
+	    sk->src_ip6[2] != tuple.ipv6.daddr[2] ||
+	    sk->src_ip6[3] != tuple.ipv6.daddr[3] ||
+	    sk->src_port != DST_REWRITE_PORT6) {
+		bpf_sk_release(sk);
+		return 0;
+	}
+
+	bpf_sk_release(sk);
 
 	/* Rewrite destination. */
 	ctx->user_ip6[0] = bpf_htonl(DST_REWRITE_IP6_0);
@@ -39,21 +73,19 @@ int connect_v6_prog(struct bpf_sock_addr *ctx)
 
 	ctx->user_port = bpf_htons(DST_REWRITE_PORT6);
 
-	if (ctx->type == SOCK_DGRAM || ctx->type == SOCK_STREAM) {
-		/* Rewrite source. */
-		memset(&sa, 0, sizeof(sa));
+	/* Rewrite source. */
+	memset(&sa, 0, sizeof(sa));
 
-		sa.sin6_family = AF_INET6;
-		sa.sin6_port = bpf_htons(0);
+	sa.sin6_family = AF_INET6;
+	sa.sin6_port = bpf_htons(0);
 
-		sa.sin6_addr.s6_addr32[0] = bpf_htonl(SRC_REWRITE_IP6_0);
-		sa.sin6_addr.s6_addr32[1] = bpf_htonl(SRC_REWRITE_IP6_1);
-		sa.sin6_addr.s6_addr32[2] = bpf_htonl(SRC_REWRITE_IP6_2);
-		sa.sin6_addr.s6_addr32[3] = bpf_htonl(SRC_REWRITE_IP6_3);
+	sa.sin6_addr.s6_addr32[0] = bpf_htonl(SRC_REWRITE_IP6_0);
+	sa.sin6_addr.s6_addr32[1] = bpf_htonl(SRC_REWRITE_IP6_1);
+	sa.sin6_addr.s6_addr32[2] = bpf_htonl(SRC_REWRITE_IP6_2);
+	sa.sin6_addr.s6_addr32[3] = bpf_htonl(SRC_REWRITE_IP6_3);
 
-		if (bpf_bind(ctx, (struct sockaddr *)&sa, sizeof(sa)) != 0)
-			return 0;
-	}
+	if (bpf_bind(ctx, (struct sockaddr *)&sa, sizeof(sa)) != 0)
+		return 0;
 
 	return 1;
 }
-- 
2.17.1

^ permalink raw reply related

* [PATCH bpf-next 3/4] bpf: Support socket lookup in CGROUP_SOCK_ADDR progs
From: Andrey Ignatov @ 2018-11-08 16:54 UTC (permalink / raw)
  To: netdev; +Cc: Andrey Ignatov, ast, daniel, joe, kernel-team
In-Reply-To: <cover.1541695683.git.rdna@fb.com>

Make bpf_sk_lookup_tcp, bpf_sk_lookup_udp and bpf_sk_release helpers
available in programs of type BPF_PROG_TYPE_CGROUP_SOCK_ADDR.

Such programs operate on sockets and have access to socket and struct
sockaddr passed by user to system calls such as sys_bind, sys_connect,
sys_sendmsg.

It's useful to be able to lookup other sockets from these programs.
E.g. sys_connect may lookup IP:port endpoint and if there is a server
socket bound to that endpoint ("server" can be defined by saddr & sport
being zero), redirect client connection to it by rewriting IP:port in
sockaddr passed to sys_connect.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 net/core/filter.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index dc0f86a707b7..2e8575a34a1e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4971,6 +4971,51 @@ static const struct bpf_func_proto bpf_sk_release_proto = {
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_SOCKET,
 };
+
+static unsigned long
+bpf_sock_addr_sk_lookup(struct sock *sk, struct bpf_sock_tuple *tuple, u32 len,
+			u8 proto, u64 netns_id, u64 flags)
+{
+	return __bpf_sk_lookup(NULL, tuple, len, proto, netns_id, sock_net(sk),
+			       0, flags);
+}
+
+BPF_CALL_5(bpf_sock_addr_sk_lookup_tcp, struct bpf_sock_addr_kern *, ctx,
+	   struct bpf_sock_tuple *, tuple, u32, len, u64, netns_id, u64, flags)
+{
+	return bpf_sock_addr_sk_lookup(ctx->sk, tuple, len, IPPROTO_TCP,
+				       netns_id, flags);
+}
+
+static const struct bpf_func_proto bpf_sock_addr_sk_lookup_tcp_proto = {
+	.func		= bpf_sock_addr_sk_lookup_tcp,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_SOCKET_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+	.arg5_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_5(bpf_sock_addr_sk_lookup_udp, struct bpf_sock_addr_kern *, ctx,
+	   struct bpf_sock_tuple *, tuple, u32, len, u64, netns_id, u64, flags)
+{
+	return bpf_sock_addr_sk_lookup(ctx->sk, tuple, len, IPPROTO_UDP,
+				       netns_id, flags);
+}
+
+static const struct bpf_func_proto bpf_sock_addr_sk_lookup_udp_proto = {
+	.func		= bpf_sock_addr_sk_lookup_udp,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_SOCKET_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+	.arg5_type	= ARG_ANYTHING,
+};
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -5077,6 +5122,14 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_cookie_sock_addr_proto;
 	case BPF_FUNC_get_local_storage:
 		return &bpf_get_local_storage_proto;
+#ifdef CONFIG_INET
+	case BPF_FUNC_sk_lookup_tcp:
+		return &bpf_sock_addr_sk_lookup_tcp_proto;
+	case BPF_FUNC_sk_lookup_udp:
+		return &bpf_sock_addr_sk_lookup_udp_proto;
+	case BPF_FUNC_sk_release:
+		return &bpf_sk_release_proto;
+#endif /* CONFIG_INET */
 	default:
 		return bpf_base_func_proto(func_id);
 	}
-- 
2.17.1

^ permalink raw reply related

* [PATCH bpf-next 1/4] bpf: Fix IPv6 dport byte order in bpf_sk_lookup_udp
From: Andrey Ignatov @ 2018-11-08 16:54 UTC (permalink / raw)
  To: netdev; +Cc: Andrey Ignatov, ast, daniel, joe, kernel-team
In-Reply-To: <cover.1541695683.git.rdna@fb.com>

Lookup functions in sk_lookup have different expectations about byte
order of provided arguments.

Specifically __inet_lookup, __udp4_lib_lookup and __udp6_lib_lookup
expect dport to be in network byte order and do ntohs(dport) internally.

At the same time __inet6_lookup expects dport to be in host byte order
and correspondingly name the argument hnum.

sk_lookup works correctly with __inet_lookup, __udp4_lib_lookup and
__inet6_lookup with regard to dport. But in __udp6_lib_lookup case it
uses host instead of expected network byte order. It makes result
returned by bpf_sk_lookup_udp for IPv6 incorrect.

The patch fixes byte order of dport passed to __udp6_lib_lookup.

Originally sk_lookup properly handled UDPv6, but not TCPv6. 5ef0ae84f02a
fixes TCPv6 but breaks UDPv6.

Fixes: 5ef0ae84f02a ("bpf: Fix IPv6 dport byte-order in bpf_sk_lookup")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/filter.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index e521c5ebc7d1..9a1327eb25fa 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4852,18 +4852,17 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 	} else {
 		struct in6_addr *src6 = (struct in6_addr *)&tuple->ipv6.saddr;
 		struct in6_addr *dst6 = (struct in6_addr *)&tuple->ipv6.daddr;
-		u16 hnum = ntohs(tuple->ipv6.dport);
 		int sdif = inet6_sdif(skb);

 		if (proto == IPPROTO_TCP)
 			sk = __inet6_lookup(net, &tcp_hashinfo, skb, 0,
 					    src6, tuple->ipv6.sport,
-					    dst6, hnum,
+					    dst6, ntohs(tuple->ipv6.dport),
 					    dif, sdif, &refcounted);
 		else if (likely(ipv6_bpf_stub))
 			sk = ipv6_bpf_stub->udp6_lib_lookup(net,
 							    src6, tuple->ipv6.sport,
-							    dst6, hnum,
+							    dst6, tuple->ipv6.dport,
 							    dif, sdif,
 							    &udp_table, skb);
 #endif
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH net-next] SUNRPC: drop pointless static qualifier in xdr_get_next_encode_buffer()
From: bfields @ 2018-11-08 17:26 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: yuehaibing@huawei.com, anna.schumaker@netapp.com,
	davem@davemloft.net, jlayton@kernel.org, netdev@vger.kernel.org,
	linux-nfs@vger.kernel.org, kernel-janitors@vger.kernel.org
In-Reply-To: <d23105929adb66c626d382525adef77265585404.camel@hammerspace.com>

On Thu, Nov 08, 2018 at 03:13:25AM +0000, Trond Myklebust wrote:
> On Thu, 2018-11-08 at 02:04 +0000, YueHaibing wrote:
> > There is no need to have the '__be32 *p' variable static since new
> > value
> > always be assigned before use it.

Applying for 4.20 and stable, thanks!

> > 
> > Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> > ---
> >  net/sunrpc/xdr.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> > index 2bbb8d3..d80b156 100644
> > --- a/net/sunrpc/xdr.c
> > +++ b/net/sunrpc/xdr.c
> > @@ -546,7 +546,7 @@ void xdr_commit_encode(struct xdr_stream *xdr)
> >  static __be32 *xdr_get_next_encode_buffer(struct xdr_stream *xdr,
> >  		size_t nbytes)
> >  {
> > -	static __be32 *p;
> > +	__be32 *p;
> >  	int space_left;
> >  	int frag1bytes, frag2bytes;
> > 
> 
> Ouch, that's a really nasty bug that could definitely cause corruption
> if you have 2 threads simultaneously calling this function! This really
> deserves to be a stable patch.

Agreed.  Looks like I introduced that in 3.16, over 5 years ago, so I'm
a little surprised not to have seen a bug report that this would
explain.  Maybe it's just that the critical section is only a few lines
of arithemtic at the end of the function.  Also it only gets called when
an xdr reply other than a read reaches the end of a page.  So you'd need
a lot of concurrent READDIRs of large directories or something.  Still,
I'd think it would be possible.....

> Thank you, YueHaibing!
> 
> Bruce, do you want to shepherd this one in?

Yes, I've got 3 bugfixes queued up now, I should send them along later
today or tomorrow.

--b.

^ permalink raw reply

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-11-08 17:30 UTC (permalink / raw)
  To: David Ahern, Jesper Dangaard Brouer; +Cc: netdev, Yoel Caspersen
In-Reply-To: <68cc8279-5e3f-85c2-673c-aa3d4a47b353@gmail.com>



W dniu 08.11.2018 o 17:32, David Ahern pisze:
> On 11/8/18 9:27 AM, Paweł Staszewski wrote:
>>>> What hardware is this?
>>>>
>> mellanox connectx 4
>> ethtool -i enp175s0f0
>> driver: mlx5_core
>> version: 5.0-0
>> firmware-version: 12.21.1000 (SM_2001000001033)
>> expansion-rom-version:
>> bus-info: 0000:af:00.0
>> supports-statistics: yes
>> supports-test: yes
>> supports-eeprom-access: no
>> supports-register-dump: no
>> supports-priv-flags: yes
>>
>> ethtool -i enp175s0f1
>> driver: mlx5_core
>> version: 5.0-0
>> firmware-version: 12.21.1000 (SM_2001000001033)
>> expansion-rom-version:
>> bus-info: 0000:af:00.1
>> supports-statistics: yes
>> supports-test: yes
>> supports-eeprom-access: no
>> supports-register-dump: no
>> supports-priv-flags: yes
>>
>>>> Start with:
>>>>
>>>> echo 1 > /sys/kernel/debug/tracing/events/xdp/enable
>>>> cat /sys/kernel/debug/tracing/trace_pipe
>>>   cat /sys/kernel/debug/tracing/trace_pipe
>>>           <idle>-0     [045] ..s. 68469.467752: xdp_devmap_xmit:
>>> ndo_xdp_xmit map_id=32 map_index=5 action=REDIRECT sent=0 drops=1
>>> from_ifindex=4 to_ifindex=5 err=-6
> FIB lookup is good, the redirect is happening, but the mlx5 driver does
> not like it.
>
> I think the -6 is coming from the mlx5 driver and the packet is getting
> dropped. Perhaps this check in mlx5e_xdp_xmit:
>
>         if (unlikely(sq_num >= priv->channels.num))
>                  return -ENXIO;
>
Wondering about this:
swapper     0 [045] 68494.770287: fib:fib_table_lookup: table 254 oif 0 
iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0 ==> 
dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
             7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

oif 0 ?

Is that correct here ?


>
>>> swapper     0 [045] 68493.746274: fib:fib_table_lookup: table 254 oif
>>> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
>>> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>>>              7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
>>>
>>> swapper     0 [045] 68494.770287: fib:fib_table_lookup: table 254 oif
>>> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
>>> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>>>              7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
>>>
>>> swapper     0 [045] 68495.794304: fib:fib_table_lookup: table 254 oif
>>> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
>>> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>>>              7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
>>>
>>> swapper     0 [045] 68496.818308: fib:fib_table_lookup: table 254 oif
>>> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
>>> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>>>              7fff818c13b5 fib_table_lookup ([kernel.kallsyms])
>>>
>>> swapper     0 [045] 68497.842313: fib:fib_table_lookup: table 254 oif
>>> 0 iif 6 proto 1 192.168.22.237/0 -> 172.16.0.2/0 tos 0 scope 0 flags 0
>>> ==> dev vlan1740 gw 0.0.0.0 src 172.16.0.1 err 0
>>>              7fff818c13b5 fib_table_lookup ([kernel.kallsyms])

^ permalink raw reply

* Re: [PATCH 0/5] hy: core: rework phy_set_mode to accept phy mode and submode
From: David Miller @ 2018-11-09  3:20 UTC (permalink / raw)
  To: grygorii.strashko
  Cc: kishon, netdev, nsekhar, linux-kernel, linux-arm-kernel, tony,
	linux-amlogic, linux-mediatek, alexandre.belloni, antoine.tenart,
	quentin.schulz, vivek.gautam, maxime.ripard, wens, carlo,
	chunfeng.yun, matthias.bgg, mgautam
In-Reply-To: <20181108003617.10334-1-grygorii.strashko@ti.com>

From: Grygorii Strashko <grygorii.strashko@ti.com>
Date: Wed, 7 Nov 2018 18:36:12 -0600

> As was discussed in [1] I'm posting series which introduces rework of
> phy_set_mode to accept phy mode and submode. I've dropped TI specific patches as
> this change is pretty big by itself.
> 
> Patch 1 is cumulative change which refactors PHY framework code to
> support dual level PHYs mode configuration - PHY mode and PHY submode. It
> extends .set_mode() callback to support additional parameter "int submode"
> and converts all corresponding PHY drivers to support new .set_mode()
> callback declaration.
> The new extended PHY API
>  int phy_set_mode_ext(struct phy *phy, enum phy_mode mode, int submode)
> is introduced to support dual level PHYs mode configuration and existing
> phy_set_mode() API is converted to macros, so PHY framework consumers do
> not need to be changed (~21 matches).
> 
> Patches 2-4: Add new PHY's mode to be used by Ethernet PHY interface drivers or
> multipurpose PHYs like serdes and convert ocelot-serdes and mvebu-cp110-comphy
> PHY drivers to use recently introduced PHY_MODE_ETHERNET and phy_set_mode_ext().
> 
> Patch 5 - removes unused, ethernet specific phy modes from enum phy_mode.
> 
> [1] https://lkml.org/lkml/2018/10/25/366

I guess this will go via Kishon's tree.

^ permalink raw reply

* [PATCH net-next] i40iw: remove use of VLAN_TAG_PRESENT
From: Michał Mirosław @ 2018-11-08 17:44 UTC (permalink / raw)
  To: netdev
  Cc: Faisal Latif, Shiraz Saleem, Claudiu Manoil, Pravin B Shelar, dev,
	linux-rdma
In-Reply-To: <cover.1541698641.git.mirq-linux@rere.qmqm.pl>

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
 drivers/infiniband/hw/i40iw/i40iw_cm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c b/drivers/infiniband/hw/i40iw/i40iw_cm.c
index 771eb6bd0785..4b3999d88c9e 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_cm.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c
@@ -404,7 +404,7 @@ static struct i40iw_puda_buf *i40iw_form_cm_frame(struct i40iw_cm_node *cm_node,
 	if (pdata)
 		pd_len = pdata->size;
 
-	if (cm_node->vlan_id < VLAN_TAG_PRESENT)
+	if (cm_node->vlan_id <= VLAN_VID_MASK)
 		eth_hlen += 4;
 
 	if (cm_node->ipv4)
@@ -433,7 +433,7 @@ static struct i40iw_puda_buf *i40iw_form_cm_frame(struct i40iw_cm_node *cm_node,
 
 		ether_addr_copy(ethh->h_dest, cm_node->rem_mac);
 		ether_addr_copy(ethh->h_source, cm_node->loc_mac);
-		if (cm_node->vlan_id < VLAN_TAG_PRESENT) {
+		if (cm_node->vlan_id <= VLAN_VID_MASK) {
 			((struct vlan_ethhdr *)ethh)->h_vlan_proto = htons(ETH_P_8021Q);
 			vtag = (cm_node->user_pri << VLAN_PRIO_SHIFT) | cm_node->vlan_id;
 			((struct vlan_ethhdr *)ethh)->h_vlan_TCI = htons(vtag);
@@ -463,7 +463,7 @@ static struct i40iw_puda_buf *i40iw_form_cm_frame(struct i40iw_cm_node *cm_node,
 
 		ether_addr_copy(ethh->h_dest, cm_node->rem_mac);
 		ether_addr_copy(ethh->h_source, cm_node->loc_mac);
-		if (cm_node->vlan_id < VLAN_TAG_PRESENT) {
+		if (cm_node->vlan_id <= VLAN_VID_MASK) {
 			((struct vlan_ethhdr *)ethh)->h_vlan_proto = htons(ETH_P_8021Q);
 			vtag = (cm_node->user_pri << VLAN_PRIO_SHIFT) | cm_node->vlan_id;
 			((struct vlan_ethhdr *)ethh)->h_vlan_TCI = htons(vtag);
@@ -3323,7 +3323,7 @@ static void i40iw_init_tcp_ctx(struct i40iw_cm_node *cm_node,
 
 	tcp_info->flow_label = 0;
 	tcp_info->snd_mss = cpu_to_le32(((u32)cm_node->tcp_cntxt.mss));
-	if (cm_node->vlan_id < VLAN_TAG_PRESENT) {
+	if (cm_node->vlan_id <= VLAN_VID_MASK) {
 		tcp_info->insert_vlan_tag = true;
 		tcp_info->vlan_tag = cpu_to_le16(((u16)cm_node->user_pri << I40IW_VLAN_PRIO_SHIFT) |
 						  cm_node->vlan_id);
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next 2/4] cnic: remove use of VLAN_TAG_PRESENT
From: Michał Mirosław @ 2018-11-08 17:44 UTC (permalink / raw)
  To: netdev
  Cc: Claudiu Manoil, Faisal Latif, Pravin B Shelar, Shiraz Saleem, dev,
	linux-rdma
In-Reply-To: <cover.1541698641.git.mirq-linux@rere.qmqm.pl>

This just removes VLAN_TAG_PRESENT use.  VLAN TCI=0 special meaning is
deeply embedded in the driver code and so is left as is.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
 drivers/net/ethernet/broadcom/cnic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/cnic.c b/drivers/net/ethernet/broadcom/cnic.c
index d83233ae4a15..510dfc1c236b 100644
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -5731,7 +5731,7 @@ static int cnic_netdev_event(struct notifier_block *this, unsigned long event,
 		if (realdev) {
 			dev = cnic_from_netdev(realdev);
 			if (dev) {
-				vid |= VLAN_TAG_PRESENT;
+				vid |= VLAN_CFI_MASK;	/* make non-zero */
 				cnic_rcv_netevent(dev->cnic_priv, event, vid);
 				cnic_put(dev);
 			}
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next 0/4] Remove VLAN_TAG_PRESENT from drivers
From: Michał Mirosław @ 2018-11-08 17:44 UTC (permalink / raw)
  To: netdev
  Cc: Claudiu Manoil, Faisal Latif, Pravin B Shelar, Shiraz Saleem, dev,
	linux-rdma

This series removes VLAN_TAG_PRESENT use from network drivers in
preparation to removing its special meaning.

Michał Mirosław (4):
  i40iw: remove use of VLAN_TAG_PRESENT
  cnic: remove use of VLAN_TAG_PRESENT
  gianfar: remove use of VLAN_TAG_PRESENT
  OVS: remove use of VLAN_TAG_PRESENT

 drivers/infiniband/hw/i40iw/i40iw_cm.c        |  8 +++----
 drivers/net/ethernet/broadcom/cnic.c          |  2 +-
 .../net/ethernet/freescale/gianfar_ethtool.c  |  8 +++----
 net/openvswitch/actions.c                     | 13 +++++++----
 net/openvswitch/flow.c                        |  4 ++--
 net/openvswitch/flow.h                        |  2 +-
 net/openvswitch/flow_netlink.c                | 22 +++++++++----------
 7 files changed, 31 insertions(+), 28 deletions(-)

-- 
2.19.1

^ permalink raw reply

* [PATCH net-next 3/4] gianfar: remove use of VLAN_TAG_PRESENT
From: Michał Mirosław @ 2018-11-08 17:44 UTC (permalink / raw)
  To: netdev
  Cc: Claudiu Manoil, Faisal Latif, Pravin B Shelar, Shiraz Saleem, dev,
	linux-rdma
In-Reply-To: <cover.1541698641.git.mirq-linux@rere.qmqm.pl>

Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
 drivers/net/ethernet/freescale/gianfar_ethtool.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c b/drivers/net/ethernet/freescale/gianfar_ethtool.c
index 0d76e15cd6dd..241325c35cb4 100644
--- a/drivers/net/ethernet/freescale/gianfar_ethtool.c
+++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c
@@ -1134,11 +1134,9 @@ static int gfar_convert_to_filer(struct ethtool_rx_flow_spec *rule,
 		prio = vlan_tci_prio(rule);
 		prio_mask = vlan_tci_priom(rule);
 
-		if (cfi == VLAN_TAG_PRESENT && cfi_mask == VLAN_TAG_PRESENT) {
-			vlan |= RQFPR_CFI;
-			vlan_mask |= RQFPR_CFI;
-		} else if (cfi != VLAN_TAG_PRESENT &&
-			   cfi_mask == VLAN_TAG_PRESENT) {
+		if (cfi_mask) {
+			if (cfi)
+				vlan |= RQFPR_CFI;
 			vlan_mask |= RQFPR_CFI;
 		}
 	}
-- 
2.19.1

^ permalink raw reply related

* [PATCH net-next 4/4] OVS: remove use of VLAN_TAG_PRESENT
From: Michał Mirosław @ 2018-11-08 17:44 UTC (permalink / raw)
  To: netdev
  Cc: Pravin B Shelar, Claudiu Manoil, Faisal Latif, Shiraz Saleem, dev,
	linux-rdma
In-Reply-To: <cover.1541698641.git.mirq-linux@rere.qmqm.pl>

This is a minimal change to allow removing of VLAN_TAG_PRESENT.
It leaves OVS unable to use CFI bit, as fixing this would need
a deeper surgery involving userspace interface.

Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
---
 net/openvswitch/actions.c      | 13 +++++++++----
 net/openvswitch/flow.c         |  4 ++--
 net/openvswitch/flow.h         |  2 +-
 net/openvswitch/flow_netlink.c | 22 +++++++++++-----------
 4 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 85ae53d8fd09..e47ebbbe71b8 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -301,7 +301,7 @@ static int push_vlan(struct sk_buff *skb, struct sw_flow_key *key,
 		key->eth.vlan.tpid = vlan->vlan_tpid;
 	}
 	return skb_vlan_push(skb, vlan->vlan_tpid,
-			     ntohs(vlan->vlan_tci) & ~VLAN_TAG_PRESENT);
+			     ntohs(vlan->vlan_tci) & ~VLAN_CFI_MASK);
 }
 
 /* 'src' is already properly masked. */
@@ -822,8 +822,10 @@ static int ovs_vport_output(struct net *net, struct sock *sk, struct sk_buff *sk
 	__skb_dst_copy(skb, data->dst);
 	*OVS_CB(skb) = data->cb;
 	skb->inner_protocol = data->inner_protocol;
-	skb->vlan_tci = data->vlan_tci;
-	skb->vlan_proto = data->vlan_proto;
+	if (data->vlan_tci & VLAN_CFI_MASK)
+		__vlan_hwaccel_put_tag(skb, data->vlan_proto, data->vlan_tci & ~VLAN_CFI_MASK);
+	else
+		__vlan_hwaccel_clear_tag(skb);
 
 	/* Reconstruct the MAC header.  */
 	skb_push(skb, data->l2_len);
@@ -867,7 +869,10 @@ static void prepare_frag(struct vport *vport, struct sk_buff *skb,
 	data->cb = *OVS_CB(skb);
 	data->inner_protocol = skb->inner_protocol;
 	data->network_offset = orig_network_offset;
-	data->vlan_tci = skb->vlan_tci;
+	if (skb_vlan_tag_present(skb))
+		data->vlan_tci = skb_vlan_tag_get(skb) | VLAN_CFI_MASK;
+	else
+		data->vlan_tci = 0;
 	data->vlan_proto = skb->vlan_proto;
 	data->mac_proto = mac_proto;
 	data->l2_len = hlen;
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 35966da84769..fa393815991e 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -325,7 +325,7 @@ static int parse_vlan_tag(struct sk_buff *skb, struct vlan_head *key_vh,
 		return -ENOMEM;
 
 	vh = (struct vlan_head *)skb->data;
-	key_vh->tci = vh->tci | htons(VLAN_TAG_PRESENT);
+	key_vh->tci = vh->tci | htons(VLAN_CFI_MASK);
 	key_vh->tpid = vh->tpid;
 
 	if (unlikely(untag_vlan)) {
@@ -358,7 +358,7 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
 	int res;
 
 	if (skb_vlan_tag_present(skb)) {
-		key->eth.vlan.tci = htons(skb->vlan_tci);
+		key->eth.vlan.tci = htons(skb->vlan_tci) | htons(VLAN_CFI_MASK);
 		key->eth.vlan.tpid = skb->vlan_proto;
 	} else {
 		/* Parse outer vlan tag in the non-accelerated case. */
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index c670dd24b8b7..ba01fc4270bd 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -60,7 +60,7 @@ struct ovs_tunnel_info {
 
 struct vlan_head {
 	__be16 tpid; /* Vlan type. Generally 802.1q or 802.1ad.*/
-	__be16 tci;  /* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */
+	__be16 tci;  /* 0 if no VLAN, VLAN_CFI_MASK set otherwise. */
 };
 
 #define OVS_SW_FLOW_KEY_METADATA_SIZE			\
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 865ecef68196..435a4bdf8f89 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -990,9 +990,9 @@ static int validate_vlan_from_nlattrs(const struct sw_flow_match *match,
 	if (a[OVS_KEY_ATTR_VLAN])
 		tci = nla_get_be16(a[OVS_KEY_ATTR_VLAN]);
 
-	if (!(tci & htons(VLAN_TAG_PRESENT))) {
+	if (!(tci & htons(VLAN_CFI_MASK))) {
 		if (tci) {
-			OVS_NLERR(log, "%s TCI does not have VLAN_TAG_PRESENT bit set.",
+			OVS_NLERR(log, "%s TCI does not have VLAN_CFI_MASK bit set.",
 				  (inner) ? "C-VLAN" : "VLAN");
 			return -EINVAL;
 		} else if (nla_len(a[OVS_KEY_ATTR_ENCAP])) {
@@ -1013,9 +1013,9 @@ static int validate_vlan_mask_from_nlattrs(const struct sw_flow_match *match,
 	__be16 tci = 0;
 	__be16 tpid = 0;
 	bool encap_valid = !!(match->key->eth.vlan.tci &
-			      htons(VLAN_TAG_PRESENT));
+			      htons(VLAN_CFI_MASK));
 	bool i_encap_valid = !!(match->key->eth.cvlan.tci &
-				htons(VLAN_TAG_PRESENT));
+				htons(VLAN_CFI_MASK));
 
 	if (!(key_attrs & (1 << OVS_KEY_ATTR_ENCAP))) {
 		/* Not a VLAN. */
@@ -1039,8 +1039,8 @@ static int validate_vlan_mask_from_nlattrs(const struct sw_flow_match *match,
 			  (inner) ? "C-VLAN" : "VLAN", ntohs(tpid));
 		return -EINVAL;
 	}
-	if (!(tci & htons(VLAN_TAG_PRESENT))) {
-		OVS_NLERR(log, "%s TCI mask does not have exact match for VLAN_TAG_PRESENT bit.",
+	if (!(tci & htons(VLAN_CFI_MASK))) {
+		OVS_NLERR(log, "%s TCI mask does not have exact match for VLAN_CFI_MASK bit.",
 			  (inner) ? "C-VLAN" : "VLAN");
 		return -EINVAL;
 	}
@@ -1095,7 +1095,7 @@ static int parse_vlan_from_nlattrs(struct sw_flow_match *match,
 	if (err)
 		return err;
 
-	encap_valid = !!(match->key->eth.vlan.tci & htons(VLAN_TAG_PRESENT));
+	encap_valid = !!(match->key->eth.vlan.tci & htons(VLAN_CFI_MASK));
 	if (encap_valid) {
 		err = __parse_vlan_from_nlattrs(match, key_attrs, true, a,
 						is_mask, log);
@@ -2943,7 +2943,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 			vlan = nla_data(a);
 			if (!eth_type_vlan(vlan->vlan_tpid))
 				return -EINVAL;
-			if (!(vlan->vlan_tci & htons(VLAN_TAG_PRESENT)))
+			if (!(vlan->vlan_tci & htons(VLAN_CFI_MASK)))
 				return -EINVAL;
 			vlan_tci = vlan->vlan_tci;
 			break;
@@ -2959,7 +2959,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 			/* Prohibit push MPLS other than to a white list
 			 * for packets that have a known tag order.
 			 */
-			if (vlan_tci & htons(VLAN_TAG_PRESENT) ||
+			if (vlan_tci & htons(VLAN_CFI_MASK) ||
 			    (eth_type != htons(ETH_P_IP) &&
 			     eth_type != htons(ETH_P_IPV6) &&
 			     eth_type != htons(ETH_P_ARP) &&
@@ -2971,7 +2971,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 		}
 
 		case OVS_ACTION_ATTR_POP_MPLS:
-			if (vlan_tci & htons(VLAN_TAG_PRESENT) ||
+			if (vlan_tci & htons(VLAN_CFI_MASK) ||
 			    !eth_p_mpls(eth_type))
 				return -EINVAL;
 
@@ -3036,7 +3036,7 @@ static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
 		case OVS_ACTION_ATTR_POP_ETH:
 			if (mac_proto != MAC_PROTO_ETHERNET)
 				return -EINVAL;
-			if (vlan_tci & htons(VLAN_TAG_PRESENT))
+			if (vlan_tci & htons(VLAN_CFI_MASK))
 				return -EINVAL;
 			mac_proto = MAC_PROTO_NONE;
 			break;
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Edward Cree @ 2018-11-08 17:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Martin Lau, Yonghong Song, Alexei Starovoitov,
	daniel@iogearbox.net, netdev@vger.kernel.org, Kernel Team
In-Reply-To: <20181107214922.xjqcacj5rc5hmepw@ast-mbp.dhcp.thefacebook.com>

On 07/11/18 21:49, Alexei Starovoitov wrote:
> On Wed, Nov 07, 2018 at 07:29:31PM +0000, Edward Cree wrote:
>> Whereas I don't, and I don't feel like my core criticisms have
>>  been addressed _at all_.  The only answer I get to "BTF should
>>  store type and instance information in separate records" is
>>  "it's a debuginfo",
> ...
>>  I am just trying to organise
>>  BTF to consist of separate _parts_ for types and instances,
>>  rather than forcing both into the same Procrustean bed.
> BTF does not have and should not have instances.
> BTF is debug info only.
> This is not negotiable.
I'm not saying the instances go in BTF, I'm saying that debug info
 *about* instances goes in BTF (it already does, as you keep saying
 BTF is "not just pure types"), and that that ought to be
 distinguished within the format from debug info about types.

> So I'm looking forward to your ideas how to describe BTF in .s
> Note such .s must have freedom to describe 'int bar(struct __sk_buff *a1, char a2)'
> as debug info while having '.globl foo; foo:' as symbol name.
I've pushed out a branch with what I have; see
 https://github.com/solarflarecom/ebpf_asm/tree/btfdoc
 (with some examples in dropper.s and documentation in the README).
In particular note that right now the BTF section is entirely
 decoupled from the .text, so indeed there is nothing right now
 tying function names to symbol names.  I do not yet have anything
 generating FuncInfo (or LineInfo) tables, but when I do that will
 remain decoupled.

> Your other 'criticism' was about libbpf's bpf_map_find_btf_info()
> and ____btf_map_* hack. Yes. It is a hack and I'm open to change it
> if there are better suggestions. It's a convention between
> libbpf and program writers that produce elf. It's not a kernel abi.
> Nothing to do with BTF and this instance vs debug info discussion.
It's everything to do with it: it's defining a type with a magic name
 (____btf_map_foo) when what we really want to do is declare an
 instance (the map 'foo').  And it may not be a kernel ABI, but it's
 a part of the file format you're defining (whether that's just a
 'convention' or something more), and if you want the BTF ecosystem
 to be more than just an llvm monoculture then the format needs to be
 properly specified so that others can work with it.

> Happy to jump on the call to explain it again.
> 10:30am pacific time works for me tomorrow.
That works for me (that's in ~30 minutes from now if I've converted
 correctly.)  Please email me offlist with the phone number to call.

-Ed

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox