Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next] liquidio: bump up driver version to 1.7.2 to match newer NIC firmware
From: Felix Manlunas @ 2018-05-09 18:49 UTC (permalink / raw)
  To: davem; +Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla

Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
---
 drivers/net/ethernet/cavium/liquidio/liquidio_common.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
index 285b248..690424b 100644
--- a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
+++ b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
@@ -28,7 +28,7 @@
 #define LIQUIDIO_PACKAGE ""
 #define LIQUIDIO_BASE_MAJOR_VERSION 1
 #define LIQUIDIO_BASE_MINOR_VERSION 7
-#define LIQUIDIO_BASE_MICRO_VERSION 0
+#define LIQUIDIO_BASE_MICRO_VERSION 2
 #define LIQUIDIO_BASE_VERSION   __stringify(LIQUIDIO_BASE_MAJOR_VERSION) "." \
 				__stringify(LIQUIDIO_BASE_MINOR_VERSION)
 #define LIQUIDIO_MICRO_VERSION  "." __stringify(LIQUIDIO_BASE_MICRO_VERSION)

^ permalink raw reply related

* Re: [PATCH net-next v3 0/2] socket statistics for ss
From: Eric Dumazet @ 2018-05-09 18:53 UTC (permalink / raw)
  To: Stephen Hemminger, Eric Dumazet
  Cc: davem, gerrit, kuznet, yoshfuji, netdev, dccp, Stephen Hemminger
In-Reply-To: <20180509114456.0303a026@xeon-e3>



On 05/09/2018 11:44 AM, Stephen Hemminger wrote:
> On Wed, 9 May 2018 10:53:58 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> On 05/09/2018 10:31 AM, Stephen Hemminger wrote:
>>> On Wed, 9 May 2018 10:18:23 -0700
>>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>   
>>>> On 05/09/2018 08:22 AM, Stephen Hemminger wrote:
>>>>  
>>>>> I am not sure if these patches are worth applying.
>>>>> The 'ss -s' command has had missing values since 2.4 kernel.
>>>>> And the first complaints came in only this year.
>>>>>
>>>>> Another alternative would be just to remove these fields from ss -s
>>>>> output and move on.
>>>>>     
>>>>
>>>> Anyway your patches are not netns ready, so lets remove these fields from ss.
>>>>
>>>> Or you have to spend _much_ more time on writing and testing the kernel part.
>>>>
>>>> Thanks.  
>>>
>>> The patches only expose the existing TCP socket accounting infrastructure.
>>> Several other pieces that sockstat has are not netns aware.
>>> That is a completely different problem.  
>>
>>
>> Adding a new field counting 'bounds ports' without being netns ready is a total mistake,
>> as it is useless by current standards.
>>
>> The first thing that users will do is add proper netns support, with extra complexity in the kernel.
>>
>> So, instead of pushing some incomplete feature, trying to fool ourselves with a sentiment of 'small cost'
>> that will later need another 100 lines of code in the kernel, please give us the complete picture.
>>
>> I am just saying, you can of course ignore my feedback.
> 
> The current TCP hashinfo should be moved into netns. The current method of scanning and matching
> by net namespace is a scalability issue now.

It is not the plan yet, and we have no scalability issue.

Before switching to netns hash table, this would need rhashtable conversions
but so far this has not been done.

- Time to create/delete netns is critical.
- Adding few Mbytes of overhead per netns is a nogo,

Please do not change subject, this is adding noise to this particular thread.

^ permalink raw reply

* Re: Performance regression between 4.13 and 4.14
From: Ben Greear @ 2018-05-09 19:02 UTC (permalink / raw)
  To: Eric Dumazet, netdev
In-Reply-To: <988a00db-bab3-7318-5e31-2a7c0948262a@gmail.com>

On 05/09/2018 11:48 AM, Eric Dumazet wrote:
>
>
> On 05/09/2018 11:43 AM, Ben Greear wrote:
>> On 05/08/2018 10:10 AM, Eric Dumazet wrote:
>>>
>>>
>>> On 05/08/2018 09:44 AM, Ben Greear wrote:
>>>> Hello,
>>>>
>>>> I am trying to track down a performance regression that appears to be between 4.13
>>>> and 4.14.
>>>>
>>>> I first saw the problem with a hacked version of pktgen on some ixgbe NICs.  4.13 can do
>>>> right at 10G bi-directional on two ports, and 4.14 and later can do only about 6Gbps.
>>>>
>>>> I also tried with user-space UDP traffic on a stock kernel, and I can get about 3.2Gbps combined tx+rx
>>>> on 4.14 and about 4.4Gbps on 4.13.
>>>>
>>>> Attempting to bisect seems to be triggering a weirdness in git, and also lots of commits
>>>> crash or do not bring up networking, which makes the bisect difficult.
>>>>
>>>> Looking at perf top, it would appear that some lock is probably to blame.
>>>
>>>
>>> perf record -a -g -e cycles:pp sleep 5
>>> perf report
>>>
>>> Then you'll be able to tell us which lock (or call graph) is killing your perf.
>>>
>>
>> I seem to be chasing multiple issues.  For 4.13, at least part of my problem was that LOCKDEP was enabled,
>> during my bisect, though it does NOT appear enabled in 4.16.  I think maybe CONFIG_LOCKDEP moved to CONFIG_PROVE_LOCKING
>> in 4.16, or something like that?  My 4.16 .config does have CONFIG_LOCKDEP_SUPPORT enabled, and I see no option to disable it:
>>
>> [greearb@ben-dt3 linux-4.16.x64]$ grep LOCKDEP .config
>> CONFIG_LOCKDEP_SUPPORT=y
>>
>>
>> For 4.16, I am disabling RETRAMPOLINE...are there any other such things I need
>> to disable to keep from getting a performance hit from the spectre-related bug
>> fixes?  At this point, I do not care about the security implications.
>>
>> greearb@ben-dt3 linux-4.16.x64]$ grep RETPO .config
>> # CONFIG_RETPOLINE is not set
>>
>>
>> Thanks,
>> Ben
>>
>
> No idea really, you mention a 4.13 -> 4.14 regression and jump then to 4.16 :/

I initially saw the problem in 4.16, then bisected, and 4.14 still showed the
issue.

4.13 works, but only when I use a .config I originally built for 4.13, not the 4.16 .config
that I ended up using with the bisect (make oldconfig, accept all defaults).  I originally
configured 4.16 with a .config that had lockdep enabled, then manually tried to disable it
through 'make xconfig'.  I think that must leave "CONFIG_LOCKDEP=y" in the .config, which
screws up older builds during bisect, perhaps?

> Before doing a (painful) dissection, the perf output would immediately tell you if
> something is really wrong on your .config.

I didn't realize lockdep might be an issue at the time, but here is a 'bad' run from
a 4.13+ (plus pktgen hacks).  I guess lockdep is why this runs slowly, but I see no obvious
proof of that in the output:

4.13+, patched pktgen, 6Gbps throughput, on commit 906dde0f355bd97c080c215811ae7db1137c4af8

Samples: 26K of event 'cycles:pp', Event count (approx.): 20119166736
   Children      Self  Command          Shared Object                Symbol
+   87.97%     0.00%  kpktgend_1       [kernel.kallsyms]            [k] ret_from_fork
+   87.97%     0.00%  kpktgend_1       [kernel.kallsyms]            [k] kthread
+   86.89%     5.42%  kpktgend_1       [kernel.kallsyms]            [k] pktgen_thread_worker
+   33.75%     0.18%  kpktgend_1       [kernel.kallsyms]            [k] getnstimeofday64
+   32.77%     4.47%  kpktgend_1       [kernel.kallsyms]            [k] __getnstimeofday64
+   24.60%    10.91%  kpktgend_1       [kernel.kallsyms]            [k] lock_acquire
+   23.59%     0.03%  kpktgend_1       [kernel.kallsyms]            [k] __do_softirq
+   23.55%     0.07%  kpktgend_1       [kernel.kallsyms]            [k] net_rx_action
+   22.29%     0.47%  kpktgend_1       [kernel.kallsyms]            [k] getRelativeCurNs
+   21.33%     1.71%  kpktgend_1       [kernel.kallsyms]            [k] ixgbe_poll
+   15.79%     0.02%  kpktgend_1       [kernel.kallsyms]            [k] ret_from_intr
+   15.78%     0.01%  kpktgend_1       [kernel.kallsyms]            [k] do_IRQ
+   15.34%     0.01%  kpktgend_1       [kernel.kallsyms]            [k] irq_exit
+   13.95%    10.00%  kpktgend_1       [kernel.kallsyms]            [k] ip_send_check
+   13.80%    13.80%  kpktgend_1       [kernel.kallsyms]            [k] __lock_acquire.isra.31
+   12.98%     0.53%  kpktgend_1       [kernel.kallsyms]            [k] pktgen_finalize_skb
+   12.31%     0.20%  kpktgend_1       [kernel.kallsyms]            [k] timestamp_skb.isra.24
+   11.68%     0.13%  kpktgend_1       [kernel.kallsyms]            [k] napi_gro_receive
+   11.36%     0.25%  kpktgend_1       [kernel.kallsyms]            [k] netif_receive_skb_internal
+   10.93%     0.00%  swapper          [kernel.kallsyms]            [k] verify_cpu
+   10.93%     0.00%  swapper          [kernel.kallsyms]            [k] cpu_startup_entry
+   10.92%     0.02%  swapper          [kernel.kallsyms]            [k] do_idle
+   10.71%     0.00%  swapper          [kernel.kallsyms]            [k] cpuidle_enter
+   10.71%     0.00%  swapper          [kernel.kallsyms]            [k] call_cpuidle
+   10.66%     0.06%  swapper          [kernel.kallsyms]            [k] cpuidle_enter_state
+    9.78%     0.00%  swapper          [kernel.kallsyms]            [k] x86_64_start_kernel
+    9.78%     0.00%  swapper          [kernel.kallsyms]            [k] x86_64_start_reservations
+    9.78%     0.00%  swapper          [kernel.kallsyms]            [k] start_kernel
+    9.78%     0.00%  swapper          [kernel.kallsyms]            [k] rest_init
+    9.25%     0.00%  swapper          [kernel.kallsyms]            [k] ret_from_intr
+    9.25%     0.00%  swapper          [kernel.kallsyms]            [k] do_IRQ
+    9.24%     0.30%  kpktgend_1       [kernel.kallsyms]            [k] pktgen_alloc_skb
+    9.14%     0.00%  swapper          [kernel.kallsyms]            [k] irq_exit
+    8.89%     0.00%  swapper          [kernel.kallsyms]            [k] __do_softirq
+    8.82%     0.01%  swapper          [kernel.kallsyms]            [k] net_rx_action
+    8.80%     0.47%  kpktgend_1       [kernel.kallsyms]            [k] __local_bh_enable_ip
+    8.67%     6.87%  kpktgend_1       [kernel.kallsyms]            [k] lock_release
+    8.57%     1.10%  kpktgend_1       [kernel.kallsyms]            [k] __netdev_alloc_skb
+    8.28%     0.00%  kpktgend_1       [kernel.kallsyms]            [k] do_softirq.part.17
+    8.28%     0.00%  kpktgend_1       [kernel.kallsyms]            [k] do_softirq_own_stack
+    8.01%     0.74%  swapper          [kernel.kallsyms]            [k] ixgbe_poll
+    6.85%     2.28%  kpktgend_1       [kernel.kallsyms]            [k] __build_skb
+    4.89%     0.07%  kpktgend_1       [kernel.kallsyms]            [k] __netif_receive_skb
+    4.87%     0.75%  kpktgend_1       [kernel.kallsyms]            [k] __netif_receive_skb_core
+    4.66%     0.16%  kpktgend_1       [kernel.kallsyms]            [k] _raw_spin_lock
+    4.37%     0.06%  swapper          [kernel.kallsyms]            [k] napi_gro_receive
+    4.23%     0.10%  swapper          [kernel.kallsyms]            [k] netif_receive_skb_internal



Here is a 'bad' run of 4.16.0 + my full out-of-tree patches, including my normal pktgen changes
(pktgen code should be nearly identical between the run above and the run here)

Samples: 26K of event 'cycles:pp', Event count (approx.): 19888249841
   Children      Self  Command          Shared Object                 Symbol
+   74.24%     5.38%  kpktgend_1       [kernel.kallsyms]             [k] pktgen_thread_worker
+   73.89%     0.00%  kpktgend_1       [kernel.kallsyms]             [k] ret_from_fork
+   73.89%     0.00%  kpktgend_1       [kernel.kallsyms]             [k] kthread
+   27.31%     0.25%  kpktgend_1       [kernel.kallsyms]             [k] getnstimeofday64
+   26.15%     4.24%  kpktgend_1       [kernel.kallsyms]             [k] __getnstimeofday64
+   22.64%     0.02%  kpktgend_1       [kernel.kallsyms]             [k] __softirqentry_text_s
+   22.62%     0.06%  kpktgend_1       [kernel.kallsyms]             [k] net_rx_action
+   22.59%     2.04%  kpktgend_1       [kernel.kallsyms]             [k] ixgbe_poll
+   21.49%     7.92%  kpktgend_1       [kernel.kallsyms]             [k] lock_acquire
+   17.78%    17.78%  kpktgend_1       [kernel.kallsyms]             [k] __lock_acquire.isra.3
+   15.95%     0.39%  kpktgend_1       [kernel.kallsyms]             [k] getRelativeCurNs
+   14.49%     0.02%  kpktgend_1       [kernel.kallsyms]             [k] ret_from_intr
+   14.48%     0.00%  kpktgend_1       [kernel.kallsyms]             [k] do_IRQ
+   14.35%     0.01%  kpktgend_1       [kernel.kallsyms]             [k] irq_exit
+   13.46%     9.90%  kpktgend_1       [kernel.kallsyms]             [k] ip_send_check
+   11.36%     0.18%  kpktgend_1       [kernel.kallsyms]             [k] napi_gro_receive
+   11.25%     0.17%  kpktgend_1       [kernel.kallsyms]             [k] netif_receive_skb_int
+   10.37%     0.16%  kpktgend_1       [kernel.kallsyms]             [k] timestamp_skb.isra.25
+    8.85%     0.45%  kpktgend_1       [kernel.kallsyms]             [k] __local_bh_enable_ip
+    8.61%     0.02%  swapper          [kernel.kallsyms]             [k] do_idle
+    8.60%     0.00%  swapper          [kernel.kallsyms]             [k] secondary_startup_64
+    8.60%     0.00%  swapper          [kernel.kallsyms]             [k] cpu_startup_entry
+    8.37%     0.05%  swapper          [kernel.kallsyms]             [k] cpuidle_enter_state
+    8.35%     7.57%  kpktgend_1       [kernel.kallsyms]             [k] lock_release
+    8.32%     0.00%  kpktgend_1       [kernel.kallsyms]             [k] do_softirq.part.16
+    8.31%     0.00%  kpktgend_1       [kernel.kallsyms]             [k] do_softirq_own_stack
+    7.16%     0.36%  kpktgend_1       [kernel.kallsyms]             [k] pktgen_alloc_skb
+    7.09%     1.14%  kpktgend_1       [kernel.kallsyms]             [k] __netdev_alloc_skb
+    6.87%     0.00%  swapper          [kernel.kallsyms]             [k] start_kernel
+    6.55%     0.00%  swapper          [kernel.kallsyms]             [k] ret_from_intr
+    6.55%     0.01%  swapper          [kernel.kallsyms]             [k] do_IRQ
+    6.51%     0.01%  swapper          [kernel.kallsyms]             [k] irq_exit
+    6.34%     0.02%  swapper          [kernel.kallsyms]             [k] __softirqentry_text_s
+    6.25%     0.01%  swapper          [kernel.kallsyms]             [k] net_rx_action
+    6.06%     0.45%  swapper          [kernel.kallsyms]             [k] ixgbe_poll
+    5.35%     2.11%  kpktgend_1       [kernel.kallsyms]             [k] __build_skb


I tried testing with stock kernels and UDP user-space traffic, but results varied widely
from run to run, so it was useless for bisecting.

I have irqbalance disabled on this system in order to pin pktgen process and ixgbe IRQs
to be efficient, so maybe that is why user-space UDP traffic works so unreliably.  My hacked pktgen
reliably shows good on my good 4.13 kernel and reliably bad on higher kernels I have compiled
so far, so I decided it was the only useful way to bisect this problem.

I will now try 4.14 and higher again with lockdep definitely disabled.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: dsa: mv88e6xxx: add a cascade port op
From: Andrew Lunn @ 2018-05-09 19:05 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: netdev, linux-kernel, kernel, davem, f.fainelli
In-Reply-To: <20180509153851.10207-2-vivien.didelot@savoirfairelinux.com>

On Wed, May 09, 2018 at 11:38:49AM -0400, Vivien Didelot wrote:
> Only the 88E6185 family has bits 15:12 Cascade Port bits in the Global
> Control 2 register. Hence inconsistent values are actually written in
> this register for other families.
> 
> Add a .set_cascade_port operation to isolate the 88E6185 case, and call
> it from the device mapping setup function.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 2/3] net: dsa: mv88e6xxx: set device number
From: Andrew Lunn @ 2018-05-09 19:06 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: netdev, linux-kernel, kernel, davem, f.fainelli
In-Reply-To: <20180509153851.10207-3-vivien.didelot@savoirfairelinux.com>

On Wed, May 09, 2018 at 11:38:50AM -0400, Vivien Didelot wrote:
> All Marvell switches supported by mv88e6xxx have to set their device
> number in the Global Control 2 register. Extract this in a read then
> write function, called from the device mapping setup code.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 3/3] net: dsa: mv88e6xxx: add RMU disable op
From: Andrew Lunn @ 2018-05-09 19:08 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: netdev, linux-kernel, kernel, davem, f.fainelli
In-Reply-To: <20180509153851.10207-4-vivien.didelot@savoirfairelinux.com>

On Wed, May 09, 2018 at 11:38:51AM -0400, Vivien Didelot wrote:
> The RMU mode bits moved a lot within the Global Control 2 register of
> the Marvell switch families. Add an .rmu_disable op to support at least
> 3 known alternatives.
> 
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* [PATCH bpf-next v4 0/6] ipv6: sr: introduce seg6local End.BPF action
From: Mathieu Xhonneux @ 2018-05-09 21:16 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov

As of Linux 4.14, it is possible to define advanced local processing for
IPv6 packets with a Segment Routing Header through the seg6local LWT
infrastructure. This LWT implements the network programming principles
defined in the IETF “SRv6 Network Programming” draft.

The implemented operations are generic, and it would be very interesting to
be able to implement user-specific seg6local actions, without having to
modify the kernel directly. To do so, this patchset adds an End.BPF action
to seg6local, powered by some specific Segment Routing-related helpers,
which provide SR functionalities that can be applied on the packet. This
BPF hook would then allow to implement specific actions at native kernel
speed such as OAM features, advanced SR SDN policies, SRv6 actions like
Segment Routing Header (SRH) encapsulation depending on the content of
the packet, etc.

This patchset is divided in 6 patches, whose main features are :

- A new seg6local action End.BPF with the corresponding new BPF program
  type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be
  passed to the LWT seg6local through netlink, the same way as the LWT
  BPF hook operates.
- 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/
  shrink a SRH and apply on a packet some of the generic SRv6 actions.
- 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through
  encapsulation (via IPv6 encapsulation or inlining if the packet contains
  already an IPv6 header).

As this patchset adds a new LWT BPF hook, I took into account the result of
the discussions when the LWT BPF infrastructure got merged. Hence, the
seg6local BPF hook doesn’t allow write access to skb->data directly, only
the SRH can be modified through specific helpers, which ensures that the
integrity of the packet is maintained.
More details are available in the related patches messages.

The performances of this BPF hook have been assessed with the BPF JIT
enabled on a Intel Xeon X3440 processors with 4 cores and 8 threads
clocked at 2.53 GHz. No throughput losses are noted with the seg6local
BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes
TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes)
drops the throughput to 410kpps, and inlining a SRH via
bpf_lwt_seg6_action drops the throughput to 420kpps.
All throughputs are stable.

-------
v2: move the SRH integrity state from skb->cb to a per-cpu buffer
v3: - document helpers in man-page style
    - fix kbuild bugs
    - un-break BPF LWT out hook
    - bpf_push_seg6_encap is now static
    - preempt_enable is now called when the packet is dropped in
      input_action_end_bpf
v4: fix kbuild bugs when CONFIG_IPV6=m

Thanks.


Mathieu Xhonneux (6):
  ipv6: sr: make seg6.h includable without IPv6
  ipv6: sr: export function lookup_nexthop
  bpf: Add IPv6 Segment Routing helpers
  bpf: Split lwt inout verifier structures
  ipv6: sr: Add seg6local action End.BPF
  selftests/bpf: test for seg6local End.BPF action

 include/linux/bpf_types.h                         |   7 +-
 include/net/seg6.h                                |   7 +-
 include/net/seg6_local.h                          |  32 ++
 include/uapi/linux/bpf.h                          |  96 ++++-
 include/uapi/linux/seg6_local.h                   |   3 +
 kernel/bpf/verifier.c                             |   1 +
 net/core/filter.c                                 | 390 ++++++++++++++++---
 net/ipv6/Kconfig                                  |   5 +
 net/ipv6/seg6_local.c                             | 180 ++++++++-
 tools/include/uapi/linux/bpf.h                    |  97 ++++-
 tools/testing/selftests/bpf/Makefile              |   5 +-
 tools/testing/selftests/bpf/bpf_helpers.h         |  12 +
 tools/testing/selftests/bpf/test_lwt_seg6local.c  | 438 ++++++++++++++++++++++
 tools/testing/selftests/bpf/test_lwt_seg6local.sh | 140 +++++++
 14 files changed, 1340 insertions(+), 73 deletions(-)
 create mode 100644 include/net/seg6_local.h
 create mode 100644 tools/testing/selftests/bpf/test_lwt_seg6local.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_seg6local.sh

-- 
2.16.1

^ permalink raw reply

* [PATCH bpf-next v4 1/6] ipv6: sr: make seg6.h includable without IPv6
From: Mathieu Xhonneux @ 2018-05-09 21:16 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <cover.1525898587.git.m.xhonneux@gmail.com>

include/net/seg6.h cannot be included in a source file if CONFIG_IPV6 is
not enabled:
   include/net/seg6.h: In function 'seg6_pernet':
>> include/net/seg6.h:52:14: error: 'struct net' has no member named
                                        'ipv6'; did you mean 'ipv4'?
     return net->ipv6.seg6_data;
                 ^~~~
                 ipv4

This commit makes seg6_pernet return NULL if IPv6 is not compiled, hence
allowing seg6.h to be included regardless of the configuration.

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
---
 include/net/seg6.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/net/seg6.h b/include/net/seg6.h
index 099bad59dc90..70b4cfac52d7 100644
--- a/include/net/seg6.h
+++ b/include/net/seg6.h
@@ -49,7 +49,11 @@ struct seg6_pernet_data {
 
 static inline struct seg6_pernet_data *seg6_pernet(struct net *net)
 {
+#if IS_ENABLED(CONFIG_IPV6)
 	return net->ipv6.seg6_data;
+#else
+	return NULL;
+#endif
 }
 
 extern int seg6_init(void);
-- 
2.16.1

^ permalink raw reply related

* [PATCH bpf-next v4 2/6] ipv6: sr: export function lookup_nexthop
From: Mathieu Xhonneux @ 2018-05-09 21:16 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <cover.1525898587.git.m.xhonneux@gmail.com>

The function lookup_nexthop is essential to implement most of the seg6local
actions. As we want to provide a BPF helper allowing to apply some of these
actions on the packet being processed, the helper should be able to call
this function, hence the need to make it public.

Moreover, if one argument is incorrect or if the next hop can not be found,
an error should be returned by the BPF helper so the BPF program can adapt
its processing of the packet (return an error, properly force the drop,
...). This patch hence makes this function return dst->error to indicate a
possible error.

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
---
 include/net/seg6.h       |  3 ++-
 include/net/seg6_local.h | 24 ++++++++++++++++++++++++
 net/ipv6/seg6_local.c    | 20 +++++++++++---------
 3 files changed, 37 insertions(+), 10 deletions(-)
 create mode 100644 include/net/seg6_local.h

diff --git a/include/net/seg6.h b/include/net/seg6.h
index 70b4cfac52d7..e029e301faa5 100644
--- a/include/net/seg6.h
+++ b/include/net/seg6.h
@@ -67,5 +67,6 @@ extern bool seg6_validate_srh(struct ipv6_sr_hdr *srh, int len);
 extern int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh,
 			     int proto);
 extern int seg6_do_srh_inline(struct sk_buff *skb, struct ipv6_sr_hdr *osrh);
-
+extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
+			       u32 tbl_id);
 #endif
diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
new file mode 100644
index 000000000000..57498b23085d
--- /dev/null
+++ b/include/net/seg6_local.h
@@ -0,0 +1,24 @@
+/*
+ *  SR-IPv6 implementation
+ *
+ *  Authors:
+ *  David Lebrun <david.lebrun@uclouvain.be>
+ *  eBPF support: Mathieu Xhonneux <m.xhonneux@gmail.com>
+ *
+ *
+ *  This program is free software; you can redistribute it and/or
+ *      modify it under the terms of the GNU General Public License
+ *      as published by the Free Software Foundation; either version
+ *      2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _NET_SEG6_LOCAL_H
+#define _NET_SEG6_LOCAL_H
+
+#include <linux/net.h>
+#include <linux/ipv6.h>
+
+extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
+			       u32 tbl_id);
+
+#endif
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index 45722327375a..e9b23fb924ad 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -30,6 +30,7 @@
 #ifdef CONFIG_IPV6_SEG6_HMAC
 #include <net/seg6_hmac.h>
 #endif
+#include <net/seg6_local.h>
 #include <linux/etherdevice.h>
 
 struct seg6_local_lwt;
@@ -140,8 +141,8 @@ static void advance_nextseg(struct ipv6_sr_hdr *srh, struct in6_addr *daddr)
 	*daddr = *addr;
 }
 
-static void lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
-			   u32 tbl_id)
+int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
+			u32 tbl_id)
 {
 	struct net *net = dev_net(skb->dev);
 	struct ipv6hdr *hdr = ipv6_hdr(skb);
@@ -187,6 +188,7 @@ static void lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
 
 	skb_dst_drop(skb);
 	skb_dst_set(skb, dst);
+	return dst->error;
 }
 
 /* regular endpoint function */
@@ -200,7 +202,7 @@ static int input_action_end(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 
 	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
 
-	lookup_nexthop(skb, NULL, 0);
+	seg6_lookup_nexthop(skb, NULL, 0);
 
 	return dst_input(skb);
 
@@ -220,7 +222,7 @@ static int input_action_end_x(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 
 	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
 
-	lookup_nexthop(skb, &slwt->nh6, 0);
+	seg6_lookup_nexthop(skb, &slwt->nh6, 0);
 
 	return dst_input(skb);
 
@@ -239,7 +241,7 @@ static int input_action_end_t(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 
 	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
 
-	lookup_nexthop(skb, NULL, slwt->table);
+	seg6_lookup_nexthop(skb, NULL, slwt->table);
 
 	return dst_input(skb);
 
@@ -331,7 +333,7 @@ static int input_action_end_dx6(struct sk_buff *skb,
 	if (!ipv6_addr_any(&slwt->nh6))
 		nhaddr = &slwt->nh6;
 
-	lookup_nexthop(skb, nhaddr, 0);
+	seg6_lookup_nexthop(skb, nhaddr, 0);
 
 	return dst_input(skb);
 drop:
@@ -380,7 +382,7 @@ static int input_action_end_dt6(struct sk_buff *skb,
 	if (!pskb_may_pull(skb, sizeof(struct ipv6hdr)))
 		goto drop;
 
-	lookup_nexthop(skb, NULL, slwt->table);
+	seg6_lookup_nexthop(skb, NULL, slwt->table);
 
 	return dst_input(skb);
 
@@ -406,7 +408,7 @@ static int input_action_end_b6(struct sk_buff *skb, struct seg6_local_lwt *slwt)
 	ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
 	skb_set_transport_header(skb, sizeof(struct ipv6hdr));
 
-	lookup_nexthop(skb, NULL, 0);
+	seg6_lookup_nexthop(skb, NULL, 0);
 
 	return dst_input(skb);
 
@@ -438,7 +440,7 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
 	ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
 	skb_set_transport_header(skb, sizeof(struct ipv6hdr));
 
-	lookup_nexthop(skb, NULL, 0);
+	seg6_lookup_nexthop(skb, NULL, 0);
 
 	return dst_input(skb);
 
-- 
2.16.1

^ permalink raw reply related

* [PATCH bpf-next v4 3/6] bpf: Add IPv6 Segment Routing helpers
From: Mathieu Xhonneux @ 2018-05-09 21:16 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <cover.1525898587.git.m.xhonneux@gmail.com>

The BPF seg6local hook should be powerful enough to enable users to
implement most of the use-cases one could think of. After some thinking,
we figured out that the following actions should be possible on a SRv6
packet, requiring 3 specific helpers :
    - bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH
    - bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH
                               (to add/delete TLVs)
    - bpf_lwt_seg6_action: Apply some SRv6 network programming actions
                           (specifically End.X, End.T, End.B6 and
                            End.B6.Encap)

The specifications of these helpers are provided in the patch (see
include/uapi/linux/bpf.h).

The non-sensitive fields of the SRH are the following : flags, tag and
TLVs. The other fields can not be modified, to maintain the SRH
integrity. Flags, tag and TLVs can easily be modified as their validity
can be checked afterwards via seg6_validate_srh. It is not allowed to
modify the segments directly. If one wants to add segments on the path,
he should stack a new SRH using the End.B6 action via
bpf_lwt_seg6_action.

Growing, shrinking or editing TLVs via the helpers will flag the SRH as
invalid, and it will have to be re-validated before re-entering the IPv6
layer. This flag is stored in a per-CPU buffer, along with the current
header length in bytes.

Storing the SRH len in bytes in the control block is mandatory when using
bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH
len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes
boundary). When adding/deleting TLVs within the BPF program, the SRH may
temporary be in an invalid state where its length cannot be rounded to 8
bytes without remainder, hence the need to store the length in bytes
separately. The caller of the BPF program can then ensure that the SRH's
final length is valid using this value. Again, a final SRH modified by a
BPF program which doesn’t respect the 8-bytes boundary will be discarded
as it will be considered as invalid.

Finally, a fourth helper is provided, bpf_lwt_push_encap, which is
available from the LWT BPF IN hook, but not from the seg6local BPF one.
This helper allows to encapsulate a Segment Routing Header (either with
a new outer IPv6 header, or by inlining it directly in the existing IPv6
header) into a non-SRv6 packet. This helper is required if we want to
offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet,
as the BPF seg6local hook only works on traffic already containing a SRH.
This is the BPF equivalent of the seg6 LWT infrastructure, which achieves
the same purpose but with a static SRH per route.

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
---
 include/net/seg6_local.h |   8 ++
 include/uapi/linux/bpf.h |  95 +++++++++++++++-
 net/core/filter.c        | 282 +++++++++++++++++++++++++++++++++++++++++++----
 net/ipv6/Kconfig         |   5 +
 net/ipv6/seg6_local.c    |   2 +
 5 files changed, 368 insertions(+), 24 deletions(-)

diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h
index 57498b23085d..661fd5b4d3e0 100644
--- a/include/net/seg6_local.h
+++ b/include/net/seg6_local.h
@@ -15,10 +15,18 @@
 #ifndef _NET_SEG6_LOCAL_H
 #define _NET_SEG6_LOCAL_H
 
+#include <linux/percpu.h>
 #include <linux/net.h>
 #include <linux/ipv6.h>
 
 extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr,
 			       u32 tbl_id);
 
+struct seg6_bpf_srh_state {
+	bool valid;
+	u16 hdrlen;
+};
+
+DECLARE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+
 #endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d615c777b573..39006c4a30d0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1828,6 +1828,89 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
+ * int bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
+ *	Description
+ *		Encapsulate the packet associated to *skb* within a Layer 3
+ *		protocol header. This header is provided in the buffer at
+ *		address *hdr*, with *len* its size in bytes. *type* indicates
+ *		the protocol of the header and can be one of:
+ *
+ *		**BPF_LWT_ENCAP_SEG6**
+ *			IPv6 encapsulation with Segment Routing Header
+ *			(**struct ipv6_sr_hdr**). *hdr* only contains the SRH,
+ *			the IPv6 header is computed by the kernel.
+ *		**BPF_LWT_ENCAP_SEG6_INLINE**
+ *			Only works if *skb* contains an IPv6 packet. Insert a
+ *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
+ *			the IPv6 header.
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len)
+ *	Description
+ *		Store *len* bytes from address *from* into the packet
+ *		associated to *skb*, at *offset*. Only the flags, tag and TLVs
+ *		inside the outermost IPv6 Segment Routing Header can be
+ *		modified through this helper.
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_adjust_srh(struct sk_buff *skb, u32 offset, s32 delta)
+ *	Description
+ *		Adjust the size allocated to TLVs in the outermost IPv6
+ *		Segment Routing Header contained in the packet associated to
+ *		*skb*, at position *offset* by *delta* bytes. Only offsets
+ *		after the segments are accepted. *delta* can be as well
+ *		positive (growing) as negative (shrinking).
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_action(struct sk_buff *skb, u32 action, void *param, u32 param_len)
+ *	Description
+ *		Apply an IPv6 Segment Routing action of type *action* to the
+ *		packet associated to *skb*. Each action takes a parameter
+ *		contained at address *param*, and of length *param_len* bytes.
+ *		*action* can be one of:
+ *
+ *		**SEG6_LOCAL_ACTION_END_X**
+ *			End.X action: Endpoint with Layer-3 cross-connect.
+ *			Type of *param*: **struct in6_addr**.
+ *		**SEG6_LOCAL_ACTION_END_T**
+ *			End.T action: Endpoint with specific IPv6 table lookup.
+ *			Type of *param*: **int**.
+ *		**SEG6_LOCAL_ACTION_END_B6**
+ *			End.B6 action: Endpoint bound to an SRv6 policy.
+ *			Type of param: **struct ipv6_sr_hdr**.
+ *		**SEG6_LOCAL_ACTION_END_B6_ENCAP**
+ *			End.B6.Encap action: Endpoint bound to an SRv6
+ *			encapsulation policy.
+ *			Type of param: **struct ipv6_sr_hdr**.
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -1898,7 +1981,11 @@ union bpf_attr {
 	FN(xdp_adjust_tail),		\
 	FN(skb_get_xfrm_state),		\
 	FN(get_stack),			\
-	FN(skb_load_bytes_relative),
+	FN(skb_load_bytes_relative),    \
+	FN(lwt_push_encap),		\
+	FN(lwt_seg6_store_bytes),	\
+	FN(lwt_seg6_adjust_srh),	\
+	FN(lwt_seg6_action),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -1965,6 +2052,12 @@ enum bpf_hdr_start_off {
 	BPF_HDR_START_NET,
 };
 
+/* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
+enum bpf_lwt_encap_mode {
+	BPF_LWT_ENCAP_SEG6,
+	BPF_LWT_ENCAP_SEG6_INLINE
+};
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
diff --git a/net/core/filter.c b/net/core/filter.c
index 0baa715e4699..48b3e9390f1e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -60,6 +60,10 @@
 #include <net/xfrm.h>
 #include <linux/bpf_trace.h>
 #include <net/xdp_sock.h>
+#include <net/ipv6.h>
+#include <linux/seg6_local.h>
+#include <net/seg6.h>
+#include <net/seg6_local.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -3322,28 +3326,6 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
-bool bpf_helper_changes_pkt_data(void *func)
-{
-	if (func == bpf_skb_vlan_push ||
-	    func == bpf_skb_vlan_pop ||
-	    func == bpf_skb_store_bytes ||
-	    func == bpf_skb_change_proto ||
-	    func == bpf_skb_change_head ||
-	    func == bpf_skb_change_tail ||
-	    func == bpf_skb_adjust_room ||
-	    func == bpf_skb_pull_data ||
-	    func == bpf_clone_redirect ||
-	    func == bpf_l3_csum_replace ||
-	    func == bpf_l4_csum_replace ||
-	    func == bpf_xdp_adjust_head ||
-	    func == bpf_xdp_adjust_meta ||
-	    func == bpf_msg_pull_data ||
-	    func == bpf_xdp_adjust_tail)
-		return true;
-
-	return false;
-}
-
 static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
 				  unsigned long off, unsigned long len)
 {
@@ -4032,6 +4014,261 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
 };
 #endif
 
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
+{
+	int err;
+	struct ipv6_sr_hdr *srh = (struct ipv6_sr_hdr *)hdr;
+
+	if (!seg6_validate_srh(srh, len))
+		return -EINVAL;
+
+	switch (type) {
+	case BPF_LWT_ENCAP_SEG6_INLINE:
+		if (skb->protocol != htons(ETH_P_IPV6))
+			return -EBADMSG;
+
+		err = seg6_do_srh_inline(skb, srh);
+		break;
+	case BPF_LWT_ENCAP_SEG6:
+		skb_reset_inner_headers(skb);
+		skb->encapsulation = 1;
+		err = seg6_do_srh_encap(skb, srh, IPPROTO_IPV6);
+		break;
+	default:
+		return -EINVAL;
+	}
+	if (err)
+		return err;
+
+	ipv6_hdr(skb)->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
+	skb_set_transport_header(skb, sizeof(struct ipv6hdr));
+
+	return seg6_lookup_nexthop(skb, NULL, 0);
+}
+#endif /* CONFIG_IPV6_SEG6_BPF */
+
+BPF_CALL_4(bpf_lwt_push_encap, struct sk_buff *, skb, u32, type, void *, hdr,
+	   u32, len)
+{
+	switch (type) {
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+	case BPF_LWT_ENCAP_SEG6:
+	case BPF_LWT_ENCAP_SEG6_INLINE:
+		return bpf_push_seg6_encap(skb, type, hdr, len);
+#endif
+	default:
+		return -EINVAL;
+	}
+}
+
+static const struct bpf_func_proto bpf_lwt_push_encap_proto = {
+	.func		= bpf_lwt_push_encap,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_MEM,
+	.arg4_type	= ARG_CONST_SIZE
+};
+
+BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset,
+	   const void *, from, u32, len)
+{
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+	struct seg6_bpf_srh_state *srh_state =
+		this_cpu_ptr(&seg6_bpf_srh_states);
+	void *srh_tlvs, *srh_end, *ptr;
+	struct ipv6_sr_hdr *srh;
+	int srhoff = 0;
+
+	if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
+		return -EINVAL;
+
+	srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
+	srh_tlvs = (void *)((char *)srh + ((srh->first_segment + 1) << 4));
+	srh_end = (void *)((char *)srh + sizeof(*srh) + srh_state->hdrlen);
+
+	ptr = skb->data + offset;
+	if (ptr >= srh_tlvs && ptr + len <= srh_end)
+		srh_state->valid = 0;
+	else if (ptr < (void *)&srh->flags ||
+		 ptr + len > (void *)&srh->segments)
+		return -EFAULT;
+
+	if (unlikely(bpf_try_make_writable(skb, offset + len)))
+		return -EFAULT;
+
+	memcpy(ptr, from, len);
+	return 0;
+#else /* CONFIG_IPV6_SEG6_BPF */
+	return -EOPNOTSUPP;
+#endif
+}
+
+static const struct bpf_func_proto bpf_lwt_seg6_store_bytes_proto = {
+	.func		= bpf_lwt_seg6_store_bytes,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_MEM,
+	.arg4_type	= ARG_CONST_SIZE
+};
+
+BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb,
+	   u32, action, void *, param, u32, param_len)
+{
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+	struct seg6_bpf_srh_state *srh_state =
+		this_cpu_ptr(&seg6_bpf_srh_states);
+	struct ipv6_sr_hdr *srh;
+	int srhoff = 0;
+	int err;
+
+	if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
+		return -EINVAL;
+	srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
+
+	if (!srh_state->valid) {
+		if (unlikely((srh_state->hdrlen & 7) != 0))
+			return -EBADMSG;
+
+		srh->hdrlen = (u8)(srh_state->hdrlen >> 3);
+		if (unlikely(!seg6_validate_srh(srh, (srh->hdrlen + 1) << 3)))
+			return -EBADMSG;
+
+		srh_state->valid = 1;
+	}
+
+	switch (action) {
+	case SEG6_LOCAL_ACTION_END_X:
+		if (param_len != sizeof(struct in6_addr))
+			return -EINVAL;
+		return seg6_lookup_nexthop(skb, (struct in6_addr *)param, 0);
+	case SEG6_LOCAL_ACTION_END_T:
+		if (param_len != sizeof(int))
+			return -EINVAL;
+		return seg6_lookup_nexthop(skb, NULL, *(int *)param);
+	case SEG6_LOCAL_ACTION_END_B6:
+		err = bpf_push_seg6_encap(skb, BPF_LWT_ENCAP_SEG6_INLINE,
+					  param, param_len);
+		if (!err)
+			srh_state->hdrlen =
+				((struct ipv6_sr_hdr *)param)->hdrlen << 3;
+		return err;
+	case SEG6_LOCAL_ACTION_END_B6_ENCAP:
+		err = bpf_push_seg6_encap(skb, BPF_LWT_ENCAP_SEG6,
+					  param, param_len);
+		if (!err)
+			srh_state->hdrlen =
+				((struct ipv6_sr_hdr *)param)->hdrlen << 3;
+		return err;
+	default:
+		return -EINVAL;
+	}
+#else /* CONFIG_IPV6_SEG6_BPF */
+	return -EOPNOTSUPP;
+#endif
+}
+
+static const struct bpf_func_proto bpf_lwt_seg6_action_proto = {
+	.func		= bpf_lwt_seg6_action,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_MEM,
+	.arg4_type	= ARG_CONST_SIZE
+};
+
+BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset,
+	   s32, len)
+{
+#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
+	struct seg6_bpf_srh_state *srh_state =
+		this_cpu_ptr(&seg6_bpf_srh_states);
+	void *srh_end, *srh_tlvs, *ptr;
+	struct ipv6_sr_hdr *srh;
+	struct ipv6hdr *hdr;
+	int srhoff = 0;
+	int ret;
+
+	if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
+		return -EINVAL;
+	srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
+
+	srh_tlvs = (void *)((unsigned char *)srh + sizeof(*srh) +
+			((srh->first_segment + 1) << 4));
+	srh_end = (void *)((unsigned char *)srh + sizeof(*srh) +
+			srh_state->hdrlen);
+	ptr = skb->data + offset;
+
+	if (unlikely(ptr < srh_tlvs || ptr > srh_end))
+		return -EFAULT;
+	if (unlikely(len < 0 && (void *)((char *)ptr - len) > srh_end))
+		return -EFAULT;
+
+	if (len > 0) {
+		ret = skb_cow_head(skb, len);
+		if (unlikely(ret < 0))
+			return ret;
+
+		ret = bpf_skb_net_hdr_push(skb, offset, len);
+	} else {
+		ret = bpf_skb_net_hdr_pop(skb, offset, -1 * len);
+	}
+	if (unlikely(ret < 0))
+		return ret;
+
+	hdr = (struct ipv6hdr *)skb->data;
+	hdr->payload_len = htons(skb->len - sizeof(struct ipv6hdr));
+
+	bpf_compute_data_pointers(skb);
+	srh_state->hdrlen += len;
+	srh_state->valid = 0;
+	return 0;
+#else /* CONFIG_IPV6_SEG6_BPF */
+	return -EOPNOTSUPP;
+#endif
+}
+
+static const struct bpf_func_proto bpf_lwt_seg6_adjust_srh_proto = {
+	.func		= bpf_lwt_seg6_adjust_srh,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+bool bpf_helper_changes_pkt_data(void *func)
+{
+	if (func == bpf_skb_vlan_push ||
+	    func == bpf_skb_vlan_pop ||
+	    func == bpf_skb_store_bytes ||
+	    func == bpf_skb_change_proto ||
+	    func == bpf_skb_change_head ||
+	    func == bpf_skb_change_tail ||
+	    func == bpf_skb_adjust_room ||
+	    func == bpf_skb_pull_data ||
+	    func == bpf_clone_redirect ||
+	    func == bpf_l3_csum_replace ||
+	    func == bpf_l4_csum_replace ||
+	    func == bpf_xdp_adjust_head ||
+	    func == bpf_xdp_adjust_meta ||
+	    func == bpf_msg_pull_data ||
+	    func == bpf_xdp_adjust_tail ||
+	    func == bpf_lwt_push_encap ||
+	    func == bpf_lwt_seg6_store_bytes ||
+	    func == bpf_lwt_seg6_adjust_srh ||
+	    func == bpf_lwt_seg6_action
+	    )
+		return true;
+
+	return false;
+}
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -4436,7 +4673,6 @@ static bool lwt_is_valid_access(int off, int size,
 	return bpf_skb_is_valid_access(off, size, type, prog, info);
 }
 
-
 /* Attach type specific accesses */
 static bool __sock_filter_check_attach_type(int off,
 					    enum bpf_access_type access_type,
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index 6794ddf0547c..f0e8a762ae0c 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -330,4 +330,9 @@ config IPV6_SEG6_HMAC
 
 	  If unsure, say N.
 
+config IPV6_SEG6_BPF
+	def_bool y
+	depends on IPV6_SEG6_LWTUNNEL
+	depends on IPV6 = y
+
 endif # IPV6
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index e9b23fb924ad..ae68c1ef8fb0 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -449,6 +449,8 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
 	return err;
 }
 
+DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
+
 static struct seg6_action_desc seg6_action_table[] = {
 	{
 		.action		= SEG6_LOCAL_ACTION_END,
-- 
2.16.1

^ permalink raw reply related

* [PATCH bpf-next v4 4/6] bpf: Split lwt inout verifier structures
From: Mathieu Xhonneux @ 2018-05-09 21:16 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <cover.1525898587.git.m.xhonneux@gmail.com>

The new bpf_lwt_push_encap helper should only be accessible within the
LWT BPF IN hook, and not the OUT one, as this may lead to a skb under
panic.

At the moment, both LWT BPF IN and OUT share the same list of helpers,
whose calls are authorized by the verifier. This patch separates the
verifier ops for the IN and OUT hooks, and allows the IN hook to call the
bpf_lwt_push_encap helper.

This patch is also the occasion to put all lwt_*_func_proto functions
together for clarity. At the moment, socks_op_func_proto is in the middle
of lwt_inout_func_proto and lwt_xmit_func_proto.

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
---
 include/linux/bpf_types.h |  4 +--
 net/core/filter.c         | 83 +++++++++++++++++++++++++++++------------------
 2 files changed, 54 insertions(+), 33 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index d7df1b323082..cc9d7e031330 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -9,8 +9,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock)
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK_ADDR, cg_sock_addr)
-BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout)
-BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout)
+BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_in)
+BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_out)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
diff --git a/net/core/filter.c b/net/core/filter.c
index 48b3e9390f1e..4962e3a44fcc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4448,33 +4448,6 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
-static const struct bpf_func_proto *
-lwt_inout_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
-{
-	switch (func_id) {
-	case BPF_FUNC_skb_load_bytes:
-		return &bpf_skb_load_bytes_proto;
-	case BPF_FUNC_skb_pull_data:
-		return &bpf_skb_pull_data_proto;
-	case BPF_FUNC_csum_diff:
-		return &bpf_csum_diff_proto;
-	case BPF_FUNC_get_cgroup_classid:
-		return &bpf_get_cgroup_classid_proto;
-	case BPF_FUNC_get_route_realm:
-		return &bpf_get_route_realm_proto;
-	case BPF_FUNC_get_hash_recalc:
-		return &bpf_get_hash_recalc_proto;
-	case BPF_FUNC_perf_event_output:
-		return &bpf_skb_event_output_proto;
-	case BPF_FUNC_get_smp_processor_id:
-		return &bpf_get_smp_processor_id_proto;
-	case BPF_FUNC_skb_under_cgroup:
-		return &bpf_skb_under_cgroup_proto;
-	default:
-		return bpf_base_func_proto(func_id);
-	}
-}
-
 static const struct bpf_func_proto *
 sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -4534,6 +4507,44 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
+static const struct bpf_func_proto *
+lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_load_bytes:
+		return &bpf_skb_load_bytes_proto;
+	case BPF_FUNC_skb_pull_data:
+		return &bpf_skb_pull_data_proto;
+	case BPF_FUNC_csum_diff:
+		return &bpf_csum_diff_proto;
+	case BPF_FUNC_get_cgroup_classid:
+		return &bpf_get_cgroup_classid_proto;
+	case BPF_FUNC_get_route_realm:
+		return &bpf_get_route_realm_proto;
+	case BPF_FUNC_get_hash_recalc:
+		return &bpf_get_hash_recalc_proto;
+	case BPF_FUNC_perf_event_output:
+		return &bpf_skb_event_output_proto;
+	case BPF_FUNC_get_smp_processor_id:
+		return &bpf_get_smp_processor_id_proto;
+	case BPF_FUNC_skb_under_cgroup:
+		return &bpf_skb_under_cgroup_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
+lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_lwt_push_encap:
+		return &bpf_lwt_push_encap_proto;
+	default:
+		return lwt_out_func_proto(func_id, prog);
+	}
+}
+
 static const struct bpf_func_proto *
 lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -4565,7 +4576,7 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_set_hash_invalid:
 		return &bpf_set_hash_invalid_proto;
 	default:
-		return lwt_inout_func_proto(func_id, prog);
+		return lwt_out_func_proto(func_id, prog);
 	}
 }
 
@@ -6138,13 +6149,23 @@ const struct bpf_prog_ops cg_skb_prog_ops = {
 	.test_run		= bpf_prog_test_run_skb,
 };
 
-const struct bpf_verifier_ops lwt_inout_verifier_ops = {
-	.get_func_proto		= lwt_inout_func_proto,
+const struct bpf_verifier_ops lwt_in_verifier_ops = {
+	.get_func_proto		= lwt_in_func_proto,
+	.is_valid_access	= lwt_is_valid_access,
+	.convert_ctx_access	= bpf_convert_ctx_access,
+};
+
+const struct bpf_prog_ops lwt_in_prog_ops = {
+	.test_run		= bpf_prog_test_run_skb,
+};
+
+const struct bpf_verifier_ops lwt_out_verifier_ops = {
+	.get_func_proto		= lwt_out_func_proto,
 	.is_valid_access	= lwt_is_valid_access,
 	.convert_ctx_access	= bpf_convert_ctx_access,
 };
 
-const struct bpf_prog_ops lwt_inout_prog_ops = {
+const struct bpf_prog_ops lwt_out_prog_ops = {
 	.test_run		= bpf_prog_test_run_skb,
 };
 
-- 
2.16.1

^ permalink raw reply related

* [PATCH bpf-next v4 5/6] ipv6: sr: Add seg6local action End.BPF
From: Mathieu Xhonneux @ 2018-05-09 21:16 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <cover.1525898587.git.m.xhonneux@gmail.com>

This patch adds the End.BPF action to the LWT seg6local infrastructure.
This action works like any other seg6local End action, meaning that an IPv6
header with SRH is needed, whose DA has to be equal to the SID of the
action. It will also advance the SRH to the next segment, the BPF program
does not have to take care of this.

Since the BPF program may not be a source of instability in the kernel, it
is important to ensure that the integrity of the packet is maintained
before yielding it back to the IPv6 layer. The hook hence keeps track if
the SRH has been altered through the helpers, and re-validates its
content if needed with seg6_validate_srh. The state kept for validation is
stored in a per-CPU buffer. The BPF program is not allowed to directly
write into the packet, and only some fields of the SRH can be altered
through the helper bpf_lwt_seg6_store_bytes.

Performances profiling has shown that the SRH re-validation does not induce
a significant overhead. If the altered SRH is deemed as invalid, the packet
is dropped.

This validation is also done before executing any action through
bpf_lwt_seg6_action, and will not be performed again if the SRH is not
modified after calling the action.

The BPF program may return 3 types of return codes:
    - BPF_OK: the End.BPF action will look up the next destination through
             seg6_lookup_nexthop.
    - BPF_REDIRECT: if an action has been executed through the
          bpf_lwt_seg6_action helper, the BPF program should return this
          value, as the skb's destination is already set and the default
          lookup should not be performed.
    - BPF_DROP : the packet will be dropped.

This action requires CONFIG_IPV6=y (and not =m).

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
---
 include/linux/bpf_types.h       |   3 +
 include/uapi/linux/bpf.h        |   1 +
 include/uapi/linux/seg6_local.h |   3 +
 kernel/bpf/verifier.c           |   1 +
 net/core/filter.c               |  25 +++++++
 net/ipv6/seg6_local.c           | 158 +++++++++++++++++++++++++++++++++++++++-
 6 files changed, 188 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cc9d7e031330..f9b24f275869 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -12,6 +12,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK_ADDR, cg_sock_addr)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_in)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_out)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit)
+#ifdef CONFIG_IPV6_SEG6_BPF
+BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_SEG6LOCAL, lwt_seg6local)
+#endif
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 39006c4a30d0..7ee36887858a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -140,6 +140,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_MSG,
 	BPF_PROG_TYPE_RAW_TRACEPOINT,
 	BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 };
 
 enum bpf_attach_type {
diff --git a/include/uapi/linux/seg6_local.h b/include/uapi/linux/seg6_local.h
index ef2d8c3e76c1..aadcc11fb918 100644
--- a/include/uapi/linux/seg6_local.h
+++ b/include/uapi/linux/seg6_local.h
@@ -25,6 +25,7 @@ enum {
 	SEG6_LOCAL_NH6,
 	SEG6_LOCAL_IIF,
 	SEG6_LOCAL_OIF,
+	SEG6_LOCAL_BPF,
 	__SEG6_LOCAL_MAX,
 };
 #define SEG6_LOCAL_MAX (__SEG6_LOCAL_MAX - 1)
@@ -59,6 +60,8 @@ enum {
 	SEG6_LOCAL_ACTION_END_AS	= 13,
 	/* forward to SR-unaware VNF with masquerading */
 	SEG6_LOCAL_ACTION_END_AM	= 14,
+	/* custom BPF action */
+	SEG6_LOCAL_ACTION_END_BPF	= 15,
 
 	__SEG6_LOCAL_ACTION_MAX,
 };
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d92d9c37affd..c6b5eadcad16 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1262,6 +1262,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	switch (env->prog->type) {
 	case BPF_PROG_TYPE_LWT_IN:
 	case BPF_PROG_TYPE_LWT_OUT:
+	case BPF_PROG_TYPE_LWT_SEG6LOCAL:
 		/* dst_input() and dst_output() can't write for now */
 		if (t == BPF_WRITE)
 			return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index 4962e3a44fcc..ce10f203b5b9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4580,6 +4580,21 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
+static const struct bpf_func_proto *
+lwt_seg6local_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_lwt_seg6_store_bytes:
+		return &bpf_lwt_seg6_store_bytes_proto;
+	case BPF_FUNC_lwt_seg6_action:
+		return &bpf_lwt_seg6_action_proto;
+	case BPF_FUNC_lwt_seg6_adjust_srh:
+		return &bpf_lwt_seg6_adjust_srh_proto;
+	default:
+		return lwt_out_func_proto(func_id, prog);
+	}
+}
+
 static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type,
 				    const struct bpf_prog *prog,
 				    struct bpf_insn_access_aux *info)
@@ -6180,6 +6195,16 @@ const struct bpf_prog_ops lwt_xmit_prog_ops = {
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+const struct bpf_verifier_ops lwt_seg6local_verifier_ops = {
+	.get_func_proto		= lwt_seg6local_func_proto,
+	.is_valid_access	= lwt_is_valid_access,
+	.convert_ctx_access	= bpf_convert_ctx_access,
+};
+
+const struct bpf_prog_ops lwt_seg6local_prog_ops = {
+	.test_run		= bpf_prog_test_run_skb,
+};
+
 const struct bpf_verifier_ops cg_sock_verifier_ops = {
 	.get_func_proto		= sock_filter_func_proto,
 	.is_valid_access	= sock_filter_is_valid_access,
diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c
index ae68c1ef8fb0..2ac887da63e2 100644
--- a/net/ipv6/seg6_local.c
+++ b/net/ipv6/seg6_local.c
@@ -1,8 +1,9 @@
 /*
  *  SR-IPv6 implementation
  *
- *  Author:
+ *  Authors:
  *  David Lebrun <david.lebrun@uclouvain.be>
+ *  eBPF support: Mathieu Xhonneux <m.xhonneux@gmail.com>
  *
  *
  *  This program is free software; you can redistribute it and/or
@@ -32,6 +33,7 @@
 #endif
 #include <net/seg6_local.h>
 #include <linux/etherdevice.h>
+#include <linux/bpf.h>
 
 struct seg6_local_lwt;
 
@@ -42,6 +44,11 @@ struct seg6_action_desc {
 	int static_headroom;
 };
 
+struct bpf_lwt_prog {
+	struct bpf_prog *prog;
+	char *name;
+};
+
 struct seg6_local_lwt {
 	int action;
 	struct ipv6_sr_hdr *srh;
@@ -50,6 +57,7 @@ struct seg6_local_lwt {
 	struct in6_addr nh6;
 	int iif;
 	int oif;
+	struct bpf_lwt_prog bpf;
 
 	int headroom;
 	struct seg6_action_desc *desc;
@@ -451,6 +459,69 @@ static int input_action_end_b6_encap(struct sk_buff *skb,
 
 DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states);
 
+static int input_action_end_bpf(struct sk_buff *skb,
+				struct seg6_local_lwt *slwt)
+{
+	struct seg6_bpf_srh_state *srh_state =
+		this_cpu_ptr(&seg6_bpf_srh_states);
+	struct seg6_bpf_srh_state local_srh_state;
+	struct ipv6_sr_hdr *srh;
+	int srhoff = 0;
+	int ret;
+
+	srh = get_and_validate_srh(skb);
+	if (!srh)
+		goto drop;
+	advance_nextseg(srh, &ipv6_hdr(skb)->daddr);
+
+	/* preempt_disable is needed to protect the per-CPU buffer srh_state,
+	 * which is also accessed by the bpf_lwt_seg6_* helpers
+	 */
+	preempt_disable();
+	srh_state->hdrlen = srh->hdrlen << 3;
+	srh_state->valid = 1;
+
+	rcu_read_lock();
+	bpf_compute_data_pointers(skb);
+	ret = bpf_prog_run_save_cb(slwt->bpf.prog, skb);
+	rcu_read_unlock();
+
+	local_srh_state = *srh_state;
+	preempt_enable();
+
+	switch (ret) {
+	case BPF_OK:
+	case BPF_REDIRECT:
+		break;
+	case BPF_DROP:
+		goto drop;
+	default:
+		pr_warn_once("bpf-seg6local: Illegal return value %u\n", ret);
+		goto drop;
+	}
+
+	if (unlikely((local_srh_state.hdrlen & 7) != 0))
+		goto drop;
+
+	if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0)
+		goto drop;
+	srh = (struct ipv6_sr_hdr *)(skb->data + srhoff);
+	srh->hdrlen = (u8)(local_srh_state.hdrlen >> 3);
+
+	if (!local_srh_state.valid &&
+	    unlikely(!seg6_validate_srh(srh, (srh->hdrlen + 1) << 3)))
+		goto drop;
+
+	if (ret != BPF_REDIRECT)
+		seg6_lookup_nexthop(skb, NULL, 0);
+
+	return dst_input(skb);
+
+drop:
+	kfree_skb(skb);
+	return -EINVAL;
+}
+
 static struct seg6_action_desc seg6_action_table[] = {
 	{
 		.action		= SEG6_LOCAL_ACTION_END,
@@ -497,7 +568,13 @@ static struct seg6_action_desc seg6_action_table[] = {
 		.attrs		= (1 << SEG6_LOCAL_SRH),
 		.input		= input_action_end_b6_encap,
 		.static_headroom	= sizeof(struct ipv6hdr),
-	}
+	},
+	{
+		.action		= SEG6_LOCAL_ACTION_END_BPF,
+		.attrs		= (1 << SEG6_LOCAL_BPF),
+		.input		= input_action_end_bpf,
+	},
+
 };
 
 static struct seg6_action_desc *__get_action_desc(int action)
@@ -542,6 +619,7 @@ static const struct nla_policy seg6_local_policy[SEG6_LOCAL_MAX + 1] = {
 				    .len = sizeof(struct in6_addr) },
 	[SEG6_LOCAL_IIF]	= { .type = NLA_U32 },
 	[SEG6_LOCAL_OIF]	= { .type = NLA_U32 },
+	[SEG6_LOCAL_BPF]	= { .type = NLA_NESTED },
 };
 
 static int parse_nla_srh(struct nlattr **attrs, struct seg6_local_lwt *slwt)
@@ -719,6 +797,71 @@ static int cmp_nla_oif(struct seg6_local_lwt *a, struct seg6_local_lwt *b)
 	return 0;
 }
 
+#define MAX_PROG_NAME 256
+static const struct nla_policy bpf_prog_policy[LWT_BPF_PROG_MAX + 1] = {
+	[LWT_BPF_PROG_FD]   = { .type = NLA_U32, },
+	[LWT_BPF_PROG_NAME] = { .type = NLA_NUL_STRING,
+				.len = MAX_PROG_NAME },
+};
+
+static int parse_nla_bpf(struct nlattr **attrs, struct seg6_local_lwt *slwt)
+{
+	struct nlattr *tb[LWT_BPF_PROG_MAX + 1];
+	struct bpf_prog *p;
+	int ret;
+	u32 fd;
+
+	ret = nla_parse_nested(tb, LWT_BPF_PROG_MAX, attrs[SEG6_LOCAL_BPF],
+			       bpf_prog_policy, NULL);
+	if (ret < 0)
+		return ret;
+
+	if (!tb[LWT_BPF_PROG_FD] || !tb[LWT_BPF_PROG_NAME])
+		return -EINVAL;
+
+	slwt->bpf.name = nla_memdup(tb[LWT_BPF_PROG_NAME], GFP_KERNEL);
+	if (!slwt->bpf.name)
+		return -ENOMEM;
+
+	fd = nla_get_u32(tb[LWT_BPF_PROG_FD]);
+	p = bpf_prog_get_type(fd, BPF_PROG_TYPE_LWT_SEG6LOCAL);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	slwt->bpf.prog = p;
+
+	return 0;
+}
+
+static int put_nla_bpf(struct sk_buff *skb, struct seg6_local_lwt *slwt)
+{
+	struct nlattr *nest;
+
+	if (!slwt->bpf.prog)
+		return 0;
+
+	nest = nla_nest_start(skb, SEG6_LOCAL_BPF);
+	if (!nest)
+		return -EMSGSIZE;
+
+	if (slwt->bpf.name &&
+	    nla_put_string(skb, LWT_BPF_PROG_NAME, slwt->bpf.name))
+		return -EMSGSIZE;
+
+	return nla_nest_end(skb, nest);
+}
+
+static int cmp_nla_bpf(struct seg6_local_lwt *a, struct seg6_local_lwt *b)
+{
+	if (!a->bpf.name && !b->bpf.name)
+		return 0;
+
+	if (!a->bpf.name || !b->bpf.name)
+		return 1;
+
+	return strcmp(a->bpf.name, b->bpf.name);
+}
+
 struct seg6_action_param {
 	int (*parse)(struct nlattr **attrs, struct seg6_local_lwt *slwt);
 	int (*put)(struct sk_buff *skb, struct seg6_local_lwt *slwt);
@@ -749,6 +892,11 @@ static struct seg6_action_param seg6_action_params[SEG6_LOCAL_MAX + 1] = {
 	[SEG6_LOCAL_OIF]	= { .parse = parse_nla_oif,
 				    .put = put_nla_oif,
 				    .cmp = cmp_nla_oif },
+
+	[SEG6_LOCAL_BPF]	= { .parse = parse_nla_bpf,
+				    .put = put_nla_bpf,
+				    .cmp = cmp_nla_bpf },
+
 };
 
 static int parse_nla_action(struct nlattr **attrs, struct seg6_local_lwt *slwt)
@@ -797,7 +945,6 @@ static int seg6_local_build_state(struct nlattr *nla, unsigned int family,
 
 	err = nla_parse_nested(tb, SEG6_LOCAL_MAX, nla, seg6_local_policy,
 			       extack);
-
 	if (err < 0)
 		return err;
 
@@ -886,6 +1033,11 @@ static int seg6_local_get_encap_size(struct lwtunnel_state *lwt)
 	if (attrs & (1 << SEG6_LOCAL_OIF))
 		nlsize += nla_total_size(4);
 
+	if (attrs & (1 << SEG6_LOCAL_BPF))
+		nlsize += nla_total_size(sizeof(struct nlattr)) +
+		       nla_total_size(MAX_PROG_NAME) +
+		       nla_total_size(4);
+
 	return nlsize;
 }
 
-- 
2.16.1

^ permalink raw reply related

* [PATCH bpf-next v4 6/6] selftests/bpf: test for seg6local End.BPF action
From: Mathieu Xhonneux @ 2018-05-09 21:16 UTC (permalink / raw)
  To: netdev; +Cc: dlebrun, alexei.starovoitov
In-Reply-To: <cover.1525898587.git.m.xhonneux@gmail.com>

Add a new test for the seg6local End.BPF action. The following helpers
are also tested :

- bpf_lwt_push_encap within the LWT BPF IN hook
- bpf_lwt_seg6_action
- bpf_lwt_seg6_adjust_srh
- bpf_lwt_seg6_store_bytes

A chain of End.BPF actions is built. The SRH is injected through a LWT
BPF IN hook before the chain. Each End.BPF action validates the previous
one, otherwise the packet is dropped.
The test succeeds if the last node in the chain receives the packet and
the UDP datagram contained can be retrieved from userspace.

Signed-off-by: Mathieu Xhonneux <m.xhonneux@gmail.com>
---
 tools/include/uapi/linux/bpf.h                    |  97 ++++-
 tools/testing/selftests/bpf/Makefile              |   5 +-
 tools/testing/selftests/bpf/bpf_helpers.h         |  12 +
 tools/testing/selftests/bpf/test_lwt_seg6local.c  | 438 ++++++++++++++++++++++
 tools/testing/selftests/bpf/test_lwt_seg6local.sh | 140 +++++++
 5 files changed, 689 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lwt_seg6local.c
 create mode 100755 tools/testing/selftests/bpf/test_lwt_seg6local.sh

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index fff51c187d1e..7ee36887858a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -117,6 +117,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_DEVMAP,
 	BPF_MAP_TYPE_SOCKMAP,
 	BPF_MAP_TYPE_CPUMAP,
+	BPF_MAP_TYPE_XSKMAP,
 };
 
 enum bpf_prog_type {
@@ -139,6 +140,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SK_MSG,
 	BPF_PROG_TYPE_RAW_TRACEPOINT,
 	BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 };
 
 enum bpf_attach_type {
@@ -1827,6 +1829,89 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
+ * int bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len)
+ *	Description
+ *		Encapsulate the packet associated to *skb* within a Layer 3
+ *		protocol header. This header is provided in the buffer at
+ *		address *hdr*, with *len* its size in bytes. *type* indicates
+ *		the protocol of the header and can be one of:
+ *
+ *		**BPF_LWT_ENCAP_SEG6**
+ *			IPv6 encapsulation with Segment Routing Header
+ *			(**struct ipv6_sr_hdr**). *hdr* only contains the SRH,
+ *			the IPv6 header is computed by the kernel.
+ *		**BPF_LWT_ENCAP_SEG6_INLINE**
+ *			Only works if *skb* contains an IPv6 packet. Insert a
+ *			Segment Routing Header (**struct ipv6_sr_hdr**) inside
+ *			the IPv6 header.
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len)
+ *	Description
+ *		Store *len* bytes from address *from* into the packet
+ *		associated to *skb*, at *offset*. Only the flags, tag and TLVs
+ *		inside the outermost IPv6 Segment Routing Header can be
+ *		modified through this helper.
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_adjust_srh(struct sk_buff *skb, u32 offset, s32 delta)
+ *	Description
+ *		Adjust the size allocated to TLVs in the outermost IPv6
+ *		Segment Routing Header contained in the packet associated to
+ *		*skb*, at position *offset* by *delta* bytes. Only offsets
+ *		after the segments are accepted. *delta* can be as well
+ *		positive (growing) as negative (shrinking).
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_lwt_seg6_action(struct sk_buff *skb, u32 action, void *param, u32 param_len)
+ *	Description
+ *		Apply an IPv6 Segment Routing action of type *action* to the
+ *		packet associated to *skb*. Each action takes a parameter
+ *		contained at address *param*, and of length *param_len* bytes.
+ *		*action* can be one of:
+ *
+ *		**SEG6_LOCAL_ACTION_END_X**
+ *			End.X action: Endpoint with Layer-3 cross-connect.
+ *			Type of *param*: **struct in6_addr**.
+ *		**SEG6_LOCAL_ACTION_END_T**
+ *			End.T action: Endpoint with specific IPv6 table lookup.
+ *			Type of *param*: **int**.
+ *		**SEG6_LOCAL_ACTION_END_B6**
+ *			End.B6 action: Endpoint bound to an SRv6 policy.
+ *			Type of param: **struct ipv6_sr_hdr**.
+ *		**SEG6_LOCAL_ACTION_END_B6_ENCAP**
+ *			End.B6.Encap action: Endpoint bound to an SRv6
+ *			encapsulation policy.
+ *			Type of param: **struct ipv6_sr_hdr**.
+ *
+ * 		A call to this helper is susceptible to change the underlaying
+ * 		packet buffer. Therefore, at load time, all checks on pointers
+ * 		previously done by the verifier are invalidated and must be
+ * 		performed again, if the helper is used in combination with
+ * 		direct packet access.
+ *	Return
+ * 		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -1897,7 +1982,11 @@ union bpf_attr {
 	FN(xdp_adjust_tail),		\
 	FN(skb_get_xfrm_state),		\
 	FN(get_stack),			\
-	FN(skb_load_bytes_relative),
+	FN(skb_load_bytes_relative),    \
+	FN(lwt_push_encap),		\
+	FN(lwt_seg6_store_bytes),	\
+	FN(lwt_seg6_adjust_srh),	\
+	FN(lwt_seg6_action),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -1964,6 +2053,12 @@ enum bpf_hdr_start_off {
 	BPF_HDR_START_NET,
 };
 
+/* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */
+enum bpf_lwt_encap_mode {
+	BPF_LWT_ENCAP_SEG6,
+	BPF_LWT_ENCAP_SEG6_INLINE
+};
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 9d762184b805..e7455e0e43b6 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -33,7 +33,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
 	sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \
 	test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o \
-	test_get_stack_rawtp.o
+	test_get_stack_rawtp.o test_lwt_seg6local.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -42,7 +42,8 @@ TEST_PROGS := test_kmod.sh \
 	test_xdp_meta.sh \
 	test_offload.py \
 	test_sock_addr.sh \
-	test_tunnel.sh
+	test_tunnel.sh \
+	test_lwt_seg6local.sh
 
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 265f8e0e8ada..a3d556819559 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -103,6 +103,18 @@ static int (*bpf_skb_get_xfrm_state)(void *ctx, int index, void *state,
 	(void *) BPF_FUNC_skb_get_xfrm_state;
 static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
 	(void *) BPF_FUNC_get_stack;
+static int (*bpf_lwt_push_encap)(void *ctx, unsigned int type, void *hdr,
+				 unsigned int len) =
+	(void *) BPF_FUNC_lwt_push_encap;
+static int (*bpf_lwt_seg6_store_bytes)(void *ctx, unsigned int offset,
+				       void *from, unsigned int len) =
+	(void *) BPF_FUNC_lwt_seg6_store_bytes;
+static int (*bpf_lwt_seg6_action)(void *ctx, unsigned int action, void *param,
+				  unsigned int param_len) =
+	(void *) BPF_FUNC_lwt_seg6_action;
+static int (*bpf_lwt_seg6_adjust_srh)(void *ctx, unsigned int offset,
+				      unsigned int len) =
+	(void *) BPF_FUNC_lwt_seg6_adjust_srh;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_lwt_seg6local.c b/tools/testing/selftests/bpf/test_lwt_seg6local.c
new file mode 100644
index 000000000000..d752bc1fe81c
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_seg6local.c
@@ -0,0 +1,438 @@
+#include <stddef.h>
+#include <inttypes.h>
+#include <errno.h>
+#include <linux/seg6_local.h>
+#include <linux/bpf.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define bpf_printk(fmt, ...)				\
+({							\
+	char ____fmt[] = fmt;				\
+	bpf_trace_printk(____fmt, sizeof(____fmt),	\
+			##__VA_ARGS__);			\
+})
+
+/* Packet parsing state machine helpers. */
+#define cursor_advance(_cursor, _len) \
+	({ void *_tmp = _cursor; _cursor += _len; _tmp; })
+
+#define SR6_FLAG_ALERT (1 << 4)
+
+#define htonll(x) ((bpf_htonl(1)) == 1 ? (x) : ((uint64_t)bpf_htonl((x) & \
+				0xFFFFFFFF) << 32) | bpf_htonl((x) >> 32))
+#define ntohll(x) ((bpf_ntohl(1)) == 1 ? (x) : ((uint64_t)bpf_ntohl((x) & \
+				0xFFFFFFFF) << 32) | bpf_ntohl((x) >> 32))
+#define BPF_PACKET_HEADER __attribute__((packed))
+
+struct ip6_t {
+	unsigned int ver:4;
+	unsigned int priority:8;
+	unsigned int flow_label:20;
+	unsigned short payload_len;
+	unsigned char next_header;
+	unsigned char hop_limit;
+	unsigned long long src_hi;
+	unsigned long long src_lo;
+	unsigned long long dst_hi;
+	unsigned long long dst_lo;
+} BPF_PACKET_HEADER;
+
+struct ip6_addr_t {
+	unsigned long long hi;
+	unsigned long long lo;
+} BPF_PACKET_HEADER;
+
+struct ip6_srh_t {
+	unsigned char nexthdr;
+	unsigned char hdrlen;
+	unsigned char type;
+	unsigned char segments_left;
+	unsigned char first_segment;
+	unsigned char flags;
+	unsigned short tag;
+
+	struct ip6_addr_t segments[0];
+} BPF_PACKET_HEADER;
+
+struct sr6_tlv_t {
+	unsigned char type;
+	unsigned char len;
+	unsigned char value[0];
+} BPF_PACKET_HEADER;
+
+__attribute__((always_inline)) struct ip6_srh_t *get_srh(struct __sk_buff *skb)
+{
+	void *cursor, *data_end;
+	struct ip6_srh_t *srh;
+	struct ip6_t *ip;
+	uint8_t *ipver;
+
+	data_end = (void *)(long)skb->data_end;
+	cursor = (void *)(long)skb->data;
+	ipver = (uint8_t *)cursor;
+
+	if ((void *)ipver + sizeof(*ipver) > data_end)
+		return NULL;
+
+	if ((*ipver >> 4) != 6)
+		return NULL;
+
+	ip = cursor_advance(cursor, sizeof(*ip));
+	if ((void *)ip + sizeof(*ip) > data_end)
+		return NULL;
+
+	if (ip->next_header != 43)
+		return NULL;
+
+	srh = cursor_advance(cursor, sizeof(*srh));
+	if ((void *)srh + sizeof(*srh) > data_end)
+		return NULL;
+
+	if (srh->type != 4)
+		return NULL;
+
+	return srh;
+}
+
+__attribute__((always_inline))
+int update_tlv_pad(struct __sk_buff *skb, uint32_t new_pad,
+		   uint32_t old_pad, uint32_t pad_off)
+{
+	int err;
+
+	if (new_pad != old_pad) {
+		err = bpf_lwt_seg6_adjust_srh(skb, pad_off,
+					  (int) new_pad - (int) old_pad);
+		if (err)
+			return err;
+	}
+
+	if (new_pad > 0) {
+		char pad_tlv_buf[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+					0, 0, 0};
+		struct sr6_tlv_t *pad_tlv = (struct sr6_tlv_t *) pad_tlv_buf;
+
+		pad_tlv->type = SR6_TLV_PADDING;
+		pad_tlv->len = new_pad - 2;
+
+		err = bpf_lwt_seg6_store_bytes(skb, pad_off,
+					       (void *)pad_tlv_buf, new_pad);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+__attribute__((always_inline))
+int is_valid_tlv_boundary(struct __sk_buff *skb, struct ip6_srh_t *srh,
+			  uint32_t *tlv_off, uint32_t *pad_size,
+			  uint32_t *pad_off)
+{
+	uint32_t srh_off, cur_off;
+	int offset_valid = 0;
+	int err;
+
+	srh_off = (char *)srh - (char *)(long)skb->data;
+	// cur_off = end of segments, start of possible TLVs
+	cur_off = srh_off + sizeof(*srh) +
+		sizeof(struct ip6_addr_t) * (srh->first_segment + 1);
+
+	*pad_off = 0;
+
+	// we can only go as far as ~10 TLVs due to the BPF max stack size
+	#pragma clang loop unroll(full)
+	for (int i = 0; i < 10; i++) {
+		struct sr6_tlv_t tlv;
+
+		if (cur_off == *tlv_off)
+			offset_valid = 1;
+
+		if (cur_off >= srh_off + ((srh->hdrlen + 1) << 3))
+			break;
+
+		err = bpf_skb_load_bytes(skb, cur_off, &tlv, sizeof(tlv));
+		if (err)
+			return err;
+
+		if (tlv.type == SR6_TLV_PADDING) {
+			*pad_size = tlv.len + sizeof(tlv);
+			*pad_off = cur_off;
+
+			if (*tlv_off == srh_off) {
+				*tlv_off = cur_off;
+				offset_valid = 1;
+			}
+			break;
+
+		} else if (tlv.type == SR6_TLV_HMAC) {
+			break;
+		}
+
+		cur_off += sizeof(tlv) + tlv.len;
+	} // we reached the padding or HMAC TLVs, or the end of the SRH
+
+	if (*pad_off == 0)
+		*pad_off = cur_off;
+
+	if (*tlv_off == -1)
+		*tlv_off = cur_off;
+	else if (!offset_valid)
+		return -EINVAL;
+
+	return 0;
+}
+
+__attribute__((always_inline))
+int add_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t tlv_off,
+	    struct sr6_tlv_t *itlv, uint8_t tlv_size)
+{
+	uint32_t srh_off = (char *)srh - (char *)(long)skb->data;
+	uint8_t len_remaining, new_pad;
+	uint32_t pad_off = 0;
+	uint32_t pad_size = 0;
+	uint32_t partial_srh_len;
+	int err;
+
+	if (tlv_off != -1)
+		tlv_off += srh_off;
+
+	if (itlv->type == SR6_TLV_PADDING || itlv->type == SR6_TLV_HMAC)
+		return -EINVAL;
+
+	err = is_valid_tlv_boundary(skb, srh, &tlv_off, &pad_size, &pad_off);
+	if (err)
+		return err;
+
+	err = bpf_lwt_seg6_adjust_srh(skb, tlv_off, sizeof(*itlv) + itlv->len);
+	if (err)
+		return err;
+
+	err = bpf_lwt_seg6_store_bytes(skb, tlv_off, (void *)itlv, tlv_size);
+	if (err)
+		return err;
+
+	// the following can't be moved inside update_tlv_pad because the
+	// bpf verifier has some issues with it
+	pad_off += sizeof(*itlv) + itlv->len;
+	partial_srh_len = pad_off - srh_off;
+	len_remaining = partial_srh_len % 8;
+	new_pad = 8 - len_remaining;
+
+	if (new_pad == 1) // cannot pad for 1 byte only
+		new_pad = 9;
+	else if (new_pad == 8)
+		new_pad = 0;
+
+	return update_tlv_pad(skb, new_pad, pad_size, pad_off);
+}
+
+__attribute__((always_inline))
+int delete_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh,
+	       uint32_t tlv_off)
+{
+	uint32_t srh_off = (char *)srh - (char *)(long)skb->data;
+	uint8_t len_remaining, new_pad;
+	uint32_t partial_srh_len;
+	uint32_t pad_off = 0;
+	uint32_t pad_size = 0;
+	struct sr6_tlv_t tlv;
+	int err;
+
+	tlv_off += srh_off;
+
+	err = is_valid_tlv_boundary(skb, srh, &tlv_off, &pad_size, &pad_off);
+	if (err)
+		return err;
+
+	err = bpf_skb_load_bytes(skb, tlv_off, &tlv, sizeof(tlv));
+	if (err)
+		return err;
+
+	err = bpf_lwt_seg6_adjust_srh(skb, tlv_off, -(sizeof(tlv) + tlv.len));
+	if (err)
+		return err;
+
+	pad_off -= sizeof(tlv) + tlv.len;
+	partial_srh_len = pad_off - srh_off;
+	len_remaining = partial_srh_len % 8;
+	new_pad = 8 - len_remaining;
+	if (new_pad == 1) // cannot pad for 1 byte only
+		new_pad = 9;
+	else if (new_pad == 8)
+		new_pad = 0;
+
+	return update_tlv_pad(skb, new_pad, pad_size, pad_off);
+}
+
+__attribute__((always_inline))
+int has_egr_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh)
+{
+	int tlv_offset = sizeof(struct ip6_t) + sizeof(struct ip6_srh_t) +
+		((srh->first_segment + 1) << 4);
+	struct sr6_tlv_t tlv;
+
+	if (bpf_skb_load_bytes(skb, tlv_offset, &tlv, sizeof(struct sr6_tlv_t)))
+		return 0;
+
+	if (tlv.type == SR6_TLV_EGRESS && tlv.len == 18) {
+		struct ip6_addr_t egr_addr;
+
+		if (bpf_skb_load_bytes(skb, tlv_offset + 4, &egr_addr, 16))
+			return 0;
+
+		// check if egress TLV value is correct
+		if (ntohll(egr_addr.hi) == 0xfd00000000000000 &&
+				ntohll(egr_addr.lo) == 0x4)
+			return 1;
+	}
+
+	return 0;
+}
+
+// This function will push a SRH with segments fd00::1, fd00::2, fd00::3,
+// fd00::4
+SEC("encap_srh")
+int __encap_srh(struct __sk_buff *skb)
+{
+	bpf_printk("got pkt\n");
+	unsigned long long hi = 0xfd00000000000000;
+	struct ip6_addr_t *seg;
+	struct ip6_srh_t *srh;
+	char srh_buf[72]; // room for 4 segments
+	int err;
+
+	srh = (struct ip6_srh_t *)srh_buf;
+	srh->nexthdr = 0;
+	srh->hdrlen = 8;
+	srh->type = 4;
+	srh->segments_left = 3;
+	srh->first_segment = 3;
+	srh->flags = 0;
+	srh->tag = 0;
+
+	seg = (struct ip6_addr_t *)((char *)srh + sizeof(*srh));
+
+	#pragma clang loop unroll(full)
+	for (unsigned long long lo = 0; lo < 4; lo++) {
+		seg->lo = htonll(4 - lo);
+		seg->hi = htonll(hi);
+		seg = (struct ip6_addr_t *)((char *)seg + sizeof(*seg));
+	}
+
+	err = bpf_lwt_push_encap(skb, 0, (void *)srh, sizeof(srh_buf));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_REDIRECT;
+}
+
+// Add an Egress TLV fc00::4, add the flag A,
+// and apply End.X action to fc42::1
+SEC("add_egr_x")
+int __add_egr_x(struct __sk_buff *skb)
+{
+	unsigned long long hi = 0xfc42000000000000;
+	unsigned long long lo = 0x1;
+	struct ip6_srh_t *srh = get_srh(skb);
+	uint8_t new_flags = SR6_FLAG_ALERT;
+	struct ip6_addr_t addr;
+	int err, offset;
+
+	if (srh == NULL)
+		return BPF_DROP;
+
+	uint8_t tlv[20] = {2, 18, 0, 0, 0xfd, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
+			   0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x4};
+
+	err = add_tlv(skb, srh, (srh->hdrlen+1) << 3,
+		      (struct sr6_tlv_t *)&tlv, 20);
+	if (err)
+		return BPF_DROP;
+
+	offset = sizeof(struct ip6_t) + offsetof(struct ip6_srh_t, flags);
+	err = bpf_lwt_seg6_store_bytes(skb, offset,
+				       (void *)&new_flags, sizeof(new_flags));
+	if (err)
+		return BPF_DROP;
+
+	addr.lo = htonll(lo);
+	addr.hi = htonll(hi);
+	err = bpf_lwt_seg6_action(skb, SEG6_LOCAL_ACTION_END_X,
+				  (void *)&addr, sizeof(addr));
+	if (err)
+		return BPF_DROP;
+	return BPF_REDIRECT;
+}
+
+// Pop the Egress TLV, reset the flags, change the tag 2442 and finally do a
+// simple End action
+SEC("pop_egr")
+int __pop_egr(struct __sk_buff *skb)
+{
+	struct ip6_srh_t *srh = get_srh(skb);
+	uint16_t new_tag = bpf_htons(2442);
+	uint8_t new_flags = 0;
+	int err, offset;
+
+	if (srh == NULL)
+		return BPF_DROP;
+
+	if (srh->flags != SR6_FLAG_ALERT)
+		return BPF_DROP;
+
+	if (srh->hdrlen != 11) // 4 segments + Egress TLV + Padding TLV
+		return BPF_DROP;
+
+	if (!has_egr_tlv(skb, srh))
+		return BPF_DROP;
+
+	err = delete_tlv(skb, srh, 8 + (srh->first_segment + 1) * 16);
+	if (err)
+		return BPF_DROP;
+
+	offset = sizeof(struct ip6_t) + offsetof(struct ip6_srh_t, flags);
+	if (bpf_lwt_seg6_store_bytes(skb, offset, (void *)&new_flags,
+				     sizeof(new_flags)))
+		return BPF_DROP;
+
+	offset = sizeof(struct ip6_t) + offsetof(struct ip6_srh_t, tag);
+	if (bpf_lwt_seg6_store_bytes(skb, offset, (void *)&new_tag,
+				     sizeof(new_tag)))
+		return BPF_DROP;
+
+	return BPF_OK;
+}
+
+// Inspect if the Egress TLV and flag have been removed, if the tag is correct,
+// then apply a End.T action to reach the last segment
+SEC("inspect_t")
+int __inspect_t(struct __sk_buff *skb)
+{
+	struct ip6_srh_t *srh = get_srh(skb);
+	int table = 117;
+	int err;
+
+	if (srh == NULL)
+		return BPF_DROP;
+
+	if (srh->flags != 0)
+		return BPF_DROP;
+
+	if (srh->tag != bpf_htons(2442))
+		return BPF_DROP;
+
+	if (srh->hdrlen != 8) // 4 segments
+		return BPF_DROP;
+
+	err = bpf_lwt_seg6_action(skb, SEG6_LOCAL_ACTION_END_T,
+				  (void *)&table, sizeof(table));
+
+	if (err)
+		return BPF_DROP;
+
+	return BPF_REDIRECT;
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_lwt_seg6local.sh b/tools/testing/selftests/bpf/test_lwt_seg6local.sh
new file mode 100755
index 000000000000..1c77994b5e71
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lwt_seg6local.sh
@@ -0,0 +1,140 @@
+#!/bin/bash
+# Connects 6 network namespaces through veths.
+# Each NS may have different IPv6 global scope addresses :
+#   NS1 ---- NS2 ---- NS3 ---- NS4 ---- NS5 ---- NS6
+# fb00::1           fd00::1  fd00::2  fd00::3  fb00::6
+#                   fc42::1           fd00::4
+#
+# All IPv6 packets going to fb00::/16 through NS2 will be encapsulated in a
+# IPv6 header with a Segment Routing Header, with segments :
+# 	fd00::1 -> fd00::2 -> fd00::3 -> fd00::4
+#
+# 3 fd00::/16 IPv6 addresses are binded to seg6local End.BPF actions :
+# - fd00::1 : add a TLV, change the flags and apply a End.X action to fc42::1
+# - fd00::2 : remove the TLV, change the flags, add a tag
+# - fd00::3 : apply an End.T action to fd00::4, through routing table 117
+#
+# fd00::4 is a simple Segment Routing node decapsulating the inner IPv6 packet.
+# Each End.BPF action will validate the operations applied on the SRH by the
+# previous BPF program in the chain, otherwise the packet is dropped.
+#
+# An UDP datagram is sent from fb00::1 to fb00::6. The test succeeds if this
+# datagram can be read on NS6 when binding to fb00::6.
+
+TMP_FILE="/tmp/selftest_lwt_seg6local.txt"
+
+cleanup()
+{
+	if [ "$?" = "0" ]; then
+		echo "selftests: test_lwt_seg6local [PASS]";
+	else
+		echo "selftests: test_lwt_seg6local [FAILED]";
+	fi
+
+	set +e
+	ip netns del ns1 2> /dev/null
+	ip netns del ns2 2> /dev/null
+	ip netns del ns3 2> /dev/null
+	ip netns del ns4 2> /dev/null
+	ip netns del ns5 2> /dev/null
+	ip netns del ns6 2> /dev/null
+	rm -f $TMP_FILE
+}
+
+set -e
+
+ip netns add ns1
+ip netns add ns2
+ip netns add ns3
+ip netns add ns4
+ip netns add ns5
+ip netns add ns6
+
+trap cleanup 0 2 3 6 9
+
+ip link add veth1 type veth peer name veth2
+ip link add veth3 type veth peer name veth4
+ip link add veth5 type veth peer name veth6
+ip link add veth7 type veth peer name veth8
+ip link add veth9 type veth peer name veth10
+
+ip link set veth1 netns ns1
+ip link set veth2 netns ns2
+ip link set veth3 netns ns2
+ip link set veth4 netns ns3
+ip link set veth5 netns ns3
+ip link set veth6 netns ns4
+ip link set veth7 netns ns4
+ip link set veth8 netns ns5
+ip link set veth9 netns ns5
+ip link set veth10 netns ns6
+
+ip netns exec ns1 ip link set dev veth1 up
+ip netns exec ns2 ip link set dev veth2 up
+ip netns exec ns2 ip link set dev veth3 up
+ip netns exec ns3 ip link set dev veth4 up
+ip netns exec ns3 ip link set dev veth5 up
+ip netns exec ns4 ip link set dev veth6 up
+ip netns exec ns4 ip link set dev veth7 up
+ip netns exec ns5 ip link set dev veth8 up
+ip netns exec ns5 ip link set dev veth9 up
+ip netns exec ns6 ip link set dev veth10 up
+ip netns exec ns6 ip link set dev lo up
+
+# All link scope addresses and routes required between veths
+ip netns exec ns1 ip -6 addr add fb00::12/16 dev veth1 scope link
+ip netns exec ns1 ip -6 route add fb00::21 dev veth1 scope link
+ip netns exec ns2 ip -6 addr add fb00::21/16 dev veth2 scope link
+ip netns exec ns2 ip -6 addr add fb00::34/16 dev veth3 scope link
+ip netns exec ns2 ip -6 route add fb00::43 dev veth3 scope link
+ip netns exec ns3 ip -6 route add fb00::65 dev veth5 scope link
+ip netns exec ns3 ip -6 addr add fb00::43/16 dev veth4 scope link
+ip netns exec ns3 ip -6 addr add fb00::56/16 dev veth5 scope link
+ip netns exec ns4 ip -6 addr add fb00::65/16 dev veth6 scope link
+ip netns exec ns4 ip -6 addr add fb00::78/16 dev veth7 scope link
+ip netns exec ns4 ip -6 route add fb00::87 dev veth7 scope link
+ip netns exec ns5 ip -6 addr add fb00::87/16 dev veth8 scope link
+ip netns exec ns5 ip -6 addr add fb00::910/16 dev veth9 scope link
+ip netns exec ns5 ip -6 route add fb00::109 dev veth9 scope link
+ip netns exec ns5 ip -6 route add fb00::109 table 117 dev veth9 scope link
+ip netns exec ns6 ip -6 addr add fb00::109/16 dev veth10 scope link
+
+ip netns exec ns1 ip -6 addr add fb00::1/16 dev lo
+ip netns exec ns1 ip -6 route add fb00::6 dev veth1 via fb00::21
+
+ip netns exec ns2 ip -6 route add fb00::6 encap bpf in obj test_lwt_seg6local.o sec encap_srh dev veth2
+ip netns exec ns2 ip -6 route add fd00::1 dev veth3 via fb00::43 scope link
+
+ip netns exec ns3 ip -6 route add fc42::1 dev veth5 via fb00::65
+ip netns exec ns3 ip -6 route add fd00::1 encap seg6local action End.BPF obj test_lwt_seg6local.o sec add_egr_x dev veth4
+
+ip netns exec ns4 ip -6 route add fd00::2 encap seg6local action End.BPF obj test_lwt_seg6local.o sec pop_egr dev veth6
+ip netns exec ns4 ip -6 addr add fc42::1 dev lo
+ip netns exec ns4 ip -6 route add fd00::3 dev veth7 via fb00::87
+
+ip netns exec ns5 ip -6 route add fd00::4 table 117 dev veth9 via fb00::109
+ip netns exec ns5 ip -6 route add fd00::3 encap seg6local action End.BPF obj test_lwt_seg6local.o sec inspect_t dev veth8
+
+ip netns exec ns6 ip -6 addr add fb00::6/16 dev lo
+ip netns exec ns6 ip -6 addr add fd00::4/16 dev lo
+
+ip netns exec ns1 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec ns2 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec ns3 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec ns4 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+ip netns exec ns5 sysctl net.ipv6.conf.all.forwarding=1 > /dev/null
+
+ip netns exec ns6 sysctl net.ipv6.conf.all.seg6_enabled=1 > /dev/null
+ip netns exec ns6 sysctl net.ipv6.conf.lo.seg6_enabled=1 > /dev/null
+ip netns exec ns6 sysctl net.ipv6.conf.veth10.seg6_enabled=1 > /dev/null
+
+ip netns exec ns6 nc -l -6 -u -d 7330 > $TMP_FILE &
+ip netns exec ns1 bash -c "echo 'foobar' | nc -w0 -6 -u -p 2121 -s fb00::1 fb00::6 7330"
+sleep 5 # wait enough time to ensure the UDP datagram arrived to the last segment
+kill -INT $!
+
+if [[ $(< $TMP_FILE) != "foobar" ]]; then
+	exit 1
+fi
+
+exit 0
-- 
2.16.1

^ permalink raw reply related

* Re: [PATCH bpf-next v4 0/6] ipv6: sr: introduce seg6local End.BPF action
From: Mathieu Xhonneux @ 2018-05-09 19:20 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: David Lebrun, netdev
In-Reply-To: <cover.1525898587.git.m.xhonneux@gmail.com>

@Alexei: I created the CONFIG_IPV6_SEG6_BPF symbol, fearing that using
ipv6_bpf_stub would be detrimental for the bpf_lwt_seg6_action helper,
since all the calls to seg6_* are in the critical path.

2018-05-09 22:16 GMT+01:00 Mathieu Xhonneux <m.xhonneux@gmail.com>:
> As of Linux 4.14, it is possible to define advanced local processing for
> IPv6 packets with a Segment Routing Header through the seg6local LWT
> infrastructure. This LWT implements the network programming principles
> defined in the IETF “SRv6 Network Programming” draft.
>
> The implemented operations are generic, and it would be very interesting to
> be able to implement user-specific seg6local actions, without having to
> modify the kernel directly. To do so, this patchset adds an End.BPF action
> to seg6local, powered by some specific Segment Routing-related helpers,
> which provide SR functionalities that can be applied on the packet. This
> BPF hook would then allow to implement specific actions at native kernel
> speed such as OAM features, advanced SR SDN policies, SRv6 actions like
> Segment Routing Header (SRH) encapsulation depending on the content of
> the packet, etc.
>
> This patchset is divided in 6 patches, whose main features are :
>
> - A new seg6local action End.BPF with the corresponding new BPF program
>   type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be
>   passed to the LWT seg6local through netlink, the same way as the LWT
>   BPF hook operates.
> - 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/
>   shrink a SRH and apply on a packet some of the generic SRv6 actions.
> - 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through
>   encapsulation (via IPv6 encapsulation or inlining if the packet contains
>   already an IPv6 header).
>
> As this patchset adds a new LWT BPF hook, I took into account the result of
> the discussions when the LWT BPF infrastructure got merged. Hence, the
> seg6local BPF hook doesn’t allow write access to skb->data directly, only
> the SRH can be modified through specific helpers, which ensures that the
> integrity of the packet is maintained.
> More details are available in the related patches messages.
>
> The performances of this BPF hook have been assessed with the BPF JIT
> enabled on a Intel Xeon X3440 processors with 4 cores and 8 threads
> clocked at 2.53 GHz. No throughput losses are noted with the seg6local
> BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes
> TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes)
> drops the throughput to 410kpps, and inlining a SRH via
> bpf_lwt_seg6_action drops the throughput to 420kpps.
> All throughputs are stable.
>
> -------
> v2: move the SRH integrity state from skb->cb to a per-cpu buffer
> v3: - document helpers in man-page style
>     - fix kbuild bugs
>     - un-break BPF LWT out hook
>     - bpf_push_seg6_encap is now static
>     - preempt_enable is now called when the packet is dropped in
>       input_action_end_bpf
> v4: fix kbuild bugs when CONFIG_IPV6=m
>
> Thanks.
>
>
> Mathieu Xhonneux (6):
>   ipv6: sr: make seg6.h includable without IPv6
>   ipv6: sr: export function lookup_nexthop
>   bpf: Add IPv6 Segment Routing helpers
>   bpf: Split lwt inout verifier structures
>   ipv6: sr: Add seg6local action End.BPF
>   selftests/bpf: test for seg6local End.BPF action
>
>  include/linux/bpf_types.h                         |   7 +-
>  include/net/seg6.h                                |   7 +-
>  include/net/seg6_local.h                          |  32 ++
>  include/uapi/linux/bpf.h                          |  96 ++++-
>  include/uapi/linux/seg6_local.h                   |   3 +
>  kernel/bpf/verifier.c                             |   1 +
>  net/core/filter.c                                 | 390 ++++++++++++++++---
>  net/ipv6/Kconfig                                  |   5 +
>  net/ipv6/seg6_local.c                             | 180 ++++++++-
>  tools/include/uapi/linux/bpf.h                    |  97 ++++-
>  tools/testing/selftests/bpf/Makefile              |   5 +-
>  tools/testing/selftests/bpf/bpf_helpers.h         |  12 +
>  tools/testing/selftests/bpf/test_lwt_seg6local.c  | 438 ++++++++++++++++++++++
>  tools/testing/selftests/bpf/test_lwt_seg6local.sh | 140 +++++++
>  14 files changed, 1340 insertions(+), 73 deletions(-)
>  create mode 100644 include/net/seg6_local.h
>  create mode 100644 tools/testing/selftests/bpf/test_lwt_seg6local.c
>  create mode 100755 tools/testing/selftests/bpf/test_lwt_seg6local.sh
>
> --
> 2.16.1
>

^ permalink raw reply

* Re: KASAN: use-after-free Read in __dev_queue_xmit
From: Willem de Bruijn @ 2018-05-09 19:21 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Eric Dumazet, syzbot, alexander.deucher, Andrey Konovalov,
	Anoob Soman, chris, David Miller, Reshetova, Elena,
	Greg Kroah-Hartman, Kees Cook, LKML, Mike Maloney, mchehab,
	netdev, Rosen, Rami, Sowmini Varadhan, syzkaller-bugs,
	Willem de Bruijn
In-Reply-To: <CAF=yD-KGO=+=j3yTr4H8cF_WH3RGtnyMDgYRzd7da-r7D=U=GQ@mail.gmail.com>

On Wed, May 9, 2018 at 12:38 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>>> But a crash with the same signature is still occurring, so it should eventually
>>> get reported again.  C reproducer is here, it works on Linus' tree (commit
>>> 036db8bd963): https://syzkaller.appspot.com/text?tag=ReproC&x=105b1ae7800000
>>
>> This appears to be a separate issue.
>>
>> This reproducer requires a setsockopt SOL_SOCKET/SO_TIMESTAMPING
>> to trigger the use-after-free. And the freed path also points at a timestamping
>> skb:
>>
>> [   31.963619] Freed by task 2672:
>> [   31.964006]  __kasan_slab_free+0x125/0x170
>> [   31.964509]  kfree+0x8b/0x1a0
>> [   31.964875]  skb_free_head+0x6f/0xa0
>> [   31.965314]  skb_release_data+0x420/0x5a0
>> [   31.965802]  skb_release_all+0x46/0x60
>> [   31.966260]  kfree_skb+0x91/0x1c0
>> [   31.966669]  __skb_complete_tx_timestamp+0x2e9/0x3d0
>> [   31.967273]  __skb_tstamp_tx+0x3b3/0x620
>> [   31.967774]  __dev_queue_xmit+0xed5/0x1a20
>> [   31.968300]  packet_sendmsg+0x36fd/0x5400
>> [   31.968821]  sock_sendmsg+0xc0/0x100
>> [   31.969284]  ___sys_sendmsg+0x367/0x880
>> [   31.969777]  __sys_sendmmsg+0x178/0x410
>> [   31.970267]  __x64_sys_sendmmsg+0x99/0x100
>> [   31.970789]  do_syscall_64+0x9a/0x2c0
>> [   31.971260]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> This is a rare path taken when the timestamp skb cannot be queued
> onto the socket (likely because of insufficient rcvbuf).
>
> Somehow, freeing the timestamp skb triggers this use-after-free in
> the original skb from which the timestamp was cloned. As if there
> is a bug in the shared info dataref.

Indeed. The skb shared info struct is zeroed by dev_validate_header
as a result of dev->hard_header_len exceeding skb->end - skb->data.

Not exactly sure yet how this can happen. The hard header length space
is accounted for during allocation as reserved memory. But,
packet_alloc_skb does call skb_reserve(), moving skb->data
effectively beyond this reserved region.

It may be incorrect to pass skb->data to dev_validate_header, as that
does not point to the start of the ll_header anymore. Still figuring out what
the right fix is..

^ permalink raw reply

* Re: [PATCH net-next 3/3] net/mlx4_core: Use msi_x module param to limit num of MSI-X irqs
From: Ajaykumar Hotchandani @ 2018-05-09 19:29 UTC (permalink / raw)
  To: Tariq Toukan, David S. Miller; +Cc: netdev, Eran Ben Elisha
In-Reply-To: <1525879744-1858-4-git-send-email-tariqt@mellanox.com>

Thanks Tariq.

Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

On 05/09/2018 08:29 AM, Tariq Toukan wrote:
> Extend the boolean interpretation of msi_x module parameter
> to numerical, as follows:
>
> 0   - Don't use MSI-X.
> 1   - Use MSI-X, driver decides the num of MSI-X irqs.
>> =2 - Use MSI-X, limit number of MSI-X irqs to msi_x.
>        In SRIOV, this limits the number of MSI-X irqs per VF.
>
> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> Cc: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/main.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
> index b6aaf34d6648..80a75c80a463 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -73,7 +73,7 @@
>   
>   static int msi_x = 1;
>   module_param(msi_x, int, 0444);
> -MODULE_PARM_DESC(msi_x, "attempt to use MSI-X if nonzero");
> +MODULE_PARM_DESC(msi_x, "0 - don't use MSI-X, 1 - use MSI-X, >1 - limit number of MSI-X irqs to msi_x");
>   
>   #else /* CONFIG_PCI_MSI */
>   
> @@ -2815,6 +2815,9 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
>   				dev->caps.num_eqs - dev->caps.reserved_eqs,
>   				MAX_MSIX);
>   
> +		if (msi_x > 1)
> +			nreq = min_t(int, nreq, msi_x);
> +
>   		entries = kcalloc(nreq, sizeof(*entries), GFP_KERNEL);
>   		if (!entries)
>   			goto no_msi;
> @@ -4182,6 +4185,11 @@ static int mlx4_resume(struct pci_dev *pdev)
>   
>   static int __init mlx4_verify_params(void)
>   {
> +	if (msi_x < 0) {
> +		pr_warn("mlx4_core: bad msi_x: %d\n", msi_x);
> +		return -1;
> +	}
> +
>   	if ((log_num_mac < 0) || (log_num_mac > 7)) {
>   		pr_warn("mlx4_core: bad num_mac: %d\n", log_num_mac);
>   		return -1;

^ permalink raw reply

* Re: [PATCH net-next v10 3/4] netdev: octeon-ethernet: Add Cavium Octeon III support.
From: Steven J. Hill @ 2018-05-09 19:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20180508.222824.409067086455315118.davem@davemloft.net>

On 05/08/2018 09:28 PM, David Miller wrote:
> 
> That's all I have the stomache for at the moment.
> 
> This thing is really large, making it nearly impossible to review
> as one huge patch #3. Perhaps you can find a way to split it up
> logically somehow?
> 
Hey David.

This code was inherited by me, so there a lot of parts I personally
have not looked at. The third patch indeed needs to be broken up.
Let me devote some hours to cleaning, splitting and simplifying. My
apologies for the time you spent looking at it. The first patch,
however, only makes changes in the existing 'staging' directory. It
seems like that one could go in without trouble? Cheers.

Steve

^ permalink raw reply

* Re: [PATCH net-next v3 0/2] socket statistics for ss
From: Stephen Hemminger @ 2018-05-09 19:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, gerrit, kuznet, yoshfuji, netdev, dccp, Stephen Hemminger
In-Reply-To: <ba77dfc0-56fe-9c1c-f68d-e8c06ffa1cb4@gmail.com>

On Wed, 9 May 2018 11:53:50 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On 05/09/2018 11:44 AM, Stephen Hemminger wrote:
> > On Wed, 9 May 2018 10:53:58 -0700
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >   
> >> On 05/09/2018 10:31 AM, Stephen Hemminger wrote:  
> >>> On Wed, 9 May 2018 10:18:23 -0700
> >>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>>     
> >>>> On 05/09/2018 08:22 AM, Stephen Hemminger wrote:
> >>>>    
> >>>>> I am not sure if these patches are worth applying.
> >>>>> The 'ss -s' command has had missing values since 2.4 kernel.
> >>>>> And the first complaints came in only this year.
> >>>>>
> >>>>> Another alternative would be just to remove these fields from ss -s
> >>>>> output and move on.
> >>>>>       
> >>>>
> >>>> Anyway your patches are not netns ready, so lets remove these fields from ss.
> >>>>
> >>>> Or you have to spend _much_ more time on writing and testing the kernel part.
> >>>>
> >>>> Thanks.    
> >>>
> >>> The patches only expose the existing TCP socket accounting infrastructure.
> >>> Several other pieces that sockstat has are not netns aware.
> >>> That is a completely different problem.    
> >>
> >>
> >> Adding a new field counting 'bounds ports' without being netns ready is a total mistake,
> >> as it is useless by current standards.
> >>
> >> The first thing that users will do is add proper netns support, with extra complexity in the kernel.
> >>
> >> So, instead of pushing some incomplete feature, trying to fool ourselves with a sentiment of 'small cost'
> >> that will later need another 100 lines of code in the kernel, please give us the complete picture.
> >>
> >> I am just saying, you can of course ignore my feedback.  
> > 
> > The current TCP hashinfo should be moved into netns. The current method of scanning and matching
> > by net namespace is a scalability issue now.  
> 
> It is not the plan yet, and we have no scalability issue.
> 
> Before switching to netns hash table, this would need rhashtable conversions
> but so far this has not been done.
> 
> - Time to create/delete netns is critical.
> - Adding few Mbytes of overhead per netns is a nogo,
> 
> Please do not change subject, this is adding noise to this particular thread.

Back to original subject. My current intention is to just pull all these statistics
from ss command. They are always zero now, and very few people noticed and  no one
really needs them.

^ permalink raw reply

* pull-request: mac80211 2018-05-09
From: Johannes Berg @ 2018-05-09 19:36 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-wireless

Hi Dave,

We just have a few fixes this time around.

Please pull and let me know if there's any problem.

Thanks,
johannes



The following changes since commit 64e86fec54069266ba32be551d7b7f75e88ab60c:

  net: qualcomm: rmnet: Fix warning seen with fill_info (2018-04-18 21:23:06 -0400)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git tags/mac80211-for-davem-2018-05-09

for you to fetch changes up to 914eac248d876f9c00cd1792ffec3d182c863f13:

  mac80211: use timeout from the AddBA response instead of the request (2018-05-07 20:35:15 +0200)

----------------------------------------------------------------
We only have a few fixes this time:
 * WMM element validation
 * SAE timeout
 * add-BA timeout
 * docbook parsing
 * a few memory leaks in error paths

----------------------------------------------------------------
Ilan Peer (2):
      mac80211: Fix condition validating WMM IE
      mac80211: Adjust SAE authentication timeout

Johan Hovold (1):
      rfkill: gpio: fix memory leak in probe error path

Johannes Berg (1):
      cfg80211: limit wiphy names to 128 bytes

Randy Dunlap (1):
      mac80211: fix kernel-doc "bad line" warning

Sara Sharon (1):
      mac80211: use timeout from the AddBA response instead of the request

Srinivas Dasari (1):
      nl80211: Free connkeys on external authentication failure

YueHaibing (1):
      mac80211_hwsim: fix a possible memory leak in hwsim_new_radio_nl()

weiyongjun (A) (1):
      cfg80211: fix possible memory leak in regdb_query_country()

 drivers/net/wireless/mac80211_hwsim.c |  1 +
 include/net/mac80211.h                |  2 +-
 include/uapi/linux/nl80211.h          |  2 ++
 net/mac80211/agg-tx.c                 |  4 ++++
 net/mac80211/mlme.c                   | 27 +++++++++++++++++++--------
 net/mac80211/tx.c                     |  3 ++-
 net/rfkill/rfkill-gpio.c              |  7 ++++++-
 net/wireless/core.c                   |  3 +++
 net/wireless/nl80211.c                |  1 +
 net/wireless/reg.c                    |  1 +
 10 files changed, 40 insertions(+), 11 deletions(-)

^ permalink raw reply

* Re: KASAN: use-after-free Read in __dev_queue_xmit
From: Eric Dumazet @ 2018-05-09 19:36 UTC (permalink / raw)
  To: Willem de Bruijn, Eric Biggers
  Cc: Eric Dumazet, syzbot, alexander.deucher, Andrey Konovalov,
	Anoob Soman, chris, David Miller, Reshetova, Elena,
	Greg Kroah-Hartman, Kees Cook, LKML, Mike Maloney, mchehab,
	netdev, Rosen, Rami, Sowmini Varadhan, syzkaller-bugs,
	Willem de Bruijn
In-Reply-To: <CAF=yD-JPmU-q1iUoxtmyQa4BM=fU3CMR3BUDGNMg-4v8=2_deA@mail.gmail.com>



On 05/09/2018 12:21 PM, Willem de Bruijn wrote:

> Indeed. The skb shared info struct is zeroed by dev_validate_header
> as a result of dev->hard_header_len exceeding skb->end - skb->data.
> 
> Not exactly sure yet how this can happen. The hard header length space
> is accounted for during allocation as reserved memory. But,
> packet_alloc_skb does call skb_reserve(), moving skb->data
> effectively beyond this reserved region.
> 
> It may be incorrect to pass skb->data to dev_validate_header, as that
> does not point to the start of the ll_header anymore. Still figuring out what
> the right fix is..
> 

I believe the bug happens if the sock_wmalloc() call at line 1921 has to sleep.

device can change (or at lest dev->hard_header_len can change)

So we need to bailout if reserved/hhlen had changed.

Or revert some patches, since dev_hold() and dev_put() are no longer high cost,
since it is now using per cpu counter.

 

^ permalink raw reply

* pull-request: mac80211-next 2018-05-09
From: Johannes Berg @ 2018-05-09 19:40 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-wireless

Hi Dave,

Nothing major here either. I'm still waiting for the HE/802.11ax
stuff, but even when we do get that it'll just add rates and not
be very exciting at this point :-)

Please pull and let me know if there's any problem.

Thanks,
johannes



The following changes since commit 415787d7799f4fccbe8d49cb0b8e5811be6b0389:

  ipv6: frags: fix a lockdep false positive (2018-04-18 23:19:39 -0400)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git tags/mac80211-next-for-davem-2018-05-09

for you to fetch changes up to 57c6cb81717f957fb741f2e4c79bd0e2f4f55910:

  mac80211: ethtool: avoid 32 bit multiplication overflow (2018-05-08 15:02:03 +0200)

----------------------------------------------------------------
While we're still waiting for HE/802.11ax code, we just
have a few things:
 * some new statistics (ACK RSSI, TXQ)
 * TXQ configuration
 * infrastructure to handle CSA in firmware
 * and some various cleanups/improvements

----------------------------------------------------------------
Amar Singhal (1):
      cfg80211: Call reg_notifier for self managed hints conditionally

Balaji Pothunoori (2):
      cfg80211: average ack rssi support for data frames
      mac80211: average ack rssi support for data frames

Bjoern Johansson (1):
      mac80211_hwsim: indicate support for powersave.

Colin Ian King (1):
      mac80211: ethtool: avoid 32 bit multiplication overflow

Gregory Greenman (1):
      mac80211: add api to set CSA counter in mac80211

Haim Dreyfuss (1):
      nl80211: Add wmm rule attribute to NL80211_CMD_GET_WIPHY dump command

Johannes Berg (4):
      mac80211: rename rtap_vendor_space to rtap_space
      mac80211: clean up rate info bandwidth setting
      mac80211: ethtool: memset the whole sinfo struct to 0
      mac80211: remove pointless flags=0 assignment

Toke Høiland-Jørgensen (3):
      regulatory: Rename confusing 'country IE' in log output
      cfg80211: Expose TXQ stats and parameters to userspace
      mac80211: Support the new cfg80211 TXQ stats API

 drivers/net/wireless/mac80211_hwsim.c |   1 +
 include/net/cfg80211.h                |  53 +++++++
 include/net/mac80211.h                |  13 ++
 include/uapi/linux/nl80211.h          |  93 ++++++++++++
 net/mac80211/cfg.c                    |  99 +++++++++++++
 net/mac80211/ethtool.c                |  37 +++--
 net/mac80211/ieee80211_i.h            |   3 +
 net/mac80211/main.c                   |   3 +
 net/mac80211/rx.c                     |  40 +++---
 net/mac80211/sta_info.c               |  24 +++-
 net/mac80211/sta_info.h               |   2 +
 net/mac80211/status.c                 |   2 +
 net/mac80211/tx.c                     |  45 ++++++
 net/mac80211/util.c                   |   6 +-
 net/wireless/nl80211.c                | 262 ++++++++++++++++++++++++++++++----
 net/wireless/rdev-ops.h               |  12 ++
 net/wireless/reg.c                    |  37 ++++-
 net/wireless/trace.h                  |  14 ++
 net/wireless/wext-compat.c            |  23 +--
 19 files changed, 683 insertions(+), 86 deletions(-)

^ permalink raw reply

* Re: [PATCH v6 12/13] Documentation: remove stale firmware API reference
From: Luis R. Rodriguez @ 2018-05-09 19:43 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Luis R. Rodriguez, gregkh, akpm, keescook, josh, maco, andy.gross,
	david.brown, bjorn.andersson, teg, wagi, hdegoede, andresx7,
	zohar, kubakici, shuah, mfuzzey, dhowells, pali.rohar, tiwai,
	kvalo, arend.vanspriel, zajec5, nbroeking, markivx, broonie,
	dmitry.torokhov, dwmw2, torvalds, Abhay_Salunke, jewalt, oneukum,
	cantabile.desu, ast, hare, jejb
In-Reply-To: <20180509121209.1e332f63@vento.lan>

On Wed, May 09, 2018 at 12:12:09PM -0300, Mauro Carvalho Chehab wrote:
> Em Tue,  8 May 2018 11:12:46 -0700
> "Luis R. Rodriguez" <mcgrof@kernel.org> escreveu:
> 
> > It refers to a pending patch, but this was merged eons ago.
> 
> Didn't know that such patch was already merged. Great!
> 
> Regards,
> Mauro
> 
> > 
> > Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
> > ---
> >  Documentation/dell_rbu.txt | 3 ---
> >  1 file changed, 3 deletions(-)
> > 
> > diff --git a/Documentation/dell_rbu.txt b/Documentation/dell_rbu.txt
> > index 0fdb6aa2704c..077fdc29a0d0 100644
> > --- a/Documentation/dell_rbu.txt
> > +++ b/Documentation/dell_rbu.txt
> > @@ -121,9 +121,6 @@ read back the image downloaded.
> >  
> >  .. note::
> >  
> > -   This driver requires a patch for firmware_class.c which has the modified
> > -   request_firmware_nowait function.
> > -
> 
> >     Also after updating the BIOS image a user mode application needs to execute
> >     code which sends the BIOS update request to the BIOS. So on the next reboot
> >     the BIOS knows about the new image downloaded and it updates itself.
> 
> You should likely remove the "Also" here.

Good catch. Will adjust.

> 
> With that,
> 
> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>

Great, thanks.

  Luis

^ permalink raw reply

* Re: [net-next PATCH 1/3] net: Refactor XPS for CPUs and Rx queues
From: Tom Herbert @ 2018-05-09 20:31 UTC (permalink / raw)
  To: Amritha Nambiar
  Cc: Linux Kernel Network Developers, David S. Miller, Alexander Duyck,
	Sridhar Samudrala, Eric Dumazet, Hannes Frederic Sowa
In-Reply-To: <152418626993.5832.6216077462845621375.stgit@anamdev.jf.intel.com>

On Thu, Apr 19, 2018 at 6:04 PM, Amritha Nambiar
<amritha.nambiar@intel.com> wrote:
> Refactor XPS code to support Tx queue selection based on
> CPU map or Rx queue map.
>
> Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
> ---
>  include/linux/netdevice.h |   82 +++++++++++++++++-
>  net/core/dev.c            |  206 +++++++++++++++++++++++++++++----------------
>  net/core/net-sysfs.c      |    4 -
>  3 files changed, 216 insertions(+), 76 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 14e0777..40a9171 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -730,10 +730,21 @@ struct xps_map {
>   */
>  struct xps_dev_maps {
>         struct rcu_head rcu;
> -       struct xps_map __rcu *cpu_map[0];
> +       struct xps_map __rcu *attr_map[0];

This seems unnecessarily complicated to me. Why not just add another
map called something like "rxq2txq_map". Then when selecting TXQ just
check the new map first and then the normal cpu_map if there's not a
hit.

>  };
> -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +         \
> +
> +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) +     \
>         (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
> +
> +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
> +       (_rxqs * (_tcs) * sizeof(struct xps_map *)))
> +
> +enum xps_map_type {
> +       XPS_MAP_RXQS,
> +       XPS_MAP_CPUS,
> +       __XPS_MAP_MAX
> +};
> +
>  #endif /* CONFIG_XPS */
>
>  #define TC_MAX_QUEUE   16
> @@ -1867,7 +1878,7 @@ struct net_device {
>         int                     watchdog_timeo;
>
>  #ifdef CONFIG_XPS
> -       struct xps_dev_maps __rcu *xps_maps;
> +       struct xps_dev_maps __rcu *xps_maps[__XPS_MAP_MAX];
>  #endif
>  #ifdef CONFIG_NET_CLS_ACT
>         struct mini_Qdisc __rcu *miniq_egress;
> @@ -3204,6 +3215,71 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
>  #ifdef CONFIG_XPS
>  int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>                         u16 index);
> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
> +                         u16 index, enum xps_map_type type);
> +
> +static inline bool attr_test_mask(unsigned long j, const unsigned long *mask,
> +                                 unsigned int nr_bits)
> +{
> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
> +       WARN_ON_ONCE(j >= nr_bits);
> +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */

This #ifdef block appears 3 times in the patch. Seems like it should
be replace by simple macro.

> +       return test_bit(j, mask);
> +}
> +
> +static inline bool attr_test_online(unsigned long j,
> +                                   const unsigned long *online_mask,
> +                                   unsigned int nr_bits)
> +{
> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
> +       WARN_ON_ONCE(j >= nr_bits);
> +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
> +
> +       if (online_mask)
> +               return test_bit(j, online_mask);
> +
> +       if (j >= 0 && j < nr_bits)
> +               return true;
> +
> +       return false;
> +}
> +
> +static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
> +                                        unsigned int nr_bits)
> +{
> +       /* -1 is a legal arg here. */
> +       if (n != -1) {
> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
> +               WARN_ON_ONCE(n >= nr_bits);
> +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
> +       }
> +
> +       if (srcp)
> +               return find_next_bit(srcp, nr_bits, n + 1);
> +
> +       return n + 1;
> +}
> +
> +static inline int attrmask_next_and(int n, const unsigned long *src1p,
> +                                   const unsigned long *src2p,
> +                                   unsigned int nr_bits)
> +{
> +       /* -1 is a legal arg here. */
> +       if (n != -1) {
> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
> +               WARN_ON_ONCE(n >= nr_bits);
> +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */
> +       }
> +
> +       if (src1p && src2p)
> +               return find_next_and_bit(src1p, src2p, nr_bits, n + 1);
> +       else if (src1p)
> +               return find_next_bit(src1p, nr_bits, n + 1);
> +       else if (src2p)
> +               return find_next_bit(src2p, nr_bits, n + 1);
> +
> +       return n + 1;
> +}
>  #else
>  static inline int netif_set_xps_queue(struct net_device *dev,
>                                       const struct cpumask *mask,
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a490ef6..17c4883 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2092,7 +2092,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
>         int pos;
>
>         if (dev_maps)
> -               map = xmap_dereference(dev_maps->cpu_map[tci]);
> +               map = xmap_dereference(dev_maps->attr_map[tci]);
>         if (!map)
>                 return false;
>
> @@ -2105,7 +2105,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
>                         break;
>                 }
>
> -               RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
> +               RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL);
>                 kfree_rcu(map, rcu);
>                 return false;
>         }
> @@ -2138,30 +2138,47 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
>  static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
>                                    u16 count)
>  {
> +       const unsigned long *possible_mask = NULL;
> +       enum xps_map_type type = XPS_MAP_RXQS;
>         struct xps_dev_maps *dev_maps;
> -       int cpu, i;
>         bool active = false;
> +       unsigned int nr_ids;
> +       int i, j;
>
>         mutex_lock(&xps_map_mutex);
> -       dev_maps = xmap_dereference(dev->xps_maps);
>
> -       if (!dev_maps)
> -               goto out_no_maps;
> +       while (type < __XPS_MAP_MAX) {
> +               dev_maps = xmap_dereference(dev->xps_maps[type]);
> +               if (!dev_maps)
> +                       goto out_no_maps;
> +
> +               if (type == XPS_MAP_CPUS) {
> +                       if (num_possible_cpus() > 1)
> +                               possible_mask = cpumask_bits(cpu_possible_mask);
> +                       nr_ids = nr_cpu_ids;
> +               } else if (type == XPS_MAP_RXQS) {
> +                       nr_ids = dev->num_rx_queues;
> +               }
>
> -       for_each_possible_cpu(cpu)
> -               active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
> -                                              offset, count);
> +               for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> +                    j < nr_ids;)
> +                       active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
> +                                                      count);
> +               if (!active) {
> +                       RCU_INIT_POINTER(dev->xps_maps[type], NULL);
> +                       kfree_rcu(dev_maps, rcu);
> +               }
>
> -       if (!active) {
> -               RCU_INIT_POINTER(dev->xps_maps, NULL);
> -               kfree_rcu(dev_maps, rcu);
> +               if (type == XPS_MAP_CPUS) {
> +                       for (i = offset + (count - 1); count--; i--)
> +                               netdev_queue_numa_node_write(
> +                                       netdev_get_tx_queue(dev, i),
> +                                                           NUMA_NO_NODE);
> +               }
> +out_no_maps:
> +               type++;
>         }
>
> -       for (i = offset + (count - 1); count--; i--)
> -               netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
> -                                            NUMA_NO_NODE);
> -
> -out_no_maps:
>         mutex_unlock(&xps_map_mutex);
>  }
>
> @@ -2170,11 +2187,11 @@ static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
>         netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
>  }
>
> -static struct xps_map *expand_xps_map(struct xps_map *map,
> -                                     int cpu, u16 index)
> +static struct xps_map *expand_xps_map(struct xps_map *map, int attr_index,
> +                                     u16 index, enum xps_map_type type)
>  {
> -       struct xps_map *new_map;
>         int alloc_len = XPS_MIN_MAP_ALLOC;
> +       struct xps_map *new_map = NULL;
>         int i, pos;
>
>         for (pos = 0; map && pos < map->len; pos++) {
> @@ -2183,7 +2200,7 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
>                 return map;
>         }
>
> -       /* Need to add queue to this CPU's existing map */
> +       /* Need to add tx-queue to this CPU's/rx-queue's existing map */
>         if (map) {
>                 if (pos < map->alloc_len)
>                         return map;
> @@ -2191,9 +2208,14 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
>                 alloc_len = map->alloc_len * 2;
>         }
>
> -       /* Need to allocate new map to store queue on this CPU's map */
> -       new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
> -                              cpu_to_node(cpu));
> +       /* Need to allocate new map to store tx-queue on this CPU's/rx-queue's
> +        *  map
> +        */
> +       if (type == XPS_MAP_RXQS)
> +               new_map = kzalloc(XPS_MAP_SIZE(alloc_len), GFP_KERNEL);
> +       else if (type == XPS_MAP_CPUS)
> +               new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
> +                                      cpu_to_node(attr_index));
>         if (!new_map)
>                 return NULL;
>
> @@ -2205,14 +2227,16 @@ static struct xps_map *expand_xps_map(struct xps_map *map,
>         return new_map;
>  }
>
> -int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> -                       u16 index)
> +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
> +                         u16 index, enum xps_map_type type)
>  {
> +       const unsigned long *online_mask = NULL, *possible_mask = NULL;
>         struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
> -       int i, cpu, tci, numa_node_id = -2;
> +       int i, j, tci, numa_node_id = -2;
>         int maps_sz, num_tc = 1, tc = 0;
>         struct xps_map *map, *new_map;
>         bool active = false;
> +       unsigned int nr_ids;
>
>         if (dev->num_tc) {
>                 num_tc = dev->num_tc;
> @@ -2221,16 +2245,33 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>                         return -EINVAL;
>         }
>
> -       maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
> +       switch (type) {
> +       case XPS_MAP_RXQS:
> +               maps_sz = XPS_RXQ_DEV_MAPS_SIZE(num_tc, dev->num_rx_queues);
> +               dev_maps = xmap_dereference(dev->xps_maps[XPS_MAP_RXQS]);
> +               nr_ids = dev->num_rx_queues;
> +               break;
> +       case XPS_MAP_CPUS:
> +               maps_sz = XPS_CPU_DEV_MAPS_SIZE(num_tc);
> +               if (num_possible_cpus() > 1) {
> +                       online_mask = cpumask_bits(cpu_online_mask);
> +                       possible_mask = cpumask_bits(cpu_possible_mask);
> +               }
> +               dev_maps = xmap_dereference(dev->xps_maps[XPS_MAP_CPUS]);
> +               nr_ids = nr_cpu_ids;
> +               break;
> +       default:
> +               return -EINVAL;
> +       }
> +
>         if (maps_sz < L1_CACHE_BYTES)
>                 maps_sz = L1_CACHE_BYTES;
>
>         mutex_lock(&xps_map_mutex);
>
> -       dev_maps = xmap_dereference(dev->xps_maps);
> -
>         /* allocate memory for queue storage */
> -       for_each_cpu_and(cpu, cpu_online_mask, mask) {
> +       for (j = -1; j = attrmask_next_and(j, online_mask, mask, nr_ids),
> +            j < nr_ids;) {
>                 if (!new_dev_maps)
>                         new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
>                 if (!new_dev_maps) {
> @@ -2238,73 +2279,81 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>                         return -ENOMEM;
>                 }
>
> -               tci = cpu * num_tc + tc;
> -               map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
> +               tci = j * num_tc + tc;
> +               map = dev_maps ? xmap_dereference(dev_maps->attr_map[tci]) :
>                                  NULL;
>
> -               map = expand_xps_map(map, cpu, index);
> +               map = expand_xps_map(map, j, index, type);
>                 if (!map)
>                         goto error;
>
> -               RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> +               RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>         }
>
>         if (!new_dev_maps)
>                 goto out_no_new_maps;
>
> -       for_each_possible_cpu(cpu) {
> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> +            j < nr_ids;) {
>                 /* copy maps belonging to foreign traffic classes */
> -               for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
> +               for (i = tc, tci = j * num_tc; dev_maps && i--; tci++) {
>                         /* fill in the new device map from the old device map */
> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
> -                       RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
> +                       RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>                 }
>
>                 /* We need to explicitly update tci as prevous loop
>                  * could break out early if dev_maps is NULL.
>                  */
> -               tci = cpu * num_tc + tc;
> +               tci = j * num_tc + tc;
>
> -               if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
> -                       /* add queue to CPU maps */
> +               if (attr_test_mask(j, mask, nr_ids) &&
> +                   attr_test_online(j, online_mask, nr_ids)) {
> +                       /* add tx-queue to CPU/rx-queue maps */
>                         int pos = 0;
>
> -                       map = xmap_dereference(new_dev_maps->cpu_map[tci]);
> +                       map = xmap_dereference(new_dev_maps->attr_map[tci]);
>                         while ((pos < map->len) && (map->queues[pos] != index))
>                                 pos++;
>
>                         if (pos == map->len)
>                                 map->queues[map->len++] = index;
>  #ifdef CONFIG_NUMA
> -                       if (numa_node_id == -2)
> -                               numa_node_id = cpu_to_node(cpu);
> -                       else if (numa_node_id != cpu_to_node(cpu))
> -                               numa_node_id = -1;
> +                       if (type == XPS_MAP_CPUS) {
> +                               if (numa_node_id == -2)
> +                                       numa_node_id = cpu_to_node(j);
> +                               else if (numa_node_id != cpu_to_node(j))
> +                                       numa_node_id = -1;
> +                       }
>  #endif
>                 } else if (dev_maps) {
>                         /* fill in the new device map from the old device map */
> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
> -                       RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
> +                       RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>                 }
>
>                 /* copy maps belonging to foreign traffic classes */
>                 for (i = num_tc - tc, tci++; dev_maps && --i; tci++) {
>                         /* fill in the new device map from the old device map */
> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
> -                       RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
> +                       RCU_INIT_POINTER(new_dev_maps->attr_map[tci], map);
>                 }
>         }
>
> -       rcu_assign_pointer(dev->xps_maps, new_dev_maps);
> +       if (type == XPS_MAP_RXQS)
> +               rcu_assign_pointer(dev->xps_maps[XPS_MAP_RXQS], new_dev_maps);
> +       else if (type == XPS_MAP_CPUS)
> +               rcu_assign_pointer(dev->xps_maps[XPS_MAP_CPUS], new_dev_maps);
>
>         /* Cleanup old maps */
>         if (!dev_maps)
>                 goto out_no_old_maps;
>
> -       for_each_possible_cpu(cpu) {
> -               for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
> -                       new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
> -                       map = xmap_dereference(dev_maps->cpu_map[tci]);
> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> +            j < nr_ids;) {
> +               for (i = num_tc, tci = j * num_tc; i--; tci++) {
> +                       new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
> +                       map = xmap_dereference(dev_maps->attr_map[tci]);
>                         if (map && map != new_map)
>                                 kfree_rcu(map, rcu);
>                 }
> @@ -2317,19 +2366,23 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>         active = true;
>
>  out_no_new_maps:
> -       /* update Tx queue numa node */
> -       netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
> -                                    (numa_node_id >= 0) ? numa_node_id :
> -                                    NUMA_NO_NODE);
> +       if (type == XPS_MAP_CPUS) {
> +               /* update Tx queue numa node */
> +               netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
> +                                            (numa_node_id >= 0) ?
> +                                            numa_node_id : NUMA_NO_NODE);
> +       }
>
>         if (!dev_maps)
>                 goto out_no_maps;
>
> -       /* removes queue from unused CPUs */
> -       for_each_possible_cpu(cpu) {
> -               for (i = tc, tci = cpu * num_tc; i--; tci++)
> +       /* removes tx-queue from unused CPUs/rx-queues */
> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> +            j < nr_ids;) {
> +               for (i = tc, tci = j * num_tc; i--; tci++)
>                         active |= remove_xps_queue(dev_maps, tci, index);
> -               if (!cpumask_test_cpu(cpu, mask) || !cpu_online(cpu))
> +               if (!attr_test_mask(j, mask, nr_ids) ||
> +                   !attr_test_online(j, online_mask, nr_ids))
>                         active |= remove_xps_queue(dev_maps, tci, index);
>                 for (i = num_tc - tc, tci++; --i; tci++)
>                         active |= remove_xps_queue(dev_maps, tci, index);
> @@ -2337,7 +2390,10 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>
>         /* free map if not active */
>         if (!active) {
> -               RCU_INIT_POINTER(dev->xps_maps, NULL);
> +               if (type == XPS_MAP_RXQS)
> +                       RCU_INIT_POINTER(dev->xps_maps[XPS_MAP_RXQS], NULL);
> +               else if (type == XPS_MAP_CPUS)
> +                       RCU_INIT_POINTER(dev->xps_maps[XPS_MAP_CPUS], NULL);
>                 kfree_rcu(dev_maps, rcu);
>         }
>
> @@ -2347,11 +2403,12 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>         return 0;
>  error:
>         /* remove any maps that we added */
> -       for_each_possible_cpu(cpu) {
> -               for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
> -                       new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
> +       for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
> +            j < nr_ids;) {
> +               for (i = num_tc, tci = j * num_tc; i--; tci++) {
> +                       new_map = xmap_dereference(new_dev_maps->attr_map[tci]);
>                         map = dev_maps ?
> -                             xmap_dereference(dev_maps->cpu_map[tci]) :
> +                             xmap_dereference(dev_maps->attr_map[tci]) :
>                               NULL;
>                         if (new_map && new_map != map)
>                                 kfree(new_map);
> @@ -2363,6 +2420,13 @@ int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
>         kfree(new_dev_maps);
>         return -ENOMEM;
>  }
> +
> +int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
> +                       u16 index)
> +{
> +       return __netif_set_xps_queue(dev, cpumask_bits(mask), index,
> +                                    XPS_MAP_CPUS);
> +}
>  EXPORT_SYMBOL(netif_set_xps_queue);
>
>  #endif
> @@ -3400,7 +3464,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>         int queue_index = -1;
>
>         rcu_read_lock();
> -       dev_maps = rcu_dereference(dev->xps_maps);
> +       dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
>         if (dev_maps) {
>                 unsigned int tci = skb->sender_cpu - 1;
>
> @@ -3409,7 +3473,7 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
>                         tci += netdev_get_prio_tc_map(dev, skb->priority);
>                 }
>
> -               map = rcu_dereference(dev_maps->cpu_map[tci]);
> +               map = rcu_dereference(dev_maps->attr_map[tci]);
>                 if (map) {
>                         if (map->len == 1)
>                                 queue_index = map->queues[0];
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index c476f07..d7abd33 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -1227,13 +1227,13 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
>         }
>
>         rcu_read_lock();
> -       dev_maps = rcu_dereference(dev->xps_maps);
> +       dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]);
>         if (dev_maps) {
>                 for_each_possible_cpu(cpu) {
>                         int i, tci = cpu * num_tc + tc;
>                         struct xps_map *map;
>
> -                       map = rcu_dereference(dev_maps->cpu_map[tci]);
> +                       map = rcu_dereference(dev_maps->attr_map[tci]);
>                         if (!map)
>                                 continue;
>
>

^ permalink raw reply

* Re: RTL8723BE performance regression
From: João Paulo Rechi Vita @ 2018-05-09 20:33 UTC (permalink / raw)
  To: Pkshih
  Cc: linux-kernel@vger.kernel.org, Larry.Finger@lwfinger.net,
	jprvita@endlessm.com, Birming Chiu, drake@endlessm.com,
	Chaoming_Li, kvalo@codeaurora.org, 莊彥宣,
	derosier@gmail.com, Steven Ting, netdev@vger.kernel.org,
	linux@endlessm.com, Shaofu, linux-wireless@vger.kernel.org
In-Reply-To: <1525768634.2885.11.camel@realtek.com>

On Tue, May 8, 2018 at 1:37 AM, Pkshih <pkshih@realtek.com> wrote:
> On Mon, 2018-05-07 at 14:49 -0700, João Paulo Rechi Vita wrote:
>> On Tue, May 1, 2018 at 10:58 PM, Pkshih <pkshih@realtek.com> wrote:
>> > On Wed, 2018-05-02 at 05:44 +0000, Pkshih wrote:
>> >>
>> >> > -----Original Message-----
>> >> > From: João Paulo Rechi Vita [mailto:jprvita@gmail.com]
>> >> > Sent: Wednesday, May 02, 2018 6:41 AM
>> >> > To: Larry Finger
>> >> > Cc: Steve deRosier; 莊彥宣; Pkshih; Birming Chiu; Shaofu; Steven Ting; Chaoming_Li; Kalle Valo;
>> >> > linux-wireless; Network Development; LKML; Daniel Drake; João Paulo Rechi Vita; linux@endless
>> m.c
>> >> om
>> >> > Subject: Re: RTL8723BE performance regression
>> >> >
>> >> > On Tue, Apr 3, 2018 at 7:51 PM, Larry Finger <Larry.Finger@lwfinger.net> wrote:
>> >> > > On 04/03/2018 09:37 PM, João Paulo Rechi Vita wrote:
>> >> > >>
>> >> > >> On Tue, Apr 3, 2018 at 7:28 PM, Larry Finger <Larry.Finger@lwfinger.net>
>> >> > >> wrote:
>> >> > >>
>> >> > >> (...)
>> >> > >>
>> >> > >>> As the antenna selection code changes affected your first bisection, do
>> >> > >>> you
>> >> > >>> have one of those HP laptops with only one antenna and the incorrect
>> >> > >>> coding
>> >> > >>> in the FUSE?
>> >> > >>
>> >> > >>
>> >> > >> Yes, that is why I've been passing ant_sel=1 during my tests -- this
>> >> > >> was needed to achieve a good performance in the past, before this
>> >> > >> regression. I've also opened the laptop chassis and confirmed the
>> >> > >> antenna cable is plugged to the connector labeled with "1" on the
>> >> > >> card.
>> >> > >>
>> >> > >>> If so, please make sure that you still have the same signal
>> >> > >>> strength for good and bad cases. I have tried to keep the driver and the
>> >> > >>> btcoex code in sync, but there may be some combinations of antenna
>> >> > >>> configuration and FUSE contents that cause the code to fail.
>> >> > >>>
>> >> > >>
>> >> > >> What is the recommended way to monitor the signal strength?
>> >> > >
>> >> > >
>> >> > > The btcoex code is developed for multiple platforms by a different group
>> >> > > than the Linux driver. I think they made a change that caused ant_sel to
>> >> > > switch from 1 to 2. At least numerous comments at
>> >> > > github.com/lwfinger/rtlwifi_new claimed they needed to make that change.
>> >> > >
>> >> > > Mhy recommended method is to verify the wifi device name with "iw dev". Then
>> >> > > using that device
>> >> > >
>> >> > > sudo iw dev <dev_name> scan | egrep "SSID|signal"
>> >> > >
>> >> >
>> >> > I have confirmed that the performance regression is indeed tied to
>> >> > signal strength: on the good cases signal was between -16 and -8 dBm,
>> >> > whereas in bad cases signal was always between -50 to - 40 dBm. I've
>> >> > also switched to testing bandwidth in controlled LAN environment using
>> >> > iperf3, as suggested by Steve deRosier, with the DUT being the only
>> >> > machine connected to the 2.4 GHz radio and the machine running the
>> >> > iperf3 server connected via ethernet.
>> >> >
>> >>
>> >> We have new experimental results in commit af8a41cccf8f46 ("rtlwifi: cleanup
>> >> 8723be ant_sel definition"). You can use the above commit and do the same
>> >> experiments (with ant_sel=0, 1 and 2) in your side, and then share your results.
>> >> Since performance is tied to signal strength, you can only share signal strength.
>> >>
>> >
>> > Please pay attention to cold reboot once ant_sel is changed.
>> >
>>
>> I've tested the commit mentioned above and it fixes the problem on top
>> of v4.16 (in addition to the latest wireless-drivers-next also been
>> fixed as it already contains such commit). On v4.15, we also need the
>> following commits before "af8a41cccf8f rtlwifi: cleanup 8723be ant_sel
>> definition" to have a good performance again:
>>
>>   874e837d67d0 rtlwifi: fill FW version and subversion
>>   a44709bba70f rtlwifi: btcoex: Add power_on_setting routine
>>   40d9dd4f1c5d rtlwifi: btcoex: Remove global variables from btcoex
>
> v4.15 isn't longterm version and had been EOL.
>

Right, but this is a performace regression in comparison to v4.11, so
if "af8a41cccf8f rtlwifi: cleanup 8723be ant_sel definition" is marked
for stable, shouldn't these other patches be brought as well? All
releases since v4.11 are probably affected, but honestly I don't have
a strong understanding of how the stable trees operate in situations
like this.

>>
>> Surprisingly, it seems forcing ant_sel=1 is not needed anymore on
>> these machines, as the shown by the numbers bellow (ant_sel=0 means
>> that actually no parameter was passed to the module). I have powered
>> off the machine and done a cold boot for every test. It seems
>> something have changed in the antenna auto-selection code since v4.11,
>> the latest point where I could confirm we definitely need to force
>> ant_sel=1. I've been trying to understand what causes this difference,
>> but haven't made progress on that so far, so any suggestions are
>> appreciated (we are trying to decide if we can confidently drop the
>> downstream DMI quirks for these specific machines).
>>
> I think your rtl8723be module programed correct efuse content, so it
> works properly with ant_sel=0, and quirk isn't required for your
> machine.
>
>>   w-d-n ant_sel=0: -14.00 dBm,  69.5 Mbps -> good
>>   w-d-n ant_sel=1: -10.00 dBm,  41.1 Mbps -> good
>>   w-d-n ant_sel=2: -44.00 dBm,   607 kbps -> bad
>>
>>   v4.16 ant_sel=0: -12.00 dBm,  63.0 Mbps -> good
>>   v4.16 ant_sel=1: - 8.00 dBm,  69.0 Mbps -> good
>>   v4.16 ant_sel=2: -50.00 dBm,   224 kbps -> bad
>>
>>   v4.15 ant_sel=0: - 8.00 dBm,  33.0 Mbps -> good
>>   v4.15 ant_sel=1: -10.00 dBm,  38.1 Mbps -> good
>>   v4.15 ant_sel=2: -48.00 dBm,   206 kbps -> bad
>>
>
> With your results, the efuse content is programmed as one or two antenna
> on AUX path.
>

With v4.11 I had good performance results on this very same machine
(thus same efuse contents) only when passing ant_sel=1, so there has
to be some change on the code that parses the efuse contents and
decides which antenna will be used.

--
João Paulo Rechi Vita
http://about.me/jprvita

^ permalink raw reply

* Re: [PATCH v3 next-next] drivers: net: davinci_mdio: prevent spurious timeout
From: Grygorii Strashko @ 2018-05-09 20:36 UTC (permalink / raw)
  To: Sekhar Nori, David S . Miller; +Cc: linux-omap, netdev, Andrew Lunn
In-Reply-To: <20180509154515.5968-1-nsekhar@ti.com>



On 05/09/2018 10:45 AM, Sekhar Nori wrote:
> A well timed kernel preemption in the time_after() loop
> in wait_for_idle() can result in a spurious timeout
> error to be returned.
> 
> Fix it by using readl_poll_timeout() which takes care of
> this issue.
> 
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>
> Signed-off-by: Sekhar Nori <nsekhar@ti.com>
> ---
> v3: simplify return path based on comment from Andrew
> 
> The issue has not been personally observed by me, but has
> been reported by users. Sending for next-next given the
> non-critical nature. There is seems to be no easy way to
> reproduce this.
> 

Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>



-- 
regards,
-grygorii

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox