Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next 2/7] lib: reciprocal_div: implement the improved algorithm on the paper mentioned
From: Jakub Kicinski @ 2018-06-26 20:52 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking,
	Jiong Wang
In-Reply-To: <CAPhsuW4P5J1ZgUVKGfwWs7vX_7thc5cr1x6mrY8Wx3d2dqEDTw@mail.gmail.com>

On Mon, 25 Jun 2018 23:21:10 -0700, Song Liu wrote:
> > +struct reciprocal_value_adv reciprocal_value_adv(u32 d, u8 prec)
> > +{
> > +       struct reciprocal_value_adv R;
> > +       u32 l, post_shift;
> > +       u64 mhigh, mlow;
> > +
> > +       l = fls(d - 1);
> > +       post_shift = l;
> > +       /* NOTE: mlow/mhigh could overflow u64 when l == 32 which means d has
> > +        * MSB set. This case needs to be handled before calling
> > +        * "reciprocal_value_adv", please see the comment at
> > +        * include/linux/reciprocal_div.h.
> > +        */  
> 
> Shall we handle l == 32 case better? I guess the concern here is extra
> handling may slow down the fast path? If that's the case, we should
> at least add a WARNING on the slow path.

Agreed, I think Jiong is travelling, hence no response.  We'll respin.

^ permalink raw reply

* Re: [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage
From: Tom Herbert @ 2018-06-26 20:48 UTC (permalink / raw)
  To: Edward Cree
  Cc: linux-net-drivers, Linux Kernel Network Developers,
	David S. Miller
In-Reply-To: <fa3d7e58-e7b6-ad0c-619f-824c25ed0d97@solarflare.com>

On Tue, Jun 26, 2018 at 11:15 AM, Edward Cree <ecree@solarflare.com> wrote:
>
> This patch series adds the capability for the network stack to receive a
>  list of packets and process them as a unit, rather than handling each
>  packet singly in sequence.  This is done by factoring out the existing
>  datapath code at each layer and wrapping it in list handling code.
>
> The motivation for this change is twofold:
> * Instruction cache locality.  Currently, running the entire network
>   stack receive path on a packet involves more code than will fit in the
>   lowest-level icache, meaning that when the next packet is handled, the
>   code has to be reloaded from more distant caches.  By handling packets
>   in "row-major order", we ensure that the code at each layer is hot for
>   most of the list.  (There is a corresponding downside in _data_ cache
>   locality, since we are now touching every packet at every layer, but in
>   practice there is easily enough room in dcache to hold one cacheline of
>   each of the 64 packets in a NAPI poll.)
> * Reduction of indirect calls.  Owing to Spectre mitigations, indirect
>   function calls are now more expensive than ever; they are also heavily
>   used in the network stack's architecture (see [1]).  By replacing 64
>   indirect calls to the next-layer per-packet function with a single
>   indirect call to the next-layer list function, we can save CPU cycles.
>
> Drivers pass an SKB list to the stack at the end of the NAPI poll; this
>  gives a natural batch size (the NAPI poll weight) and avoids waiting at
>  the software level for further packets to make a larger batch (which
>  would add latency).  It also means that the batch size is automatically
>  tuned by the existing interrupt moderation mechanism.
> The stack then runs each layer of processing over all the packets in the
>  list before proceeding to the next layer.  Where the 'next layer' (or
>  the context in which it must run) differs among the packets, the stack
>  splits the list; this 'late demux' means that packets which differ only
>  in later headers (e.g. same L2/L3 but different L4) can traverse the
>  early part of the stack together.
> Also, where the next layer is not (yet) list-aware, the stack can revert
>  to calling the rest of the stack in a loop; this allows gradual/creeping
>  listification, with no 'flag day' patch needed to listify everything.
>
> Patches 1-2 simply place received packets on a list during the event
>  processing loop on the sfc EF10 architecture, then call the normal stack
>  for each packet singly at the end of the NAPI poll.  (Analogues of patch
>  #2 for other NIC drivers should be fairly straightforward.)
> Patches 3-9 extend the list processing as far as the IP receive handler.
> Patches 10-12 apply the list techniques to Generic XDP, since the bpf_func
>  there is an indirect call.  In patch #12 we JIT a list_func that performs
>  list unwrapping and makes direct calls to the bpf_func.
>
> Patches 1-2 alone give about a 10% improvement in packet rate in the
>  baseline test; adding patches 3-9 raises this to around 25%.  Patches 10-
>  12, intended to improve Generic XDP performance, have in fact slightly
>  worsened it; I am unsure why this is and have included them in this RFC
>  in the hopes that someone will spot the reason.  If no progress is made I
>  will drop them from the series.
>
> Performance measurements were made with NetPerf UDP_STREAM, using 1-byte
>  packets and a single core to handle interrupts on the RX side; this was
>  in order to measure as simply as possible the packet rate handled by a
>  single core.  Figures are in Mbit/s; divide by 8 to obtain Mpps.  The
>  setup was tuned for maximum reproducibility, rather than raw performance.
>  Full details and more results (both with and without retpolines) are
>  presented in [2].
>
> The baseline test uses four streams, and multiple RXQs all bound to a
>  single CPU (the netperf binary is bound to a neighbouring CPU).  These
>  tests were run with retpolines.
> net-next: 6.60 Mb/s (datum)
>  after 9: 8.35 Mb/s (datum + 26.6%)
> after 12: 8.29 Mb/s (datum + 25.6%)
> Note however that these results are not robust; changes in the parameters
>  of the test often shrink the gain to single-digit percentages.  For
>  instance, when using only a single RXQ, only a 4% gain was seen.  The
>  results also seem to change significantly each time the patch series is
>  rebased onto a new net-next; for instance the results in [3] with
>  retpolines (slide 9) show only 11.6% gain in the same test as above (the
>  post-patch performance is the same but the pre-patch datum is 7.5Mb/s).
>
Very nice! I really like the deliberate progression of functionality
in the patches makes follwing them very readable. I do think that XDP
related patches at the end of the set should be separated out.

I suspects the effects will vary a lot between architectures and
configuration, so I'm not too worried about the variance mentioned in
the performance numbers. For future work, it might also be worth it to
compare to techniques done in VPP.

Tom

>
> I also performed tests with Generic XDP enabled (using a simple map-based
>  UDP port drop program with no entries in the map), both with and without
>  the eBPF JIT enabled.
> No JIT:
> net-next: 3.52 Mb/s (datum)
>  after 9: 4.91 Mb/s (datum + 39.5%)
> after 12: 4.83 Mb/s (datum + 37.3%)
>
> With JIT:
> net-next: 5.23 Mb/s (datum)
>  after 9: 6.64 Mb/s (datum + 27.0%)
> after 12: 6.46 Mb/s (datum + 23.6%)
>
> Another test variation was the use of software filtering/firewall rules.
>  Adding a single iptables rule (a UDP port drop on a port range not
>  matching the test traffic), thus making the netfilter hook have work to
>  do, reduced baseline performance but showed a similar delta from the
>  patches.  Similarly, testing with a set of TC flower filters (kindly
>  supplied by Cong Wang) in the single-RXQ setup (that previously gave 4%)
>  slowed down the baseline but not the patched performance, giving a 5.7%
>  performance delta.  These data suggest that the batching approach
>  remains effective in the presence of software switching rules.
>
> Changes from v1 (see [3]):
> * Rebased across 2 years' net-next movement (surprisingly straightforward).
>   - Added Generic XDP handling to netif_receive_skb_list_internal()
>   - Dealt with changes to PFMEMALLOC setting APIs
> * General cleanup of code and comments.
> * Skipped function calls for empty lists at various points in the stack
>   (patch #9).
> * Added listified Generic XDP handling (patches 10-12), though it doesn't
>   seem to help (see above).
> * Extended testing to cover software firewalls / netfilter etc.
>
> [1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf
> [2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf
> [3] http://lists.openwall.net/netdev/2016/04/19/89
>
> Edward Cree (12):
>   net: core: trivial netif_receive_skb_list() entry point
>   sfc: batch up RX delivery
>   net: core: unwrap skb list receive slightly further
>   net: core: Another step of skb receive list processing
>   net: core: another layer of lists, around PF_MEMALLOC skb handling
>   net: core: propagate SKB lists through packet_type lookup
>   net: ipv4: listified version of ip_rcv
>   net: ipv4: listify ip_rcv_finish
>   net: don't bother calling list RX functions on empty lists
>   net: listify Generic XDP processing, part 1
>   net: listify Generic XDP processing, part 2
>   net: listify jited Generic XDP processing on x86_64
>
>  arch/x86/net/bpf_jit_comp.c           | 164 ++++++++++++++
>  drivers/net/ethernet/sfc/efx.c        |  12 +
>  drivers/net/ethernet/sfc/net_driver.h |   3 +
>  drivers/net/ethernet/sfc/rx.c         |   7 +-
>  include/linux/filter.h                |  43 +++-
>  include/linux/netdevice.h             |   4 +
>  include/linux/netfilter.h             |  27 +++
>  include/linux/skbuff.h                |  16 ++
>  include/net/ip.h                      |   2 +
>  include/trace/events/net.h            |  14 ++
>  kernel/bpf/core.c                     |  38 +++-
>  net/core/dev.c                        | 415 +++++++++++++++++++++++++++++-----
>  net/core/filter.c                     |  10 +-
>  net/ipv4/af_inet.c                    |   1 +
>  net/ipv4/ip_input.c                   | 129 ++++++++++-
>  15 files changed, 810 insertions(+), 75 deletions(-)
>

^ permalink raw reply

* Re: [PATCH rdma-next 00/12] RDMA fixes 2018-06-24
From: Jason Gunthorpe @ 2018-06-26 20:39 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, RDMA mailing list, Matan Barak, Michael J Ruhl,
	Noa Osherovich, Raed Salem, Yishai Hadas, Saeed Mahameed,
	linux-netdev
In-Reply-To: <20180626042126.GM17747@mtr-leonro.mtl.com>

On Tue, Jun 26, 2018 at 07:21:26AM +0300, Leon Romanovsky wrote:
> On Mon, Jun 25, 2018 at 03:34:38PM -0600, Jason Gunthorpe wrote:
> > On Sun, Jun 24, 2018 at 11:23:41AM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@mellanox.com>
> > >
> > > Hi,
> > >
> > > This is bunch of patches trigged by running syzkaller internally.
> > >
> > > I'm sending them based on rdma-next mainly for two reasons:
> > > 1, Most of the patches fix the old issues and it doesn't matter when
> > > they will hit the Linus's tree: now or later in a couple of weeks
> > > during merge window.
> > > 2. They interleave with code cleanup, mlx5-next patches and Michael's
> > > feedback on flow counters series.
> > >
> > > Thanks
> > >
> > > Leon Romanovsky (12):
> > >   RDMA/uverbs: Protect from attempts to create flows on unsupported QP
> > >   RDMA/uverbs: Fix slab-out-of-bounds in ib_uverbs_ex_create_flow
> >
> > I applied these two to for-rc
> >
> > >   RDMA/uverbs: Check existence of create_flow callback
> > >   RDMA/verbs: Drop kernel variant of create_flow
> > >   RDMA/verbs: Drop kernel variant of destroy_flow
> > >   net/mlx5: Rate limit errors in command interface
> > >   RDMA/uverbs: Don't overwrite NULL pointer with ZERO_SIZE_PTR
> > >   RDMA/umem: Don't check for negative return value of dma_map_sg_attrs()
> > >   RDMA/uverbs: Remove redundant check
> >
> > These to for-next
> 
> Jason,
> 
> We would like to see patch "[PATCH mlx5-next 05/12] net/mlx5:
> Rate limit errors in command interface" in out mlx5-next. Is it possible
> at this point to drop it from for-next, so I'll be able to take it into
> mlx5-next?

Okay, you got to this while it was still 'wip', so it is dropped. Add
it to the mlx5-next branch and netdev or rdma can pull it next time
there is some reason to pull the branch..

Jason

^ permalink raw reply

* Re: Suspend of SDIO function devices
From: Daniel Mack @ 2018-06-26 20:34 UTC (permalink / raw)
  To: Ulf Hansson
  Cc: Chris Ball, linux-mmc@vger.kernel.org,
	libertas-dev@lists.infradead.org, linux-wireless,
	netdev@vger.kernel.org
In-Reply-To: <CAPDyKFrgcFrah34+eFMsBQYWr03bTSMMUuYw61jYfpffSQrQAg@mail.gmail.com>

Hi Ulf,

Thanks for the prompt reply!

On Monday, June 25, 2018 05:00 PM, Ulf Hansson wrote:
> On 24 June 2018 at 22:46, Daniel Mack <daniel@zonque.org> wrote:
>> Please advise, I'm happy to test approaches and send patches.
> 
>  From a top level point of view, I think this needs to be changed:
> 
> 1)
> In cases when the libertas sdio driver's ->suspend() callback, thinks
> of returning -ENOSYS, it should instead call if_sdio_power_off().
> Depending if if_sdio_power_save() has already been called, this shall
> be skipped.
> 
> The important thing here is to disable the SDIO func device and to
> release the SDIO irq.
> 
> 2)
> During resume, depending on whether the earlier ->suspend() callback
> invoked if_sdio_power_off(),  libertas sdio driver's ->resume()
> callback should call if_sdio_power_on().
> 
> This should re-initiate the libertas sdio device and re-program the
> firmware. To complete these actions, the firmware file also needs to
> be fetched, which requires file system accesses also to be resumed.
> 
> We also need to wait for the firmware programming to be completed,
> hence also do a "wait_event(card->pwron_waitq, priv->fw_ready);" from
> somewhere.

Great, thanks, that helped a lot! I have something that does the job ob 
my board. Will send out a patch shortly.


Best regards,
Daniel

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Jakub Kicinski @ 2018-06-26 20:31 UTC (permalink / raw)
  To: Okash Khawaja
  Cc: Martin KaFai Lau, Daniel Borkmann, Alexei Starovoitov,
	Yonghong Song, Quentin Monnet, David S. Miller, netdev,
	kernel-team, linux-kernel
In-Reply-To: <20180626164820.GA12981@w1t1fb>

On Tue, 26 Jun 2018 17:48:22 +0100, Okash Khawaja wrote:
> On Fri, Jun 22, 2018 at 05:26:39PM -0700, Martin KaFai Lau wrote:
> > On Fri, Jun 22, 2018 at 04:32:00PM -0700, Jakub Kicinski wrote:  
> > > On Fri, 22 Jun 2018 15:54:08 -0700, Martin KaFai Lau wrote:  
> > > > > > > > > > > > >         "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> > > > > > > > > > > > >         ],
> > > > > > > > > > > > > 	"value_struct":{
> > > > > > > > > > > > > 		"src_ip":2,        
> > > > > > > > If for the same map the user changes the "src_ip" to an array of int[4]
> > > > > > > > later (e.g. to support ipv6), it will become "src_ip": [1, 2, 3, 4].
> > > > > > > > Is it breaking backward compat?
> > > > > > > > i.e.
> > > > > > > > struct five_tuples {
> > > > > > > > -	int src_ip;
> > > > > > > > +	int src_ip[4];
> > > > > > > > /* ... */
> > > > > > > > };      
> > > > > > > 
> > > > > > > Well, it is breaking backward compat, but it's the program doing it,
> > > > > > > not bpftool :)  BTF changes so does the output.      
> > > > > > As we see, the key/value's btf-output is inherently not backward compat.
> > > > > > Hence, "-j" and "-p" will stay as is.  The whole existing json will
> > > > > > be backward compat instead of only partly backward compat.    
> > > > > 
> > > > > No.  There is a difference between user of a facility changing their
> > > > > input and kernel/libraries providing different output in response to
> > > > > that, and the libraries suddenly changing the output on their own.
> > > > > 
> > > > > Your example is like saying if user started using IPv6 addresses
> > > > > instead of IPv4 the netlink attributes in dumps will be different so
> > > > > kernel didn't keep backwards compat.  While what you're doing is more
> > > > > equivalent to dropping support for old ioctl interfaces because there
> > > > > is a better mechanism now.    
> > > > Sorry, I don't follow this.  I don't see netlink suffer json issue like
> > > > the one on "key" and "value".
> > > > 
> > > > All I can grasp is, the json should normally be backward compat but now
> > > > we are saying anything added by btf-output is an exception because
> > > > the script parsing it will treat it differently than "key" and "value"  
> > > 
> > > Backward compatibility means that if I run *the same* program against
> > > different kernels/libraries it continues to work.  If someone decides
> > > to upgrade their program to work with IPv6 (which was your example)
> > > obviously there is no way system as a whole will look 1:1 the same.
> > >   
> > > > > BTF in JSON is very useful, and will help people who writes simple
> > > > > orchestration/scripts based on bpftool *a* *lot*.  I really appreciate    
> > > > Can you share what the script will do?  I want to understand why
> > > > it cannot directly use the BTF format and the map data.  
> > > 
> > > Think about a python script which wants to read a counter in a map.
> > > Right now it would have to get the BTF, find out which bytes are the
> > > counter, then convert the bytes into a larger int.  With JSON BTF if
> > > just does entry["formatted"]["value"]["counter"].
> > > 
> > > Real life example from my test code (conversion of 3 element counter
> > > array):
> > > 
> > > def str2int(strtab):
> > >     inttab = []
> > >     for i in strtab:
> > >         inttab.append(int(i, 16))
> > >     ba = bytearray(inttab)
> > >     if len(strtab) == 4:
> > >         fmt = "I"
> > >     elif len(strtab) == 8:
> > >         fmt = "Q"
> > >     else:
> > >         raise Exception("String array of len %d can't be unpacked to an int" %
> > >                         (len(strtab)))
> > >     return struct.unpack(fmt, ba)[0]
> > > 
> > > def convert(elems, idx):
> > >     val = []
> > >     for i in range(3):
> > >         part = elems[idx]["value"][i * length:(i + 1) * length]
> > >         val.append(str2int(part))
> > >     return val
> > > 
> > > With BTF it would be:
> > > 
> > > 	elems[idx]["formatted"]["value"]
> > > 
> > > Which is fairly awesome.  
> > Thanks for the example.  Agree that with BTF, things are easier in general.
> > 
> > btw, what more awesome is,  
> > #> bpftool map find id 100 key 1  
> > {
> > 	"counter_x": 1,
> > 	"counter_y": 10
> > }
> >   
> > >   
> > > > > this addition to bpftool and will start using it myself as soon as it
> > > > > lands.  I'm not sure why the reluctance to slightly change the output
> > > > > format?    
> > > > The initial change argument is because the json has to be backward compat.
> > > > 
> > > > Then we show that btf-output is inherently not backward compat, so
> > > > printing it in json does not make sense at all.
> > > > 
> > > > However, now it is saying part of it does not have to be backward compat.  
> > > 
> > > Compatibility of "formatted" member is defined as -> fields broken out
> > > according to BTF.  So it is backward compatible.  The definition of
> > > "value" member is -> an array of unfortunately formatted array of
> > > ugly hex strings :(
> > >   
> > > > I am fine putting it under "formatted" for "-j" or "-p" if that is the
> > > > case, other than the double output is still confusing.  Lets wait for
> > > > Okash's input.
> > > >
> > > > At the same time, the same output will be used as the default plaintext
> > > > output when BTF is available.  Then the plaintext BTF output
> > > > will not be limited by the json restrictions when we want
> > > > to improve human readability later.  Apparently, the
> > > > improvements on plaintext will not be always applicable
> > > > to json output.  
> > >   
> 
> hi,
> 
> so i guess following is what we want:
> 
> 1. a "formatted" object nested inside -p and -j switches for bpf map
>   dump. this will be JSON and backward compatible
> 2. an output for humans - which is like the current output. this will
> not be JSON. this won't have to be backward compatible. this output will
> be shown when neither of -j and -p are supplied and btf info is
> available.
> 
> i can update the patches to v2 which covers 2 above + all other comments
> on the patchset. later we can follow up with a patch for 1.

Please do both at the same time.  I've learnt not to trust people when
they say things like "we can follow up with xyz" :(  In my experience it
_always_ backfires.

Implementing both outputs in one series will help you structure your
code to best suit both of the formats up front.

^ permalink raw reply

* Re: [PATCH] dt-bindings: Fix unbalanced quotation marks
From: Rob Herring @ 2018-06-26 20:19 UTC (permalink / raw)
  To: Jonathan Neuschäfer
  Cc: devicetree, Kukjin Kim, Krzysztof Kozlowski, Mark Rutland,
	Linus Walleij, Dmitry Torokhov, Thomas Gleixner, Jason Cooper,
	Marc Zyngier, Thierry Reding, Jonathan Hunter, Maxime Coquelin,
	Alexandre Torgue, Hauke Mehrtens, Rafał Miłecki,
	Ralf Baechle, Paul Burton, James Hogan, Madalin Bucur
In-Reply-To: <20180617143127.11421-1-j.neuschaefer@gmx.net>

On Sun, Jun 17, 2018 at 04:31:18PM +0200, Jonathan Neuschäfer wrote:
> Multiple binding documents have various forms of unbalanced quotation
> marks. Fix them.
> 
> Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
> ---
> 
> Should I split this patch so that different parts can go through different trees?

No, I applied it. Thanks.

> ---
>  .../devicetree/bindings/arm/samsung/samsung-boards.txt          | 2 +-
>  .../devicetree/bindings/gpio/nintendo,hollywood-gpio.txt        | 2 +-
>  Documentation/devicetree/bindings/input/touchscreen/hideep.txt  | 2 +-
>  .../bindings/interrupt-controller/nvidia,tegra20-ictlr.txt      | 2 +-
>  .../devicetree/bindings/interrupt-controller/st,stm32-exti.txt  | 2 +-
>  Documentation/devicetree/bindings/mips/brcm/soc.txt             | 2 +-
>  Documentation/devicetree/bindings/net/fsl-fman.txt              | 2 +-
>  Documentation/devicetree/bindings/power/power_domain.txt        | 2 +-
>  Documentation/devicetree/bindings/regulator/tps65090.txt        | 2 +-
>  Documentation/devicetree/bindings/reset/st,sti-softreset.txt    | 2 +-
>  Documentation/devicetree/bindings/sound/qcom,apq8016-sbc.txt    | 2 +-
>  Documentation/devicetree/bindings/sound/qcom,apq8096.txt        | 2 +-
>  12 files changed, 12 insertions(+), 12 deletions(-)

^ permalink raw reply

* Re: [PATCH net-next 2/4] net/sched: act_tunnel_key: add extended ack support
From: David Ahern @ 2018-06-26 19:50 UTC (permalink / raw)
  To: Jakub Kicinski, davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, daniel, oss-drivers,
	netdev, Simon Horman, Alexander Aring, Pieter Jansen van Vuuren
In-Reply-To: <20180626185308.3605-3-jakub.kicinski@netronome.com>

On 6/26/18 12:53 PM, Jakub Kicinski wrote:
> From: Simon Horman <simon.horman@netronome.com>
> 
> Add extended ack support for the tunnel key action by using NL_SET_ERR_MSG
> during validation of user input.
> 
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Alexander Aring <aring@mojatatu.com>
> Signed-off-by: Simon Horman <simon.horman@netronome.com>
> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
>  net/sched/act_tunnel_key.c | 27 +++++++++++++++++++++------
>  1 file changed, 21 insertions(+), 6 deletions(-)
> 

Reviewed-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

* [PATCH net-next] netlink: Return extack message if attribute validation fails
From: dsahern @ 2018-06-26 19:39 UTC (permalink / raw)
  To: netdev; +Cc: jakub.kicinski, David Ahern

From: David Ahern <dsahern@gmail.com>

Have one extack message for parsing and validating.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 lib/nlattr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/nlattr.c b/lib/nlattr.c
index dfa55c873c13..e335bcafa9e4 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -253,8 +253,8 @@ int nla_parse(struct nlattr **tb, int maxtype, const struct nlattr *head,
 			if (policy) {
 				err = validate_nla(nla, maxtype, policy);
 				if (err < 0) {
-					if (extack)
-						extack->bad_attr = nla;
+					NL_SET_ERR_MSG_ATTR(extack, nla,
+							    "Attribute failed policy validation");
 					goto errout;
 				}
 			}
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH net-next v2 3/3] net: phy: xgmiitorgmii: Check read_status results
From: Andrew Lunn @ 2018-06-26 19:38 UTC (permalink / raw)
  To: Brandon Maier
  Cc: netdev, f.fainelli, davem, michal.simek, clayton.shotwell,
	kristopher.cory, linux-kernel
In-Reply-To: <20180626175050.71165-3-brandon.maier@rockwellcollins.com>

On Tue, Jun 26, 2018 at 12:50:50PM -0500, Brandon Maier wrote:
> We're ignoring the result of the attached phy device's read_status().
> Return it so we can detect errors.
> 
> Signed-off-by: Brandon Maier <brandon.maier@rockwellcollins.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next v2 2/3] net: phy: xgmiitorgmii: Use correct mdio bus
From: Andrew Lunn @ 2018-06-26 19:37 UTC (permalink / raw)
  To: Brandon Maier
  Cc: netdev, f.fainelli, davem, michal.simek, clayton.shotwell,
	kristopher.cory, linux-kernel
In-Reply-To: <20180626175050.71165-2-brandon.maier@rockwellcollins.com>

On Tue, Jun 26, 2018 at 12:50:49PM -0500, Brandon Maier wrote:
> The xgmiitorgmii is using the mii_bus of the device it's attached to,
> instead of the bus it was given during probe.
> 
> Signed-off-by: Brandon Maier <brandon.maier@rockwellcollins.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next v2 1/3] net: phy: xgmiitorgmii: Check phy_driver ready before accessing
From: Andrew Lunn @ 2018-06-26 19:36 UTC (permalink / raw)
  To: Brandon Maier
  Cc: netdev, f.fainelli, davem, michal.simek, clayton.shotwell,
	kristopher.cory, linux-kernel
In-Reply-To: <20180626175050.71165-1-brandon.maier@rockwellcollins.com>

On Tue, Jun 26, 2018 at 12:50:48PM -0500, Brandon Maier wrote:
> Since a phy_device is added to the global mdio_bus list during
> phy_device_register(), but a phy_device's phy_driver doesn't get
> attached until phy_probe(). It's possible of_phy_find_device() in
> xgmiitorgmii will return a valid phy with a NULL phy_driver. Leading to
> a NULL pointer access during the memcpy().


> Signed-off-by: Brandon Maier <brandon.maier@rockwellcollins.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [patch net-next RFC 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Andrew Lunn @ 2018-06-26 19:35 UTC (permalink / raw)
  To: Vadim Pasternak
  Cc: Guenter Roeck, linux-pm@vger.kernel.org, netdev@vger.kernel.org,
	rui.zhang@intel.com, edubezval@gmail.com, jiri@resnulli.us
In-Reply-To: <HE1PR0502MB37535AE1B1E7093E198F0A67A2490@HE1PR0502MB3753.eurprd05.prod.outlook.com>

On Tue, Jun 26, 2018 at 07:01:32PM +0000, Vadim Pasternak wrote:
> 
> 
> > -----Original Message-----
> > From: Andrew Lunn [mailto:andrew@lunn.ch]
> > Sent: Tuesday, June 26, 2018 9:18 PM
> > To: Vadim Pasternak <vadimp@mellanox.com>
> > Cc: Guenter Roeck <linux@roeck-us.net>; linux-pm@vger.kernel.org;
> > netdev@vger.kernel.org; rui.zhang@intel.com; edubezval@gmail.com;
> > jiri@resnulli.us
> > Subject: Re: [patch net-next RFC 03/12] mlxsw: core: Add core environment
> > module for port temperature reading
> > 
> > > However, I have some concerns on this matter.
> > > Our hardware provides bulk reading of the modules temperature, means I
> > > can get all inputs by one hardware request, which is important optimization.
> > > Reading each module individually will be resulted in huge overhead and
> > > will require maybe some cashing of temperature inputs.
> > 
> > Well, you can cache the SFP calibration values, and the 4 limit values. To get an
> > actually temperature you need to read 2 bytes from the SFP module. I don't see
> > why that would be expensive. You talk to the firmware over PCIe right? So you
> > have lots of bandwidth.
> 
> Yes, but FW in its turn will run I2C transaction to read temperature sensor.

So how does that add overhead? It needs to read the same two bytes
independent of if it is getting readings from one sensor, or all
sensors.

> And we also run hwmon and thermal parts of our driver on BMC (Based
> Management Controller) on system equipped with it.
> In such case host CPU performs networking stuff, while BMC system related
> stuff. And in such configuration BMC talks to FW over I2C.

So you have a 20MHz I2C bus between your BMC and the firmware. Lets
assume a relativity dumb protocol. 2 bytes for command to read an sfp
sensor, 3 bytes for a replay. 5 bytes, at 20Mbps allows you to read
500,000 sensors per second. And for environment monitoring, 64 sensors
one per second should be sufficient.

    Andrew

^ permalink raw reply

* RE: [patch net-next RFC 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Vadim Pasternak @ 2018-06-26 19:01 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Guenter Roeck, linux-pm@vger.kernel.org, netdev@vger.kernel.org,
	rui.zhang@intel.com, edubezval@gmail.com, jiri@resnulli.us
In-Reply-To: <20180626181821.GA9800@lunn.ch>



> -----Original Message-----
> From: Andrew Lunn [mailto:andrew@lunn.ch]
> Sent: Tuesday, June 26, 2018 9:18 PM
> To: Vadim Pasternak <vadimp@mellanox.com>
> Cc: Guenter Roeck <linux@roeck-us.net>; linux-pm@vger.kernel.org;
> netdev@vger.kernel.org; rui.zhang@intel.com; edubezval@gmail.com;
> jiri@resnulli.us
> Subject: Re: [patch net-next RFC 03/12] mlxsw: core: Add core environment
> module for port temperature reading
> 
> > However, I have some concerns on this matter.
> > Our hardware provides bulk reading of the modules temperature, means I
> > can get all inputs by one hardware request, which is important optimization.
> > Reading each module individually will be resulted in huge overhead and
> > will require maybe some cashing of temperature inputs.
> 
> Well, you can cache the SFP calibration values, and the 4 limit values. To get an
> actually temperature you need to read 2 bytes from the SFP module. I don't see
> why that would be expensive. You talk to the firmware over PCIe right? So you
> have lots of bandwidth.

Yes, but FW in its turn will run I2C transaction to read temperature sensor.

And we also run hwmon and thermal parts of our driver on BMC (Based
Management Controller) on system equipped with it.
In such case host CPU performs networking stuff, while BMC system related
stuff. And in such configuration BMC talks to FW over I2C.
So I'll must to cache.

> 
>     Andrew

^ permalink raw reply

* [PATCH net-next 4/4] net/sched: add tunnel option support to act_tunnel_key
From: Jakub Kicinski @ 2018-06-26 18:53 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, daniel, oss-drivers,
	netdev, Simon Horman, Pieter Jansen van Vuuren
In-Reply-To: <20180626185308.3605-1-jakub.kicinski@netronome.com>

From: Simon Horman <simon.horman@netronome.com>

Allow setting tunnel options using the act_tunnel_key action.

Options are expressed as class:type:data and multiple options
may be listed using a comma delimiter.

 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc add dev eth0 ingress
 # tc filter add dev eth0 protocol ip parent ffff: \
     flower indev eth0 \
        ip_proto udp \
        action tunnel_key \
            set src_ip 10.0.99.192 \
            dst_ip 10.0.99.193 \
            dst_port 6081 \
            id 11 \
            geneve_opts 0102:80:00800022,0102:80:00800022 \
    action mirred egress redirect dev geneve0

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 include/uapi/linux/tc_act/tc_tunnel_key.h |  26 +++
 net/sched/act_tunnel_key.c                | 214 +++++++++++++++++++++-
 2 files changed, 236 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/tc_act/tc_tunnel_key.h b/include/uapi/linux/tc_act/tc_tunnel_key.h
index 72bbefe5d1d1..1b7bdd841b98 100644
--- a/include/uapi/linux/tc_act/tc_tunnel_key.h
+++ b/include/uapi/linux/tc_act/tc_tunnel_key.h
@@ -36,9 +36,35 @@ enum {
 	TCA_TUNNEL_KEY_PAD,
 	TCA_TUNNEL_KEY_ENC_DST_PORT,	/* be16 */
 	TCA_TUNNEL_KEY_NO_CSUM,		/* u8 */
+	TCA_TUNNEL_KEY_ENC_OPTS,	/* Nested TCA_TUNNEL_KEY_ENC_OPTS_
+					 * attributes
+					 */
 	__TCA_TUNNEL_KEY_MAX,
 };
 
 #define TCA_TUNNEL_KEY_MAX (__TCA_TUNNEL_KEY_MAX - 1)
 
+enum {
+	TCA_TUNNEL_KEY_ENC_OPTS_UNSPEC,
+	TCA_TUNNEL_KEY_ENC_OPTS_GENEVE,		/* Nested
+						 * TCA_TUNNEL_KEY_ENC_OPTS_
+						 * attributes
+						 */
+	__TCA_TUNNEL_KEY_ENC_OPTS_MAX,
+};
+
+#define TCA_TUNNEL_KEY_ENC_OPTS_MAX (__TCA_TUNNEL_KEY_ENC_OPTS_MAX - 1)
+
+enum {
+	TCA_TUNNEL_KEY_ENC_OPT_GENEVE_UNSPEC,
+	TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS,		/* u16 */
+	TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE,		/* u8 */
+	TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA,		/* 4 to 128 bytes */
+
+	__TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX,
+};
+
+#define TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX \
+	(__TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX - 1)
+
 #endif
diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 20e98ed8d498..5eace86ea687 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -13,6 +13,7 @@
 #include <linux/kernel.h>
 #include <linux/skbuff.h>
 #include <linux/rtnetlink.h>
+#include <net/geneve.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
 #include <net/dst.h>
@@ -57,6 +58,135 @@ static int tunnel_key_act(struct sk_buff *skb, const struct tc_action *a,
 	return action;
 }
 
+static const struct nla_policy
+enc_opts_policy[TCA_TUNNEL_KEY_ENC_OPTS_MAX + 1] = {
+	[TCA_TUNNEL_KEY_ENC_OPTS_GENEVE]	= { .type = NLA_NESTED },
+};
+
+static const struct nla_policy
+geneve_opt_policy[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX + 1] = {
+	[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS]	   = { .type = NLA_U16 },
+	[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE]	   = { .type = NLA_U8 },
+	[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]	   = { .type = NLA_BINARY,
+						       .len = 128 },
+};
+
+static int
+tunnel_key_copy_geneve_opt(const struct nlattr *nla, void *dst, int dst_len,
+			   struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX + 1];
+	int err, data_len, opt_len;
+	u8 *data;
+
+	err = nla_parse_nested(tb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX,
+			       nla, geneve_opt_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS] ||
+	    !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE] ||
+	    !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]) {
+		NL_SET_ERR_MSG(extack, "Missing tunnel key geneve option class, type or data");
+		return -EINVAL;
+	}
+
+	data = nla_data(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]);
+	data_len = nla_len(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]);
+	if (data_len < 4) {
+		NL_SET_ERR_MSG(extack, "Tunnel key geneve option data is less than 4 bytes long");
+		return -ERANGE;
+	}
+	if (data_len % 4) {
+		NL_SET_ERR_MSG(extack, "Tunnel key geneve option data is not a multiple of 4 bytes long");
+		return -ERANGE;
+	}
+
+	opt_len = sizeof(struct geneve_opt) + data_len;
+	if (dst) {
+		struct geneve_opt *opt = dst;
+
+		WARN_ON(dst_len < opt_len);
+
+		opt->opt_class =
+			nla_get_u16(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS]);
+		opt->type = nla_get_u8(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE]);
+		opt->length = data_len / 4; /* length is in units of 4 bytes */
+		opt->r1 = 0;
+		opt->r2 = 0;
+		opt->r3 = 0;
+
+		memcpy(opt + 1, data, data_len);
+	}
+
+	return opt_len;
+}
+
+static int tunnel_key_copy_opts(const struct nlattr *nla, u8 *dst,
+				int dst_len, struct netlink_ext_ack *extack)
+{
+	int err, rem, opt_len, len = nla_len(nla), opts_len = 0;
+	const struct nlattr *attr, *head = nla_data(nla);
+
+	err = nla_validate(head, len, TCA_TUNNEL_KEY_ENC_OPTS_MAX,
+			   enc_opts_policy, extack);
+	if (err)
+		return err;
+
+	nla_for_each_attr(attr, head, len, rem) {
+		switch (nla_type(attr)) {
+		case TCA_TUNNEL_KEY_ENC_OPTS_GENEVE:
+			opt_len = tunnel_key_copy_geneve_opt(attr, dst,
+							     dst_len, extack);
+			if (opt_len < 0)
+				return opt_len;
+			opts_len += opt_len;
+			if (dst) {
+				dst_len -= opt_len;
+				dst += opt_len;
+			}
+			break;
+		}
+	}
+
+	if (!opts_len) {
+		NL_SET_ERR_MSG(extack, "Empty list of tunnel options");
+		return -EINVAL;
+	}
+
+	if (rem > 0) {
+		NL_SET_ERR_MSG(extack, "Trailing data after parsing tunnel key options attributes");
+		return -EINVAL;
+	}
+
+	return opts_len;
+}
+
+static int tunnel_key_get_opts_len(struct nlattr *nla,
+				   struct netlink_ext_ack *extack)
+{
+	return tunnel_key_copy_opts(nla, NULL, 0, extack);
+}
+
+static int tunnel_key_opts_set(struct nlattr *nla, struct ip_tunnel_info *info,
+			       int opts_len, struct netlink_ext_ack *extack)
+{
+	info->options_len = opts_len;
+	switch (nla_type(nla_data(nla))) {
+	case TCA_TUNNEL_KEY_ENC_OPTS_GENEVE:
+#if IS_ENABLED(CONFIG_INET)
+		info->key.tun_flags |= TUNNEL_GENEVE_OPT;
+		return tunnel_key_copy_opts(nla, ip_tunnel_info_opts(info),
+					    opts_len, extack);
+#else
+		return -EAFNOSUPPORT;
+#endif
+	default:
+		NL_SET_ERR_MSG(extack, "Cannot set tunnel options for unknown tunnel type");
+		return -EINVAL;
+	}
+}
+
 static const struct nla_policy tunnel_key_policy[TCA_TUNNEL_KEY_MAX + 1] = {
 	[TCA_TUNNEL_KEY_PARMS]	    = { .len = sizeof(struct tc_tunnel_key) },
 	[TCA_TUNNEL_KEY_ENC_IPV4_SRC] = { .type = NLA_U32 },
@@ -66,6 +196,7 @@ static const struct nla_policy tunnel_key_policy[TCA_TUNNEL_KEY_MAX + 1] = {
 	[TCA_TUNNEL_KEY_ENC_KEY_ID]   = { .type = NLA_U32 },
 	[TCA_TUNNEL_KEY_ENC_DST_PORT] = {.type = NLA_U16},
 	[TCA_TUNNEL_KEY_NO_CSUM]      = { .type = NLA_U8 },
+	[TCA_TUNNEL_KEY_ENC_OPTS]     = { .type = NLA_NESTED },
 };
 
 static int tunnel_key_init(struct net *net, struct nlattr *nla,
@@ -81,6 +212,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	struct tcf_tunnel_key *t;
 	bool exists = false;
 	__be16 dst_port = 0;
+	int opts_len = 0;
 	__be64 key_id;
 	__be16 flags;
 	int ret = 0;
@@ -128,6 +260,15 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 		if (tb[TCA_TUNNEL_KEY_ENC_DST_PORT])
 			dst_port = nla_get_be16(tb[TCA_TUNNEL_KEY_ENC_DST_PORT]);
 
+		if (tb[TCA_TUNNEL_KEY_ENC_OPTS]) {
+			opts_len = tunnel_key_get_opts_len(tb[TCA_TUNNEL_KEY_ENC_OPTS],
+							   extack);
+			if (opts_len < 0) {
+				ret = opts_len;
+				goto err_out;
+			}
+		}
+
 		if (tb[TCA_TUNNEL_KEY_ENC_IPV4_SRC] &&
 		    tb[TCA_TUNNEL_KEY_ENC_IPV4_DST]) {
 			__be32 saddr;
@@ -138,7 +279,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 
 			metadata = __ip_tun_set_dst(saddr, daddr, 0, 0,
 						    dst_port, flags,
-						    key_id, 0);
+						    key_id, opts_len);
 		} else if (tb[TCA_TUNNEL_KEY_ENC_IPV6_SRC] &&
 			   tb[TCA_TUNNEL_KEY_ENC_IPV6_DST]) {
 			struct in6_addr saddr;
@@ -162,6 +303,14 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 			goto err_out;
 		}
 
+		if (opts_len) {
+			ret = tunnel_key_opts_set(tb[TCA_TUNNEL_KEY_ENC_OPTS],
+						  &metadata->u.tun_info,
+						  opts_len, extack);
+			if (ret < 0)
+				goto err_out;
+		}
+
 		metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX;
 		break;
 	default:
@@ -234,6 +383,61 @@ static void tunnel_key_release(struct tc_action *a)
 	}
 }
 
+static int tunnel_key_geneve_opts_dump(struct sk_buff *skb,
+				       const struct ip_tunnel_info *info)
+{
+	int len = info->options_len;
+	u8 *src = (u8 *)(info + 1);
+	struct nlattr *start;
+
+	start = nla_nest_start(skb, TCA_TUNNEL_KEY_ENC_OPTS_GENEVE);
+	if (!start)
+		return -EMSGSIZE;
+
+	while (len > 0) {
+		struct geneve_opt *opt = (struct geneve_opt *)src;
+
+		if (nla_put_u16(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS,
+				opt->opt_class) ||
+		    nla_put_u8(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE,
+			       opt->type) ||
+		    nla_put(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA,
+			    opt->length * 4, opt + 1))
+			return -EMSGSIZE;
+
+		len -= sizeof(struct geneve_opt) + opt->length * 4;
+		src += sizeof(struct geneve_opt) + opt->length * 4;
+	}
+
+	nla_nest_end(skb, start);
+	return 0;
+}
+
+static int tunnel_key_opts_dump(struct sk_buff *skb,
+				const struct ip_tunnel_info *info)
+{
+	struct nlattr *start;
+	int err;
+
+	if (!info->options_len)
+		return 0;
+
+	start = nla_nest_start(skb, TCA_TUNNEL_KEY_ENC_OPTS);
+	if (!start)
+		return -EMSGSIZE;
+
+	if (info->key.tun_flags & TUNNEL_GENEVE_OPT) {
+		err = tunnel_key_geneve_opts_dump(skb, info);
+		if (err)
+			return err;
+	} else {
+		return -EINVAL;
+	}
+
+	nla_nest_end(skb, start);
+	return 0;
+}
+
 static int tunnel_key_dump_addresses(struct sk_buff *skb,
 				     const struct ip_tunnel_info *info)
 {
@@ -284,8 +488,9 @@ static int tunnel_key_dump(struct sk_buff *skb, struct tc_action *a,
 		goto nla_put_failure;
 
 	if (params->tcft_action == TCA_TUNNEL_KEY_ACT_SET) {
-		struct ip_tunnel_key *key =
-			&params->tcft_enc_metadata->u.tun_info.key;
+		struct ip_tunnel_info *info =
+			&params->tcft_enc_metadata->u.tun_info;
+		struct ip_tunnel_key *key = &info->key;
 		__be32 key_id = tunnel_id_to_key32(key->tun_id);
 
 		if (nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_KEY_ID, key_id) ||
@@ -293,7 +498,8 @@ static int tunnel_key_dump(struct sk_buff *skb, struct tc_action *a,
 					      &params->tcft_enc_metadata->u.tun_info) ||
 		    nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_DST_PORT, key->tp_dst) ||
 		    nla_put_u8(skb, TCA_TUNNEL_KEY_NO_CSUM,
-			       !(key->tun_flags & TUNNEL_CSUM)))
+			       !(key->tun_flags & TUNNEL_CSUM)) ||
+		    tunnel_key_opts_dump(skb, info))
 			goto nla_put_failure;
 	}
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next 3/4] net: check tunnel option type in tunnel flags
From: Jakub Kicinski @ 2018-06-26 18:53 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, daniel, oss-drivers,
	netdev, Pieter Jansen van Vuuren, Jakub Kicinski
In-Reply-To: <20180626185308.3605-1-jakub.kicinski@netronome.com>

From: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>

Check the tunnel option type stored in tunnel flags when creating options
for tunnels. Thereby ensuring we do not set geneve, vxlan or erspan tunnel
options on interfaces that are not associated with them.

Make sure all users of the infrastructure set correct flags, for the BPF
helper we have to set all bits to keep backward compatibility.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/geneve.c           | 6 ++++--
 drivers/net/vxlan.c            | 3 ++-
 include/net/ip_tunnels.h       | 8 ++++++--
 net/core/filter.c              | 2 +-
 net/ipv4/ip_gre.c              | 2 ++
 net/ipv6/ip6_gre.c             | 2 ++
 net/openvswitch/flow_netlink.c | 8 ++++++--
 7 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 3e94375b9b01..471edd76ff55 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -236,7 +236,8 @@ static void geneve_rx(struct geneve_dev *geneve, struct geneve_sock *gs,
 		}
 		/* Update tunnel dst according to Geneve options. */
 		ip_tunnel_info_opts_set(&tun_dst->u.tun_info,
-					gnvh->options, gnvh->opt_len * 4);
+					gnvh->options, gnvh->opt_len * 4,
+					TUNNEL_GENEVE_OPT);
 	} else {
 		/* Drop packets w/ critical options,
 		 * since we don't support any...
@@ -675,7 +676,8 @@ static void geneve_build_header(struct genevehdr *geneveh,
 	geneveh->proto_type = htons(ETH_P_TEB);
 	geneveh->rsvd2 = 0;
 
-	ip_tunnel_info_opts_get(geneveh->options, info);
+	if (info->key.tun_flags & TUNNEL_GENEVE_OPT)
+		ip_tunnel_info_opts_get(geneveh->options, info);
 }
 
 static int geneve_build_skb(struct dst_entry *dst, struct sk_buff *skb,
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index cc14e0cd5647..7eb30d7c8bd7 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2122,7 +2122,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 		vni = tunnel_id_to_key32(info->key.tun_id);
 		ifindex = 0;
 		dst_cache = &info->dst_cache;
-		if (info->options_len)
+		if (info->options_len &&
+		    info->key.tun_flags & TUNNEL_VXLAN_OPT)
 			md = ip_tunnel_info_opts(info);
 		ttl = info->key.ttl;
 		tos = info->key.tos;
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 90ff430f5e9d..b0d022ff6ea1 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -466,10 +466,12 @@ static inline void ip_tunnel_info_opts_get(void *to,
 }
 
 static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info,
-					   const void *from, int len)
+					   const void *from, int len,
+					   __be16 flags)
 {
 	memcpy(ip_tunnel_info_opts(info), from, len);
 	info->options_len = len;
+	info->key.tun_flags |= flags;
 }
 
 static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state *lwtstate)
@@ -511,9 +513,11 @@ static inline void ip_tunnel_info_opts_get(void *to,
 }
 
 static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info,
-					   const void *from, int len)
+					   const void *from, int len,
+					   __be16 flags)
 {
 	info->options_len = 0;
+	info->key.tun_flags |= flags;
 }
 
 #endif /* CONFIG_INET */
diff --git a/net/core/filter.c b/net/core/filter.c
index e7f12e9f598c..dade922678f6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3582,7 +3582,7 @@ BPF_CALL_3(bpf_skb_set_tunnel_opt, struct sk_buff *, skb,
 	if (unlikely(size > IP_TUNNEL_OPTS_MAX))
 		return -ENOMEM;
 
-	ip_tunnel_info_opts_set(info, from, size);
+	ip_tunnel_info_opts_set(info, from, size, TUNNEL_OPTIONS_PRESENT);
 
 	return 0;
 }
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 2d8efeecf619..c8ca5d8f0f75 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -587,6 +587,8 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev,
 		goto err_free_skb;
 
 	key = &tun_info->key;
+	if (!(tun_info->key.tun_flags & TUNNEL_ERSPAN_OPT))
+		goto err_free_rt;
 	md = ip_tunnel_info_opts(tun_info);
 	if (!md)
 		goto err_free_rt;
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index c8cf2fdbb13b..367177786e34 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -990,6 +990,8 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff *skb,
 		fl6.flowi6_uid = sock_net_uid(dev_net(dev), NULL);
 
 		dsfield = key->tos;
+		if (!(tun_info->key.tun_flags & TUNNEL_ERSPAN_OPT))
+			goto tx_err;
 		md = ip_tunnel_info_opts(tun_info);
 		if (!md)
 			goto tx_err;
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 492ab0c36f7c..26ab964923fd 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2515,8 +2515,9 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	struct ip_tunnel_info *tun_info;
 	struct ovs_tunnel_info *ovs_tun;
 	struct nlattr *a;
-	int err = 0, start, opts_type;
+	int err = 0, start, opts_type, dst_opt_type;
 
+	dst_opt_type = 0;
 	ovs_match_init(&match, &key, true, NULL);
 	opts_type = ip_tun_from_nlattr(nla_data(attr), &match, false, log);
 	if (opts_type < 0)
@@ -2528,10 +2529,13 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 			err = validate_geneve_opts(&key);
 			if (err < 0)
 				return err;
+			dst_opt_type = TUNNEL_GENEVE_OPT;
 			break;
 		case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
+			dst_opt_type = TUNNEL_VXLAN_OPT;
 			break;
 		case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS:
+			dst_opt_type = TUNNEL_ERSPAN_OPT;
 			break;
 		}
 	}
@@ -2574,7 +2578,7 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	 */
 	ip_tunnel_info_opts_set(tun_info,
 				TUN_METADATA_OPTS(&key, key.tun_opts_len),
-				key.tun_opts_len);
+				key.tun_opts_len, dst_opt_type);
 	add_nested_action_end(*sfa, start);
 
 	return err;
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next 2/4] net/sched: act_tunnel_key: add extended ack support
From: Jakub Kicinski @ 2018-06-26 18:53 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, daniel, oss-drivers,
	netdev, Simon Horman, David Ahern, Alexander Aring,
	Pieter Jansen van Vuuren
In-Reply-To: <20180626185308.3605-1-jakub.kicinski@netronome.com>

From: Simon Horman <simon.horman@netronome.com>

Add extended ack support for the tunnel key action by using NL_SET_ERR_MSG
during validation of user input.

Cc: David Ahern <dsahern@gmail.com>
Cc: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 net/sched/act_tunnel_key.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 2edd389e7c92..20e98ed8d498 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -86,16 +86,22 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	int ret = 0;
 	int err;
 
-	if (!nla)
+	if (!nla) {
+		NL_SET_ERR_MSG(extack, "Tunnel requires attributes to be passed");
 		return -EINVAL;
+	}
 
 	err = nla_parse_nested(tb, TCA_TUNNEL_KEY_MAX, nla, tunnel_key_policy,
-			       NULL);
-	if (err < 0)
+			       extack);
+	if (err < 0) {
+		NL_SET_ERR_MSG(extack, "Failed to parse nested tunnel key attributes");
 		return err;
+	}
 
-	if (!tb[TCA_TUNNEL_KEY_PARMS])
+	if (!tb[TCA_TUNNEL_KEY_PARMS]) {
+		NL_SET_ERR_MSG(extack, "Missing tunnel key parameters");
 		return -EINVAL;
+	}
 
 	parm = nla_data(tb[TCA_TUNNEL_KEY_PARMS]);
 	exists = tcf_idr_check(tn, parm->index, a, bind);
@@ -107,6 +113,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 		break;
 	case TCA_TUNNEL_KEY_ACT_SET:
 		if (!tb[TCA_TUNNEL_KEY_ENC_KEY_ID]) {
+			NL_SET_ERR_MSG(extack, "Missing tunnel key id");
 			ret = -EINVAL;
 			goto err_out;
 		}
@@ -144,11 +151,13 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 						      0, flags,
 						      key_id, 0);
 		} else {
+			NL_SET_ERR_MSG(extack, "Missing either ipv4 or ipv6 src and dst");
 			ret = -EINVAL;
 			goto err_out;
 		}
 
 		if (!metadata) {
+			NL_SET_ERR_MSG(extack, "Cannot allocate tunnel metadata dst");
 			ret = -ENOMEM;
 			goto err_out;
 		}
@@ -156,6 +165,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 		metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX;
 		break;
 	default:
+		NL_SET_ERR_MSG(extack, "Unknown tunnel key action");
 		ret = -EINVAL;
 		goto err_out;
 	}
@@ -163,14 +173,18 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	if (!exists) {
 		ret = tcf_idr_create(tn, parm->index, est, a,
 				     &act_tunnel_key_ops, bind, true);
-		if (ret)
+		if (ret) {
+			NL_SET_ERR_MSG(extack, "Cannot create TC IDR");
 			return ret;
+		}
 
 		ret = ACT_P_CREATED;
 	} else {
 		tcf_idr_release(*a, bind);
-		if (!ovr)
+		if (!ovr) {
+			NL_SET_ERR_MSG(extack, "TC IDR already exists");
 			return -EEXIST;
+		}
 	}
 
 	t = to_tunnel_key(*a);
@@ -180,6 +194,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	if (unlikely(!params_new)) {
 		if (ret == ACT_P_CREATED)
 			tcf_idr_release(*a, bind);
+		NL_SET_ERR_MSG(extack, "Cannot allocate tunnel key parameters");
 		return -ENOMEM;
 	}
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next 1/4] net/sched: act_tunnel_key: disambiguate metadata dst error cases
From: Jakub Kicinski @ 2018-06-26 18:53 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, daniel, oss-drivers,
	netdev, Simon Horman, David Ahern, Alexander Aring
In-Reply-To: <20180626185308.3605-1-jakub.kicinski@netronome.com>

From: Simon Horman <simon.horman@netronome.com>

Metadata may be NULL for one of two reasons:
* Missing user input
* Failure to allocate the metadata dst

Disambiguate these case by returning -EINVAL for the former and -ENOMEM
for the latter rather than -EINVAL for both cases.

This is in preparation for using extended ack to provide more information
to users when parsing their input.

Cc: David Ahern <dsa@cumulusnetworks.com>
Cc: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 net/sched/act_tunnel_key.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 626dac81a48a..2edd389e7c92 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -143,10 +143,13 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 			metadata = __ipv6_tun_set_dst(&saddr, &daddr, 0, 0, dst_port,
 						      0, flags,
 						      key_id, 0);
+		} else {
+			ret = -EINVAL;
+			goto err_out;
 		}
 
 		if (!metadata) {
-			ret = -EINVAL;
+			ret = -ENOMEM;
 			goto err_out;
 		}
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH net-next 0/4] net: Geneve options support for TC act_tunnel_key
From: Jakub Kicinski @ 2018-06-26 18:53 UTC (permalink / raw)
  To: davem, jbenc
  Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, daniel, oss-drivers,
	netdev, Jakub Kicinski

Hi,

Simon & Pieter say:

This set adds Geneve Options support to the TC tunnel key action.
It provides the plumbing required to configure Geneve variable length
options.  The options can be configured in the form CLASS:TYPE:DATA,
where CLASS is represented as a 16bit hexadecimal value, TYPE as an 8bit
hexadecimal value and DATA as a variable length hexadecimal value.
Additionally multiple options may be listed using a comma delimiter.


Pieter Jansen van Vuuren (1):
  net: check tunnel option type in tunnel flags

Simon Horman (3):
  net/sched: act_tunnel_key: disambiguate metadata dst error cases
  net/sched: act_tunnel_key: add extended ack support
  net/sched: add tunnel option support to act_tunnel_key

 drivers/net/geneve.c                      |   6 +-
 drivers/net/vxlan.c                       |   3 +-
 include/net/ip_tunnels.h                  |   8 +-
 include/uapi/linux/tc_act/tc_tunnel_key.h |  26 +++
 net/core/filter.c                         |   2 +-
 net/ipv4/ip_gre.c                         |   2 +
 net/ipv6/ip6_gre.c                        |   2 +
 net/openvswitch/flow_netlink.c            |   8 +-
 net/sched/act_tunnel_key.c                | 242 +++++++++++++++++++++-
 9 files changed, 280 insertions(+), 19 deletions(-)

-- 
2.17.1

^ permalink raw reply

* Re: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface with FAN fault attribute
From: Guenter Roeck @ 2018-06-26 18:32 UTC (permalink / raw)
  To: Vadim Pasternak
  Cc: Andrew Lunn, davem@davemloft.net, netdev@vger.kernel.org,
	rui.zhang@intel.com, edubezval@gmail.com, jiri@resnulli.us, mlxsw,
	Michael Shych
In-Reply-To: <HE1PR0502MB3753FB368101C4B5B5849BE6A2490@HE1PR0502MB3753.eurprd05.prod.outlook.com>

On Tue, Jun 26, 2018 at 04:47:05PM +0000, Vadim Pasternak wrote:
> 
> 
> > -----Original Message-----
> > From: Guenter Roeck [mailto:linux@roeck-us.net]
> > Sent: Tuesday, June 26, 2018 7:33 PM
> > To: Vadim Pasternak <vadimp@mellanox.com>
> > Cc: Andrew Lunn <andrew@lunn.ch>; davem@davemloft.net;
> > netdev@vger.kernel.org; rui.zhang@intel.com; edubezval@gmail.com;
> > jiri@resnulli.us; mlxsw <mlxsw@mellanox.com>; Michael Shych
> > <michaelsh@mellanox.com>
> > Subject: Re: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon interface
> > with FAN fault attribute
> > 
> > On Tue, Jun 26, 2018 at 02:47:01PM +0000, Vadim Pasternak wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Andrew Lunn [mailto:andrew@lunn.ch]
> > > > Sent: Tuesday, June 26, 2018 5:29 PM
> > > > To: Vadim Pasternak <vadimp@mellanox.com>
> > > > Cc: davem@davemloft.net; netdev@vger.kernel.org; linux@roeck-us.net;
> > > > rui.zhang@intel.com; edubezval@gmail.com; jiri@resnulli.us; mlxsw
> > > > <mlxsw@mellanox.com>; Michael Shych <michaelsh@mellanox.com>
> > > > Subject: Re: [patch net-next RFC 11/12] mlxsw: core: Extend hwmon
> > > > interface with FAN fault attribute
> > > >
> > > > > +static ssize_t mlxsw_hwmon_fan_fault_show(struct device *dev,
> > > > > +					  struct device_attribute *attr,
> > > > > +					  char *buf)
> > > > > +{
> > > > > +	struct mlxsw_hwmon_attr *mlwsw_hwmon_attr =
> > > > > +			container_of(attr, struct mlxsw_hwmon_attr,
> > > > dev_attr);
> > > > > +	struct mlxsw_hwmon *mlxsw_hwmon = mlwsw_hwmon_attr->hwmon;
> > > > > +	char mfsm_pl[MLXSW_REG_MFSM_LEN];
> > > > > +	u16 tach;
> > > > > +	int err;
> > > > > +
> > > > > +	mlxsw_reg_mfsm_pack(mfsm_pl, mlwsw_hwmon_attr->type_index);
> > > > > +	err = mlxsw_reg_query(mlxsw_hwmon->core, MLXSW_REG(mfsm),
> > > > mfsm_pl);
> > > > > +	if (err) {
> > > > > +		dev_err(mlxsw_hwmon->bus_info->dev, "Failed to query
> > > > fan\n");
> > > > > +		return err;
> > > > > +	}
> > > > > +	tach = mlxsw_reg_mfsm_rpm_get(mfsm_pl);
> > > > > +
> > > > > +	return sprintf(buf, "%u\n", (tach < mlxsw_hwmon->tach_min) ? 1 :
> > > > > +0); }
> > > >
> > > > Documentation/hwmon/sysfs-interface says:
> > > >
> > > > Alarms are direct indications read from the chips. The drivers do
> > > > NOT make comparisons of readings to thresholds. This allows
> > > > violations between readings to be caught and alarmed. The exact
> > > > definition of an alarm (for example, whether a threshold must be met
> > > > or must be exceeded to cause an alarm) is chip-dependent.
> > > >
> > > > Now, this is a fault, not an alarm. But does the same apply?
> > >
> > Yes, it does. There are no "soft" alarms / faults.
> > 
> > > Hi Andrew,
> > >
> > > Hardware provides minimum value for tachometer.
> > > Tachometer is considered as faulty in case it's below this value.
> > 
> > This is for user space to decide, not for the kernel.
> 
> Hi Guenter,
> 
> Do you suggest to expose provide fan{x}_min, instead of fan{x}_fault
> and give to user to compare fan{x}_input versus fan{x}_min for the
> fault decision?
> 

fanX_min only makes sense if programmed into or reported by the chip
or controller (that is what the attribute is for), usually to enable
the chip/controller to set an alarm. If the chip or controller does
not have a minimum speed register, the attribute should not exist,
and any decision based on a comparison between a minimum fan speed
and the actual fan speed is a user space problem.

I don't know what the tach_min calculation is about, but setting
it to the minimum of all tachometer speeds (or of all reported
minimums ?) is not the task of a hwmon driver. A hwmon driver
reports what it gets from hardware; the interpretation is up
to other parts of the system (eg userspace or the thermal
subsystem). That includes a software-based decision if an alarm
or fault should be reported or not.

> > 
> > > In case any tachometer is faulty, PWM according to the system
> > > requirements should be set to 100% until the fault
> > 
> > system requirements. Again, this is for user space to decide.
> 
> 
> Yes, user should decide in this case and I wanted to provide to user
> fan{x}_fault for this matter. But it could do it based on input and min
> attributes, of course.
> 
Note that "fault" and "alarm" do have distinct different meanings.
Many fan controllers can detect if a fan is faulty (eg no sensor
connected or it is deemed faulty) or if it just runs too slow.
The typical remedy is also different: A slow fan may just need
more pwm or voltage, a faulty fan needs to be replaced.

Guenter

> > 
> > > is not recovered (f.e. by physical replacing of bad unit).
> > > This is the motivation to expose fan{x}_fault in the way it's exposed.
> > >
> > > Thanks,
> > > Vadim.
> > >
> > > >
> > > >      Andrew

^ permalink raw reply

* [RFC PATCH v2 net-next 12/12] net: listify jited Generic XDP processing on x86_64
From: Edward Cree @ 2018-06-26 18:22 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem
In-Reply-To: <fa3d7e58-e7b6-ad0c-619f-824c25ed0d97@solarflare.com>

When JITing an eBPF program on x86_64, also JIT a list_func that calls the
 bpf_func in a loop.  Since this is a direct call, it should perform better
 than indirect-calling bpf_func in a loop.
Since getting the percpu 'redirect_info' variable is ugly and magic, pass
 a pointer into the list_func as a parameter instead of calculating it
 inside the loop.  This is safe because BPF execution is not preemptible,
 so the percpu variable can't get moved while we're using it.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 arch/x86/net/bpf_jit_comp.c | 164 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/filter.h      |  19 +++--
 kernel/bpf/core.c           |  18 +++--
 net/core/dev.c              |   5 +-
 4 files changed, 195 insertions(+), 11 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 2580cd2e98b1..3e06dd79adda 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1060,6 +1060,169 @@ struct x64_jit_data {
 	struct jit_context ctx;
 };
 
+static void try_do_jit_list(struct bpf_prog *bpf_prog,
+			    unsigned int (*bpf_func)(const void *ctx,
+						     const struct bpf_insn *insn))
+{
+	/* list_func takes three arguments:
+	 * struct list_head *list [RDI]
+	 * const struct bpf_insn *insn [RSI]
+	 * const struct redirect_info *percpu_ri [RDX]
+	 *
+	 * Layout of struct bpf_work on x86_64:
+	 * struct list_head {
+	 *	struct list_head *next; // 0x0
+	 *	struct list_head *prev; // 0x8
+	 * } list; // 0x0
+	 * void *ctx; 0x10
+	 * struct redirect_info {
+	 *	u32 ifindex; // 0x18
+	 *	u32 flags; // 0x1c
+	 *	struct bpf_map *map; // 0x20
+	 *	struct bpf_map *map_to_flush; // 0x28
+	 *	unsigned long   map_owner; // 0x30
+	 * } ri; // 0x18
+	 * unsigned long ret; // 0x38
+	 * (total size 0x40 bytes)
+	 *
+	 * Desired function body:
+	 * struct redirect_info *ri = percpu_ri; [R12]
+	 * struct bpf_work *work; [RBP]
+	 *
+	 * list_for_each_entry(work, list, list) {
+	 *	work->ret = (*bpf_func)(work->ctx, insn);
+	 *	work->ri = *ri;
+	 * }
+	 *
+	 * Assembly to emit:
+	 * ; save CSRs
+	 *	push %rbx
+	 *	push %rbp
+	 *	push %r12
+	 * ; stash pointer to redirect_info
+	 *	mov %rdx,%r12	; ri = percpu_ri
+	 * ; start list
+	 *	mov %rdi,%rbp	; head = list
+	 * next:		; while (true) {
+	 *	mov (%rbp),%rbx	; rbx = head->next
+	 *	cmp %rbx,%rdi	; if (rbx == list)
+	 *	je out		;	break
+	 *	mov %rbx,%rbp	; head = rbx
+	 *	push %rdi	; save list
+	 *	push %rsi	; save insn (is still arg2)
+	 * ; struct bpf_work *work = head (container_of, but list is first member)
+	 *	mov 0x10(%rbp),%rdi; arg1 = work->ctx
+	 *	callq bpf_func  ; rax = (*bpf_func)(work->ctx, insn)
+	 *	mov %rax,0x38(%rbp); work->ret = rax
+	 * ; work->ri = *ri
+	 *	mov (%r12),%rdx
+	 *	mov %rdx,0x18(%rbp)
+	 *	mov 0x8(%r12),%rdx
+	 *	mov %rdx,0x20(%rbp)
+	 *	mov 0x10(%r12),%rdx
+	 *	mov %rdx,0x28(%rbp)
+	 *	mov 0x18(%r12),%rdx
+	 *	mov %rdx,0x30(%rbp)
+	 *	pop %rsi	; restore insn
+	 *	pop %rdi	; restore list
+	 *	jmp next	; }
+	 * out:
+	 * ; restore CSRs and return
+	 *	pop %r12
+	 *	pop %rbp
+	 *	pop %rbx
+	 *	retq
+	 */
+	u8 *image, *prog, *from_to_out, *next;
+	struct bpf_binary_header *header;
+	int off, cnt = 0;
+	s64 jmp_offset;
+
+	/* Prog should be 81 bytes; let's round up to 128 */
+	header = bpf_jit_binary_alloc(128, &image, 1, jit_fill_hole);
+	prog = image;
+
+	/* push rbx */
+	EMIT1(0x53);
+	/* push rbp */
+	EMIT1(0x55);
+	/* push %r12 */
+	EMIT2(0x41, 0x54);
+	/* mov %rdx,%r12 */
+	EMIT3(0x49, 0x89, 0xd4);
+	/* mov %rdi,%rbp */
+	EMIT3(0x48, 0x89, 0xfd);
+	next = prog;
+	/* mov 0x0(%rbp),%rbx */
+	EMIT4(0x48, 0x8b, 0x5d, 0x00);
+	/* cmp %rbx,%rdi */
+	EMIT3(0x48, 0x39, 0xdf);
+	/* je out */
+	EMIT2(X86_JE, 0);
+	from_to_out = prog; /* record . to patch this jump later */
+	/* mov %rbx,%rbp */
+	EMIT3(0x48, 0x89, 0xdd);
+	/* push %rdi */
+	EMIT1(0x57);
+	/* push %rsi */
+	EMIT1(0x56);
+	/* mov 0x10(%rbp),%rdi */
+	EMIT4(0x48, 0x8b, 0x7d, 0x10);
+	/* e8 callq bpf_func */
+	jmp_offset = (u8 *)bpf_func - (prog + 5);
+	if (!is_simm32(jmp_offset)) {
+		pr_err("call out of range to BPF func %p from list image %p\n",
+		       bpf_func, image);
+		goto fail;
+	}
+	EMIT1_off32(0xE8, jmp_offset);
+	/* mov %rax,0x38(%rbp) */
+	EMIT4(0x48, 0x89, 0x45, 0x38);
+	/* mov (%r12),%rdx */
+	EMIT4(0x49, 0x8b, 0x14, 0x24);
+	/* mov %rdx,0x18(%rbp) */
+	EMIT4(0x48, 0x89, 0x55, 0x18);
+	for (off = 0x8; off < 0x20; off += 0x8) {
+		/* mov off(%r12),%rdx */
+		EMIT4(0x49, 0x8b, 0x54, 0x24);
+		EMIT1(off);
+		/* mov %rdx,0x18+off(%rbp) */
+		EMIT4(0x48, 0x89, 0x55, 0x18 + off);
+	}
+	/* pop %rsi */
+	EMIT1(0x5e);
+	/* pop %rdi */
+	EMIT1(0x5f);
+	/* jmp next */
+	jmp_offset = next - (prog + 2);
+	if (WARN_ON(!is_imm8(jmp_offset))) /* can't happen */
+		goto fail;
+	EMIT2(0xeb, jmp_offset);
+	/* out: */
+	jmp_offset = prog - from_to_out;
+	if (WARN_ON(!is_imm8(jmp_offset))) /* can't happen */
+		goto fail;
+	from_to_out[-1] = jmp_offset;
+	/* pop %r12 */
+	EMIT2(0x41, 0x5c);
+	/* pop %rbp */
+	EMIT1(0x5d);
+	/* pop %rbx */
+	EMIT1(0x5b);
+	/* retq */
+	EMIT1(0xc3);
+	/* If we were wrong about how much space we needed, scream and shout */
+	WARN_ON(cnt != 81);
+	if (bpf_jit_enable > 1)
+		bpf_jit_dump(0, cnt, 0, image);
+	bpf_jit_binary_lock_ro(header);
+	bpf_prog->list_func = (void *)image;
+	bpf_prog->jited_list = 1;
+	return;
+fail:
+	bpf_jit_binary_free(header);
+}
+
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 {
 	struct bpf_binary_header *header = NULL;
@@ -1176,6 +1339,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
 		prog->bpf_func = (void *)image;
 		prog->jited = 1;
 		prog->jited_len = proglen;
+		try_do_jit_list(prog, prog->bpf_func);
 	} else {
 		prog = orig_prog;
 	}
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7d813034e286..ad1e75bf0991 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -517,7 +517,8 @@ struct bpf_prog {
 					    const struct bpf_insn *insn);
 	/* Takes a list of struct bpf_work */
 	void			(*list_func)(struct list_head *list,
-					     const struct bpf_insn *insn);
+					     const struct bpf_insn *insn,
+					     const struct redirect_info *percpu_ri);
 	/* Instructions for interpreter */
 	union {
 		struct sock_filter	insns[0];
@@ -532,7 +533,7 @@ struct sk_filter {
 };
 
 #define BPF_PROG_RUN(filter, ctx)  (*(filter)->bpf_func)(ctx, (filter)->insnsi)
-#define BPF_LIST_PROG_RUN(filter, list) (*(filter)->list_func)(list, (filter)->insnsi)
+#define BPF_LIST_PROG_RUN(filter, list, percpu) (*(filter)->list_func)(list, (filter)->insnsi, percpu)
 
 #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN
 
@@ -638,10 +639,11 @@ static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
 }
 
 static __always_inline void bpf_list_prog_run_xdp(const struct bpf_prog *prog,
-						  struct list_head *list)
+						  struct list_head *list,
+						  const struct redirect_info *percpu_ri)
 {
 	/* Caller must hold rcu_read_lock(), as per bpf_prog_run_xdp(). */
-	BPF_LIST_PROG_RUN(prog, list);
+	BPF_LIST_PROG_RUN(prog, list, percpu_ri);
 }
 
 static inline u32 bpf_prog_insn_size(const struct bpf_prog *prog)
@@ -756,6 +758,15 @@ bpf_jit_binary_hdr(const struct bpf_prog *fp)
 	return (void *)addr;
 }
 
+static inline struct bpf_binary_header *
+bpf_list_jit_binary_hdr(const struct bpf_prog *fp)
+{
+	unsigned long real_start = (unsigned long)fp->list_func;
+	unsigned long addr = real_start & PAGE_MASK;
+
+	return (void *)addr;
+}
+
 #ifdef CONFIG_ARCH_HAS_SET_MEMORY
 static inline int bpf_prog_check_pages_ro_single(const struct bpf_prog *fp)
 {
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index c35da826cc3b..028be88c4af8 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -621,15 +621,22 @@ void bpf_jit_binary_free(struct bpf_binary_header *hdr)
  */
 void __weak bpf_jit_free(struct bpf_prog *fp)
 {
-	if (fp->jited) {
-		struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp);
+	struct bpf_binary_header *hdr;
 
+	if (fp->jited) {
+		hdr = bpf_jit_binary_hdr(fp);
 		bpf_jit_binary_unlock_ro(hdr);
 		bpf_jit_binary_free(hdr);
 
 		WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp));
 	}
 
+	if (fp->jited_list) {
+		hdr = bpf_list_jit_binary_hdr(fp);
+		bpf_jit_binary_unlock_ro(hdr);
+		bpf_jit_binary_free(hdr);
+	}
+
 	bpf_prog_unlock_free(fp);
 }
 
@@ -1358,13 +1365,13 @@ static u64 PROG_NAME_ARGS(stack_size)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5, \
 
 #define LIST_PROG_NAME(stack_size) __bpf_list_prog_run##stack_size
 #define DEFINE_BPF_LIST_PROG_RUN(stack_size) \
-static void LIST_PROG_NAME(stack_size)(struct list_head *list, const struct bpf_insn *insn) \
+static void LIST_PROG_NAME(stack_size)(struct list_head *list, const struct bpf_insn *insn, const struct redirect_info *percpu_ri) \
 { \
 	struct bpf_work *work; \
 \
 	list_for_each_entry(work, list, list) { \
 		work->ret = PROG_NAME(stack_size)(work->ctx, insn); \
-		work->ri = *this_cpu_ptr(&redirect_info); \
+		work->ri = *percpu_ri; \
 	} \
 }
 
@@ -1398,7 +1405,8 @@ EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
 #undef PROG_NAME_LIST
 #define PROG_NAME_LIST(stack_size) LIST_PROG_NAME(stack_size),
 static void (*list_interpreters[])(struct list_head *list,
-				   const struct bpf_insn *insn) = {
+				   const struct bpf_insn *insn,
+				   const struct redirect_info *percpu_ri) = {
 EVAL6(PROG_NAME_LIST, 32, 64, 96, 128, 160, 192)
 EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384)
 EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
diff --git a/net/core/dev.c b/net/core/dev.c
index 746112c22afd..7c1879045ef8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4214,6 +4214,7 @@ static void do_xdp_list_generic(struct bpf_prog *xdp_prog,
 				struct sk_buff_head *list,
 				struct sk_buff_head *pass_list)
 {
+	const struct redirect_info *percpu_ri = this_cpu_ptr(&redirect_info);
 	struct xdp_work (*xwa)[NAPI_POLL_WEIGHT], *xw;
 	struct bpf_work *bw;
 	struct sk_buff *skb;
@@ -4249,11 +4250,11 @@ static void do_xdp_list_generic(struct bpf_prog *xdp_prog,
 
 	if (xdp_prog->list_func && (xdp_prog->jited_list ||
 				    !xdp_prog->jited))
-		bpf_list_prog_run_xdp(xdp_prog, &xdp_list);
+		bpf_list_prog_run_xdp(xdp_prog, &xdp_list, percpu_ri);
 	else
 		list_for_each_entry(bw, &xdp_list, list) {
 			bw->ret = bpf_prog_run_xdp(xdp_prog, bw->ctx);
-			bw->ri = *this_cpu_ptr(&redirect_info);
+			bw->ri = *percpu_ri;
 		}
 
 	for (i = 0; i < n; i++) {

^ permalink raw reply related

* Re: [PATCH] selftests/net: Fix permissions for fib_tests.sh
From: Shuah Khan @ 2018-06-26 18:22 UTC (permalink / raw)
  To: Daniel Díaz, shuahkh, linux-kselftest, David S. Miller
  Cc: open list:NETWORKING [GENERAL], open list, Shuah Khan
In-Reply-To: <1529425232-25242-1-git-send-email-daniel.diaz@linaro.org>

On 06/19/2018 10:20 AM, Daniel Díaz wrote:
> fib_tests.sh became non-executable at some point. This is
> what happens:
>   selftests: net: fib_tests.sh: Warning: file fib_tests.sh is
>   not executable, correct this.
>   not ok 1..11 selftests: net: fib_tests.sh [FAIL]
> 
> Fixes: d69faad76584 ("selftests: fib_tests: Add prefix route tests with metric")
> 
> Signed-off-by: Daniel Díaz <daniel.diaz@linaro.org>
> ---
>  tools/testing/selftests/net/fib_tests.sh | 0
>  1 file changed, 0 insertions(+), 0 deletions(-)
>  mode change 100644 => 100755 tools/testing/selftests/net/fib_tests.sh
> 
> diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh
> old mode 100644
> new mode 100755
> 

Hi David,

Are you planning to pick this up for 4.18-rc3 or later?

Acked-by: Shuah Khan <shuah@kernel.org>

thanks,
-- Shuah

^ permalink raw reply

* [RFC PATCH v2 net-next 11/12] net: listify Generic XDP processing, part 2
From: Edward Cree @ 2018-06-26 18:22 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem
In-Reply-To: <fa3d7e58-e7b6-ad0c-619f-824c25ed0d97@solarflare.com>

Adds listified versions of the eBPF interpreter functions, and uses them when
 the single func is not JITed.  If the single func is JITed (and the list
 func is not, which currently it never is), then use the single func since
 the cost of interpreting is probably much worse than the cost of the extra
 indirect calls.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/filter.h | 38 +++++++++++++++++++++++++++++---------
 kernel/bpf/core.c      | 26 ++++++++++++++++++++++++++
 net/core/dev.c         | 19 ++++++++-----------
 3 files changed, 63 insertions(+), 20 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 75db6cbf78a3..7d813034e286 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -477,6 +477,21 @@ struct bpf_binary_header {
 	u8 image[] __aligned(4);
 };
 
+struct redirect_info {
+	u32 ifindex;
+	u32 flags;
+	struct bpf_map *map;
+	struct bpf_map *map_to_flush;
+	unsigned long   map_owner;
+};
+
+struct bpf_work {
+	struct list_head list;
+	void *ctx;
+	struct redirect_info ri;
+	unsigned long ret;
+};
+
 struct bpf_prog {
 	u16			pages;		/* Number of allocated pages */
 	u16			jited:1,	/* Is our filter JIT'ed? */
@@ -488,7 +503,9 @@ struct bpf_prog {
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
 				kprobe_override:1, /* Do we override a kprobe? */
-				has_callchain_buf:1; /* callchain buffer allocated? */
+				has_callchain_buf:1, /* callchain buffer allocated? */
+				jited_list:1;	/* Is list func JIT'ed? */
+				/* 5 bits left */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
@@ -498,6 +515,9 @@ struct bpf_prog {
 	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
 	unsigned int		(*bpf_func)(const void *ctx,
 					    const struct bpf_insn *insn);
+	/* Takes a list of struct bpf_work */
+	void			(*list_func)(struct list_head *list,
+					     const struct bpf_insn *insn);
 	/* Instructions for interpreter */
 	union {
 		struct sock_filter	insns[0];
@@ -512,6 +532,7 @@ struct sk_filter {
 };
 
 #define BPF_PROG_RUN(filter, ctx)  (*(filter)->bpf_func)(ctx, (filter)->insnsi)
+#define BPF_LIST_PROG_RUN(filter, list) (*(filter)->list_func)(list, (filter)->insnsi)
 
 #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN
 
@@ -616,6 +637,13 @@ static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
 	return BPF_PROG_RUN(prog, xdp);
 }
 
+static __always_inline void bpf_list_prog_run_xdp(const struct bpf_prog *prog,
+						  struct list_head *list)
+{
+	/* Caller must hold rcu_read_lock(), as per bpf_prog_run_xdp(). */
+	BPF_LIST_PROG_RUN(prog, list);
+}
+
 static inline u32 bpf_prog_insn_size(const struct bpf_prog *prog)
 {
 	return prog->len * sizeof(struct bpf_insn);
@@ -820,14 +848,6 @@ static inline int __xdp_generic_ok_fwd_dev(struct sk_buff *skb,
 	return 0;
 }
 
-struct redirect_info {
-	u32 ifindex;
-	u32 flags;
-	struct bpf_map *map;
-	struct bpf_map *map_to_flush;
-	unsigned long   map_owner;
-};
-
 DECLARE_PER_CPU(struct redirect_info, redirect_info);
 
 /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index a9e6c04d0f4a..c35da826cc3b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1356,6 +1356,18 @@ static u64 PROG_NAME_ARGS(stack_size)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5, \
 	return ___bpf_prog_run(regs, insn, stack); \
 }
 
+#define LIST_PROG_NAME(stack_size) __bpf_list_prog_run##stack_size
+#define DEFINE_BPF_LIST_PROG_RUN(stack_size) \
+static void LIST_PROG_NAME(stack_size)(struct list_head *list, const struct bpf_insn *insn) \
+{ \
+	struct bpf_work *work; \
+\
+	list_for_each_entry(work, list, list) { \
+		work->ret = PROG_NAME(stack_size)(work->ctx, insn); \
+		work->ri = *this_cpu_ptr(&redirect_info); \
+	} \
+}
+
 #define EVAL1(FN, X) FN(X)
 #define EVAL2(FN, X, Y...) FN(X) EVAL1(FN, Y)
 #define EVAL3(FN, X, Y...) FN(X) EVAL2(FN, Y)
@@ -1367,6 +1379,10 @@ EVAL6(DEFINE_BPF_PROG_RUN, 32, 64, 96, 128, 160, 192);
 EVAL6(DEFINE_BPF_PROG_RUN, 224, 256, 288, 320, 352, 384);
 EVAL4(DEFINE_BPF_PROG_RUN, 416, 448, 480, 512);
 
+EVAL6(DEFINE_BPF_LIST_PROG_RUN, 32, 64, 96, 128, 160, 192);
+EVAL6(DEFINE_BPF_LIST_PROG_RUN, 224, 256, 288, 320, 352, 384);
+EVAL4(DEFINE_BPF_LIST_PROG_RUN, 416, 448, 480, 512);
+
 EVAL6(DEFINE_BPF_PROG_RUN_ARGS, 32, 64, 96, 128, 160, 192);
 EVAL6(DEFINE_BPF_PROG_RUN_ARGS, 224, 256, 288, 320, 352, 384);
 EVAL4(DEFINE_BPF_PROG_RUN_ARGS, 416, 448, 480, 512);
@@ -1380,6 +1396,14 @@ EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384)
 EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
 };
 #undef PROG_NAME_LIST
+#define PROG_NAME_LIST(stack_size) LIST_PROG_NAME(stack_size),
+static void (*list_interpreters[])(struct list_head *list,
+				   const struct bpf_insn *insn) = {
+EVAL6(PROG_NAME_LIST, 32, 64, 96, 128, 160, 192)
+EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384)
+EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
+};
+#undef PROG_NAME_LIST
 #define PROG_NAME_LIST(stack_size) PROG_NAME_ARGS(stack_size),
 static u64 (*interpreters_args[])(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5,
 				  const struct bpf_insn *insn) = {
@@ -1472,8 +1496,10 @@ static void bpf_prog_select_func(struct bpf_prog *fp)
 	u32 stack_depth = max_t(u32, fp->aux->stack_depth, 1);
 
 	fp->bpf_func = interpreters[(round_up(stack_depth, 32) / 32) - 1];
+	fp->list_func = list_interpreters[(round_up(stack_depth, 32) / 32) - 1];
 #else
 	fp->bpf_func = __bpf_prog_ret0_warn;
+	fp->list_func = NULL;
 #endif
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 22cbd5314d56..746112c22afd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4198,13 +4198,6 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(do_xdp_generic);
 
-struct bpf_work {
-	struct list_head list;
-	void *ctx;
-	struct redirect_info ri;
-	unsigned long ret;
-};
-
 struct xdp_work {
 	struct bpf_work w;
 	struct xdp_buff xdp;
@@ -4254,10 +4247,14 @@ static void do_xdp_list_generic(struct bpf_prog *xdp_prog,
 			list_add_tail(&xw->w.list, &xdp_list);
 	}
 
-	list_for_each_entry(bw, &xdp_list, list) {
-		bw->ret = bpf_prog_run_xdp(xdp_prog, bw->ctx);
-		bw->ri = *this_cpu_ptr(&redirect_info);
-	}
+	if (xdp_prog->list_func && (xdp_prog->jited_list ||
+				    !xdp_prog->jited))
+		bpf_list_prog_run_xdp(xdp_prog, &xdp_list);
+	else
+		list_for_each_entry(bw, &xdp_list, list) {
+			bw->ret = bpf_prog_run_xdp(xdp_prog, bw->ctx);
+			bw->ri = *this_cpu_ptr(&redirect_info);
+		}
 
 	for (i = 0; i < n; i++) {
 		xw = (*xwa) + i;

^ permalink raw reply related

* [RFC PATCH v2 net-next 10/12] net: listify Generic XDP processing, part 1
From: Edward Cree @ 2018-06-26 18:21 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem
In-Reply-To: <fa3d7e58-e7b6-ad0c-619f-824c25ed0d97@solarflare.com>

Deals with all the pre- and post-amble to the BPF program itself, which is
 still called one packet at a time.
Involves some fiddly percpu variables to cope with XDP_REDIRECT handling.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/filter.h |  10 +++
 net/core/dev.c         | 165 +++++++++++++++++++++++++++++++++++++++++++------
 net/core/filter.c      |  10 +--
 3 files changed, 156 insertions(+), 29 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 20f2659dd829..75db6cbf78a3 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -820,6 +820,16 @@ static inline int __xdp_generic_ok_fwd_dev(struct sk_buff *skb,
 	return 0;
 }
 
+struct redirect_info {
+	u32 ifindex;
+	u32 flags;
+	struct bpf_map *map;
+	struct bpf_map *map_to_flush;
+	unsigned long   map_owner;
+};
+
+DECLARE_PER_CPU(struct redirect_info, redirect_info);
+
 /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
  * same cpu context. Further for best results no more than a single map
  * for the do_redirect/do_flush pair should be used. This limitation is
diff --git a/net/core/dev.c b/net/core/dev.c
index 11f80d4502b9..22cbd5314d56 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4015,15 +4015,14 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
 	return rxqueue;
 }
 
-static u32 netif_receive_generic_xdp(struct sk_buff *skb,
-				     struct xdp_buff *xdp,
-				     struct bpf_prog *xdp_prog)
+static u32 netif_receive_generic_xdp_prepare(struct sk_buff *skb,
+					     struct xdp_buff *xdp,
+					     void **orig_data,
+					     void **orig_data_end,
+					     u32 *mac_len)
 {
 	struct netdev_rx_queue *rxqueue;
-	void *orig_data, *orig_data_end;
-	u32 metalen, act = XDP_DROP;
-	int hlen, off;
-	u32 mac_len;
+	int hlen;
 
 	/* Reinjected packets coming from act_mirred or similar should
 	 * not get XDP generic processing.
@@ -4054,19 +4053,35 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	/* The XDP program wants to see the packet starting at the MAC
 	 * header.
 	 */
-	mac_len = skb->data - skb_mac_header(skb);
-	hlen = skb_headlen(skb) + mac_len;
-	xdp->data = skb->data - mac_len;
+	*mac_len = skb->data - skb_mac_header(skb);
+	hlen = skb_headlen(skb) + *mac_len;
+	xdp->data = skb->data - *mac_len;
 	xdp->data_meta = xdp->data;
 	xdp->data_end = xdp->data + hlen;
 	xdp->data_hard_start = skb->data - skb_headroom(skb);
-	orig_data_end = xdp->data_end;
-	orig_data = xdp->data;
+	*orig_data_end = xdp->data_end;
+	*orig_data = xdp->data;
 
 	rxqueue = netif_get_rxqueue(skb);
 	xdp->rxq = &rxqueue->xdp_rxq;
+	/* is actually XDP_ABORTED, but here we use it to mean "go ahead and
+	 * run the xdp program"
+	 */
+	return 0;
+do_drop:
+	kfree_skb(skb);
+	return XDP_DROP;
+}
 
-	act = bpf_prog_run_xdp(xdp_prog, xdp);
+static u32 netif_receive_generic_xdp_finish(struct sk_buff *skb,
+					    struct xdp_buff *xdp,
+					    struct bpf_prog *xdp_prog,
+					    void *orig_data,
+					    void *orig_data_end,
+					    u32 act, u32 mac_len)
+{
+	u32 metalen;
+	int off;
 
 	off = xdp->data - orig_data;
 	if (off > 0)
@@ -4082,7 +4097,6 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	if (off != 0) {
 		skb_set_tail_pointer(skb, xdp->data_end - xdp->data);
 		skb->len -= off;
-
 	}
 
 	switch (act) {
@@ -4102,7 +4116,6 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 		trace_xdp_exception(skb->dev, xdp_prog, act);
 		/* fall through */
 	case XDP_DROP:
-	do_drop:
 		kfree_skb(skb);
 		break;
 	}
@@ -4110,6 +4123,23 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	return act;
 }
 
+static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+				     struct xdp_buff *xdp,
+				     struct bpf_prog *xdp_prog)
+{
+	void *orig_data, *orig_data_end;
+	u32 act, mac_len;
+
+	act = netif_receive_generic_xdp_prepare(skb, xdp, &orig_data,
+						&orig_data_end, &mac_len);
+	if (act)
+		return act;
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
+	return netif_receive_generic_xdp_finish(skb, xdp, xdp_prog,
+						orig_data, orig_data_end, act,
+						mac_len);
+}
+
 /* When doing generic XDP we have to bypass the qdisc layer and the
  * network taps in order to match in-driver-XDP behavior.
  */
@@ -4168,6 +4198,93 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(do_xdp_generic);
 
+struct bpf_work {
+	struct list_head list;
+	void *ctx;
+	struct redirect_info ri;
+	unsigned long ret;
+};
+
+struct xdp_work {
+	struct bpf_work w;
+	struct xdp_buff xdp;
+	struct sk_buff *skb;
+	void *orig_data;
+	void *orig_data_end;
+	u32 mac_len;
+};
+
+/* Storage area for per-packet Generic XDP metadata */
+static DEFINE_PER_CPU(struct xdp_work[NAPI_POLL_WEIGHT], xdp_work);
+
+static void do_xdp_list_generic(struct bpf_prog *xdp_prog,
+				struct sk_buff_head *list,
+				struct sk_buff_head *pass_list)
+{
+	struct xdp_work (*xwa)[NAPI_POLL_WEIGHT], *xw;
+	struct bpf_work *bw;
+	struct sk_buff *skb;
+	LIST_HEAD(xdp_list);
+	int n = 0, i, err;
+	u32 act;
+
+	if (!xdp_prog) {
+		/* PASS everything */
+		skb_queue_splice_init(list, pass_list);
+		return;
+	}
+
+	xwa = this_cpu_ptr(&xdp_work);
+
+	skb_queue_for_each(skb, list) {
+		if (WARN_ON(n > NAPI_POLL_WEIGHT))
+			 /* checked in caller, can't happen */
+			 return;
+		xw = (*xwa) + n++;
+		memset(xw, 0, sizeof(*xw));
+		xw->skb = skb;
+		xw->w.ctx = &xw->xdp;
+		act = netif_receive_generic_xdp_prepare(skb, &xw->xdp,
+							&xw->orig_data,
+							&xw->orig_data_end,
+							&xw->mac_len);
+		if (act)
+			xw->w.ret = act;
+		else
+			list_add_tail(&xw->w.list, &xdp_list);
+	}
+
+	list_for_each_entry(bw, &xdp_list, list) {
+		bw->ret = bpf_prog_run_xdp(xdp_prog, bw->ctx);
+		bw->ri = *this_cpu_ptr(&redirect_info);
+	}
+
+	for (i = 0; i < n; i++) {
+		xw = (*xwa) + i;
+		act = netif_receive_generic_xdp_finish(xw->skb, &xw->xdp,
+						       xdp_prog, xw->orig_data,
+						       xw->orig_data_end,
+						       xw->w.ret, xw->mac_len);
+		if (act != XDP_PASS) {
+			switch (act) {
+			case XDP_REDIRECT:
+				*this_cpu_ptr(&redirect_info) = xw->w.ri;
+				err = xdp_do_generic_redirect(xw->skb->dev,
+							      xw->skb, &xw->xdp,
+							      xdp_prog);
+				if (err) /* free and drop */
+					kfree_skb(xw->skb);
+				break;
+			case XDP_TX:
+				generic_xdp_tx(xw->skb, xdp_prog);
+				break;
+			}
+		} else {
+			__skb_queue_tail(pass_list, xw->skb);
+		}
+	}
+}
+
 static int netif_rx_internal(struct sk_buff *skb)
 {
 	int ret;
@@ -4878,7 +4995,7 @@ static void netif_receive_skb_list_internal(struct sk_buff_head *list)
 {
 	/* Two sublists so we can go back and forth between them */
 	struct sk_buff_head sublist, sublist2;
-	struct bpf_prog *xdp_prog = NULL;
+	struct bpf_prog *xdp_prog = NULL, *curr_prog = NULL;
 	struct sk_buff *skb;
 
 	__skb_queue_head_init(&sublist);
@@ -4893,15 +5010,23 @@ static void netif_receive_skb_list_internal(struct sk_buff_head *list)
 
 	__skb_queue_head_init(&sublist2);
 	if (static_branch_unlikely(&generic_xdp_needed_key)) {
+		struct sk_buff_head sublist3;
+		int n = 0;
+
+		__skb_queue_head_init(&sublist3);
 		preempt_disable();
 		rcu_read_lock();
 		while ((skb = __skb_dequeue(&sublist)) != NULL) {
 			xdp_prog = rcu_dereference(skb->dev->xdp_prog);
-			if (do_xdp_generic(xdp_prog, skb) != XDP_PASS)
-				/* Dropped, don't add to sublist */
-				continue;
-			__skb_queue_tail(&sublist2, skb);
+			if (++n >= NAPI_POLL_WEIGHT || xdp_prog != curr_prog) {
+				do_xdp_list_generic(curr_prog, &sublist3, &sublist2);
+				__skb_queue_head_init(&sublist3);
+				n = 0;
+				curr_prog = xdp_prog;
+			}
+			__skb_queue_tail(&sublist3, skb);
 		}
+		do_xdp_list_generic(curr_prog, &sublist3, &sublist2);
 		rcu_read_unlock();
 		preempt_enable();
 		/* Move all packets onto first sublist */
diff --git a/net/core/filter.c b/net/core/filter.c
index e7f12e9f598c..c96aff14d76a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2039,15 +2039,7 @@ static const struct bpf_func_proto bpf_clone_redirect_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
-struct redirect_info {
-	u32 ifindex;
-	u32 flags;
-	struct bpf_map *map;
-	struct bpf_map *map_to_flush;
-	unsigned long   map_owner;
-};
-
-static DEFINE_PER_CPU(struct redirect_info, redirect_info);
+DEFINE_PER_CPU(struct redirect_info, redirect_info);
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {

^ permalink raw reply related

* [RFC PATCH v2 net-next 09/12] net: don't bother calling list RX functions on empty lists
From: Edward Cree @ 2018-06-26 18:21 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem
In-Reply-To: <fa3d7e58-e7b6-ad0c-619f-824c25ed0d97@solarflare.com>

Generally the check should be very cheap, as the sk_buff_head is in cache.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/core/dev.c      | 8 ++++++--
 net/ipv4/ip_input.c | 2 ++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index f0eb00e9fb57..11f80d4502b9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4780,7 +4780,8 @@ static void __netif_receive_skb_list(struct sk_buff_head *list)
 	while ((skb = __skb_dequeue(list)) != NULL) {
 		if ((sk_memalloc_socks() && skb_pfmemalloc(skb)) != pfmemalloc) {
 			/* Handle the previous sublist */
-			__netif_receive_skb_list_core(&sublist, pfmemalloc);
+			if (!skb_queue_empty(&sublist))
+				__netif_receive_skb_list_core(&sublist, pfmemalloc);
 			pfmemalloc = !pfmemalloc;
 			/* See comments in __netif_receive_skb */
 			if (pfmemalloc)
@@ -4792,7 +4793,8 @@ static void __netif_receive_skb_list(struct sk_buff_head *list)
 		__skb_queue_tail(&sublist, skb);
 	}
 	/* Handle the last sublist */
-	__netif_receive_skb_list_core(&sublist, pfmemalloc);
+	if (!skb_queue_empty(&sublist))
+		__netif_receive_skb_list_core(&sublist, pfmemalloc);
 	/* Restore pflags */
 	if (pfmemalloc)
 		memalloc_noreclaim_restore(noreclaim_flag);
@@ -4968,6 +4970,8 @@ void netif_receive_skb_list(struct sk_buff_head *list)
 {
 	struct sk_buff *skb;
 
+	if (skb_queue_empty(list))
+		return;
 	skb_queue_for_each(skb, list)
 		trace_netif_receive_skb_list_entry(skb);
 	netif_receive_skb_list_internal(list);
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 63d4dfdb1766..65a5ed9e4b3c 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -570,6 +570,8 @@ static void ip_sublist_rcv(struct sk_buff_head *list, struct net_device *dev,
 {
 	struct sk_buff_head sublist;
 
+	if (skb_queue_empty(list))
+		return;
 	NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
 		     list, &sublist, dev, NULL, ip_rcv_finish);
 	ip_list_rcv_finish(net, NULL, &sublist);

^ permalink raw reply related

* [RFC PATCH v2 net-next 08/12] net: ipv4: listify ip_rcv_finish
From: Edward Cree @ 2018-06-26 18:20 UTC (permalink / raw)
  To: linux-net-drivers, netdev; +Cc: davem
In-Reply-To: <fa3d7e58-e7b6-ad0c-619f-824c25ed0d97@solarflare.com>

ip_rcv_finish_core(), if it does not drop, sets skb->dst by either early
 demux or route lookup.  The last step, calling dst_input(skb), is left to
 the caller; in the listified case, we split to form sublists with a common
 dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop.
The next step in listification would thus be to add a list_input() method
 to struct dst_entry.

Early demux is an indirect call based on iph->protocol; this is another
 opportunity for listification which is not taken here (it would require
 slicing up ip_rcv_finish_core() to allow splitting on protocol changes).

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/ipv4/ip_input.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 7a8af8ff3f07..63d4dfdb1766 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -307,7 +307,8 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 	return true;
 }
 
-static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
+static int ip_rcv_finish_core(struct net *net, struct sock *sk,
+			      struct sk_buff *skb)
 {
 	const struct iphdr *iph = ip_hdr(skb);
 	int (*edemux)(struct sk_buff *skb);
@@ -393,7 +394,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 			goto drop;
 	}
 
-	return dst_input(skb);
+	return NET_RX_SUCCESS;
 
 drop:
 	kfree_skb(skb);
@@ -405,6 +406,15 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 	goto drop;
 }
 
+static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
+{
+	int ret = ip_rcv_finish_core(net, sk, skb);
+
+	if (ret != NET_RX_DROP)
+		ret = dst_input(skb);
+	return ret;
+}
+
 /*
  * 	Main IP Receive routine.
  */
@@ -515,16 +525,54 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		       ip_rcv_finish);
 }
 
+static void ip_sublist_rcv_finish(struct sk_buff_head *list)
+{
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(list)) != NULL)
+		dst_input(skb);
+}
+
+static void ip_list_rcv_finish(struct net *net, struct sock *sk,
+			       struct sk_buff_head *list)
+{
+	struct dst_entry *curr_dst = NULL;
+	struct sk_buff_head sublist;
+	struct sk_buff *skb;
+
+	__skb_queue_head_init(&sublist);
+
+	while ((skb = __skb_dequeue(list)) != NULL) {
+		struct dst_entry *dst;
+
+		if (ip_rcv_finish_core(net, sk, skb) == NET_RX_DROP)
+			continue;
+
+		dst = skb_dst(skb);
+		if (skb_queue_empty(&sublist)) {
+			curr_dst = dst;
+		} else if (curr_dst != dst) {
+			/* dispatch old sublist */
+			ip_sublist_rcv_finish(&sublist);
+			/* start new sublist */
+			__skb_queue_head_init(&sublist);
+			curr_dst = dst;
+		}
+		/* add to current sublist */
+		__skb_queue_tail(&sublist, skb);
+	}
+	/* dispatch final sublist */
+	ip_sublist_rcv_finish(&sublist);
+}
+
 static void ip_sublist_rcv(struct sk_buff_head *list, struct net_device *dev,
 			   struct net *net)
 {
 	struct sk_buff_head sublist;
-	struct sk_buff *skb;
 
 	NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
 		     list, &sublist, dev, NULL, ip_rcv_finish);
-	while ((skb = __skb_dequeue(&sublist)) != NULL)
-		ip_rcv_finish(net, NULL, skb);
+	ip_list_rcv_finish(net, NULL, &sublist);
 }
 
 /* Receive a list of IP packets */

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox