Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] net/rds: Fix info leak in rds6_inc_info_copy()
From: David Miller @ 2019-08-28  4:08 UTC (permalink / raw)
  To: ka-cheong.poon; +Cc: netdev, santosh.shilimkar, rds-devel
In-Reply-To: <1566812352-27332-1-git-send-email-ka-cheong.poon@oracle.com>

From: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Date: Mon, 26 Aug 2019 02:39:12 -0700

> The rds6_inc_info_copy() function has a couple struct members which
> are leaking stack information.  The ->tos field should hold actual
> information and the ->flags field needs to be zeroed out.
> 
> Fixes: 3eb450367d08 ("rds: add type of service(tos) infrastructure")
> Fixes: b7ff8b1036f0 ("rds: Extend RDS API for IPv6 support")
> Reported-by: 黄ID蝴蝶 <butterflyhuangxx@gmail.com>
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [net-next 00/15][pull request] 100GbE Intel Wired LAN Driver Updates 2019-08-26
From: Jakub Kicinski @ 2019-08-28  4:09 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, netdev, nhorman, sassmann
In-Reply-To: <20190827163832.8362-1-jeffrey.t.kirsher@intel.com>

On Tue, 27 Aug 2019 09:38:17 -0700, Jeff Kirsher wrote:
> This series contains updates to ice driver only.

Looks clear from uAPI perspective. It does mix fixes with -next, 
but I guess that's your call.

Code-wise changes like this are perhaps the low-light:

@@ -2105,7 +2108,10 @@ void ice_trigger_sw_intr(struct ice_hw *hw, struct ice_q_vector *q_vector)
  * @ring: Tx ring to be stopped
  * @txq_meta: Meta data of Tx ring to be stopped
  */
-static int
+#ifndef CONFIG_PCI_IOV
+static
+#endif /* !CONFIG_PCI_IOV */
+int
 ice_vsi_stop_tx_ring(struct ice_vsi *vsi, enum ice_disq_rst_src rst_src,
 		     u16 rel_vmvf_num, struct ice_ring *ring,
 		     struct ice_txq_meta *txq_meta)

^ permalink raw reply

* Re: [net-next 00/15][pull request] 100GbE Intel Wired LAN Driver Updates 2019-08-26
From: Jeff Kirsher @ 2019-08-28  4:17 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, nhorman, sassmann
In-Reply-To: <20190827210928.576c5fef@cakuba.netronome.com>

[-- Attachment #1: Type: text/plain, Size: 1128 bytes --]

On Tue, 2019-08-27 at 21:09 -0700, Jakub Kicinski wrote:
> On Tue, 27 Aug 2019 09:38:17 -0700, Jeff Kirsher wrote:
> > This series contains updates to ice driver only.
> 
> Looks clear from uAPI perspective. It does mix fixes with -next, 
> but I guess that's your call.

Yeah, I always debate about sending the fixes to net, but many of them do
not apply cleanly or at all to the previous kernel version since we are
actively adding new features and functionality to this driver.

Once this device gets released, I will be more concerned about getting
fixes into older kernels.

> 
> Code-wise changes like this are perhaps the low-light:
> 
> @@ -2105,7 +2108,10 @@ void ice_trigger_sw_intr(struct ice_hw *hw, struct
> ice_q_vector *q_vector)
>   * @ring: Tx ring to be stopped
>   * @txq_meta: Meta data of Tx ring to be stopped
>   */
> -static int
> +#ifndef CONFIG_PCI_IOV
> +static
> +#endif /* !CONFIG_PCI_IOV */
> +int
>  ice_vsi_stop_tx_ring(struct ice_vsi *vsi, enum ice_disq_rst_src rst_src,
>  		     u16 rel_vmvf_num, struct ice_ring *ring,
>  		     struct ice_txq_meta *txq_meta)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH net v3 0/2] r8152: fix side effect
From: Jakub Kicinski @ 2019-08-28  4:17 UTC (permalink / raw)
  To: Hayes Wang; +Cc: netdev, nic_swsd, linux-kernel
In-Reply-To: <1394712342-15778-320-Taiwan-albertk@realtek.com>

On Wed, 28 Aug 2019 09:51:40 +0800, Hayes Wang wrote:
> v3:
> Update the commit message for patch #1.
> 
> v2:
> Replace patch #2 with "r8152: remove calling netif_napi_del".
> 
> v1:
> The commit 0ee1f4734967 ("r8152: napi hangup fix after disconnect")
> add a check to avoid using napi_disable after netif_napi_del. However,
> the commit ffa9fec30ca0 ("r8152: set RTL8152_UNPLUG only for real
> disconnection") let the check useless.
> 
> Therefore, I revert commit 0ee1f4734967 ("r8152: napi hangup fix
> after disconnect") first, and add another patch to fix it.

LGTM, seems like if we were to add a Fixes tag it'd point to the

ffa9fec30ca0 ("r8152: set RTL8152_UNPLUG only for real disconnection")

commit, then? So only net needs it, v5.2 is fine.

^ permalink raw reply

* Re: [PATCH V3 net 1/2] openvswitch: Properly set L4 keys on "later" IP fragments
From: Gregory Rose @ 2019-08-28  4:19 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: Linux Kernel Network Developers, Joe Stringer
In-Reply-To: <CAOrHB_DXXSoe9rjamp_OSxDonsqTADrbV4GdUdct=uq_eOXN-Q@mail.gmail.com>


On 8/27/2019 5:33 PM, Pravin Shelar wrote:
> On Tue, Aug 27, 2019 at 7:58 AM Greg Rose <gvrose8192@gmail.com> wrote:
>> When IP fragments are reassembled before being sent to conntrack, the
>> key from the last fragment is used.  Unless there are reordering
>> issues, the last fragment received will not contain the L4 ports, so the
>> key for the reassembled datagram won't contain them.  This patch updates
>> the key once we have a reassembled datagram.
>>
>> The handle_fragments() function works on L3 headers so we pull the L3/L4
>> flow key update code from key_extract into a new function
>> 'key_extract_l3l4'.  Then we add a another new function
>> ovs_flow_key_update_l3l4() and export it so that it is accessible by
>> handle_fragments() for conntrack packet reassembly.
>>
>> Co-authored by: Justin Pettit <jpettit@ovn.org>
>> Signed-off-by: Greg Rose <gvrose8192@gmail.com>
>>
> Looks good to me.
>
> Acked-by: Pravin B Shelar <pshelar@ovn.org>
>
> Thanks,
> Pravin.

Thanks Pravin.

I missed a dash in the Co-authored-by line.  If that could be fixed up 
on commit then good, otherwise I can resend.

- Greg

^ permalink raw reply

* Re: [PATCH net] tcp: remove empty skb from write queue in error cases
From: David Miller @ 2019-08-28  4:38 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, soheil, ncardwell, eric.dumazet, jbaron, rutsky
In-Reply-To: <20190826161915.81676-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Mon, 26 Aug 2019 09:19:15 -0700

> Vladimir Rutsky reported stuck TCP sessions after memory pressure
> events. Edge Trigger epoll() user would never receive an EPOLLOUT
> notification allowing them to retry a sendmsg().
> 
> Jason tested the case of sk_stream_alloc_skb() returning NULL,
> but there are other paths that could lead both sendmsg() and sendpage()
> to return -1 (EAGAIN), with an empty skb queued on the write queue.
> 
> This patch makes sure we remove this empty skb so that
> Jason code can detect that the queue is empty, and
> call sk->sk_write_space(sk) accordingly.
> 
> Fixes: ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jason Baron <jbaron@akamai.com>
> Reported-by: Vladimir Rutsky <rutsky@google.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH] net/hamradio/6pack: Fix the size of a sk_buff used in 'sp_bump()'
From: David Miller @ 2019-08-28  4:39 UTC (permalink / raw)
  To: christophe.jaillet; +Cc: ajk, linux-hams, netdev, linux-kernel, kernel-janitors
In-Reply-To: <20190826190209.16795-1-christophe.jaillet@wanadoo.fr>

From: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Date: Mon, 26 Aug 2019 21:02:09 +0200

> We 'allocate' 'count' bytes here. In fact, 'dev_alloc_skb' already add some
> extra space for padding, so a bit more is allocated.
> 
> However, we use 1 byte for the KISS command, then copy 'count' bytes, so
> count+1 bytes.
> 
> Explicitly allocate and use 1 more byte to be safe.
> 
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
> ---
> This patch should be safe, be however may no be the correct way to fix the
> "buffer overflow". Maybe, the allocated size is correct and we should have:
>    memcpy(ptr, sp->cooked_buf + 1, count - 1);
> or
>    memcpy(ptr, sp->cooked_buf + 1, count - 1sp->rcount);
> 
> I've not dig deep enough to understand the link betwwen 'rcount' and
> how 'cooked_buf' is used.

I'm trying to figure out how this code works too.

Why are they skipping over the first byte?  Is that to avoid the
command byte?  Yes, then using sp->rcount as the memcpy length makes
sense.

Why is the caller subtracting 2 from the RX buffer count when
calculating sp->rcount?  This makes the situation even more confusing.


^ permalink raw reply

* Re: [PATCH net-next v5 0/6] net: dsa: mv88e6xxx: Peridot/Topaz SERDES changes
From: David Miller @ 2019-08-28  4:42 UTC (permalink / raw)
  To: marek.behun; +Cc: vivien.didelot, netdev, andrew, f.fainelli, olteanv
In-Reply-To: <20190826213155.14685-1-marek.behun@nic.cz>

From: Marek Behún <marek.behun@nic.cz>
Date: Mon, 26 Aug 2019 23:31:49 +0200

> this is the fifth version of changes for the Topaz/Peridot family of
> switches. The patches apply on net-next.
> Changes since v4:
>  - added Reviewed-by and Tested-by tags on first 2 patches, the others
>    are changed are affected by changes in patch 3/6, so I did not add
>    the tags, except for 5/6, which is just macro renaming
>  - patch 3 was changed: the serdes_get_lane returns 0 on success (lane
>    was discovered), -ENODEV if not lane is present on the port, and
>    other error if other error occured. Lane is put into a pointer of
>    type u8
>  - patches 4 and 6 were affected by this (error detecting from
>    serdes_get_lane)
>  - Andrew's complaint about the two additional parameters
>    (allow_over_2500 and make_cmode_writable) was addressed, by Vivien's
>    advice: I put a new method into chip operations structure, named
>    port_set_cmode_writable. This is called from mv88e6xxx_port_setup_mac
>    just before port_set_cmode. The method is implemented for Topaz.
>    The check if cmodes over 2500 should be allowed on given port is now
>    done in the specific port_set_cmode() that requires it, thus the
>    allow_over_2500 argument is not needed
> 
> Again, tested on Turris Mox with Peridot, Topaz, and Peridot + Topaz.

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-28  4:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Kees Cook, LSM List, James Morris, Jann Horn,
	Peter Zijlstra, Masami Hiramatsu, Steven Rostedt, David S. Miller,
	Daniel Borkmann, Network Development, bpf, kernel-team, Linux API
In-Reply-To: <CALCETrVbPPPr=BdPAx=tJKxD3oLXP4OVSgCYrB_E4vb6idELow@mail.gmail.com>

On Tue, Aug 27, 2019 at 05:55:41PM -0700, Andy Lutomirski wrote:
> 
> I was hoping for something in Documentation/admin-guide, not in a
> changelog that's hard to find.

eventually yes.

> >
> > > Changing the capability that some existing operation requires could
> > > break existing programs.  The old capability may need to be accepted
> > > as well.
> >
> > As far as I can see there is no ABI breakage. Please point out
> > which line of the patch may break it.
> 
> As a more or less arbitrary selection:
> 
>  void bpf_prog_kallsyms_add(struct bpf_prog *fp)
>  {
>         if (!bpf_prog_kallsyms_candidate(fp) ||
> -           !capable(CAP_SYS_ADMIN))
> +           !capable(CAP_BPF))
>                 return;
> 
> Before your patch, a task with CAP_SYS_ADMIN could do this.  Now it
> can't.  Per the usual Linux definition of "ABI break", this is an ABI
> break if and only if someone actually did this in a context where they
> have CAP_SYS_ADMIN but not all capabilities.  How confident are you
> that no one does things like this?
>  void bpf_prog_kallsyms_add(struct bpf_prog *fp)
>  {
>         if (!bpf_prog_kallsyms_candidate(fp) ||
> -           !capable(CAP_SYS_ADMIN))
> +           !capable(CAP_BPF))
>                 return;

Yes. I'm confident that apps don't drop everything and
leave cap_sys_admin only before doing bpf() syscall, since it would
break their own use of networking.
Hence I'm not going to do the cap_syslog-like "deprecated" message mess
because of this unfounded concern.
If I turn out to be wrong we will add this "deprecated mess" later.

> 
> From the previous discussion, you want to make progress toward solving
> a lot of problems with CAP_BPF.  One of them was making BPF
> firewalling more generally useful. By making CAP_BPF grant the ability
> to read kernel memory, you will make administrators much more nervous
> to grant CAP_BPF. 

Andy, were your email hacked?
I explained several times that in this proposal 
CAP_BPF _and_ CAP_TRACING _both_ are necessary to read kernel memory.
CAP_BPF alone is _not enough_.

> Similarly, and correct me if I'm wrong, most of
> these capabilities are primarily or only useful for tracing, so I
> don't see why users without CAP_TRACING should get them.
> bpf_trace_printk(), in particular, even has "trace" in its name :)
> 
> Also, if a task has CAP_TRACING, it's expected to be able to trace the
> system -- that's the whole point.  Why shouldn't it be able to use BPF
> to trace the system better?

CAP_TRACING shouldn't be able to do BPF because BPF is not tracing only.

> > For example:
> > BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr)
> > {
> >         int ret;
> >
> >         ret = probe_kernel_read(dst, unsafe_ptr, size);
> >         if (unlikely(ret < 0))
> >                 memset(dst, 0, size);
> >
> >         return ret;
> > }
> >
> > All of BPF (including prototype of bpf_probe_read) is controlled by CAP_BPF.
> > But the kernel primitives its using (probe_kernel_read) is controlled by CAP_TRACING.
> > Hence a task needs _both_ CAP_BPF and CAP_TRACING to attach and run bpf program
> > that uses bpf_probe_read.
> >
> > Similar with all other kernel code that BPF helpers may call directly or indirectly.
> > If there is a way for bpf program to call into piece of code controlled by CAP_TRACING
> > such helper would need CAP_BPF and CAP_TRACING.
> > If bpf helper calls into something that may mangle networking packet
> > such helper would need both CAP_BPF and CAP_NET_ADMIN to execute.
> 
> Why do you want to require CAP_BPF to call into functions like
> bpf_probe_read()?  I understand why you want to limit access to bpf,
> but I think that CAP_TRACING should be sufficient to allow the tracing
> parts of BPF.  After all, a lot of your concerns, especially the ones
> involving speculation, don't really apply to users with CAP_TRACING --
> users with CAP_TRACING can read kernel memory with or without bpf.

Let me try again to explain the concept...

Imagine AUDI logo with 4 circles.
They partially intersect.
The first circle is CAP_TRACING. Second is CAP_BPF. Third is CAP_NET_ADMIN.
Fourth - up to your imagination :)

These capabilities subdivide different parts of root privileges.
CAP_NET_ADMIN is useful on its own.
Just as CAP_TRACING that is useful for perf, ftrace, and probably
other tracing things that don't need bpf.

'bpftrace' is using a lot of tracing and a lot of bpf features,
but not all of bpf and not all tracing.
It falls into intersection of CAP_BPF and CAP_TRACING.

probe_kernel_read is a tracing mechanism.
perf can use it without bpf.
Hence it should be controlled by CAP_TRACING.

bpf_probe_read is a wrapper of that mechanism.
It's a place where BPF and TRACING circles intersect.
A task needs to have both CAP_BPF (to load the program)
and CAP_TRACING (to read kernel memory) to execute bpf_probe_read() helper.

> > > > @@ -2080,7 +2083,10 @@ static int bpf_prog_test_run(const union bpf_attr *attr,
> > > >         struct bpf_prog *prog;
> > > >         int ret = -ENOTSUPP;
> > > >
> > > > -       if (!capable(CAP_SYS_ADMIN))
> > > > +       if (!capable(CAP_NET_ADMIN) || !capable(CAP_BPF))
> > > > +               /* test_run callback is available for networking progs only.
> > > > +                * Add cap_bpf_tracing() above when tracing progs become runable.
> > > > +                */
> > >
> > > I think test_run should probably be CAP_SYS_ADMIN forever.  test_run
> > > is the only way that one can run a bpf program and call helper
> > > functions via the program if one doesn't have permission to attach the
> > > program.
> >
> > Since CAP_BPF + CAP_NET_ADMIN allow attach. It means that a task
> > with these two permissions will have programs running anyway.
> > (traffic will flow through netdev, socket events will happen, etc)
> > Hence no reason to disallow running program via test_run.
> >
> 
> test_run allows fully controlled inputs, in a context where a program
> can trivially flush caches, mistrain branch predictors, etc first.  It
> seems to me that, if a JITted bpf program contains an exploitable
> speculation gadget (MDS, Spectre v1, RSB, or anything else), 

speaking of MDS... I already asked you to help investigate its
applicability with existing bpf exposure. Are you going to do that?

> it will
> be *much* easier to exploit it using test_run than using normal
> network traffic.  Similarly, normal network traffic will have network
> headers that are valid enough to have caused the BPF program to be
> invoked in the first place.  test_run can inject arbitrary garbage.

Please take a look at Jann's var1 exploit. Was it hard to run bpf prog
in controlled environment without test_run command ?


^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-28  4:47 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, Andy Lutomirski, Alexei Starovoitov, Kees Cook,
	LSM List, James Morris, Jann Horn, Peter Zijlstra,
	David S. Miller, Daniel Borkmann, Network Development, bpf,
	kernel-team, Linux API
In-Reply-To: <20190828123041.c0c90c15865897461ee819a2@kernel.org>

On Wed, Aug 28, 2019 at 12:30:41PM +0900, Masami Hiramatsu wrote:
> > kprobes can be created in the tracefs filesystem (which is separate from
> > debugfs, tracefs just gets automatically mounted
> > in /sys/kernel/debug/tracing when debugfs is mounted) from the
> > kprobe_events file. /sys/kernel/tracing is just the tracefs
> > directory without debugfs, and was created specifically to allow
> > tracing to be access without opening up the can of worms in debugfs.
> 
> I like the CAP_TRACING for tracefs. Can we make the tracefs itself
> check the CAP_TRACING and call file_ops? or each tracefs file-ops
> handlers must check it?

Thanks for the feedback.
I'll hack a prototype of CAP_TRACING for perf bits that I understand
and you folks will be able to use it in ftrace when initial support lands.
imo the question above is an implementation detail that you can resolve later.
I see it as a followup to initial CAP_TRACING drop.


^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-28  4:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Kees Cook, LSM List, James Morris, Jann Horn,
	Peter Zijlstra, Masami Hiramatsu, Steven Rostedt, David S. Miller,
	Daniel Borkmann, Network Development, bpf, kernel-team, Linux API
In-Reply-To: <CALCETrVVQs1s27y8fB17JtQi-VzTq1YZPTPy3k=fKhQB1X-KKA@mail.gmail.com>

On Tue, Aug 27, 2019 at 07:00:40PM -0700, Andy Lutomirski wrote:
> 
> Let me put this a bit differently. Part of the point is that
> CAP_TRACING should allow a user or program to trace without being able
> to corrupt the system. CAP_BPF as you’ve proposed it *can* likely
> crash the system.

Really? I'm still waiting for your example where bpf+kprobe crashes the system...


^ permalink raw reply

* Re: [PATCH v1 net-next 0/4] Add EHL and TGL PCI info and PCI ID
From: David Miller @ 2019-08-28  4:59 UTC (permalink / raw)
  To: weifeng.voon
  Cc: mcoquelin.stm32, netdev, linux-kernel, joabreu, peppe.cavallaro,
	andrew, alexandre.torgue, boon.leong.ong
In-Reply-To: <1566869891-29239-1-git-send-email-weifeng.voon@intel.com>

From: Voon Weifeng <weifeng.voon@intel.com>
Date: Tue, 27 Aug 2019 09:38:07 +0800

> In order to keep PCI info simple and neat, this patch series have
> introduced a 3 hierarchy of struct. First layer will be the
> intel_mgbe_common_data struct which keeps all Intel common configuration.
> Second layer will be xxx_common_data which keeps all the different Intel
> microarchitecture, e.g tgl, ehl. The third layer will be configuration
> that tied to the PCI ID only based on speed and RGMII/SGMII interface.
> 
> EHL and TGL will also having a higher system clock which is 200Mhz.

Series applied.

^ permalink raw reply

* Re: [PATCH] powerpc/kmcent2: update the ethernet devices' phy properties
From: Scott Wood @ 2019-08-28  4:19 UTC (permalink / raw)
  To: Valentin Longchamp, Madalin-cristian Bucur
  Cc: linuxppc-dev@lists.ozlabs.org, galak@kernel.crashing.org,
	netdev@vger.kernel.org
In-Reply-To: <CADYrJDxsQ3H7b_BHOfmfTNb1OuXt+vzTg4k8Goj8tKPaaOMz_g@mail.gmail.com>

On Thu, 2019-08-08 at 23:09 +0200, Valentin Longchamp wrote:
> Le mar. 30 juil. 2019 à 11:44, Madalin-cristian Bucur
> <madalin.bucur@nxp.com> a écrit :
> > 
> > > -----Original Message-----
> > > 
> > > > Le dim. 14 juil. 2019 à 22:05, Valentin Longchamp
> > > > <valentin@longchamp.me> a écrit :
> > > > > 
> > > > > Change all phy-connection-type properties to phy-mode that are
> > > > > better
> > > > > supported by the fman driver.
> > > > > 
> > > > > Use the more readable fixed-link node for the 2 sgmii links.
> > > > > 
> > > > > Change the RGMII link to rgmii-id as the clock delays are added by
> > > > > the
> > > > > phy.
> > > > > 
> > > > > Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
> > > 
> > > I don't see any other uses of phy-mode in arch/powerpc/boot/dts/fsl, and
> > > I see
> > > lots of phy-connection-type with fman.  Madalin, does this patch look
> > > OK?
> > > 
> > > -Scott
> > 
> > Hi,
> > 
> > we are using "phy-connection-type" not "phy-mode" for the NXP (former
> > Freescale)
> > DPAA platforms. While the two seem to be interchangeable ("phy-mode" seems
> > to be
> > more recent, looking at the device tree bindings), the driver code in
> > Linux seems
> > to use one or the other, not both so one should stick with the variant the
> > driver
> > is using. To make things more complex, there may be dependencies in
> > bootloaders,
> > I see code in u-boot using only "phy-connection-type" or only "phy-mode".
> > 
> > I'd leave "phy-connection-type" as is.
> 
> So I have finally had time to have a look and now I understand what
> happens. You are right, there are bootloader dependencies: u-boot
> calls fdt_fixup_phy_connection() that somehow in our case adds (or
> changes if already in the device tree) the phy-connection-type
> property to a wrong value ! By having a phy-mode in the device tree,
> that is not changed by u-boot and by chance picked up by the kernel
> fman driver (of_get_phy_mode() ) over phy-connection-mode, the below
> patch fixes it for us.
> 
> I agree with you, it's not correct to have both phy-connection-type
> and phy-mode. Ideally, u-boot on the board should be reworked so that
> it does not perform the above wrong fixup. However, in an "unfixed"
> .dtb (I have disabled fdt_fixup_phy_connection), the device tree in
> the end only has either phy-connection-type or phy-mode, according to
> what was chosen in the .dts file. And the fman driver works well with
> both (thanks to the call to of_get_phy_mode() ). I would therefore
> argue that even if all other DPAA platforms use phy-connection-type,
> phy-mode is valid as well. (Furthermore we already have hundreds of
> such boards in the field and we don't really support "remote" u-boot
> update, so the u-boot fix is going to be difficult for us to pull).
> 
> Valentin

Madalin, are you OK with the patch given this explanation?

-Scott



^ permalink raw reply

* [PATCH bpf-next 0/2] nfp: bpf: add simple map op cache
From: Jakub Kicinski @ 2019-08-28  5:36 UTC (permalink / raw)
  To: alexei.starovoitov, daniel
  Cc: netdev, oss-drivers, jaco.gericke, Jakub Kicinski

Hi!

This set adds a small batching and cache mechanism to the driver.
Map dumps require two operations per element - get next, and
lookup. Each of those needs a round trip to the device, and on
a loaded system scheduling out and in of the dumping process.
This set makes the driver request a number of entries at the same
time, and if no operation which would modify the map happens
from the host side those entries are used to serve lookup
requests for up to 250us, at which point they are considered
stale.

This set has been measured to provide almost 4x dumping speed
improvement, Jaco says:

OLD dump times
    500 000 elements: 26.1s
      1 000 000 elements: 54.5s

NEW dump times
    500 000 elements: 7.6s
      1 000 000 elements: 16.5s

Jakub Kicinski (2):
  nfp: bpf: rework MTU checking
  nfp: bpf: add simple map op cache

 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c | 187 ++++++++++++++++--
 drivers/net/ethernet/netronome/nfp/bpf/fw.h   |   1 +
 drivers/net/ethernet/netronome/nfp/bpf/main.c |  33 ++++
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  24 +++
 .../net/ethernet/netronome/nfp/bpf/offload.c  |   3 +
 drivers/net/ethernet/netronome/nfp/nfp_net.h  |   2 +-
 .../ethernet/netronome/nfp/nfp_net_common.c   |   9 +-
 7 files changed, 239 insertions(+), 20 deletions(-)

-- 
2.21.0

^ permalink raw reply

* [PATCH bpf-next 1/2] nfp: bpf: rework MTU checking
From: Jakub Kicinski @ 2019-08-28  5:36 UTC (permalink / raw)
  To: alexei.starovoitov, daniel
  Cc: netdev, oss-drivers, jaco.gericke, Jakub Kicinski, Quentin Monnet
In-Reply-To: <20190828053629.28658-1-jakub.kicinski@netronome.com>

If control channel MTU is too low to support map operations a warning
will be printed. This is not enough, we want to make sure probe fails
in such scenario, as this would clearly be a faulty configuration.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c     | 10 +++++++---
 drivers/net/ethernet/netronome/nfp/bpf/main.c     | 15 +++++++++++++++
 drivers/net/ethernet/netronome/nfp/bpf/main.h     |  1 +
 drivers/net/ethernet/netronome/nfp/nfp_net.h      |  2 +-
 .../net/ethernet/netronome/nfp/nfp_net_common.c   |  9 +--------
 5 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
index bc9850e4ec5e..fcf880c82f3f 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
@@ -267,11 +267,15 @@ int nfp_bpf_ctrl_getnext_entry(struct bpf_offloaded_map *offmap,
 				     key, NULL, 0, next_key, NULL);
 }
 
+unsigned int nfp_bpf_ctrl_cmsg_min_mtu(struct nfp_app_bpf *bpf)
+{
+	return max(nfp_bpf_cmsg_map_req_size(bpf, 1),
+		   nfp_bpf_cmsg_map_reply_size(bpf, 1));
+}
+
 unsigned int nfp_bpf_ctrl_cmsg_mtu(struct nfp_app_bpf *bpf)
 {
-	return max3((unsigned int)NFP_NET_DEFAULT_MTU,
-		    nfp_bpf_cmsg_map_req_size(bpf, 1),
-		    nfp_bpf_cmsg_map_reply_size(bpf, 1));
+	return max(NFP_NET_DEFAULT_MTU, nfp_bpf_ctrl_cmsg_min_mtu(bpf));
 }
 
 void nfp_bpf_ctrl_msg_rx(struct nfp_app *app, struct sk_buff *skb)
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 1c9fb11470df..2b1773ed3de9 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -415,6 +415,20 @@ static void nfp_bpf_ndo_uninit(struct nfp_app *app, struct net_device *netdev)
 	bpf_offload_dev_netdev_unregister(bpf->bpf_dev, netdev);
 }
 
+static int nfp_bpf_start(struct nfp_app *app)
+{
+	struct nfp_app_bpf *bpf = app->priv;
+
+	if (app->ctrl->dp.mtu < nfp_bpf_ctrl_cmsg_min_mtu(bpf)) {
+		nfp_err(bpf->app->cpp,
+			"ctrl channel MTU below min required %u < %u\n",
+			app->ctrl->dp.mtu, nfp_bpf_ctrl_cmsg_min_mtu(bpf));
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static int nfp_bpf_init(struct nfp_app *app)
 {
 	struct nfp_app_bpf *bpf;
@@ -488,6 +502,7 @@ const struct nfp_app_type app_bpf = {
 
 	.init		= nfp_bpf_init,
 	.clean		= nfp_bpf_clean,
+	.start		= nfp_bpf_start,
 
 	.check_mtu	= nfp_bpf_check_mtu,
 
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 57d6ff51e980..f4802036eb42 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -564,6 +564,7 @@ nfp_bpf_goto_meta(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
 
 void *nfp_bpf_relo_for_vnic(struct nfp_prog *nfp_prog, struct nfp_bpf_vnic *bv);
 
+unsigned int nfp_bpf_ctrl_cmsg_min_mtu(struct nfp_app_bpf *bpf);
 unsigned int nfp_bpf_ctrl_cmsg_mtu(struct nfp_app_bpf *bpf);
 long long int
 nfp_bpf_ctrl_alloc_map(struct nfp_app_bpf *bpf, struct bpf_map *map);
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 5d6c3738b494..250f510b1d21 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -66,7 +66,7 @@
 #define NFP_NET_MAX_DMA_BITS	40
 
 /* Default size for MTU and freelist buffer sizes */
-#define NFP_NET_DEFAULT_MTU		1500
+#define NFP_NET_DEFAULT_MTU		1500U
 
 /* Maximum number of bytes prepended to a packet */
 #define NFP_NET_MAX_PREPEND		64
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 6f97b554f7da..61aabffc8888 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -4116,14 +4116,7 @@ int nfp_net_init(struct nfp_net *nn)
 
 	/* Set default MTU and Freelist buffer size */
 	if (!nfp_net_is_data_vnic(nn) && nn->app->ctrl_mtu) {
-		if (nn->app->ctrl_mtu <= nn->max_mtu) {
-			nn->dp.mtu = nn->app->ctrl_mtu;
-		} else {
-			if (nn->app->ctrl_mtu != NFP_APP_CTRL_MTU_MAX)
-				nn_warn(nn, "app requested MTU above max supported %u > %u\n",
-					nn->app->ctrl_mtu, nn->max_mtu);
-			nn->dp.mtu = nn->max_mtu;
-		}
+		nn->dp.mtu = min(nn->app->ctrl_mtu, nn->max_mtu);
 	} else if (nn->max_mtu < NFP_NET_DEFAULT_MTU) {
 		nn->dp.mtu = nn->max_mtu;
 	} else {
-- 
2.21.0


^ permalink raw reply related

* [PATCH bpf-next 2/2] nfp: bpf: add simple map op cache
From: Jakub Kicinski @ 2019-08-28  5:36 UTC (permalink / raw)
  To: alexei.starovoitov, daniel
  Cc: netdev, oss-drivers, jaco.gericke, Jakub Kicinski, Quentin Monnet
In-Reply-To: <20190828053629.28658-1-jakub.kicinski@netronome.com>

Each get_next and lookup call requires a round trip to the device.
However, the device is capable of giving us a few entries back,
instead of just one.

In this patch we ask for a small yet reasonable number of entries
(4) on every get_next call, and on subsequent get_next/lookup calls
check this little cache for a hit. The cache is only kept for 250us,
and is invalidated on every operation which may modify the map
(e.g. delete or update call). Note that operations may be performed
simultaneously, so we have to keep track of operations in flight.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c | 179 +++++++++++++++++-
 drivers/net/ethernet/netronome/nfp/bpf/fw.h   |   1 +
 drivers/net/ethernet/netronome/nfp/bpf/main.c |  18 ++
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  23 +++
 .../net/ethernet/netronome/nfp/bpf/offload.c  |   3 +
 5 files changed, 215 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
index fcf880c82f3f..0e2db6ea79e9 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
@@ -6,6 +6,7 @@
 #include <linux/bug.h>
 #include <linux/jiffies.h>
 #include <linux/skbuff.h>
+#include <linux/timekeeping.h>
 
 #include "../ccm.h"
 #include "../nfp_app.h"
@@ -175,29 +176,151 @@ nfp_bpf_ctrl_reply_val(struct nfp_app_bpf *bpf, struct cmsg_reply_map_op *reply,
 	return &reply->data[bpf->cmsg_key_sz * (n + 1) + bpf->cmsg_val_sz * n];
 }
 
+static bool nfp_bpf_ctrl_op_cache_invalidate(enum nfp_ccm_type op)
+{
+	return op == NFP_CCM_TYPE_BPF_MAP_UPDATE ||
+	       op == NFP_CCM_TYPE_BPF_MAP_DELETE;
+}
+
+static bool nfp_bpf_ctrl_op_cache_capable(enum nfp_ccm_type op)
+{
+	return op == NFP_CCM_TYPE_BPF_MAP_LOOKUP ||
+	       op == NFP_CCM_TYPE_BPF_MAP_GETNEXT;
+}
+
+static bool nfp_bpf_ctrl_op_cache_fill(enum nfp_ccm_type op)
+{
+	return op == NFP_CCM_TYPE_BPF_MAP_GETFIRST ||
+	       op == NFP_CCM_TYPE_BPF_MAP_GETNEXT;
+}
+
+static unsigned int
+nfp_bpf_ctrl_op_cache_get(struct nfp_bpf_map *nfp_map, enum nfp_ccm_type op,
+			  const u8 *key, u8 *out_key, u8 *out_value,
+			  u32 *cache_gen)
+{
+	struct bpf_map *map = &nfp_map->offmap->map;
+	struct nfp_app_bpf *bpf = nfp_map->bpf;
+	unsigned int i, count, n_entries;
+	struct cmsg_reply_map_op *reply;
+
+	n_entries = nfp_bpf_ctrl_op_cache_fill(op) ? bpf->cmsg_cache_cnt : 1;
+
+	spin_lock(&nfp_map->cache_lock);
+	*cache_gen = nfp_map->cache_gen;
+	if (nfp_map->cache_blockers)
+		n_entries = 1;
+
+	if (nfp_bpf_ctrl_op_cache_invalidate(op))
+		goto exit_block;
+	if (!nfp_bpf_ctrl_op_cache_capable(op))
+		goto exit_unlock;
+
+	if (!nfp_map->cache)
+		goto exit_unlock;
+	if (nfp_map->cache_to < ktime_get_ns())
+		goto exit_invalidate;
+
+	reply = (void *)nfp_map->cache->data;
+	count = be32_to_cpu(reply->count);
+
+	for (i = 0; i < count; i++) {
+		void *cached_key;
+
+		cached_key = nfp_bpf_ctrl_reply_key(bpf, reply, i);
+		if (memcmp(cached_key, key, map->key_size))
+			continue;
+
+		if (op == NFP_CCM_TYPE_BPF_MAP_LOOKUP)
+			memcpy(out_value, nfp_bpf_ctrl_reply_val(bpf, reply, i),
+			       map->value_size);
+		if (op == NFP_CCM_TYPE_BPF_MAP_GETNEXT) {
+			if (i + 1 == count)
+				break;
+
+			memcpy(out_key,
+			       nfp_bpf_ctrl_reply_key(bpf, reply, i + 1),
+			       map->key_size);
+		}
+
+		n_entries = 0;
+		goto exit_unlock;
+	}
+	goto exit_unlock;
+
+exit_block:
+	nfp_map->cache_blockers++;
+exit_invalidate:
+	dev_consume_skb_any(nfp_map->cache);
+	nfp_map->cache = NULL;
+exit_unlock:
+	spin_unlock(&nfp_map->cache_lock);
+	return n_entries;
+}
+
+static void
+nfp_bpf_ctrl_op_cache_put(struct nfp_bpf_map *nfp_map, enum nfp_ccm_type op,
+			  struct sk_buff *skb, u32 cache_gen)
+{
+	bool blocker, filler;
+
+	blocker = nfp_bpf_ctrl_op_cache_invalidate(op);
+	filler = nfp_bpf_ctrl_op_cache_fill(op);
+	if (blocker || filler) {
+		u64 to = 0;
+
+		if (filler)
+			to = ktime_get_ns() + NFP_BPF_MAP_CACHE_TIME_NS;
+
+		spin_lock(&nfp_map->cache_lock);
+		if (blocker) {
+			nfp_map->cache_blockers--;
+			nfp_map->cache_gen++;
+		}
+		if (filler && !nfp_map->cache_blockers &&
+		    nfp_map->cache_gen == cache_gen) {
+			nfp_map->cache_to = to;
+			swap(nfp_map->cache, skb);
+		}
+		spin_unlock(&nfp_map->cache_lock);
+	}
+
+	dev_consume_skb_any(skb);
+}
+
 static int
 nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap, enum nfp_ccm_type op,
 		      u8 *key, u8 *value, u64 flags, u8 *out_key, u8 *out_value)
 {
 	struct nfp_bpf_map *nfp_map = offmap->dev_priv;
+	unsigned int n_entries, reply_entries, count;
 	struct nfp_app_bpf *bpf = nfp_map->bpf;
 	struct bpf_map *map = &offmap->map;
 	struct cmsg_reply_map_op *reply;
 	struct cmsg_req_map_op *req;
 	struct sk_buff *skb;
+	u32 cache_gen;
 	int err;
 
 	/* FW messages have no space for more than 32 bits of flags */
 	if (flags >> 32)
 		return -EOPNOTSUPP;
 
+	/* Handle op cache */
+	n_entries = nfp_bpf_ctrl_op_cache_get(nfp_map, op, key, out_key,
+					      out_value, &cache_gen);
+	if (!n_entries)
+		return 0;
+
 	skb = nfp_bpf_cmsg_map_req_alloc(bpf, 1);
-	if (!skb)
-		return -ENOMEM;
+	if (!skb) {
+		err = -ENOMEM;
+		goto err_cache_put;
+	}
 
 	req = (void *)skb->data;
 	req->tid = cpu_to_be32(nfp_map->tid);
-	req->count = cpu_to_be32(1);
+	req->count = cpu_to_be32(n_entries);
 	req->flags = cpu_to_be32(flags);
 
 	/* Copy inputs */
@@ -207,16 +330,38 @@ nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap, enum nfp_ccm_type op,
 		memcpy(nfp_bpf_ctrl_req_val(bpf, req, 0), value,
 		       map->value_size);
 
-	skb = nfp_ccm_communicate(&bpf->ccm, skb, op,
-				  nfp_bpf_cmsg_map_reply_size(bpf, 1));
-	if (IS_ERR(skb))
-		return PTR_ERR(skb);
+	skb = nfp_ccm_communicate(&bpf->ccm, skb, op, 0);
+	if (IS_ERR(skb)) {
+		err = PTR_ERR(skb);
+		goto err_cache_put;
+	}
+
+	if (skb->len < sizeof(*reply)) {
+		cmsg_warn(bpf, "cmsg drop - type 0x%02x too short %d!\n",
+			  op, skb->len);
+		err = -EIO;
+		goto err_free;
+	}
 
 	reply = (void *)skb->data;
+	count = be32_to_cpu(reply->count);
 	err = nfp_bpf_ctrl_rc_to_errno(bpf, &reply->reply_hdr);
+	/* FW responds with message sized to hold the good entries,
+	 * plus one extra entry if there was an error.
+	 */
+	reply_entries = count + !!err;
+	if (n_entries > 1 && count)
+		err = 0;
 	if (err)
 		goto err_free;
 
+	if (skb->len != nfp_bpf_cmsg_map_reply_size(bpf, reply_entries)) {
+		cmsg_warn(bpf, "cmsg drop - type 0x%02x too short %d for %d entries!\n",
+			  op, skb->len, reply_entries);
+		err = -EIO;
+		goto err_free;
+	}
+
 	/* Copy outputs */
 	if (out_key)
 		memcpy(out_key, nfp_bpf_ctrl_reply_key(bpf, reply, 0),
@@ -225,11 +370,13 @@ nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap, enum nfp_ccm_type op,
 		memcpy(out_value, nfp_bpf_ctrl_reply_val(bpf, reply, 0),
 		       map->value_size);
 
-	dev_consume_skb_any(skb);
+	nfp_bpf_ctrl_op_cache_put(nfp_map, op, skb, cache_gen);
 
 	return 0;
 err_free:
 	dev_kfree_skb_any(skb);
+err_cache_put:
+	nfp_bpf_ctrl_op_cache_put(nfp_map, op, NULL, cache_gen);
 	return err;
 }
 
@@ -275,7 +422,21 @@ unsigned int nfp_bpf_ctrl_cmsg_min_mtu(struct nfp_app_bpf *bpf)
 
 unsigned int nfp_bpf_ctrl_cmsg_mtu(struct nfp_app_bpf *bpf)
 {
-	return max(NFP_NET_DEFAULT_MTU, nfp_bpf_ctrl_cmsg_min_mtu(bpf));
+	return max3(NFP_NET_DEFAULT_MTU,
+		    nfp_bpf_cmsg_map_req_size(bpf, NFP_BPF_MAP_CACHE_CNT),
+		    nfp_bpf_cmsg_map_reply_size(bpf, NFP_BPF_MAP_CACHE_CNT));
+}
+
+unsigned int nfp_bpf_ctrl_cmsg_cache_cnt(struct nfp_app_bpf *bpf)
+{
+	unsigned int mtu, req_max, reply_max, entry_sz;
+
+	mtu = bpf->app->ctrl->dp.mtu;
+	entry_sz = bpf->cmsg_key_sz + bpf->cmsg_val_sz;
+	req_max = (mtu - sizeof(struct cmsg_req_map_op)) / entry_sz;
+	reply_max = (mtu - sizeof(struct cmsg_reply_map_op)) / entry_sz;
+
+	return min3(req_max, reply_max, NFP_BPF_MAP_CACHE_CNT);
 }
 
 void nfp_bpf_ctrl_msg_rx(struct nfp_app *app, struct sk_buff *skb)
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/fw.h b/drivers/net/ethernet/netronome/nfp/bpf/fw.h
index 06c4286bd79e..a83a0ad5e27d 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/fw.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/fw.h
@@ -24,6 +24,7 @@ enum bpf_cap_tlv_type {
 	NFP_BPF_CAP_TYPE_QUEUE_SELECT	= 5,
 	NFP_BPF_CAP_TYPE_ADJUST_TAIL	= 6,
 	NFP_BPF_CAP_TYPE_ABI_VERSION	= 7,
+	NFP_BPF_CAP_TYPE_CMSG_MULTI_ENT	= 8,
 };
 
 struct nfp_bpf_cap_tlv_func {
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 2b1773ed3de9..8f732771d3fa 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -299,6 +299,14 @@ nfp_bpf_parse_cap_adjust_tail(struct nfp_app_bpf *bpf, void __iomem *value,
 	return 0;
 }
 
+static int
+nfp_bpf_parse_cap_cmsg_multi_ent(struct nfp_app_bpf *bpf, void __iomem *value,
+				 u32 length)
+{
+	bpf->cmsg_multi_ent = true;
+	return 0;
+}
+
 static int
 nfp_bpf_parse_cap_abi_version(struct nfp_app_bpf *bpf, void __iomem *value,
 			      u32 length)
@@ -375,6 +383,11 @@ static int nfp_bpf_parse_capabilities(struct nfp_app *app)
 							  length))
 				goto err_release_free;
 			break;
+		case NFP_BPF_CAP_TYPE_CMSG_MULTI_ENT:
+			if (nfp_bpf_parse_cap_cmsg_multi_ent(app->priv, value,
+							     length))
+				goto err_release_free;
+			break;
 		default:
 			nfp_dbg(cpp, "unknown BPF capability: %d\n", type);
 			break;
@@ -426,6 +439,11 @@ static int nfp_bpf_start(struct nfp_app *app)
 		return -EINVAL;
 	}
 
+	if (bpf->cmsg_multi_ent)
+		bpf->cmsg_cache_cnt = nfp_bpf_ctrl_cmsg_cache_cnt(bpf);
+	else
+		bpf->cmsg_cache_cnt = 1;
+
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index f4802036eb42..fac9c6f9e197 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -99,6 +99,7 @@ enum pkt_vec {
  * @maps_neutral:	hash table of offload-neutral maps (on pointer)
  *
  * @abi_version:	global BPF ABI version
+ * @cmsg_cache_cnt:	number of entries to read for caching
  *
  * @adjust_head:	adjust head capability
  * @adjust_head.flags:		extra flags for adjust head
@@ -124,6 +125,7 @@ enum pkt_vec {
  * @pseudo_random:	FW initialized the pseudo-random machinery (CSRs)
  * @queue_select:	BPF can set the RX queue ID in packet vector
  * @adjust_tail:	BPF can simply trunc packet size for adjust tail
+ * @cmsg_multi_ent:	FW can pack multiple map entries in a single cmsg
  */
 struct nfp_app_bpf {
 	struct nfp_app *app;
@@ -134,6 +136,8 @@ struct nfp_app_bpf {
 	unsigned int cmsg_key_sz;
 	unsigned int cmsg_val_sz;
 
+	unsigned int cmsg_cache_cnt;
+
 	struct list_head map_list;
 	unsigned int maps_in_use;
 	unsigned int map_elems_in_use;
@@ -169,6 +173,7 @@ struct nfp_app_bpf {
 	bool pseudo_random;
 	bool queue_select;
 	bool adjust_tail;
+	bool cmsg_multi_ent;
 };
 
 enum nfp_bpf_map_use {
@@ -183,11 +188,21 @@ struct nfp_bpf_map_word {
 	unsigned char non_zero_update	:1;
 };
 
+#define NFP_BPF_MAP_CACHE_CNT		4U
+#define NFP_BPF_MAP_CACHE_TIME_NS	(250 * 1000)
+
 /**
  * struct nfp_bpf_map - private per-map data attached to BPF maps for offload
  * @offmap:	pointer to the offloaded BPF map
  * @bpf:	back pointer to bpf app private structure
  * @tid:	table id identifying map on datapath
+ *
+ * @cache_lock:	protects @cache_blockers, @cache_to, @cache
+ * @cache_blockers:	number of ops in flight which block caching
+ * @cache_gen:	counter incremented by every blocker on exit
+ * @cache_to:	time when cache will no longer be valid (ns)
+ * @cache:	skb with cached response
+ *
  * @l:		link on the nfp_app_bpf->map_list list
  * @use_map:	map of how the value is used (in 4B chunks)
  */
@@ -195,6 +210,13 @@ struct nfp_bpf_map {
 	struct bpf_offloaded_map *offmap;
 	struct nfp_app_bpf *bpf;
 	u32 tid;
+
+	spinlock_t cache_lock;
+	u32 cache_blockers;
+	u32 cache_gen;
+	u64 cache_to;
+	struct sk_buff *cache;
+
 	struct list_head l;
 	struct nfp_bpf_map_word use_map[];
 };
@@ -566,6 +588,7 @@ void *nfp_bpf_relo_for_vnic(struct nfp_prog *nfp_prog, struct nfp_bpf_vnic *bv);
 
 unsigned int nfp_bpf_ctrl_cmsg_min_mtu(struct nfp_app_bpf *bpf);
 unsigned int nfp_bpf_ctrl_cmsg_mtu(struct nfp_app_bpf *bpf);
+unsigned int nfp_bpf_ctrl_cmsg_cache_cnt(struct nfp_app_bpf *bpf);
 long long int
 nfp_bpf_ctrl_alloc_map(struct nfp_app_bpf *bpf, struct bpf_map *map);
 void
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index 39c9fec222b4..88fab6a82acf 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -385,6 +385,7 @@ nfp_bpf_map_alloc(struct nfp_app_bpf *bpf, struct bpf_offloaded_map *offmap)
 	offmap->dev_priv = nfp_map;
 	nfp_map->offmap = offmap;
 	nfp_map->bpf = bpf;
+	spin_lock_init(&nfp_map->cache_lock);
 
 	res = nfp_bpf_ctrl_alloc_map(bpf, &offmap->map);
 	if (res < 0) {
@@ -407,6 +408,8 @@ nfp_bpf_map_free(struct nfp_app_bpf *bpf, struct bpf_offloaded_map *offmap)
 	struct nfp_bpf_map *nfp_map = offmap->dev_priv;
 
 	nfp_bpf_ctrl_free_map(bpf, nfp_map);
+	dev_consume_skb_any(nfp_map->cache);
+	WARN_ON_ONCE(nfp_map->cache_blockers);
 	list_del_init(&nfp_map->l);
 	bpf->map_elems_in_use -= offmap->map.max_entries;
 	bpf->maps_in_use--;
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH net-next] net: phy: force phy suspend when calling phy_stop
From: Heiner Kallweit @ 2019-08-28  5:38 UTC (permalink / raw)
  To: Jian Shen, andrew, f.fainelli, davem, sergei.shtylyov
  Cc: netdev, forest.zhouchang, linuxarm
In-Reply-To: <1566956087-37096-1-git-send-email-shenjian15@huawei.com>

On 28.08.2019 03:34, Jian Shen wrote:
> Some ethernet drivers may call phy_start() and phy_stop() from
> ndo_open() and ndo_close() respectively.
> 
> When network cable is unconnected, and operate like below:
> step 1: ifconfig ethX up -> ndo_open -> phy_start ->start
> autoneg, and phy is no link.
> step 2: ifconfig ethX down -> ndo_close -> phy_stop -> just stop
> phy state machine.
> 
> This patch forces phy suspend even phydev->link is off.
> 
> Signed-off-by: Jian Shen <shenjian15@huawei.com>
> ---
>  drivers/net/phy/phy.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index f3adea9..0acd5b4 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -911,8 +911,8 @@ void phy_state_machine(struct work_struct *work)
>  		if (phydev->link) {
>  			phydev->link = 0;
>  			phy_link_down(phydev, true);
> -			do_suspend = true;
>  		}
> +		do_suspend = true;
>  		break;
>  	}
>  
> 
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>

^ permalink raw reply

* [RFC v3] vhost: introduce mdev based hardware vhost backend
From: Tiwei Bie @ 2019-08-28  5:37 UTC (permalink / raw)
  To: mst, jasowang, alex.williamson, maxime.coquelin
  Cc: linux-kernel, kvm, virtualization, netdev, dan.daly,
	cunming.liang, zhihong.wang, lingshan.zhu, tiwei.bie

Details about this can be found here:

https://lwn.net/Articles/750770/

What's new in this version
==========================

There are three choices based on the discussion [1] in RFC v2:

> #1. We expose a VFIO device, so we can reuse the VFIO container/group
>     based DMA API and potentially reuse a lot of VFIO code in QEMU.
>
>     But in this case, we have two choices for the VFIO device interface
>     (i.e. the interface on top of VFIO device fd):
>
>     A) we may invent a new vhost protocol (as demonstrated by the code
>        in this RFC) on VFIO device fd to make it work in VFIO's way,
>        i.e. regions and irqs.
>
>     B) Or as you proposed, instead of inventing a new vhost protocol,
>        we can reuse most existing vhost ioctls on the VFIO device fd
>        directly. There should be no conflicts between the VFIO ioctls
>        (type is 0x3B) and VHOST ioctls (type is 0xAF) currently.
>
> #2. Instead of exposing a VFIO device, we may expose a VHOST device.
>     And we will introduce a new mdev driver vhost-mdev to do this.
>     It would be natural to reuse the existing kernel vhost interface
>     (ioctls) on it as much as possible. But we will need to invent
>     some APIs for DMA programming (reusing VHOST_SET_MEM_TABLE is a
>     choice, but it's too heavy and doesn't support vIOMMU by itself).

This version is more like a quick PoC to try Jason's proposal on
reusing vhost ioctls. And the second way (#1/B) in above three
choices was chosen in this version to demonstrate the idea quickly.

Now the userspace API looks like this:

- VFIO's container/group based IOMMU API is used to do the
  DMA programming.

- Vhost's existing ioctls are used to setup the device.

And the device will report device_api as "vfio-vhost".

Note that, there are dirty hacks in this version. If we decide to
go this way, some refactoring in vhost.c/vhost.h may be needed.

PS. The direct mapping of the notify registers isn't implemented
    in this version.

[1] https://lkml.org/lkml/2019/7/9/101

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/vhost/Kconfig      |   9 +
 drivers/vhost/Makefile     |   3 +
 drivers/vhost/mdev.c       | 382 +++++++++++++++++++++++++++++++++++++
 include/linux/vhost_mdev.h |  58 ++++++
 include/uapi/linux/vfio.h  |   2 +
 include/uapi/linux/vhost.h |   8 +
 6 files changed, 462 insertions(+)
 create mode 100644 drivers/vhost/mdev.c
 create mode 100644 include/linux/vhost_mdev.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 3d03ccbd1adc..2ba54fcf43b7 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -34,6 +34,15 @@ config VHOST_VSOCK
 	To compile this driver as a module, choose M here: the module will be called
 	vhost_vsock.
 
+config VHOST_MDEV
+	tristate "Hardware vhost accelerator abstraction"
+	depends on EVENTFD && VFIO && VFIO_MDEV
+	select VHOST
+	default n
+	---help---
+	Say Y here to enable the vhost_mdev module
+	for use with hardware vhost accelerators
+
 config VHOST
 	tristate
 	---help---
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 6c6df24f770c..ad9c0f8c6d8c 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -10,4 +10,7 @@ vhost_vsock-y := vsock.o
 
 obj-$(CONFIG_VHOST_RING) += vringh.o
 
+obj-$(CONFIG_VHOST_MDEV) += vhost_mdev.o
+vhost_mdev-y := mdev.o
+
 obj-$(CONFIG_VHOST)	+= vhost.o
diff --git a/drivers/vhost/mdev.c b/drivers/vhost/mdev.c
new file mode 100644
index 000000000000..6bef1d9ae2e6
--- /dev/null
+++ b/drivers/vhost/mdev.c
@@ -0,0 +1,382 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018-2019 Intel Corporation.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/vfio.h>
+#include <linux/vhost.h>
+#include <linux/mdev.h>
+#include <linux/vhost_mdev.h>
+
+#include "vhost.h"
+
+struct vhost_mdev {
+	struct vhost_dev dev;
+	bool opened;
+	int nvqs;
+	u64 state;
+	u64 acked_features;
+	u64 features;
+	const struct vhost_mdev_device_ops *ops;
+	struct mdev_device *mdev;
+	void *private;
+	struct vhost_virtqueue vqs[];
+};
+
+static void handle_vq_kick(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+						  poll.work);
+	struct vhost_mdev *vdpa = container_of(vq->dev, struct vhost_mdev, dev);
+
+	vdpa->ops->notify(vdpa, vq - vdpa->vqs);
+}
+
+static int vhost_set_state(struct vhost_mdev *vdpa, u64 __user *statep)
+{
+	u64 state;
+
+	if (copy_from_user(&state, statep, sizeof(state)))
+		return -EFAULT;
+
+	if (state >= VHOST_MDEV_S_MAX)
+		return -EINVAL;
+
+	if (vdpa->state == state)
+		return 0;
+
+	mutex_lock(&vdpa->dev.mutex);
+
+	vdpa->state = state;
+
+	switch (vdpa->state) {
+	case VHOST_MDEV_S_RUNNING:
+		vdpa->ops->start(vdpa);
+		break;
+	case VHOST_MDEV_S_STOPPED:
+		vdpa->ops->stop(vdpa);
+		break;
+	}
+
+	mutex_unlock(&vdpa->dev.mutex);
+
+	return 0;
+}
+
+static int vhost_set_features(struct vhost_mdev *vdpa, u64 __user *featurep)
+{
+	u64 features;
+
+	if (copy_from_user(&features, featurep, sizeof(features)))
+		return -EFAULT;
+
+	if (features & ~vdpa->features)
+		return -EINVAL;
+
+	vdpa->acked_features = features;
+	vdpa->ops->features_changed(vdpa);
+	return 0;
+}
+
+static int vhost_get_features(struct vhost_mdev *vdpa, u64 __user *featurep)
+{
+	if (copy_to_user(featurep, &vdpa->features, sizeof(vdpa->features)))
+		return -EFAULT;
+	return 0;
+}
+
+static int vhost_get_vring_base(struct vhost_mdev *vdpa, void __user *argp)
+{
+	struct vhost_virtqueue *vq;
+	u32 idx;
+	int r;
+
+	r = get_user(idx, (u32 __user *)argp);
+	if (r < 0)
+		return r;
+
+	vq = &vdpa->vqs[idx];
+	vq->last_avail_idx = vdpa->ops->get_vring_base(vdpa, idx);
+
+	return vhost_vring_ioctl(&vdpa->dev, VHOST_GET_VRING_BASE, argp);
+}
+
+/*
+ * Helpers for backend to register mdev.
+ */
+
+struct vhost_mdev *vhost_mdev_alloc(struct mdev_device *mdev, void *private,
+				    int nvqs)
+{
+	struct vhost_mdev *vdpa;
+	struct vhost_dev *dev;
+	struct vhost_virtqueue **vqs;
+	size_t size;
+	int i;
+
+	size = sizeof(struct vhost_mdev) + nvqs * sizeof(struct vhost_virtqueue);
+
+	vdpa = kzalloc(size, GFP_KERNEL);
+	if (!vdpa)
+		return NULL;
+
+	vdpa->nvqs = nvqs;
+
+	vqs = kmalloc_array(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs) {
+		kfree(vdpa);
+		return NULL;
+	}
+
+	dev = &vdpa->dev;
+	for (i = 0; i < nvqs; i++) {
+		vqs[i] = &vdpa->vqs[i];
+		vqs[i]->handle_kick = handle_vq_kick;
+	}
+	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0);
+
+	vdpa->private = private;
+	vdpa->mdev = mdev;
+
+	mdev_set_drvdata(mdev, vdpa);
+
+	return vdpa;
+}
+EXPORT_SYMBOL(vhost_mdev_alloc);
+
+void vhost_mdev_free(struct vhost_mdev *vdpa)
+{
+	struct mdev_device *mdev;
+
+	mdev = vdpa->mdev;
+	mdev_set_drvdata(mdev, NULL);
+
+	vhost_dev_stop(&vdpa->dev);
+	vhost_dev_cleanup(&vdpa->dev);
+	kfree(vdpa->dev.vqs);
+	kfree(vdpa);
+}
+EXPORT_SYMBOL(vhost_mdev_free);
+
+ssize_t vhost_mdev_read(struct mdev_device *mdev, char __user *buf,
+		  size_t count, loff_t *ppos)
+{
+	return -EINVAL;
+}
+EXPORT_SYMBOL(vhost_mdev_read);
+
+
+ssize_t vhost_mdev_write(struct mdev_device *mdev, const char __user *buf,
+		   size_t count, loff_t *ppos)
+{
+	return -EINVAL;
+}
+EXPORT_SYMBOL(vhost_mdev_write);
+
+int vhost_mdev_mmap(struct mdev_device *mdev, struct vm_area_struct *vma)
+{
+	// TODO
+	return -EINVAL;
+}
+EXPORT_SYMBOL(vhost_mdev_mmap);
+
+long vhost_mdev_ioctl(struct mdev_device *mdev, unsigned int cmd,
+		      unsigned long arg)
+{
+	void __user *argp = (void __user *)arg;
+	struct vhost_mdev *vdpa;
+	unsigned long minsz;
+	int ret = 0;
+
+	if (!mdev)
+		return -EINVAL;
+
+	vdpa = mdev_get_drvdata(mdev);
+	if (!vdpa)
+		return -ENODEV;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		if (info.argsz < minsz) {
+			ret = -EINVAL;
+			break;
+		}
+
+		info.flags = VFIO_DEVICE_FLAGS_VHOST;
+		info.num_regions = 0;
+		info.num_irqs = 0;
+
+		if (copy_to_user((void __user *)arg, &info, minsz)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		break;
+	}
+	case VFIO_DEVICE_GET_REGION_INFO:
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	case VFIO_DEVICE_SET_IRQS:
+	case VFIO_DEVICE_RESET:
+		ret = -EINVAL;
+		break;
+
+	case VHOST_MDEV_SET_STATE:
+		ret = vhost_set_state(vdpa, argp);
+		break;
+	case VHOST_GET_FEATURES:
+		ret = vhost_get_features(vdpa, argp);
+		break;
+	case VHOST_SET_FEATURES:
+		ret = vhost_set_features(vdpa, argp);
+		break;
+	case VHOST_GET_VRING_BASE:
+		ret = vhost_get_vring_base(vdpa, argp);
+		break;
+	default:
+		ret = vhost_dev_ioctl(&vdpa->dev, cmd, argp);
+		if (ret == -ENOIOCTLCMD)
+			ret = vhost_vring_ioctl(&vdpa->dev, cmd, argp);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(vhost_mdev_ioctl);
+
+int vhost_mdev_open(struct mdev_device *mdev)
+{
+	struct vhost_mdev *vdpa;
+	int ret = 0;
+
+	vdpa = mdev_get_drvdata(mdev);
+	if (!vdpa)
+		return -ENODEV;
+
+	mutex_lock(&vdpa->dev.mutex);
+
+	if (vdpa->opened)
+		ret = -EBUSY;
+	else
+		vdpa->opened = true;
+
+	mutex_unlock(&vdpa->dev.mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL(vhost_mdev_open);
+
+void vhost_mdev_close(struct mdev_device *mdev)
+{
+	struct vhost_mdev *vdpa;
+
+	vdpa = mdev_get_drvdata(mdev);
+
+	mutex_lock(&vdpa->dev.mutex);
+
+	vhost_dev_stop(&vdpa->dev);
+	vhost_dev_cleanup(&vdpa->dev);
+
+	vdpa->opened = false;
+	mutex_unlock(&vdpa->dev.mutex);
+}
+EXPORT_SYMBOL(vhost_mdev_close);
+
+/*
+ * Helpers for backend to set/get information.
+ */
+
+int vhost_mdev_set_device_ops(struct vhost_mdev *vdpa,
+			      const struct vhost_mdev_device_ops *ops)
+{
+	vdpa->ops = ops;
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_set_device_ops);
+
+int vhost_mdev_set_features(struct vhost_mdev *vdpa, u64 features)
+{
+	vdpa->features = features;
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_set_features);
+
+struct eventfd_ctx *
+vhost_mdev_get_call_ctx(struct vhost_mdev *vdpa, int queue_id)
+{
+	return vdpa->vqs[queue_id].call_ctx;
+}
+EXPORT_SYMBOL(vhost_mdev_get_call_ctx);
+
+int vhost_mdev_get_acked_features(struct vhost_mdev *vdpa, u64 *features)
+{
+	*features = vdpa->acked_features;
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_get_acked_features);
+
+int vhost_mdev_get_vring_num(struct vhost_mdev *vdpa, int queue_id, u16 *num)
+{
+	*num = vdpa->vqs[queue_id].num;
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_get_vring_num);
+
+int vhost_mdev_get_vring_base(struct vhost_mdev *vdpa, int queue_id, u16 *base)
+{
+	*base = vdpa->vqs[queue_id].last_avail_idx;
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_get_vring_base);
+
+int vhost_mdev_get_vring_addr(struct vhost_mdev *vdpa, int queue_id,
+			      struct vhost_vring_addr *addr)
+{
+	struct vhost_virtqueue *vq = &vdpa->vqs[queue_id];
+
+	/*
+	 * XXX: we need userspace to pass guest physical address or
+	 *      IOVA directly.
+	 */
+	addr->flags = vq->log_used ? (0x1 << VHOST_VRING_F_LOG) : 0;
+	addr->desc_user_addr = (__u64)vq->desc;
+	addr->avail_user_addr = (__u64)vq->avail;
+	addr->used_user_addr = (__u64)vq->used;
+	addr->log_guest_addr = (__u64)vq->log_addr;
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_get_vring_addr);
+
+int vhost_mdev_get_log_base(struct vhost_mdev *vdpa, int queue_id,
+			    void **log_base, u64 *log_size)
+{
+	// TODO
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_get_log_base);
+
+struct mdev_device *vhost_mdev_get_mdev(struct vhost_mdev *vdpa)
+{
+	return vdpa->mdev;
+}
+EXPORT_SYMBOL(vhost_mdev_get_mdev);
+
+void *vhost_mdev_get_private(struct vhost_mdev *vdpa)
+{
+	return vdpa->private;
+}
+EXPORT_SYMBOL(vhost_mdev_get_private);
+
+MODULE_VERSION("0.0.0");
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("Hardware vhost accelerator abstraction");
diff --git a/include/linux/vhost_mdev.h b/include/linux/vhost_mdev.h
new file mode 100644
index 000000000000..070787ce6b36
--- /dev/null
+++ b/include/linux/vhost_mdev.h
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018-2019 Intel Corporation.
+ */
+
+#ifndef _VHOST_MDEV_H
+#define _VHOST_MDEV_H
+
+struct mdev_device;
+struct vhost_mdev;
+
+typedef int (*vhost_mdev_start_device_t)(struct vhost_mdev *vdpa);
+typedef int (*vhost_mdev_stop_device_t)(struct vhost_mdev *vdpa);
+typedef int (*vhost_mdev_set_features_t)(struct vhost_mdev *vdpa);
+typedef void (*vhost_mdev_notify_device_t)(struct vhost_mdev *vdpa, int queue_id);
+typedef u64 (*vhost_mdev_get_notify_addr_t)(struct vhost_mdev *vdpa, int queue_id);
+typedef u16 (*vhost_mdev_get_vring_base_t)(struct vhost_mdev *vdpa, int queue_id);
+typedef void (*vhost_mdev_features_changed_t)(struct vhost_mdev *vdpa);
+
+struct vhost_mdev_device_ops {
+	vhost_mdev_start_device_t	start;
+	vhost_mdev_stop_device_t	stop;
+	vhost_mdev_notify_device_t	notify;
+	vhost_mdev_get_notify_addr_t	get_notify_addr;
+	vhost_mdev_get_vring_base_t	get_vring_base;
+	vhost_mdev_features_changed_t	features_changed;
+};
+
+struct vhost_mdev *vhost_mdev_alloc(struct mdev_device *mdev,
+		void *private, int nvqs);
+void vhost_mdev_free(struct vhost_mdev *vdpa);
+
+ssize_t vhost_mdev_read(struct mdev_device *mdev, char __user *buf,
+		size_t count, loff_t *ppos);
+ssize_t vhost_mdev_write(struct mdev_device *mdev, const char __user *buf,
+		size_t count, loff_t *ppos);
+long vhost_mdev_ioctl(struct mdev_device *mdev, unsigned int cmd,
+		unsigned long arg);
+int vhost_mdev_mmap(struct mdev_device *mdev, struct vm_area_struct *vma);
+int vhost_mdev_open(struct mdev_device *mdev);
+void vhost_mdev_close(struct mdev_device *mdev);
+
+int vhost_mdev_set_device_ops(struct vhost_mdev *vdpa,
+		const struct vhost_mdev_device_ops *ops);
+int vhost_mdev_set_features(struct vhost_mdev *vdpa, u64 features);
+struct eventfd_ctx *vhost_mdev_get_call_ctx(struct vhost_mdev *vdpa,
+		int queue_id);
+int vhost_mdev_get_acked_features(struct vhost_mdev *vdpa, u64 *features);
+int vhost_mdev_get_vring_num(struct vhost_mdev *vdpa, int queue_id, u16 *num);
+int vhost_mdev_get_vring_base(struct vhost_mdev *vdpa, int queue_id, u16 *base);
+int vhost_mdev_get_vring_addr(struct vhost_mdev *vdpa, int queue_id,
+		struct vhost_vring_addr *addr);
+int vhost_mdev_get_log_base(struct vhost_mdev *vdpa, int queue_id,
+		void **log_base, u64 *log_size);
+struct mdev_device *vhost_mdev_get_mdev(struct vhost_mdev *vdpa);
+void *vhost_mdev_get_private(struct vhost_mdev *vdpa);
+
+#endif /* _VHOST_MDEV_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 8f10748dac79..0300d6831cc5 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -201,6 +201,7 @@ struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
 #define VFIO_DEVICE_FLAGS_CCW	(1 << 4)	/* vfio-ccw device */
 #define VFIO_DEVICE_FLAGS_AP	(1 << 5)	/* vfio-ap device */
+#define VFIO_DEVICE_FLAGS_VHOST	(1 << 6)	/* vfio-vhost device */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 };
@@ -217,6 +218,7 @@ struct vfio_device_info {
 #define VFIO_DEVICE_API_AMBA_STRING		"vfio-amba"
 #define VFIO_DEVICE_API_CCW_STRING		"vfio-ccw"
 #define VFIO_DEVICE_API_AP_STRING		"vfio-ap"
+#define VFIO_DEVICE_API_VHOST_STRING		"vfio-vhost"
 
 /**
  * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 40d028eed645..5afbc2f08fa3 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -116,4 +116,12 @@
 #define VHOST_VSOCK_SET_GUEST_CID	_IOW(VHOST_VIRTIO, 0x60, __u64)
 #define VHOST_VSOCK_SET_RUNNING		_IOW(VHOST_VIRTIO, 0x61, int)
 
+/* VHOST_MDEV specific defines */
+
+#define VHOST_MDEV_SET_STATE	_IOW(VHOST_VIRTIO, 0x70, __u64)
+
+#define VHOST_MDEV_S_STOPPED	0
+#define VHOST_MDEV_S_RUNNING	1
+#define VHOST_MDEV_S_MAX	2
+
 #endif
-- 
2.17.1


^ permalink raw reply related

* [PATCH 2/2] vhost/test: fix build for vhost test
From: Tiwei Bie @ 2019-08-28  5:37 UTC (permalink / raw)
  To: mst, jasowang; +Cc: kvm, virtualization, netdev, linux-kernel, stable
In-Reply-To: <20190828053700.26022-1-tiwei.bie@intel.com>

Since vhost_exceeds_weight() was introduced, callers need to specify
the packet weight and byte weight in vhost_dev_init(). Note that, the
packet weight isn't counted in this patch to keep the original behavior
unchanged.

Fixes: e82b9b0727ff ("vhost: introduce vhost_exceeds_weight()")
Cc: stable@vger.kernel.org
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/vhost/test.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index ac4f762c4f65..7804869c6a31 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -22,6 +22,12 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_TEST_WEIGHT 0x80000
 
+/* Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_TEST_PKT_WEIGHT 256
+
 enum {
 	VHOST_TEST_VQ = 0,
 	VHOST_TEST_VQ_MAX = 1,
@@ -80,10 +86,8 @@ static void handle_vq(struct vhost_test *n)
 		}
 		vhost_add_used_and_signal(&n->dev, vq, head, 0);
 		total_len += len;
-		if (unlikely(total_len >= VHOST_TEST_WEIGHT)) {
-			vhost_poll_queue(&vq->poll);
+		if (unlikely(vhost_exceeds_weight(vq, 0, total_len)))
 			break;
-		}
 	}
 
 	mutex_unlock(&vq->mutex);
@@ -115,7 +119,8 @@ static int vhost_test_open(struct inode *inode, struct file *f)
 	dev = &n->dev;
 	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
 	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
-	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV);
+	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
+		       VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT);
 
 	f->private_data = n;
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH 1/2] vhost/test: fix build for vhost test
From: Tiwei Bie @ 2019-08-28  5:36 UTC (permalink / raw)
  To: mst, jasowang; +Cc: kvm, virtualization, netdev, linux-kernel, stable

Since below commit, callers need to specify the iov_limit in
vhost_dev_init() explicitly.

Fixes: b46a0bf78ad7 ("vhost: fix OOB in get_rx_bufs()")
Cc: stable@vger.kernel.org
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/vhost/test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 9e90e969af55..ac4f762c4f65 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -115,7 +115,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
 	dev = &n->dev;
 	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
 	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
-	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX);
+	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV);
 
 	f->private_data = n;
 
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH net-next 1/4] r8169: prepare for adding RTL8125 support
From: Heiner Kallweit @ 2019-08-28  5:52 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Realtek linux nic maintainers, David Miller,
	netdev@vger.kernel.org, Chun-Hao Lin
In-Reply-To: <20190827232713.GE26248@lunn.ch>

On 28.08.2019 01:27, Andrew Lunn wrote:
> On Tue, Aug 27, 2019 at 08:41:00PM +0200, Heiner Kallweit wrote:
>> This patch prepares the driver for adding RTL8125 support:
>> - change type of interrupt mask to u32
>> - restrict rtl_is_8168evl_up to RTL8168 chip versions
>> - factor out reading MAC address from registers
>> - re-add function rtl_get_events
>> - move disabling interrupt coalescing to RTL8169/RTL8168 init
>> - read different register for PCI commit
>> - don't use bit LastFrag in tx descriptor after send, RTL8125 clears it
> 
> Hi Heiner
> 
> That is a lot of changes in one patch. Although there is no planned
> functional change, r8169 has a habit of breaking. Having lots of small
> changes would help tracking down which change caused a breakage, via a
> git bisect.
> 
> So you might want to consider splitting this up into a number of small
> patches.
> 
> 	Andrew
> 
Hi Andrew,

most of the changes are trivial, but you're right. I'll split this patch.

Heiner

^ permalink raw reply

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
From: Shmulik Ladkani @ 2019-08-28  5:56 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov,
	Yonghong Song, Steffen Klassert, shmulik, eyal
In-Reply-To: <88a3da53-fecc-0d8c-56dc-a4c3b0e11dfd@iogearbox.net>

On Tue, 27 Aug 2019 14:10:35 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> Given first point above wrt hitting rarely, it would be good to first get a
> better understanding for writing a reproducer. Back then Yonghong added one
> to the BPF kernel test suite [0], so it would be desirable to extend it for
> the case you're hitting. Given NAT64 use-case is needed and used by multiple
> parties, we should try to (fully) fix it generically.

Thanks Daniel for the advice.

I'm working on a reproducer that resembles the input skb which triggers
this BUG_ON.

^ permalink raw reply

* [PATCH net 0/2] nfp: flower: fix bugs in merge tunnel encap code
From: Jakub Kicinski @ 2019-08-28  5:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, oss-drivers, Jakub Kicinski

John says:

There are few bugs in the merge encap code that have come to light with
recent driver changes. Effectively, flow bind callbacks were being
registered twice when using internal ports (new 'busy' code triggers
this). There was also an issue with neighbour notifier messages being
ignored for internal ports.

John Hurley (2):
  nfp: flower: prevent ingress block binds on internal ports
  nfp: flower: handle neighbour events on internal ports

 drivers/net/ethernet/netronome/nfp/flower/offload.c     | 7 ++++---
 drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c | 8 ++++----
 2 files changed, 8 insertions(+), 7 deletions(-)

-- 
2.21.0

^ permalink raw reply

* [PATCH net 1/2] nfp: flower: prevent ingress block binds on internal ports
From: Jakub Kicinski @ 2019-08-28  5:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, oss-drivers, John Hurley, Jakub Kicinski
In-Reply-To: <20190828055630.17331-1-jakub.kicinski@netronome.com>

From: John Hurley <john.hurley@netronome.com>

Internal port TC offload is implemented through user-space applications
(such as OvS) by adding filters at egress via TC clsact qdiscs. Indirect
block offload support in the NFP driver accepts both ingress qdisc binds
and egress binds if the device is an internal port. However, clsact sends
bind notification for both ingress and egress block binds which can lead
to the driver registering multiple callbacks and receiving multiple
notifications of new filters.

Fix this by rejecting ingress block bind callbacks when the port is
internal and only adding filter callbacks for egress binds.

Fixes: 4d12ba42787b ("nfp: flower: allow offloading of matches on 'internal' ports")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 9917d64694c6..457bdc60f3ee 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -1409,9 +1409,10 @@ nfp_flower_setup_indr_tc_block(struct net_device *netdev, struct nfp_app *app,
 	struct nfp_flower_priv *priv = app->priv;
 	struct flow_block_cb *block_cb;
 
-	if (f->binder_type != FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS &&
-	    !(f->binder_type == FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS &&
-	      nfp_flower_internal_port_can_offload(app, netdev)))
+	if ((f->binder_type != FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS &&
+	     !nfp_flower_internal_port_can_offload(app, netdev)) ||
+	    (f->binder_type != FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS &&
+	     nfp_flower_internal_port_can_offload(app, netdev)))
 		return -EOPNOTSUPP;
 
 	switch (f->command) {
-- 
2.21.0


^ permalink raw reply related

* [PATCH net 2/2] nfp: flower: handle neighbour events on internal ports
From: Jakub Kicinski @ 2019-08-28  5:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, oss-drivers, John Hurley, Simon Horman, Jakub Kicinski
In-Reply-To: <20190828055630.17331-1-jakub.kicinski@netronome.com>

From: John Hurley <john.hurley@netronome.com>

Recent code changes to NFP allowed the offload of neighbour entries to FW
when the next hop device was an internal port. This allows for offload of
tunnel encap when the end-point IP address is applied to such a port.

Unfortunately, the neighbour event handler still rejects events that are
not associated with a repr dev and so the firmware neighbour table may get
out of sync for internal ports.

Fix this by allowing internal port neighbour events to be correctly
processed.

Fixes: 45756dfedab5 ("nfp: flower: allow tunnels to output to internal port")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
index a7a80f4b722a..f0ee982eb1b5 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
@@ -328,13 +328,13 @@ nfp_tun_neigh_event_handler(struct notifier_block *nb, unsigned long event,
 
 	flow.daddr = *(__be32 *)n->primary_key;
 
-	/* Only concerned with route changes for representors. */
-	if (!nfp_netdev_is_nfp_repr(n->dev))
-		return NOTIFY_DONE;
-
 	app_priv = container_of(nb, struct nfp_flower_priv, tun.neigh_nb);
 	app = app_priv->app;
 
+	if (!nfp_netdev_is_nfp_repr(n->dev) &&
+	    !nfp_flower_internal_port_can_offload(app, n->dev))
+		return NOTIFY_DONE;
+
 	/* Only concerned with changes to routes already added to NFP. */
 	if (!nfp_tun_has_route(app, flow.daddr))
 		return NOTIFY_DONE;
-- 
2.21.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox