Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net v3] ipvs: Improve robustness to the ipvs sysctl
From: Pablo Neira Ayuso @ 2019-08-03 16:40 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: hujunwei, wensong, horms, kadlec, Florian Westphal, davem,
	netdev@vger.kernel.org, lvs-devel, netfilter-devel, coreteam,
	Mingfangsen, wangxiaogang3, xuhanbing
In-Reply-To: <alpine.LFD.2.21.1907312052310.3631@ja.home.ssi.bg>

On Wed, Jul 31, 2019 at 08:53:47PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 1 Aug 2019, hujunwei wrote:
> 
> > From: Junwei Hu <hujunwei4@huawei.com>
> > 
> > The ipvs module parse the user buffer and save it to sysctl,
> > then check if the value is valid. invalid value occurs
> > over a period of time.
> > Here, I add a variable, struct ctl_table tmp, used to read
> > the value from the user buffer, and save only when it is valid.
> > I delete proc_do_sync_mode and use extra1/2 in table for the
> > proc_dointvec_minmax call.
> > 
> > Fixes: f73181c8288f ("ipvs: add support for sync threads")
> > Signed-off-by: Junwei Hu <hujunwei4@huawei.com>
> > Acked-by: Julian Anastasov <ja@ssi.bg>
> 
> 	Yep, Acked-by: Julian Anastasov <ja@ssi.bg>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] netfilter: conntrack: use shared sysctl constants
From: Pablo Neira Ayuso @ 2019-08-03 16:40 UTC (permalink / raw)
  To: Matteo Croce
  Cc: netdev, netfilter-devel, Jozsef Kadlecsik, Florian Westphal,
	linux-kernel, akpm, Tonghao Zhang
In-Reply-To: <20190723012303.2221-1-mcroce@redhat.com>

On Tue, Jul 23, 2019 at 03:23:03AM +0200, Matteo Croce wrote:
> Use shared sysctl variables for zero and one constants, as in commit
> eec4844fae7c ("proc/sysctl: add shared variables for range check")

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v3 bpf-next 02/12] libbpf: implement BPF CO-RE offset relocation algorithm
From: Andrii Nakryiko @ 2019-08-03 16:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Yonghong Song, Song Liu, Kernel Team
In-Reply-To: <20190803162556.pdbckv7yta4wigjk@ast-mbp.dhcp.thefacebook.com>

On Sat, Aug 3, 2019 at 9:26 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Aug 02, 2019 at 11:30:21PM -0700, Andrii Nakryiko wrote:
> >
> > No, not anonymous.
> >
> > struct my_struct___local {
> >     int a;
> > };
> >
> > struct my_struct___target {
> >     long long a;
> > };
> >
> > my_struct___local->a will not match my_struct___target->a, but it's
> > not a reason to stop relocation process due to error.
>
> why? It feels that this is exactly the reason to fail relocation.
> struct names matched. field names matched.
> but the types are different. Why proceed further?

Because it can be accidental name collision. Think about two different
ring_buffer structs, let's say they have some field with the same
generic name, but different types, e.g., id (I'm making everything up,
except the fact that there are two different ring_buffers in kernel):

struct ring_buffer {
    int id;
    int extra_field_a;
    /* tons of other fields */
};

struct ring_buffer {
    char id[16];
    int extra_field_b;
    /* other stuff */
};

If BPF app expects to read first ring_buffer's id (of int type), that
relocation will still be tried against both, because there is no way
to know which one was originally intended. But because type matched
for only one of them, then that one is the chosen candidate.

Now, if both ids were int, then we would have two matching candidates,
and then two situations are possible:

1. Offsets of that field match. Then it doesn't matter, we still
successfully relocate.
2. Offsets don't match - that's an error, because we can't figure out
correct relocation. Program loading will fail in that case.

But in situation 2, if there was another relocation for ring_buffer
(for another field), and that field was matched against only one of
the structs, then that one struct would be chosen only.

So in general, I try to postpone failure as much as possible, because
there could be "accidental" name matches. This is not a failure, until
we have ambiguity or failed to find any matching candidate, that
satisfies **all** field relocations.


>
> Also what about str->a followed by str->b.
> Is it possible that str->a will pick up one flavor when str->b different one?

No, if str->a picked one flavor and discarded another one, then we are
left with that chosen flavor as single possible candidate for all
subsequent relocations against structs with the same name. That's main
reason or this cache of candidates that I maintain (not just a
performance improvement).

> That will likely lead to wrong behavior?
>
> >
> > All the tests I added use non-numeric flavors. While technically I can
> > use just ___1, ___2 and so on, it will greatly reduce readability,
> > while not really solving any problem (nothing prevents someone to add
> > something like lmc___1 eventually).
> >
> > I think it's not worth it to complicate this logic just for
> > lmc___{softc,media,ctl}, but we can do 2) - try to match any struct as
> > is. If that fails, see if it's a "flavor" and match flavors.
>
> Could you please share benchmarking results of largish bpf prog
> with couple thousands relocations against typical vmlinux.h ?

I can once I have such a program converted.

> I'm concerned that this double check will be noticeable.

Well yes, it's an extra pass, essentially, which is why I wouldn't
like to do it.

> May be llvm should recognize "flavor" in the type name and
> encode them differently in BTF ?
> Or add a pre-pass to libbpf to sort out all types into flavored and not.

This seems like extreme overcomplication for extreme corner case. And
there is very simple work-around from user space.

E.g., let's take that lmc___media struct. If for user really needs to
relocate such struct, we can just define (locally) typedef:

typedef struct lmc___media lmc___media___for_real;

And now "non-flavored" name that we will try to match will be desired
"lmc___media". Given there are whole 3 structs like that right now,
I'm fine with this work around in exchange of not overcomplicating
libbpf's algorithm. WDYT?

> If flavored search is expensive may be all flavors could be a linked list
> from the base type. The typical case is one or two flavors, right?

Flavors search is no more expensive than non-flavored one. And we have
a list of candidates.

>
> > > > > > +     for (i = 0, j = 0; i < cand_ids->len; i++) {
> > > > > > +             cand_id = cand_ids->data[i];
> > > > > > +             cand_type = btf__type_by_id(targ_btf, cand_id);
> > > > > > +             cand_name = btf__name_by_offset(targ_btf, cand_type->name_off);
> > > > > > +
> > > > > > +             err = bpf_core_spec_match(&local_spec, targ_btf,
> > > > > > +                                       cand_id, &cand_spec);
> > > > > > +             if (err < 0) {
> > > > > > +                     pr_warning("prog '%s': relo #%d: failed to match spec ",
> > > > > > +                                prog_name, relo_idx);
> > > > > > +                     bpf_core_dump_spec(LIBBPF_WARN, &local_spec);
> > > > > > +                     libbpf_print(LIBBPF_WARN,
> > > > > > +                                  " to candidate #%d [%d] (%s): %d\n",
> > > > > > +                                  i, cand_id, cand_name, err);
> > > > > > +                     return err;
> > > > > > +             }
> > > > > > +             if (err == 0) {
> > > > > > +                     pr_debug("prog '%s': relo #%d: candidate #%d [%d] (%s) doesn't match spec ",
> > > > > > +                              prog_name, relo_idx, i, cand_id, cand_name);
> > > > > > +                     bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec);
> > > > > > +                     libbpf_print(LIBBPF_DEBUG, "\n");
> > > > > > +                     continue;
> > > > > > +             }
> > > > > > +
> > > > > > +             pr_debug("prog '%s': relo #%d: candidate #%d matched as spec ",
> > > > > > +                      prog_name, relo_idx, i);
> > > > >
> > > > > did you mention that you're going to make a helper for this debug dumps?
> > > >
> > > > yeah, I added bpf_core_dump_spec(), but I don't know how to shorten
> > > > this further... This output is extremely useful to understand what's
> > > > happening and will be invaluable when users will inevitably report
> > > > confusing behavior in some cases, so I still want to keep it.
> > >
> > > not sure yet. Just pointing out that this function has more debug printfs
> > > than actual code which doesn't look right.
> > > We have complex algorithms in the kernel (like verifier).
> > > Yet we don't sprinkle printfs in there to this degree.
> > >
> >
> > We do have a verbose verifier logging, though, exactly to help users
> > to debug issues, which is extremely helpful and is greatly appreciated
> > by users.
> > There is nothing worse for developer experience than getting -EINVAL
> > without any useful log message. Been there, banged my head against the
> > wall wishing for a bit more verbose log. What are we trying to
> > optimize for here?
>
> All I'm saying that three printfs in a row that essentially convey the same info
> look like clowntown. Some level of verbosity is certainly useful.

But they don't convey the same information! One is error clearly
stating "failed to match with error". Another one is informational
debug-only "we tried this candidate, it didn't match, but it's not an
error". The other one "yay, we successfully found this candidate".
Three extremely different outcomes.

Given I find it hard to understand what's the exact reason you don't
like this, it's hard to guess the approach that you'll find
acceptable, but I'll try one more time. How about this:

- emit "We are trying this spec/this candidate" before we try (at
DEBUG log level, though);
- then depending on the outcome, emit corresponding "cand #%d:
failed", "cand #%d: not a match", or "cand #%d: a match".

This is very subpar, especially given that on error we might not have
spec/candidate details, because they are DEBUG-only, but still better
than nothing.

But, honestly, at this point I'm ready to give up and remove all that,
I'll just keep adding it back locally when reproducing or debugging
some issue in the future, which is PITA, but nothing I can't live
with.
>

^ permalink raw reply

* Re: [net-next 00/11][pull request] 100GbE Intel Wired LAN Driver Updates 2019-08-01
From: David Miller @ 2019-08-03 17:40 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann
In-Reply-To: <20190801222548.15975-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Thu,  1 Aug 2019 15:25:37 -0700

> This series for fm10k, by Jake Keller, reduces the scope of local variables
> where possible.
> 
> The following are changes since commit a8e600e2184f45c40025cbe4d7e8893b69378a9f:
>   Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 100GbE

Pulled, thanks Jeff.

^ permalink raw reply

* Re: [PATCH net-next 00/15] net: Add functional tests for L3 and L4
From: David Miller @ 2019-08-03 17:42 UTC (permalink / raw)
  To: dsahern; +Cc: netdev, dsahern
In-Reply-To: <20190801185648.27653-1-dsahern@kernel.org>

From: David Ahern <dsahern@kernel.org>
Date: Thu,  1 Aug 2019 11:56:33 -0700

> From: David Ahern <dsahern@gmail.com>
> 
> This is a port the functional test cases created during the development
> of the VRF feature. It covers various permutations of icmp, tcp and udp
> for IPv4 and IPv6 including negative tests.

Series applied, thanks David.

Please work with Alexei to make the server/client startup work better in
a CI environment since frankly that's the most important situation where
all of these tests are going to run now that these are in the tree.

Thanks.

^ permalink raw reply

* Re: [net-next 0/9][pull request] 40GbE Intel Wired LAN Driver Updates 2019-08-01
From: David Miller @ 2019-08-03 17:46 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann
In-Reply-To: <20190801205149.4114-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Thu,  1 Aug 2019 13:51:40 -0700

> This series contains updates to i40e driver only.

Patch #2 seems to need some changes, so I'll wait for the next respin
of this pull request.

^ permalink raw reply

* Re: [PATCH] net: sctp: Rename fallthrough label to unhandled
From: Joe Perches @ 2019-08-03 18:01 UTC (permalink / raw)
  To: David Miller
  Cc: nhorman, vyasevich, marcelo.leitner, linux-sctp, netdev,
	linux-kernel
In-Reply-To: <20190802.161932.1776993765494484851.davem@davemloft.net>

On Fri, 2019-08-02 at 16:19 -0700, David Miller wrote:
> From: Joe Perches <joe@perches.com>
> Date: Fri, 02 Aug 2019 10:47:34 -0700
> 
> > On Wed, 2019-07-31 at 08:16 -0400, Neil Horman wrote:
> >> On Wed, Jul 31, 2019 at 04:32:43AM -0700, Joe Perches wrote:
> >> > On Wed, 2019-07-31 at 07:19 -0400, Neil Horman wrote:
> >> > > On Tue, Jul 30, 2019 at 10:04:37PM -0700, Joe Perches wrote:
> >> > > > fallthrough may become a pseudo reserved keyword so this only use of
> >> > > > fallthrough is better renamed to allow it.
> > 
> > Can you or any other maintainer apply this patch
> > or ack it so David Miller can apply it?
> 
> I, like others, don't like the lack of __ in the keyword.  It's kind of
> rediculous the problems it creates to pollute the global namespace like
> that and yes also inconsistent with other shorthands for builtins.

Rejected?

I think that's inappropriate.

As coded, it's nothing like a fallthrough and
the rename to unhandled is more descriptive.



^ permalink raw reply

* Re: [PATCH v3] net/socket: fix GCC8+ Wpacked-not-aligned warnings
From: David Miller @ 2019-08-03 18:03 UTC (permalink / raw)
  To: cai
  Cc: vyasevich, nhorman, marcelo.leitner, David.Laight, linux-sctp,
	netdev, linux-kernel
In-Reply-To: <1564500633-7419-1-git-send-email-cai@lca.pw>

From: Qian Cai <cai@lca.pw>
Date: Tue, 30 Jul 2019 11:30:33 -0400

> There are a lot of those warnings with GCC8+ 64-bit,
 ...
> This is because the commit 20c9c825b12f ("[SCTP] Fix SCTP socket options
> to work with 32-bit apps on 64-bit kernels.") added "packed, aligned(4)"
> GCC attributes to some structures but one of the members, i.e, "struct
> sockaddr_storage" in those structures has the attribute,
> "aligned(__alignof__ (struct sockaddr *)" which is 8-byte on 64-bit
> systems, so the commit overwrites the designed alignments for
> "sockaddr_storage".
> 
> To fix this, "struct sockaddr_storage" needs to be aligned to 4-byte as
> it is only used in those packed sctp structure which is part of UAPI,
> and "struct __kernel_sockaddr_storage" is used in some other
> places of UAPI that need not to change alignments in order to not
> breaking userspace.
> 
> Use an implicit alignment for "struct __kernel_sockaddr_storage" so it
> can keep the same alignments as a member in both packed and un-packed
> structures without breaking UAPI.
> 
> Suggested-by: David Laight <David.Laight@ACULAB.COM>
> Signed-off-by: Qian Cai <cai@lca.pw>

Ok, this just changes how the alignment is declared from using the
aligned attribute to using a union member which requires the same
alignment.

Applied.

^ permalink raw reply

* Re: [PATCH v2] net: mdio-octeon: Fix Kconfig warnings and build errors
From: Randy Dunlap @ 2019-08-03 18:34 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: andrew, broonie, davem, devel, f.fainelli, gregkh, hkallweit1,
	kernel-build-reports, linux-arm-kernel, linux-next, lkp, netdev,
	willy
In-Reply-To: <20190803060155.89548-1-natechancellor@gmail.com>

On 8/2/19 11:01 PM, Nathan Chancellor wrote:
> After commit 171a9bae68c7 ("staging/octeon: Allow test build on
> !MIPS"), the following combination of configs cause a few Kconfig
> warnings and build errors (distilled from arm allyesconfig and Randy's
> randconfig builds):
> 
>     CONFIG_NETDEVICES=y
>     CONFIG_STAGING=y
>     CONFIG_COMPILE_TEST=y
> 
> and CONFIG_OCTEON_ETHERNET as either a module or built-in.
> 
> WARNING: unmet direct dependencies detected for MDIO_OCTEON
>   Depends on [n]: NETDEVICES [=y] && MDIO_DEVICE [=y] && MDIO_BUS [=y]
> && 64BIT [=n] && HAS_IOMEM [=y] && OF_MDIO [=n]
>   Selected by [y]:
>   - OCTEON_ETHERNET [=y] && STAGING [=y] && (CAVIUM_OCTEON_SOC ||
> COMPILE_TEST [=y]) && NETDEVICES [=y]
> 
> In file included from ../drivers/net/phy/mdio-octeon.c:14:
> ../drivers/net/phy/mdio-cavium.h:111:36: error: implicit declaration of
> function ‘writeq’; did you mean ‘writel’?
> [-Werror=implicit-function-declaration]
>   111 | #define oct_mdio_writeq(val, addr) writeq(val, (void *)addr)
>       |                                    ^~~~~~
> 
> CONFIG_64BIT is not strictly necessary if the proper readq/writeq
> definitions are included from io-64-nonatomic-lo-hi.h.
> 
> CONFIG_OF_MDIO is not needed when CONFIG_COMPILE_TEST is enabled because
> of commit f9dc9ac51610 ("of/mdio: Add dummy functions in of_mdio.h.").
> 
> Fixes: 171a9bae68c7 ("staging/octeon: Allow test build on !MIPS")
> Reported-by: kbuild test robot <lkp@intel.com>
> Reported-by: Mark Brown <broonie@kernel.org>
> Reported-by: Randy Dunlap <rdunlap@infradead.org>
> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>

Works for me. Fixes the reported build errors.  Thanks.

Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested

> ---
> 
> v1 -> v2:
> 
> * Address Randy's reported failure here: https://lore.kernel.org/netdev/b3444283-7a77-ece8-7ac6-41756aa7dc60@infradead.org/
>   by not requiring CONFIG_OF_MDIO when CONFIG_COMPILE_TEST is set.
> 
> * Improve commit message
> 
>  drivers/net/phy/Kconfig       | 4 ++--
>  drivers/net/phy/mdio-cavium.h | 2 ++
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
> index 20f14c5fbb7e..0e3d9e3d3533 100644
> --- a/drivers/net/phy/Kconfig
> +++ b/drivers/net/phy/Kconfig
> @@ -159,8 +159,8 @@ config MDIO_MSCC_MIIM
>  
>  config MDIO_OCTEON
>  	tristate "Octeon and some ThunderX SOCs MDIO buses"
> -	depends on 64BIT
> -	depends on HAS_IOMEM && OF_MDIO
> +	depends on (64BIT && OF_MDIO) || COMPILE_TEST
> +	depends on HAS_IOMEM
>  	select MDIO_CAVIUM
>  	help
>  	  This module provides a driver for the Octeon and ThunderX MDIO
> diff --git a/drivers/net/phy/mdio-cavium.h b/drivers/net/phy/mdio-cavium.h
> index ed5f9bb5448d..b7f89ad27465 100644
> --- a/drivers/net/phy/mdio-cavium.h
> +++ b/drivers/net/phy/mdio-cavium.h
> @@ -108,6 +108,8 @@ static inline u64 oct_mdio_readq(u64 addr)
>  	return cvmx_read_csr(addr);
>  }
>  #else
> +#include <linux/io-64-nonatomic-lo-hi.h>
> +
>  #define oct_mdio_writeq(val, addr)	writeq(val, (void *)addr)
>  #define oct_mdio_readq(addr)		readq((void *)addr)
>  #endif
> 


-- 
~Randy

^ permalink raw reply

* Re: [PATCH 06/34] drm/i915: convert put_page() to put_user_page*()
From: John Hubbard @ 2019-08-03 20:03 UTC (permalink / raw)
  To: Joonas Lahtinen, Andrew Morton, john.hubbard
  Cc: Christoph Hellwig, Dan Williams, Dave Chinner, Dave Hansen,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jérôme Glisse,
	LKML, amd-gfx, ceph-devel, devel, devel, dri-devel, intel-gfx,
	kvm, linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
	linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
	linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
	xen-devel, Jani Nikula, Rodrigo Vivi, David Airlie
In-Reply-To: <7d9a9c57-4322-270b-b636-7214019f87e9@nvidia.com>

On 8/2/19 11:48 AM, John Hubbard wrote:
> On 8/2/19 2:19 AM, Joonas Lahtinen wrote:
>> Quoting john.hubbard@gmail.com (2019-08-02 05:19:37)
>>> From: John Hubbard <jhubbard@nvidia.com>
...
> In order to deal with the merge problem, I'll drop this patch from my series,
> and I'd recommend that the drm-intel-next take the following approach:

Actually, I just pulled the latest linux.git, and there are a few changes:

> 
> 1) For now, s/put_page/put_user_page/ in i915_gem_userptr_put_pages(),
> and fix up the set_page_dirty() --> set_page_dirty_lock() issue, like this
> (based against linux.git):
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> index 528b61678334..94721cc0093b 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> @@ -664,10 +664,10 @@ i915_gem_userptr_put_pages(struct drm_i915_gem_object *obj,
> 
>         for_each_sgt_page(page, sgt_iter, pages) {
>                 if (obj->mm.dirty)
> -                       set_page_dirty(page);
> +                       set_page_dirty_lock(page);

I see you've already applied this fix to your tree, in linux.git already.

> 
>                 mark_page_accessed(page);
> -               put_page(page);
> +               put_user_page(page);

But this conversion still needs doing. So I'll repost a patch that only does 
this (plus the other call sites). 

That can go in via either your tree, or Andrew's -mm tree, without generating
any conflicts.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-03 21:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <20190802172418.GB11245@ziepe.ca>

On Fri, Aug 02, 2019 at 02:24:18PM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 02, 2019 at 10:27:21AM -0400, Michael S. Tsirkin wrote:
> > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > > > This must be a proper barrier, like a spinlock, mutex, or
> > > > > synchronize_rcu.
> > > > 
> > > > 
> > > > I start with synchronize_rcu() but both you and Michael raise some
> > > > concern.
> > > 
> > > I've also idly wondered if calling synchronize_rcu() under the various
> > > mm locks is a deadlock situation.
> > > 
> > > > Then I try spinlock and mutex:
> > > > 
> > > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance
> > > > improvement.
> > > 
> > > I think the topic here is correctness not performance improvement
> > 
> > The topic is whether we should revert
> > commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address")
> > 
> > or keep it in. The only reason to keep it is performance.
> 
> Yikes, I'm not sure you can ever win against copy_from_user using
> mmu_notifiers?

Ever since copy_from_user started playing with flags (for SMAP) and
added speculation barriers there's a chance we can win by accessing
memory through the kernel address.


Another reason would be to access it from e.g. softirq
context. copy_from_user will only work if the
correct mmu is active.


> The synchronization requirements are likely always
> more expensive unless large and scattered copies are being done..
> 
> The rcu is about the only simple approach that could be less
> expensive, and that gets back to the question if you can block an
> invalidate_start_range in synchronize_rcu or not..
> 
> So, frankly, I'd revert it until someone could prove the rcu solution is
> OK..

I have it all disabled at compile time, so reverting isn't urgent
anymore. I'll wait a couple more days to decide what's cleanest.

> BTW, how do you get copy_from_user to work outside a syscall?

By switching to the correct mm.

> 
> Also, why can't this just permanently GUP the pages? In fact, where
> does it put_page them anyhow? Worrying that 7f466 adds a get_user page
> but does not add a put_page??
> 
> Jason

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-03 21:54 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, virtualization, netdev, linux-kernel, linux-mm, jgg,
	Paul E. McKenney
In-Reply-To: <130386548.6222676.1564646773879.JavaMail.zimbra@redhat.com>

On Thu, Aug 01, 2019 at 04:06:13AM -0400, Jason Wang wrote:
> On 2019/8/1 上午2:29, Michael S. Tsirkin wrote:
> > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
> >> We used to use RCU to synchronize MMU notifier with worker. This leads
> >> calling synchronize_rcu() in invalidate_range_start(). But on a busy
> >> system, there would be many factors that may slow down the
> >> synchronize_rcu() which makes it unsuitable to be called in MMU
> >> notifier.
> >>
> >> A solution is SRCU but its overhead is obvious with the expensive full
> >> memory barrier. Another choice is to use seqlock, but it doesn't
> >> provide a synchronization method between readers and writers. The last
> >> choice is to use vq mutex, but it need to deal with the worst case
> >> that MMU notifier must be blocked and wait for the finish of swap in.
> >>
> >> So this patch switches use a counter to track whether or not the map
> >> was used. The counter was increased when vq try to start or finish
> >> uses the map. This means, when it was even, we're sure there's no
> >> readers and MMU notifier is synchronized. When it was odd, it means
> >> there's a reader we need to wait it to be even again then we are
> >> synchronized. To avoid full memory barrier, store_release +
> >> load_acquire on the counter is used.
> >
> > Unfortunately this needs a lot of review and testing, so this can't make
> > rc2, and I don't think this is the kind of patch I can merge after rc3.
> > Subtle memory barrier tricks like this can introduce new bugs while they
> > are fixing old ones.
> 
> I admit the patch is tricky. Some questions:
> 
> - Do we must address the case of e.g swap in? If not, a simple
>   vhost_work_flush() instead of synchronize_rcu() may work.
> - Having some hard thought, I think we can use seqlock, it looks
>   to me smp_wmb() is in write_segcount_begin() is sufficient, we don't
>   care vq->map read before smp_wmb(), and for the other we all have
>   good data devendency so smp_wmb() in the write_seqbegin_end() is
>   sufficient.

If we need an mb in the begin() we can switch to
dependent_ptr_mb. if you need me to fix it up
and repost, let me know.

Why isn't it a problem if the map is
accessed outside the lock?



> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index db2c81cb1e90..6d9501303258 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -363,39 +363,29 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
>  
>  static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
>  {
> -	int ref = READ_ONCE(vq->ref);
> -
> -	smp_store_release(&vq->ref, ref + 1);
> -	/* Make sure ref counter is visible before accessing the map */
> -	smp_load_acquire(&vq->ref);
> +	write_seqcount_begin(&vq->seq);
>  }
>  
>  static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
>  {
> -	int ref = READ_ONCE(vq->ref);
> -
> -	/* Make sure vq access is done before increasing ref counter */
> -	smp_store_release(&vq->ref, ref + 1);
> +	write_seqcount_end(&vq->seq);
>  }
>  
>  static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
>  {
> -	int ref;
> +	unsigned int ret;
>  
>  	/* Make sure map change was done before checking ref counter */
>  	smp_mb();
> -
> -	ref = READ_ONCE(vq->ref);
> -	if (ref & 0x1) {
> -		/* When ref change, we are sure no reader can see
> +	ret = raw_read_seqcount(&vq->seq);
> +	if (ret & 0x1) {
> +		/* When seq changes, we are sure no reader can see
>  		 * previous map */
> -		while (READ_ONCE(vq->ref) == ref) {
> -			set_current_state(TASK_RUNNING);
> +		while (raw_read_seqcount(&vq->seq) == ret)
>  			schedule();


So why do we set state here? And should not we
check need_sched?


> -		}
>  	}
> -	/* Make sure ref counter was checked before any other
> -	 * operations that was dene on map. */
> +	/* Make sure seq was checked before any other operations that
> +	 * was dene on map. */
>  	smp_mb();
>  }
>  
> @@ -691,7 +681,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  		vq->indirect = NULL;
>  		vq->heads = NULL;
>  		vq->dev = dev;
> -		vq->ref = 0;
> +		seqcount_init(&vq->seq);
>  		mutex_init(&vq->mutex);
>  		spin_lock_init(&vq->mmu_lock);
>  		vhost_vq_reset(dev, vq);
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 3d10da0ae511..1a705e181a84 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -125,7 +125,7 @@ struct vhost_virtqueue {
>  	 */
>  	struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
>  #endif
> -	int ref;
> +	seqcount_t seq;
>  	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>  
>  	struct file *kick;
> -- 
> 2.18.1
> 
> >
> >
> >
> >
> >
> >>
> >> Consider the read critical section is pretty small the synchronization
> >> should be done very fast.
> >>
> >> Note the patch lead about 3% PPS dropping.
> >
> > Sorry what do you mean by this last sentence? This degrades performance
> > compared to what?
> 
> Compare to without this patch.

OK is the feature still a performance win? or should we drop it for now?

> >
> >>
> >> Reported-by: Michael S. Tsirkin <mst@redhat.com>
> >> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
> >> Signed-off-by: Jason Wang <jasowang@redhat.com>
> >> ---
> >>  drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++----------------
> >>  drivers/vhost/vhost.h |   7 +-
> >>  2 files changed, 94 insertions(+), 58 deletions(-)
> >>
> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >> index cfc11f9ed9c9..db2c81cb1e90 100644
> >> --- a/drivers/vhost/vhost.c
> >> +++ b/drivers/vhost/vhost.c
> >> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
> >>  
> >>  	spin_lock(&vq->mmu_lock);
> >>  	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> >> -		map[i] = rcu_dereference_protected(vq->maps[i],
> >> -				  lockdep_is_held(&vq->mmu_lock));
> >> +		map[i] = vq->maps[i];
> >>  		if (map[i]) {
> >>  			vhost_set_map_dirty(vq, map[i], i);
> >> -			rcu_assign_pointer(vq->maps[i], NULL);
> >> +			vq->maps[i] = NULL;
> >>  		}
> >>  	}
> >>  	spin_unlock(&vq->mmu_lock);
> >>  
> >> -	/* No need for synchronize_rcu() or kfree_rcu() since we are
> >> -	 * serialized with memory accessors (e.g vq mutex held).
> >> +	/* No need for synchronization since we are serialized with
> >> +	 * memory accessors (e.g vq mutex held).
> >>  	 */
> >>  
> >>  	for (i = 0; i < VHOST_NUM_ADDRS; i++)
> >> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
> >>  	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
> >>  }
> >>  
> >> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
> >> +{
> >> +	int ref = READ_ONCE(vq->ref);
> >> +
> >> +	smp_store_release(&vq->ref, ref + 1);
> >> +	/* Make sure ref counter is visible before accessing the map */
> >> +	smp_load_acquire(&vq->ref);
> >
> > The map access is after this sequence, correct?
> 
> Yes.
> 
> >
> > Just going by the rules in Documentation/memory-barriers.txt,
> > I think that this pair will not order following accesses with ref store.
> >
> > Documentation/memory-barriers.txt says:
> >
> >
> > +     In addition, a RELEASE+ACQUIRE
> > +     pair is -not- guaranteed to act as a full memory barrier.
> >
> >
> >
> > The guarantee that is made is this:
> > 	after
> >      an ACQUIRE on a given variable, all memory accesses preceding any prior
> >      RELEASE on that same variable are guaranteed to be visible.
> 
> Yes, but it's not clear about the order of ACQUIRE the same location
> of previous RELEASE. And it only has a example like:
> 
> "
> 	*A = a;
> 	RELEASE M
> 	ACQUIRE N
> 	*B = b;
> 
> could occur as:
> 
> 	ACQUIRE N, STORE *B, STORE *A, RELEASE M
> "
> 
> But it doesn't explain what happen when
> 
> *A = a
> RELEASE M
> ACQUIRE M
> *B = b;
> 
> And tools/memory-model/Documentation said
> 
> "
> First, when a lock-acquire reads from a lock-release, the LKMM
> requires that every instruction po-before the lock-release must
> execute before any instruction po-after the lock-acquire.
> "
> 
> Is this a hint that I was correct?

I don't think it's correct since by this logic
memory barriers can be nops on x86.

> >
> >
> > And if we also had the reverse rule we'd end up with a full barrier,
> > won't we?
> >
> > Cc Paul in case I missed something here. And if I'm right,
> > maybe we should call this out, adding
> >
> > 	"The opposite is not true: a prior RELEASE is not
> > 	 guaranteed to be visible before memory accesses following
> > 	 the subsequent ACQUIRE".
> 
> That kinds of violates the RELEASE?
> 
> "
>      This also acts as a one-way permeable barrier.  It guarantees that all
>      memory operations before the RELEASE operation will appear to happen
>      before the RELEASE operation with respect to the other components of the
> "


yes but we are talking about RELEASE itself versus stuff
that comes after it.

> >
> >
> >
> >> +}
> >> +
> >> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
> >> +{
> >> +	int ref = READ_ONCE(vq->ref);
> >> +
> >> +	/* Make sure vq access is done before increasing ref counter */
> >> +	smp_store_release(&vq->ref, ref + 1);
> >> +}
> >> +
> >> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
> >> +{
> >> +	int ref;
> >> +
> >> +	/* Make sure map change was done before checking ref counter */
> >> +	smp_mb();
> >> +
> >> +	ref = READ_ONCE(vq->ref);
> >> +	if (ref & 0x1) {
> >
> > Please document the even/odd trick here too, not just in the commit log.
> >
> 
> Ok.
> 
> >> +		/* When ref change,
> >
> > changes
> >
> >> we are sure no reader can see
> >> +		 * previous map */
> >> +		while (READ_ONCE(vq->ref) == ref) {
> >
> >
> > what is the below line in aid of?
> >
> >> +			set_current_state(TASK_RUNNING);

any answers here?

> >> +			schedule();
> >
> >                         if (need_resched())
> >                                 schedule();
> >
> > ?
> 
> Yes, better.
> 
> >
> >> +		}
> >
> > On an interruptible kernel, there's a risk here is that
> > a task got preempted with an odd ref.
> > So I suspect we'll have to disable preemption when we
> > make ref odd.
> 
> I'm not sure I get, if the odd is not the original value we read,
> we're sure it won't read the new map here I believe.

But we will spin for a very long time in this case.

> >
> >
> >> +	}
> >> +	/* Make sure ref counter was checked before any other
> >> +	 * operations that was dene on map. */
> >
> > was dene -> were done?
> >
> 
> Yes.
> 
> >> +	smp_mb();
> >> +}
> >> +
> >>  static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> >>  				      int index,
> >>  				      unsigned long start,
> >> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> >>  	spin_lock(&vq->mmu_lock);
> >>  	++vq->invalidate_count;
> >>  
> >> -	map = rcu_dereference_protected(vq->maps[index],
> >> -					lockdep_is_held(&vq->mmu_lock));
> >> +	map = vq->maps[index];
> >>  	if (map) {
> >>  		vhost_set_map_dirty(vq, map, index);
> >> -		rcu_assign_pointer(vq->maps[index], NULL);
> >> +		vq->maps[index] = NULL;
> >>  	}
> >>  	spin_unlock(&vq->mmu_lock);
> >>  
> >>  	if (map) {
> >> -		synchronize_rcu();
> >> +		vhost_vq_sync_access(vq);
> >>  		vhost_map_unprefetch(map);
> >>  	}
> >>  }
> >> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev)
> >>  	for (i = 0; i < dev->nvqs; ++i) {
> >>  		vq = dev->vqs[i];
> >>  		for (j = 0; j < VHOST_NUM_ADDRS; j++)
> >> -			RCU_INIT_POINTER(vq->maps[j], NULL);
> >> +			vq->maps[j] = NULL;
> >>  	}
> >>  }
> >>  #endif
> >> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> >>  		vq->indirect = NULL;
> >>  		vq->heads = NULL;
> >>  		vq->dev = dev;
> >> +		vq->ref = 0;
> >>  		mutex_init(&vq->mutex);
> >>  		spin_lock_init(&vq->mmu_lock);
> >>  		vhost_vq_reset(dev, vq);
> >> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
> >>  	map->npages = npages;
> >>  	map->pages = pages;
> >>  
> >> -	rcu_assign_pointer(vq->maps[index], map);
> >> +	vq->maps[index] = map;
> >>  	/* No need for a synchronize_rcu(). This function should be
> >>  	 * called by dev->worker so we are serialized with all
> >>  	 * readers.
> >> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> >>  	struct vring_used *used;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> >> +		map = vq->maps[VHOST_ADDR_USED];
> >>  		if (likely(map)) {
> >>  			used = map->addr;
> >>  			*((__virtio16 *)&used->ring[vq->num]) =
> >>  				cpu_to_vhost16(vq, vq->avail_idx);
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> >>  	size_t size;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> >> +		map = vq->maps[VHOST_ADDR_USED];
> >>  		if (likely(map)) {
> >>  			used = map->addr;
> >>  			size = count * sizeof(*head);
> >>  			memcpy(used->ring + idx, head, size);
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> >>  	struct vring_used *used;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> >> +		map = vq->maps[VHOST_ADDR_USED];
> >>  		if (likely(map)) {
> >>  			used = map->addr;
> >>  			used->flags = cpu_to_vhost16(vq, vq->used_flags);
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
> >>  	struct vring_used *used;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> >> +		map = vq->maps[VHOST_ADDR_USED];
> >>  		if (likely(map)) {
> >>  			used = map->addr;
> >>  			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> >>  	struct vring_avail *avail;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> >> +		map = vq->maps[VHOST_ADDR_AVAIL];
> >>  		if (likely(map)) {
> >>  			avail = map->addr;
> >>  			*idx = avail->idx;
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> >>  	struct vring_avail *avail;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> >> +		map = vq->maps[VHOST_ADDR_AVAIL];
> >>  		if (likely(map)) {
> >>  			avail = map->addr;
> >>  			*head = avail->ring[idx & (vq->num - 1)];
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> >>  	struct vring_avail *avail;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> >> +		map = vq->maps[VHOST_ADDR_AVAIL];
> >>  		if (likely(map)) {
> >>  			avail = map->addr;
> >>  			*flags = avail->flags;
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
> >>  	struct vring_avail *avail;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
> >> +		vhost_vq_access_map_begin(vq);
> >> +		map = vq->maps[VHOST_ADDR_AVAIL];
> >>  		if (likely(map)) {
> >>  			avail = map->addr;
> >>  			*event = (__virtio16)avail->ring[vq->num];
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
> >>  	struct vring_used *used;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
> >> +		map = vq->maps[VHOST_ADDR_USED];
> >>  		if (likely(map)) {
> >>  			used = map->addr;
> >>  			*idx = used->idx;
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq,
> >>  	struct vring_desc *d;
> >>  
> >>  	if (!vq->iotlb) {
> >> -		rcu_read_lock();
> >> +		vhost_vq_access_map_begin(vq);
> >>  
> >> -		map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]);
> >> +		map = vq->maps[VHOST_ADDR_DESC];
> >>  		if (likely(map)) {
> >>  			d = map->addr;
> >>  			*desc = *(d + idx);
> >> -			rcu_read_unlock();
> >> +			vhost_vq_access_map_end(vq);
> >>  			return 0;
> >>  		}
> >>  
> >> -		rcu_read_unlock();
> >> +		vhost_vq_access_map_end(vq);
> >>  	}
> >>  #endif
> >>  
> >> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
> >>  #if VHOST_ARCH_CAN_ACCEL_UACCESS
> >>  static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
> >>  {
> >> -	struct vhost_map __rcu *map;
> >> +	struct vhost_map *map;
> >>  	int i;
> >>  
> >>  	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> >> -		rcu_read_lock();
> >> -		map = rcu_dereference(vq->maps[i]);
> >> -		rcu_read_unlock();
> >> +		map = vq->maps[i];
> >>  		if (unlikely(!map))
> >>  			vhost_map_prefetch(vq, i);
> >>  	}
> >> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> >> index a9a2a93857d2..f9e9558a529d 100644
> >> --- a/drivers/vhost/vhost.h
> >> +++ b/drivers/vhost/vhost.h
> >> @@ -115,16 +115,17 @@ struct vhost_virtqueue {
> >>  #if VHOST_ARCH_CAN_ACCEL_UACCESS
> >>  	/* Read by memory accessors, modified by meta data
> >>  	 * prefetching, MMU notifier and vring ioctl().
> >> -	 * Synchonrized through mmu_lock (writers) and RCU (writers
> >> -	 * and readers).
> >> +	 * Synchonrized through mmu_lock (writers) and ref counters,
> >> +	 * see vhost_vq_access_map_begin()/vhost_vq_access_map_end().
> >>  	 */
> >> -	struct vhost_map __rcu *maps[VHOST_NUM_ADDRS];
> >> +	struct vhost_map *maps[VHOST_NUM_ADDRS];
> >>  	/* Read by MMU notifier, modified by vring ioctl(),
> >>  	 * synchronized through MMU notifier
> >>  	 * registering/unregistering.
> >>  	 */
> >>  	struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
> >>  #endif
> >> +	int ref;
> >
> > Is it important that this is signed? If not I'd do unsigned here:
> > even though kernel does compile with 2s complement sign overflow,
> > it seems cleaner not to depend on that.
> 
> Not a must, let me fix.
> 
> Thanks
> 
> >
> >>  	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
> >>  
> >>  	struct file *kick;
> >> -- 
> >> 2.18.1

^ permalink raw reply

* Re: [patch 1/1] drivers/net/ethernet/marvell/mvmdio.c: Fix non OF case
From: Andrew Lunn @ 2019-08-03 22:22 UTC (permalink / raw)
  To: Arnaud Patard; +Cc: netdev
In-Reply-To: <20190802083310.772136040@rtp-net.org>

On Fri, Aug 02, 2019 at 10:32:40AM +0200, Arnaud Patard wrote:
> Orion5.x systems are still using machine files and not device-tree.
> Commit 96cb4342382290c9 ("net: mvmdio: allow up to three clocks to be
> specified for orion-mdio") has replaced devm_clk_get() with of_clk_get(),
> leading to a oops at boot and not working network, as reported in 
> https://lists.debian.org/debian-arm/2019/07/msg00088.html and possibly in
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=908712.
> 
> Link: https://lists.debian.org/debian-arm/2019/07/msg00088.html
> Fixes: 96cb4342382290c9 ("net: mvmdio: allow up to three clocks to be specified for orion-mdio")
> Signed-off-by: Arnaud Patard <arnaud.patard@rtp-net.org>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: Print error when ovs_execute_actions() fails
From: Pravin Shelar @ 2019-08-03 23:01 UTC (permalink / raw)
  To: Yifeng Sun; +Cc: Linux Kernel Network Developers
In-Reply-To: <1564694047-4859-1-git-send-email-pkusunyifeng@gmail.com>

On Thu, Aug 1, 2019 at 2:14 PM Yifeng Sun <pkusunyifeng@gmail.com> wrote:
>
> Currently in function ovs_dp_process_packet(), return values of
> ovs_execute_actions() are silently discarded. This patch prints out
> an error message when error happens so as to provide helpful hints
> for debugging.
>
> Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
> ---
>  net/openvswitch/datapath.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index 892287d..603c533 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -222,6 +222,7 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
>         struct dp_stats_percpu *stats;
>         u64 *stats_counter;
>         u32 n_mask_hit;
> +       int error;
>
>         stats = this_cpu_ptr(dp->stats_percpu);
>
> @@ -229,7 +230,6 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
>         flow = ovs_flow_tbl_lookup_stats(&dp->table, key, &n_mask_hit);
>         if (unlikely(!flow)) {
>                 struct dp_upcall_info upcall;
> -               int error;
>
>                 memset(&upcall, 0, sizeof(upcall));
>                 upcall.cmd = OVS_PACKET_CMD_MISS;
> @@ -246,7 +246,10 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
>
>         ovs_flow_stats_update(flow, key->tp.flags, skb);
>         sf_acts = rcu_dereference(flow->sf_acts);
> -       ovs_execute_actions(dp, skb, sf_acts, key);
> +       error = ovs_execute_actions(dp, skb, sf_acts, key);
> +       if (unlikely(error))
> +               net_err_ratelimited("ovs: action execution error on datapath %s: %d\n",
> +                                                       ovs_dp_name(dp), error);
>

I would rather add error counter for better visibility.
If you want to use current approach, can you use net_dbg_ratelimited()
since you want to use this for debugging purpose?

Thanks.

^ permalink raw reply

* [PATCH net-next 0/2] Two small fq_codel optimizations
From: Dave Taht @ 2019-08-03 23:37 UTC (permalink / raw)
  To: netdev; +Cc: Dave Taht

These two patches improve fq_codel performance 
under extreme network loads. The first patch 
more rapidly escalates the codel count under
overload, the second just kills a totally useless
statistic. 

(sent together because they'd otherwise conflict)

Signed-off-by: Dave Taht <dave.taht@gmail.com>

Dave Taht (2):
  Increase fq_codel count in the bulk dropper
  fq_codel: Kill useless per-flow dropped statistic

 net/sched/sch_fq_codel.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

-- 
2.17.1


^ permalink raw reply

* [PATCH net-next 1/2] Increase fq_codel count in the bulk dropper
From: Dave Taht @ 2019-08-03 23:37 UTC (permalink / raw)
  To: netdev; +Cc: Dave Taht
In-Reply-To: <1564875449-12122-1-git-send-email-dave.taht@gmail.com>

In the field fq_codel is often used with a smaller memory or
packet limit than the default, and when the bulk dropper is hit,
the drop pattern bifircates into one that more slowly increases
the codel drop rate and hits the bulk dropper more than it should.

The scan through the 1024 queues happens more often than it needs to.

This patch increases the codel count in the bulk dropper, but
does not change the drop rate there, relying on the next codel round
to deliver the next packet at the original drop rate
(after that burst of loss), then escalate to a higher signaling rate.

Signed-off-by: Dave Taht <dave.taht@gmail.com>

---
 net/sched/sch_fq_codel.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index d59fbcc745d1..d67b2c40e6e6 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -173,6 +173,8 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets,
 		__qdisc_drop(skb, to_free);
 	} while (++i < max_packets && len < threshold);
 
+	/* Tell codel to increase its signal strength also */
+	flow->cvars.count += i;
 	flow->dropped += i;
 	q->backlogs[idx] -= len;
 	q->memory_usage -= mem;
-- 
2.17.1


^ permalink raw reply related

* [PATCH net-next 2/2] fq_codel: Kill useless per-flow dropped statistic
From: Dave Taht @ 2019-08-03 23:37 UTC (permalink / raw)
  To: netdev; +Cc: Dave Taht
In-Reply-To: <1564875449-12122-1-git-send-email-dave.taht@gmail.com>

It is almost impossible to get anything other than a 0 out of
flow->dropped statistic with a tc class dump, as it resets to 0
on every round.

It also conflates ecn marks with drops.

It would have been useful had it kept a cumulative drop count, but
it doesn't. This patch doesn't change the API, it just stops
tracking a stat and state that is impossible to measure and nobody
uses.

Signed-off-by: Dave Taht <dave.taht@gmail.com>

---
 net/sched/sch_fq_codel.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index d67b2c40e6e6..9edd0f495001 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -45,7 +45,6 @@ struct fq_codel_flow {
 	struct sk_buff	  *tail;
 	struct list_head  flowchain;
 	int		  deficit;
-	u32		  dropped; /* number of drops (or ECN marks) on this flow */
 	struct codel_vars cvars;
 }; /* please try to keep this structure <= 64 bytes */
 
@@ -175,7 +174,6 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets,
 
 	/* Tell codel to increase its signal strength also */
 	flow->cvars.count += i;
-	flow->dropped += i;
 	q->backlogs[idx] -= len;
 	q->memory_usage -= mem;
 	sch->qstats.drops += i;
@@ -213,7 +211,6 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		list_add_tail(&flow->flowchain, &q->new_flows);
 		q->new_flow_count++;
 		flow->deficit = q->quantum;
-		flow->dropped = 0;
 	}
 	get_codel_cb(skb)->mem_usage = skb->truesize;
 	q->memory_usage += get_codel_cb(skb)->mem_usage;
@@ -312,9 +309,6 @@ static struct sk_buff *fq_codel_dequeue(struct Qdisc *sch)
 			    &flow->cvars, &q->cstats, qdisc_pkt_len,
 			    codel_get_enqueue_time, drop_func, dequeue_func);
 
-	flow->dropped += q->cstats.drop_count - prev_drop_count;
-	flow->dropped += q->cstats.ecn_mark - prev_ecn_mark;
-
 	if (!skb) {
 		/* force a pass through old_flows to prevent starvation */
 		if ((head == &q->new_flows) && !list_empty(&q->old_flows))
@@ -660,7 +654,7 @@ static int fq_codel_dump_class_stats(struct Qdisc *sch, unsigned long cl,
 			sch_tree_unlock(sch);
 		}
 		qs.backlog = q->backlogs[idx];
-		qs.drops = flow->dropped;
+		qs.drops = 0;
 	}
 	if (gnet_stats_copy_queue(d, NULL, &qs, qs.qlen) < 0)
 		return -1;
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Gunthorpe @ 2019-08-04  0:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <20190803172944-mutt-send-email-mst@kernel.org>

On Sat, Aug 03, 2019 at 05:36:13PM -0400, Michael S. Tsirkin wrote:
> On Fri, Aug 02, 2019 at 02:24:18PM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 02, 2019 at 10:27:21AM -0400, Michael S. Tsirkin wrote:
> > > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > > > > This must be a proper barrier, like a spinlock, mutex, or
> > > > > > synchronize_rcu.
> > > > > 
> > > > > 
> > > > > I start with synchronize_rcu() but both you and Michael raise some
> > > > > concern.
> > > > 
> > > > I've also idly wondered if calling synchronize_rcu() under the various
> > > > mm locks is a deadlock situation.
> > > > 
> > > > > Then I try spinlock and mutex:
> > > > > 
> > > > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance
> > > > > improvement.
> > > > 
> > > > I think the topic here is correctness not performance improvement
> > > 
> > > The topic is whether we should revert
> > > commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address")
> > > 
> > > or keep it in. The only reason to keep it is performance.
> > 
> > Yikes, I'm not sure you can ever win against copy_from_user using
> > mmu_notifiers?
> 
> Ever since copy_from_user started playing with flags (for SMAP) and
> added speculation barriers there's a chance we can win by accessing
> memory through the kernel address.

You think copy_to_user will be more expensive than the minimum two
atomics required to synchronize with another thread?

> > Also, why can't this just permanently GUP the pages? In fact, where
> > does it put_page them anyhow? Worrying that 7f466 adds a get_user page
> > but does not add a put_page??

You didn't answer this.. Why not just use GUP?

Jason

^ permalink raw reply

* [PATCH 1/2] Fix a NULL-ptr-deref bug in ath6kl_usb_alloc_urb_from_pipe
From: Hui Peng @ 2019-08-04  0:29 UTC (permalink / raw)
  To: kvalo, davem
  Cc: Hui Peng, Mathias Payer, linux-wireless, netdev, linux-kernel

The `ar_usb` field of `ath6kl_usb_pipe_usb_pipe` objects
are initialized to point to the containing `ath6kl_usb` object
according to endpoint descriptors read from the device side, as shown
below in `ath6kl_usb_setup_pipe_resources`:

for (i = 0; i < iface_desc->desc.bNumEndpoints; ++i) {
	endpoint = &iface_desc->endpoint[i].desc;

	// get the address from endpoint descriptor
	pipe_num = ath6kl_usb_get_logical_pipe_num(ar_usb,
						endpoint->bEndpointAddress,
						&urbcount);
	......
	// select the pipe object
	pipe = &ar_usb->pipes[pipe_num];

	// initialize the ar_usb field
	pipe->ar_usb = ar_usb;
}

The driver assumes that the addresses reported in endpoint
descriptors from device side  to be complete. If a device is
malicious and does not report complete addresses, it may trigger
NULL-ptr-deref `ath6kl_usb_alloc_urb_from_pipe` and
`ath6kl_usb_free_urb_to_pipe`.

This patch fixes the bug by preventing potential NULL-ptr-deref.

Signed-off-by: Hui Peng <benquike@gmail.com>
Reported-by: Hui Peng <benquike@gmail.com>
Reported-by: Mathias Payer <mathias.payer@nebelwelt.net>
---
 drivers/net/wireless/ath/ath6kl/usb.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/wireless/ath/ath6kl/usb.c b/drivers/net/wireless/ath/ath6kl/usb.c
index 4defb7a0330f..53b66e9434c9 100644
--- a/drivers/net/wireless/ath/ath6kl/usb.c
+++ b/drivers/net/wireless/ath/ath6kl/usb.c
@@ -132,6 +132,10 @@ ath6kl_usb_alloc_urb_from_pipe(struct ath6kl_usb_pipe *pipe)
 	struct ath6kl_urb_context *urb_context = NULL;
 	unsigned long flags;
 
+	/* bail if this pipe is not initialized */
+	if (!pipe->ar_usb)
+		return NULL;
+
 	spin_lock_irqsave(&pipe->ar_usb->cs_lock, flags);
 	if (!list_empty(&pipe->urb_list_head)) {
 		urb_context =
@@ -150,6 +154,10 @@ static void ath6kl_usb_free_urb_to_pipe(struct ath6kl_usb_pipe *pipe,
 {
 	unsigned long flags;
 
+	/* bail if this pipe is not initialized */
+	if (!pipe->ar_usb)
+		return;
+
 	spin_lock_irqsave(&pipe->ar_usb->cs_lock, flags);
 	pipe->urb_cnt++;
 
-- 
2.22.0


^ permalink raw reply related

* [PATCH 2/2] Fix a NULL-ptr-deref bug in ath10k_usb_alloc_urb_from_pipe
From: Hui Peng @ 2019-08-04  0:31 UTC (permalink / raw)
  To: kvalo, davem
  Cc: Hui Peng, Mathias Payer, ath10k, linux-wireless, netdev,
	linux-kernel

The `ar_usb` field of `ath10k_usb_pipe_usb_pipe` objects
are initialized to point to the containing `ath10k_usb` object
according to endpoint descriptors read from the device side, as shown
below in `ath10k_usb_setup_pipe_resources`:

for (i = 0; i < iface_desc->desc.bNumEndpoints; ++i) {
        endpoint = &iface_desc->endpoint[i].desc;

        // get the address from endpoint descriptor
        pipe_num = ath10k_usb_get_logical_pipe_num(ar_usb,
                                                endpoint->bEndpointAddress,
                                                &urbcount);
        ......
        // select the pipe object
        pipe = &ar_usb->pipes[pipe_num];

        // initialize the ar_usb field
        pipe->ar_usb = ar_usb;
}

The driver assumes that the addresses reported in endpoint
descriptors from device side  to be complete. If a device is
malicious and does not report complete addresses, it may trigger
NULL-ptr-deref `ath10k_usb_alloc_urb_from_pipe` and
`ath10k_usb_free_urb_to_pipe`.

This patch fixes the bug by preventing potential NULL-ptr-deref.

Signed-off-by: Hui Peng <benquike@gmail.com>
Reported-by: Hui Peng <benquike@gmail.com>
Reported-by: Mathias Payer <mathias.payer@nebelwelt.net>
---
 drivers/net/wireless/ath/ath10k/usb.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/wireless/ath/ath10k/usb.c b/drivers/net/wireless/ath/ath10k/usb.c
index e1420f67f776..14d86627b47f 100644
--- a/drivers/net/wireless/ath/ath10k/usb.c
+++ b/drivers/net/wireless/ath/ath10k/usb.c
@@ -38,6 +38,10 @@ ath10k_usb_alloc_urb_from_pipe(struct ath10k_usb_pipe *pipe)
 	struct ath10k_urb_context *urb_context = NULL;
 	unsigned long flags;
 
+	/* bail if this pipe is not initialized */
+	if (!pipe->ar_usb)
+		return NULL;
+
 	spin_lock_irqsave(&pipe->ar_usb->cs_lock, flags);
 	if (!list_empty(&pipe->urb_list_head)) {
 		urb_context = list_first_entry(&pipe->urb_list_head,
@@ -55,6 +59,10 @@ static void ath10k_usb_free_urb_to_pipe(struct ath10k_usb_pipe *pipe,
 {
 	unsigned long flags;
 
+	/* bail if this pipe is not initialized */
+	if (!pipe->ar_usb)
+		return NULL;
+
 	spin_lock_irqsave(&pipe->ar_usb->cs_lock, flags);
 
 	pipe->urb_cnt++;
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH net-next] net/tls: prevent skb_orphan() from leaking TLS plain text with offload
From: Jakub Kicinski @ 2019-08-04  1:58 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, oss-drivers, edumazet, borisp, aviadye, davejwatson,
	john.fastabend, daniel, willemb, xiyou.wangcong, fw,
	alexei.starovoitov
In-Reply-To: <20190802.172453.1075167824005057182.davem@davemloft.net>

On Fri, 02 Aug 2019 17:24:53 -0700 (PDT), David Miller wrote:
> From: Jakub Kicinski <jakub.kicinski@netronome.com>
> Date: Tue, 30 Jul 2019 14:12:58 -0700
> 
> > I'm sending this for net-next because of lack of confidence
> > in my own abilities. It should apply cleanly to net... :)  
> 
> It looks like there will be changes to this.

Yes, sorry for going silent, I reworked this to follow Boris'es
suggestion of using flags, and wanted a little bit of extra QA.
Unfortunately 5.3-rc has broken Intel IOMMU and QA lab machines
don't boot.. :(

^ permalink raw reply

* Re: [v2,1/2] tools: bpftool: add net attach command to attach XDP on interface
From: Jakub Kicinski @ 2019-08-04  2:13 UTC (permalink / raw)
  To: Daniel T. Lee; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <CAEKGpzg91CMvq5FnYhAxX7XoA+Sr-+AY3t34q5Q2meG93Ydq9Q@mail.gmail.com>

On Sat, 3 Aug 2019 18:39:21 +0900, Daniel T. Lee wrote:
> On Sat, Aug 3, 2019 at 3:39 AM Jakub Kicinski wrote:
> > Right, I was wondering if we want to call it force, though? force is
> > sort of a reuse of iproute2 concept. But it's kind of hard to come up
> > with names.
> >
> > Just to be sure - I mean something like:
> >
> > bpftool net attach xdp id xyz dev ethN noreplace
> >
> > Rather than:
> >
> > bpftool -f net attach xdp id xyz dev ethN
> >  
> 
> How about a word 'immutable'? 'noreplace' seems good though.
> Just suggesting an option.

Hm. Immutable kind of has a meaning in Linux (check out immutable in
extended file attributes, and CAP_LINUX_IMMUTABLE) which is different
than here.. so I'd avoid that one.

Another option I was thinking about was using the same keywords as maps
do: "noexist" - although we don't have a way of doing "exist"
currently, which is kind of breaking the equivalence.

Or maye we should go the same route as iproute2 after all, and set
noreplace by default?  That way we don't need the negation in the 
name. We could use "overwrite", "replace"?

Naming things... :)

> > > > > +}
> > > > > +
> > > > > +static int do_attach(int argc, char **argv)
> > > > > +{
> > > > > +     enum net_attach_type attach_type;
> > > > > +     int err, progfd, ifindex;
> > > > > +
> > > > > +     err = parse_attach_args(argc, argv, &progfd, &attach_type, &ifindex);
> > > > > +     if (err)
> > > > > +             return err;  
> > > >
> > > > Probably not the best idea to move this out into a helper.  
> > >
> > > Again, just trying to make consistent with 'prog.c'.
> > >
> > > But clearly it has differences with do_attach/detach from 'prog.c'.
> > > From it, it uses the same parse logic 'parse_attach_detach_args' since
> > > the two command 'bpftool prog attach/detach' uses the same argument format.
> > >
> > > However, in here, 'bpftool net' attach and detach requires different number of
> > > argument, so function for parse argument has been defined separately.
> > > The situation is little bit different, but keeping argument parse logic as an
> > > helper, I think it's better in terms of consistency.  
> >
> > Well they won't share the same arguments if you add the keyword for
> > controlling IF_NOEXIST :(
> 
> My apologies, but I think I'm not following with you.
> 
> Currently detach/attach isn't sharing same arguments.
> And even after adding the argument for IF_NOEXIST such as  [ noreplace ],
> it'll stays the same for not sharing same arguments.

Ah, my apologies I misread your message.

> I'm not sure why it is not the best idea to move arg parse logic into a helper.

Output args are kind of ugly, and having to pass each parameter through
output arg seems like something that could get out of hand as the
number grows. 

Usually command handling functions are relatively small and
straightforward in bpftool so it's quite nice to have it all in one
simple procedure.

But I'm not feeling too strongly about it. Feel free to leave the
parsing in separate functions if you prefer!

^ permalink raw reply

* Re: [PATCH net-next v3] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Tao Ren @ 2019-08-04  4:48 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, Heiner Kallweit, David S . Miller,
	Arun Parameswaran, Justin Chen, netdev, lkml,
	openbmc@lists.ozlabs.org
In-Reply-To: <CA+h21hrOEape89MTqCUyGFt=f6ba7Q-2KcOsN_Vw2Qv8iq86jw@mail.gmail.com>

Hi Vladimir,

On 8/3/19 6:49 AM, Vladimir Oltean wrote:
> Hi Tao,
> 
> On Sat, 3 Aug 2019 at 00:56, Tao Ren <taoren@fb.com> wrote:
>>
>> genphy_read_status() cannot report correct link speed when BCM54616S PHY
>> is configured in RGMII->1000Base-KX mode (for example, on Facebook CMM
>> BMC platform), and it is because speed-related fields in MII registers
>> are assigned different meanings in 1000X register set. Actually there
>> is no speed field in 1000X register set because link speed is always
>> 1000 Mb/s.
>>
>> The patch adds "probe" callback to detect PHY's operation mode based on
>> INTERF_SEL[1:0] pins and 1000X/100FX selection bit in SerDES 100-FX
>> Control register. Besides, link speed is manually set to 1000 Mb/s in
>> "read_status" callback if PHY-switch link is 1000Base-X.
>>
>> Signed-off-by: Tao Ren <taoren@fb.com>
>> ---
>>  Changes in v3:
>>   - rename bcm5482_read_status to bcm54xx_read_status so the callback can
>>     be shared by BCM5482 and BCM54616S.
>>  Changes in v2:
>>   - Auto-detect PHY operation mode instead of passing DT node.
>>   - move PHY mode auto-detect logic from config_init to probe callback.
>>   - only set speed (not including duplex) in read_status callback.
>>   - update patch description with more background to avoid confusion.
>>   - patch #1 in the series ("net: phy: broadcom: set features explicitly
>>     for BCM54616") is dropped: the fix should go to get_features callback
>>     which may potentially depend on this patch.
>>
>>  drivers/net/phy/broadcom.c | 41 +++++++++++++++++++++++++++++++++-----
>>  include/linux/brcmphy.h    | 10 ++++++++--
>>  2 files changed, 44 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/net/phy/broadcom.c b/drivers/net/phy/broadcom.c
>> index 937d0059e8ac..ecad8a201a09 100644
>> --- a/drivers/net/phy/broadcom.c
>> +++ b/drivers/net/phy/broadcom.c
>> @@ -383,9 +383,9 @@ static int bcm5482_config_init(struct phy_device *phydev)
>>                 /*
>>                  * Select 1000BASE-X register set (primary SerDes)
>>                  */
>> -               reg = bcm_phy_read_shadow(phydev, BCM5482_SHD_MODE);
>> -               bcm_phy_write_shadow(phydev, BCM5482_SHD_MODE,
>> -                                    reg | BCM5482_SHD_MODE_1000BX);
>> +               reg = bcm_phy_read_shadow(phydev, BCM54XX_SHD_MODE);
>> +               bcm_phy_write_shadow(phydev, BCM54XX_SHD_MODE,
>> +                                    reg | BCM54XX_SHD_MODE_1000BX);
>>
>>                 /*
>>                  * LED1=ACTIVITYLED, LED3=LINKSPD[2]
>> @@ -409,7 +409,7 @@ static int bcm5482_config_init(struct phy_device *phydev)
>>         return err;
>>  }
>>
>> -static int bcm5482_read_status(struct phy_device *phydev)
>> +static int bcm54xx_read_status(struct phy_device *phydev)
>>  {
>>         int err;
>>
>> @@ -464,6 +464,35 @@ static int bcm54616s_config_aneg(struct phy_device *phydev)
>>         return ret;
>>  }
>>
>> +static int bcm54616s_probe(struct phy_device *phydev)
>> +{
>> +       int val, intf_sel;
>> +
>> +       val = bcm_phy_read_shadow(phydev, BCM54XX_SHD_MODE);
>> +       if (val < 0)
>> +               return val;
>> +
>> +       /* The PHY is strapped in RGMII to fiber mode when INTERF_SEL[1:0]
>> +        * is 01b.
>> +        */
>> +       intf_sel = (val & BCM54XX_SHD_INTF_SEL_MASK) >> 1;
>> +       if (intf_sel == 1) {
>> +               val = bcm_phy_read_shadow(phydev, BCM54616S_SHD_100FX_CTRL);
>> +               if (val < 0)
>> +                       return val;
>> +
>> +               /* Bit 0 of the SerDes 100-FX Control register, when set
>> +                * to 1, sets the MII/RGMII -> 100BASE-FX configuration.
>> +                * When this bit is set to 0, it sets the GMII/RGMII ->
>> +                * 1000BASE-X configuration.
>> +                */
>> +               if (!(val & BCM54616S_100FX_MODE))
>> +                       phydev->dev_flags |= PHY_BCM_FLAGS_MODE_1000BX;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>>  static int brcm_phy_setbits(struct phy_device *phydev, int reg, int set)
>>  {
>>         int val;
>> @@ -655,6 +684,8 @@ static struct phy_driver broadcom_drivers[] = {
>>         .config_aneg    = bcm54616s_config_aneg,
>>         .ack_interrupt  = bcm_phy_ack_intr,
>>         .config_intr    = bcm_phy_config_intr,
>> +       .read_status    = bcm54xx_read_status,
>> +       .probe          = bcm54616s_probe,
>>  }, {
>>         .phy_id         = PHY_ID_BCM5464,
>>         .phy_id_mask    = 0xfffffff0,
>> @@ -689,7 +720,7 @@ static struct phy_driver broadcom_drivers[] = {
>>         .name           = "Broadcom BCM5482",
>>         /* PHY_GBIT_FEATURES */
>>         .config_init    = bcm5482_config_init,
>> -       .read_status    = bcm5482_read_status,
>> +       .read_status    = bcm54xx_read_status,
>>         .ack_interrupt  = bcm_phy_ack_intr,
>>         .config_intr    = bcm_phy_config_intr,
>>  }, {
>> diff --git a/include/linux/brcmphy.h b/include/linux/brcmphy.h
>> index 6db2d9a6e503..b475e7f20d28 100644
>> --- a/include/linux/brcmphy.h
>> +++ b/include/linux/brcmphy.h
>> @@ -200,9 +200,15 @@
>>  #define BCM5482_SHD_SSD                0x14    /* 10100: Secondary SerDes control */
>>  #define BCM5482_SHD_SSD_LEDM   0x0008  /* SSD LED Mode enable */
>>  #define BCM5482_SHD_SSD_EN     0x0001  /* SSD enable */
>> -#define BCM5482_SHD_MODE       0x1f    /* 11111: Mode Control Register */
>> -#define BCM5482_SHD_MODE_1000BX        0x0001  /* Enable 1000BASE-X registers */
>>
>> +/* 10011: SerDes 100-FX Control Register */
>> +#define BCM54616S_SHD_100FX_CTRL       0x13
>> +#define        BCM54616S_100FX_MODE            BIT(0)  /* 100-FX SerDes Enable */
>> +
>> +/* 11111: Mode Control Register */
>> +#define BCM54XX_SHD_MODE               0x1f
>> +#define BCM54XX_SHD_INTF_SEL_MASK      GENMASK(2, 1)   /* INTERF_SEL[1:0] */
>> +#define BCM54XX_SHD_MODE_1000BX                BIT(0)  /* Enable 1000-X registers */
>>
>>  /*
>>   * EXPANSION SHADOW ACCESS REGISTERS.  (PHY REG 0x15, 0x16, and 0x17)
>> --
>> 2.17.1
>>
> 
> The patchset looks better now. But is it ok, I wonder, to keep
> PHY_BCM_FLAGS_MODE_1000BX in phydev->dev_flags, considering that
> phy_attach_direct is overwriting it?

I checked ftgmac100 driver (used on my machine) and it calls phy_connect_direct which passes phydev->dev_flags when calling phy_attach_direct: that explains why the flag is not cleared in my case.

Can you give me some suggestions since dev_flags may be override in other calling paths? For example, is it good idea to move the auto-detect logic from probe to config_init? Or are there other fields in phy_device structure that can be used to store PHY-switch link type? Or maybe I should just include the auto-detect logic in read_status callback?


Thanks,

Tao

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] selftests/bpf: add loop test 4
From: Yonghong Song @ 2019-08-04  5:29 UTC (permalink / raw)
  To: Alexei Starovoitov, davem@davemloft.net
  Cc: daniel@iogearbox.net, netdev@vger.kernel.org, bpf@vger.kernel.org,
	Kernel Team
In-Reply-To: <20190802233344.863418-2-ast@kernel.org>



On 8/2/19 4:33 PM, Alexei Starovoitov wrote:
> Add a test that returns a 'random' number between [0, 2^20)
> If state pruning is not working correctly for loop body the number of
> processed insns will be 2^20 * num_of_insns_in_loop_body and the program
> will be rejected.

The maximum processed insns will be 2^20 or 2^20 * 
num_of_insns_in_loop_body? I thought the verifier will
stop processing once processed insns reach 1M?

Could you elaborate which potential issues in verifier
you try to cover with this test case? Extra tests are
always welcome. We already have scale/loop tests and some
(e.g., strobemeta tests) are more complex than this one.
Maybe you have something in mind for this particular
test? Putting in the commit message may help people understand
the concerns.

> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>   .../bpf/prog_tests/bpf_verif_scale.c          |  1 +
>   tools/testing/selftests/bpf/progs/loop4.c     | 23 +++++++++++++++++++
>   2 files changed, 24 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/progs/loop4.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
> index b4be96162ff4..757e39540eda 100644
> --- a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
> +++ b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
> @@ -71,6 +71,7 @@ void test_bpf_verif_scale(void)
>   
>   		{ "loop1.o", BPF_PROG_TYPE_RAW_TRACEPOINT },
>   		{ "loop2.o", BPF_PROG_TYPE_RAW_TRACEPOINT },
> +		{ "loop4.o", BPF_PROG_TYPE_RAW_TRACEPOINT },

The program is more like a BPF_PROG_TYPE_SCHED_CLS type than
a BPF_PROG_TYPE_RAW_TRACEPOINT?

>   
>   		/* partial unroll. 19k insn in a loop.
>   		 * Total program size 20.8k insn.
> diff --git a/tools/testing/selftests/bpf/progs/loop4.c b/tools/testing/selftests/bpf/progs/loop4.c
> new file mode 100644
> index 000000000000..3e7ee14fddbd
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/loop4.c
> @@ -0,0 +1,23 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2019 Facebook
> +#include <linux/sched.h>
> +#include <linux/ptrace.h>

Since the program is a networking type,
the above two headers are probably unneeded.

> +#include <stdint.h>
> +#include <stddef.h>
> +#include <stdbool.h>
> +#include <linux/bpf.h>
> +#include "bpf_helpers.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +SEC("socket")
> +int combinations(volatile struct __sk_buff* skb)
> +{
> +	int ret = 0, i;
> +
> +#pragma nounroll
> +	for (i = 0; i < 20; i++)
> +		if (skb->len)
> +			ret |= 1 << i;
> +	return ret;
> +}
> 

^ permalink raw reply

* Re: [PATCH bpf-next 2/2] selftests/bpf: add loop test 5
From: Yonghong Song @ 2019-08-04  5:45 UTC (permalink / raw)
  To: Alexei Starovoitov, davem@davemloft.net
  Cc: daniel@iogearbox.net, netdev@vger.kernel.org, bpf@vger.kernel.org,
	Kernel Team
In-Reply-To: <20190802233344.863418-3-ast@kernel.org>



On 8/2/19 4:33 PM, Alexei Starovoitov wrote:
> Add a test with multiple exit conditions.
> It's not an infinite loop only when the verifier can properly track
> all math on variable 'i' through all possible ways of executing this loop.

Agreed with motivation of this test.

> 
> barrier()s are needed to disable llvm optimization that combines multiple
> branches into fewer branches.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>   .../bpf/prog_tests/bpf_verif_scale.c          |  1 +
>   tools/testing/selftests/bpf/progs/loop5.c     | 37 +++++++++++++++++++
>   2 files changed, 38 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/progs/loop5.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
> index 757e39540eda..29615a4a9362 100644
> --- a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
> +++ b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
> @@ -72,6 +72,7 @@ void test_bpf_verif_scale(void)
>   		{ "loop1.o", BPF_PROG_TYPE_RAW_TRACEPOINT },
>   		{ "loop2.o", BPF_PROG_TYPE_RAW_TRACEPOINT },
>   		{ "loop4.o", BPF_PROG_TYPE_RAW_TRACEPOINT }, > +		{ "loop5.o", BPF_PROG_TYPE_RAW_TRACEPOINT },

More like a BPF_PROG_TYPE_SCHED_CLS type although probably it does not 
matter as we did not attach it to anywhere?

>   
>   		/* partial unroll. 19k insn in a loop.
>   		 * Total program size 20.8k insn.
> diff --git a/tools/testing/selftests/bpf/progs/loop5.c b/tools/testing/selftests/bpf/progs/loop5.c
> new file mode 100644
> index 000000000000..9d9817efe208
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/loop5.c
> @@ -0,0 +1,37 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2019 Facebook
> +#include <linux/sched.h>
> +#include <linux/ptrace.h>

The above headers probably not needed.

> +#include <stdint.h>
> +#include <stddef.h>
> +#include <stdbool.h>
> +#include <linux/bpf.h>
> +#include "bpf_helpers.h"
> +#define barrier() __asm__ __volatile__("": : :"memory")
> +
> +char _license[] SEC("license") = "GPL";
> +
> +SEC("socket")
> +int while_true(volatile struct __sk_buff* skb)
> +{
> +	int i = 0;
> +
> +	while (true) {
> +		if (skb->len)
> +			i += 3;
> +		else
> +			i += 7;
> +		if (i == 9)
> +			break;
> +		barrier();
> +		if (i == 10)
> +			break;
> +		barrier();
> +		if (i == 13)
> +			break;
> +		barrier();
> +		if (i == 14)
> +			break;
> +	}
> +	return i;
> +}
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox