Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Song Liu @ 2019-08-05  7:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List
In-Reply-To: <CALCETrWtE2U4EvZVYeq8pSmQjBzF2PHH+KxYW8FSeF+W=1FYjw@mail.gmail.com>

Hi Andy, 

> On Aug 4, 2019, at 10:47 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sun, Aug 4, 2019 at 5:08 PM Andy Lutomirski <luto@kernel.org> wrote:
>> 
>> On Sun, Aug 4, 2019 at 3:16 PM Andy Lutomirski <luto@kernel.org> wrote:
>>> 
>>> On Fri, Aug 2, 2019 at 12:22 AM Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> Hi Andy,
>>>> 
>>>>> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
>>>>>> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
>>>>> 
>>>>> It's been done before -- it's not that hard.  IMO the main tricky bit
>>>>> would be try be somewhat careful about defining exactly what
>>>>> CAP_BPF_ADMIN does.
>>>> 
>>>> Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the
>>>> Plumbers conference.
>>>> 
>>>> OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons
>>>> like systemd to do sys_bpf() without root.
>>> 
>>> I don't understand the use case here.  Are you talking about systemd
>>> --user?  As far as I know, a user is expected to be able to fully
>>> control their systemd --user process, so giving it unrestricted bpf
>>> access is very close to giving it superuser access, and this doesn't
>>> sound like a good idea.  I think that, if systemd --user needs bpf(),
>>> it either needs real unprivileged bpf() or it needs a privileged
>>> helper (SUID or a daemon) to intermediate this access.
>>> 
>>>> 
>>>>> 
>>>>>>> I don't see why you need to invent a whole new mechanism for this.
>>>>>>> The entire cgroup ecosystem outside bpf() does just fine using the
>>>>>>> write permission on files in cgroupfs to control access.  Why can't
>>>>>>> bpf() do the same thing?
>>>>>> 
>>>>>> It is easier to use write permission for BPF_PROG_ATTACH. But it is
>>>>>> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
>>>>>> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
>>>>>> we should have target concept for all these commands. But that is a
>>>>>> much bigger project. OTOH, "all or nothing" model allows all these
>>>>>> commands at once.
>>>>> 
>>>>> For BPF_PROG_LOAD, I admit I've never understood why permission is
>>>>> required at all.  I think that CAP_SYS_ADMIN or similar should be
>>>>> needed to get is_priv in the verifier, but I think that should mainly
>>>>> be useful for tracing, and that requires lots of privilege anyway.
>>>>> BPF_MAP_* is probably the trickiest part.  One solution would be some
>>>>> kind of bpffs, but I'm sure other solutions are possible.
>>>> 
>>>> Improving permission management of cgroup_bpf is another good topic to
>>>> discuss. However, it is also an overkill for current use case.
>>>> 
>>> 
>>> I looked at the code some more, and I don't think this is so hard
>>> after all.  As I understand it, all of the map..by_id stuff is, to
>>> some extent, deprecated in favor of persistent maps.  As I see it, the
>>> map..by_id calls should require privilege forever, although I can
>>> imagine ways to scope that privilege to a namespace if the maps
>>> themselves were to be scoped to a namespace.
>>> 
>>> Instead, unprivileged tools would use the persistent map interface
>>> roughly like this:
>>> 
>>> $ bpftool map create /sys/fs/bpf/my_dir/filename type hash key 8 value
>>> 8 entries 64 name mapname
>>> 
>>> This would require that the caller have either CAP_DAC_OVERRIDE or
>>> that the caller have permission to create files in /sys/fs/bpf/my_dir
>>> (using the same rules as for any filesystem), and the resulting map
>>> would end up owned by the creating user and have mode 0600 (or maybe
>>> 0666, or maybe a new bpf_attr parameter) modified by umask.  Then all
>>> the various capable() checks that are currently involved in accessing
>>> a persistent map would instead check FMODE_READ or FMODE_WRITE on the
>>> map file as appropriate.
>>> 
>>> Half of this stuff already works.  I just set my system up like this:
>>> 
>>> $ ls -l /sys/fs/bpf
>>> total 0
>>> drwxr-xr-x. 3 luto luto 0 Aug  4 15:10 luto
>>> 
>>> $ mkdir /sys/fs/bpf/luto/test
>>> 
>>> $ ls -l /sys/fs/bpf/luto
>>> total 0
>>> drwxrwxr-x. 2 luto luto 0 Aug  4 15:10 test
>>> 
>>> I bet that making the bpf() syscalls work appropriately in this
>>> context without privilege would only be a couple of hours of work.
>>> The hard work, creating bpffs and making it function, is already done
>>> :)
>>> 
>>> P.S. The docs for bpftool create are less than fantastic.  The
>>> complete lack of any error message at all when the syscall returns
>>> -EACCES is also not fantastic.
>> 
>> This isn't remotely finished, but I spent a bit of time fiddling with this:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=bpf/perms
>> 
>> What do you think?  (It's obviously not done.  It doesn't compile, and
>> I haven't gotten to the permissions needed to do map operations.  I
>> also haven't touched the capable() checks.)
> 
> I updated the branch.  It compiles, and basic map functionality works!

Thanks a lot for trying this out. This is a very interesting direction
that we will explore. 

> 
> # mount -t bpf bpf /sys/fs/bpf
> # cd /sys/fs/bpf
> # mkdir luto
> # chown luto: luto
> # setpriv --euid=1000 --ruid=1000 bash
> $ pwd
> /sys/fs/bpf
> bash-5.0$ ls -l
> total 0
> drwxr-xr-x 2 luto luto 0 Aug  4 22:41 luto
> bash-5.0$ bpftool map create /sys/fs/bpf/luto/filename type hash key 8
> value 8 entries 64 name mapname
> bash-5.0$ bpftool map dump pinned /sys/fs/bpf/luto/filename
> Found 0 elements
> 
> # chown root: /sys/fs/bpf/luto/filename
> 
> $ bpftool map dump pinned /sys/fs/bpf/luto/filename
> Error: bpf obj get (/sys/fs/bpf/luto): Permission denied
> 
> So I think it's possible to get a respectable subset of bpf()
> functionality working without privilege in short order :)

I think we have two key questions to answer: 
  1. What subset of bpf() functionality will the users need?
  2. Who are the users? 

Different answers to these two questions lead to different directions.


In our use case, the answers are 
  1) almost all bpf() functionality
  2) highly trusted users (sudoers)

So our initial approach of /dev/bpf allows all bpf() functionality
in one bit in task_struct. (Yes, we can just sudo. But, we would 
rather not use sudo when possible.)


"cgroup management" use case may have answers like:
  1) cgroup_bpf only
  2) users in their own containers

For this case, getting cgroup_bpf related features (cgroup_bpf progs; 
some map types, etc.) work with unprivileged users would be the right 
direction. 


"USDT tracing" use case may have answers like:
  1) uprobe, stockmap, histogram, etc.
  2) unprivileged user, w/ or w/o containers

For this case, the first step is likely hacking sys_perf_event_open(). 


I guess we will need more discussions to decide how to make bpf() 
work better for all these (and more) use cases. 

Thanks,
Song


^ permalink raw reply

* Re: [PATCH v2] net/mlx5e: Use refcount_t for refcount
From: Chuhong Yuan @ 2019-08-05  6:55 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Saeed Mahameed, David S . Miller, Netdev, linux-rdma, LKML
In-Reply-To: <20190805061320.GN4832@mtr-leonro.mtl.com>

On Mon, Aug 5, 2019 at 2:13 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Sun, Aug 04, 2019 at 10:44:47PM +0800, Chuhong Yuan wrote:
> > On Sun, Aug 4, 2019 at 8:59 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Sat, Aug 03, 2019 at 12:48:28AM +0800, Chuhong Yuan wrote:
> > > > refcount_t is better for reference counters since its
> > > > implementation can prevent overflows.
> > > > So convert atomic_t ref counters to refcount_t.
> > >
> > > I'm not thrilled to see those automatic conversion patches, especially
> > > for flows which can't overflow. There is nothing wrong in using atomic_t
> > > type of variable, do you have in mind flow which will cause to overflow?
> > >
> > > Thanks
> >
> > I have to say that these patches are not done automatically...
> > Only the detection of problems is done by a script.
> > All conversions are done manually.
>
> Even worse, you need to audit usage of atomic_t and replace there
> it can overflow.
>
> >
> > I am not sure whether the flow can cause an overflow.
>
> It can't.
>
> > But I think it is hard to ensure that a data path is impossible
> > to have problems in any cases including being attacked.
>
> It is not data path, and I doubt that such conversion will be allowed
> in data paths without proving that no performance regression is introduced.
>>
>
> > So I think it is better to do this minor revision to prevent
> > potential risk, just like we have done in mlx5/core/cq.c.
>
> mlx5/core/cq.c is a different beast, refcount there means actual users
> of CQ which are limited in SW, so in theory, they have potential
> to be overflown.
>
> It is not the case here, there your are adding new port.
> There is nothing wrong with atomic_t.
>

Thanks for your explanation!
I will pay attention to this point in similar cases.
But it seems that the semantic of refcount is not always as clear as here...

Regards,
Chuhong


> Thanks
>
> >
> > Regards,
> > Chuhong
> >
> > > >
> > > > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > > > ---
> > > > Changes in v2:
> > > >   - Add #include.
> > > >
> > > >  drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c | 9 +++++----
> > > >  1 file changed, 5 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c
> > > > index b9d4f4e19ff9..148b55c3db7a 100644
> > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c
> > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c
> > > > @@ -32,6 +32,7 @@
> > > >
> > > >  #include <linux/kernel.h>
> > > >  #include <linux/module.h>
> > > > +#include <linux/refcount.h>
> > > >  #include <linux/mlx5/driver.h>
> > > >  #include <net/vxlan.h>
> > > >  #include "mlx5_core.h"
> > > > @@ -48,7 +49,7 @@ struct mlx5_vxlan {
> > > >
> > > >  struct mlx5_vxlan_port {
> > > >       struct hlist_node hlist;
> > > > -     atomic_t refcount;
> > > > +     refcount_t refcount;
> > > >       u16 udp_port;
> > > >  };
> > > >
> > > > @@ -113,7 +114,7 @@ int mlx5_vxlan_add_port(struct mlx5_vxlan *vxlan, u16 port)
> > > >
> > > >       vxlanp = mlx5_vxlan_lookup_port(vxlan, port);
> > > >       if (vxlanp) {
> > > > -             atomic_inc(&vxlanp->refcount);
> > > > +             refcount_inc(&vxlanp->refcount);
> > > >               return 0;
> > > >       }
> > > >
> > > > @@ -137,7 +138,7 @@ int mlx5_vxlan_add_port(struct mlx5_vxlan *vxlan, u16 port)
> > > >       }
> > > >
> > > >       vxlanp->udp_port = port;
> > > > -     atomic_set(&vxlanp->refcount, 1);
> > > > +     refcount_set(&vxlanp->refcount, 1);
> > > >
> > > >       spin_lock_bh(&vxlan->lock);
> > > >       hash_add(vxlan->htable, &vxlanp->hlist, port);
> > > > @@ -170,7 +171,7 @@ int mlx5_vxlan_del_port(struct mlx5_vxlan *vxlan, u16 port)
> > > >               goto out_unlock;
> > > >       }
> > > >
> > > > -     if (atomic_dec_and_test(&vxlanp->refcount)) {
> > > > +     if (refcount_dec_and_test(&vxlanp->refcount)) {
> > > >               hash_del(&vxlanp->hlist);
> > > >               remove = true;
> > > >       }
> > > > --
> > > > 2.20.1
> > > >

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-05  6:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jason Gunthorpe, linux-mm, netdev, linux-kernel, kvm,
	virtualization
In-Reply-To: <494ac30d-b750-52c8-b927-16cd4b9414c4@redhat.com>

On Mon, Aug 05, 2019 at 12:41:45PM +0800, Jason Wang wrote:
> 
> On 2019/8/5 下午12:36, Jason Wang wrote:
> > 
> > On 2019/8/2 下午10:27, Michael S. Tsirkin wrote:
> > > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > > > > This must be a proper barrier, like a spinlock, mutex, or
> > > > > > synchronize_rcu.
> > > > > 
> > > > > I start with synchronize_rcu() but both you and Michael raise some
> > > > > concern.
> > > > I've also idly wondered if calling synchronize_rcu() under the various
> > > > mm locks is a deadlock situation.
> > > > 
> > > > > Then I try spinlock and mutex:
> > > > > 
> > > > > 1) spinlock: add lots of overhead on datapath, this leads 0
> > > > > performance
> > > > > improvement.
> > > > I think the topic here is correctness not performance improvement
> > > The topic is whether we should revert
> > > commit 7f466032dc9 ("vhost: access vq metadata through kernel
> > > virtual address")
> > > 
> > > or keep it in. The only reason to keep it is performance.
> > 
> > 
> > Maybe it's time to introduce the config option?
> 
> 
> Or does it make sense if I post a V3 with:
> 
> - introduce config option and disable the optimization by default
> 
> - switch from synchronize_rcu() to vhost_flush_work(), but the rest are the
> same
> 
> This can give us some breath to decide which way should go for next release?
> 
> Thanks

As is, with preempt enabled?  Nope I don't think blocking an invalidator
on swap IO is ok, so I don't believe this stuff is going into this
release at this point.

So it's more a question of whether it's better to revert and apply a clean
patch on top, or just keep the code around but disabled with an ifdef as is.
I'm open to both options, and would like your opinion on this.

> 
> > 
> > 
> > > 
> > > Now as long as all this code is disabled anyway, we can experiment a
> > > bit.
> > > 
> > > I personally feel we would be best served by having two code paths:
> > > 
> > > - Access to VM memory directly mapped into kernel
> > > - Access to userspace
> > > 
> > > 
> > > Having it all cleanly split will allow a bunch of optimizations, for
> > > example for years now we planned to be able to process an incoming short
> > > packet directly on softirq path, or an outgoing on directly within
> > > eventfd.
> > 
> > 
> > It's not hard consider we've already had our own accssors. But the
> > question is (as asked in another thread), do you want permanent GUP or
> > still use MMU notifiers.
> > 
> > Thanks
> > 
> > _______________________________________________
> > Virtualization mailing list
> > Virtualization@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net-next v3] net: phy: broadcom: add 1000Base-X support for BCM54616S
From: Tao Ren @ 2019-08-05  6:38 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Vladimir Oltean, Florian Fainelli, Heiner Kallweit,
	David S . Miller, Arun Parameswaran, Justin Chen, netdev, lkml,
	openbmc@lists.ozlabs.org
In-Reply-To: <20190804145152.GA6800@lunn.ch>

Hi Andrew,

On 8/4/19 7:51 AM, Andrew Lunn wrote:
>>> The patchset looks better now. But is it ok, I wonder, to keep
>>> PHY_BCM_FLAGS_MODE_1000BX in phydev->dev_flags, considering that
>>> phy_attach_direct is overwriting it?
>>
> 
>> I checked ftgmac100 driver (used on my machine) and it calls
>> phy_connect_direct which passes phydev->dev_flags when calling
>> phy_attach_direct: that explains why the flag is not cleared in my
>> case.
> 
> Yes, that is the way it is intended to be used. The MAC driver can
> pass flags to the PHY. It is a fragile API, since the MAC needs to
> know what PHY is being used, since the flags are driver specific.
> 
> One option would be to modify the assignment in phy_attach_direct() to
> OR in the flags passed to it with flags which are already in
> phydev->dev_flags.

It sounds like a reasonable fix/enhancement to replace overriding with OR, no matter which direction we are going to (either adding 1000bx aneg in genphy or providing phy-specific aneg callback).

Do you want me to send out the patch (I feel it's better to be in a separate patch?) or someone else will take care of it?


Thanks,

Tao

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-05  6:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel,
	linux-mm
In-Reply-To: <e8ecb811-6653-cff4-bc11-81f4ccb0dbbf@redhat.com>

On Mon, Aug 05, 2019 at 12:36:40PM +0800, Jason Wang wrote:
> 
> On 2019/8/2 下午10:27, Michael S. Tsirkin wrote:
> > On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > > > This must be a proper barrier, like a spinlock, mutex, or
> > > > > synchronize_rcu.
> > > > 
> > > > I start with synchronize_rcu() but both you and Michael raise some
> > > > concern.
> > > I've also idly wondered if calling synchronize_rcu() under the various
> > > mm locks is a deadlock situation.
> > > 
> > > > Then I try spinlock and mutex:
> > > > 
> > > > 1) spinlock: add lots of overhead on datapath, this leads 0 performance
> > > > improvement.
> > > I think the topic here is correctness not performance improvement
> > The topic is whether we should revert
> > commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address")
> > 
> > or keep it in. The only reason to keep it is performance.
> 
> 
> Maybe it's time to introduce the config option?

Depending on CONFIG_BROKEN? I'm not sure it's a good idea.

> 
> > 
> > Now as long as all this code is disabled anyway, we can experiment a
> > bit.
> > 
> > I personally feel we would be best served by having two code paths:
> > 
> > - Access to VM memory directly mapped into kernel
> > - Access to userspace
> > 
> > 
> > Having it all cleanly split will allow a bunch of optimizations, for
> > example for years now we planned to be able to process an incoming short
> > packet directly on softirq path, or an outgoing on directly within
> > eventfd.
> 
> 
> It's not hard consider we've already had our own accssors. But the question
> is (as asked in another thread), do you want permanent GUP or still use MMU
> notifiers.
> 
> Thanks

We want THP and NUMA to work. Both are important for performance.

-- 
MST

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Michael S. Tsirkin @ 2019-08-05  6:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel,
	linux-mm
In-Reply-To: <6c3a0a1c-ce87-907b-7bc8-ec41bf9056d8@redhat.com>

On Mon, Aug 05, 2019 at 12:33:45PM +0800, Jason Wang wrote:
> 
> On 2019/8/2 下午10:03, Michael S. Tsirkin wrote:
> > On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
> > > Btw, I come up another idea, that is to disable preemption when vhost thread
> > > need to access the memory. Then register preempt notifier and if vhost
> > > thread is preempted, we're sure no one will access the memory and can do the
> > > cleanup.
> > Great, more notifiers :(
> > 
> > Maybe can live with
> > 1- disable preemption while using the cached pointer
> > 2- teach vhost to recover from memory access failures,
> >     by switching to regular from/to user path
> 
> 
> I don't get this, I believe we want to recover from regular from/to user
> path, isn't it?

That (disable copy to/from user completely) would be a nice to have
since it would reduce the attack surface of the driver, but e.g. your
code already doesn't do that.



> 
> > 
> > So if you want to try that, fine since it's a step in
> > the right direction.
> > 
> > But I think fundamentally it's not what we want to do long term.
> 
> 
> Yes.
> 
> 
> > 
> > It's always been a fundamental problem with this patch series that only
> > metadata is accessed through a direct pointer.
> > 
> > The difference in ways you handle metadata and data is what is
> > now coming and messing everything up.
> 
> 
> I do propose soemthing like this in the past:
> https://www.spinics.net/lists/linux-virtualization/msg36824.html. But looks
> like you have some concern about its locality.

Right and it doesn't go away. You'll need to come up
with a test that messes it up and triggers a worst-case
scenario, so we can measure how bad is that worst-case.

> But the problem still there, GUP can do page fault, so still need to
> synchronize it with MMU notifiers.

I think the idea was, if GUP would need a pagefault, don't
do a GUP and do to/from user instead. Hopefully that
will fault the page in and the next access will go through.

> The solution might be something like
> moving GUP to a dedicated kind of vhost work.

Right, generally GUP.

> 
> > 
> > So if continuing the direct map approach,
> > what is needed is a cache of mapped VM memory, then on a cache miss
> > we'd queue work along the lines of 1-2 above.
> > 
> > That's one direction to take. Another one is to give up on that and
> > write our own version of uaccess macros.  Add a "high security" flag to
> > the vhost module and if not active use these for userspace memory
> > access.
> 
> 
> Or using SET_BACKEND_FEATURES?

No, I don't think it's considered best practice to allow unpriveledged
userspace control over whether kernel enables security features.

> But do you mean permanent GUP as I did in
> original RFC https://lkml.org/lkml/2018/12/13/218?
> 
> Thanks

Permanent GUP breaks THP and NUMA.

> > 
> > 

^ permalink raw reply

* Re: [PATCH net-next 3/6] cls_api: add flow_indr_block_call function
From: wenxu @ 2019-08-05  6:26 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: pablo, fw, jakub.kicinski, netfilter-devel, netdev
In-Reply-To: <20190805060238.GB2349@nanopsycho.orion>

v5 contain this patch but with non-version tag,

I used --subject-prefix in git-format-patch. I am sorry to  make a mistake when modify the

commit log. So should I repost the v6?


On 8/5/2019 2:02 PM, Jiri Pirko wrote:
> Re subject. You don't have "v5" in this patch. I don't understand how
> that happened. Do you use --subject-prefix in git-format-patch?
>

^ permalink raw reply

* Re: [RFC PATCH 1/2] dt-bindings: net: macb: Add new property for PS SGMII only
From: Harini Katakam @ 2019-08-05  6:15 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Harini Katakam, Nicolas Ferre, David Miller, Claudiu Beznea,
	Rob Herring, Mark Rutland, netdev, linux-kernel, Michal Simek,
	devicetree
In-Reply-To: <20190804145633.GB6800@lunn.ch>

Hi Andrew,

On Sun, Aug 4, 2019 at 8:26 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Wed, Jul 31, 2019 at 03:10:32PM +0530, Harini Katakam wrote:
> > Add a new property to indicate when PS SGMII is used with NO
> > external PHY on board.
>
> Hi Harini
>
> What exactly is you use case? Are you connecting to a Ethernet switch?
> To an SFP cage with a copper module?

Yes, an SFP cage is the common HW target for this patch. Essentially, there
is no external PHY driver that the macb can "connect" to; if there was an
external PHY on board, its link status would also indicate when the GEM(PCS)
to external PHY link was down. But in the absence of an external PHY on
HW, PCS link status needs to be monitored and reported to users.

Regards,
Harini

^ permalink raw reply

* Re: [PATCH v2] net/mlx5e: Use refcount_t for refcount
From: Leon Romanovsky @ 2019-08-05  6:13 UTC (permalink / raw)
  To: Chuhong Yuan; +Cc: Saeed Mahameed, David S . Miller, Netdev, linux-rdma, LKML
In-Reply-To: <CANhBUQ2H5MU0m2xM0AkJGPf7+MJBZ3Eq5rR0kgeOoKRi4q1j6Q@mail.gmail.com>

On Sun, Aug 04, 2019 at 10:44:47PM +0800, Chuhong Yuan wrote:
> On Sun, Aug 4, 2019 at 8:59 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Sat, Aug 03, 2019 at 12:48:28AM +0800, Chuhong Yuan wrote:
> > > refcount_t is better for reference counters since its
> > > implementation can prevent overflows.
> > > So convert atomic_t ref counters to refcount_t.
> >
> > I'm not thrilled to see those automatic conversion patches, especially
> > for flows which can't overflow. There is nothing wrong in using atomic_t
> > type of variable, do you have in mind flow which will cause to overflow?
> >
> > Thanks
>
> I have to say that these patches are not done automatically...
> Only the detection of problems is done by a script.
> All conversions are done manually.

Even worse, you need to audit usage of atomic_t and replace there
it can overflow.

>
> I am not sure whether the flow can cause an overflow.

It can't.

> But I think it is hard to ensure that a data path is impossible
> to have problems in any cases including being attacked.

It is not data path, and I doubt that such conversion will be allowed
in data paths without proving that no performance regression is introduced.

>
> So I think it is better to do this minor revision to prevent
> potential risk, just like we have done in mlx5/core/cq.c.

mlx5/core/cq.c is a different beast, refcount there means actual users
of CQ which are limited in SW, so in theory, they have potential
to be overflown.

It is not the case here, there your are adding new port.
There is nothing wrong with atomic_t.

Thanks

>
> Regards,
> Chuhong
>
> > >
> > > Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
> > > ---
> > > Changes in v2:
> > >   - Add #include.
> > >
> > >  drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c | 9 +++++----
> > >  1 file changed, 5 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c
> > > index b9d4f4e19ff9..148b55c3db7a 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/vxlan.c
> > > @@ -32,6 +32,7 @@
> > >
> > >  #include <linux/kernel.h>
> > >  #include <linux/module.h>
> > > +#include <linux/refcount.h>
> > >  #include <linux/mlx5/driver.h>
> > >  #include <net/vxlan.h>
> > >  #include "mlx5_core.h"
> > > @@ -48,7 +49,7 @@ struct mlx5_vxlan {
> > >
> > >  struct mlx5_vxlan_port {
> > >       struct hlist_node hlist;
> > > -     atomic_t refcount;
> > > +     refcount_t refcount;
> > >       u16 udp_port;
> > >  };
> > >
> > > @@ -113,7 +114,7 @@ int mlx5_vxlan_add_port(struct mlx5_vxlan *vxlan, u16 port)
> > >
> > >       vxlanp = mlx5_vxlan_lookup_port(vxlan, port);
> > >       if (vxlanp) {
> > > -             atomic_inc(&vxlanp->refcount);
> > > +             refcount_inc(&vxlanp->refcount);
> > >               return 0;
> > >       }
> > >
> > > @@ -137,7 +138,7 @@ int mlx5_vxlan_add_port(struct mlx5_vxlan *vxlan, u16 port)
> > >       }
> > >
> > >       vxlanp->udp_port = port;
> > > -     atomic_set(&vxlanp->refcount, 1);
> > > +     refcount_set(&vxlanp->refcount, 1);
> > >
> > >       spin_lock_bh(&vxlan->lock);
> > >       hash_add(vxlan->htable, &vxlanp->hlist, port);
> > > @@ -170,7 +171,7 @@ int mlx5_vxlan_del_port(struct mlx5_vxlan *vxlan, u16 port)
> > >               goto out_unlock;
> > >       }
> > >
> > > -     if (atomic_dec_and_test(&vxlanp->refcount)) {
> > > +     if (refcount_dec_and_test(&vxlanp->refcount)) {
> > >               hash_del(&vxlanp->hlist);
> > >               remove = true;
> > >       }
> > > --
> > > 2.20.1
> > >

^ permalink raw reply

* Re: [PATCH net-next 3/6] cls_api: add flow_indr_block_call function
From: Jiri Pirko @ 2019-08-05  6:02 UTC (permalink / raw)
  To: wenxu; +Cc: pablo, fw, jakub.kicinski, netfilter-devel, netdev
In-Reply-To: <1564628627-10021-4-git-send-email-wenxu@ucloud.cn>

Re subject. You don't have "v5" in this patch. I don't understand how
that happened. Do you use --subject-prefix in git-format-patch?

^ permalink raw reply

* Re: [PATCH 0/3] Use refcount_t for refcount
From: Leon Romanovsky @ 2019-08-05  5:57 UTC (permalink / raw)
  To: Chuhong Yuan
  Cc: Saeed Mahameed, David S . Miller, Doug Ledford, Jason Gunthorpe,
	linux-rdma, Netdev, LKML
In-Reply-To: <CANhBUQ0rMKHmh4ibktwRmVN6NU=HAjs-Q7PrF9yX5x5yOyOB2A@mail.gmail.com>

On Sun, Aug 04, 2019 at 10:58:19PM +0800, Chuhong Yuan wrote:
> On Sun, Aug 4, 2019 at 8:48 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Fri, Aug 02, 2019 at 08:10:35PM +0800, Chuhong Yuan wrote:
> > > Reference counters are preferred to use refcount_t instead of
> > > atomic_t.
> > > This is because the implementation of refcount_t can prevent
> > > overflows and detect possible use-after-free.
> > >
> > > First convert the refcount field to refcount_t in mlx5/driver.h.
> > > Then convert the uses to refcount_() APIs.
> >
> > You can't do it, because you need to ensure that driver compiles and
> > works between patches. By converting driver.h alone to refcount_t, you
> > simply broke mlx5 driver.
> >
>
> It is my fault... I am not clear how to send patches which cross
> several subsystems, so I sent them in series.
> Maybe I should merge these patches together?

In case of conversion patches, yes, you need to perform such change
in one shot.

>
>
> > NAK, to be clear.
> >
> > And please don't sent series of patches as standalone patches.
> >
>
> Due to the reason mentioned above, I sent them seperately.

Create patch, run ./scripts/get_maintainer.pl on it and send according
to generated output. You are not doing kernel core changes and there is
no need to worry about cross subsystem complexity as long as you will
put relevant maintainers in TO: field.

Thanks

>
> > Thanks,

^ permalink raw reply

* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: Jiri Pirko @ 2019-08-05  5:54 UTC (permalink / raw)
  To: David Ahern; +Cc: Jakub Kicinski, netdev, davem, sthemmin, mlxsw
In-Reply-To: <6f05d200-49d4-4eb1-cd69-bd88cf8b0167@gmail.com>

Fri, Aug 02, 2019 at 05:45:36PM CEST, dsahern@gmail.com wrote:
>On 8/2/19 1:48 AM, Jiri Pirko wrote:
>> Wed, Jul 31, 2019 at 09:58:10PM CEST, dsahern@gmail.com wrote:
>>> On 7/31/19 1:46 PM, David Ahern wrote:
>>>> On 7/31/19 1:45 PM, Jiri Pirko wrote:
>>>>>> check. e.g., what happens if a resource controller has been configured
>>>>>> for the devlink instance and it is moved to a namespace whose existing
>>>>>> config exceeds those limits?
>>>>>
>>>>> It's moved with all the values. The whole instance is moved.
>>>>>
>>>>
>>>> The values are moved, but the FIB in a namespace could already contain
>>>> more routes than the devlink instance allows.
>>>>
>>>
>>>From a quick test your recent refactoring to netdevsim broke the
>>> resource controller. It was, and is intended to be, per network namespace.
>> 
>> unifying devlink instances with network namespace in netdevsim was
>> really odd. Netdevsim is also a device, like any other. With other
>> devices, you do not do this so I don't see why to do this with netdevsim.
>> 
>> Now you create netdevsim instance in sysfs, there is proper bus probe
>> mechanism done, there is a devlink instance created for this device,
>> there are netdevices and devlink ports created. Same as for the real
>> hardware.
>> 
>> Honestly, creating a devlink instance per-network namespace
>> automagically, no relation to netdevsim devices, that is simply wrong.
>> There should be always 1:1 relationshin between a device and devlink
>> instance.
>> 
>
>Jiri: prior to your recent change netdevsim had a fib resource
>controller per network namespace. Please return that behavior or revert
>the change.

There was implicit devlink instance creation per-namespace. No relation
any actual device. It was wrong and misuse of devlink.

Now you have 1 devlink instance per 1 device as it should be. Also, you
have fib resource control for this device, also as it is done for real
devices, like mlxsw.

Could you please describe your usecase? Perhaps we can handle
it differently.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-05  5:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <CALCETrUjh6DdgW1qSuSRd1_=0F9CqB8+sNj__e_6AHEvh_BaxQ@mail.gmail.com>

On Sun, Aug 4, 2019 at 5:08 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Sun, Aug 4, 2019 at 3:16 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Fri, Aug 2, 2019 at 12:22 AM Song Liu <songliubraving@fb.com> wrote:
> > >
> > > Hi Andy,
> > >
> >  >> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
> > > >> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
> > > >
> > > > It's been done before -- it's not that hard.  IMO the main tricky bit
> > > > would be try be somewhat careful about defining exactly what
> > > > CAP_BPF_ADMIN does.
> > >
> > > Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the
> > > Plumbers conference.
> > >
> > > OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons
> > > like systemd to do sys_bpf() without root.
> >
> > I don't understand the use case here.  Are you talking about systemd
> > --user?  As far as I know, a user is expected to be able to fully
> > control their systemd --user process, so giving it unrestricted bpf
> > access is very close to giving it superuser access, and this doesn't
> > sound like a good idea.  I think that, if systemd --user needs bpf(),
> > it either needs real unprivileged bpf() or it needs a privileged
> > helper (SUID or a daemon) to intermediate this access.
> >
> > >
> > > >
> > > >>> I don't see why you need to invent a whole new mechanism for this.
> > > >>> The entire cgroup ecosystem outside bpf() does just fine using the
> > > >>> write permission on files in cgroupfs to control access.  Why can't
> > > >>> bpf() do the same thing?
> > > >>
> > > >> It is easier to use write permission for BPF_PROG_ATTACH. But it is
> > > >> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
> > > >> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
> > > >> we should have target concept for all these commands. But that is a
> > > >> much bigger project. OTOH, "all or nothing" model allows all these
> > > >> commands at once.
> > > >
> > > > For BPF_PROG_LOAD, I admit I've never understood why permission is
> > > > required at all.  I think that CAP_SYS_ADMIN or similar should be
> > > > needed to get is_priv in the verifier, but I think that should mainly
> > > > be useful for tracing, and that requires lots of privilege anyway.
> > > > BPF_MAP_* is probably the trickiest part.  One solution would be some
> > > > kind of bpffs, but I'm sure other solutions are possible.
> > >
> > > Improving permission management of cgroup_bpf is another good topic to
> > > discuss. However, it is also an overkill for current use case.
> > >
> >
> > I looked at the code some more, and I don't think this is so hard
> > after all.  As I understand it, all of the map..by_id stuff is, to
> > some extent, deprecated in favor of persistent maps.  As I see it, the
> > map..by_id calls should require privilege forever, although I can
> > imagine ways to scope that privilege to a namespace if the maps
> > themselves were to be scoped to a namespace.
> >
> > Instead, unprivileged tools would use the persistent map interface
> > roughly like this:
> >
> > $ bpftool map create /sys/fs/bpf/my_dir/filename type hash key 8 value
> > 8 entries 64 name mapname
> >
> > This would require that the caller have either CAP_DAC_OVERRIDE or
> > that the caller have permission to create files in /sys/fs/bpf/my_dir
> > (using the same rules as for any filesystem), and the resulting map
> > would end up owned by the creating user and have mode 0600 (or maybe
> > 0666, or maybe a new bpf_attr parameter) modified by umask.  Then all
> > the various capable() checks that are currently involved in accessing
> > a persistent map would instead check FMODE_READ or FMODE_WRITE on the
> > map file as appropriate.
> >
> > Half of this stuff already works.  I just set my system up like this:
> >
> > $ ls -l /sys/fs/bpf
> > total 0
> > drwxr-xr-x. 3 luto luto 0 Aug  4 15:10 luto
> >
> > $ mkdir /sys/fs/bpf/luto/test
> >
> > $ ls -l /sys/fs/bpf/luto
> > total 0
> > drwxrwxr-x. 2 luto luto 0 Aug  4 15:10 test
> >
> > I bet that making the bpf() syscalls work appropriately in this
> > context without privilege would only be a couple of hours of work.
> > The hard work, creating bpffs and making it function, is already done
> > :)
> >
> > P.S. The docs for bpftool create are less than fantastic.  The
> > complete lack of any error message at all when the syscall returns
> > -EACCES is also not fantastic.
>
> This isn't remotely finished, but I spent a bit of time fiddling with this:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=bpf/perms
>
> What do you think?  (It's obviously not done.  It doesn't compile, and
> I haven't gotten to the permissions needed to do map operations.  I
> also haven't touched the capable() checks.)

I updated the branch.  It compiles, and basic map functionality works!

# mount -t bpf bpf /sys/fs/bpf
# cd /sys/fs/bpf
# mkdir luto
# chown luto: luto
# setpriv --euid=1000 --ruid=1000 bash
$ pwd
/sys/fs/bpf
bash-5.0$ ls -l
total 0
drwxr-xr-x 2 luto luto 0 Aug  4 22:41 luto
bash-5.0$ bpftool map create /sys/fs/bpf/luto/filename type hash key 8
value 8 entries 64 name mapname
bash-5.0$ bpftool map dump pinned /sys/fs/bpf/luto/filename
Found 0 elements

# chown root: /sys/fs/bpf/luto/filename

$ bpftool map dump pinned /sys/fs/bpf/luto/filename
Error: bpf obj get (/sys/fs/bpf/luto): Permission denied

So I think it's possible to get a respectable subset of bpf()
functionality working without privilege in short order :)

(FWIW, a decent fraction of this probably works even without my
patches, but it's going to have nonsensical semantics and may fail for
silly reasons.)

^ permalink raw reply

* Re: [PATCH] drivers:staging:isdn:hysdn brace same line if
From: Greg KH @ 2019-08-05  4:42 UTC (permalink / raw)
  To: Fernando Valle; +Cc: isdn, netdev, devel, linux-kernel
In-Reply-To: <CACRBQB+wx3=3vQrvnxjcoiZaZjP7EOF+f0NYa4p-XKTHW79RiA@mail.gmail.com>


A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

A: No.
Q: Should I include quotations after my reply?

http://daringfireball.net/2007/07/on_top

On Sun, Aug 04, 2019 at 10:04:40PM -0300, Fernando Valle wrote:
> Sorry Greg, it was my first submission.
> I followed the kernel newbies tutorial and some other found on the internet.

The best description of how to pick a subject line is in the section
entitled "The canonical patch format" in the kernel file,
Documentation/SubmittingPatches.  Please read that whole file as it
contains everything you need to know about writing good changelog texts
and everything else.

> So, the correct form of the subject would be -> "staging:isdn:hysdn open
> braces in correct location" !?

That's better than what you currently have, but I know you can do much
better than that :)

good luck!

greg k-h

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  4:41 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Gunthorpe
  Cc: linux-mm, netdev, linux-kernel, kvm, virtualization
In-Reply-To: <e8ecb811-6653-cff4-bc11-81f4ccb0dbbf@redhat.com>


On 2019/8/5 下午12:36, Jason Wang wrote:
>
> On 2019/8/2 下午10:27, Michael S. Tsirkin wrote:
>> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
>>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>>>> This must be a proper barrier, like a spinlock, mutex, or
>>>>> synchronize_rcu.
>>>>
>>>> I start with synchronize_rcu() but both you and Michael raise some
>>>> concern.
>>> I've also idly wondered if calling synchronize_rcu() under the various
>>> mm locks is a deadlock situation.
>>>
>>>> Then I try spinlock and mutex:
>>>>
>>>> 1) spinlock: add lots of overhead on datapath, this leads 0 
>>>> performance
>>>> improvement.
>>> I think the topic here is correctness not performance improvement
>> The topic is whether we should revert
>> commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual 
>> address")
>>
>> or keep it in. The only reason to keep it is performance.
>
>
> Maybe it's time to introduce the config option?


Or does it make sense if I post a V3 with:

- introduce config option and disable the optimization by default

- switch from synchronize_rcu() to vhost_flush_work(), but the rest are 
the same

This can give us some breath to decide which way should go for next release?

Thanks


>
>
>>
>> Now as long as all this code is disabled anyway, we can experiment a
>> bit.
>>
>> I personally feel we would be best served by having two code paths:
>>
>> - Access to VM memory directly mapped into kernel
>> - Access to userspace
>>
>>
>> Having it all cleanly split will allow a bunch of optimizations, for
>> example for years now we planned to be able to process an incoming short
>> packet directly on softirq path, or an outgoing on directly within
>> eventfd.
>
>
> It's not hard consider we've already had our own accssors. But the 
> question is (as asked in another thread), do you want permanent GUP or 
> still use MMU notifiers.
>
> Thanks
>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  4:39 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Gunthorpe
  Cc: kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <20190804040034-mutt-send-email-mst@kernel.org>


On 2019/8/4 下午4:07, Michael S. Tsirkin wrote:
> On Sat, Aug 03, 2019 at 09:14:00PM -0300, Jason Gunthorpe wrote:
>> On Sat, Aug 03, 2019 at 05:36:13PM -0400, Michael S. Tsirkin wrote:
>>> On Fri, Aug 02, 2019 at 02:24:18PM -0300, Jason Gunthorpe wrote:
>>>> On Fri, Aug 02, 2019 at 10:27:21AM -0400, Michael S. Tsirkin wrote:
>>>>> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
>>>>>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>>>>>>> This must be a proper barrier, like a spinlock, mutex, or
>>>>>>>> synchronize_rcu.
>>>>>>>
>>>>>>> I start with synchronize_rcu() but both you and Michael raise some
>>>>>>> concern.
>>>>>> I've also idly wondered if calling synchronize_rcu() under the various
>>>>>> mm locks is a deadlock situation.
>>>>>>
>>>>>>> Then I try spinlock and mutex:
>>>>>>>
>>>>>>> 1) spinlock: add lots of overhead on datapath, this leads 0 performance
>>>>>>> improvement.
>>>>>> I think the topic here is correctness not performance improvement
>>>>> The topic is whether we should revert
>>>>> commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address")
>>>>>
>>>>> or keep it in. The only reason to keep it is performance.
>>>> Yikes, I'm not sure you can ever win against copy_from_user using
>>>> mmu_notifiers?
>>> Ever since copy_from_user started playing with flags (for SMAP) and
>>> added speculation barriers there's a chance we can win by accessing
>>> memory through the kernel address.
>> You think copy_to_user will be more expensive than the minimum two
>> atomics required to synchronize with another thread?
> I frankly don't know. With SMAP you flip flags twice, and with spectre
> you flush the pipeline. Is that cheaper or more expensive than an atomic
> operation? Testing is the only way to tell.


Let me test, I only did test on a non SMAP machine. Switching to 
spinlock kills all performance improvement.

Thanks


>
>>>> Also, why can't this just permanently GUP the pages? In fact, where
>>>> does it put_page them anyhow? Worrying that 7f466 adds a get_user page
>>>> but does not add a put_page??
>> You didn't answer this.. Why not just use GUP?
>>
>> Jason
> Sorry I misunderstood the question. Permanent GUP breaks lots of
> functionality we need such as THP and numa balancing.
>
> release_pages is used instead of put_page.
>
>
>
>

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  4:36 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Gunthorpe
  Cc: kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <20190802100414-mutt-send-email-mst@kernel.org>


On 2019/8/2 下午10:27, Michael S. Tsirkin wrote:
> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>>> This must be a proper barrier, like a spinlock, mutex, or
>>>> synchronize_rcu.
>>>
>>> I start with synchronize_rcu() but both you and Michael raise some
>>> concern.
>> I've also idly wondered if calling synchronize_rcu() under the various
>> mm locks is a deadlock situation.
>>
>>> Then I try spinlock and mutex:
>>>
>>> 1) spinlock: add lots of overhead on datapath, this leads 0 performance
>>> improvement.
>> I think the topic here is correctness not performance improvement
> The topic is whether we should revert
> commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address")
>
> or keep it in. The only reason to keep it is performance.


Maybe it's time to introduce the config option?


>
> Now as long as all this code is disabled anyway, we can experiment a
> bit.
>
> I personally feel we would be best served by having two code paths:
>
> - Access to VM memory directly mapped into kernel
> - Access to userspace
>
>
> Having it all cleanly split will allow a bunch of optimizations, for
> example for years now we planned to be able to process an incoming short
> packet directly on softirq path, or an outgoing on directly within
> eventfd.


It's not hard consider we've already had our own accssors. But the 
question is (as asked in another thread), do you want permanent GUP or 
still use MMU notifiers.

Thanks


^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  4:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel,
	linux-mm
In-Reply-To: <20190802094331-mutt-send-email-mst@kernel.org>


On 2019/8/2 下午10:03, Michael S. Tsirkin wrote:
> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>> Btw, I come up another idea, that is to disable preemption when vhost thread
>> need to access the memory. Then register preempt notifier and if vhost
>> thread is preempted, we're sure no one will access the memory and can do the
>> cleanup.
> Great, more notifiers :(
>
> Maybe can live with
> 1- disable preemption while using the cached pointer
> 2- teach vhost to recover from memory access failures,
>     by switching to regular from/to user path


I don't get this, I believe we want to recover from regular from/to user 
path, isn't it?


>
> So if you want to try that, fine since it's a step in
> the right direction.
>
> But I think fundamentally it's not what we want to do long term.


Yes.


>
> It's always been a fundamental problem with this patch series that only
> metadata is accessed through a direct pointer.
>
> The difference in ways you handle metadata and data is what is
> now coming and messing everything up.


I do propose soemthing like this in the past: 
https://www.spinics.net/lists/linux-virtualization/msg36824.html. But 
looks like you have some concern about its locality.

But the problem still there, GUP can do page fault, so still need to 
synchronize it with MMU notifiers. The solution might be something like 
moving GUP to a dedicated kind of vhost work.


>
> So if continuing the direct map approach,
> what is needed is a cache of mapped VM memory, then on a cache miss
> we'd queue work along the lines of 1-2 above.
>
> That's one direction to take. Another one is to give up on that and
> write our own version of uaccess macros.  Add a "high security" flag to
> the vhost module and if not active use these for userspace memory
> access.


Or using SET_BACKEND_FEATURES? But do you mean permanent GUP as I did in 
original RFC https://lkml.org/lkml/2018/12/13/218?

Thanks

>
>

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  4:20 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <20190802124613.GA11245@ziepe.ca>


On 2019/8/2 下午8:46, Jason Gunthorpe wrote:
> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>> This must be a proper barrier, like a spinlock, mutex, or
>>> synchronize_rcu.
>>
>> I start with synchronize_rcu() but both you and Michael raise some
>> concern.
> I've also idly wondered if calling synchronize_rcu() under the various
> mm locks is a deadlock situation.


Maybe, that's why I suggest to use vhost_work_flush() which is much 
lightweight can can achieve the same function. It can guarantee all 
previous work has been processed after vhost_work_flush() return.


>
>> Then I try spinlock and mutex:
>>
>> 1) spinlock: add lots of overhead on datapath, this leads 0 performance
>> improvement.
> I think the topic here is correctness not performance improvement


But the whole series is to speed up vhost.


>
>> 2) SRCU: full memory barrier requires on srcu_read_lock(), which still leads
>> little performance improvement
>   
>> 3) mutex: a possible issue is need to wait for the page to be swapped in (is
>> this unacceptable ?), another issue is that we need hold vq lock during
>> range overlap check.
> I have a feeling that mmu notififers cannot safely become dependent on
> progress of swap without causing deadlock. You probably should avoid
> this.


Yes, so that's why I try to synchronize the critical region by myself.


>>> And, again, you can't re-invent a spinlock with open coding and get
>>> something better.
>> So the question is if waiting for swap is considered to be unsuitable for
>> MMU notifiers. If not, it would simplify codes. If not, we still need to
>> figure out a possible solution.
>>
>> Btw, I come up another idea, that is to disable preemption when vhost thread
>> need to access the memory. Then register preempt notifier and if vhost
>> thread is preempted, we're sure no one will access the memory and can do the
>> cleanup.
> I think you should use the spinlock so at least the code is obviously
> functionally correct and worry about designing some properly justified
> performance change after.
>
> Jason


Spinlock is correct but make the whole series meaningless consider it 
won't bring any performance improvement.

Thanks



^ permalink raw reply

* Re: [PATCH v2 20/34] xen: convert put_page() to put_user_page*()
From: Juergen Gross @ 2019-08-05  4:15 UTC (permalink / raw)
  To: john.hubbard, Andrew Morton
  Cc: devel, Dave Chinner, Christoph Hellwig, Dan Williams, Ira Weiny,
	x86, linux-mm, Dave Hansen, amd-gfx, dri-devel, intel-gfx,
	linux-arm-kernel, linux-rpi-kernel, devel, xen-devel,
	John Hubbard, Boris Ostrovsky, rds-devel, Jérôme Glisse,
	Jan Kara, ceph-devel, kvm, linux-block, linux-crypto, linux-fbdev,
	linux-fsdevel, LKML, linux-media, linux-nfs, linux-rdma,
	linux-xfs, netdev, sparclinux, Jason Gunthorpe
In-Reply-To: <20190804224915.28669-21-jhubbard@nvidia.com>

On 05.08.19 00:49, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> For pages that were retained via get_user_pages*(), release those pages
> via the new put_user_page*() routines, instead of via put_page() or
> release_pages().
> 
> This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
> ("mm: introduce put_user_page*(), placeholder versions").
> 
> This also handles pages[i] == NULL cases, thanks to an approach
> that is actually written by Juergen Gross.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> 
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: xen-devel@lists.xenproject.org
> ---
> 
> Hi Juergen,
> 
> Say, this is *exactly* what you proposed in your gup.patch, so
> I've speculatively added your Signed-off-by above, but need your
> approval before that's final. Let me know please...

Yes, that's fine with me.


Juergen

^ permalink raw reply

* Re: BPF: ETLS: RECV FLOW
From: John Fastabend @ 2019-08-05  3:43 UTC (permalink / raw)
  To: Shridhar Venkatraman, netdev
In-Reply-To: <CADJe1ZsZcWrtdJGgXeoEnG4FFUGxT-BmJvJW2xwDUF+uCUp-kA@mail.gmail.com>

Shridhar Venkatraman wrote:
> Hi,
> 
> The eTLS work has BPF integration which is great.
> However there is one spot where access to the clear text is not available.

Guessing eTLS is a typo for KTLS.

> 
> From kernel 4.20 - receiver BPF support added for KTLS.
> 
> a. receiver BPF is applied on encrypted message
> b. after applying BPF, message is decrypted
> c. BPF run logic on the decrypted plain message   - can we add this support ?
> d. then copy the decrypted message back to userspace.
> 
> code flow reference: tls receive message call flow:
> --------------------------------------------------------------
> 
> tls_sw_recvmsg
>   __tcp_bpf_recvmsg [ bpf exec function called on encrypted message ]
>   decrypt_skb_update
>   decrypt_internal
>   BPF_PROG_RUN on decrypted plain message - can we add this support ?
>   skb_copy_datagram_msg [ decrypted message copied back to userspace ]

Yes I'm aware of this I'll push patches this release cycle. At least that
is the plan. I have some internal patches I've been running for some time
but need to clean up an edge case. Hopefully should get to it this week
after fixing up a couple bugs first.

> 
> Thanks
> ps: I sent this to the bpf list as I don't know which one it should go to

sending to bpf list and CC netdev would work or just BPF list.

.John

^ permalink raw reply

* [PATCH net-next v2] openvswitch: Print error when ovs_execute_actions() fails
From: Yifeng Sun @ 2019-08-05  2:56 UTC (permalink / raw)
  To: netdev, pshelar, gvrose8192; +Cc: Yifeng Sun

Currently in function ovs_dp_process_packet(), return values of
ovs_execute_actions() are silently discarded. This patch prints out
an debug message when error happens so as to provide helpful hints
for debugging.
---
v1->v2: Fixed according to Pravin's review.

 net/openvswitch/datapath.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 892287d..12d9850 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -222,6 +222,7 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
 	struct dp_stats_percpu *stats;
 	u64 *stats_counter;
 	u32 n_mask_hit;
+	int error;
 
 	stats = this_cpu_ptr(dp->stats_percpu);
 
@@ -229,7 +230,6 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
 	flow = ovs_flow_tbl_lookup_stats(&dp->table, key, &n_mask_hit);
 	if (unlikely(!flow)) {
 		struct dp_upcall_info upcall;
-		int error;
 
 		memset(&upcall, 0, sizeof(upcall));
 		upcall.cmd = OVS_PACKET_CMD_MISS;
@@ -246,7 +246,10 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
 
 	ovs_flow_stats_update(flow, key->tp.flags, skb);
 	sf_acts = rcu_dereference(flow->sf_acts);
-	ovs_execute_actions(dp, skb, sf_acts, key);
+	error = ovs_execute_actions(dp, skb, sf_acts, key);
+	if (unlikely(error))
+		net_dbg_ratelimited("ovs: action execution error on datapath %s: %d\n",
+							ovs_dp_name(dp), error);
 
 	stats_counter = &stats->n_hit;
 
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH net-next] openvswitch: Print error when ovs_execute_actions() fails
From: Yifeng Sun @ 2019-08-05  2:48 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: Linux Kernel Network Developers
In-Reply-To: <CAOrHB_CExkSymgCU5yGKg66XywJgNxihn8VQJNr8hw6cff0rOA@mail.gmail.com>

Yes, this fix is mainly for debugging purposes. If packets are
blackholed because
of errors from ovs_execute_actions(), we can got more helpful
information. Thanks
Pravin for the review, I will come up with a new version.
Yifeng

On Sat, Aug 3, 2019 at 4:00 PM Pravin Shelar <pshelar@ovn.org> wrote:
>
> On Thu, Aug 1, 2019 at 2:14 PM Yifeng Sun <pkusunyifeng@gmail.com> wrote:
> >
> > Currently in function ovs_dp_process_packet(), return values of
> > ovs_execute_actions() are silently discarded. This patch prints out
> > an error message when error happens so as to provide helpful hints
> > for debugging.
> >
> > Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
> > ---
> >  net/openvswitch/datapath.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> > index 892287d..603c533 100644
> > --- a/net/openvswitch/datapath.c
> > +++ b/net/openvswitch/datapath.c
> > @@ -222,6 +222,7 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> >         struct dp_stats_percpu *stats;
> >         u64 *stats_counter;
> >         u32 n_mask_hit;
> > +       int error;
> >
> >         stats = this_cpu_ptr(dp->stats_percpu);
> >
> > @@ -229,7 +230,6 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> >         flow = ovs_flow_tbl_lookup_stats(&dp->table, key, &n_mask_hit);
> >         if (unlikely(!flow)) {
> >                 struct dp_upcall_info upcall;
> > -               int error;
> >
> >                 memset(&upcall, 0, sizeof(upcall));
> >                 upcall.cmd = OVS_PACKET_CMD_MISS;
> > @@ -246,7 +246,10 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> >
> >         ovs_flow_stats_update(flow, key->tp.flags, skb);
> >         sf_acts = rcu_dereference(flow->sf_acts);
> > -       ovs_execute_actions(dp, skb, sf_acts, key);
> > +       error = ovs_execute_actions(dp, skb, sf_acts, key);
> > +       if (unlikely(error))
> > +               net_err_ratelimited("ovs: action execution error on datapath %s: %d\n",
> > +                                                       ovs_dp_name(dp), error);
> >
>
> I would rather add error counter for better visibility.
> If you want to use current approach, can you use net_dbg_ratelimited()
> since you want to use this for debugging purpose?
>
> Thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: can: Fix compiling warning
From: maowenan @ 2019-08-05  1:22 UTC (permalink / raw)
  To: socketcan, davem, netdev; +Cc: linux-kernel, kernel-janitors
In-Reply-To: <20190805012254.59869-1-maowenan@huawei.com>

please drop this mail.

On 2019/8/5 9:22, Mao Wenan wrote:
> There are two warnings in net/can, fix them by setting bcm_sock_no_ioctlcmd
> and raw_sock_no_ioctlcmd as static.
> 
> net/can/bcm.c:1683:5: warning: symbol 'bcm_sock_no_ioctlcmd' was not declared. Should it be static?
> net/can/raw.c:840:5: warning: symbol 'raw_sock_no_ioctlcmd' was not declared. Should it be static?
> 
> Fixes: 473d924d7d46 ("can: fix ioctl function removal")
> 
> Signed-off-by: Mao Wenan <maowenan@huawei.com>
> ---
>  v1->v2: change patch description typo error, 'warings' to 'warnings'.
>  net/can/bcm.c | 2 +-
>  net/can/raw.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/can/bcm.c b/net/can/bcm.c
> index bf1d0bbecec8..b8a32b4ac368 100644
> --- a/net/can/bcm.c
> +++ b/net/can/bcm.c
> @@ -1680,7 +1680,7 @@ static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
>  	return size;
>  }
>  
> -int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
> +static int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
>  			 unsigned long arg)
>  {
>  	/* no ioctls for socket layer -> hand it down to NIC layer */
> diff --git a/net/can/raw.c b/net/can/raw.c
> index da386f1fa815..a01848ff9b12 100644
> --- a/net/can/raw.c
> +++ b/net/can/raw.c
> @@ -837,7 +837,7 @@ static int raw_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
>  	return size;
>  }
>  
> -int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
> +static int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
>  			 unsigned long arg)
>  {
>  	/* no ioctls for socket layer -> hand it down to NIC layer */
> 


^ permalink raw reply

* [PATCH net-next v2] net: can: Fix compiling warning
From: Mao Wenan @ 2019-08-05  1:26 UTC (permalink / raw)
  To: socketcan, davem, netdev; +Cc: linux-kernel, kernel-janitors, Mao Wenan
In-Reply-To: <1d9e329a-eafc-6c32-ee2a-df3b15231a2a@huawei.com>

There are two warnings in net/can, fix them by setting bcm_sock_no_ioctlcmd
and raw_sock_no_ioctlcmd as static.

net/can/bcm.c:1683:5: warning: symbol 'bcm_sock_no_ioctlcmd' was not declared. Should it be static?
net/can/raw.c:840:5: warning: symbol 'raw_sock_no_ioctlcmd' was not declared. Should it be static?

Fixes: 473d924d7d46 ("can: fix ioctl function removal")
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
---
 v1->v2: change patch description typo error, 'warings' to 'warnings'.
 net/can/bcm.c | 2 +-
 net/can/raw.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/can/bcm.c b/net/can/bcm.c
index bf1d0bbecec8..b8a32b4ac368 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1680,7 +1680,7 @@ static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
-int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+static int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
 			 unsigned long arg)
 {
 	/* no ioctls for socket layer -> hand it down to NIC layer */
diff --git a/net/can/raw.c b/net/can/raw.c
index da386f1fa815..a01848ff9b12 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -837,7 +837,7 @@ static int raw_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
-int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+static int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
 			 unsigned long arg)
 {
 	/* no ioctls for socket layer -> hand it down to NIC layer */
-- 
2.20.1


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox