Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 0/3] Use refcount_t for refcount
From: Leon Romanovsky @ 2019-08-05  5:57 UTC (permalink / raw)
  To: Chuhong Yuan
  Cc: Saeed Mahameed, David S . Miller, Doug Ledford, Jason Gunthorpe,
	linux-rdma, Netdev, LKML
In-Reply-To: <CANhBUQ0rMKHmh4ibktwRmVN6NU=HAjs-Q7PrF9yX5x5yOyOB2A@mail.gmail.com>

On Sun, Aug 04, 2019 at 10:58:19PM +0800, Chuhong Yuan wrote:
> On Sun, Aug 4, 2019 at 8:48 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Fri, Aug 02, 2019 at 08:10:35PM +0800, Chuhong Yuan wrote:
> > > Reference counters are preferred to use refcount_t instead of
> > > atomic_t.
> > > This is because the implementation of refcount_t can prevent
> > > overflows and detect possible use-after-free.
> > >
> > > First convert the refcount field to refcount_t in mlx5/driver.h.
> > > Then convert the uses to refcount_() APIs.
> >
> > You can't do it, because you need to ensure that driver compiles and
> > works between patches. By converting driver.h alone to refcount_t, you
> > simply broke mlx5 driver.
> >
>
> It is my fault... I am not clear how to send patches which cross
> several subsystems, so I sent them in series.
> Maybe I should merge these patches together?

In case of conversion patches, yes, you need to perform such change
in one shot.

>
>
> > NAK, to be clear.
> >
> > And please don't sent series of patches as standalone patches.
> >
>
> Due to the reason mentioned above, I sent them seperately.

Create patch, run ./scripts/get_maintainer.pl on it and send according
to generated output. You are not doing kernel core changes and there is
no need to worry about cross subsystem complexity as long as you will
put relevant maintainers in TO: field.

Thanks

>
> > Thanks,

^ permalink raw reply

* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: Jiri Pirko @ 2019-08-05  5:54 UTC (permalink / raw)
  To: David Ahern; +Cc: Jakub Kicinski, netdev, davem, sthemmin, mlxsw
In-Reply-To: <6f05d200-49d4-4eb1-cd69-bd88cf8b0167@gmail.com>

Fri, Aug 02, 2019 at 05:45:36PM CEST, dsahern@gmail.com wrote:
>On 8/2/19 1:48 AM, Jiri Pirko wrote:
>> Wed, Jul 31, 2019 at 09:58:10PM CEST, dsahern@gmail.com wrote:
>>> On 7/31/19 1:46 PM, David Ahern wrote:
>>>> On 7/31/19 1:45 PM, Jiri Pirko wrote:
>>>>>> check. e.g., what happens if a resource controller has been configured
>>>>>> for the devlink instance and it is moved to a namespace whose existing
>>>>>> config exceeds those limits?
>>>>>
>>>>> It's moved with all the values. The whole instance is moved.
>>>>>
>>>>
>>>> The values are moved, but the FIB in a namespace could already contain
>>>> more routes than the devlink instance allows.
>>>>
>>>
>>>From a quick test your recent refactoring to netdevsim broke the
>>> resource controller. It was, and is intended to be, per network namespace.
>> 
>> unifying devlink instances with network namespace in netdevsim was
>> really odd. Netdevsim is also a device, like any other. With other
>> devices, you do not do this so I don't see why to do this with netdevsim.
>> 
>> Now you create netdevsim instance in sysfs, there is proper bus probe
>> mechanism done, there is a devlink instance created for this device,
>> there are netdevices and devlink ports created. Same as for the real
>> hardware.
>> 
>> Honestly, creating a devlink instance per-network namespace
>> automagically, no relation to netdevsim devices, that is simply wrong.
>> There should be always 1:1 relationshin between a device and devlink
>> instance.
>> 
>
>Jiri: prior to your recent change netdevsim had a fib resource
>controller per network namespace. Please return that behavior or revert
>the change.

There was implicit devlink instance creation per-namespace. No relation
any actual device. It was wrong and misuse of devlink.

Now you have 1 devlink instance per 1 device as it should be. Also, you
have fib resource control for this device, also as it is done for real
devices, like mlxsw.

Could you please describe your usecase? Perhaps we can handle
it differently.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-05  5:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <CALCETrUjh6DdgW1qSuSRd1_=0F9CqB8+sNj__e_6AHEvh_BaxQ@mail.gmail.com>

On Sun, Aug 4, 2019 at 5:08 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Sun, Aug 4, 2019 at 3:16 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Fri, Aug 2, 2019 at 12:22 AM Song Liu <songliubraving@fb.com> wrote:
> > >
> > > Hi Andy,
> > >
> >  >> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
> > > >> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
> > > >
> > > > It's been done before -- it's not that hard.  IMO the main tricky bit
> > > > would be try be somewhat careful about defining exactly what
> > > > CAP_BPF_ADMIN does.
> > >
> > > Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the
> > > Plumbers conference.
> > >
> > > OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons
> > > like systemd to do sys_bpf() without root.
> >
> > I don't understand the use case here.  Are you talking about systemd
> > --user?  As far as I know, a user is expected to be able to fully
> > control their systemd --user process, so giving it unrestricted bpf
> > access is very close to giving it superuser access, and this doesn't
> > sound like a good idea.  I think that, if systemd --user needs bpf(),
> > it either needs real unprivileged bpf() or it needs a privileged
> > helper (SUID or a daemon) to intermediate this access.
> >
> > >
> > > >
> > > >>> I don't see why you need to invent a whole new mechanism for this.
> > > >>> The entire cgroup ecosystem outside bpf() does just fine using the
> > > >>> write permission on files in cgroupfs to control access.  Why can't
> > > >>> bpf() do the same thing?
> > > >>
> > > >> It is easier to use write permission for BPF_PROG_ATTACH. But it is
> > > >> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
> > > >> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
> > > >> we should have target concept for all these commands. But that is a
> > > >> much bigger project. OTOH, "all or nothing" model allows all these
> > > >> commands at once.
> > > >
> > > > For BPF_PROG_LOAD, I admit I've never understood why permission is
> > > > required at all.  I think that CAP_SYS_ADMIN or similar should be
> > > > needed to get is_priv in the verifier, but I think that should mainly
> > > > be useful for tracing, and that requires lots of privilege anyway.
> > > > BPF_MAP_* is probably the trickiest part.  One solution would be some
> > > > kind of bpffs, but I'm sure other solutions are possible.
> > >
> > > Improving permission management of cgroup_bpf is another good topic to
> > > discuss. However, it is also an overkill for current use case.
> > >
> >
> > I looked at the code some more, and I don't think this is so hard
> > after all.  As I understand it, all of the map..by_id stuff is, to
> > some extent, deprecated in favor of persistent maps.  As I see it, the
> > map..by_id calls should require privilege forever, although I can
> > imagine ways to scope that privilege to a namespace if the maps
> > themselves were to be scoped to a namespace.
> >
> > Instead, unprivileged tools would use the persistent map interface
> > roughly like this:
> >
> > $ bpftool map create /sys/fs/bpf/my_dir/filename type hash key 8 value
> > 8 entries 64 name mapname
> >
> > This would require that the caller have either CAP_DAC_OVERRIDE or
> > that the caller have permission to create files in /sys/fs/bpf/my_dir
> > (using the same rules as for any filesystem), and the resulting map
> > would end up owned by the creating user and have mode 0600 (or maybe
> > 0666, or maybe a new bpf_attr parameter) modified by umask.  Then all
> > the various capable() checks that are currently involved in accessing
> > a persistent map would instead check FMODE_READ or FMODE_WRITE on the
> > map file as appropriate.
> >
> > Half of this stuff already works.  I just set my system up like this:
> >
> > $ ls -l /sys/fs/bpf
> > total 0
> > drwxr-xr-x. 3 luto luto 0 Aug  4 15:10 luto
> >
> > $ mkdir /sys/fs/bpf/luto/test
> >
> > $ ls -l /sys/fs/bpf/luto
> > total 0
> > drwxrwxr-x. 2 luto luto 0 Aug  4 15:10 test
> >
> > I bet that making the bpf() syscalls work appropriately in this
> > context without privilege would only be a couple of hours of work.
> > The hard work, creating bpffs and making it function, is already done
> > :)
> >
> > P.S. The docs for bpftool create are less than fantastic.  The
> > complete lack of any error message at all when the syscall returns
> > -EACCES is also not fantastic.
>
> This isn't remotely finished, but I spent a bit of time fiddling with this:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=bpf/perms
>
> What do you think?  (It's obviously not done.  It doesn't compile, and
> I haven't gotten to the permissions needed to do map operations.  I
> also haven't touched the capable() checks.)

I updated the branch.  It compiles, and basic map functionality works!

# mount -t bpf bpf /sys/fs/bpf
# cd /sys/fs/bpf
# mkdir luto
# chown luto: luto
# setpriv --euid=1000 --ruid=1000 bash
$ pwd
/sys/fs/bpf
bash-5.0$ ls -l
total 0
drwxr-xr-x 2 luto luto 0 Aug  4 22:41 luto
bash-5.0$ bpftool map create /sys/fs/bpf/luto/filename type hash key 8
value 8 entries 64 name mapname
bash-5.0$ bpftool map dump pinned /sys/fs/bpf/luto/filename
Found 0 elements

# chown root: /sys/fs/bpf/luto/filename

$ bpftool map dump pinned /sys/fs/bpf/luto/filename
Error: bpf obj get (/sys/fs/bpf/luto): Permission denied

So I think it's possible to get a respectable subset of bpf()
functionality working without privilege in short order :)

(FWIW, a decent fraction of this probably works even without my
patches, but it's going to have nonsensical semantics and may fail for
silly reasons.)

^ permalink raw reply

* Re: [PATCH] drivers:staging:isdn:hysdn brace same line if
From: Greg KH @ 2019-08-05  4:42 UTC (permalink / raw)
  To: Fernando Valle; +Cc: isdn, netdev, devel, linux-kernel
In-Reply-To: <CACRBQB+wx3=3vQrvnxjcoiZaZjP7EOF+f0NYa4p-XKTHW79RiA@mail.gmail.com>


A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

A: No.
Q: Should I include quotations after my reply?

http://daringfireball.net/2007/07/on_top

On Sun, Aug 04, 2019 at 10:04:40PM -0300, Fernando Valle wrote:
> Sorry Greg, it was my first submission.
> I followed the kernel newbies tutorial and some other found on the internet.

The best description of how to pick a subject line is in the section
entitled "The canonical patch format" in the kernel file,
Documentation/SubmittingPatches.  Please read that whole file as it
contains everything you need to know about writing good changelog texts
and everything else.

> So, the correct form of the subject would be -> "staging:isdn:hysdn open
> braces in correct location" !?

That's better than what you currently have, but I know you can do much
better than that :)

good luck!

greg k-h

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  4:41 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Gunthorpe
  Cc: linux-mm, netdev, linux-kernel, kvm, virtualization
In-Reply-To: <e8ecb811-6653-cff4-bc11-81f4ccb0dbbf@redhat.com>


On 2019/8/5 下午12:36, Jason Wang wrote:
>
> On 2019/8/2 下午10:27, Michael S. Tsirkin wrote:
>> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
>>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>>>> This must be a proper barrier, like a spinlock, mutex, or
>>>>> synchronize_rcu.
>>>>
>>>> I start with synchronize_rcu() but both you and Michael raise some
>>>> concern.
>>> I've also idly wondered if calling synchronize_rcu() under the various
>>> mm locks is a deadlock situation.
>>>
>>>> Then I try spinlock and mutex:
>>>>
>>>> 1) spinlock: add lots of overhead on datapath, this leads 0 
>>>> performance
>>>> improvement.
>>> I think the topic here is correctness not performance improvement
>> The topic is whether we should revert
>> commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual 
>> address")
>>
>> or keep it in. The only reason to keep it is performance.
>
>
> Maybe it's time to introduce the config option?


Or does it make sense if I post a V3 with:

- introduce config option and disable the optimization by default

- switch from synchronize_rcu() to vhost_flush_work(), but the rest are 
the same

This can give us some breath to decide which way should go for next release?

Thanks


>
>
>>
>> Now as long as all this code is disabled anyway, we can experiment a
>> bit.
>>
>> I personally feel we would be best served by having two code paths:
>>
>> - Access to VM memory directly mapped into kernel
>> - Access to userspace
>>
>>
>> Having it all cleanly split will allow a bunch of optimizations, for
>> example for years now we planned to be able to process an incoming short
>> packet directly on softirq path, or an outgoing on directly within
>> eventfd.
>
>
> It's not hard consider we've already had our own accssors. But the 
> question is (as asked in another thread), do you want permanent GUP or 
> still use MMU notifiers.
>
> Thanks
>
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  4:39 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Gunthorpe
  Cc: kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <20190804040034-mutt-send-email-mst@kernel.org>


On 2019/8/4 下午4:07, Michael S. Tsirkin wrote:
> On Sat, Aug 03, 2019 at 09:14:00PM -0300, Jason Gunthorpe wrote:
>> On Sat, Aug 03, 2019 at 05:36:13PM -0400, Michael S. Tsirkin wrote:
>>> On Fri, Aug 02, 2019 at 02:24:18PM -0300, Jason Gunthorpe wrote:
>>>> On Fri, Aug 02, 2019 at 10:27:21AM -0400, Michael S. Tsirkin wrote:
>>>>> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
>>>>>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>>>>>>> This must be a proper barrier, like a spinlock, mutex, or
>>>>>>>> synchronize_rcu.
>>>>>>>
>>>>>>> I start with synchronize_rcu() but both you and Michael raise some
>>>>>>> concern.
>>>>>> I've also idly wondered if calling synchronize_rcu() under the various
>>>>>> mm locks is a deadlock situation.
>>>>>>
>>>>>>> Then I try spinlock and mutex:
>>>>>>>
>>>>>>> 1) spinlock: add lots of overhead on datapath, this leads 0 performance
>>>>>>> improvement.
>>>>>> I think the topic here is correctness not performance improvement
>>>>> The topic is whether we should revert
>>>>> commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address")
>>>>>
>>>>> or keep it in. The only reason to keep it is performance.
>>>> Yikes, I'm not sure you can ever win against copy_from_user using
>>>> mmu_notifiers?
>>> Ever since copy_from_user started playing with flags (for SMAP) and
>>> added speculation barriers there's a chance we can win by accessing
>>> memory through the kernel address.
>> You think copy_to_user will be more expensive than the minimum two
>> atomics required to synchronize with another thread?
> I frankly don't know. With SMAP you flip flags twice, and with spectre
> you flush the pipeline. Is that cheaper or more expensive than an atomic
> operation? Testing is the only way to tell.


Let me test, I only did test on a non SMAP machine. Switching to 
spinlock kills all performance improvement.

Thanks


>
>>>> Also, why can't this just permanently GUP the pages? In fact, where
>>>> does it put_page them anyhow? Worrying that 7f466 adds a get_user page
>>>> but does not add a put_page??
>> You didn't answer this.. Why not just use GUP?
>>
>> Jason
> Sorry I misunderstood the question. Permanent GUP breaks lots of
> functionality we need such as THP and numa balancing.
>
> release_pages is used instead of put_page.
>
>
>
>

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  4:36 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Gunthorpe
  Cc: kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <20190802100414-mutt-send-email-mst@kernel.org>


On 2019/8/2 下午10:27, Michael S. Tsirkin wrote:
> On Fri, Aug 02, 2019 at 09:46:13AM -0300, Jason Gunthorpe wrote:
>> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>>> This must be a proper barrier, like a spinlock, mutex, or
>>>> synchronize_rcu.
>>>
>>> I start with synchronize_rcu() but both you and Michael raise some
>>> concern.
>> I've also idly wondered if calling synchronize_rcu() under the various
>> mm locks is a deadlock situation.
>>
>>> Then I try spinlock and mutex:
>>>
>>> 1) spinlock: add lots of overhead on datapath, this leads 0 performance
>>> improvement.
>> I think the topic here is correctness not performance improvement
> The topic is whether we should revert
> commit 7f466032dc9 ("vhost: access vq metadata through kernel virtual address")
>
> or keep it in. The only reason to keep it is performance.


Maybe it's time to introduce the config option?


>
> Now as long as all this code is disabled anyway, we can experiment a
> bit.
>
> I personally feel we would be best served by having two code paths:
>
> - Access to VM memory directly mapped into kernel
> - Access to userspace
>
>
> Having it all cleanly split will allow a bunch of optimizations, for
> example for years now we planned to be able to process an incoming short
> packet directly on softirq path, or an outgoing on directly within
> eventfd.


It's not hard consider we've already had our own accssors. But the 
question is (as asked in another thread), do you want permanent GUP or 
still use MMU notifiers.

Thanks


^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  4:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Gunthorpe, kvm, virtualization, netdev, linux-kernel,
	linux-mm
In-Reply-To: <20190802094331-mutt-send-email-mst@kernel.org>


On 2019/8/2 下午10:03, Michael S. Tsirkin wrote:
> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>> Btw, I come up another idea, that is to disable preemption when vhost thread
>> need to access the memory. Then register preempt notifier and if vhost
>> thread is preempted, we're sure no one will access the memory and can do the
>> cleanup.
> Great, more notifiers :(
>
> Maybe can live with
> 1- disable preemption while using the cached pointer
> 2- teach vhost to recover from memory access failures,
>     by switching to regular from/to user path


I don't get this, I believe we want to recover from regular from/to user 
path, isn't it?


>
> So if you want to try that, fine since it's a step in
> the right direction.
>
> But I think fundamentally it's not what we want to do long term.


Yes.


>
> It's always been a fundamental problem with this patch series that only
> metadata is accessed through a direct pointer.
>
> The difference in ways you handle metadata and data is what is
> now coming and messing everything up.


I do propose soemthing like this in the past: 
https://www.spinics.net/lists/linux-virtualization/msg36824.html. But 
looks like you have some concern about its locality.

But the problem still there, GUP can do page fault, so still need to 
synchronize it with MMU notifiers. The solution might be something like 
moving GUP to a dedicated kind of vhost work.


>
> So if continuing the direct map approach,
> what is needed is a cache of mapped VM memory, then on a cache miss
> we'd queue work along the lines of 1-2 above.
>
> That's one direction to take. Another one is to give up on that and
> write our own version of uaccess macros.  Add a "high security" flag to
> the vhost module and if not active use these for userspace memory
> access.


Or using SET_BACKEND_FEATURES? But do you mean permanent GUP as I did in 
original RFC https://lkml.org/lkml/2018/12/13/218?

Thanks

>
>

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-05  4:20 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: mst, kvm, virtualization, netdev, linux-kernel, linux-mm
In-Reply-To: <20190802124613.GA11245@ziepe.ca>


On 2019/8/2 下午8:46, Jason Gunthorpe wrote:
> On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang wrote:
>>> This must be a proper barrier, like a spinlock, mutex, or
>>> synchronize_rcu.
>>
>> I start with synchronize_rcu() but both you and Michael raise some
>> concern.
> I've also idly wondered if calling synchronize_rcu() under the various
> mm locks is a deadlock situation.


Maybe, that's why I suggest to use vhost_work_flush() which is much 
lightweight can can achieve the same function. It can guarantee all 
previous work has been processed after vhost_work_flush() return.


>
>> Then I try spinlock and mutex:
>>
>> 1) spinlock: add lots of overhead on datapath, this leads 0 performance
>> improvement.
> I think the topic here is correctness not performance improvement


But the whole series is to speed up vhost.


>
>> 2) SRCU: full memory barrier requires on srcu_read_lock(), which still leads
>> little performance improvement
>   
>> 3) mutex: a possible issue is need to wait for the page to be swapped in (is
>> this unacceptable ?), another issue is that we need hold vq lock during
>> range overlap check.
> I have a feeling that mmu notififers cannot safely become dependent on
> progress of swap without causing deadlock. You probably should avoid
> this.


Yes, so that's why I try to synchronize the critical region by myself.


>>> And, again, you can't re-invent a spinlock with open coding and get
>>> something better.
>> So the question is if waiting for swap is considered to be unsuitable for
>> MMU notifiers. If not, it would simplify codes. If not, we still need to
>> figure out a possible solution.
>>
>> Btw, I come up another idea, that is to disable preemption when vhost thread
>> need to access the memory. Then register preempt notifier and if vhost
>> thread is preempted, we're sure no one will access the memory and can do the
>> cleanup.
> I think you should use the spinlock so at least the code is obviously
> functionally correct and worry about designing some properly justified
> performance change after.
>
> Jason


Spinlock is correct but make the whole series meaningless consider it 
won't bring any performance improvement.

Thanks



^ permalink raw reply

* Re: [PATCH v2 20/34] xen: convert put_page() to put_user_page*()
From: Juergen Gross @ 2019-08-05  4:15 UTC (permalink / raw)
  To: john.hubbard, Andrew Morton
  Cc: devel, Dave Chinner, Christoph Hellwig, Dan Williams, Ira Weiny,
	x86, linux-mm, Dave Hansen, amd-gfx, dri-devel, intel-gfx,
	linux-arm-kernel, linux-rpi-kernel, devel, xen-devel,
	John Hubbard, Boris Ostrovsky, rds-devel, Jérôme Glisse,
	Jan Kara, ceph-devel, kvm, linux-block, linux-crypto, linux-fbdev,
	linux-fsdevel, LKML, linux-media, linux-nfs, linux-rdma,
	linux-xfs, netdev, sparclinux, Jason Gunthorpe
In-Reply-To: <20190804224915.28669-21-jhubbard@nvidia.com>

On 05.08.19 00:49, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> For pages that were retained via get_user_pages*(), release those pages
> via the new put_user_page*() routines, instead of via put_page() or
> release_pages().
> 
> This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
> ("mm: introduce put_user_page*(), placeholder versions").
> 
> This also handles pages[i] == NULL cases, thanks to an approach
> that is actually written by Juergen Gross.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> 
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: xen-devel@lists.xenproject.org
> ---
> 
> Hi Juergen,
> 
> Say, this is *exactly* what you proposed in your gup.patch, so
> I've speculatively added your Signed-off-by above, but need your
> approval before that's final. Let me know please...

Yes, that's fine with me.


Juergen

^ permalink raw reply

* Re: BPF: ETLS: RECV FLOW
From: John Fastabend @ 2019-08-05  3:43 UTC (permalink / raw)
  To: Shridhar Venkatraman, netdev
In-Reply-To: <CADJe1ZsZcWrtdJGgXeoEnG4FFUGxT-BmJvJW2xwDUF+uCUp-kA@mail.gmail.com>

Shridhar Venkatraman wrote:
> Hi,
> 
> The eTLS work has BPF integration which is great.
> However there is one spot where access to the clear text is not available.

Guessing eTLS is a typo for KTLS.

> 
> From kernel 4.20 - receiver BPF support added for KTLS.
> 
> a. receiver BPF is applied on encrypted message
> b. after applying BPF, message is decrypted
> c. BPF run logic on the decrypted plain message   - can we add this support ?
> d. then copy the decrypted message back to userspace.
> 
> code flow reference: tls receive message call flow:
> --------------------------------------------------------------
> 
> tls_sw_recvmsg
>   __tcp_bpf_recvmsg [ bpf exec function called on encrypted message ]
>   decrypt_skb_update
>   decrypt_internal
>   BPF_PROG_RUN on decrypted plain message - can we add this support ?
>   skb_copy_datagram_msg [ decrypted message copied back to userspace ]

Yes I'm aware of this I'll push patches this release cycle. At least that
is the plan. I have some internal patches I've been running for some time
but need to clean up an edge case. Hopefully should get to it this week
after fixing up a couple bugs first.

> 
> Thanks
> ps: I sent this to the bpf list as I don't know which one it should go to

sending to bpf list and CC netdev would work or just BPF list.

.John

^ permalink raw reply

* [PATCH net-next v2] openvswitch: Print error when ovs_execute_actions() fails
From: Yifeng Sun @ 2019-08-05  2:56 UTC (permalink / raw)
  To: netdev, pshelar, gvrose8192; +Cc: Yifeng Sun

Currently in function ovs_dp_process_packet(), return values of
ovs_execute_actions() are silently discarded. This patch prints out
an debug message when error happens so as to provide helpful hints
for debugging.
---
v1->v2: Fixed according to Pravin's review.

 net/openvswitch/datapath.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 892287d..12d9850 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -222,6 +222,7 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
 	struct dp_stats_percpu *stats;
 	u64 *stats_counter;
 	u32 n_mask_hit;
+	int error;
 
 	stats = this_cpu_ptr(dp->stats_percpu);
 
@@ -229,7 +230,6 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
 	flow = ovs_flow_tbl_lookup_stats(&dp->table, key, &n_mask_hit);
 	if (unlikely(!flow)) {
 		struct dp_upcall_info upcall;
-		int error;
 
 		memset(&upcall, 0, sizeof(upcall));
 		upcall.cmd = OVS_PACKET_CMD_MISS;
@@ -246,7 +246,10 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
 
 	ovs_flow_stats_update(flow, key->tp.flags, skb);
 	sf_acts = rcu_dereference(flow->sf_acts);
-	ovs_execute_actions(dp, skb, sf_acts, key);
+	error = ovs_execute_actions(dp, skb, sf_acts, key);
+	if (unlikely(error))
+		net_dbg_ratelimited("ovs: action execution error on datapath %s: %d\n",
+							ovs_dp_name(dp), error);
 
 	stats_counter = &stats->n_hit;
 
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH net-next] openvswitch: Print error when ovs_execute_actions() fails
From: Yifeng Sun @ 2019-08-05  2:48 UTC (permalink / raw)
  To: Pravin Shelar; +Cc: Linux Kernel Network Developers
In-Reply-To: <CAOrHB_CExkSymgCU5yGKg66XywJgNxihn8VQJNr8hw6cff0rOA@mail.gmail.com>

Yes, this fix is mainly for debugging purposes. If packets are
blackholed because
of errors from ovs_execute_actions(), we can got more helpful
information. Thanks
Pravin for the review, I will come up with a new version.
Yifeng

On Sat, Aug 3, 2019 at 4:00 PM Pravin Shelar <pshelar@ovn.org> wrote:
>
> On Thu, Aug 1, 2019 at 2:14 PM Yifeng Sun <pkusunyifeng@gmail.com> wrote:
> >
> > Currently in function ovs_dp_process_packet(), return values of
> > ovs_execute_actions() are silently discarded. This patch prints out
> > an error message when error happens so as to provide helpful hints
> > for debugging.
> >
> > Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
> > ---
> >  net/openvswitch/datapath.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> > index 892287d..603c533 100644
> > --- a/net/openvswitch/datapath.c
> > +++ b/net/openvswitch/datapath.c
> > @@ -222,6 +222,7 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> >         struct dp_stats_percpu *stats;
> >         u64 *stats_counter;
> >         u32 n_mask_hit;
> > +       int error;
> >
> >         stats = this_cpu_ptr(dp->stats_percpu);
> >
> > @@ -229,7 +230,6 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> >         flow = ovs_flow_tbl_lookup_stats(&dp->table, key, &n_mask_hit);
> >         if (unlikely(!flow)) {
> >                 struct dp_upcall_info upcall;
> > -               int error;
> >
> >                 memset(&upcall, 0, sizeof(upcall));
> >                 upcall.cmd = OVS_PACKET_CMD_MISS;
> > @@ -246,7 +246,10 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
> >
> >         ovs_flow_stats_update(flow, key->tp.flags, skb);
> >         sf_acts = rcu_dereference(flow->sf_acts);
> > -       ovs_execute_actions(dp, skb, sf_acts, key);
> > +       error = ovs_execute_actions(dp, skb, sf_acts, key);
> > +       if (unlikely(error))
> > +               net_err_ratelimited("ovs: action execution error on datapath %s: %d\n",
> > +                                                       ovs_dp_name(dp), error);
> >
>
> I would rather add error counter for better visibility.
> If you want to use current approach, can you use net_dbg_ratelimited()
> since you want to use this for debugging purpose?
>
> Thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: can: Fix compiling warning
From: maowenan @ 2019-08-05  1:22 UTC (permalink / raw)
  To: socketcan, davem, netdev; +Cc: linux-kernel, kernel-janitors
In-Reply-To: <20190805012254.59869-1-maowenan@huawei.com>

please drop this mail.

On 2019/8/5 9:22, Mao Wenan wrote:
> There are two warnings in net/can, fix them by setting bcm_sock_no_ioctlcmd
> and raw_sock_no_ioctlcmd as static.
> 
> net/can/bcm.c:1683:5: warning: symbol 'bcm_sock_no_ioctlcmd' was not declared. Should it be static?
> net/can/raw.c:840:5: warning: symbol 'raw_sock_no_ioctlcmd' was not declared. Should it be static?
> 
> Fixes: 473d924d7d46 ("can: fix ioctl function removal")
> 
> Signed-off-by: Mao Wenan <maowenan@huawei.com>
> ---
>  v1->v2: change patch description typo error, 'warings' to 'warnings'.
>  net/can/bcm.c | 2 +-
>  net/can/raw.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/can/bcm.c b/net/can/bcm.c
> index bf1d0bbecec8..b8a32b4ac368 100644
> --- a/net/can/bcm.c
> +++ b/net/can/bcm.c
> @@ -1680,7 +1680,7 @@ static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
>  	return size;
>  }
>  
> -int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
> +static int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
>  			 unsigned long arg)
>  {
>  	/* no ioctls for socket layer -> hand it down to NIC layer */
> diff --git a/net/can/raw.c b/net/can/raw.c
> index da386f1fa815..a01848ff9b12 100644
> --- a/net/can/raw.c
> +++ b/net/can/raw.c
> @@ -837,7 +837,7 @@ static int raw_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
>  	return size;
>  }
>  
> -int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
> +static int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
>  			 unsigned long arg)
>  {
>  	/* no ioctls for socket layer -> hand it down to NIC layer */
> 


^ permalink raw reply

* [PATCH net-next v2] net: can: Fix compiling warning
From: Mao Wenan @ 2019-08-05  1:26 UTC (permalink / raw)
  To: socketcan, davem, netdev; +Cc: linux-kernel, kernel-janitors, Mao Wenan
In-Reply-To: <1d9e329a-eafc-6c32-ee2a-df3b15231a2a@huawei.com>

There are two warnings in net/can, fix them by setting bcm_sock_no_ioctlcmd
and raw_sock_no_ioctlcmd as static.

net/can/bcm.c:1683:5: warning: symbol 'bcm_sock_no_ioctlcmd' was not declared. Should it be static?
net/can/raw.c:840:5: warning: symbol 'raw_sock_no_ioctlcmd' was not declared. Should it be static?

Fixes: 473d924d7d46 ("can: fix ioctl function removal")
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
---
 v1->v2: change patch description typo error, 'warings' to 'warnings'.
 net/can/bcm.c | 2 +-
 net/can/raw.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/can/bcm.c b/net/can/bcm.c
index bf1d0bbecec8..b8a32b4ac368 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1680,7 +1680,7 @@ static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
-int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+static int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
 			 unsigned long arg)
 {
 	/* no ioctls for socket layer -> hand it down to NIC layer */
diff --git a/net/can/raw.c b/net/can/raw.c
index da386f1fa815..a01848ff9b12 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -837,7 +837,7 @@ static int raw_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
-int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+static int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
 			 unsigned long arg)
 {
 	/* no ioctls for socket layer -> hand it down to NIC layer */
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next] net: can: Fix compiling warning
From: Mao Wenan @ 2019-08-05  1:22 UTC (permalink / raw)
  To: socketcan, davem, netdev; +Cc: linux-kernel, kernel-janitors, Mao Wenan

There are two warnings in net/can, fix them by setting bcm_sock_no_ioctlcmd
and raw_sock_no_ioctlcmd as static.

net/can/bcm.c:1683:5: warning: symbol 'bcm_sock_no_ioctlcmd' was not declared. Should it be static?
net/can/raw.c:840:5: warning: symbol 'raw_sock_no_ioctlcmd' was not declared. Should it be static?

Fixes: 473d924d7d46 ("can: fix ioctl function removal")

Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 v1->v2: change patch description typo error, 'warings' to 'warnings'.
 net/can/bcm.c | 2 +-
 net/can/raw.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/can/bcm.c b/net/can/bcm.c
index bf1d0bbecec8..b8a32b4ac368 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -1680,7 +1680,7 @@ static int bcm_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
-int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+static int bcm_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
 			 unsigned long arg)
 {
 	/* no ioctls for socket layer -> hand it down to NIC layer */
diff --git a/net/can/raw.c b/net/can/raw.c
index da386f1fa815..a01848ff9b12 100644
--- a/net/can/raw.c
+++ b/net/can/raw.c
@@ -837,7 +837,7 @@ static int raw_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	return size;
 }
 
-int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
+static int raw_sock_no_ioctlcmd(struct socket *sock, unsigned int cmd,
 			 unsigned long arg)
 {
 	/* no ioctls for socket layer -> hand it down to NIC layer */
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH net-next] net: can: Fix compiling warning
From: maowenan @ 2019-08-05  1:13 UTC (permalink / raw)
  To: Sergei Shtylyov, socketcan, davem, netdev; +Cc: linux-kernel, kernel-janitors
In-Reply-To: <133b3357-e0a4-64c8-40b7-02d386e12cef@cogentembedded.com>



On 2019/8/2 16:59, Sergei Shtylyov wrote:
> Hello!
> 
> On 02.08.2019 6:36, Mao Wenan wrote:
> 
>> There are two warings in net/can, fix them by setting bcm_sock_no_ioctlcmd
> 
>    Warnings. :-)

Thanks, I will send v2.

> 
>> and raw_sock_no_ioctlcmd as static.
>>
>> net/can/bcm.c:1683:5: warning: symbol 'bcm_sock_no_ioctlcmd' was not declared. Should it be static?
>> net/can/raw.c:840:5: warning: symbol 'raw_sock_no_ioctlcmd' was not declared. Should it be static?
>>
>> Fixes: 473d924d7d46 ("can: fix ioctl function removal")
>>
>> Signed-off-by: Mao Wenan <maowenan@huawei.com>
> [...]
> 
> MBR, Sergei
> 
> 


^ permalink raw reply

* [PATCH] tools: bpftool: fix reading from /proc/config.gz
From: Peter Wu @ 2019-08-05  0:15 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: netdev, Stanislav Fomichev, Jakub Kicinski, Quentin Monnet

/proc/config has never existed as far as I can see, but /proc/config.gz
is present on Arch Linux. Execute an external gunzip program to avoid
linking to zlib and rework the option scanning code since a pipe is not
seekable. This also fixes a file handle leak on some error paths.

Fixes: 4567b983f78c ("tools: bpftool: add probes for kernel configuration options")
Cc: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Peter Wu <peter@lekensteyn.nl>
---
 tools/bpf/bpftool/feature.c | 92 +++++++++++++++++++++----------------
 1 file changed, 52 insertions(+), 40 deletions(-)

diff --git a/tools/bpf/bpftool/feature.c b/tools/bpf/bpftool/feature.c
index d672d9086fff..e9e10f582047 100644
--- a/tools/bpf/bpftool/feature.c
+++ b/tools/bpf/bpftool/feature.c
@@ -284,34 +284,34 @@ static void probe_jit_limit(void)
 	}
 }
 
-static char *get_kernel_config_option(FILE *fd, const char *option)
+static bool get_kernel_config_option(FILE *fd, char **buf_p, size_t *n_p,
+				     char **value)
 {
-	size_t line_n = 0, optlen = strlen(option);
-	char *res, *strval, *line = NULL;
-	ssize_t n;
+	char *sep;
+	ssize_t linelen;
 
-	rewind(fd);
-	while ((n = getline(&line, &line_n, fd)) > 0) {
-		if (strncmp(line, option, optlen))
+	while ((linelen = getline(buf_p, n_p, fd)) > 0) {
+		char *line = *buf_p;
+		if (strncmp(line, "CONFIG_", 7))
 			continue;
-		/* Check we have at least '=', value, and '\n' */
-		if (strlen(line) < optlen + 3)
-			continue;
-		if (*(line + optlen) != '=')
+
+		sep = memchr(line, '=', linelen);
+		if (!sep)
 			continue;
 
 		/* Trim ending '\n' */
-		line[strlen(line) - 1] = '\0';
+		line[linelen - 1] = '\0';
+
+		/* Split on '=' and ensure that a value is present. */
+		*sep = '\0';
+		if (!sep[1])
+			continue;
 
-		/* Copy and return config option value */
-		strval = line + optlen + 1;
-		res = strdup(strval);
-		free(line);
-		return res;
+		*value = sep + 1;
+		return true;
 	}
-	free(line);
 
-	return NULL;
+	return false;
 }
 
 static void probe_kernel_image_config(void)
@@ -386,31 +386,34 @@ static void probe_kernel_image_config(void)
 		/* test_bpf module for BPF tests */
 		"CONFIG_TEST_BPF",
 	};
+	char *values[ARRAY_SIZE(options)] = { };
 	char *value, *buf = NULL;
 	struct utsname utsn;
 	char path[PATH_MAX];
 	size_t i, n;
 	ssize_t ret;
-	FILE *fd;
+	FILE *fd = NULL;
+	bool is_pipe = false;
 
 	if (uname(&utsn))
-		goto no_config;
+		goto end_parse;
 
 	snprintf(path, sizeof(path), "/boot/config-%s", utsn.release);
 
 	fd = fopen(path, "r");
 	if (!fd && errno == ENOENT) {
-		/* Some distributions put the config file at /proc/config, give
-		 * it a try.
-		 * Sometimes it is also at /proc/config.gz but we do not try
-		 * this one for now, it would require linking against libz.
+		/* Some distributions build with CONFIG_IKCONFIG=y and put the
+		 * config file at /proc/config.gz. We try to invoke an external
+		 * gzip program to avoid linking to libz.
+		 * Hide stderr to avoid interference with the JSON output.
 		 */
-		fd = fopen("/proc/config", "r");
+		fd = popen("gunzip -c /proc/config.gz 2>/dev/null", "r");
+		is_pipe = true;
 	}
 	if (!fd) {
 		p_info("skipping kernel config, can't open file: %s",
 		       strerror(errno));
-		goto no_config;
+		goto end_parse;
 	}
 	/* Sanity checks */
 	ret = getline(&buf, &n, fd);
@@ -418,27 +421,36 @@ static void probe_kernel_image_config(void)
 	if (!buf || !ret) {
 		p_info("skipping kernel config, can't read from file: %s",
 		       strerror(errno));
-		free(buf);
-		goto no_config;
+		goto end_parse;
 	}
 	if (strcmp(buf, "# Automatically generated file; DO NOT EDIT.\n")) {
 		p_info("skipping kernel config, can't find correct file");
-		free(buf);
-		goto no_config;
+		goto end_parse;
+	}
+
+	while (get_kernel_config_option(fd, &buf, &n, &value)) {
+		for (i = 0; i < ARRAY_SIZE(options); i++) {
+			if (values[i] || strcmp(buf, options[i]))
+				continue;
+
+			values[i] = strdup(value);
+		}
+	}
+
+end_parse:
+	if (fd) {
+		if (is_pipe) {
+			if (pclose(fd))
+				p_info("failed to read /proc/config.gz");
+		} else
+			fclose(fd);
 	}
 	free(buf);
 
 	for (i = 0; i < ARRAY_SIZE(options); i++) {
-		value = get_kernel_config_option(fd, options[i]);
-		print_kernel_option(options[i], value);
-		free(value);
+		print_kernel_option(options[i], values[i]);
+		free(values[i]);
 	}
-	fclose(fd);
-	return;
-
-no_config:
-	for (i = 0; i < ARRAY_SIZE(options); i++)
-		print_kernel_option(options[i], NULL);
 }
 
 static bool probe_bpf_syscall(const char *define_prefix)
-- 
2.22.0


^ permalink raw reply related

* Re: [PATCH v2 31/34] fs/nfs: convert put_page() to put_user_page*()
From: Calum Mackay @ 2019-08-05  0:26 UTC (permalink / raw)
  To: john.hubbard, Andrew Morton
  Cc: Christoph Hellwig, Dan Williams, Dave Chinner, Dave Hansen,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jérôme Glisse,
	LKML, amd-gfx, ceph-devel, devel, devel, dri-devel, intel-gfx,
	kvm, linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
	linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
	linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
	xen-devel, John Hubbard, Trond Myklebust, Anna Schumaker
In-Reply-To: <20190804224915.28669-32-jhubbard@nvidia.com>

On 04/08/2019 11:49 pm, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> For pages that were retained via get_user_pages*(), release those pages
> via the new put_user_page*() routines, instead of via put_page() or
> release_pages().
> 
> This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
> ("mm: introduce put_user_page*(), placeholder versions").
> 
> Cc: Calum Mackay <calum.mackay@oracle.com>
> Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
> Cc: Anna Schumaker <anna.schumaker@netapp.com>
> Cc: linux-nfs@vger.kernel.org
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>   fs/nfs/direct.c | 11 ++---------
>   1 file changed, 2 insertions(+), 9 deletions(-)

Reviewed-by: Calum Mackay <calum.mackay@oracle.com>


> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 0cb442406168..c0c1b9f2c069 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -276,13 +276,6 @@ ssize_t nfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>   	return nfs_file_direct_write(iocb, iter);
>   }
>   
> -static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
> -{
> -	unsigned int i;
> -	for (i = 0; i < npages; i++)
> -		put_page(pages[i]);
> -}
> -
>   void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
>   			      struct nfs_direct_req *dreq)
>   {
> @@ -512,7 +505,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
>   			pos += req_len;
>   			dreq->bytes_left -= req_len;
>   		}
> -		nfs_direct_release_pages(pagevec, npages);
> +		put_user_pages(pagevec, npages);
>   		kvfree(pagevec);
>   		if (result < 0)
>   			break;
> @@ -935,7 +928,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
>   			pos += req_len;
>   			dreq->bytes_left -= req_len;
>   		}
> -		nfs_direct_release_pages(pagevec, npages);
> +		put_user_pages(pagevec, npages);
>   		kvfree(pagevec);
>   		if (result < 0)
>   			break;
> 

^ permalink raw reply

* Re: [PATCH] Documentation: virt: Fix broken reference to virt tree's index
From: Matthew Wilcox @ 2019-08-05  0:10 UTC (permalink / raw)
  To: Sheriff Esseson
  Cc: skhan, linux-kernel-mentees, Jonathan Corbet, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song,
	open list:DOCUMENTATION, open list, open list:RISC-V ARCHITECTURE,
	open list:BPF (Safe dynamic programs and tools),
	open list:BPF (Safe dynamic programs and tools)
In-Reply-To: <20190804154635.GA18475@localhost>

On Sun, Aug 04, 2019 at 04:46:35PM +0100, Sheriff Esseson wrote:
> Fix broken reference to virt/index.rst.
> 
> Sequel to: 2f5947dfcaec ("Documentation: move Documentation/virtual to
> Documentation/virt")

'Sequel to'?  Do you mean 'Fixes'?

> Reported-by: Sphinx

Reported-by is used for people.  See
Documentation/process/submitting-patches.rst section 13.


^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-05  0:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Song Liu, Kees Cook, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <CALCETrU7NbBnXXsw1B+DvTkfTVRBFWXuJ8cZERCCNvdFG6KqRw@mail.gmail.com>

On Sun, Aug 4, 2019 at 3:16 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Fri, Aug 2, 2019 at 12:22 AM Song Liu <songliubraving@fb.com> wrote:
> >
> > Hi Andy,
> >
>  >> I actually agree CAP_BPF_ADMIN makes sense. The hard part is to make
> > >> existing tools (setcap, getcap, etc.) and libraries aware of the new CAP.
> > >
> > > It's been done before -- it's not that hard.  IMO the main tricky bit
> > > would be try be somewhat careful about defining exactly what
> > > CAP_BPF_ADMIN does.
> >
> > Agreed. I think defining CAP_BPF_ADMIN could be a good topic for the
> > Plumbers conference.
> >
> > OTOH, I don't think we have to wait for CAP_BPF_ADMIN to allow daemons
> > like systemd to do sys_bpf() without root.
>
> I don't understand the use case here.  Are you talking about systemd
> --user?  As far as I know, a user is expected to be able to fully
> control their systemd --user process, so giving it unrestricted bpf
> access is very close to giving it superuser access, and this doesn't
> sound like a good idea.  I think that, if systemd --user needs bpf(),
> it either needs real unprivileged bpf() or it needs a privileged
> helper (SUID or a daemon) to intermediate this access.
>
> >
> > >
> > >>> I don't see why you need to invent a whole new mechanism for this.
> > >>> The entire cgroup ecosystem outside bpf() does just fine using the
> > >>> write permission on files in cgroupfs to control access.  Why can't
> > >>> bpf() do the same thing?
> > >>
> > >> It is easier to use write permission for BPF_PROG_ATTACH. But it is
> > >> not easy to do the same for other bpf commands: BPF_PROG_LOAD and
> > >> BPF_MAP_*. A lot of these commands don't have target concept. Maybe
> > >> we should have target concept for all these commands. But that is a
> > >> much bigger project. OTOH, "all or nothing" model allows all these
> > >> commands at once.
> > >
> > > For BPF_PROG_LOAD, I admit I've never understood why permission is
> > > required at all.  I think that CAP_SYS_ADMIN or similar should be
> > > needed to get is_priv in the verifier, but I think that should mainly
> > > be useful for tracing, and that requires lots of privilege anyway.
> > > BPF_MAP_* is probably the trickiest part.  One solution would be some
> > > kind of bpffs, but I'm sure other solutions are possible.
> >
> > Improving permission management of cgroup_bpf is another good topic to
> > discuss. However, it is also an overkill for current use case.
> >
>
> I looked at the code some more, and I don't think this is so hard
> after all.  As I understand it, all of the map..by_id stuff is, to
> some extent, deprecated in favor of persistent maps.  As I see it, the
> map..by_id calls should require privilege forever, although I can
> imagine ways to scope that privilege to a namespace if the maps
> themselves were to be scoped to a namespace.
>
> Instead, unprivileged tools would use the persistent map interface
> roughly like this:
>
> $ bpftool map create /sys/fs/bpf/my_dir/filename type hash key 8 value
> 8 entries 64 name mapname
>
> This would require that the caller have either CAP_DAC_OVERRIDE or
> that the caller have permission to create files in /sys/fs/bpf/my_dir
> (using the same rules as for any filesystem), and the resulting map
> would end up owned by the creating user and have mode 0600 (or maybe
> 0666, or maybe a new bpf_attr parameter) modified by umask.  Then all
> the various capable() checks that are currently involved in accessing
> a persistent map would instead check FMODE_READ or FMODE_WRITE on the
> map file as appropriate.
>
> Half of this stuff already works.  I just set my system up like this:
>
> $ ls -l /sys/fs/bpf
> total 0
> drwxr-xr-x. 3 luto luto 0 Aug  4 15:10 luto
>
> $ mkdir /sys/fs/bpf/luto/test
>
> $ ls -l /sys/fs/bpf/luto
> total 0
> drwxrwxr-x. 2 luto luto 0 Aug  4 15:10 test
>
> I bet that making the bpf() syscalls work appropriately in this
> context without privilege would only be a couple of hours of work.
> The hard work, creating bpffs and making it function, is already done
> :)
>
> P.S. The docs for bpftool create are less than fantastic.  The
> complete lack of any error message at all when the syscall returns
> -EACCES is also not fantastic.

This isn't remotely finished, but I spent a bit of time fiddling with this:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=bpf/perms

What do you think?  (It's obviously not done.  It doesn't compile, and
I haven't gotten to the permissions needed to do map operations.  I
also haven't touched the capable() checks.)

^ permalink raw reply

* Re: [PATCH 31/34] nfs: convert put_page() to put_user_page*()
From: Calum Mackay @ 2019-08-04 23:28 UTC (permalink / raw)
  To: John Hubbard, john.hubbard, Andrew Morton
  Cc: calum.mackay, Christoph Hellwig, Dan Williams, Dave Chinner,
	Dave Hansen, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jérôme Glisse, LKML, amd-gfx, ceph-devel, devel, devel,
	dri-devel, intel-gfx, kvm, linux-arm-kernel, linux-block,
	linux-crypto, linux-fbdev, linux-fsdevel, linux-media, linux-mm,
	linux-nfs, linux-rdma, linux-rpi-kernel, linux-xfs, netdev,
	rds-devel, sparclinux, x86, xen-devel, Trond Myklebust,
	Anna Schumaker
In-Reply-To: <db136399-ed87-56ea-bd6e-e5d29b145eda@nvidia.com>

On 03/08/2019 2:41 am, John Hubbard wrote:
> On 8/2/19 6:27 PM, Calum Mackay wrote:
>> On 02/08/2019 3:20 am, john.hubbard@gmail.com wrote:
> ...
>> Since it's static, and only called twice, might it be better to change its two callers [nfs_direct_{read,write}_schedule_iovec()] to call put_user_pages() directly, and remove nfs_direct_release_pages() entirely?
>>
>> thanks,
>> calum.
>>
>>
>>>      void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
>>>
>   
> Hi Calum,
> 
> Absolutely! Is it OK to add your reviewed-by, with the following incremental
> patch made to this one?

Thanks John; looks good.

Reviewed-by: Calum Mackay <calum.mackay@oracle.com>

> 
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index b00b89dda3c5..c0c1b9f2c069 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -276,11 +276,6 @@ ssize_t nfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>          return nfs_file_direct_write(iocb, iter);
>   }
>   
> -static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
> -{
> -       put_user_pages(pages, npages);
> -}
> -
>   void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
>                                struct nfs_direct_req *dreq)
>   {
> @@ -510,7 +505,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
>                          pos += req_len;
>                          dreq->bytes_left -= req_len;
>                  }
> -               nfs_direct_release_pages(pagevec, npages);
> +               put_user_pages(pagevec, npages);
>                  kvfree(pagevec);
>                  if (result < 0)
>                          break;
> @@ -933,7 +928,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
>                          pos += req_len;
>                          dreq->bytes_left -= req_len;
>                  }
> -               nfs_direct_release_pages(pagevec, npages);
> +               put_user_pages(pagevec, npages);
>                  kvfree(pagevec);
>                  if (result < 0)
>                          break;
> 
> 
> 
> thanks,
> 

^ permalink raw reply

* Re: Slowness forming TIPC cluster with explicit node addresses
From: Chris Packham @ 2019-08-04 23:04 UTC (permalink / raw)
  To: jon.maloy@ericsson.com, tipc-discussion@lists.sourceforge.net
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <CH2PR15MB3575BF6FC4001C19B8A789559ADB0@CH2PR15MB3575.namprd15.prod.outlook.com>

On Sun, 2019-08-04 at 21:53 +0000, Jon Maloy wrote:
> 
> > 
> > -----Original Message-----
> > From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org>
> > On
> > Behalf Of Chris Packham
> > Sent: 2-Aug-19 01:11
> > To: Jon Maloy <jon.maloy@ericsson.com>; tipc-
> > discussion@lists.sourceforge.net
> > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> > Subject: Re: Slowness forming TIPC cluster with explicit node
> > addresses
> > 
> > On Mon, 2019-07-29 at 09:04 +1200, Chris Packham wrote:
> > > 
> > > On Fri, 2019-07-26 at 13:31 +0000, Jon Maloy wrote:
> > > > 
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: netdev-owner@vger.kernel.org <netdev-
> > owner@vger.kernel.org>
> > > 
> > > > 
> > > > > 
> > > > > On Behalf Of Chris Packham
> > > > > Sent: 25-Jul-19 19:37
> > > > > To: tipc-discussion@lists.sourceforge.net
> > > > > Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org
> > > > > Subject: Slowness forming TIPC cluster with explicit node
> > > > > addresses
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > I'm having problems forming a TIPC cluster between 2 nodes.
> > > > > 
> > > > > This is the basic steps I'm going through on each node.
> > > > > 
> > > > > modprobe tipc
> > > > > ip link set eth2 up
> > > > > tipc node set addr 1.1.5 # or 1.1.6 tipc bearer enable media
> > > > > eth
> > > > > dev eth0
> > > > eth2, I assume...
> > > > 
> > > Yes sorry I keep switching between between Ethernet ports for
> > > testing
> > > so I hand edited the email.
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Then to confirm if the cluster is formed I use tipc link list
> > > > > 
> > > > > [root@node-5 ~]# tipc link list
> > > > > broadcast-link: up
> > > > > ...
> > > > > 
> > > > > Looking at tcpdump the two nodes are sending packets
> > > > > 
> > > > > 22:30:05.782320 TIPC v2.0 1.1.5 > 0.0.0, headerlength 60
> > > > > bytes,
> > > > > MessageSize
> > > > > 76 bytes, Neighbor Detection Protocol internal, messageType
> > > > > Link
> > > > > request
> > > > > 22:30:05.863555 TIPC v2.0 1.1.6 > 0.0.0, headerlength 60
> > > > > bytes,
> > > > > MessageSize
> > > > > 76 bytes, Neighbor Detection Protocol internal, messageType
> > > > > Link
> > > > > request
> > > > > 
> > > > > Eventually (after a few minutes) the link does come up
> > > > > 
> > > > > [root@node-6 ~]# tipc link list
> > > > > broadcast-link: up
> > > > > 1001006:eth2-1001005:eth2: up
> > > > > 
> > > > > [root@node-5 ~]# tipc link list
> > > > > broadcast-link: up
> > > > > 1001005:eth2-1001006:eth2: up
> > > > > 
> > > > > When I remove the "tipc node set addr" things seem to kick
> > > > > into
> > > > > life straight away
> > > > > 
> > > > > [root@node-5 ~]# tipc link list
> > > > > broadcast-link: up
> > > > > 0050b61bd2aa:eth2-0050b61e6dfa:eth2: up
> > > > > 
> > > > > So there appears to be some difference in behaviour between
> > > > > having
> > > > > an explicit node address and using the default. Unfortunately
> > > > > our
> > > > > application relies on setting the node addresses.
> > > > I do this many times a day, without any problems. If there
> > > > would be
> > > > any time difference, I would expect the 'auto configurable'
> > > > version
> > > > to be slower, because it involves a DAD step.
> > > > Are you sure you don't have any other nodes running in your
> > > > system?
> > > > 
> > > > ///jon
> > > > 
> > > Nope the two nodes are connected back to back. Does the number of
> > > Ethernet interfaces make a difference? As you can see I've got 3
> > > on
> > > each node. One is completely disconnected, one is for booting
> > > over
> > > TFTP
> > >  (only used by U-boot) and the other is the USB Ethernet I'm
> > > using for
> > > testing.
> > > 
> > So I can still reproduce this on nodes that only have one network
> > interface and
> > are the only things connected.
> > 
> > I did find one thing that helps
> > 
> > diff --git a/net/tipc/discover.c b/net/tipc/discover.c index
> > c138d68e8a69..49921dad404a 100644
> > --- a/net/tipc/discover.c
> > +++ b/net/tipc/discover.c
> > @@ -358,10 +358,10 @@ int tipc_disc_create(struct net *net, struct
> > tipc_bearer *b,
> >         tipc_disc_init_msg(net, d->skb, DSC_REQ_MSG, b);
> > 
> >         /* Do we need an address trial period first ? */
> > -       if (!tipc_own_addr(net)) {
> > +//     if (!tipc_own_addr(net)) {
> >                 tn->addr_trial_end = jiffies +
> > msecs_to_jiffies(1000);
> >                 msg_set_type(buf_msg(d->skb), DSC_TRIAL_MSG);
> > -       }
> > +//     }
> >         memcpy(&d->dest, dest, sizeof(*dest));
> >         d->net = net;
> >         d->bearer_id = b->identity;
> > 
> > I think because with pre-configured addresses the duplicate address
> > detection
> > is skipped the shorter init phase is skipped. Would is make sense
> > to
> > unconditionally do the trial step? Or is there some better way to
> > get things to
> > transition with pre-assigned addresses.
>
> I am on vacation until the end of next-week, so I can't give you any
> good analysis right now.

Thanks for taking the time to respond.

> To do the trial step doesn’t make much sense to me, -it would only
> delay the setup unnecessarily (but with only 1 second).
> Can you check the initial value of addr_trial_end when there a pre-
> configured address?

I had the same thought. For both my devices 'addr_trial_end = 0' so I
think tipc_disc_addr_trial_msg should end up with trial == false

> 
> ///jon
> 

^ permalink raw reply

* Re: [PATCH 1/1] bpf: introduce new helper udp_flow_src_port
From: Y Song @ 2019-08-04 23:04 UTC (permalink / raw)
  To: Farid Zakaria; +Cc: Alexei Starovoitov, Daniel Borkmann, netdev, bpf
In-Reply-To: <CACCo2j=RAua1E0d6E+tVoOG=q1sSLuZpLqx32dY4mmhYNtDzvg@mail.gmail.com>

On Sun, Aug 4, 2019 at 1:43 PM Farid Zakaria <farid.m.zakaria@gmail.com> wrote:
>
> * re-sending as I've sent previously as HTML ... sorry *
>
> First off, thank you for taking the time to review this patch.

You are welcome. Also, just let you know typically people do
interleaved reply instead of
top reply.

>
> It's not clear to me the backport you'd like for libbpf, I have been following
> the documentation outlined in
> https://www.kernel.org/doc/html/latest/bpf/index.html
> I will hold off on changes to this patch as you've asked for it going forward.

The document at the above link does not specify how patches are structured.
What I suggest is what people typically do here for each review and cherry-pick
in different cases. There are no functionality changes in your patch. Just
break it into three commits instead one. The cover letter is still needed.
There are more examples in here:
https://lore.kernel.org/bpf/

>
> This patch is inspired by a MPLSoUDP (https://tools.ietf.org/html/rfc7510)]
> kernel module that was ported to eBPF -- our implementation is slightly
> modified for our custom use case.
>
> The Linux kernel provides a single abstraction for the src port for
> UDP tunneling
> via udp_flow_src_port. If it's improved eBPF filters would benefit if
> the call is the same.
>
> Exposing this function to eBPF programs would maintain feature parity
> with other kernel
> tunneling implementations.

Thanks for providing the detailed information above. Such use case
information should be in the commit message to answer the question
like why we need this feature.

If I understand correctly, you try to do MPLS over UDP in bpf program,
right? Your want to the UDP source port to be the same as the one
computed by kernel? Approximation is possible but it may introduce
different behavior at routing side (ECMP) and you do not want that to happen.

Your test case only tries to modify a tcp packet source port and checks it
is indeed changed. It would be good if your test case can be a little bit
closer to your use case.

You can submit v2 of the patch set with 3 commits, more detailed
commit messages and possibly modified test cases, so we can
continue to review.

>
> Farid Zakaria
>
> Farid Zakaria
>
>
>
> On Sun, Aug 4, 2019 at 1:41 PM Farid Zakaria <farid.m.zakaria@gmail.com> wrote:
> >
> > First off, thank you for taking the time to review this patch.
> >
> > It's not clear to me the backport you'd like for libbpf, I have been following
> > the documentation outlined in https://www.kernel.org/doc/html/latest/bpf/index.html
> > I will hold off on changes to this patch as you've asked for it going forward.
> >
> > This patch is inspired by a MPLSoUDP (https://tools.ietf.org/html/rfc7510)]
> > kernel module that was ported to eBPF -- our implementation is slightly
> > modified for our custom use case.
> >
> > The Linux kernel provides a single abstraction for the src port for UDP tunneling
> > via udp_flow_src_port. If it's improved eBPF filters would benefit if the call is the same.
> >
> > Exposing this function to eBPF programs would maintain feature parity with other kernel
> > tunneling implementations.
> >
> > Cheers,
> > Farid Zakaria
> >
> >
> >
> > On Sat, Aug 3, 2019 at 11:52 PM Y Song <ys114321@gmail.com> wrote:
> >>
> >> On Sat, Aug 3, 2019 at 8:29 PM Farid Zakaria <farid.m.zakaria@gmail.com> wrote:
> >> >
> >> > Foo over UDP uses UDP encapsulation to add additional entropy
> >> > into the packets so that they get beter distribution across EMCP
> >> > routes.
> >> >
> >> > Expose udp_flow_src_port as a bpf helper so that tunnel filters
> >> > can benefit from the helper.
> >> >
> >> > Signed-off-by: Farid Zakaria <farid.m.zakaria@gmail.com>
> >> > ---
> >> >  include/uapi/linux/bpf.h                      | 21 +++++++--
> >> >  net/core/filter.c                             | 20 ++++++++
> >> >  tools/include/uapi/linux/bpf.h                | 21 +++++++--
> >> >  tools/testing/selftests/bpf/bpf_helpers.h     |  2 +
> >> >  .../bpf/prog_tests/udp_flow_src_port.c        | 28 +++++++++++
> >> >  .../bpf/progs/test_udp_flow_src_port_kern.c   | 47 +++++++++++++++++++
> >> >  6 files changed, 131 insertions(+), 8 deletions(-)
> >> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/udp_flow_src_port.c
> >> >  create mode 100644 tools/testing/selftests/bpf/progs/test_udp_flow_src_port_kern.c
> >>
> >> First, for each review, backport and sync with libbpf repo, in the future,
> >> could you break the patch to two patches?
> >>    1. kernel changes (net/core/filter.c, include/uapi/linux/bpf.h)
> >>    2. tools/include/uapi/linux/bpf.h
> >>    3. tools/testing/ changes
> >>
> >> Second, could you explain why existing __sk_buff->hash not enough?
> >> there are corner cases where if __sk_buff->hash is 0 and the kernel did some
> >> additional hashing, but maybe you can approximate in bpf program?
> >> For case, min >= max, I suppose you can get min/max port values
> >> from the user space for a particular net device and then calculate
> >> the hash in the bpf program?
> >> What I want to know if how much accuracy you will lose if you just
> >> use __sk_buff->hash and do approximation in bpf program.
> >>
> >> >
> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> > index 4393bd4b2419..90e814153dec 100644
> >> > --- a/include/uapi/linux/bpf.h
> >> > +++ b/include/uapi/linux/bpf.h
> >> > @@ -2545,9 +2545,21 @@ union bpf_attr {
> >> >   *             *th* points to the start of the TCP header, while *th_len*
> >> >   *             contains **sizeof**\ (**struct tcphdr**).
> >> >   *
> >> > - *     Return
> >> > - *             0 if *iph* and *th* are a valid SYN cookie ACK, or a negative
> >> > - *             error otherwise.
> >> > + *  Return
> >> > + *      0 if *iph* and *th* are a valid SYN cookie ACK, or a negative
> >> > + *      error otherwise.
> >> > + *
> >> > + * int bpf_udp_flow_src_port(struct sk_buff *skb, int min, int max, int use_eth)
> >> > + *  Description
> >> > + *      It's common to implement tunnelling inside a UDP protocol to provide
> >> > + *      additional randomness to the packet. The destination port of the UDP
> >> > + *      header indicates the inner packet type whereas the source port is used
> >> > + *      for additional entropy.
> >> > + *
> >> > + *  Return
> >> > + *      An obfuscated hash of the packet that falls within the
> >> > + *      min & max port range.
> >> > + *      If min >= max, the default port range is used
> >> >   *
> >> >   * int bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t buf_len, u64 flags)
> >> >   *     Description
> >> > @@ -2853,7 +2865,8 @@ union bpf_attr {
> >> >         FN(sk_storage_get),             \
> >> >         FN(sk_storage_delete),          \
> >> >         FN(send_signal),                \
> >> > -       FN(tcp_gen_syncookie),
> >> > +       FN(tcp_gen_syncookie),  \
> >> > +       FN(udp_flow_src_port),
> >> >
> >> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >> >   * function eBPF program intends to call
> >> > diff --git a/net/core/filter.c b/net/core/filter.c
> >> > index 5a2707918629..fdf0ebb8c2c8 100644
> >> > --- a/net/core/filter.c
> >> > +++ b/net/core/filter.c
> >> > @@ -2341,6 +2341,24 @@ static const struct bpf_func_proto bpf_msg_pull_data_proto = {
> >> >         .arg4_type      = ARG_ANYTHING,
> >> >  };
> >> >
> >> > +BPF_CALL_4(bpf_udp_flow_src_port, struct sk_buff *, skb, int, min,
> >> > +          int, max, int, use_eth)
> >> > +{
> >> > +       struct net *net = dev_net(skb->dev);
> >> > +
> >> > +       return udp_flow_src_port(net, skb, min, max, use_eth);
> >> > +}
> >> > +
> >> [...]

^ permalink raw reply

* [PATCH v2 04/34] x86/kvm: convert put_page() to put_user_page*()
From: john.hubbard @ 2019-08-04 22:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Dan Williams, Dave Chinner, Dave Hansen,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jérôme Glisse,
	LKML, amd-gfx, ceph-devel, devel, devel, dri-devel, intel-gfx,
	kvm, linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
	linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
	linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
	xen-devel, John Hubbard, Joerg Roedel, Paolo Bonzini,
	Radim Krčmář, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin
In-Reply-To: <20190804224915.28669-1-jhubbard@nvidia.com>

From: John Hubbard <jhubbard@nvidia.com>

For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page().

This is part a tree-wide conversion, as described in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions").

Cc: Joerg Roedel <joro@8bytes.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 arch/x86/kvm/svm.c  | 4 ++--
 virt/kvm/kvm_main.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 7eafc6907861..ff93c923ed36 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1827,7 +1827,7 @@ static struct page **sev_pin_memory(struct kvm *kvm, unsigned long uaddr,
 
 err:
 	if (npinned > 0)
-		release_pages(pages, npinned);
+		put_user_pages(pages, npinned);
 
 	kvfree(pages);
 	return NULL;
@@ -1838,7 +1838,7 @@ static void sev_unpin_memory(struct kvm *kvm, struct page **pages,
 {
 	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
 
-	release_pages(pages, npages);
+	put_user_pages(pages, npages);
 	kvfree(pages);
 	sev->pages_locked -= npages;
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 887f3b0c2b60..4b6a596ea8e9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1499,7 +1499,7 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 
 		if (__get_user_pages_fast(addr, 1, 1, &wpage) == 1) {
 			*writable = true;
-			put_page(page);
+			put_user_page(page);
 			page = wpage;
 		}
 	}
@@ -1831,7 +1831,7 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);
 void kvm_release_pfn_clean(kvm_pfn_t pfn)
 {
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
-		put_page(pfn_to_page(pfn));
+		put_user_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
 
-- 
2.22.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox