Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH RESEND net] vhost: fix vhost_vq_access_ok() log check
From: Stefan Hajnoczi @ 2018-04-10  1:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, netdev, Linux Virtualization, kvm, syzkaller-bugs,
	linux-kernel, Stefan Hajnoczi
In-Reply-To: <20180409174731-mutt-send-email-mst@kernel.org>

On Tue, Apr 10, 2018 at 3:40 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> From: Stefan Hajnoczi <stefanha@redhat.com>
>
> Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
> when IOTLB is enabled") introduced a regression.  The logic was
> originally:
>
>   if (vq->iotlb)
>       return 1;
>   return A && B;
>
> After the patch the short-circuit logic for A was inverted:
>
>   if (A || vq->iotlb)
>       return A;
>   return B;
>
> The correct logic is:
>
>   if (!A || vq->iotlb)
>       return A;
>   return B;
>
> Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
> Cc: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>
> ---
>  drivers/vhost/vhost.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

NACK

I will send a v2 with cleaner logic as suggested by Linus.

Stefan

^ permalink raw reply

* Re: [PATCH bpf] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
From: Alexei Starovoitov @ 2018-04-10  0:51 UTC (permalink / raw)
  To: Yonghong Song, daniel, netdev; +Cc: kernel-team
In-Reply-To: <a2e305b4-760f-6e19-6043-81d1a1f85e48@fb.com>

On 4/9/18 11:41 AM, Yonghong Song wrote:
>
>
> On 4/9/18 9:47 AM, Alexei Starovoitov wrote:
>> On 4/9/18 9:18 AM, Yonghong Song wrote:
>>> syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
>> ...
>>> @@ -985,16 +986,31 @@ int perf_event_query_prog_array(struct
>>> perf_event *event, void __user *info)
>>>          return -EINVAL;
>>>      if (copy_from_user(&query, uquery, sizeof(query)))
>>>          return -EFAULT;
>>> -    if (query.ids_len > BPF_TRACE_MAX_PROGS)
>>> +
>>> +    ids_len = query.ids_len;
>>> +    if (ids_len > BPF_TRACE_MAX_PROGS)
>>>          return -E2BIG;
>>> +    ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
>>> +    if (!ids)
>>> +        return -ENOMEM;
>>>
>>>      mutex_lock(&bpf_event_mutex);
>>>      ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
>>> -                       uquery->ids,
>>> -                       query.ids_len,
>>> -                       &uquery->prog_cnt);
>>> +                       ids,
>>> +                       ids_len,
>>> +                       &prog_cnt);
>>>      mutex_unlock(&bpf_event_mutex);
>>>
>>> +    if (!ret || ret == -ENOSPC) {
>>> +        if (copy_to_user(&uquery->prog_cnt, &prog_cnt,
>>> sizeof(prog_cnt)) ||
>>> +            copy_to_user(uquery->ids, ids, ids_len * sizeof(u32))) {
>>> +            ret = -EFAULT;
>>> +            goto out;
>>> +        }
>>> +    }
>>> +
>>> +out:
>>> +    kfree(ids);
>>
>> alloc/free just to avoid this locking dependency feels suboptimal.
>
> We actually already did kcalloc/kfree in bpf_prog_array_copy_to_user.
> In that function, we did not copy_to_user one id at a time.
> We allocate a temporary array and store the result there
> and at the end, we call one copy_to_user to copy to the user buffer.
>
> The patch here just moved this allocation and associated copy_to_user
> out of the function and bpf_event_mutex. It did not introduce new
> allocations.

I see, so the patch essentially open coding
bpf_prog_array_copy_to_user()
can we share the code then?
bpf/core.c callsite used by trace/bpf_trace.c
and similar callsite in bpf/cgroup.c
should be using common helper.


>>
>> may be we can get rid of bpf_event_mutex in some cases?
>> the perf event itself is locked via perf_event_ctx_lock() when we're
>> calling perf_event_query_prog_array, perf_event_attach|detach_bpf_prog.
>> I forgot what was the motivation for us to introduce it in the
>> first place.
>
> The original motivation for the lock to make sure bpf_prog_array
> does not change during middle of attach/detach/query. it looks like
> we have:
>    . perf_event_attach|query under perf_event_ctx_lock
>    . perf_event_detach not under perf_event_ctx_lock
> Introducing perf_event_ctx_lock to perf_event_detach could still
> have the deadlock.

ahh, right, since the progs are in even->tp_event which can
be shared by multiple perf_events.
Scratch that idea.

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-09 23:31 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Andrew Lunn, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Michael S. Tsirkin, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Samudrala, Sridhar, Netdev,
	virtualization, virtio-dev
In-Reply-To: <20180409160348.12f8a6d9@xeon-e3>

On Mon, Apr 9, 2018 at 4:03 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Mon, 9 Apr 2018 15:30:42 -0700
> Siwei Liu <loseweigh@gmail.com> wrote:
>
>> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> >> No, implementation wise I'd avoid changing the class on the fly. What
>> >> I'm looking to is a means to add a secondary class or class aliasing
>> >> mechanism for netdevs that allows mapping for a kernel device
>> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> >> creating symlinks between these two namespaces as an analogy. All
>> >> userspace visible netdevs today will have both a kernel name and a
>> >> userspace visible name, having one (/class/net) referecing the other
>> >> (/class/net-kernel) in its own namespace. The newly introduced
>> >> IFF_AUTO_MANAGED device will have a kernel name only
>> >> (/class/net-kernel). As a result, the existing applications using
>> >> /class/net don't break, while we're adding the kernel namespace that
>> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> >> at all.
>> >
>> > My gut feeling is this whole scheme will not fly. You really should be
>> > talking to GregKH.
>>
>> Will do. Before spreading it out loudly I'd run it within netdev to
>> clarify the need for why not exposing the lower netdevs is critical
>> for cloud service providers in the face of introducing a new feature,
>> and we are not hiding anything but exposing it in a way that don't
>> break existing userspace applications while introducing feature is
>> possible with the limitation of keeping old userspace still.
>>
>> >
>> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
>> > A device can start out as a normal device, and will change to being
>> > automatic later, when the user on top of it probes.
>>
>> Sure. In whatever form it's still a netdev, and changing the namespace
>> should be more dynamic than changing the class.
>>
>> Thanks a lot,
>> -Siwei
>>
>> >
>> >         Andrew
>
> Also, remember for netdev's /sys is really a third class API.
> The primary API's are netlink and ioctl. Also why not use existing
> network namespaces rather than inventing a new abstraction?

Because we want to leave old userspace unmodified while making SR-IOV
live migration transparent to users. Specifically, we'd want old udevd
to skip through uevents for the lower netdevs, while also making new
udevd able to name the bypass_master interface by referencing the pci
slot information which is only present in the sysfs entry for
IFF_AUTO_MANAGED net device.

The problem of using network namespace is that, no sysfs entry will be
around for IFF_AUTO_MANAGED netdev if we isolate it out to a separate
netns, unless we generalize the scope for what netns is designed for
(isolation I mean). For auto-managed netdevs we don't neccessarily
wants strict isolation, but rather a way of sticking to 1-netdev model
for strict backward compatibility requiement of the old userspace,
while exposing the information in a way new userspace understands.

Thanks,
-Siwei

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Eric W. Biederman @ 2018-04-09 23:21 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <20180409154627.GA15157@gmail.com>

Christian Brauner <christian.brauner@canonical.com> writes:

> On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>> 
>> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> >> On 05.04.2018 17:07, Christian Brauner wrote:
>> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>> >> >>>
>> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
>> >> >>> Over time the set of uevents that get sent into all network namespaces has
>> >> >>> shrunk. We have now reached the point where hotplug events for all devices
>> >> >>> that carry a namespace tag are filtered according to that namespace.
>> >> >>>
>> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
>> >> >>> does not match the namespace tag of the netlink socket. One example are
>> >> >>> network devices. Uevents for network devices only show up in the network
>> >> >>> namespaces these devices are moved to or created in.
>> >> >>>
>> >> >>> However, any uevent for a kobject that does not have a namespace tag
>> >> >>> associated with it will not be filtered and we will *try* to broadcast it
>> >> >>> into all network namespaces.
>> >> >>>
>> >> >>> The original patchset was written in 2010 before user namespaces were a
>> >> >>> thing. With the introduction of user namespaces sending out uevents became
>> >> >>> partially isolated as they were filtered by user namespaces:
>> >> >>>
>> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >> >>>
>> >> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >> >>>                 return;
>> >> >>>
>> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
>> >> >>>                 return;
>> >> >>>
>> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >>>                              CAP_NET_BROADCAST))
>> >> >>>         j       return;
>> >> >>> }
>> >> >>>
>> >> >>> The file_ns_capable() check will check whether the caller had
>> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>> >> >>> namespace of interest. This check is fine in general but seems insufficient
>> >> >>> to me when paired with uevents. The reason is that devices always belong to
>> >> >>> the initial user namespace so uevents for kobjects that do not carry a
>> >> >>> namespace tag should never be sent into another user namespace. This has
>> >> >>> been the intention all along. But there's one case where this breaks,
>> >> >>> namely if a new user namespace is created by root on the host and an
>> >> >>> identity mapping is established between root on the host and root in the
>> >> >>> new user namespace. Here's a reproducer:
>> >> >>>
>> >> >>>  sudo unshare -U --map-root
>> >> >>>  udevadm monitor -k
>> >> >>>  # Now change to initial user namespace and e.g. do
>> >> >>>  modprobe kvm
>> >> >>>  # or
>> >> >>>  rmmod kvm
>> >> >>>
>> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
>> >> >>> host. This seems very anecdotal given that in the general case user
>> >> >>> namespaces do not see any uevents and also can't really do anything useful
>> >> >>> with them.
>> >> >>>
>> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
>> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> >> >>> namespace of the network namespace of the netlink socket) userspace process
>> >> >>> make a decision what uevents should be sent.
>> >> >>>
>> >> >>> This makes me think that we should simply ensure that uevents for kobjects
>> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
>> >> >>> in kobj_bcast_filter(). Specifically:
>> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
>> >> >>>   event will always be filtered.
>> >> >>> - If the network namespace the uevent socket belongs to was created in the
>> >> >>>   initial user namespace but was opened from a non-initial user namespace
>> >> >>>   the event will be filtered as well.
>> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
>> >> >>> always only sent to the initial user namespace. The regression potential
>> >> >>> for this is near to non-existent since user namespaces can't really do
>> >> >>> anything with interesting devices.
>> >> >>>
>> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>> >> >>> ---
>> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
>> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
>> >> >>>
>> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
>> >> >>> --- a/lib/kobject_uevent.c
>> >> >>> +++ b/lib/kobject_uevent.c
>> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >> >>>  		return sock_ns != ns;
>> >> >>>  	}
>> >> >>>  
>> >> >>> -	return 0;
>> >> >>> +	/*
>> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
>> >> >>> +	 * namespace below.
>> >> >>> +	 */
>> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
>> >> >>> +		return 1;
>> >> >>> +
>> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
>> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
>> >> >>>  }
>> >> >>>  #endif
>> >> >>
>> >> >> So, this prohibits to listen events of all devices except network-related
>> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
>> >> > 
>> >> > No, this is not correct: As it is right now *without my patch* no
>> >> > non-initial user namespace is receiving *any uevents* but those
>> >> > specifically namespaced such as those for network devices. This patch
>> >> > doesn't change that at all. The commit message outlines this in detail
>> >> > how this comes about.
>> >> > There is only one case where this currently breaks and this is as I
>> >> > outlined explicitly in my commit message when you create a new user
>> >> > namespace and map container(0) -> host(0). This patch fixes this.
>> >> 
>> >> Could you please point the place, where non-initial user namespaces are filtered?
>> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
>> >> Now it will return 1 sometimes.
>> >
>> > Oh sure, it's in the commit message though. The callchain is
>> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
>> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
>> > net/netlink/af_netlink.c:do_one_broadcast():
>> >
>> > This codepiece will check whether the openened socket holds
>> > CAP_NET_BROADCAST in the user namespace of the target network namespace
>> > which it won't because we don't have device namespaces and all devices
>> > belong to the initial set of namespaces.
>> >
>> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >                              CAP_NET_BROADCAST))
>> >         j       return;
>> >
>> 
>> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
>> on their socket and has had someone with the appropriate privileges
>> assign a peerrnetid.
>> 
>> All of which is to say that file_ns_capable is not nearly as applicable
>> as it might be, and if you can pass the other two checks I think it is
>> pointless (because the peernet attributes are not generated for
>> kobj_uevents) but valid to receive events from outside your network
>> namespace.
>> 
>> 
>> I might be missing something but I don't see anything excluding network
>> namespaces owned by !init_user_ns excluded from the kobject_uevent
>> logic.
>> 
>> The uevent_sock_list has one entry per network namespace. And
>> kobject_uevent_net_broacast appears to walk each one.
>> 
>> I had a memory of filtering uevent messages and I had a memory
>> that the netlink_has_listeners had a per network namespace effect.
>> Neither seems true from my inspection of the code tonight.
>> 
>> If we are not filtering ordinary uevents at least at the user namespace
>> level that does seem to be at least a nuisance.
>> 
>> 
>> Christian can you dig a little deeper into this.  I have a feeling that
>> there are some real efficiency improvements that we could make to
>> kobject_uevent_net_broadcast if nothing else.
>> 
>> Perhaps you could see where uevents are broadcast by poking
>> the sysfs uevent of an existing device, and triggering a retransmit.
>
> @Eric, so I did some intensive testing over the weekend and forget everything I
> said before. Uevents are not filtered by the kernel at the moment. This is
> currently - apart from network devices - a pure userspace thing. Specifically,
> anyone  on the host can listen to all uevents from everywhere. It's neither
> filtered by user nor by network namespace. The fact that I used
>
> udevadm --debug monitor
>
> to test my prior hypothesis was what led me to believe that uevents are already
> correctly filtered.
> Instead, what is actually happening is that every udev implementation out there
> is discarding uevents that were send by uids != 0 in the CMSG_DATA.
> Specifically,

Yes.  I remember something of the sort.  I believe udev also filters to
ensure that the netlink port id == 0 to ensure the message came from
the kernel and was not spoofed in any way.

> - legacy standalone udevd:
>   https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
> - eudevd
>   https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
> - systemd-udevd
>   https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
> - ueventd (Android)
>   https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81
>
> For all of those I could trace this behavior back to the first released
> version. (To be precise, for legacy udevd that eventually became systemd-udevd
> I could trace it back to the first version that is still available on
> git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
> trivially true that it has the same behavior from the beginning.)
> Android filters uevents in the same way but removed that check on January 8
> 2018 for what I think is invalid reasoning. The good news is that this is only
> in their master branch. It hasn't even made it into an rc release for Android 8
> yet. I filed a bug against Android and offered them a fix if they agree.
>
> In any case, userspace udev is not making use of uevents at all right now since
> any uid != 0 events are **explicitly** discarded.
> The fact that you receive uevents for
>
> sudo unshare -U --map-root -n
> udevadm --debug monitor
>
> is simply explained by the fact that container(0) <=> host(0) at which point
> the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
> discard it.
> The use case for receiving uevents in containers/user namespaces is definitely
> there but that's what the uevent injection patch series was for that we merged.
> This is a much safer and saner solution.
> The fact that all udev implementations filter uevents by uid != 0 very much
> seems like a security mechanism in userspace that we probably should provide by
> isolating uevents based on user and/or network namespaces.

So in summary.  Uevents are filtered in a user namespace (by userspace)
because the received uid != 0.  It instead == 65534 == "nobody" because
the global root uid is not mapped.

Which means that we can modify the kernel to not send to all network
namespaces whose user_ns != &init_user_ns because we know that userspace
will ignore the message because of the uid anyway.  Which means when
net-next reopens you can send that patch.  But please base it on just
not including network namespaces in the list, as that is much more
efficient than adding more conditions to the filter.

Thank you for tracking down what is going on.

Eric

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Stephen Hemminger @ 2018-04-09 23:03 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Andrew Lunn, David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Michael S. Tsirkin, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Samudrala, Sridhar, Netdev,
	virtualization, virtio-dev
In-Reply-To: <CADGSJ23RnioMdi0U0yeOBKuM-9QUZ3S33as693xzs0VWQbuVcA@mail.gmail.com>

On Mon, 9 Apr 2018 15:30:42 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> >> No, implementation wise I'd avoid changing the class on the fly. What
> >> I'm looking to is a means to add a secondary class or class aliasing
> >> mechanism for netdevs that allows mapping for a kernel device
> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
> >> creating symlinks between these two namespaces as an analogy. All
> >> userspace visible netdevs today will have both a kernel name and a
> >> userspace visible name, having one (/class/net) referecing the other
> >> (/class/net-kernel) in its own namespace. The newly introduced
> >> IFF_AUTO_MANAGED device will have a kernel name only
> >> (/class/net-kernel). As a result, the existing applications using
> >> /class/net don't break, while we're adding the kernel namespace that
> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
> >> at all.  
> >
> > My gut feeling is this whole scheme will not fly. You really should be
> > talking to GregKH.  
> 
> Will do. Before spreading it out loudly I'd run it within netdev to
> clarify the need for why not exposing the lower netdevs is critical
> for cloud service providers in the face of introducing a new feature,
> and we are not hiding anything but exposing it in a way that don't
> break existing userspace applications while introducing feature is
> possible with the limitation of keeping old userspace still.
> 
> >
> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> > A device can start out as a normal device, and will change to being
> > automatic later, when the user on top of it probes.  
> 
> Sure. In whatever form it's still a netdev, and changing the namespace
> should be more dynamic than changing the class.
> 
> Thanks a lot,
> -Siwei
> 
> >
> >         Andrew  

Also, remember for netdev's /sys is really a third class API.
The primary API's are netlink and ioctl. Also why not use existing
network namespaces rather than inventing a new abstraction?

^ permalink raw reply

* Re: [PATCH] iscsi: respond to netlink with unicast when appropriate
From: Lee Duncan @ 2018-04-09 23:03 UTC (permalink / raw)
  To: Chris Leech, linux-scsi, netdev, open-iscsi
In-Reply-To: <20180409221528.67880-1-cleech@redhat.com>

On 04/09/2018 03:15 PM, Chris Leech wrote:
> Instead of always multicasting responses, send a unicast netlink message
> directed at the correct pid.  This will be needed if we ever want to
> support multiple userspace processes interacting with the kernel over
> iSCSI netlink simultaneously.  Limitations can currently be seen if you
> attempt to run multiple iscsistart commands in parallel.
> 
> We've fixed up the userspace issues in iscsistart that prevented
> multiple instances from running, so now attempts to speed up booting by
> bringing up multiple iscsi sessions at once in the initramfs are just
> running into misrouted responses that this fixes.

As you may know, I disagree with running multiple iscsistart-s at the
same time, since that's what iscsid is for.

Never the less, I believe we _should_ be able to have multiple processes
talking to the kernel target code, so I agree with these changes.

> 
> Signed-off-by: Chris Leech <cleech@redhat.com>
> ---
>  drivers/scsi/scsi_transport_iscsi.c | 29 ++++++++++++++++++-----------
>  1 file changed, 18 insertions(+), 11 deletions(-)
> 
> ... (diffs removed to save electrons)
> 

Reviewed-by: Lee Duncan <lduncan@suse.com>
-- 
Lee Duncan
SUSE Labs

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-09 22:30 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Netdev,
	David Ahern, si-wei liu, David Miller
In-Reply-To: <20180409221523.GH562@lunn.ch>

On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>> No, implementation wise I'd avoid changing the class on the fly. What
>> I'm looking to is a means to add a secondary class or class aliasing
>> mechanism for netdevs that allows mapping for a kernel device
>> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> creating symlinks between these two namespaces as an analogy. All
>> userspace visible netdevs today will have both a kernel name and a
>> userspace visible name, having one (/class/net) referecing the other
>> (/class/net-kernel) in its own namespace. The newly introduced
>> IFF_AUTO_MANAGED device will have a kernel name only
>> (/class/net-kernel). As a result, the existing applications using
>> /class/net don't break, while we're adding the kernel namespace that
>> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> at all.
>
> My gut feeling is this whole scheme will not fly. You really should be
> talking to GregKH.

Will do. Before spreading it out loudly I'd run it within netdev to
clarify the need for why not exposing the lower netdevs is critical
for cloud service providers in the face of introducing a new feature,
and we are not hiding anything but exposing it in a way that don't
break existing userspace applications while introducing feature is
possible with the limitation of keeping old userspace still.

>
> Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> A device can start out as a normal device, and will change to being
> automatic later, when the user on top of it probes.

Sure. In whatever form it's still a netdev, and changing the namespace
should be more dynamic than changing the class.

Thanks a lot,
-Siwei

>
>         Andrew

^ permalink raw reply

* [PATCH] iscsi: respond to netlink with unicast when appropriate
From: Chris Leech @ 2018-04-09 22:15 UTC (permalink / raw)
  To: linux-scsi, netdev, Lee Duncan, open-iscsi; +Cc: Chris Leech

Instead of always multicasting responses, send a unicast netlink message
directed at the correct pid.  This will be needed if we ever want to
support multiple userspace processes interacting with the kernel over
iSCSI netlink simultaneously.  Limitations can currently be seen if you
attempt to run multiple iscsistart commands in parallel.

We've fixed up the userspace issues in iscsistart that prevented
multiple instances from running, so now attempts to speed up booting by
bringing up multiple iscsi sessions at once in the initramfs are just
running into misrouted responses that this fixes.

Signed-off-by: Chris Leech <cleech@redhat.com>
---
 drivers/scsi/scsi_transport_iscsi.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
index f4b52b44b966..65f6c94f2e9b 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -2322,6 +2322,12 @@ iscsi_multicast_skb(struct sk_buff *skb, uint32_t group, gfp_t gfp)
 	return nlmsg_multicast(nls, skb, 0, group, gfp);
 }
 
+static int
+iscsi_unicast_skb(struct sk_buff *skb, u32 portid)
+{
+	return nlmsg_unicast(nls, skb, portid);
+}
+
 int iscsi_recv_pdu(struct iscsi_cls_conn *conn, struct iscsi_hdr *hdr,
 		   char *data, uint32_t data_size)
 {
@@ -2524,14 +2530,11 @@ void iscsi_ping_comp_event(uint32_t host_no, struct iscsi_transport *transport,
 EXPORT_SYMBOL_GPL(iscsi_ping_comp_event);
 
 static int
-iscsi_if_send_reply(uint32_t group, int seq, int type, int done, int multi,
-		    void *payload, int size)
+iscsi_if_send_reply(u32 portid, int type, void *payload, int size)
 {
 	struct sk_buff	*skb;
 	struct nlmsghdr	*nlh;
 	int len = nlmsg_total_size(size);
-	int flags = multi ? NLM_F_MULTI : 0;
-	int t = done ? NLMSG_DONE : type;
 
 	skb = alloc_skb(len, GFP_ATOMIC);
 	if (!skb) {
@@ -2539,10 +2542,9 @@ iscsi_if_send_reply(uint32_t group, int seq, int type, int done, int multi,
 		return -ENOMEM;
 	}
 
-	nlh = __nlmsg_put(skb, 0, 0, t, (len - sizeof(*nlh)), 0);
-	nlh->nlmsg_flags = flags;
+	nlh = __nlmsg_put(skb, 0, 0, type, (len - sizeof(*nlh)), 0);
 	memcpy(nlmsg_data(nlh), payload, size);
-	return iscsi_multicast_skb(skb, group, GFP_ATOMIC);
+	return iscsi_unicast_skb(skb, portid);
 }
 
 static int
@@ -3470,6 +3472,7 @@ static int
 iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
 {
 	int err = 0;
+	u32 portid;
 	struct iscsi_uevent *ev = nlmsg_data(nlh);
 	struct iscsi_transport *transport = NULL;
 	struct iscsi_internal *priv;
@@ -3490,10 +3493,12 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
 	if (!try_module_get(transport->owner))
 		return -EINVAL;
 
+	portid = NETLINK_CB(skb).portid;
+
 	switch (nlh->nlmsg_type) {
 	case ISCSI_UEVENT_CREATE_SESSION:
 		err = iscsi_if_create_session(priv, ep, ev,
-					      NETLINK_CB(skb).portid,
+					      portid,
 					      ev->u.c_session.initial_cmdsn,
 					      ev->u.c_session.cmds_max,
 					      ev->u.c_session.queue_depth);
@@ -3506,7 +3511,7 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
 		}
 
 		err = iscsi_if_create_session(priv, ep, ev,
-					NETLINK_CB(skb).portid,
+					portid,
 					ev->u.c_bound_session.initial_cmdsn,
 					ev->u.c_bound_session.cmds_max,
 					ev->u.c_bound_session.queue_depth);
@@ -3664,6 +3669,8 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
 static void
 iscsi_if_rx(struct sk_buff *skb)
 {
+	u32 portid = NETLINK_CB(skb).portid;
+
 	mutex_lock(&rx_queue_mutex);
 	while (skb->len >= NLMSG_HDRLEN) {
 		int err;
@@ -3699,8 +3706,8 @@ iscsi_if_rx(struct sk_buff *skb)
 				break;
 			if (ev->type == ISCSI_UEVENT_GET_CHAP && !err)
 				break;
-			err = iscsi_if_send_reply(group, nlh->nlmsg_seq,
-				nlh->nlmsg_type, 0, 0, ev, sizeof(*ev));
+			err = iscsi_if_send_reply(portid, nlh->nlmsg_type,
+						  ev, sizeof(*ev));
 		} while (err < 0 && err != -ECONNREFUSED && err != -ESRCH);
 		skb_pull(skb, rlen);
 	}
-- 
2.14.3

^ permalink raw reply related

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Andrew Lunn @ 2018-04-09 22:15 UTC (permalink / raw)
  To: Siwei Liu
  Cc: David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Michael S. Tsirkin, Stephen Hemminger, Alexander Duyck,
	Brandeburg, Jesse, Jakub Kicinski, Jason Wang, Samudrala, Sridhar,
	Netdev, virtualization, virtio-dev
In-Reply-To: <CADGSJ211kA5ktADbwv2f4ma_ooBmYW4r7YwzGD2G_Gw5+wv5rw@mail.gmail.com>

> No, implementation wise I'd avoid changing the class on the fly. What
> I'm looking to is a means to add a secondary class or class aliasing
> mechanism for netdevs that allows mapping for a kernel device
> namespace (/class/net-kernel) to userspace (/class/net). Imagine
> creating symlinks between these two namespaces as an analogy. All
> userspace visible netdevs today will have both a kernel name and a
> userspace visible name, having one (/class/net) referecing the other
> (/class/net-kernel) in its own namespace. The newly introduced
> IFF_AUTO_MANAGED device will have a kernel name only
> (/class/net-kernel). As a result, the existing applications using
> /class/net don't break, while we're adding the kernel namespace that
> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
> at all.

My gut feeling is this whole scheme will not fly. You really should be
talking to GregKH.

Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
A device can start out as a normal device, and will change to being
automatic later, when the user on top of it probes.

	Andrew

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-09 22:07 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: David Miller, David Ahern, Jiri Pirko, si-wei liu,
	Michael S. Tsirkin, Stephen Hemminger, Alexander Duyck,
	Brandeburg, Jesse, Jakub Kicinski, Jason Wang, Samudrala, Sridhar,
	Netdev, virtualization, virtio-dev
In-Reply-To: <20180407031919.GB11029@lunn.ch>

On Fri, Apr 6, 2018 at 8:19 PM, Andrew Lunn <andrew@lunn.ch> wrote:
> Hi Siwei
>
>> I think everyone seems to agree not to fiddle with the ":" prefix, but
>> rather have a new class of network subsystem under /sys/class thus a
>> separate device namespace e.g. /sys/class/net-kernel for those
>> auto-managed lower netdevs is needed.
>
> How do you get a device into this new class? I don't know the Linux
> driver model too well, but to get a device out of one class and into
> another, i think you need to device_del(dev). modify dev->class and
> then device_add(dev). However, device_add() says you are not allowed
> to do this.

No, implementation wise I'd avoid changing the class on the fly. What
I'm looking to is a means to add a secondary class or class aliasing
mechanism for netdevs that allows mapping for a kernel device
namespace (/class/net-kernel) to userspace (/class/net). Imagine
creating symlinks between these two namespaces as an analogy. All
userspace visible netdevs today will have both a kernel name and a
userspace visible name, having one (/class/net) referecing the other
(/class/net-kernel) in its own namespace. The newly introduced
IFF_AUTO_MANAGED device will have a kernel name only
(/class/net-kernel). As a result, the existing applications using
/class/net don't break, while we're adding the kernel namespace that
allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
at all.

As this requires changing the internals of driver model core it's a
rather big hammer approach I'd think. If there exists a better
implementation than this to allow adding a separate layer of in-kernel
device namespace, I'd more than welcome to hear about.

>
> And i don't even see how this helps. Are you also not going to call
> list_netdevice()? Are you going to add some other list for these
> devices in a different class?

list_netdevice() is still called. I think with the current RFC patch,
I've added two lists for netdevs under the kernel namespace:
dev_cmpl_list and name_cmpl_hlist. As a result of that, all userspace
netdevs get registered will be added to two types of lists: the
userspace list for e.g. dev_list, and also the kernelspace list e.g.
dev_cmpl_list (I can rename it to something more accurate). The
IFF_AUTO_MANAGED device will be only added to kernelspace list e.g.
dev_cmpl_list.

Hope all your questions are answered.

Thanks,
-Siwei


>
>    Andrew

^ permalink raw reply

* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
From: William Tu @ 2018-04-09 21:51 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	Willem de Bruijn, Daniel Borkmann,
	Linux Kernel Network Developers, Björn Töpel,
	michael.lundkvist, Brandeburg, Jesse, Anjali Singhai Jain,
	qi.z.zhang, ravineet.singh
In-Reply-To: <20180327165919.17933-1-bjorn.topel@gmail.com>

On Tue, Mar 27, 2018 at 9:59 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
> using the XDP_DRV path. Zero-copy support requires XDP and driver
> changes that Jesper Dangaard Brouer is working on. Some of his work is
> already on the mailing list for review. We will publish our zero-copy
> support for RX and TX on top of his patch sets at a later point in
> time.
>
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
>
> This new dedicated packet buffer area is called a UMEM. It consists of
> a number of equally size frames and each frame has a unique frame
> id. A descriptor in one of the queues references a frame by
> referencing its frame id. The user space allocates memory for this
> UMEM using whatever means it feels is most appropriate (malloc, mmap,
> huge pages, etc). This memory area is then registered with the kernel
> using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues:
> the FILL queue and the COMPLETION queue. The fill queue is used by the
> application to send down frame ids for the kernel to fill in with RX
> packet data. References to these frames will then appear in the RX
> queue of the XSK once they have been received. The completion queue,
> on the other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
>
Can we register a UMEM to multiple device's queue?

So far the l2fwd sample code is sending/receiving from the same
queue. I'm thinking about forwarding packets from one device to another.
Now I'm copying packets from one device's RX desc to another device's TX
completion queue. But this introduces one extra copy.

One way I can do is to call bpf_redirect helper function, but sometimes
I still need to process the packet in userspace.

I like this work!
Thanks a lot.
William

^ permalink raw reply

* Re: [net-next PATCH v3 01/11] soc: ti: K2G: enhancement to support QMSS in K2G NAVSS
From: Rob Herring @ 2018-04-09 21:12 UTC (permalink / raw)
  To: Murali Karicheri
  Cc: mark.rutland, ssantosh, malat, w-kwok2, devicetree, linux-kernel,
	linux-arm-kernel, davem, netdev
In-Reply-To: <1522679881-25643-2-git-send-email-m-karicheri2@ti.com>

On Mon, Apr 02, 2018 at 10:37:51AM -0400, Murali Karicheri wrote:
> Navigator Subsystem (NAVSS) available on K2G SoC has a cut down
> version of QMSS with less number of queues, internal linking ram
> with lesser number of buffers etc.  It doesn't have status and
> explicit push register space as in QMSS available on other K2 SoCs.
> So define reg indices specific to QMSS on K2G. This patch introduces
> "ti,66ak2g-navss-qm" compatibility to identify QMSS on K2G NAVSS
> and to customize the dts handling code. Per Device manual,
> descriptors with index less than or equal to regions0_size is in region 0
> in the case of K2 QMSS where as for QMSS on K2G, descriptors with index
> less than regions0_size is in region 0. So update the size accordingly in
> the regions0_size bits of the linking ram size 0 register.
> 
> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
> Signed-off-by: WingMan Kwok <w-kwok2@ti.com>
> ---
>  .../bindings/soc/ti/keystone-navigator-qmss.txt    |  9 ++-

Reviewed-by: Rob Herring <robh@kernel.org>

>  drivers/soc/ti/knav_qmss.h                         |  6 ++
>  drivers/soc/ti/knav_qmss_queue.c                   | 90 ++++++++++++++++------
>  3 files changed, 82 insertions(+), 23 deletions(-)

^ permalink raw reply

* [PATCH] mISDN: Remove VLAs
From: Laura Abbott @ 2018-04-09 21:07 UTC (permalink / raw)
  To: Karsten Keil, David S. Miller
  Cc: Laura Abbott, Kees Cook, netdev, linux-kernel, kernel-hardening


There's an ongoing effort to remove VLAs[1] from the kernel to eventually
turn on -Wvla. Remove the VLAs from the mISDN code by switching to using
kstrdup in one place and just using the upper bound in another.

[1] https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Laura Abbott <labbott@redhat.com>
---
 drivers/isdn/mISDN/dsp_hwec.c   | 8 +++++---
 drivers/isdn/mISDN/l1oip_core.c | 2 +-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/isdn/mISDN/dsp_hwec.c b/drivers/isdn/mISDN/dsp_hwec.c
index a6e87076acc2..5336bbdbfdc5 100644
--- a/drivers/isdn/mISDN/dsp_hwec.c
+++ b/drivers/isdn/mISDN/dsp_hwec.c
@@ -68,12 +68,12 @@ void dsp_hwec_enable(struct dsp *dsp, const char *arg)
 		goto _do;
 
 	{
-		char _dup[len + 1];
 		char *dup, *tok, *name, *val;
 		int tmp;
 
-		strcpy(_dup, arg);
-		dup = _dup;
+		dup = kstrdup(arg, GFP_ATOMIC);
+		if (!dup)
+			return;
 
 		while ((tok = strsep(&dup, ","))) {
 			if (!strlen(tok))
@@ -89,6 +89,8 @@ void dsp_hwec_enable(struct dsp *dsp, const char *arg)
 					deftaps = tmp;
 			}
 		}
+
+		kfree(dup);
 	}
 
 _do:
diff --git a/drivers/isdn/mISDN/l1oip_core.c b/drivers/isdn/mISDN/l1oip_core.c
index 21d50e4cc5e1..df594245deca 100644
--- a/drivers/isdn/mISDN/l1oip_core.c
+++ b/drivers/isdn/mISDN/l1oip_core.c
@@ -279,7 +279,7 @@ l1oip_socket_send(struct l1oip *hc, u8 localcodec, u8 channel, u32 chanmask,
 		  u16 timebase, u8 *buf, int len)
 {
 	u8 *p;
-	u8 frame[len + 32];
+	u8 frame[L1OIP_MAX_PERFRAME + 32];
 	struct socket *socket = NULL;
 
 	if (debug & DEBUG_L1OIP_MSG)
-- 
2.14.3

^ permalink raw reply related

* Re: [PATCH AUTOSEL for 4.9 160/293] MIPS: Give __secure_computing() access to syscall arguments.
From: James Hogan @ 2018-04-09 20:26 UTC (permalink / raw)
  To: Sasha Levin
  Cc: stable@vger.kernel.org, linux-kernel@vger.kernel.org, David Daney,
	Alexei Starovoitov, Daniel Borkmann, Matt Redfearn,
	netdev@vger.kernel.org, linux-mips@linux-mips.org, Ralf Baechle
In-Reply-To: <20180409002239.163177-160-alexander.levin@microsoft.com>

[-- Attachment #1: Type: text/plain, Size: 581 bytes --]

On Mon, Apr 09, 2018 at 12:24:58AM +0000, Sasha Levin wrote:
> From: David Daney <david.daney@cavium.com>
> 
> [ Upstream commit 669c4092225f0ed5df12ebee654581b558a5e3ed ]
> 
> KProbes of __seccomp_filter() are not very useful without access to
> the syscall arguments.
> 
> Do what x86 does, and populate a struct seccomp_data to be passed to
> __secure_computing().  This allows samples/bpf/tracex5 to extract a
> sensible trace.

This broke o32 indirect syscalls, and was fixed by commit 3d729deaf287
("MIPS: seccomp: Fix indirect syscall args").

Cheers
James

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Patch "x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()" has been added to the 4.9-stable tree
From: gregkh @ 2018-04-09 20:00 UTC (permalink / raw)
  To: jpoimboe, alexander.levin, andreyknvl, davem, dvyukov, edumazet,
	gregkh, kcc, marcelo.leitner, mingo, netdev, nhorman, peterz,
	syzkaller, tglx, torvalds, vyasevich, xiyou.wangcong
  Cc: stable, stable-commits


This is a note to let you know that I've just added the patch titled

    x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()

to the 4.9-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     x86-asm-don-t-use-rbp-as-a-temporary-register-in-csum_partial_copy_generic.patch
and it can be found in the queue-4.9 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@vger.kernel.org> know about it.


>From foo@baz Mon Apr  9 17:09:24 CEST 2018
From: Josh Poimboeuf <jpoimboe@redhat.com>
Date: Thu, 4 May 2017 09:51:40 -0500
Subject: x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()

From: Josh Poimboeuf <jpoimboe@redhat.com>


[ Upstream commit 42fc6c6cb1662ba2fa727dd01c9473c63be4e3b6 ]

Andrey Konovalov reported the following warning while fuzzing the kernel
with syzkaller:

  WARNING: kernel stack regs at ffff8800686869f8 in a.out:4933 has bad 'bp' value c3fc855a10167ec0

The unwinder dump revealed that RBP had a bad value when an interrupt
occurred in csum_partial_copy_generic().

That function saves RBP on the stack and then overwrites it, using it as
a scratch register.  That's problematic because it breaks stack traces
if an interrupt occurs in the middle of the function.

Replace the usage of RBP with another callee-saved register (R15) so
stack traces are no longer affected.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: netdev <netdev@vger.kernel.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Link: http://lkml.kernel.org/r/4b03a961efda5ec9bfe46b7b9c9ad72d1efad343.1493909486.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 arch/x86/lib/csum-copy_64.S |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -55,7 +55,7 @@ ENTRY(csum_partial_copy_generic)
 	movq  %r12, 3*8(%rsp)
 	movq  %r14, 4*8(%rsp)
 	movq  %r13, 5*8(%rsp)
-	movq  %rbp, 6*8(%rsp)
+	movq  %r15, 6*8(%rsp)
 
 	movq  %r8, (%rsp)
 	movq  %r9, 1*8(%rsp)
@@ -74,7 +74,7 @@ ENTRY(csum_partial_copy_generic)
 	/* main loop. clear in 64 byte blocks */
 	/* r9: zero, r8: temp2, rbx: temp1, rax: sum, rcx: saved length */
 	/* r11:	temp3, rdx: temp4, r12 loopcnt */
-	/* r10:	temp5, rbp: temp6, r14 temp7, r13 temp8 */
+	/* r10:	temp5, r15: temp6, r14 temp7, r13 temp8 */
 	.p2align 4
 .Lloop:
 	source
@@ -89,7 +89,7 @@ ENTRY(csum_partial_copy_generic)
 	source
 	movq  32(%rdi), %r10
 	source
-	movq  40(%rdi), %rbp
+	movq  40(%rdi), %r15
 	source
 	movq  48(%rdi), %r14
 	source
@@ -103,7 +103,7 @@ ENTRY(csum_partial_copy_generic)
 	adcq  %r11, %rax
 	adcq  %rdx, %rax
 	adcq  %r10, %rax
-	adcq  %rbp, %rax
+	adcq  %r15, %rax
 	adcq  %r14, %rax
 	adcq  %r13, %rax
 
@@ -121,7 +121,7 @@ ENTRY(csum_partial_copy_generic)
 	dest
 	movq %r10, 32(%rsi)
 	dest
-	movq %rbp, 40(%rsi)
+	movq %r15, 40(%rsi)
 	dest
 	movq %r14, 48(%rsi)
 	dest
@@ -203,7 +203,7 @@ ENTRY(csum_partial_copy_generic)
 	movq 3*8(%rsp), %r12
 	movq 4*8(%rsp), %r14
 	movq 5*8(%rsp), %r13
-	movq 6*8(%rsp), %rbp
+	movq 6*8(%rsp), %r15
 	addq $7*8, %rsp
 	ret
 


Patches currently in stable-queue which might be from jpoimboe@redhat.com are

queue-4.9/x86-asm-don-t-use-rbp-as-a-temporary-register-in-csum_partial_copy_generic.patch

^ permalink raw reply

* RFC: ti_hecc: don't repoll but interrupt again
From: Jeroen Hofstee @ 2018-04-09 19:49 UTC (permalink / raw)
  To: netdev

Hello,

I posted below RFC to the linux-can ML, but it might be actually
more appropriate for netdev (since it involves napi). This driver
is lying a bit about the amount of work done. Changing that (always
look at work done instead of the hw state) seems to solve the issue,
but I have no clear understanding why that is (and why it takes down
the ethernet stack as well).

If someone has suggestions / ideas on what to check, please let me
know.

Regards,
Jeroen



The mail send to linux-can:


While updating to Linux 4.9.93, the ti_hecc causes several problems. One
of them is that under high load on the CAN-bus and ethernet, the ethernet
connection completely stalls and won't recover. The patch below seems to
fix that (as in read as much as we can and always re-enable interrupts,
it will poll again thereafter).

It would be appreciated if someone more familiar with this code can
comment on it.

Thanks in advance,

Jeroen

--- a/drivers/net/can/ti_hecc.c
+++ b/drivers/net/can/ti_hecc.c
@@ -625,20 +625,16 @@ static int ti_hecc_rx_poll(struct napi_struct
*napi, int quota)
spin_unlock_irqrestore(&priv->mbx_lock, flags);
                  } else if (priv->rx_next == HECC_MAX_TX_MBOX - 1) {
                          priv->rx_next = HECC_RX_FIRST_MBOX;
-                       break;
                  }
          }

          /* Enable packet interrupt if all pkts are handled */
-       if (hecc_read(priv, HECC_CANRMP) == 0) {
+       if (num_pkts < quota) {
                  napi_complete(napi);
                  /* Re-enable RX mailbox interrupts */
                  mbx_mask = hecc_read(priv, HECC_CANMIM);
                  mbx_mask |= HECC_TX_MBOX_MASK;
                  hecc_write(priv, HECC_CANMIM, mbx_mask);
-       } else {
-               /* repoll is done only if whole budget is used */
-               num_pkts = quota;
          }

^ permalink raw reply

* [PATCH RESEND net] vhost: fix vhost_vq_access_ok() log check
From: Michael S. Tsirkin @ 2018-04-09 19:40 UTC (permalink / raw)
  To: David Miller, netdev, virtualization
  Cc: linux-kernel, jasowang, kvm, mst, syzkaller-bugs, Stefan Hajnoczi
In-Reply-To: <20180409131021.5132-1-stefanha@redhat.com>

From: Stefan Hajnoczi <stefanha@redhat.com>

Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
when IOTLB is enabled") introduced a regression.  The logic was
originally:

  if (vq->iotlb)
      return 1;
  return A && B;

After the patch the short-circuit logic for A was inverted:

  if (A || vq->iotlb)
      return A;
  return B;

The correct logic is:

  if (!A || vq->iotlb)
      return A;
  return B;

Reported-by: syzbot+65a84dde0214b0387ccd@syzkaller.appspotmail.com
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>

---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5320039671b7..f6af4210679a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1246,7 +1246,7 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 {
 	int ret = vq_log_access_ok(vq, vq->log_base);
 
-	if (ret || vq->iotlb)
+	if (!ret || vq->iotlb)
 		return ret;
 
 	return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
-- 
2.14.3

^ permalink raw reply related

* Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-04-09 18:47 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh
In-Reply-To: <20180409080751.GE19345@nanopsycho>

On 4/9/2018 1:07 AM, Jiri Pirko wrote:
> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>> Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudrala@intel.com wrote:
> [...]
>
>>>> +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>>>> +				     struct net_device *child_netdev)
>>>> +{
>>>> +	struct virtnet_bypass_info *vbi;
>>>> +	bool backup;
>>>> +
>>>> +	vbi = netdev_priv(bypass_netdev);
>>>> +	backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
>>>> +	if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>>> +			rtnl_dereference(vbi->active_netdev)) {
>>>> +		netdev_info(bypass_netdev,
>>>> +			    "%s attempting to join bypass dev when %s already present\n",
>>>> +			    child_netdev->name, backup ? "backup" : "active");
>>> Bypass module should check if there is already some other netdev
>>> enslaved and refuse right there.
>> This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc
>> as its bypass_netdev is same as the backup_netdev.
>> Will add a flag while registering with the bypass module to indicate if the driver is doing
>> a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module
>> for 3 netdev scenario.
> Just let me undestand it clearly. What I expect the difference would be
> between 2netdev and3 netdev model is this:
> 2netdev:
>    bypass_master
>       /
>      /
> VF_slave
>
> 3netdev:
>    bypass_master
>       /     \
>      /       \
> VF_slave   backup_slave
>
> Is that correct? If not, how does it look like?
>
>
Looks correct.
VF_slave and backup_slave are the original netdevs and are present in both the models.
In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are
marked as the 2 slaves of this new netdev.

In the 2 netdev model, backup_slave acts as bypass_master and the bypass module doesn't
have access to netdev_priv of the backup_slave.

Once i moved all the ndo_ops of the master netdev to bypass module, i realized that we can
move the create/destroy of the upper netdev also to bypass.c.
That way the changes to virtio_net become very minimal.

With these updates, bypass module now supports both the models by exporting 2 sets of
functions.
3 netdev:
    int bypass_master_create(struct net_device *backup_netdev,
                             struct bypass_master **pbypass_master);
    void bypass_master_destroy(struct bypass_master *bypass_master);


2 netdev:
    int bypass_master_register(struct net_device *backup_netdev, struct bypass_ops *ops,
                               struct bypass_master **pbypass_master);
    void bypass_master_unregister(struct bypass_master *bypass_master);

Will send the next revision in a day or two.

Thanks
Sridhar

^ permalink raw reply

* Re: [PATCH bpf] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
From: Yonghong Song @ 2018-04-09 18:41 UTC (permalink / raw)
  To: Alexei Starovoitov, daniel, netdev; +Cc: kernel-team
In-Reply-To: <88effd87-1356-38b4-96e5-b90ab869b2a8@fb.com>



On 4/9/18 9:47 AM, Alexei Starovoitov wrote:
> On 4/9/18 9:18 AM, Yonghong Song wrote:
>> syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
> ...
>> @@ -985,16 +986,31 @@ int perf_event_query_prog_array(struct 
>> perf_event *event, void __user *info)
>>          return -EINVAL;
>>      if (copy_from_user(&query, uquery, sizeof(query)))
>>          return -EFAULT;
>> -    if (query.ids_len > BPF_TRACE_MAX_PROGS)
>> +
>> +    ids_len = query.ids_len;
>> +    if (ids_len > BPF_TRACE_MAX_PROGS)
>>          return -E2BIG;
>> +    ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
>> +    if (!ids)
>> +        return -ENOMEM;
>>
>>      mutex_lock(&bpf_event_mutex);
>>      ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
>> -                       uquery->ids,
>> -                       query.ids_len,
>> -                       &uquery->prog_cnt);
>> +                       ids,
>> +                       ids_len,
>> +                       &prog_cnt);
>>      mutex_unlock(&bpf_event_mutex);
>>
>> +    if (!ret || ret == -ENOSPC) {
>> +        if (copy_to_user(&uquery->prog_cnt, &prog_cnt, 
>> sizeof(prog_cnt)) ||
>> +            copy_to_user(uquery->ids, ids, ids_len * sizeof(u32))) {
>> +            ret = -EFAULT;
>> +            goto out;
>> +        }
>> +    }
>> +
>> +out:
>> +    kfree(ids);
> 
> alloc/free just to avoid this locking dependency feels suboptimal.

We actually already did kcalloc/kfree in bpf_prog_array_copy_to_user.
In that function, we did not copy_to_user one id at a time.
We allocate a temporary array and store the result there
and at the end, we call one copy_to_user to copy to the user buffer.

The patch here just moved this allocation and associated copy_to_user
out of the function and bpf_event_mutex. It did not introduce new
allocations.

> 
> may be we can get rid of bpf_event_mutex in some cases?
> the perf event itself is locked via perf_event_ctx_lock() when we're
> calling perf_event_query_prog_array, perf_event_attach|detach_bpf_prog.
> I forgot what was the motivation for us to introduce it in the
> first place.

The original motivation for the lock to make sure bpf_prog_array
does not change during middle of attach/detach/query. it looks like
we have:
    . perf_event_attach|query under perf_event_ctx_lock
    . perf_event_detach not under perf_event_ctx_lock
Introducing perf_event_ctx_lock to perf_event_detach could still
have the deadlock.

> 

^ permalink raw reply

* Re: [PATCH iproute2-next 1/1] tc: jsonify skbedit action
From: Roman Mashak @ 2018-04-09 17:57 UTC (permalink / raw)
  To: David Ahern; +Cc: stephen, netdev, kernel, jhs, xiyou.wangcong, jiri
In-Reply-To: <f38705ad-38d0-9e65-510c-c71095e5afd8@gmail.com>

David Ahern <dsahern@gmail.com> writes:

> On 4/3/18 1:24 PM, Roman Mashak wrote:
>>  	if (tb[TCA_SKBEDIT_PTYPE] != NULL) {
>> -		ptype = RTA_DATA(tb[TCA_SKBEDIT_PTYPE]);
>> -		if (*ptype == PACKET_HOST)
>> -			fprintf(f, " ptype host");
>> -		else if (*ptype == PACKET_BROADCAST)
>> -			fprintf(f, " ptype broadcast");
>> -		else if (*ptype == PACKET_MULTICAST)
>> -			fprintf(f, " ptype multicast");
>> -		else if (*ptype == PACKET_OTHERHOST)
>> -			fprintf(f, " ptype otherhost");
>> +		ptype = rta_getattr_u16(tb[TCA_SKBEDIT_PTYPE]);
>> +		if (ptype == PACKET_HOST)
>> +			print_string(PRINT_ANY, "ptype", " %s", "ptype host");
>> +		else if (ptype == PACKET_BROADCAST)
>> +			print_string(PRINT_ANY, "ptype", " %s",
>> +				     "ptype broadcast");
>> +		else if (ptype == PACKET_MULTICAST)
>> +			print_string(PRINT_ANY, "ptype", " %s",
>> +				     "ptype multicast");
>> +		else if (ptype == PACKET_OTHERHOST)
>> +			print_string(PRINT_ANY, "ptype", " %s",
>> +				     "ptype otherhost");
>
> Shouldn't that be:
>     print_string(PRINT_ANY, "ptype", "ptype %s", "otherhost");
>
> And ditto for the other strings.
>
>>  		else
>> -			fprintf(f, " ptype %d", *ptype);
>> +			print_uint(PRINT_ANY, "ptype", " %u", ptype);
>
> And then this one needs 'ptype' before %u

OK. I will send v2.

^ permalink raw reply

* RE: [Intel-wired-lan] [next-queue PATCH v6 10/10] igb: Add support for adding offloaded clsflower filters
From: Vinicius Costa Gomes @ 2018-04-09 17:36 UTC (permalink / raw)
  To: Brown, Aaron F, intel-wired-lan@lists.osuosl.org
  Cc: netdev@vger.kernel.org, Sanchez-Palencia, Jesus
In-Reply-To: <309B89C4C689E141A5FF6A0C5FB2118B8C82598F@ORSMSX101.amr.corp.intel.com>

Hi,

"Brown, Aaron F" <aaron.f.brown@intel.com> writes:

>> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
>> Behalf Of Vinicius Costa Gomes
>> Sent: Thursday, March 29, 2018 2:08 PM
>> To: intel-wired-lan@lists.osuosl.org
>> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus <jesus.sanchez-
>> palencia@intel.com>
>> Subject: [Intel-wired-lan] [next-queue PATCH v6 10/10] igb: Add support for
>> adding offloaded clsflower filters
>> 
>> This allows filters added by tc-flower and specifying MAC addresses,
>> Ethernet types, and the VLAN priority field, to be offloaded to the
>> controller.
>
> Can I get a brief explanation for enabling this?  I'm currently happy
> with this patch series from a regression perspective, but am
> personally a bit, umm, challenged with tc in general but would like to
> run it through the paces a bit.  If it can be done in a one or two
> liner I think it would be a good addition to the patch description.


Of course, my bad for not adding those examples since the begining.



Cheers,

^ permalink raw reply

* RE: [Intel-wired-lan] [next-queue PATCH v6 08/10] igb: Add MAC address support for ethtool nftuple filters
From: Vinicius Costa Gomes @ 2018-04-09 17:33 UTC (permalink / raw)
  To: Brown, Aaron F, intel-wired-lan@lists.osuosl.org
  Cc: netdev@vger.kernel.org, Sanchez-Palencia, Jesus
In-Reply-To: <309B89C4C689E141A5FF6A0C5FB2118B8C825976@ORSMSX101.amr.corp.intel.com>

Hi,

"Brown, Aaron F" <aaron.f.brown@intel.com> writes:

[...]

>
>> 
>> I added that note in the hope that someone else would have an stronger
>> opinion about what to do.
>
> I don't have a strong opinion beyond my preference for an ideal world
> where everything works :)  If the part simply cannot filter on the src
> address as a whole without the protocol I would ideally prefer an
> attempt in ethtool to set the filter on src address as a whole to
> return an error WHILE still allowing the filter to be set on an
> ethertype when the proto keyword is issued.  If ethtool does not allow
> that fine grain of control then I think the way it is now is good, I'd
> rather have the annoyance of being able to set a filter that does
> nothing then not be able to set the more specific filter at all.

We are agreed. The next version of this series implements just that:
specifying the src address alone is an error, but if the classifier has
a src address and any other filter, it works.


Cheers,

^ permalink raw reply

* [PATCH RFC iptables] iptables: Per-net ns lock
From: Kirill Tkhai @ 2018-04-09 16:55 UTC (permalink / raw)
  To: netdev, pablo, rstoyanov1, ptikhomirov, rstoyanov1, avagin,
	ktkhai

In CRIU and LXC-restore we met the situation,
when iptables in container can't be restored
because of permission denied:

https://github.com/checkpoint-restore/criu/issues/469

Containers want to restore their own net ns,
while they may have no their own mnt ns.
This case they share host's /run/xtables.lock
file, but they may not have permission to open
it.

Patch makes /run/xtables.lock to be per-namespace,
i.e., to refer to the caller task's net ns.

What you think?

Thanks,
Kirill

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 iptables/xshared.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/iptables/xshared.c b/iptables/xshared.c
index 06db72d4..b6dbe4e7 100644
--- a/iptables/xshared.c
+++ b/iptables/xshared.c
@@ -254,7 +254,12 @@ static int xtables_lock(int wait, struct timeval *wait_interval)
 	time_left.tv_sec = wait;
 	time_left.tv_usec = 0;
 
-	fd = open(XT_LOCK_NAME, O_CREAT, 0600);
+	if (symlink("/proc/self/ns/net", XT_LOCK_NAME) != 0 &&
+	    errno != EEXIST) {
+		fprintf(stderr, "Fatal: can't create lock file\n");
+		return XT_LOCK_FAILED;
+	}
+	fd = open(XT_LOCK_NAME, O_RDONLY);
 	if (fd < 0) {
 		fprintf(stderr, "Fatal: can't open lock file %s: %s\n",
 			XT_LOCK_NAME, strerror(errno));

^ permalink raw reply related

* Re: [RFC PATCH bpf-next 2/6] bpf: add bpf_get_stack helper
From: Yonghong Song @ 2018-04-09 16:52 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov, netdev; +Cc: kernel-team
In-Reply-To: <181ba790-db7c-943c-8b77-aabe10f1a1de@iogearbox.net>



On 4/9/18 3:01 AM, Daniel Borkmann wrote:
> On 04/09/2018 07:02 AM, Alexei Starovoitov wrote:
>> On 4/8/18 9:53 PM, Yonghong Song wrote:
>>>>> @@ -1004,7 +1007,8 @@ static void __bpf_prog_put(struct bpf_prog
>>>>> *prog, bool do_idr_lock)
>>>>>               bpf_prog_kallsyms_del(prog->aux->func[i]);
>>>>>           bpf_prog_kallsyms_del(prog);
>>>>>
>>>>> -        call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
>>>>> +        synchronize_rcu();
>>>>> +        __bpf_prog_put_rcu(&prog->aux->rcu);
>>>>
>>>> there should have been lockdep splat.
>>>> We cannot call synchronize_rcu here, since we cannot sleep
>>>> in some cases.
>>>
>>> Let me double check this. The following is the reason
>>> why I am using synchronize_rcu().
>>>
>>> With call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu) and
>>> _bpf_prog_put_rcu calls put_callchain_buffers() which
>>> calls mutex_lock(), the runtime with CONFIG_DEBUG_ATOMIC_SLEEP=y
>>> will complains since potential sleep inside the call_rcu is not
>>> allowed.
>>
>> I see. Indeed. We cannot call put_callchain_buffers() from rcu callback,
>> but doing synchronize_rcu() here is also not possible.
>> How about moving put_callchain into bpf_prog_free_deferred() ?
> 
> +1, the assumption is that you can call bpf_prog_put() and also the
> bpf_map_put() from any context. Sleeping here for a long time might
> subtly break things badly.

Thanks for the suggestion! This should work.

^ permalink raw reply

* Re: [PATCH bpf] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
From: Alexei Starovoitov @ 2018-04-09 16:47 UTC (permalink / raw)
  To: Yonghong Song, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20180409161819.4024793-1-yhs@fb.com>

On 4/9/18 9:18 AM, Yonghong Song wrote:
> syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
...
> @@ -985,16 +986,31 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info)
>  		return -EINVAL;
>  	if (copy_from_user(&query, uquery, sizeof(query)))
>  		return -EFAULT;
> -	if (query.ids_len > BPF_TRACE_MAX_PROGS)
> +
> +	ids_len = query.ids_len;
> +	if (ids_len > BPF_TRACE_MAX_PROGS)
>  		return -E2BIG;
> +	ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
> +	if (!ids)
> +		return -ENOMEM;
>
>  	mutex_lock(&bpf_event_mutex);
>  	ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
> -				       uquery->ids,
> -				       query.ids_len,
> -				       &uquery->prog_cnt);
> +				       ids,
> +				       ids_len,
> +				       &prog_cnt);
>  	mutex_unlock(&bpf_event_mutex);
>
> +	if (!ret || ret == -ENOSPC) {
> +		if (copy_to_user(&uquery->prog_cnt, &prog_cnt, sizeof(prog_cnt)) ||
> +		    copy_to_user(uquery->ids, ids, ids_len * sizeof(u32))) {
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +	}
> +
> +out:
> +	kfree(ids);

alloc/free just to avoid this locking dependency feels suboptimal.

may be we can get rid of bpf_event_mutex in some cases?
the perf event itself is locked via perf_event_ctx_lock() when we're
calling perf_event_query_prog_array, perf_event_attach|detach_bpf_prog.
I forgot what was the motivation for us to introduce it in the
first place.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox