* Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)
From: Stephen Hemminger @ 2018-04-11 16:14 UTC (permalink / raw)
To: Stephen Suryaputra; +Cc: netdev
In-Reply-To: <1523415335-17154-1-git-send-email-ssuryaextr@gmail.com>
On Tue, 10 Apr 2018 22:55:35 -0400
Stephen Suryaputra <ssuryaextr@gmail.com> wrote:
> This is enhanced from the proposed patch by Igor Maravic in 2011 to
> support per interface IPv4 stats. The enhancement is mainly adding a
> kernel configuration option CONFIG_IP_IFSTATS_TABLE.
>
> Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Net-next is closed. http://vger.kernel.org/~davem/net-next.html
Also, these kind of statistics have a negative performance impact.
^ permalink raw reply
* Re: [PATCH v2] dec: tulip: de4x5: Replace mdelay with usleep_range in de4x5_hw_init
From: James Bottomley @ 2018-04-11 16:16 UTC (permalink / raw)
To: Jia-Ju Bai, davem, stephen, johannes.berg, arvind.yadav.cs,
dhowells
Cc: netdev, linux-parisc, linux-kernel
In-Reply-To: <1523461182-5897-1-git-send-email-baijiaju1990@gmail.com>
On Wed, 2018-04-11 at 23:39 +0800, Jia-Ju Bai wrote:
> de4x5_hw_init() is never called in atomic context.
>
> de4x5_hw_init() is only called by de4x5_pci_probe(), which is only
> set as ".probe" in struct pci_driver.
>
> Despite never getting called from atomic context, de4x5_hw_init()
> calls mdelay() to busily wait. This is not necessary and can be
> replaced with usleep_range() to avoid busy waiting.
>
> This is found by a static analysis tool named DCNS written by myself.
> And I also manually check it.
Did you actually test this? The usual reason for wanting m/udelay is
that the timing must be exact. The driver is filled with mdelay()s for
this reason. The one you've picked on is in the init path so it won't
affect the runtime in any way. I also don't think we have the hrtimer
machinery for usleep_range() to work properly on parisc, so I don't
think the replacement works.
James
^ permalink raw reply
* Re: [PATCH net] sctp: do not check port in sctp_inet6_cmp_addr
From: Xin Long @ 2018-04-11 16:16 UTC (permalink / raw)
To: Marcelo Ricardo Leitner; +Cc: Neil Horman, network dev, linux-sctp, davem
In-Reply-To: <20180411145910.GC3711@localhost.localdomain>
On Wed, Apr 11, 2018 at 10:59 PM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Wed, Apr 11, 2018 at 10:36:07AM -0400, Neil Horman wrote:
>> On Wed, Apr 11, 2018 at 08:58:05PM +0800, Xin Long wrote:
>> > pf->cmp_addr() is called before binding a v6 address to the sock. It
>> > should not check ports, like in sctp_inet_cmp_addr.
>> >
>> > But sctp_inet6_cmp_addr checks the addr by invoking af(6)->cmp_addr,
>> > sctp_v6_cmp_addr where it also compares the ports.
>> >
>> > This would cause that setsockopt(SCTP_SOCKOPT_BINDX_ADD) could bind
>> > multiple duplicated IPv6 addresses after Commit 40b4f0fd74e4 ("sctp:
>> > lack the check for ports in sctp_v6_cmp_addr").
>> >
>> > This patch is to remove af->cmp_addr called in sctp_inet6_cmp_addr,
>> > but do the proper check for both v6 addrs and v4mapped addrs.
>> >
>> > Fixes: 40b4f0fd74e4 ("sctp: lack the check for ports in sctp_v6_cmp_addr")
>> > Reported-by: Jianwen Ji <jiji@redhat.com>
>> > Signed-off-by: Xin Long <lucien.xin@gmail.com>
>> > ---
>> > net/sctp/ipv6.c | 27 ++++++++++++++++++++++++---
>> > 1 file changed, 24 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
>> > index f1fc48e..be4b72c 100644
>> > --- a/net/sctp/ipv6.c
>> > +++ b/net/sctp/ipv6.c
>> > @@ -846,8 +846,8 @@ static int sctp_inet6_cmp_addr(const union sctp_addr *addr1,
>> > const union sctp_addr *addr2,
>> > struct sctp_sock *opt)
>> > {
>> > - struct sctp_af *af1, *af2;
>> > struct sock *sk = sctp_opt2sk(opt);
>> > + struct sctp_af *af1, *af2;
>> >
>> > af1 = sctp_get_af_specific(addr1->sa.sa_family);
>> > af2 = sctp_get_af_specific(addr2->sa.sa_family);
>> > @@ -863,10 +863,31 @@ static int sctp_inet6_cmp_addr(const union sctp_addr *addr1,
>> > if (sctp_is_any(sk, addr1) || sctp_is_any(sk, addr2))
>> > return 1;
>> >
>> > - if (addr1->sa.sa_family != addr2->sa.sa_family)
>> > + if (addr1->sa.sa_family != addr2->sa.sa_family) {
>> > + if (addr1->sa.sa_family == AF_INET &&
>> > + addr2->sa.sa_family == AF_INET6 &&
>> > + ipv6_addr_v4mapped(&addr2->v6.sin6_addr))
>> > + if (addr2->v6.sin6_addr.s6_addr32[3] ==
>> > + addr1->v4.sin_addr.s_addr)
>> > + return 1;
>> > + if (addr2->sa.sa_family == AF_INET &&
>> > + addr1->sa.sa_family == AF_INET6 &&
>> > + ipv6_addr_v4mapped(&addr1->v6.sin6_addr))
>> > + if (addr1->v6.sin6_addr.s6_addr32[3] ==
>> > + addr2->v4.sin_addr.s_addr)
>> > + return 1;
>> > + return 0;
>> > + }
>> > +
>> > + if (!ipv6_addr_equal(&addr1->v6.sin6_addr, &addr2->v6.sin6_addr))
>> > + return 0;
>> > +
>> > + if ((ipv6_addr_type(&addr1->v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) &&
>> > + addr1->v6.sin6_scope_id && addr2->v6.sin6_scope_id &&
>> > + addr1->v6.sin6_scope_id != addr2->v6.sin6_scope_id)
>> > return 0;
>> >
>> > - return af1->cmp_addr(addr1, addr2);
>> > + return 1;
>> > }
>> >
>> > /* Verify that the provided sockaddr looks bindable. Common verification,
>> > --
>> > 2.1.0
>> >
>> This looks correct to me, but is it worth duplicating the comparison code like
>> this from the cmp_addr function? It might be more worthwhile to add a flag to
>> the cmp_addr method to direct weather it needs to check port values or not.
>> That way you could continue to use the cmp_addr function here.
>
> Adding a flag into sctp_v6_cmp_addr will get us a terrible code to
> read. It's already not one of the best looking part of it. Maybe
> still duplicate part of it it, but at 'af' level? As in:
> - af->cmp_addr
> - af->cmp_addr_port
>
What do you think of:
static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
const union sctp_addr *addr2)
{
return __sctp_v6_cmp_addr(addr1, addr2) &&
addr1->v6.sin_port == addr2->v6.sin_port;
}
(v6.sin_port and v4.sin_port have the same offset in union sctp_addr,
we've exploited this in many places in SCTP)
^ permalink raw reply
* Re: [PATCH] net: dsa: b53: Replace mdelay with msleep in b53_switch_reset_gpio
From: Florian Fainelli @ 2018-04-11 16:19 UTC (permalink / raw)
To: Jia-Ju Bai, Phil Reid, andrew, vivien.didelot; +Cc: netdev, linux-kernel
In-Reply-To: <7c9dffb7-d55a-98f5-4024-7e2f0c704a56@gmail.com>
On 04/11/2018 12:14 AM, Jia-Ju Bai wrote:
>
>
> On 2018/4/11 13:30, Phil Reid wrote:
>> On 11/04/2018 09:51, Jia-Ju Bai wrote:
>>> b53_switch_reset_gpio() is never called in atomic context.
>>>
>>> The call chain ending up at b53_switch_reset_gpio() is:
>>> [1] b53_switch_reset_gpio() <- b53_switch_reset() <-
>>> b53_reset_switch() <- b53_setup()
>>>
>>> b53_switch_reset_gpio() is set as ".setup" in struct dsa_switch_ops.
>>> This function is not called in atomic context.
>>>
>>> Despite never getting called from atomic context,
>>> b53_switch_reset_gpio()
>>> calls mdelay() to busily wait.
>>> This is not necessary and can be replaced with msleep() to
>>> avoid busy waiting.
>>>
>>> This is found by a static analysis tool named DCNS written by myself.
>>> And I also manually check it.
>>>
>>> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
>>> ---
>>> drivers/net/dsa/b53/b53_common.c | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/dsa/b53/b53_common.c
>>> b/drivers/net/dsa/b53/b53_common.c
>>> index 274f367..e070ff6 100644
>>> --- a/drivers/net/dsa/b53/b53_common.c
>>> +++ b/drivers/net/dsa/b53/b53_common.c
>>> @@ -597,10 +597,10 @@ static void b53_switch_reset_gpio(struct
>>> b53_device *dev)
>>> /* Reset sequence: RESET low(50ms)->high(20ms)
>>> */
>>> gpio_set_value(gpio, 0);
>>> - mdelay(50);
>>> + msleep(50);
>>> gpio_set_value(gpio, 1);
>>> - mdelay(20);
>>> + msleep(20);
>>> dev->current_page = 0xff;
>>> }
>>>
>> Would that also imply gpio_set_value could be gpio_set_value_cansleep?
>>
>
> Yes, I think gpio_set_value_cansleep() is okay here?
> Do I need to send a V2 patch to replace gpio_set_value()?
Yes, I would lump these two changes in the same patch since this is
effectively about solving sleeping vs. non sleeping operations.
Thanks!
--
Florian
^ permalink raw reply
* Re: [PATCH net-next] netns: filter uevents correctly
From: Eric W. Biederman @ 2018-04-11 16:40 UTC (permalink / raw)
To: Christian Brauner
Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <20180411090907.GA20340@gmail.com>
Christian Brauner <christian.brauner@canonical.com> writes:
> On Tue, Apr 10, 2018 at 10:04:46AM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>>
>> > On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner <christian.brauner@canonical.com> writes:
>> >>
>> >> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
>> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
>> >> >>
>> >> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> >> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
>> >> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>> >> >> >> >>>
>> >> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
>> >> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
>> >> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
>> >> >> >> >>> that carry a namespace tag are filtered according to that namespace.
>> >> >> >> >>>
>> >> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
>> >> >> >> >>> does not match the namespace tag of the netlink socket. One example are
>> >> >> >> >>> network devices. Uevents for network devices only show up in the network
>> >> >> >> >>> namespaces these devices are moved to or created in.
>> >> >> >> >>>
>> >> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
>> >> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
>> >> >> >> >>> into all network namespaces.
>> >> >> >> >>>
>> >> >> >> >>> The original patchset was written in 2010 before user namespaces were a
>> >> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
>> >> >> >> >>> partially isolated as they were filtered by user namespaces:
>> >> >> >> >>>
>> >> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >> >> >> >>>
>> >> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >> >> >> >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >> >> >> >>> return;
>> >> >> >> >>>
>> >> >> >> >>> if (!peernet_has_id(sock_net(sk), p->net))
>> >> >> >> >>> return;
>> >> >> >> >>>
>> >> >> >> >>> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >> >> >>> CAP_NET_BROADCAST))
>> >> >> >> >>> j return;
>> >> >> >> >>> }
>> >> >> >> >>>
>> >> >> >> >>> The file_ns_capable() check will check whether the caller had
>> >> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>> >> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
>> >> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
>> >> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
>> >> >> >> >>> namespace tag should never be sent into another user namespace. This has
>> >> >> >> >>> been the intention all along. But there's one case where this breaks,
>> >> >> >> >>> namely if a new user namespace is created by root on the host and an
>> >> >> >> >>> identity mapping is established between root on the host and root in the
>> >> >> >> >>> new user namespace. Here's a reproducer:
>> >> >> >> >>>
>> >> >> >> >>> sudo unshare -U --map-root
>> >> >> >> >>> udevadm monitor -k
>> >> >> >> >>> # Now change to initial user namespace and e.g. do
>> >> >> >> >>> modprobe kvm
>> >> >> >> >>> # or
>> >> >> >> >>> rmmod kvm
>> >> >> >> >>>
>> >> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
>> >> >> >> >>> host. This seems very anecdotal given that in the general case user
>> >> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
>> >> >> >> >>> with them.
>> >> >> >> >>>
>> >> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
>> >> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> >> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
>> >> >> >> >>> make a decision what uevents should be sent.
>> >> >> >> >>>
>> >> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
>> >> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
>> >> >> >> >>> in kobj_bcast_filter(). Specifically:
>> >> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
>> >> >> >> >>> event will always be filtered.
>> >> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
>> >> >> >> >>> initial user namespace but was opened from a non-initial user namespace
>> >> >> >> >>> the event will be filtered as well.
>> >> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
>> >> >> >> >>> always only sent to the initial user namespace. The regression potential
>> >> >> >> >>> for this is near to non-existent since user namespaces can't really do
>> >> >> >> >>> anything with interesting devices.
>> >> >> >> >>>
>> >> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>> >> >> >> >>> ---
>> >> >> >> >>> lib/kobject_uevent.c | 10 +++++++++-
>> >> >> >> >>> 1 file changed, 9 insertions(+), 1 deletion(-)
>> >> >> >> >>>
>> >> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
>> >> >> >> >>> --- a/lib/kobject_uevent.c
>> >> >> >> >>> +++ b/lib/kobject_uevent.c
>> >> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >> >> >> >>> return sock_ns != ns;
>> >> >> >> >>> }
>> >> >> >> >>>
>> >> >> >> >>> - return 0;
>> >> >> >> >>> + /*
>> >> >> >> >>> + * The kobject does not carry a namespace tag so filter by user
>> >> >> >> >>> + * namespace below.
>> >> >> >> >>> + */
>> >> >> >> >>> + if (sock_net(dsk)->user_ns != &init_user_ns)
>> >> >> >> >>> + return 1;
>> >> >> >> >>> +
>> >> >> >> >>> + /* Check if socket was opened from non-initial user namespace. */
>> >> >> >> >>> + return sk_user_ns(dsk) != &init_user_ns;
>> >> >> >> >>> }
>> >> >> >> >>> #endif
>> >> >> >> >>
>> >> >> >> >> So, this prohibits to listen events of all devices except network-related
>> >> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
>> >> >> >> >
>> >> >> >> > No, this is not correct: As it is right now *without my patch* no
>> >> >> >> > non-initial user namespace is receiving *any uevents* but those
>> >> >> >> > specifically namespaced such as those for network devices. This patch
>> >> >> >> > doesn't change that at all. The commit message outlines this in detail
>> >> >> >> > how this comes about.
>> >> >> >> > There is only one case where this currently breaks and this is as I
>> >> >> >> > outlined explicitly in my commit message when you create a new user
>> >> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
>> >> >> >>
>> >> >> >> Could you please point the place, where non-initial user namespaces are filtered?
>> >> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
>> >> >> >> Now it will return 1 sometimes.
>> >> >> >
>> >> >> > Oh sure, it's in the commit message though. The callchain is
>> >> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
>> >> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
>> >> >> > net/netlink/af_netlink.c:do_one_broadcast():
>> >> >> >
>> >> >> > This codepiece will check whether the openened socket holds
>> >> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
>> >> >> > which it won't because we don't have device namespaces and all devices
>> >> >> > belong to the initial set of namespaces.
>> >> >> >
>> >> >> > if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >> > CAP_NET_BROADCAST))
>> >> >> > j return;
>> >> >> >
>> >> >>
>> >> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
>> >> >> on their socket and has had someone with the appropriate privileges
>> >> >> assign a peerrnetid.
>> >> >>
>> >> >> All of which is to say that file_ns_capable is not nearly as applicable
>> >> >> as it might be, and if you can pass the other two checks I think it is
>> >> >> pointless (because the peernet attributes are not generated for
>> >> >> kobj_uevents) but valid to receive events from outside your network
>> >> >> namespace.
>> >> >>
>> >> >>
>> >> >> I might be missing something but I don't see anything excluding network
>> >> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
>> >> >> logic.
>> >> >>
>> >> >> The uevent_sock_list has one entry per network namespace. And
>> >> >> kobject_uevent_net_broacast appears to walk each one.
>> >> >>
>> >> >> I had a memory of filtering uevent messages and I had a memory
>> >> >> that the netlink_has_listeners had a per network namespace effect.
>> >> >> Neither seems true from my inspection of the code tonight.
>> >> >>
>> >> >> If we are not filtering ordinary uevents at least at the user namespace
>> >> >> level that does seem to be at least a nuisance.
>> >> >>
>> >> >>
>> >> >> Christian can you dig a little deeper into this. I have a feeling that
>> >> >> there are some real efficiency improvements that we could make to
>> >> >> kobject_uevent_net_broadcast if nothing else.
>> >> >>
>> >> >> Perhaps you could see where uevents are broadcast by poking
>> >> >> the sysfs uevent of an existing device, and triggering a retransmit.
>> >> >
>> >> > @Eric, so I did some intensive testing over the weekend and forget everything I
>> >> > said before. Uevents are not filtered by the kernel at the moment. This is
>> >> > currently - apart from network devices - a pure userspace thing. Specifically,
>> >> > anyone on the host can listen to all uevents from everywhere. It's neither
>> >> > filtered by user nor by network namespace. The fact that I used
>> >> >
>> >> > udevadm --debug monitor
>> >> >
>> >> > to test my prior hypothesis was what led me to believe that uevents are already
>> >> > correctly filtered.
>> >> > Instead, what is actually happening is that every udev implementation out there
>> >> > is discarding uevents that were send by uids != 0 in the CMSG_DATA.
>> >> > Specifically,
>> >>
>> >> Yes. I remember something of the sort. I believe udev also filters to
>> >> ensure that the netlink port id == 0 to ensure the message came from
>> >> the kernel and was not spoofed in any way.
>> >>
>> >> > - legacy standalone udevd:
>> >> > https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
>> >> > - eudevd
>> >> > https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
>> >> > - systemd-udevd
>> >> > https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
>> >> > - ueventd (Android)
>> >> > https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81
>> >> >
>> >> > For all of those I could trace this behavior back to the first released
>> >> > version. (To be precise, for legacy udevd that eventually became systemd-udevd
>> >> > I could trace it back to the first version that is still available on
>> >> > git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
>> >> > trivially true that it has the same behavior from the beginning.)
>> >> >
>> >> > In any case, userspace udev is not making use of uevents at all right now since
>> >> > any uid != 0 events are **explicitly** discarded.
>> >> > The fact that you receive uevents for
>> >> >
>> >> > sudo unshare -U --map-root -n
>> >> > udevadm --debug monitor
>> >> >
>> >> > is simply explained by the fact that container(0) <=> host(0) at which point
>> >> > the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
>> >> > discard it.
>> >> > The use case for receiving uevents in containers/user namespaces is definitely
>> >> > there but that's what the uevent injection patch series was for that we merged.
>> >> > This is a much safer and saner solution.
>> >> > The fact that all udev implementations filter uevents by uid != 0 very much
>> >> > seems like a security mechanism in userspace that we probably should provide by
>> >> > isolating uevents based on user and/or network namespaces.
>> >>
>> >> So in summary. Uevents are filtered in a user namespace (by userspace)
>> >> because the received uid != 0. It instead == 65534 == "nobody" because
>> >> the global root uid is not mapped.
>> >
>> > Exactly.
>> >
>> >>
>> >> Which means that we can modify the kernel to not send to all network
>> >> namespaces whose user_ns != &init_user_ns because we know that userspace
>> >> will ignore the message because of the uid anyway. Which means when
>> >
>> > Yes.
>> >
>> >> net-next reopens you can send that patch. But please base it on just
>> >> not including network namespaces in the list, as that is much more
>> >> efficient than adding more conditions to the filter.
>> >
>> > I'll send a patch out once net-next reopens. I'll also make sure to
>> > inform all udev userspace implementations of the change. It won't affect
>> > them but it is nice for them to know that they're safer now.
>>
>> The real danger is in a user namespace or in a container really is too
>> many daemons responding to events will generate a thundering hurd of
>> activity when there is really nothing to do.
>>
>> > Something like this (Proper commit message and so on will be added once
>> > I sent this out.):
>>
>> Exactly.
>>
>> I would make the comment say something like: "ignore all but the initial
>> user namespace".
>
> Yeah, agreed.
> But I think the patch is not complete. To guarantee that no non-initial
> user namespace actually receives uevents we need to:
> 1. only sent uevents to uevent sockets that are located in network
> namespaces that are owned by init_user_ns
> 2. filter uevents that are sent to sockets in mc_list that have opened a
> uevent socket that is owned by init_user_ns *from* a
> non-init_user_ns
>
> We account for 1. by only recording uevent sockets in the global uevent
> socket list who are owned by init_user_ns.
> But to account for 2. we need to filter by the user namespace who owns
> the socket in mc_list. So in addition to that we also need to slightly
> change the filter logic in kobj_bcast_filter() I think:
>
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 22a2c1a98b8f..064d7d29ace5 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> return sock_ns != ns;
> }
>
> - return 0;
> + /* Check if socket was opened from non-initial user namespace. */
> + return sk_user_ns(dsk) != &init_user_ns;
> }
> #endif
>
>
> But correct me if I'm wrong.
You are worrying about NETLINK_LISTEN_ALL_NSID sockets. That has
permissions and an explicit opt-in to receiving packets from multiple
network namespaces.
There is a netlink CMSG that tells you which network namespace the
packet comes from. Something any sane person will use if they set
NETLINK_LISTEN_ALL_NSID.
There is no problem of confusion. There is no problem of permissions.
So we don't need to worry about preventing NETLINK_LISTEN_ALL_NSID
listeners from seeing the uevents.
Eric
^ permalink raw reply
* Re: [PATCH net] sctp: do not check port in sctp_inet6_cmp_addr
From: David Miller @ 2018-04-11 16:40 UTC (permalink / raw)
To: lucien.xin; +Cc: marcelo.leitner, nhorman, netdev, linux-sctp
In-Reply-To: <CADvbK_egknPZCHRkZoLZ2MuGhRyaYKcMCaqG9Apt=QJ3-o-D6A@mail.gmail.com>
From: Xin Long <lucien.xin@gmail.com>
Date: Thu, 12 Apr 2018 00:16:58 +0800
> What do you think of:
>
> static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
> const union sctp_addr *addr2)
> {
> return __sctp_v6_cmp_addr(addr1, addr2) &&
> addr1->v6.sin_port == addr2->v6.sin_port;
> }
>
> (v6.sin_port and v4.sin_port have the same offset in union sctp_addr,
> we've exploited this in many places in SCTP)
>From my perspective this is OK.
^ permalink raw reply
* Re: [PATCH net-next] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-11 17:03 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <87tvshlms1.fsf@xmission.com>
On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
>
> > On Tue, Apr 10, 2018 at 10:04:46AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >>
> >> > On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> >>
> >> >> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> >> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> >> >>
> >> >> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> >> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> >> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >> >> >> >> >>>
> >> >> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >> >> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
> >> >> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
> >> >> >> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >> >> >> >>>
> >> >> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >> >> >> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >> >> >> >>> network devices. Uevents for network devices only show up in the network
> >> >> >> >> >>> namespaces these devices are moved to or created in.
> >> >> >> >> >>>
> >> >> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
> >> >> >> >> >>> into all network namespaces.
> >> >> >> >> >>>
> >> >> >> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
> >> >> >> >> >>> partially isolated as they were filtered by user namespaces:
> >> >> >> >> >>>
> >> >> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >> >> >> >>>
> >> >> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >> >> >> >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >> >> >> >>> return;
> >> >> >> >> >>>
> >> >> >> >> >>> if (!peernet_has_id(sock_net(sk), p->net))
> >> >> >> >> >>> return;
> >> >> >> >> >>>
> >> >> >> >> >>> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >> >> >> >>> CAP_NET_BROADCAST))
> >> >> >> >> >>> j return;
> >> >> >> >> >>> }
> >> >> >> >> >>>
> >> >> >> >> >>> The file_ns_capable() check will check whether the caller had
> >> >> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
> >> >> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
> >> >> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >> >> >> >>> namespace tag should never be sent into another user namespace. This has
> >> >> >> >> >>> been the intention all along. But there's one case where this breaks,
> >> >> >> >> >>> namely if a new user namespace is created by root on the host and an
> >> >> >> >> >>> identity mapping is established between root on the host and root in the
> >> >> >> >> >>> new user namespace. Here's a reproducer:
> >> >> >> >> >>>
> >> >> >> >> >>> sudo unshare -U --map-root
> >> >> >> >> >>> udevadm monitor -k
> >> >> >> >> >>> # Now change to initial user namespace and e.g. do
> >> >> >> >> >>> modprobe kvm
> >> >> >> >> >>> # or
> >> >> >> >> >>> rmmod kvm
> >> >> >> >> >>>
> >> >> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >> >> >> >> >>> host. This seems very anecdotal given that in the general case user
> >> >> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
> >> >> >> >> >>> with them.
> >> >> >> >> >>>
> >> >> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >> >> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
> >> >> >> >> >>> make a decision what uevents should be sent.
> >> >> >> >> >>>
> >> >> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
> >> >> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >> >> >> >> >>> in kobj_bcast_filter(). Specifically:
> >> >> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >> >> >> >> >>> event will always be filtered.
> >> >> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
> >> >> >> >> >>> initial user namespace but was opened from a non-initial user namespace
> >> >> >> >> >>> the event will be filtered as well.
> >> >> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >> >> >> >> >>> always only sent to the initial user namespace. The regression potential
> >> >> >> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >> >> >> >>> anything with interesting devices.
> >> >> >> >> >>>
> >> >> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> >> >> >> >>> ---
> >> >> >> >> >>> lib/kobject_uevent.c | 10 +++++++++-
> >> >> >> >> >>> 1 file changed, 9 insertions(+), 1 deletion(-)
> >> >> >> >> >>>
> >> >> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >> >> >> >> >>> --- a/lib/kobject_uevent.c
> >> >> >> >> >>> +++ b/lib/kobject_uevent.c
> >> >> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >> >> >> >>> return sock_ns != ns;
> >> >> >> >> >>> }
> >> >> >> >> >>>
> >> >> >> >> >>> - return 0;
> >> >> >> >> >>> + /*
> >> >> >> >> >>> + * The kobject does not carry a namespace tag so filter by user
> >> >> >> >> >>> + * namespace below.
> >> >> >> >> >>> + */
> >> >> >> >> >>> + if (sock_net(dsk)->user_ns != &init_user_ns)
> >> >> >> >> >>> + return 1;
> >> >> >> >> >>> +
> >> >> >> >> >>> + /* Check if socket was opened from non-initial user namespace. */
> >> >> >> >> >>> + return sk_user_ns(dsk) != &init_user_ns;
> >> >> >> >> >>> }
> >> >> >> >> >>> #endif
> >> >> >> >> >>
> >> >> >> >> >> So, this prohibits to listen events of all devices except network-related
> >> >> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> >> >> >> >> >
> >> >> >> >> > No, this is not correct: As it is right now *without my patch* no
> >> >> >> >> > non-initial user namespace is receiving *any uevents* but those
> >> >> >> >> > specifically namespaced such as those for network devices. This patch
> >> >> >> >> > doesn't change that at all. The commit message outlines this in detail
> >> >> >> >> > how this comes about.
> >> >> >> >> > There is only one case where this currently breaks and this is as I
> >> >> >> >> > outlined explicitly in my commit message when you create a new user
> >> >> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
> >> >> >> >>
> >> >> >> >> Could you please point the place, where non-initial user namespaces are filtered?
> >> >> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> >> >> >> >> Now it will return 1 sometimes.
> >> >> >> >
> >> >> >> > Oh sure, it's in the commit message though. The callchain is
> >> >> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> >> >> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> >> >> >> > net/netlink/af_netlink.c:do_one_broadcast():
> >> >> >> >
> >> >> >> > This codepiece will check whether the openened socket holds
> >> >> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
> >> >> >> > which it won't because we don't have device namespaces and all devices
> >> >> >> > belong to the initial set of namespaces.
> >> >> >> >
> >> >> >> > if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >> >> > CAP_NET_BROADCAST))
> >> >> >> > j return;
> >> >> >> >
> >> >> >>
> >> >> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
> >> >> >> on their socket and has had someone with the appropriate privileges
> >> >> >> assign a peerrnetid.
> >> >> >>
> >> >> >> All of which is to say that file_ns_capable is not nearly as applicable
> >> >> >> as it might be, and if you can pass the other two checks I think it is
> >> >> >> pointless (because the peernet attributes are not generated for
> >> >> >> kobj_uevents) but valid to receive events from outside your network
> >> >> >> namespace.
> >> >> >>
> >> >> >>
> >> >> >> I might be missing something but I don't see anything excluding network
> >> >> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
> >> >> >> logic.
> >> >> >>
> >> >> >> The uevent_sock_list has one entry per network namespace. And
> >> >> >> kobject_uevent_net_broacast appears to walk each one.
> >> >> >>
> >> >> >> I had a memory of filtering uevent messages and I had a memory
> >> >> >> that the netlink_has_listeners had a per network namespace effect.
> >> >> >> Neither seems true from my inspection of the code tonight.
> >> >> >>
> >> >> >> If we are not filtering ordinary uevents at least at the user namespace
> >> >> >> level that does seem to be at least a nuisance.
> >> >> >>
> >> >> >>
> >> >> >> Christian can you dig a little deeper into this. I have a feeling that
> >> >> >> there are some real efficiency improvements that we could make to
> >> >> >> kobject_uevent_net_broadcast if nothing else.
> >> >> >>
> >> >> >> Perhaps you could see where uevents are broadcast by poking
> >> >> >> the sysfs uevent of an existing device, and triggering a retransmit.
> >> >> >
> >> >> > @Eric, so I did some intensive testing over the weekend and forget everything I
> >> >> > said before. Uevents are not filtered by the kernel at the moment. This is
> >> >> > currently - apart from network devices - a pure userspace thing. Specifically,
> >> >> > anyone on the host can listen to all uevents from everywhere. It's neither
> >> >> > filtered by user nor by network namespace. The fact that I used
> >> >> >
> >> >> > udevadm --debug monitor
> >> >> >
> >> >> > to test my prior hypothesis was what led me to believe that uevents are already
> >> >> > correctly filtered.
> >> >> > Instead, what is actually happening is that every udev implementation out there
> >> >> > is discarding uevents that were send by uids != 0 in the CMSG_DATA.
> >> >> > Specifically,
> >> >>
> >> >> Yes. I remember something of the sort. I believe udev also filters to
> >> >> ensure that the netlink port id == 0 to ensure the message came from
> >> >> the kernel and was not spoofed in any way.
> >> >>
> >> >> > - legacy standalone udevd:
> >> >> > https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
> >> >> > - eudevd
> >> >> > https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
> >> >> > - systemd-udevd
> >> >> > https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
> >> >> > - ueventd (Android)
> >> >> > https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81
> >> >> >
> >> >> > For all of those I could trace this behavior back to the first released
> >> >> > version. (To be precise, for legacy udevd that eventually became systemd-udevd
> >> >> > I could trace it back to the first version that is still available on
> >> >> > git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
> >> >> > trivially true that it has the same behavior from the beginning.)
> >> >> >
> >> >> > In any case, userspace udev is not making use of uevents at all right now since
> >> >> > any uid != 0 events are **explicitly** discarded.
> >> >> > The fact that you receive uevents for
> >> >> >
> >> >> > sudo unshare -U --map-root -n
> >> >> > udevadm --debug monitor
> >> >> >
> >> >> > is simply explained by the fact that container(0) <=> host(0) at which point
> >> >> > the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
> >> >> > discard it.
> >> >> > The use case for receiving uevents in containers/user namespaces is definitely
> >> >> > there but that's what the uevent injection patch series was for that we merged.
> >> >> > This is a much safer and saner solution.
> >> >> > The fact that all udev implementations filter uevents by uid != 0 very much
> >> >> > seems like a security mechanism in userspace that we probably should provide by
> >> >> > isolating uevents based on user and/or network namespaces.
> >> >>
> >> >> So in summary. Uevents are filtered in a user namespace (by userspace)
> >> >> because the received uid != 0. It instead == 65534 == "nobody" because
> >> >> the global root uid is not mapped.
> >> >
> >> > Exactly.
> >> >
> >> >>
> >> >> Which means that we can modify the kernel to not send to all network
> >> >> namespaces whose user_ns != &init_user_ns because we know that userspace
> >> >> will ignore the message because of the uid anyway. Which means when
> >> >
> >> > Yes.
> >> >
> >> >> net-next reopens you can send that patch. But please base it on just
> >> >> not including network namespaces in the list, as that is much more
> >> >> efficient than adding more conditions to the filter.
> >> >
> >> > I'll send a patch out once net-next reopens. I'll also make sure to
> >> > inform all udev userspace implementations of the change. It won't affect
> >> > them but it is nice for them to know that they're safer now.
> >>
> >> The real danger is in a user namespace or in a container really is too
> >> many daemons responding to events will generate a thundering hurd of
> >> activity when there is really nothing to do.
> >>
> >> > Something like this (Proper commit message and so on will be added once
> >> > I sent this out.):
> >>
> >> Exactly.
> >>
> >> I would make the comment say something like: "ignore all but the initial
> >> user namespace".
> >
> > Yeah, agreed.
> > But I think the patch is not complete. To guarantee that no non-initial
> > user namespace actually receives uevents we need to:
> > 1. only sent uevents to uevent sockets that are located in network
> > namespaces that are owned by init_user_ns
> > 2. filter uevents that are sent to sockets in mc_list that have opened a
> > uevent socket that is owned by init_user_ns *from* a
> > non-init_user_ns
> >
> > We account for 1. by only recording uevent sockets in the global uevent
> > socket list who are owned by init_user_ns.
> > But to account for 2. we need to filter by the user namespace who owns
> > the socket in mc_list. So in addition to that we also need to slightly
> > change the filter logic in kobj_bcast_filter() I think:
> >
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index 22a2c1a98b8f..064d7d29ace5 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> > return sock_ns != ns;
> > }
> >
> > - return 0;
> > + /* Check if socket was opened from non-initial user namespace. */
> > + return sk_user_ns(dsk) != &init_user_ns;
> > }
> > #endif
> >
> >
> > But correct me if I'm wrong.
>
> You are worrying about NETLINK_LISTEN_ALL_NSID sockets. That has
> permissions and an explicit opt-in to receiving packets from multiple
> network namespaces.
I don't think that's what I'm talking about unless that is somehow the
default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
doing
unshare -U --map-root
then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
uevents. Imho, this should not be possible because I'm in a
non-init_user_ns. But currently I'm able to - even with the patch to
come - since the uevent socket in the kernel was created when init_net
was created and hence is *owned* by the init_user_ns which means it is
in the list of uevent sockets. Here's a demo of what I mean:
https://asciinema.org/a/175632
Christian
>
> There is a netlink CMSG that tells you which network namespace the
> packet comes from. Something any sane person will use if they set
> NETLINK_LISTEN_ALL_NSID.
>
> There is no problem of confusion. There is no problem of permissions.
> So we don't need to worry about preventing NETLINK_LISTEN_ALL_NSID
> listeners from seeing the uevents.
>
> Eric
^ permalink raw reply
* Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)
From: Stephen Suryaputra @ 2018-04-11 17:05 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20180411091434.38161a30@xeon-e3>
I read the thread and understand some of the concerns, but these are
useful for network equipments. That's why in this patch, I made it as
configuration option.
If you have feedbacks, I'll appreciate them.
I can resend when net-next is opened again.
Thanks.
On Wed, Apr 11, 2018 at 12:14 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 10 Apr 2018 22:55:35 -0400
> Stephen Suryaputra <ssuryaextr@gmail.com> wrote:
>
>> This is enhanced from the proposed patch by Igor Maravic in 2011 to
>> support per interface IPv4 stats. The enhancement is mainly adding a
>> kernel configuration option CONFIG_IP_IFSTATS_TABLE.
>>
>> Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
>
>
> Net-next is closed. http://vger.kernel.org/~davem/net-next.html
>
> Also, these kind of statistics have a negative performance impact.
^ permalink raw reply
* Re: [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Quentin Monnet @ 2018-04-11 18:01 UTC (permalink / raw)
To: Yonghong Song, daniel, ast
Cc: netdev, oss-drivers, linux-doc, linux-man, Lawrence Brakmo,
Josef Bacik, Andrey Ignatov
In-Reply-To: <7e388b10-ccea-a2b0-e776-5420c8e7f521@netronome.com>
2018-04-11 16:44 UTC+0100 ~ Quentin Monnet <quentin.monnet@netronome.com>
> 2018-04-10 09:58 UTC-0700 ~ Yonghong Song <yhs@fb.com>
>> On 4/10/18 7:41 AM, Quentin Monnet wrote:
>>> Add documentation for eBPF helper functions to bpf.h user header file.
>>> This documentation can be parsed with the Python script provided in
>>> another commit of the patch series, in order to provide a RST document
>>> that can later be converted into a man page.
>>>
>>> The objective is to make the documentation easily understandable and
>>> accessible to all eBPF developers, including beginners.
>>>
>>> This patch contains descriptions for the following helper functions:
>>>
>>> Helpers from Lawrence:
>>> - bpf_setsockopt()
>>> - bpf_getsockopt()
>>> - bpf_sock_ops_cb_flags_set()
>>>
>>> Helpers from Yonghong:
>>> - bpf_perf_event_read_value()
>>> - bpf_perf_prog_read_value()
>>>
>>> Helper from Josef:
>>> - bpf_override_return()
>>>
>>> Helper from Andrey:
>>> - bpf_bind()
>>>
>>> Cc: Lawrence Brakmo <brakmo@fb.com>
>>> Cc: Yonghong Song <yhs@fb.com>
>>> Cc: Josef Bacik <jbacik@fb.com>
>>> Cc: Andrey Ignatov <rdna@fb.com>
>>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>>> ---
> [...]
>
> Thanks Yonghong for the review!
>
> I have a favor to ask of you. I got a bounce for Kaixu Xia's email
> address, and I don't know what alternative email address I could use. I
> CC-ed to have a review for helper bpf_perf_event_read() (in patch 6 of
> this series), which is rather close to bpf_perf_event_read_value().
> Would you mind having a look at that one too, please? The description is
> not long.
Well I read again the description I wrote, and actually the one for
bpf_perf_evnet_read() is nearly a subset of the one for
perf_event_read_value(). So the same comments that you raised earlier
apply, there's probably nothing more to review. But if you notice that
some important info is missing for bpf_perf_event_read(), I'm interested
too!
Quentin
^ permalink raw reply
* Re: [PATCH] net/mlx5: remove some extraneous spaces in indentations
From: Saeed Mahameed @ 2018-04-11 18:23 UTC (permalink / raw)
To: colin.king@canonical.com, jgg@ziepe.ca
Cc: Matan Barak, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
linux-kernel@vger.kernel.org, leon@kernel.org,
kernel-janitors@vger.kernel.org
In-Reply-To: <20180410232719.qzaltu7hu57aqps3@ziepe.ca>
On Tue, 2018-04-10 at 17:27 -0600, Jason Gunthorpe wrote:
> On Tue, Apr 10, 2018 at 05:22:54PM -0600, Jason Gunthorpe wrote:
> > On Mon, Apr 09, 2018 at 01:43:36PM +0100, Colin Ian King wrote:
> > > From: Colin Ian King <colin.king@canonical.com>
> > >
> > > There are several lines where there is an extraneous space
> > > causing
> > > indentation misalignment. Remove them.
> > >
> > > Cleans up Cocconelle warnings:
> > >
> > > ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:409:3-18: code
> > > aligned
> > > with following code on line 410
> > > ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:415:3-18: code
> > > aligned
> > > with following code on line 416
> > > ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:421:3-18: code
> > > aligned
> > > with following code on line 422
> > >
> > > Signed-off-by: Colin Ian King <colin.king@canonical.com>
> > > drivers/net/ethernet/mellanox/mlx5/core/qp.c | 18 +++++++++-----
> > > ----
> > > 1 file changed, 9 insertions(+), 9 deletions(-)
> >
> > applied to for-next thanks
>
> Oh wait, this is for netdev, not rdma sorry. Never mind, DaveM should
> pick it up.
>
net-next is currently closed.
Dave, if it is ok with you I would like to apply this into mlx5-next
branch so it won't cause any issues with our next pull requests to rdma
and netdev.
> Jason
^ permalink raw reply
* [PATCH 4.9 005/310] x86/asm: Dont use RBP as a temporary register in csum_partial_copy_generic()
From: Greg Kroah-Hartman @ 2018-04-11 18:32 UTC (permalink / raw)
To: linux-kernel
Cc: Greg Kroah-Hartman, stable, Andrey Konovalov, Josh Poimboeuf,
Cong Wang, David S . Miller, Dmitry Vyukov, Eric Dumazet,
Kostya Serebryany, Linus Torvalds, Marcelo Ricardo Leitner,
Neil Horman, Peter Zijlstra, Thomas Gleixner, Vlad Yasevich,
linux-sctp, netdev, syzkaller, Ingo Molnar, Sasha Levin
In-Reply-To: <20180411183622.305902791@linuxfoundation.org>
4.9-stable review patch. If anyone has any objections, please let me know.
------------------
From: Josh Poimboeuf <jpoimboe@redhat.com>
[ Upstream commit 42fc6c6cb1662ba2fa727dd01c9473c63be4e3b6 ]
Andrey Konovalov reported the following warning while fuzzing the kernel
with syzkaller:
WARNING: kernel stack regs at ffff8800686869f8 in a.out:4933 has bad 'bp' value c3fc855a10167ec0
The unwinder dump revealed that RBP had a bad value when an interrupt
occurred in csum_partial_copy_generic().
That function saves RBP on the stack and then overwrites it, using it as
a scratch register. That's problematic because it breaks stack traces
if an interrupt occurs in the middle of the function.
Replace the usage of RBP with another callee-saved register (R15) so
stack traces are no longer affected.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: netdev <netdev@vger.kernel.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Link: http://lkml.kernel.org/r/4b03a961efda5ec9bfe46b7b9c9ad72d1efad343.1493909486.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
arch/x86/lib/csum-copy_64.S | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -55,7 +55,7 @@ ENTRY(csum_partial_copy_generic)
movq %r12, 3*8(%rsp)
movq %r14, 4*8(%rsp)
movq %r13, 5*8(%rsp)
- movq %rbp, 6*8(%rsp)
+ movq %r15, 6*8(%rsp)
movq %r8, (%rsp)
movq %r9, 1*8(%rsp)
@@ -74,7 +74,7 @@ ENTRY(csum_partial_copy_generic)
/* main loop. clear in 64 byte blocks */
/* r9: zero, r8: temp2, rbx: temp1, rax: sum, rcx: saved length */
/* r11: temp3, rdx: temp4, r12 loopcnt */
- /* r10: temp5, rbp: temp6, r14 temp7, r13 temp8 */
+ /* r10: temp5, r15: temp6, r14 temp7, r13 temp8 */
.p2align 4
.Lloop:
source
@@ -89,7 +89,7 @@ ENTRY(csum_partial_copy_generic)
source
movq 32(%rdi), %r10
source
- movq 40(%rdi), %rbp
+ movq 40(%rdi), %r15
source
movq 48(%rdi), %r14
source
@@ -103,7 +103,7 @@ ENTRY(csum_partial_copy_generic)
adcq %r11, %rax
adcq %rdx, %rax
adcq %r10, %rax
- adcq %rbp, %rax
+ adcq %r15, %rax
adcq %r14, %rax
adcq %r13, %rax
@@ -121,7 +121,7 @@ ENTRY(csum_partial_copy_generic)
dest
movq %r10, 32(%rsi)
dest
- movq %rbp, 40(%rsi)
+ movq %r15, 40(%rsi)
dest
movq %r14, 48(%rsi)
dest
@@ -203,7 +203,7 @@ ENTRY(csum_partial_copy_generic)
movq 3*8(%rsp), %r12
movq 4*8(%rsp), %r14
movq 5*8(%rsp), %r13
- movq 6*8(%rsp), %rbp
+ movq 6*8(%rsp), %r15
addq $7*8, %rsp
ret
^ permalink raw reply
* [PATCH 4.4 004/190] x86/asm: Dont use RBP as a temporary register in csum_partial_copy_generic()
From: Greg Kroah-Hartman @ 2018-04-11 18:34 UTC (permalink / raw)
To: linux-kernel
Cc: Greg Kroah-Hartman, stable, Andrey Konovalov, Josh Poimboeuf,
Cong Wang, David S . Miller, Dmitry Vyukov, Eric Dumazet,
Kostya Serebryany, Linus Torvalds, Marcelo Ricardo Leitner,
Neil Horman, Peter Zijlstra, Thomas Gleixner, Vlad Yasevich,
linux-sctp, netdev, syzkaller, Ingo Molnar, Sasha Levin
In-Reply-To: <20180411183550.114495991@linuxfoundation.org>
4.4-stable review patch. If anyone has any objections, please let me know.
------------------
From: Josh Poimboeuf <jpoimboe@redhat.com>
[ Upstream commit 42fc6c6cb1662ba2fa727dd01c9473c63be4e3b6 ]
Andrey Konovalov reported the following warning while fuzzing the kernel
with syzkaller:
WARNING: kernel stack regs at ffff8800686869f8 in a.out:4933 has bad 'bp' value c3fc855a10167ec0
The unwinder dump revealed that RBP had a bad value when an interrupt
occurred in csum_partial_copy_generic().
That function saves RBP on the stack and then overwrites it, using it as
a scratch register. That's problematic because it breaks stack traces
if an interrupt occurs in the middle of the function.
Replace the usage of RBP with another callee-saved register (R15) so
stack traces are no longer affected.
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: David S . Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: netdev <netdev@vger.kernel.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Link: http://lkml.kernel.org/r/4b03a961efda5ec9bfe46b7b9c9ad72d1efad343.1493909486.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
arch/x86/lib/csum-copy_64.S | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -55,7 +55,7 @@ ENTRY(csum_partial_copy_generic)
movq %r12, 3*8(%rsp)
movq %r14, 4*8(%rsp)
movq %r13, 5*8(%rsp)
- movq %rbp, 6*8(%rsp)
+ movq %r15, 6*8(%rsp)
movq %r8, (%rsp)
movq %r9, 1*8(%rsp)
@@ -74,7 +74,7 @@ ENTRY(csum_partial_copy_generic)
/* main loop. clear in 64 byte blocks */
/* r9: zero, r8: temp2, rbx: temp1, rax: sum, rcx: saved length */
/* r11: temp3, rdx: temp4, r12 loopcnt */
- /* r10: temp5, rbp: temp6, r14 temp7, r13 temp8 */
+ /* r10: temp5, r15: temp6, r14 temp7, r13 temp8 */
.p2align 4
.Lloop:
source
@@ -89,7 +89,7 @@ ENTRY(csum_partial_copy_generic)
source
movq 32(%rdi), %r10
source
- movq 40(%rdi), %rbp
+ movq 40(%rdi), %r15
source
movq 48(%rdi), %r14
source
@@ -103,7 +103,7 @@ ENTRY(csum_partial_copy_generic)
adcq %r11, %rax
adcq %rdx, %rax
adcq %r10, %rax
- adcq %rbp, %rax
+ adcq %r15, %rax
adcq %r14, %rax
adcq %r13, %rax
@@ -121,7 +121,7 @@ ENTRY(csum_partial_copy_generic)
dest
movq %r10, 32(%rsi)
dest
- movq %rbp, 40(%rsi)
+ movq %r15, 40(%rsi)
dest
movq %r14, 48(%rsi)
dest
@@ -203,7 +203,7 @@ ENTRY(csum_partial_copy_generic)
movq 3*8(%rsp), %r12
movq 4*8(%rsp), %r14
movq 5*8(%rsp), %r13
- movq 6*8(%rsp), %rbp
+ movq 6*8(%rsp), %r15
addq $7*8, %rsp
ret
^ permalink raw reply
* Re: [PATCH net-next] netns: filter uevents correctly
From: Eric W. Biederman @ 2018-04-11 18:37 UTC (permalink / raw)
To: Christian Brauner
Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <20180411170333.GA4319@gmail.com>
Christian Brauner <christian.brauner@canonical.com> writes:
> On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>> > Yeah, agreed.
>> > But I think the patch is not complete. To guarantee that no non-initial
>> > user namespace actually receives uevents we need to:
>> > 1. only sent uevents to uevent sockets that are located in network
>> > namespaces that are owned by init_user_ns
>> > 2. filter uevents that are sent to sockets in mc_list that have opened a
>> > uevent socket that is owned by init_user_ns *from* a
>> > non-init_user_ns
>> >
>> > We account for 1. by only recording uevent sockets in the global uevent
>> > socket list who are owned by init_user_ns.
>> > But to account for 2. we need to filter by the user namespace who owns
>> > the socket in mc_list. So in addition to that we also need to slightly
>> > change the filter logic in kobj_bcast_filter() I think:
>> >
>> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> > index 22a2c1a98b8f..064d7d29ace5 100644
>> > --- a/lib/kobject_uevent.c
>> > +++ b/lib/kobject_uevent.c
>> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> > return sock_ns != ns;
>> > }
>> >
>> > - return 0;
>> > + /* Check if socket was opened from non-initial user namespace. */
>> > + return sk_user_ns(dsk) != &init_user_ns;
>> > }
>> > #endif
>> >
>> >
>> > But correct me if I'm wrong.
>>
>> You are worrying about NETLINK_LISTEN_ALL_NSID sockets. That has
>> permissions and an explicit opt-in to receiving packets from multiple
>> network namespaces.
>
> I don't think that's what I'm talking about unless that is somehow the
> default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
> doing
>
> unshare -U --map-root
>
> then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
> uevents. Imho, this should not be possible because I'm in a
> non-init_user_ns. But currently I'm able to - even with the patch to
> come - since the uevent socket in the kernel was created when init_net
> was created and hence is *owned* by the init_user_ns which means it is
> in the list of uevent sockets. Here's a demo of what I mean:
>
> https://asciinema.org/a/175632
Why do you care about this case?
Everyone is allowed to listen to the uevent netlink sockets with or
without user namespaces. So there are no permission issues, and
this is not even a data information leak.
If you don't want programs in your user namespace to have access you
will be able to unshare the network namespace.
Eric
^ permalink raw reply
* Re: [PATCH net v2 0/6] bnxt_en: Fixes for net.
From: David Miller @ 2018-04-11 18:42 UTC (permalink / raw)
To: michael.chan; +Cc: netdev
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>
From: Michael Chan <michael.chan@broadcom.com>
Date: Wed, 11 Apr 2018 11:50:12 -0400
> This bug fix series include NULL pointer fixes in ethtool -x code path
> and in the error clean up path when freeing IRQs, a ring accounting bug
> that missed rings used by the RDMA driver, and 3 bug fixes related to TC
> Flower and VF-reps.
>
> v2: Fixed commit message of patch 4. Changed the pound sign to $ sign
> in front of the ip command.
Yep, that looks better.
Series applied, thanks Michael.
^ permalink raw reply
* Re: [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Jakub Kicinski @ 2018-04-11 18:43 UTC (permalink / raw)
To: Michael Chan, Andy Gospodarek; +Cc: davem, netdev
In-Reply-To: <1523461818-15774-3-git-send-email-michael.chan@broadcom.com>
On Wed, 11 Apr 2018 11:50:14 -0400, Michael Chan wrote:
> @@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
> return false;
> }
>
> + /* Currently source/dest MAC cannot be partial wildcard */
> + if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
> + !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
> + netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");
This wouldn't be something to do in net, but how do you feel about
using extack for messages like this?
> + return false;
> + }
^ permalink raw reply
* Re: [PATCH] net/mlx5: remove some extraneous spaces in indentations
From: David Miller @ 2018-04-11 18:43 UTC (permalink / raw)
To: saeedm
Cc: colin.king, jgg, matanb, netdev, linux-rdma, linux-kernel, leon,
kernel-janitors
In-Reply-To: <1523471018.3402.42.camel@mellanox.com>
From: Saeed Mahameed <saeedm@mellanox.com>
Date: Wed, 11 Apr 2018 18:23:41 +0000
> Dave, if it is ok with you I would like to apply this into mlx5-next
> branch so it won't cause any issues with our next pull requests to rdma
> and netdev.
Ok.
^ permalink raw reply
* Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
From: Alexei Starovoitov @ 2018-04-11 18:43 UTC (permalink / raw)
To: Björn Töpel, William Tu
Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
John Fastabend, Jesper Dangaard Brouer, Willem de Bruijn,
Daniel Borkmann, Linux Kernel Network Developers,
Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
Anjali Singhai Jain, Zhang, Qi Z, ravineet.singh
In-Reply-To: <CAJ+HfNiJ+tS6cG4_cCj1syQC8m8OZWM44-NX-hbdRYWQj=98TA@mail.gmail.com>
On 4/11/18 5:17 AM, Björn Töpel wrote:
>
> In the current RFC you are required to create both an Rx and Tx queue
> to bind the socket, which is just weird for your "Rx on one device, Tx
> to another" scenario. I'll fix that in the next RFC.
I would defer on adding new features until the key functionality lands.
imo it's in good shape and I would submit it without RFC tag as soon as
net-next reopens.
^ permalink raw reply
* Re: [PATCH] lan78xx: Don't reset the interface on open
From: David Miller @ 2018-04-11 18:45 UTC (permalink / raw)
To: phil
Cc: woojung.huh, UNGLinuxDriver, agraf, tbogendoerfer, netdev,
linux-usb, linux-kernel
In-Reply-To: <1523362705-30032-1-git-send-email-phil@raspberrypi.org>
From: Phil Elwell <phil@raspberrypi.org>
Date: Tue, 10 Apr 2018 13:18:25 +0100
> Commit 92571a1aae40 ("lan78xx: Connect phy early") moves the PHY
> initialisation into lan78xx_probe, but lan78xx_open subsequently calls
> lan78xx_reset. As well as forcing a second round of link negotiation,
> this reset frequently prevents the phy interrupt from being generated
> (even though the link is up), rendering the interface unusable.
>
> Fix this issue by removing the lan78xx_reset call from lan78xx_open.
>
> Fixes: 92571a1aae40 ("lan78xx: Connect phy early")
> Signed-off-by: Phil Elwell <phil@raspberrypi.org>
Applied and queued up for -stable, thanks.
^ permalink raw reply
* Re: [PATCH net 1/2] tun: set the flags before registering the netdevice
From: David Miller @ 2018-04-11 18:48 UTC (permalink / raw)
To: sd; +Cc: netdev, thaller, sbrivio
In-Reply-To: <0b61a3298ffa3a3fbc4ae0cae1573b368301a211.1523370272.git.sd@queasysnail.net>
From: Sabrina Dubroca <sd@queasysnail.net>
Date: Tue, 10 Apr 2018 16:28:55 +0200
> Otherwise, register_netdevice advertises the creation of the device with
> the default flags, instead of what the user requested.
>
> Reported-by: Thomas Haller <thaller@redhat.com>
> Fixes: 1ec010e70593 ("tun: export flags, uid, gid, queue information over netlink")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Applied and queued up for -stable.
^ permalink raw reply
* Re: [PATCH net 2/2] tun: send netlink notification when the device is modified
From: David Miller @ 2018-04-11 18:48 UTC (permalink / raw)
To: sd; +Cc: netdev, thaller, sbrivio
In-Reply-To: <d88d26513396435befee50bdc740549a4ea546a2.1523370272.git.sd@queasysnail.net>
From: Sabrina Dubroca <sd@queasysnail.net>
Date: Tue, 10 Apr 2018 16:28:56 +0200
> I added dumping of link information about tun devices over netlink in
> commit 1ec010e70593 ("tun: export flags, uid, gid, queue information
> over netlink"), but didn't add the missing netlink notifications when
> the device's exported properties change.
>
> This patch adds notifications when owner/group or flags are modified,
> when queues are attached/detached, and when a tun fd is closed.
>
> Reported-by: Thomas Haller <thaller@redhat.com>
> Fixes: 1ec010e70593 ("tun: export flags, uid, gid, queue information over netlink")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Applied and queued up for -stable.
^ permalink raw reply
* [RFC PATCH] lan78xx: lan78xx_regs[] can be static
From: kbuild test robot @ 2018-04-11 18:51 UTC (permalink / raw)
To: Raghuram Chary J
Cc: kbuild-all, davem, netdev, unglinuxdriver, woojung.huh,
raghuramchary.jallipalli
In-Reply-To: <20180411072450.9809-3-raghuramchary.jallipalli@microchip.com>
Fixes: 1373bfc60bd8 ("lan78xx: Add support to dump lan78xx registers")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
lan78xx.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index cbeec78..df5c3eb 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -278,7 +278,7 @@ struct lan78xx_statstage64 {
u64 eee_tx_lpi_time;
};
-u32 lan78xx_regs[] = {
+static u32 lan78xx_regs[] = {
ID_REV,
INT_STS,
HW_CFG,
^ permalink raw reply related
* Re: [PATCH v1 net 2/3] lan78xx: Add support to dump lan78xx registers
From: kbuild test robot @ 2018-04-11 18:51 UTC (permalink / raw)
To: Raghuram Chary J
Cc: kbuild-all, davem, netdev, unglinuxdriver, woojung.huh,
raghuramchary.jallipalli
In-Reply-To: <20180411072450.9809-3-raghuramchary.jallipalli@microchip.com>
Hi Raghuram,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on net-next/master]
[also build test WARNING on v4.16 next-20180411]
[cannot apply to net/master]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Raghuram-Chary-J/lan78xx-Fixes-and-enhancements/20180411-231134
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__
sparse warnings: (new ones prefixed by >>)
>> drivers/net/usb/lan78xx.c:281:5: sparse: symbol 'lan78xx_regs' was not declared. Should it be static?
drivers/net/usb/lan78xx.c:2992:29: sparse: cast to restricted __be16
drivers/net/usb/lan78xx.c:2992:29: sparse: cast to restricted __be16
drivers/net/usb/lan78xx.c:2992:29: sparse: cast to restricted __be16
drivers/net/usb/lan78xx.c:2992:29: sparse: cast to restricted __be16
drivers/net/usb/lan78xx.c:2992:27: sparse: incorrect type in assignment (different base types) @@ expected restricted __wsum [usertype] csum @@ got e] csum @@
drivers/net/usb/lan78xx.c:2992:27: expected restricted __wsum [usertype] csum
drivers/net/usb/lan78xx.c:2992:27: got int
Please review and possibly fold the followup patch.
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
^ permalink raw reply
* Re: [PATCH net-next] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-11 18:57 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <87efjllhcx.fsf@xmission.com>
On Wed, Apr 11, 2018 at 01:37:18PM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
>
> > On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> > Yeah, agreed.
> >> > But I think the patch is not complete. To guarantee that no non-initial
> >> > user namespace actually receives uevents we need to:
> >> > 1. only sent uevents to uevent sockets that are located in network
> >> > namespaces that are owned by init_user_ns
> >> > 2. filter uevents that are sent to sockets in mc_list that have opened a
> >> > uevent socket that is owned by init_user_ns *from* a
> >> > non-init_user_ns
> >> >
> >> > We account for 1. by only recording uevent sockets in the global uevent
> >> > socket list who are owned by init_user_ns.
> >> > But to account for 2. we need to filter by the user namespace who owns
> >> > the socket in mc_list. So in addition to that we also need to slightly
> >> > change the filter logic in kobj_bcast_filter() I think:
> >> >
> >> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> > index 22a2c1a98b8f..064d7d29ace5 100644
> >> > --- a/lib/kobject_uevent.c
> >> > +++ b/lib/kobject_uevent.c
> >> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> > return sock_ns != ns;
> >> > }
> >> >
> >> > - return 0;
> >> > + /* Check if socket was opened from non-initial user namespace. */
> >> > + return sk_user_ns(dsk) != &init_user_ns;
> >> > }
> >> > #endif
> >> >
> >> >
> >> > But correct me if I'm wrong.
> >>
> >> You are worrying about NETLINK_LISTEN_ALL_NSID sockets. That has
> >> permissions and an explicit opt-in to receiving packets from multiple
> >> network namespaces.
> >
> > I don't think that's what I'm talking about unless that is somehow the
> > default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
> > doing
> >
> > unshare -U --map-root
> >
> > then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
> > uevents. Imho, this should not be possible because I'm in a
> > non-init_user_ns. But currently I'm able to - even with the patch to
> > come - since the uevent socket in the kernel was created when init_net
> > was created and hence is *owned* by the init_user_ns which means it is
> > in the list of uevent sockets. Here's a demo of what I mean:
> >
> > https://asciinema.org/a/175632
>
> Why do you care about this case?
It's not so much that I care about this case since any workload that
wants to run a separate udevd will have to unshare the network namespace
and the user namespace for it to make complete sense.
What I do care about is that the two of us are absolutely in the clear
about what semantics we are going to expose to userspace and it seems
that we were talking past each other wrt to this "corner case".
For userspace, it needs to be very clear that the intention is to filter
by *owning user namespace of the network namespace a given task resides
in* and not by user namespace of the task per se. This is what this
corner case basically shows, I think.
Christian
>
> Everyone is allowed to listen to the uevent netlink sockets with or
> without user namespaces. So there are no permission issues, and
> this is not even a data information leak.
>
> If you don't want programs in your user namespace to have access you
> will be able to unshare the network namespace.
>
> Eric
^ permalink raw reply
* Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section
From: Saeed Mahameed @ 2018-04-11 18:59 UTC (permalink / raw)
To: eric.dumazet@gmail.com, davem@davemloft.net; +Cc: netdev@vger.kernel.org
In-Reply-To: <75137dbf-4608-127e-1601-10a3c13e3a32@gmail.com>
On Tue, 2018-04-10 at 13:35 -0700, Eric Dumazet wrote:
>
> On 04/10/2018 10:16 AM, David Miller wrote:
> >
> > The tradeoff here is that now you are doing two unnecessary atomic
> > operations per stats dump.
> >
> > That is what the RCU lock allows us to avoid.
> >
>
> dev_hold() and dev_put() are actually per cpu increment and
> decrements,
> pretty cheap these days.
>
Yes, i am adding only 2 cpu instructions here.
I think the trade-off here is too small and the price to finally have
get_stats64 called from non atomic context is really worth it.
It looks really odd to me that the device chain locks are held for
such long periods, while we already have the means to avoid this, same
goes for rtnl_lock, same trick can work here for many use cases and
many ndos, we are just over protective for no reasons.
> Problem here is that any preemption of the process holding device
> reference
> might trigger warnings in device unregister.
>
This is true for any other place dev_hold is used,
as explained in the commit message dev_hold is used for a very brief
moment before calling the stats ndo and released directly after.
looking at
netdev_wait_allrefs(...)
[...]
msleep(250);
refcnt = netdev_refcnt_read(dev);
if (time_after(jiffies, warning_time + 10 * HZ)) {
pr_emerg("unregister_netdevice: waiting for %s to
become free. Usage count = %d\n",
dev->name, refcnt);
warning_time = jiffies;
}
The holder will get enough time to release the netdev way before the
warning is triggered.
The warning is triggered only if someone holds the dev for more than 10
seconds which is impossible for the stats ndo to take more than this,
in fact i just did a quick measurement and it seems that in average
get_stats64 ndo takes 0.5us !
^ permalink raw reply
* [PULL] virtio: feature pull
From: Michael S. Tsirkin @ 2018-04-11 19:10 UTC (permalink / raw)
To: Linus Torvalds
Cc: kvm, virtualization, netdev, linux-kernel, jonathan.helman, mst
The following changes since commit dc32bb678e103afbcfa4d814489af0566307f528:
vhost: add vsock compat ioctl (2018-03-20 03:17:42 +0200)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to 6c64fe7f2adcee21d7c4247f1ec021fd18428fc4:
virtio_balloon: export hugetlb page allocation counts (2018-04-10 21:23:55 +0300)
----------------------------------------------------------------
virtio: feature
This adds reporting hugepage stats to virtio-balloon.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
----------------------------------------------------------------
Jonathan Helman (1):
virtio_balloon: export hugetlb page allocation counts
drivers/virtio/virtio_balloon.c | 6 ++++++
include/uapi/linux/virtio_balloon.h | 4 +++-
2 files changed, 9 insertions(+), 1 deletion(-)
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox