Netdev List
 help / color / mirror / Atom feed
* Re: [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Quentin Monnet @ 2018-04-11 18:01 UTC (permalink / raw)
  To: Yonghong Song, daniel, ast
  Cc: netdev, oss-drivers, linux-doc, linux-man, Lawrence Brakmo,
	Josef Bacik, Andrey Ignatov
In-Reply-To: <7e388b10-ccea-a2b0-e776-5420c8e7f521@netronome.com>

2018-04-11 16:44 UTC+0100 ~ Quentin Monnet <quentin.monnet@netronome.com>
> 2018-04-10 09:58 UTC-0700 ~ Yonghong Song <yhs@fb.com>
>> On 4/10/18 7:41 AM, Quentin Monnet wrote:
>>> Add documentation for eBPF helper functions to bpf.h user header file.
>>> This documentation can be parsed with the Python script provided in
>>> another commit of the patch series, in order to provide a RST document
>>> that can later be converted into a man page.
>>>
>>> The objective is to make the documentation easily understandable and
>>> accessible to all eBPF developers, including beginners.
>>>
>>> This patch contains descriptions for the following helper functions:
>>>
>>> Helpers from Lawrence:
>>> - bpf_setsockopt()
>>> - bpf_getsockopt()
>>> - bpf_sock_ops_cb_flags_set()
>>>
>>> Helpers from Yonghong:
>>> - bpf_perf_event_read_value()
>>> - bpf_perf_prog_read_value()
>>>
>>> Helper from Josef:
>>> - bpf_override_return()
>>>
>>> Helper from Andrey:
>>> - bpf_bind()
>>>
>>> Cc: Lawrence Brakmo <brakmo@fb.com>
>>> Cc: Yonghong Song <yhs@fb.com>
>>> Cc: Josef Bacik <jbacik@fb.com>
>>> Cc: Andrey Ignatov <rdna@fb.com>
>>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>>> ---

> [...]
> 
> Thanks Yonghong for the review!
> 
> I have a favor to ask of you. I got a bounce for Kaixu Xia's email
> address, and I don't know what alternative email address I could use. I
> CC-ed to have a review for helper bpf_perf_event_read() (in patch 6 of
> this series), which is rather close to bpf_perf_event_read_value().
> Would you mind having a look at that one too, please? The description is
> not long.

Well I read again the description I wrote, and actually the one for
bpf_perf_evnet_read() is nearly a subset of the one for
perf_event_read_value(). So the same comments that you raised earlier
apply, there's probably nothing more to review. But if you notice that
some important info is missing for bpf_perf_event_read(), I'm interested
too!

Quentin

^ permalink raw reply

* Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)
From: Stephen Suryaputra @ 2018-04-11 17:05 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20180411091434.38161a30@xeon-e3>

I read the thread and understand some of the concerns, but these are
useful for network equipments. That's why in this patch, I made it as
configuration option.
If you have feedbacks, I'll appreciate them.

I can resend when net-next is opened again.

Thanks.

On Wed, Apr 11, 2018 at 12:14 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Tue, 10 Apr 2018 22:55:35 -0400
> Stephen Suryaputra <ssuryaextr@gmail.com> wrote:
>
>> This is enhanced from the proposed patch by Igor Maravic in 2011 to
>> support per interface IPv4 stats. The enhancement is mainly adding a
>> kernel configuration option CONFIG_IP_IFSTATS_TABLE.
>>
>> Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
>
>
> Net-next is closed. http://vger.kernel.org/~davem/net-next.html
>
> Also, these kind of statistics have a negative performance impact.

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Christian Brauner @ 2018-04-11 17:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <87tvshlms1.fsf@xmission.com>

On Wed, Apr 11, 2018 at 11:40:14AM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@canonical.com> writes:
> 
> > On Tue, Apr 10, 2018 at 10:04:46AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> 
> >> > On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> >> 
> >> >> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> >> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
> >> >> >> 
> >> >> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> >> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> >> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> >> >> >> >> >>>
> >> >> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
> >> >> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
> >> >> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
> >> >> >> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >> >> >> >>>
> >> >> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
> >> >> >> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >> >> >> >>> network devices. Uevents for network devices only show up in the network
> >> >> >> >> >>> namespaces these devices are moved to or created in.
> >> >> >> >> >>>
> >> >> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
> >> >> >> >> >>> into all network namespaces.
> >> >> >> >> >>>
> >> >> >> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
> >> >> >> >> >>> partially isolated as they were filtered by user namespaces:
> >> >> >> >> >>>
> >> >> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >> >> >> >>>
> >> >> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >> >> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >> >> >> >>>                 return;
> >> >> >> >> >>>
> >> >> >> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
> >> >> >> >> >>>                 return;
> >> >> >> >> >>>
> >> >> >> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >> >> >> >>>                              CAP_NET_BROADCAST))
> >> >> >> >> >>>         j       return;
> >> >> >> >> >>> }
> >> >> >> >> >>>
> >> >> >> >> >>> The file_ns_capable() check will check whether the caller had
> >> >> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
> >> >> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
> >> >> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >> >> >> >>> namespace tag should never be sent into another user namespace. This has
> >> >> >> >> >>> been the intention all along. But there's one case where this breaks,
> >> >> >> >> >>> namely if a new user namespace is created by root on the host and an
> >> >> >> >> >>> identity mapping is established between root on the host and root in the
> >> >> >> >> >>> new user namespace. Here's a reproducer:
> >> >> >> >> >>>
> >> >> >> >> >>>  sudo unshare -U --map-root
> >> >> >> >> >>>  udevadm monitor -k
> >> >> >> >> >>>  # Now change to initial user namespace and e.g. do
> >> >> >> >> >>>  modprobe kvm
> >> >> >> >> >>>  # or
> >> >> >> >> >>>  rmmod kvm
> >> >> >> >> >>>
> >> >> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
> >> >> >> >> >>> host. This seems very anecdotal given that in the general case user
> >> >> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
> >> >> >> >> >>> with them.
> >> >> >> >> >>>
> >> >> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
> >> >> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
> >> >> >> >> >>> make a decision what uevents should be sent.
> >> >> >> >> >>>
> >> >> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
> >> >> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
> >> >> >> >> >>> in kobj_bcast_filter(). Specifically:
> >> >> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
> >> >> >> >> >>>   event will always be filtered.
> >> >> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
> >> >> >> >> >>>   initial user namespace but was opened from a non-initial user namespace
> >> >> >> >> >>>   the event will be filtered as well.
> >> >> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
> >> >> >> >> >>> always only sent to the initial user namespace. The regression potential
> >> >> >> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >> >> >> >>> anything with interesting devices.
> >> >> >> >> >>>
> >> >> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> >> >> >> >>> ---
> >> >> >> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
> >> >> >> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >> >> >> >>>
> >> >> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
> >> >> >> >> >>> --- a/lib/kobject_uevent.c
> >> >> >> >> >>> +++ b/lib/kobject_uevent.c
> >> >> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >> >> >> >> >>>  		return sock_ns != ns;
> >> >> >> >> >>>  	}
> >> >> >> >> >>>  
> >> >> >> >> >>> -	return 0;
> >> >> >> >> >>> +	/*
> >> >> >> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
> >> >> >> >> >>> +	 * namespace below.
> >> >> >> >> >>> +	 */
> >> >> >> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
> >> >> >> >> >>> +		return 1;
> >> >> >> >> >>> +
> >> >> >> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
> >> >> >> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
> >> >> >> >> >>>  }
> >> >> >> >> >>>  #endif
> >> >> >> >> >>
> >> >> >> >> >> So, this prohibits to listen events of all devices except network-related
> >> >> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
> >> >> >> >> > 
> >> >> >> >> > No, this is not correct: As it is right now *without my patch* no
> >> >> >> >> > non-initial user namespace is receiving *any uevents* but those
> >> >> >> >> > specifically namespaced such as those for network devices. This patch
> >> >> >> >> > doesn't change that at all. The commit message outlines this in detail
> >> >> >> >> > how this comes about.
> >> >> >> >> > There is only one case where this currently breaks and this is as I
> >> >> >> >> > outlined explicitly in my commit message when you create a new user
> >> >> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
> >> >> >> >> 
> >> >> >> >> Could you please point the place, where non-initial user namespaces are filtered?
> >> >> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
> >> >> >> >> Now it will return 1 sometimes.
> >> >> >> >
> >> >> >> > Oh sure, it's in the commit message though. The callchain is
> >> >> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
> >> >> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
> >> >> >> > net/netlink/af_netlink.c:do_one_broadcast():
> >> >> >> >
> >> >> >> > This codepiece will check whether the openened socket holds
> >> >> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
> >> >> >> > which it won't because we don't have device namespaces and all devices
> >> >> >> > belong to the initial set of namespaces.
> >> >> >> >
> >> >> >> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >> >> >                              CAP_NET_BROADCAST))
> >> >> >> >         j       return;
> >> >> >> >
> >> >> >> 
> >> >> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
> >> >> >> on their socket and has had someone with the appropriate privileges
> >> >> >> assign a peerrnetid.
> >> >> >> 
> >> >> >> All of which is to say that file_ns_capable is not nearly as applicable
> >> >> >> as it might be, and if you can pass the other two checks I think it is
> >> >> >> pointless (because the peernet attributes are not generated for
> >> >> >> kobj_uevents) but valid to receive events from outside your network
> >> >> >> namespace.
> >> >> >> 
> >> >> >> 
> >> >> >> I might be missing something but I don't see anything excluding network
> >> >> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
> >> >> >> logic.
> >> >> >> 
> >> >> >> The uevent_sock_list has one entry per network namespace. And
> >> >> >> kobject_uevent_net_broacast appears to walk each one.
> >> >> >> 
> >> >> >> I had a memory of filtering uevent messages and I had a memory
> >> >> >> that the netlink_has_listeners had a per network namespace effect.
> >> >> >> Neither seems true from my inspection of the code tonight.
> >> >> >> 
> >> >> >> If we are not filtering ordinary uevents at least at the user namespace
> >> >> >> level that does seem to be at least a nuisance.
> >> >> >> 
> >> >> >> 
> >> >> >> Christian can you dig a little deeper into this.  I have a feeling that
> >> >> >> there are some real efficiency improvements that we could make to
> >> >> >> kobject_uevent_net_broadcast if nothing else.
> >> >> >> 
> >> >> >> Perhaps you could see where uevents are broadcast by poking
> >> >> >> the sysfs uevent of an existing device, and triggering a retransmit.
> >> >> >
> >> >> > @Eric, so I did some intensive testing over the weekend and forget everything I
> >> >> > said before. Uevents are not filtered by the kernel at the moment. This is
> >> >> > currently - apart from network devices - a pure userspace thing. Specifically,
> >> >> > anyone  on the host can listen to all uevents from everywhere. It's neither
> >> >> > filtered by user nor by network namespace. The fact that I used
> >> >> >
> >> >> > udevadm --debug monitor
> >> >> >
> >> >> > to test my prior hypothesis was what led me to believe that uevents are already
> >> >> > correctly filtered.
> >> >> > Instead, what is actually happening is that every udev implementation out there
> >> >> > is discarding uevents that were send by uids != 0 in the CMSG_DATA.
> >> >> > Specifically,
> >> >> 
> >> >> Yes.  I remember something of the sort.  I believe udev also filters to
> >> >> ensure that the netlink port id == 0 to ensure the message came from
> >> >> the kernel and was not spoofed in any way.
> >> >> 
> >> >> > - legacy standalone udevd:
> >> >> >   https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
> >> >> > - eudevd
> >> >> >   https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
> >> >> > - systemd-udevd
> >> >> >   https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
> >> >> > - ueventd (Android)
> >> >> >   https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81
> >> >> >
> >> >> > For all of those I could trace this behavior back to the first released
> >> >> > version. (To be precise, for legacy udevd that eventually became systemd-udevd
> >> >> > I could trace it back to the first version that is still available on
> >> >> > git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
> >> >> > trivially true that it has the same behavior from the beginning.)
> >> >> >
> >> >> > In any case, userspace udev is not making use of uevents at all right now since
> >> >> > any uid != 0 events are **explicitly** discarded.
> >> >> > The fact that you receive uevents for
> >> >> >
> >> >> > sudo unshare -U --map-root -n
> >> >> > udevadm --debug monitor
> >> >> >
> >> >> > is simply explained by the fact that container(0) <=> host(0) at which point
> >> >> > the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
> >> >> > discard it.
> >> >> > The use case for receiving uevents in containers/user namespaces is definitely
> >> >> > there but that's what the uevent injection patch series was for that we merged.
> >> >> > This is a much safer and saner solution.
> >> >> > The fact that all udev implementations filter uevents by uid != 0 very much
> >> >> > seems like a security mechanism in userspace that we probably should provide by
> >> >> > isolating uevents based on user and/or network namespaces.
> >> >> 
> >> >> So in summary.  Uevents are filtered in a user namespace (by userspace)
> >> >> because the received uid != 0.  It instead == 65534 == "nobody" because
> >> >> the global root uid is not mapped.
> >> >
> >> > Exactly.
> >> >
> >> >> 
> >> >> Which means that we can modify the kernel to not send to all network
> >> >> namespaces whose user_ns != &init_user_ns because we know that userspace
> >> >> will ignore the message because of the uid anyway.  Which means when
> >> >
> >> > Yes.
> >> >
> >> >> net-next reopens you can send that patch.  But please base it on just
> >> >> not including network namespaces in the list, as that is much more
> >> >> efficient than adding more conditions to the filter.
> >> >
> >> > I'll send a patch out once net-next reopens. I'll also make sure to
> >> > inform all udev userspace implementations of the change. It won't affect
> >> > them but it is nice for them to know that they're safer now.
> >> 
> >> The real danger is in a user namespace or in a container really is too
> >> many daemons responding to events will generate a thundering hurd of
> >> activity when there is really nothing to do.
> >> 
> >> > Something like this (Proper commit message and so on will be added once
> >> > I sent this out.):
> >> 
> >> Exactly.
> >> 
> >> I would make the comment say something like: "ignore all but the initial
> >> user namespace".
> >
> > Yeah, agreed.
> > But I think the patch is not complete. To guarantee that no non-initial
> > user namespace actually receives uevents we need to:
> > 1. only sent uevents to uevent sockets that are located in network
> >    namespaces that are owned by init_user_ns
> > 2. filter uevents that are sent to sockets in mc_list that have opened a
> >    uevent socket that is owned by init_user_ns *from* a
> >    non-init_user_ns
> >
> > We account for 1. by only recording uevent sockets in the global uevent
> > socket list who are owned by init_user_ns.
> > But to account for 2. we need to filter by the user namespace who owns
> > the socket in mc_list. So in addition to that we also need to slightly
> > change the filter logic in kobj_bcast_filter() I think:
> >
> > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> > index 22a2c1a98b8f..064d7d29ace5 100644
> > --- a/lib/kobject_uevent.c
> > +++ b/lib/kobject_uevent.c
> > @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
> >  		return sock_ns != ns;
> >  	}
> >  
> > -	return 0;
> > + 	/* Check if socket was opened from non-initial user namespace. */
> > + 	return sk_user_ns(dsk) != &init_user_ns;
> >  }
> >  #endif
> >  
> >
> > But correct me if I'm wrong.
> 
> You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
> permissions and an explicit opt-in to receiving packets from multiple
> network namespaces.

I don't think that's what I'm talking about unless that is somehow the
default for NETLINK_KOBJECT_UEVENT sockets. What I'm worried about is
doing

unshare -U --map-root

then opening a NETLINK_KOBJECT_UEVENT socket and starting to listen to
uevents. Imho, this should not be possible because I'm in a
non-init_user_ns. But currently I'm able to - even with the patch to
come - since the uevent socket in the kernel was created when init_net
was created and hence is *owned* by the init_user_ns which means it is
in the list of uevent sockets. Here's a demo of what I mean:

https://asciinema.org/a/175632

Christian

> 
> There is a netlink CMSG that tells you which network namespace the
> packet comes from.  Something any sane person will use if they set
> NETLINK_LISTEN_ALL_NSID.
> 
> There is no problem of confusion.  There is no problem of permissions.
> So we don't need to worry about preventing NETLINK_LISTEN_ALL_NSID
> listeners from seeing the uevents.
> 
> Eric

^ permalink raw reply

* Re: [PATCH net] sctp: do not check port in sctp_inet6_cmp_addr
From: David Miller @ 2018-04-11 16:40 UTC (permalink / raw)
  To: lucien.xin; +Cc: marcelo.leitner, nhorman, netdev, linux-sctp
In-Reply-To: <CADvbK_egknPZCHRkZoLZ2MuGhRyaYKcMCaqG9Apt=QJ3-o-D6A@mail.gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Thu, 12 Apr 2018 00:16:58 +0800

> What do you think of:
> 
> static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
>                             const union sctp_addr *addr2)
> {
>         return __sctp_v6_cmp_addr(addr1, addr2) &&
>                addr1->v6.sin_port == addr2->v6.sin_port;
> }
> 
> (v6.sin_port and v4.sin_port have the same offset in union sctp_addr,
>  we've exploited this in many places in SCTP)

>From my perspective this is OK.

^ permalink raw reply

* Re: [PATCH net-next] netns: filter uevents correctly
From: Eric W. Biederman @ 2018-04-11 16:40 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kirill Tkhai, davem, gregkh, netdev, linux-kernel, avagin, serge
In-Reply-To: <20180411090907.GA20340@gmail.com>

Christian Brauner <christian.brauner@canonical.com> writes:

> On Tue, Apr 10, 2018 at 10:04:46AM -0500, Eric W. Biederman wrote:
>> Christian Brauner <christian.brauner@canonical.com> writes:
>> 
>> > On Mon, Apr 09, 2018 at 06:21:31PM -0500, Eric W. Biederman wrote:
>> >> Christian Brauner <christian.brauner@canonical.com> writes:
>> >> 
>> >> > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
>> >> >> Christian Brauner <christian.brauner@canonical.com> writes:
>> >> >> 
>> >> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> >> >> >> On 05.04.2018 17:07, Christian Brauner wrote:
>> >> >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> >> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >> >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>> >> >> >> >>>
>> >> >> >> >>> enabled sending hotplug events into all network namespaces back in 2010.
>> >> >> >> >>> Over time the set of uevents that get sent into all network namespaces has
>> >> >> >> >>> shrunk. We have now reached the point where hotplug events for all devices
>> >> >> >> >>> that carry a namespace tag are filtered according to that namespace.
>> >> >> >> >>>
>> >> >> >> >>> Specifically, they are filtered whenever the namespace tag of the kobject
>> >> >> >> >>> does not match the namespace tag of the netlink socket. One example are
>> >> >> >> >>> network devices. Uevents for network devices only show up in the network
>> >> >> >> >>> namespaces these devices are moved to or created in.
>> >> >> >> >>>
>> >> >> >> >>> However, any uevent for a kobject that does not have a namespace tag
>> >> >> >> >>> associated with it will not be filtered and we will *try* to broadcast it
>> >> >> >> >>> into all network namespaces.
>> >> >> >> >>>
>> >> >> >> >>> The original patchset was written in 2010 before user namespaces were a
>> >> >> >> >>> thing. With the introduction of user namespaces sending out uevents became
>> >> >> >> >>> partially isolated as they were filtered by user namespaces:
>> >> >> >> >>>
>> >> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >> >> >> >>>
>> >> >> >> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >> >> >> >>>         if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >> >> >> >>>                 return;
>> >> >> >> >>>
>> >> >> >> >>>         if (!peernet_has_id(sock_net(sk), p->net))
>> >> >> >> >>>                 return;
>> >> >> >> >>>
>> >> >> >> >>>         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >> >> >>>                              CAP_NET_BROADCAST))
>> >> >> >> >>>         j       return;
>> >> >> >> >>> }
>> >> >> >> >>>
>> >> >> >> >>> The file_ns_capable() check will check whether the caller had
>> >> >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>> >> >> >> >>> namespace of interest. This check is fine in general but seems insufficient
>> >> >> >> >>> to me when paired with uevents. The reason is that devices always belong to
>> >> >> >> >>> the initial user namespace so uevents for kobjects that do not carry a
>> >> >> >> >>> namespace tag should never be sent into another user namespace. This has
>> >> >> >> >>> been the intention all along. But there's one case where this breaks,
>> >> >> >> >>> namely if a new user namespace is created by root on the host and an
>> >> >> >> >>> identity mapping is established between root on the host and root in the
>> >> >> >> >>> new user namespace. Here's a reproducer:
>> >> >> >> >>>
>> >> >> >> >>>  sudo unshare -U --map-root
>> >> >> >> >>>  udevadm monitor -k
>> >> >> >> >>>  # Now change to initial user namespace and e.g. do
>> >> >> >> >>>  modprobe kvm
>> >> >> >> >>>  # or
>> >> >> >> >>>  rmmod kvm
>> >> >> >> >>>
>> >> >> >> >>> will allow the non-initial user namespace to retrieve all uevents from the
>> >> >> >> >>> host. This seems very anecdotal given that in the general case user
>> >> >> >> >>> namespaces do not see any uevents and also can't really do anything useful
>> >> >> >> >>> with them.
>> >> >> >> >>>
>> >> >> >> >>> Additionally, it is now possible to send uevents from userspace. As such we
>> >> >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> >> >> >> >>> namespace of the network namespace of the netlink socket) userspace process
>> >> >> >> >>> make a decision what uevents should be sent.
>> >> >> >> >>>
>> >> >> >> >>> This makes me think that we should simply ensure that uevents for kobjects
>> >> >> >> >>> that do not carry a namespace tag are *always* filtered by user namespace
>> >> >> >> >>> in kobj_bcast_filter(). Specifically:
>> >> >> >> >>> - If the owning user namespace of the uevent socket is not init_user_ns the
>> >> >> >> >>>   event will always be filtered.
>> >> >> >> >>> - If the network namespace the uevent socket belongs to was created in the
>> >> >> >> >>>   initial user namespace but was opened from a non-initial user namespace
>> >> >> >> >>>   the event will be filtered as well.
>> >> >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are now
>> >> >> >> >>> always only sent to the initial user namespace. The regression potential
>> >> >> >> >>> for this is near to non-existent since user namespaces can't really do
>> >> >> >> >>> anything with interesting devices.
>> >> >> >> >>>
>> >> >> >> >>> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
>> >> >> >> >>> ---
>> >> >> >> >>>  lib/kobject_uevent.c | 10 +++++++++-
>> >> >> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
>> >> >> >> >>>
>> >> >> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>> >> >> >> >>> index 15ea216a67ce..cb98cddb6e3b 100644
>> >> >> >> >>> --- a/lib/kobject_uevent.c
>> >> >> >> >>> +++ b/lib/kobject_uevent.c
>> >> >> >> >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>> >> >> >> >>>  		return sock_ns != ns;
>> >> >> >> >>>  	}
>> >> >> >> >>>  
>> >> >> >> >>> -	return 0;
>> >> >> >> >>> +	/*
>> >> >> >> >>> +	 * The kobject does not carry a namespace tag so filter by user
>> >> >> >> >>> +	 * namespace below.
>> >> >> >> >>> +	 */
>> >> >> >> >>> +	if (sock_net(dsk)->user_ns != &init_user_ns)
>> >> >> >> >>> +		return 1;
>> >> >> >> >>> +
>> >> >> >> >>> +	/* Check if socket was opened from non-initial user namespace. */
>> >> >> >> >>> +	return sk_user_ns(dsk) != &init_user_ns;
>> >> >> >> >>>  }
>> >> >> >> >>>  #endif
>> >> >> >> >>
>> >> >> >> >> So, this prohibits to listen events of all devices except network-related
>> >> >> >> >> in containers? If it's so, I don't think it's a good solution. Uevents is not
>> >> >> >> > 
>> >> >> >> > No, this is not correct: As it is right now *without my patch* no
>> >> >> >> > non-initial user namespace is receiving *any uevents* but those
>> >> >> >> > specifically namespaced such as those for network devices. This patch
>> >> >> >> > doesn't change that at all. The commit message outlines this in detail
>> >> >> >> > how this comes about.
>> >> >> >> > There is only one case where this currently breaks and this is as I
>> >> >> >> > outlined explicitly in my commit message when you create a new user
>> >> >> >> > namespace and map container(0) -> host(0). This patch fixes this.
>> >> >> >> 
>> >> >> >> Could you please point the place, where non-initial user namespaces are filtered?
>> >> >> >> I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted".
>> >> >> >> Now it will return 1 sometimes.
>> >> >> >
>> >> >> > Oh sure, it's in the commit message though. The callchain is
>> >> >> > lib/kobject_uevent.c:kobject_uevent_net_broadcast() ->
>> >> >> > nnet/netlink/af_netlink.c:netlink_broadcast_filtered() ->
>> >> >> > net/netlink/af_netlink.c:do_one_broadcast():
>> >> >> >
>> >> >> > This codepiece will check whether the openened socket holds
>> >> >> > CAP_NET_BROADCAST in the user namespace of the target network namespace
>> >> >> > which it won't because we don't have device namespaces and all devices
>> >> >> > belong to the initial set of namespaces.
>> >> >> >
>> >> >> >         if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >> >                              CAP_NET_BROADCAST))
>> >> >> >         j       return;
>> >> >> >
>> >> >> 
>> >> >> The above that only applies if someone has set NETLINK_F_LISTEN_ALL_NSID
>> >> >> on their socket and has had someone with the appropriate privileges
>> >> >> assign a peerrnetid.
>> >> >> 
>> >> >> All of which is to say that file_ns_capable is not nearly as applicable
>> >> >> as it might be, and if you can pass the other two checks I think it is
>> >> >> pointless (because the peernet attributes are not generated for
>> >> >> kobj_uevents) but valid to receive events from outside your network
>> >> >> namespace.
>> >> >> 
>> >> >> 
>> >> >> I might be missing something but I don't see anything excluding network
>> >> >> namespaces owned by !init_user_ns excluded from the kobject_uevent
>> >> >> logic.
>> >> >> 
>> >> >> The uevent_sock_list has one entry per network namespace. And
>> >> >> kobject_uevent_net_broacast appears to walk each one.
>> >> >> 
>> >> >> I had a memory of filtering uevent messages and I had a memory
>> >> >> that the netlink_has_listeners had a per network namespace effect.
>> >> >> Neither seems true from my inspection of the code tonight.
>> >> >> 
>> >> >> If we are not filtering ordinary uevents at least at the user namespace
>> >> >> level that does seem to be at least a nuisance.
>> >> >> 
>> >> >> 
>> >> >> Christian can you dig a little deeper into this.  I have a feeling that
>> >> >> there are some real efficiency improvements that we could make to
>> >> >> kobject_uevent_net_broadcast if nothing else.
>> >> >> 
>> >> >> Perhaps you could see where uevents are broadcast by poking
>> >> >> the sysfs uevent of an existing device, and triggering a retransmit.
>> >> >
>> >> > @Eric, so I did some intensive testing over the weekend and forget everything I
>> >> > said before. Uevents are not filtered by the kernel at the moment. This is
>> >> > currently - apart from network devices - a pure userspace thing. Specifically,
>> >> > anyone  on the host can listen to all uevents from everywhere. It's neither
>> >> > filtered by user nor by network namespace. The fact that I used
>> >> >
>> >> > udevadm --debug monitor
>> >> >
>> >> > to test my prior hypothesis was what led me to believe that uevents are already
>> >> > correctly filtered.
>> >> > Instead, what is actually happening is that every udev implementation out there
>> >> > is discarding uevents that were send by uids != 0 in the CMSG_DATA.
>> >> > Specifically,
>> >> 
>> >> Yes.  I remember something of the sort.  I believe udev also filters to
>> >> ensure that the netlink port id == 0 to ensure the message came from
>> >> the kernel and was not spoofed in any way.
>> >> 
>> >> > - legacy standalone udevd:
>> >> >   https://git.kernel.org/pub/scm/linux/hotplug/udev.git/snapshot/udev-062.tar.gz
>> >> > - eudevd
>> >> >   https://github.com/gentoo/eudev/blob/6f630d32bf494a457171b3f99e329592497bf271/src/libudev/libudev-monitor.c#L645
>> >> > - systemd-udevd
>> >> >   https://github.com/systemd/systemd/blob/e89ab7f219a399ab719c78cf43c07c0da60bd151/src/libudev/libudev-monitor.c#L656
>> >> > - ueventd (Android)
>> >> >   https://android.googlesource.com/platform/system/core.git/+/android-8.1.0_r22/libcutils/uevent.c#81
>> >> >
>> >> > For all of those I could trace this behavior back to the first released
>> >> > version. (To be precise, for legacy udevd that eventually became systemd-udevd
>> >> > I could trace it back to the first version that is still available on
>> >> > git.kernel.org which is 062. Since eudevd is a fork of systemd-udevd it is
>> >> > trivially true that it has the same behavior from the beginning.)
>> >> >
>> >> > In any case, userspace udev is not making use of uevents at all right now since
>> >> > any uid != 0 events are **explicitly** discarded.
>> >> > The fact that you receive uevents for
>> >> >
>> >> > sudo unshare -U --map-root -n
>> >> > udevadm --debug monitor
>> >> >
>> >> > is simply explained by the fact that container(0) <=> host(0) at which point
>> >> > the uid in CMSG_DATA will be 0 in the new user namespace and udev will not
>> >> > discard it.
>> >> > The use case for receiving uevents in containers/user namespaces is definitely
>> >> > there but that's what the uevent injection patch series was for that we merged.
>> >> > This is a much safer and saner solution.
>> >> > The fact that all udev implementations filter uevents by uid != 0 very much
>> >> > seems like a security mechanism in userspace that we probably should provide by
>> >> > isolating uevents based on user and/or network namespaces.
>> >> 
>> >> So in summary.  Uevents are filtered in a user namespace (by userspace)
>> >> because the received uid != 0.  It instead == 65534 == "nobody" because
>> >> the global root uid is not mapped.
>> >
>> > Exactly.
>> >
>> >> 
>> >> Which means that we can modify the kernel to not send to all network
>> >> namespaces whose user_ns != &init_user_ns because we know that userspace
>> >> will ignore the message because of the uid anyway.  Which means when
>> >
>> > Yes.
>> >
>> >> net-next reopens you can send that patch.  But please base it on just
>> >> not including network namespaces in the list, as that is much more
>> >> efficient than adding more conditions to the filter.
>> >
>> > I'll send a patch out once net-next reopens. I'll also make sure to
>> > inform all udev userspace implementations of the change. It won't affect
>> > them but it is nice for them to know that they're safer now.
>> 
>> The real danger is in a user namespace or in a container really is too
>> many daemons responding to events will generate a thundering hurd of
>> activity when there is really nothing to do.
>> 
>> > Something like this (Proper commit message and so on will be added once
>> > I sent this out.):
>> 
>> Exactly.
>> 
>> I would make the comment say something like: "ignore all but the initial
>> user namespace".
>
> Yeah, agreed.
> But I think the patch is not complete. To guarantee that no non-initial
> user namespace actually receives uevents we need to:
> 1. only sent uevents to uevent sockets that are located in network
>    namespaces that are owned by init_user_ns
> 2. filter uevents that are sent to sockets in mc_list that have opened a
>    uevent socket that is owned by init_user_ns *from* a
>    non-init_user_ns
>
> We account for 1. by only recording uevent sockets in the global uevent
> socket list who are owned by init_user_ns.
> But to account for 2. we need to filter by the user namespace who owns
> the socket in mc_list. So in addition to that we also need to slightly
> change the filter logic in kobj_bcast_filter() I think:
>
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 22a2c1a98b8f..064d7d29ace5 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -251,7 +251,8 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
>  		return sock_ns != ns;
>  	}
>  
> -	return 0;
> + 	/* Check if socket was opened from non-initial user namespace. */
> + 	return sk_user_ns(dsk) != &init_user_ns;
>  }
>  #endif
>  
>
> But correct me if I'm wrong.

You are worrying about NETLINK_LISTEN_ALL_NSID sockets.  That has
permissions and an explicit opt-in to receiving packets from multiple
network namespaces.

There is a netlink CMSG that tells you which network namespace the
packet comes from.  Something any sane person will use if they set
NETLINK_LISTEN_ALL_NSID.

There is no problem of confusion.  There is no problem of permissions.
So we don't need to worry about preventing NETLINK_LISTEN_ALL_NSID
listeners from seeing the uevents.

Eric

^ permalink raw reply

* Re: [PATCH] net: dsa: b53: Replace mdelay with msleep in b53_switch_reset_gpio
From: Florian Fainelli @ 2018-04-11 16:19 UTC (permalink / raw)
  To: Jia-Ju Bai, Phil Reid, andrew, vivien.didelot; +Cc: netdev, linux-kernel
In-Reply-To: <7c9dffb7-d55a-98f5-4024-7e2f0c704a56@gmail.com>

On 04/11/2018 12:14 AM, Jia-Ju Bai wrote:
> 
> 
> On 2018/4/11 13:30, Phil Reid wrote:
>> On 11/04/2018 09:51, Jia-Ju Bai wrote:
>>> b53_switch_reset_gpio() is never called in atomic context.
>>>
>>> The call chain ending up at b53_switch_reset_gpio() is:
>>> [1] b53_switch_reset_gpio() <- b53_switch_reset() <-
>>>     b53_reset_switch() <- b53_setup()
>>>
>>> b53_switch_reset_gpio() is set as ".setup" in struct dsa_switch_ops.
>>> This function is not called in atomic context.
>>>
>>> Despite never getting called from atomic context,
>>> b53_switch_reset_gpio()
>>> calls mdelay() to busily wait.
>>> This is not necessary and can be replaced with msleep() to
>>> avoid busy waiting.
>>>
>>> This is found by a static analysis tool named DCNS written by myself.
>>> And I also manually check it.
>>>
>>> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
>>> ---
>>>   drivers/net/dsa/b53/b53_common.c | 4 ++--
>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/dsa/b53/b53_common.c
>>> b/drivers/net/dsa/b53/b53_common.c
>>> index 274f367..e070ff6 100644
>>> --- a/drivers/net/dsa/b53/b53_common.c
>>> +++ b/drivers/net/dsa/b53/b53_common.c
>>> @@ -597,10 +597,10 @@ static void b53_switch_reset_gpio(struct
>>> b53_device *dev)
>>>       /* Reset sequence: RESET low(50ms)->high(20ms)
>>>        */
>>>       gpio_set_value(gpio, 0);
>>> -    mdelay(50);
>>> +    msleep(50);
>>>         gpio_set_value(gpio, 1);
>>> -    mdelay(20);
>>> +    msleep(20);
>>>         dev->current_page = 0xff;
>>>   }
>>>
>> Would that also imply gpio_set_value could be gpio_set_value_cansleep?
>>
> 
> Yes, I think gpio_set_value_cansleep() is okay here?
> Do I need to send a V2 patch to replace gpio_set_value()?

Yes, I would lump these two changes in the same patch since this is
effectively about solving sleeping vs. non sleeping operations.

Thanks!
-- 
Florian

^ permalink raw reply

* Re: [PATCH net] sctp: do not check port in sctp_inet6_cmp_addr
From: Xin Long @ 2018-04-11 16:16 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner; +Cc: Neil Horman, network dev, linux-sctp, davem
In-Reply-To: <20180411145910.GC3711@localhost.localdomain>

On Wed, Apr 11, 2018 at 10:59 PM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Wed, Apr 11, 2018 at 10:36:07AM -0400, Neil Horman wrote:
>> On Wed, Apr 11, 2018 at 08:58:05PM +0800, Xin Long wrote:
>> > pf->cmp_addr() is called before binding a v6 address to the sock. It
>> > should not check ports, like in sctp_inet_cmp_addr.
>> >
>> > But sctp_inet6_cmp_addr checks the addr by invoking af(6)->cmp_addr,
>> > sctp_v6_cmp_addr where it also compares the ports.
>> >
>> > This would cause that setsockopt(SCTP_SOCKOPT_BINDX_ADD) could bind
>> > multiple duplicated IPv6 addresses after Commit 40b4f0fd74e4 ("sctp:
>> > lack the check for ports in sctp_v6_cmp_addr").
>> >
>> > This patch is to remove af->cmp_addr called in sctp_inet6_cmp_addr,
>> > but do the proper check for both v6 addrs and v4mapped addrs.
>> >
>> > Fixes: 40b4f0fd74e4 ("sctp: lack the check for ports in sctp_v6_cmp_addr")
>> > Reported-by: Jianwen Ji <jiji@redhat.com>
>> > Signed-off-by: Xin Long <lucien.xin@gmail.com>
>> > ---
>> >  net/sctp/ipv6.c | 27 ++++++++++++++++++++++++---
>> >  1 file changed, 24 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
>> > index f1fc48e..be4b72c 100644
>> > --- a/net/sctp/ipv6.c
>> > +++ b/net/sctp/ipv6.c
>> > @@ -846,8 +846,8 @@ static int sctp_inet6_cmp_addr(const union sctp_addr *addr1,
>> >                            const union sctp_addr *addr2,
>> >                            struct sctp_sock *opt)
>> >  {
>> > -   struct sctp_af *af1, *af2;
>> >     struct sock *sk = sctp_opt2sk(opt);
>> > +   struct sctp_af *af1, *af2;
>> >
>> >     af1 = sctp_get_af_specific(addr1->sa.sa_family);
>> >     af2 = sctp_get_af_specific(addr2->sa.sa_family);
>> > @@ -863,10 +863,31 @@ static int sctp_inet6_cmp_addr(const union sctp_addr *addr1,
>> >     if (sctp_is_any(sk, addr1) || sctp_is_any(sk, addr2))
>> >             return 1;
>> >
>> > -   if (addr1->sa.sa_family != addr2->sa.sa_family)
>> > +   if (addr1->sa.sa_family != addr2->sa.sa_family) {
>> > +           if (addr1->sa.sa_family == AF_INET &&
>> > +               addr2->sa.sa_family == AF_INET6 &&
>> > +               ipv6_addr_v4mapped(&addr2->v6.sin6_addr))
>> > +                   if (addr2->v6.sin6_addr.s6_addr32[3] ==
>> > +                       addr1->v4.sin_addr.s_addr)
>> > +                           return 1;
>> > +           if (addr2->sa.sa_family == AF_INET &&
>> > +               addr1->sa.sa_family == AF_INET6 &&
>> > +               ipv6_addr_v4mapped(&addr1->v6.sin6_addr))
>> > +                   if (addr1->v6.sin6_addr.s6_addr32[3] ==
>> > +                       addr2->v4.sin_addr.s_addr)
>> > +                           return 1;
>> > +           return 0;
>> > +   }
>> > +
>> > +   if (!ipv6_addr_equal(&addr1->v6.sin6_addr, &addr2->v6.sin6_addr))
>> > +           return 0;
>> > +
>> > +   if ((ipv6_addr_type(&addr1->v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) &&
>> > +       addr1->v6.sin6_scope_id && addr2->v6.sin6_scope_id &&
>> > +       addr1->v6.sin6_scope_id != addr2->v6.sin6_scope_id)
>> >             return 0;
>> >
>> > -   return af1->cmp_addr(addr1, addr2);
>> > +   return 1;
>> >  }
>> >
>> >  /* Verify that the provided sockaddr looks bindable.   Common verification,
>> > --
>> > 2.1.0
>> >
>> This looks correct to me, but is it worth duplicating the comparison code like
>> this from the cmp_addr function?  It might be more worthwhile to add a flag to
>> the cmp_addr method to direct weather it needs to check port values or not.
>> That way you could continue to use the cmp_addr function here.
>
> Adding a flag into sctp_v6_cmp_addr will get us a terrible code to
> read. It's already not one of the best looking part of it. Maybe
> still duplicate part of it it, but at 'af' level? As in:
> - af->cmp_addr
> - af->cmp_addr_port
>
What do you think of:

static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
                            const union sctp_addr *addr2)
{
        return __sctp_v6_cmp_addr(addr1, addr2) &&
               addr1->v6.sin_port == addr2->v6.sin_port;
}

(v6.sin_port and v4.sin_port have the same offset in union sctp_addr,
 we've exploited this in many places in SCTP)

^ permalink raw reply

* Re: [PATCH v2] dec: tulip: de4x5: Replace mdelay with usleep_range in de4x5_hw_init
From: James Bottomley @ 2018-04-11 16:16 UTC (permalink / raw)
  To: Jia-Ju Bai, davem, stephen, johannes.berg, arvind.yadav.cs,
	dhowells
  Cc: netdev, linux-parisc, linux-kernel
In-Reply-To: <1523461182-5897-1-git-send-email-baijiaju1990@gmail.com>

On Wed, 2018-04-11 at 23:39 +0800, Jia-Ju Bai wrote:
> de4x5_hw_init() is never called in atomic context.
> 
> de4x5_hw_init() is only called by de4x5_pci_probe(), which is only 
> set as ".probe" in struct pci_driver.
> 
> Despite never getting called from atomic context, de4x5_hw_init() 
> calls mdelay() to busily wait. This is not necessary and can be
> replaced with usleep_range() to  avoid busy waiting.
> 
> This is found by a static analysis tool named DCNS written by myself.
> And I also manually check it.

Did you actually test this?  The usual reason for wanting m/udelay is
that the timing must be exact.  The driver is filled with mdelay()s for
this reason.  The one you've picked on is in the init path so it won't
affect the runtime in any way.  I also don't think we have the hrtimer
machinery for usleep_range() to work properly on parisc, so I don't
think the replacement works.

James

^ permalink raw reply

* Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)
From: Stephen Hemminger @ 2018-04-11 16:14 UTC (permalink / raw)
  To: Stephen Suryaputra; +Cc: netdev
In-Reply-To: <1523415335-17154-1-git-send-email-ssuryaextr@gmail.com>

On Tue, 10 Apr 2018 22:55:35 -0400
Stephen Suryaputra <ssuryaextr@gmail.com> wrote:

> This is enhanced from the proposed patch by Igor Maravic in 2011 to
> support per interface IPv4 stats. The enhancement is mainly adding a
> kernel configuration option CONFIG_IP_IFSTATS_TABLE.
> 
> Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>


Net-next is closed. http://vger.kernel.org/~davem/net-next.html

Also, these kind of statistics have a negative performance impact.

^ permalink raw reply

* 4.15.13 kernel panic, ip_rcv_finish, nf_xfrm_me_harder warnings continue to fill dmesg
From: Denys Fedoryshchenko @ 2018-04-11 15:51 UTC (permalink / raw)
  To: Linux Kernel Network Developers

Apr 11 18:01:34[99194.935520] general protection fault: 0000 [#1] SMP
Apr 11 18:01:34[99194.935998] Modules linked in: pppoe pppox ppp_generic 
slhc ip_set_hash_net xt_nat xt_string xt_connmark xt_TCPMSS xt_mark 
xt_CT xt_set xt_tcpudp ip_set_bitmap_port ip_set nfnetlink
iptable_raw iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip_tables x_tables 
netconsole configfs 8021q garp mrp stp llc ixgbe dca ipv6
Apr 11 18:01:34[99194.938313] CPU: 23 PID: 150 Comm: ksoftirqd/23 
Tainted: G        W        4.15.13-build-0135 #4
Apr 11 18:01:34[99194.939258] Hardware name: Intel Corporation 
S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
Apr 11 18:01:34[99194.940189] RIP: 0010:ip_rcv_finish+0x2b5/0x2e5
Apr 11 18:01:34[99194.940716] RSP: 0018:ffffc9000cad7cf8 EFLAGS: 
00010286
Apr 11 18:01:34[99194.941214] RAX: ffff00e2ffff476d RBX: 
ffff88178f944400 RCX: ffff88179a32a800
Apr 11 18:01:34[99194.941741] RDX: ffff88178f944400 RSI: 
0000000000000000 RDI: ffff88178f944400
Apr 11 18:01:34[99194.942234] RBP: ffff882fd580d000 R08: 
0000000000000001 R09: ffff882fd034ee00
Apr 11 18:01:34[99194.942771] R10: ffffc9000cad7b58 R11: 
00000000e92316b9 R12: ffff88179a32a8d6
Apr 11 18:01:34[99194.943286] R13: ffff882fd580d000 R14: 
0000000000ea0008 R15: ffff882fd580d078
Apr 11 18:01:34[99194.943821] FS:  0000000000000000(0000) 
GS:ffff88303fcc0000(0000) knlGS:0000000000000000
Apr 11 18:01:34[99194.944779] CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Apr 11 18:01:34[99194.945287] CR2: 00007f8bb37888f0 CR3: 
000000303e209003 CR4: 00000000001606e0
Apr 11 18:01:34[99194.945808] Call Trace:
Apr 11 18:01:34[99194.946307]  ip_rcv+0x2f2/0x325
Apr 11 18:01:34[99194.946816]  ? ip_local_deliver_finish+0x187/0x187
Apr 11 18:01:34[99194.947331]  __netif_receive_skb_core+0x81c/0x89c
Apr 11 18:01:34[99194.947872]  ? napi_complete_done+0xb4/0xba
Apr 11 18:01:34[99194.948391]  ? ixgbe_poll+0xf96/0x104d [ixgbe]
Apr 11 18:01:34[99194.948931]  ? process_backlog+0x8b/0x10d
Apr 11 18:01:34[99194.949424]  process_backlog+0x8b/0x10d
Apr 11 18:01:34[99194.949953]  net_rx_action+0x127/0x2b5
Apr 11 18:01:34[99194.950445]  __do_softirq+0xc1/0x1b1
Apr 11 18:01:34[99194.950951]  ? sort_range+0x17/0x17
Apr 11 18:01:34[99194.951442]  run_ksoftirqd+0x11/0x22
Apr 11 18:01:34[99194.951972]  smpboot_thread_fn+0x121/0x136
Apr 11 18:01:34[99194.952489]  kthread+0xfd/0x105
Apr 11 18:01:34[99194.953018]  ? kthread_create_on_node+0x3a/0x3a
Apr 11 18:01:34[99194.953528]  ret_from_fork+0x1f/0x30
Apr 11 18:01:34[99194.954047] Code: 15 77 9e 99 00 83 7a 7c 00 75 37 83 
b8 2c 01 00 00 00 75 2e 48 8b 43 58 48 89 df 5b 5d 48 83 e0 fe 41 5c 41 
5d 41 5e 48 8b 40 50 <ff> e0 83 f8 ee 75 10 49 8b 84 24
90 01 00 00 65 48 ff 80 40 02
Apr 11 18:01:34[99194.955449] RIP: ip_rcv_finish+0x2b5/0x2e5 RSP: 
ffffc9000cad7cf8
Apr 11 18:01:34[99194.956008] ---[ end trace 312b0bf537b4709a ]---
Apr 11 18:01:34[99195.007900] Kernel panic - not syncing: Fatal 
exception in interrupt
Apr 11 18:01:34[99195.008400] Kernel Offset: disabled
Apr 11 18:01:34[99195.013950] Rebooting in 5 seconds..
--------------


and i reported before about warnings in nf_frm_me_harder, but probably 
nobody have interest to take a look, and it is seems plaguing 4.15.x and 
nearby versions kernels . Here is one of such warnings.

-----------------------
Apr 11 00:00:17[34320.802349] dst_release: dst:00000000b32dca17 
refcnt:-2
Apr 11 00:00:19[34323.018468] WARNING: CPU: 7 PID: 0 at 
./include/net/dst.h:256 nf_xfrm_me_harder+0x62/0xfe [nf_nat]
Apr 11 00:00:19[34323.019357] Modules linked in: pppoe pppox ppp_generic 
slhc ip_set_hash_net xt_nat xt_string xt_connmark xt_TCPMSS xt_mark 
xt_CT xt_set xt_tcpudp ip_set_bitmap_port ip_set nfnetlink
iptable_raw iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip_tables x_tables 
netconsole configfs 8021q garp mrp stp llc ixgbe dca ipv6
Apr 11 00:00:19[34323.021503] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G   
      W        4.15.13-build-0135 #4
Apr 11 00:00:19[34323.022380] Hardware name: Intel Corporation 
S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
Apr 11 00:00:19[34323.023261] RIP: 0010:nf_xfrm_me_harder+0x62/0xfe 
[nf_nat]
Apr 11 00:00:19[34323.023737] RSP: 0018:ffff88303fa43c90 EFLAGS: 
00010246
Apr 11 00:00:19[34323.024218] RAX: 0000000000000000 RBX: 
ffff8817b2c35200 RCX: 0000000000000000
Apr 11 00:00:19[34323.024703] RDX: 0000000000000002 RSI: 
ffff88178fab3700 RDI: ffff88303fa43cd0
Apr 11 00:00:19[34323.025214] RBP: ffffffff822a6180 R08: 
0000000000000005 R09: 0000000000000001
Apr 11 00:00:19[34323.025717] R10: 00000000000000d6 R11: 
ffff8817c945bca0 R12: 0000000000000001
Apr 11 00:00:19[34323.026214] R13: ffff88303fa43d60 R14: 
0000000000ce0008 R15: ffff8817b7477078
Apr 11 00:00:19[34323.026736] FS:  0000000000000000(0000) 
GS:ffff88303fa40000(0000) knlGS:0000000000000000
Apr 11 00:00:19[34323.027680] CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Apr 11 00:00:19[34323.028209] CR2: 00007f3557e13490 CR3: 
000000303e209002 CR4: 00000000001606e0
Apr 11 00:00:19[34323.028709] Call Trace:
Apr 11 00:00:19[34323.029181]  <IRQ>
Apr 11 00:00:19[34323.029670]  nf_nat_ipv4_out+0xa7/0xbb [nf_nat_ipv4]
Apr 11 00:00:19[34323.030161]  nf_hook_slow+0x31/0x90
Apr 11 00:00:19[34323.030681]  ip_output+0x93/0xaf
Apr 11 00:00:19[34323.031158]  ? ip_fragment.constprop.5+0x6e/0x6e
Apr 11 00:00:19[34323.031668]  ip_forward+0x327/0x36a
Apr 11 00:00:19[34323.032173]  ? ip_frag_mem+0x7/0x7
Apr 11 00:00:19[34323.032689]  ip_rcv+0x2f2/0x325
Apr 11 00:00:19[34323.033179]  ? ip_local_deliver_finish+0x187/0x187
Apr 11 00:00:19[34323.033680]  __netif_receive_skb_core+0x81c/0x89c
Apr 11 00:00:19[34323.034179]  ? process_backlog+0x8b/0x10d
Apr 11 00:00:19[34323.034701]  process_backlog+0x8b/0x10d
Apr 11 00:00:19[34323.035177]  net_rx_action+0x127/0x2b5
Apr 11 00:00:19[34323.035694]  __do_softirq+0xc1/0x1b1
Apr 11 00:00:19[34323.036202]  irq_exit+0x49/0x88
Apr 11 00:00:19[34323.036728]  call_function_single_interrupt+0x77/0x80
Apr 11 00:00:19[34323.037213]  </IRQ>
Apr 11 00:00:19[34323.037698] RIP: 0010:mwait_idle+0x52/0x64
Apr 11 00:00:19[34323.038273] RSP: 0018:ffffc9000c607f00 EFLAGS: 
00000246 ORIG_RAX: ffffffffffffff04
Apr 11 00:00:19[34323.039183] RAX: 0000000000000000 RBX: 
0000000000000000 RCX: 0000000000000000
Apr 11 00:00:19[34323.039701] RDX: 0000000000000000 RSI: 
0000000000000000 RDI: 0000000000000000
Apr 11 00:00:19[34323.040211] RBP: 0000000000000000 R08: 
012105aacb441779 R09: ffffffff82244640
Apr 11 00:00:19[34323.040742] R10: ffffc9000c607e90 R11: 
0000000000000046 R12: 0000000000000007
Apr 11 00:00:19[34323.041236] R13: 0000000000000000 R14: 
0000000000000000 R15: 0000000000000000
Apr 11 00:00:19[34323.041777]  do_idle+0xae/0x137
Apr 11 00:00:19[34323.042292]  cpu_startup_entry+0x57/0x59
Apr 11 00:00:19[34323.042814]  start_secondary+0x151/0x154
Apr 11 00:00:19[34323.043305]  secondary_startup_64+0xa5/0xb0
Apr 11 00:00:19[34323.043819] Code: 48 83 e6 fe 48 83 7e 48 00 74 07 48 
8b b6 80 01 00 00 8b 86 80 00 00 00 85 c0 74 0f 8d 50 01 f0 0f b1 96 80 
00 00 00 74 04 eb ed <0f> 0b 48 8b 4b 18 48 8d 54 24 08
45 31 c0 48 89 ef e8 c5 ff 83
Apr 11 00:00:19[34323.045208] ---[ end trace 312b0bf537b46b33 ]---
Apr 11 00:00:19[34323.045739] dst_release: dst:00000000156ec4ca 
refcnt:-1

^ permalink raw reply

* [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)
From: Stephen Suryaputra @ 2018-04-11  2:55 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Suryaputra

This is enhanced from the proposed patch by Igor Maravic in 2011 to
support per interface IPv4 stats. The enhancement is mainly adding a
kernel configuration option CONFIG_IP_IFSTATS_TABLE.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
---
 drivers/net/vrf.c               |   2 +-
 include/linux/inetdevice.h      |  22 +++++++
 include/net/icmp.h              |  44 ++++++++++++--
 include/net/ip.h                |  72 +++++++++++++++++++++--
 include/net/ipv6.h              |  62 ++++----------------
 include/net/netns/mib.h         |   3 +
 include/net/snmp.h              |  95 ++++++++++++++++++++++++++++++
 net/bridge/br_netfilter_hooks.c |  10 ++--
 net/dccp/ipv4.c                 |   4 +-
 net/ipv4/Kconfig                |   8 +++
 net/ipv4/af_inet.c              |   7 ++-
 net/ipv4/datagram.c             |   2 +-
 net/ipv4/devinet.c              |  85 ++++++++++++++++++++++++++-
 net/ipv4/icmp.c                 |  32 +++++-----
 net/ipv4/inet_connection_sock.c |   8 ++-
 net/ipv4/ip_forward.c           |   8 +--
 net/ipv4/ip_fragment.c          |  20 ++++---
 net/ipv4/ip_input.c             |  29 ++++-----
 net/ipv4/ip_output.c            |  40 ++++++++-----
 net/ipv4/ipmr.c                 |   6 +-
 net/ipv4/ping.c                 |   9 ++-
 net/ipv4/proc.c                 | 126 ++++++++++++++++++++++++++++++++++++++++
 net/ipv4/raw.c                  |   4 +-
 net/ipv4/route.c                |   6 +-
 net/ipv4/tcp_ipv4.c             |   4 +-
 net/ipv4/udp.c                  |   4 +-
 net/l2tp/l2tp_ip.c              |   4 +-
 net/l2tp/l2tp_ip6.c             |   2 +-
 net/mpls/af_mpls.c              |   2 +-
 net/netfilter/ipvs/ip_vs_xmit.c |   2 +-
 net/sctp/input.c                |   2 +-
 net/sctp/output.c               |   2 +-
 32 files changed, 570 insertions(+), 156 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 0a2b180..509c2ca 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -592,7 +592,7 @@ static int vrf_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
 
-	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
+	IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index e16fe7d..3d120cb 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -22,6 +22,15 @@ struct ipv4_devconf {
 
 #define MC_HASH_SZ_LOG 9
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+struct ipv4_devstat {
+	struct proc_dir_entry *proc_dir_entry;
+	DEFINE_SNMP_STAT(struct ipstats_mib, ip);
+	DEFINE_SNMP_STAT_ATOMIC(struct icmp_mib_device, icmpdev);
+	DEFINE_SNMP_STAT_ATOMIC(struct icmpmsg_mib_device, icmpmsgdev);
+};
+#endif
+
 struct in_device {
 	struct net_device	*dev;
 	refcount_t		refcnt;
@@ -45,6 +54,9 @@ struct in_device {
 
 	struct neigh_parms	*arp_parms;
 	struct ipv4_devconf	cnf;
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	struct ipv4_devstat stats;
+#endif
 	struct rcu_head		rcu_head;
 };
 
@@ -216,6 +228,16 @@ static inline struct in_device *__in_dev_get_rcu(const struct net_device *dev)
 	return rcu_dereference(dev->ip_ptr);
 }
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static inline struct in_device *__in_dev_get_rcu_safely(const struct net_device *dev)
+{
+	if (likely(dev))
+		return rcu_dereference(dev->ip_ptr);
+	else
+		return NULL;
+}
+#endif
+
 static inline struct in_device *in_dev_get(const struct net_device *dev)
 {
 	struct in_device *in_dev;
diff --git a/include/net/icmp.h b/include/net/icmp.h
index 3ef2743..fdfbc0f 100644
--- a/include/net/icmp.h
+++ b/include/net/icmp.h
@@ -29,10 +29,44 @@ struct icmp_err {
 };
 
 extern const struct icmp_err icmp_err_convert[];
-#define ICMP_INC_STATS(net, field)	SNMP_INC_STATS((net)->mib.icmp_statistics, field)
-#define __ICMP_INC_STATS(net, field)	__SNMP_INC_STATS((net)->mib.icmp_statistics, field)
-#define ICMPMSGOUT_INC_STATS(net, field)	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field+256)
-#define ICMPMSGIN_INC_STATS(net, field)		SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field)
+#ifdef CONFIG_IP_IFSTATS_TABLE
+#define ICMP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINCATOMIC(net, icmp, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+
+#define __ICMP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		___DEVINCATOMIC(net, icmp, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+
+#define ICMPMSGOUT_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC_ATOMIC_ATOMIC(net, icmpmsg, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field+256);	\
+		rcu_read_unlock();	\
+	})
+
+#define ICMPMSGIN_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC_ATOMIC_ATOMIC(net, icmpmsg, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+#else
+#define ICMP_INC_STATS(net, dev, field)	SNMP_INC_STATS((net)->mib.icmp_statistics, field)
+#define __ICMP_INC_STATS(net, dev, field)	__SNMP_INC_STATS((net)->mib.icmp_statistics, field)
+#define ICMPMSGOUT_INC_STATS(net, dev, field)	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field+256)
+#define ICMPMSGIN_INC_STATS(net, dev, field)		SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field)
+#endif
 
 struct dst_entry;
 struct net_proto_family;
@@ -43,6 +77,6 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info);
 int icmp_rcv(struct sk_buff *skb);
 void icmp_err(struct sk_buff *skb, u32 info);
 int icmp_init(void);
-void icmp_out_count(struct net *net, unsigned char type);
+void icmp_out_count(struct net_device *dev, unsigned char type);
 
 #endif	/* _ICMP_H */
diff --git a/include/net/ip.h b/include/net/ip.h
index ecffd84..aa8a55b 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -27,6 +27,7 @@
 #include <linux/in.h>
 #include <linux/skbuff.h>
 #include <linux/jhash.h>
+#include <linux/inetdevice.h>
 
 #include <net/inet_sock.h>
 #include <net/route.h>
@@ -218,12 +219,62 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
 			   const struct ip_reply_arg *arg,
 			   unsigned int len);
 
-#define IP_INC_STATS(net, field)	SNMP_INC_STATS64((net)->mib.ip_statistics, field)
-#define __IP_INC_STATS(net, field)	__SNMP_INC_STATS64((net)->mib.ip_statistics, field)
-#define IP_ADD_STATS(net, field, val)	SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
-#define __IP_ADD_STATS(net, field, val) __SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
-#define IP_UPD_PO_STATS(net, field, val) SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
-#define __IP_UPD_PO_STATS(net, field, val) __SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
+#ifdef CONFIG_IP_IFSTATS_TABLE
+#define IP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC(net, ip, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+
+#define __IP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		___DEVINC(net, ip, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+
+#define IP_ADD_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVADD(net, ip, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field, val);	\
+		rcu_read_unlock();	\
+	})
+
+#define __IP_ADD_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		___DEVADD(net, ip, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field, val);	\
+		rcu_read_unlock();	\
+	})
+
+#define IP_UPD_PO_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVUPD(net, ip, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field, val);	\
+		rcu_read_unlock();	\
+	})
+
+#define __IP_UPD_PO_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		___DEVUPD(net, ip, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field, val);	\
+		rcu_read_unlock();	\
+	})
+#else
+#define IP_INC_STATS(net, dev, field)	SNMP_INC_STATS64((net)->mib.ip_statistics, field)
+#define __IP_INC_STATS(net, dev, field)	__SNMP_INC_STATS64((net)->mib.ip_statistics, field)
+#define IP_ADD_STATS(net, dev, field, val)	SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
+#define __IP_ADD_STATS(net, dev, field, val) __SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
+#define IP_UPD_PO_STATS(net, dev, field, val) SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
+#define __IP_UPD_PO_STATS(net, dev, field, val) __SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
+#endif
 #define NET_INC_STATS(net, field)	SNMP_INC_STATS((net)->mib.net_statistics, field)
 #define __NET_INC_STATS(net, field)	__SNMP_INC_STATS((net)->mib.net_statistics, field)
 #define NET_ADD_STATS(net, field, adnd)	SNMP_ADD_STATS((net)->mib.net_statistics, field, adnd)
@@ -660,4 +711,13 @@ extern int sysctl_icmp_msgs_burst;
 int ip_misc_proc_init(void);
 #endif
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+#ifdef CONFIG_PROC_FS
+extern int snmp_register_dev(struct in_device *idev);
+extern int snmp_unregister_dev(struct in_device *idev);
+#else
+extern int snmp_register_dev(struct in_device *idev) { return 0; }
+extern int snmp_unregister_dev(struct in_device *idev) { return 0; }
+#endif
+#endif
 #endif	/* _IP_H */
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 9b6e7f5..c9be5e1 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -163,71 +163,29 @@ struct frag_hdr {
 extern int sysctl_mld_max_msf;
 extern int sysctl_mld_qrv;
 
-#define _DEVINC(net, statname, mod, idev, field)			\
-({									\
-	struct inet6_dev *_idev = (idev);				\
-	if (likely(_idev != NULL))					\
-		mod##SNMP_INC_STATS64((_idev)->stats.statname, (field));\
-	mod##SNMP_INC_STATS64((net)->mib.statname##_statistics, (field));\
-})
-
-/* per device counters are atomic_long_t */
-#define _DEVINCATOMIC(net, statname, mod, idev, field)			\
-({									\
-	struct inet6_dev *_idev = (idev);				\
-	if (likely(_idev != NULL))					\
-		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field)); \
-	mod##SNMP_INC_STATS((net)->mib.statname##_statistics, (field));\
-})
-
-/* per device and per net counters are atomic_long_t */
-#define _DEVINC_ATOMIC_ATOMIC(net, statname, idev, field)		\
-({									\
-	struct inet6_dev *_idev = (idev);				\
-	if (likely(_idev != NULL))					\
-		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field)); \
-	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.statname##_statistics, (field));\
-})
-
-#define _DEVADD(net, statname, mod, idev, field, val)			\
-({									\
-	struct inet6_dev *_idev = (idev);				\
-	if (likely(_idev != NULL))					\
-		mod##SNMP_ADD_STATS((_idev)->stats.statname, (field), (val)); \
-	mod##SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val));\
-})
-
-#define _DEVUPD(net, statname, mod, idev, field, val)			\
-({									\
-	struct inet6_dev *_idev = (idev);				\
-	if (likely(_idev != NULL))					\
-		mod##SNMP_UPD_PO_STATS((_idev)->stats.statname, field, (val)); \
-	mod##SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val));\
-})
-
 /* MIBs */
 
 #define IP6_INC_STATS(net, idev,field)		\
-		_DEVINC(net, ipv6, , idev, field)
+		_DEVINC(net, ipv6, struct inet6_dev, idev, field)
 #define __IP6_INC_STATS(net, idev,field)	\
-		_DEVINC(net, ipv6, __, idev, field)
+		___DEVINC(net, ipv6, struct inet6_dev, idev, field)
 #define IP6_ADD_STATS(net, idev,field,val)	\
-		_DEVADD(net, ipv6, , idev, field, val)
+		_DEVADD(net, ipv6, struct inet6_dev, idev, field, val)
 #define __IP6_ADD_STATS(net, idev,field,val)	\
-		_DEVADD(net, ipv6, __, idev, field, val)
+		___DEVADD(net, ipv6, struct inet6_dev, idev, field, val)
 #define IP6_UPD_PO_STATS(net, idev,field,val)   \
-		_DEVUPD(net, ipv6, , idev, field, val)
+		_DEVUPD(net, ipv6, struct inet6_dev, idev, field, val)
 #define __IP6_UPD_PO_STATS(net, idev,field,val)   \
-		_DEVUPD(net, ipv6, __, idev, field, val)
+		___DEVUPD(net, ipv6, struct inet6_dev, idev, field, val)
 #define ICMP6_INC_STATS(net, idev, field)	\
-		_DEVINCATOMIC(net, icmpv6, , idev, field)
+		_DEVINCATOMIC(net, icmpv6, struct inet6_dev, idev, field)
 #define __ICMP6_INC_STATS(net, idev, field)	\
-		_DEVINCATOMIC(net, icmpv6, __, idev, field)
+		___DEVINCATOMIC(net, icmpv6, struct inet6_dev, idev, field)
 
 #define ICMP6MSGOUT_INC_STATS(net, idev, field)		\
-	_DEVINC_ATOMIC_ATOMIC(net, icmpv6msg, idev, field +256)
+	_DEVINC_ATOMIC_ATOMIC(net, icmpv6msg, struct inet6_dev, idev, field+256)
 #define ICMP6MSGIN_INC_STATS(net, idev, field)	\
-	_DEVINC_ATOMIC_ATOMIC(net, icmpv6msg, idev, field)
+	_DEVINC_ATOMIC_ATOMIC(net, icmpv6msg, struct inet6_dev, idev, field)
 
 struct ip6_ra_chain {
 	struct ip6_ra_chain	*next;
diff --git a/include/net/netns/mib.h b/include/net/netns/mib.h
index 830bdf3..798bbc2 100644
--- a/include/net/netns/mib.h
+++ b/include/net/netns/mib.h
@@ -5,6 +5,9 @@
 #include <net/snmp.h>
 
 struct netns_mib {
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	struct proc_dir_entry *proc_net_devsnmp;
+#endif
 	DEFINE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 	DEFINE_SNMP_STAT(struct ipstats_mib, ip_statistics);
 	DEFINE_SNMP_STAT(struct linux_mib, net_statistics);
diff --git a/include/net/snmp.h b/include/net/snmp.h
index c9228ad..56e1f98 100644
--- a/include/net/snmp.h
+++ b/include/net/snmp.h
@@ -61,14 +61,26 @@ struct ipstats_mib {
 
 /* ICMP */
 #define ICMP_MIB_MAX	__ICMP_MIB_MAX
+/* per network ns counters */
 struct icmp_mib {
 	unsigned long	mibs[ICMP_MIB_MAX];
 };
 
 #define ICMPMSG_MIB_MAX	__ICMPMSG_MIB_MAX
+/* per network ns counters */
 struct icmpmsg_mib {
 	atomic_long_t	mibs[ICMPMSG_MIB_MAX];
 };
+#ifdef CONFIG_IP_IFSTATS_TABLE
+/* per device counters, (shared on all cpus) */
+struct icmp_mib_device {
+	atomic_long_t	mibs[ICMP_MIB_MAX];
+};
+/* per device counters, (shared on all cpus) */
+struct icmpmsg_mib_device {
+	atomic_long_t	mibs[ICMPMSG_MIB_MAX];
+};
+#endif
 
 /* ICMP6 (IPv6-ICMP) */
 #define ICMP6_MIB_MAX	__ICMP6_MIB_MAX
@@ -198,4 +210,87 @@ struct linux_xfrm_mib {
 #define __SNMP_UPD_PO_STATS64(mib, basefield, addend) __SNMP_UPD_PO_STATS(mib, basefield, addend)
 #endif
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+/* Macros for enabling per device statistics */
+#define _DEVINC(net, statname, type, idev, field)			\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_INC_STATS((_idev)->stats.statname, (field));	\
+	SNMP_INC_STATS((net)->mib.statname##_statistics, (field));	\
+})
+#define ___DEVINC(net, statname, type, idev, field)			\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		__SNMP_INC_STATS((_idev)->stats.statname, (field));	\
+	__SNMP_INC_STATS((net)->mib.statname##_statistics, (field));	\
+})
+
+/* per device counters are atomic_long_t */
+#define _DEVINCATOMIC(net, statname, type, idev, field)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field));	\
+	SNMP_INC_STATS((net)->mib.statname##_statistics, (field));	\
+})
+#define ___DEVINCATOMIC(net, statname, type, idev, field)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field));	\
+	__SNMP_INC_STATS((net)->mib.statname##_statistics, (field));	\
+})
+
+/* per device and per net counters are atomic_long_t */
+#define _DEVINC_ATOMIC_ATOMIC(net, statname, type, idev, field)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field));	\
+	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.statname##_statistics, (field));	\
+})
+
+#define _DEVADD(net, statname, type, idev, field, val)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_ADD_STATS((_idev)->stats.statname, (field), (val));	\
+	SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val));	\
+})
+#define ___DEVADD(net, statname, type, idev, field, val)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		__SNMP_ADD_STATS((_idev)->stats.statname, (field), (val));	\
+	__SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val));	\
+})
+
+#define _DEVUPD(net, statname, type, idev, field, val)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_UPD_PO_STATS((_idev)->stats.statname, field, (val));	\
+	SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val));	\
+})
+#define ___DEVUPD(net, statname, type, idev, field, val)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		__SNMP_UPD_PO_STATS((_idev)->stats.statname, field, (val));	\
+	__SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val));	\
+})
+#else
+#define _DEVINC(net, statname, type, idev, field) SNMP_INC_STATS((net)->mib.statname##_statistics, (field))
+#define ___DEVINC(net, statname, type, idev, field) __SNMP_INC_STATS((net)->mib.statname##_statistics, (field))
+#define _DEVINCATOMIC(net, statname, type, idev, field) SNMP_INC_STATS((net)->mib.statname##_statistics, (field))
+#define ___DEVINCATOMIC(net, statname, type, idev, field) __SNMP_INC_STATS((net)->mib.statname##_statistics, (field))
+#define _DEVINC_ATOMIC_ATOMIC(net, statname, type, idev, field)	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.statname##_statistics, (field))
+#define _DEVADD(net, statname, type, idev, field, val)	SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val))
+#define ___DEVADD(net, statname, type, idev, field, val)	__SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val))
+#define _DEVUPD(net, statname, type, idev, field, val)	SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val))
+#define ___DEVUPD(net, statname, type, idev, field, val)	__SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val))
+#endif
+
 #endif
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 9b16eaf..f9576b7 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -218,13 +218,13 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
 
 	len = ntohs(iph->tot_len);
 	if (skb->len < len) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
+		__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INTRUNCATEDPKTS);
 		goto drop;
 	} else if (len < (iph->ihl*4))
 		goto inhdr_error;
 
 	if (pskb_trim_rcsum(skb, len)) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -237,9 +237,9 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
 	return 0;
 
 csum_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS);
+	__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_CSUMERRORS);
 inhdr_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+	__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INHDRERRORS);
 drop:
 	return -1;
 }
@@ -691,7 +691,7 @@ br_nf_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
 		     (IPCB(skb)->frag_max_size &&
 		      IPCB(skb)->frag_max_size > mtu))) {
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_FRAGFAILS);
 		kfree_skb(skb);
 		return -EMSGSIZE;
 	}
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index e65fcb4..11e0d4c 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -258,7 +258,7 @@ static void dccp_v4_err(struct sk_buff *skb, u32 info)
 				       iph->saddr, ntohs(dh->dccph_sport),
 				       inet_iif(skb), 0);
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 
@@ -468,7 +468,7 @@ static struct dst_entry* dccp_v4_route_skb(struct net *net, struct sock *sk,
 	security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
 	rt = ip_route_output_flow(net, &fl4, sk);
 	if (IS_ERR(rt)) {
-		IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+		IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 		return NULL;
 	}
 
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 80dad30..6470b95 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -52,6 +52,14 @@ config IP_ADVANCED_ROUTER
 
 	  If unsure, say N here.
 
+config IP_IFSTATS_TABLE
+	def_bool n
+	depends on IP_ADVANCED_ROUTER
+	prompt "IP: interface statistics"
+	help
+	  This option enables per interface statistics for IPv4. Refer to
+	  RFC 4293.
+
 config IP_FIB_TRIE_STATS
 	bool "FIB TRIE statistics"
 	depends on IP_ADVANCED_ROUTER
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index eaed036..c99da7a 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1938,6 +1938,11 @@ static int __init inet_init(void)
 		inet_register_protosw(q);
 
 	/*
+	 *	Init proc fs before initializing devinet
+	 */
+	ipv4_proc_init();
+
+	/*
 	 *	Set the ARP module up
 	 */
 
@@ -1984,8 +1989,6 @@ static int __init inet_init(void)
 	if (init_ipv4_mibs())
 		pr_crit("%s: Cannot init ipv4 mibs\n", __func__);
 
-	ipv4_proc_init();
-
 	ipfrag_init();
 
 	dev_add_pack(&ip_packet_type);
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f915abf..7acaadb 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -55,7 +55,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		if (err == -ENETUNREACH)
-			IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+			IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 		goto out;
 	}
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 40f0017..7600e1e 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -218,6 +218,49 @@ static void inet_free_ifa(struct in_ifaddr *ifa)
 	call_rcu(&ifa->rcu_head, inet_rcu_free_ifa);
 }
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static int snmp_alloc_dev(struct in_device *idev)
+{
+	int i;
+
+	idev->stats.ip = alloc_percpu(struct ipstats_mib);
+	if (!idev->stats.ip)
+		goto err_ip;
+
+	for_each_possible_cpu(i) {
+		struct ipstats_mib *addrconf_stats;
+
+		addrconf_stats = per_cpu_ptr(idev->stats.ip, i);
+		u64_stats_init(&addrconf_stats->syncp);
+	}
+
+	idev->stats.icmpdev = kzalloc(sizeof(*idev->stats.icmpdev),
+				      GFP_KERNEL);
+	if (!idev->stats.icmpdev)
+		goto err_icmp;
+	idev->stats.icmpmsgdev = kzalloc(sizeof(*idev->stats.icmpmsgdev),
+					 GFP_KERNEL);
+	if (!idev->stats.icmpmsgdev)
+		goto err_icmpmsg;
+
+	return 0;
+
+err_icmpmsg:
+	kfree(idev->stats.icmpdev);
+err_icmp:
+	free_percpu(idev->stats.ip);
+err_ip:
+	return -ENOMEM;
+}
+
+static void snmp_free_dev(struct in_device *idev)
+{
+	kfree(idev->stats.icmpmsgdev);
+	kfree(idev->stats.icmpdev);
+	free_percpu(idev->stats.ip);
+}
+#endif
+
 void in_dev_finish_destroy(struct in_device *idev)
 {
 	struct net_device *dev = idev->dev;
@@ -229,10 +272,14 @@ void in_dev_finish_destroy(struct in_device *idev)
 	pr_debug("%s: %p=%s\n", __func__, idev, dev ? dev->name : "NIL");
 #endif
 	dev_put(dev);
-	if (!idev->dead)
+	if (!idev->dead) {
 		pr_err("Freeing alive in_device %p\n", idev);
-	else
+	} else {
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		snmp_free_dev(idev);
+#endif
 		kfree(idev);
+	}
 }
 EXPORT_SYMBOL(in_dev_finish_destroy);
 
@@ -257,6 +304,25 @@ static struct in_device *inetdev_init(struct net_device *dev)
 		dev_disable_lro(dev);
 	/* Reference in_dev->dev */
 	dev_hold(dev);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	err = snmp_alloc_dev(in_dev);
+	if (err < 0) {
+		netdev_crit(dev,
+			    "%s(): cannot allocate memory for statistics; dev=%s err=%d.\n",
+			    __func__, dev->name, err);
+		neigh_parms_release(&arp_tbl, in_dev->arp_parms);
+		dev_put(dev);
+		kfree(in_dev);
+		return NULL;
+	}
+
+	err = snmp_register_dev(in_dev);
+	if (err < 0)
+		netdev_warn(dev,
+			    "%s(): cannot create /proc/net/dev_snmp/%s err=%d\n",
+			    __func__, dev->name, err);
+#endif
+
 	/* Account for reference dev->ip_ptr (below) */
 	refcount_set(&in_dev->refcnt, 1);
 
@@ -306,6 +372,9 @@ static void inetdev_destroy(struct in_device *in_dev)
 	}
 
 	RCU_INIT_POINTER(dev->ip_ptr, NULL);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	snmp_unregister_dev(in_dev);
+#endif
 
 	devinet_sysctl_unregister(in_dev);
 	neigh_parms_release(&arp_tbl, in_dev->arp_parms);
@@ -1529,8 +1598,20 @@ static int inetdev_event(struct notifier_block *this, unsigned long event,
 		 */
 		inetdev_changename(dev, in_dev);
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		snmp_unregister_dev(in_dev);
+#endif
 		devinet_sysctl_unregister(in_dev);
 		devinet_sysctl_register(in_dev);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		{
+			int err = snmp_register_dev(in_dev);
+
+			if (err)
+				return notifier_from_errno(err);
+		}
+#endif
+		/* CONFIG_EXTREME: End */
 		break;
 	}
 out:
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 1617604..4d5c092 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -338,10 +338,10 @@ static bool icmpv4_xrlim_allow(struct net *net, struct rtable *rt,
 /*
  *	Maintain the counters used in the SNMP statistics for outgoing ICMP
  */
-void icmp_out_count(struct net *net, unsigned char type)
+void icmp_out_count(struct net_device *dev, unsigned char type)
 {
-	ICMPMSGOUT_INC_STATS(net, type);
-	ICMP_INC_STATS(net, ICMP_MIB_OUTMSGS);
+	ICMPMSGOUT_INC_STATS(dev_net(dev), dev, type);
+	ICMP_INC_STATS(dev_net(dev), dev, ICMP_MIB_OUTMSGS);
 }
 
 /*
@@ -370,13 +370,14 @@ static void icmp_push_reply(struct icmp_bxm *icmp_param,
 {
 	struct sock *sk;
 	struct sk_buff *skb;
+	struct net_device *dev = (*rt)->dst.dev;
 
-	sk = icmp_sk(dev_net((*rt)->dst.dev));
+	sk = icmp_sk(dev_net(dev));
 	if (ip_append_data(sk, fl4, icmp_glue_bits, icmp_param,
 			   icmp_param->data_len+icmp_param->head_len,
 			   icmp_param->head_len,
 			   ipc, rt, MSG_DONTWAIT) < 0) {
-		__ICMP_INC_STATS(sock_net(sk), ICMP_MIB_OUTERRORS);
+		__ICMP_INC_STATS(sock_net(sk), dev, ICMP_MIB_OUTERRORS);
 		ip_flush_pending_frames(sk);
 	} else if ((skb = skb_peek(&sk->sk_write_queue)) != NULL) {
 		struct icmphdr *icmph = icmp_hdr(skb);
@@ -760,7 +761,7 @@ static void icmp_socket_deliver(struct sk_buff *skb, u32 info)
 	 * avoid additional coding at protocol handlers.
 	 */
 	if (!pskb_may_pull(skb, iph->ihl * 4 + 8)) {
-		__ICMP_INC_STATS(dev_net(skb->dev), ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(dev_net(skb->dev), skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 
@@ -792,8 +793,9 @@ static bool icmp_unreach(struct sk_buff *skb)
 	struct icmphdr *icmph;
 	struct net *net;
 	u32 info = 0;
+	struct net_device *dev = skb_dst(skb)->dev;
 
-	net = dev_net(skb_dst(skb)->dev);
+	net = dev_net(dev);
 
 	/*
 	 *	Incomplete header ?
@@ -852,7 +854,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 		info = ntohl(icmph->un.gateway) >> 24;
 		break;
 	case ICMP_TIME_EXCEEDED:
-		__ICMP_INC_STATS(net, ICMP_MIB_INTIMEEXCDS);
+		__ICMP_INC_STATS(net, dev, ICMP_MIB_INTIMEEXCDS);
 		if (icmph->code == ICMP_EXC_FRAGTIME)
 			goto out;
 		break;
@@ -890,7 +892,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 out:
 	return true;
 out_err:
-	__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+	__ICMP_INC_STATS(net, dev, ICMP_MIB_INERRORS);
 	return false;
 }
 
@@ -902,7 +904,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 static bool icmp_redirect(struct sk_buff *skb)
 {
 	if (skb->len < sizeof(struct iphdr)) {
-		__ICMP_INC_STATS(dev_net(skb->dev), ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(dev_net(skb->dev), skb->dev, ICMP_MIB_INERRORS);
 		return false;
 	}
 
@@ -982,7 +984,7 @@ static bool icmp_timestamp(struct sk_buff *skb)
 	return true;
 
 out_err:
-	__ICMP_INC_STATS(dev_net(skb_dst(skb)->dev), ICMP_MIB_INERRORS);
+	__ICMP_INC_STATS(dev_net(skb_dst(skb)->dev), skb_dst(skb)->dev, ICMP_MIB_INERRORS);
 	return false;
 }
 
@@ -1022,7 +1024,7 @@ int icmp_rcv(struct sk_buff *skb)
 		skb_set_network_header(skb, nh);
 	}
 
-	__ICMP_INC_STATS(net, ICMP_MIB_INMSGS);
+	__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INMSGS);
 
 	if (skb_checksum_simple_validate(skb))
 		goto csum_error;
@@ -1032,7 +1034,7 @@ int icmp_rcv(struct sk_buff *skb)
 
 	icmph = icmp_hdr(skb);
 
-	ICMPMSGIN_INC_STATS(net, icmph->type);
+	ICMPMSGIN_INC_STATS(net, skb->dev, icmph->type);
 	/*
 	 *	18 is the highest 'known' ICMP type. Anything else is a mystery
 	 *
@@ -1078,9 +1080,9 @@ int icmp_rcv(struct sk_buff *skb)
 	kfree_skb(skb);
 	return NET_RX_DROP;
 csum_error:
-	__ICMP_INC_STATS(net, ICMP_MIB_CSUMERRORS);
+	__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_CSUMERRORS);
 error:
-	__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+	__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 	goto drop;
 }
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 881ac6d..30afff4 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -539,6 +539,7 @@ struct dst_entry *inet_csk_route_req(const struct sock *sk,
 	struct net *net = read_pnet(&ireq->ireq_net);
 	struct ip_options_rcu *opt;
 	struct rtable *rt;
+	struct net_device *dev = NULL;
 
 	opt = ireq_opt_deref(ireq);
 
@@ -557,9 +558,10 @@ struct dst_entry *inet_csk_route_req(const struct sock *sk,
 	return &rt->dst;
 
 route_err:
+	dev = rt->dst.dev;
 	ip_rt_put(rt);
 no_route:
-	__IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_OUTNOROUTES);
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(inet_csk_route_req);
@@ -574,6 +576,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
 	struct ip_options_rcu *opt;
 	struct flowi4 *fl4;
 	struct rtable *rt;
+	struct net_device *dev = NULL;
 
 	opt = rcu_dereference(ireq->ireq_opt);
 	fl4 = &newinet->cork.fl.u.ip4;
@@ -593,9 +596,10 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
 	return &rt->dst;
 
 route_err:
+	dev = rt->dst.dev;
 	ip_rt_put(rt);
 no_route:
-	__IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_OUTNOROUTES);
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index b54b948..bb9be11 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -66,8 +66,8 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
 {
 	struct ip_options *opt	= &(IPCB(skb)->opt);
 
-	__IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
-	__IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len);
+	__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTFORWDATAGRAMS);
+	__IP_ADD_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTOCTETS, skb->len);
 
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
@@ -121,7 +121,7 @@ int ip_forward(struct sk_buff *skb)
 	IPCB(skb)->flags |= IPSKB_FORWARDED;
 	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
 	if (ip_exceeds_mtu(skb, mtu)) {
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 			  htonl(mtu));
 		goto drop;
@@ -158,7 +158,7 @@ int ip_forward(struct sk_buff *skb)
 
 too_many_hops:
 	/* Tell the sender its packet died... */
-	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+	__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_INHDRERRORS);
 	icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
 drop:
 	kfree_skb(skb);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 994fa70..40aadd4 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -136,6 +136,7 @@ static void ip_expire(struct timer_list *t)
 {
 	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	const struct iphdr *iph;
+	struct net_device *dev;
 	struct sk_buff *head;
 	struct net *net;
 	struct ipq *qp;
@@ -151,16 +152,17 @@ static void ip_expire(struct timer_list *t)
 		goto out;
 
 	ipq_kill(qp);
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+	dev = dev_get_by_index_rcu(net, qp->iif);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMFAILS);
 
 	head = qp->q.fragments;
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMTIMEOUT);
 
 	if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
 		goto out;
 
-	head->dev = dev_get_by_index_rcu(net, qp->iif);
+	head->dev = dev;
 	if (!head->dev)
 		goto out;
 
@@ -237,7 +239,9 @@ static int ip_frag_too_far(struct ipq *qp)
 		struct net *net;
 
 		net = container_of(qp->q.net, struct net, ipv4.frags);
-		__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+		rcu_read_lock();
+		__IP_INC_STATS(net, dev_get_by_index_rcu(net, qp->iif), IPSTATS_MIB_REASMFAILS);
+		rcu_read_unlock();
 	}
 
 	return rc;
@@ -582,7 +586,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 
 	ip_send_check(iph);
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMOKS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMOKS);
 	qp->q.fragments = NULL;
 	qp->q.fragments_tail = NULL;
 	return 0;
@@ -594,7 +598,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 out_oversize:
 	net_info_ratelimited("Oversized IP packet from %pI4\n", &qp->q.key.v4.saddr);
 out_fail:
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMFAILS);
 	return err;
 }
 
@@ -605,7 +609,7 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 user)
 	int vif = l3mdev_master_ifindex_rcu(dev);
 	struct ipq *qp;
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMREQDS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMREQDS);
 	skb_orphan(skb);
 
 	/* Lookup (or create) queue header */
@@ -622,7 +626,7 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 user)
 		return ret;
 	}
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMFAILS);
 	kfree_skb(skb);
 	return -ENOMEM;
 }
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 7582713..1143fe2 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -197,6 +197,7 @@ static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_b
 		int protocol = ip_hdr(skb)->protocol;
 		const struct net_protocol *ipprot;
 		int raw;
+		struct net_device *dev = skb->dev;
 
 	resubmit:
 		raw = raw_local_deliver(skb, protocol);
@@ -217,17 +218,17 @@ static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_b
 				protocol = -ret;
 				goto resubmit;
 			}
-			__IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+			__IP_INC_STATS(net, dev, IPSTATS_MIB_INDELIVERS);
 		} else {
 			if (!raw) {
 				if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
-					__IP_INC_STATS(net, IPSTATS_MIB_INUNKNOWNPROTOS);
+					__IP_INC_STATS(net, dev, IPSTATS_MIB_INUNKNOWNPROTOS);
 					icmp_send(skb, ICMP_DEST_UNREACH,
 						  ICMP_PROT_UNREACH, 0);
 				}
 				kfree_skb(skb);
 			} else {
-				__IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+				__IP_INC_STATS(net, dev, IPSTATS_MIB_INDELIVERS);
 				consume_skb(skb);
 			}
 		}
@@ -272,7 +273,7 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 					      --ANK (980813)
 	*/
 	if (skb_cow(skb, skb_headroom(skb))) {
-		__IP_INC_STATS(dev_net(dev), IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(dev_net(dev), dev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -281,7 +282,7 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
 
 	if (ip_options_compile(dev_net(dev), opt, skb)) {
-		__IP_INC_STATS(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
+		__IP_INC_STATS(dev_net(dev), dev, IPSTATS_MIB_INHDRERRORS);
 		goto drop;
 	}
 
@@ -366,9 +367,9 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 
 	rt = skb_rtable(skb);
 	if (rt->rt_type == RTN_MULTICAST) {
-		__IP_UPD_PO_STATS(net, IPSTATS_MIB_INMCAST, skb->len);
+		__IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_INMCAST, skb->len);
 	} else if (rt->rt_type == RTN_BROADCAST) {
-		__IP_UPD_PO_STATS(net, IPSTATS_MIB_INBCAST, skb->len);
+		__IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_INBCAST, skb->len);
 	} else if (skb->pkt_type == PACKET_BROADCAST ||
 		   skb->pkt_type == PACKET_MULTICAST) {
 		struct in_device *in_dev = __in_dev_get_rcu(dev);
@@ -422,11 +423,11 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 
 
 	net = dev_net(dev);
-	__IP_UPD_PO_STATS(net, IPSTATS_MIB_IN, skb->len);
+	__IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_IN, skb->len);
 
 	skb = skb_share_check(skb, GFP_ATOMIC);
 	if (!skb) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INDISCARDS);
 		goto out;
 	}
 
@@ -452,7 +453,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	BUILD_BUG_ON(IPSTATS_MIB_ECT1PKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_ECT_1);
 	BUILD_BUG_ON(IPSTATS_MIB_ECT0PKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_ECT_0);
 	BUILD_BUG_ON(IPSTATS_MIB_CEPKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_CE);
-	__IP_ADD_STATS(net,
+	__IP_ADD_STATS(net, dev,
 		       IPSTATS_MIB_NOECTPKTS + (iph->tos & INET_ECN_MASK),
 		       max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs));
 
@@ -466,7 +467,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 
 	len = ntohs(iph->tot_len);
 	if (skb->len < len) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INTRUNCATEDPKTS);
 		goto drop;
 	} else if (len < (iph->ihl*4))
 		goto inhdr_error;
@@ -476,7 +477,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	 * Note this now means skb->len holds ntohs(iph->tot_len).
 	 */
 	if (pskb_trim_rcsum(skb, len)) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -494,9 +495,9 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		       ip_rcv_finish);
 
 csum_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_CSUMERRORS);
 inhdr_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_INHDRERRORS);
 drop:
 	kfree_skb(skb);
 out:
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 94cacae..a0cf93d 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -191,9 +191,9 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
 	u32 nexthop;
 
 	if (rt->rt_type == RTN_MULTICAST) {
-		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTMCAST, skb->len);
+		IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUTMCAST, skb->len);
 	} else if (rt->rt_type == RTN_BROADCAST)
-		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len);
+		IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUTBCAST, skb->len);
 
 	/* Be paranoid, rather than too clever. */
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
@@ -339,7 +339,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	/*
 	 *	If the indicated interface is up and running, send the packet.
 	 */
-	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
+	IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
@@ -397,7 +397,7 @@ int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
 
-	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
+	IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
@@ -507,7 +507,7 @@ int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
 
 no_route:
 	rcu_read_unlock();
-	IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+	IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	return -EHOSTUNREACH;
 }
@@ -548,7 +548,7 @@ static int ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	if (unlikely(!skb->ignore_df ||
 		     (IPCB(skb)->frag_max_size &&
 		      IPCB(skb)->frag_max_size > mtu))) {
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, skb->dev, IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 			  htonl(mtu));
 		kfree_skb(skb);
@@ -575,8 +575,11 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	int offset;
 	__be16 not_last_frag;
 	struct rtable *rt = skb_rtable(skb);
+	struct net_device *dev = rt->dst.dev;
 	int err = 0;
 
+	dev_hold(dev);
+
 	/* for offloaded checksums cleanup checksum before fragmentation */
 	if (skb->ip_summed == CHECKSUM_PARTIAL &&
 	    (err = skb_checksum_help(skb)))
@@ -675,7 +678,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			err = output(net, sk, skb);
 
 			if (!err)
-				IP_INC_STATS(net, IPSTATS_MIB_FRAGCREATES);
+				IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGCREATES);
 			if (err || !frag)
 				break;
 
@@ -685,7 +688,8 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		}
 
 		if (err == 0) {
-			IP_INC_STATS(net, IPSTATS_MIB_FRAGOKS);
+			IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGOKS);
+			dev_put(dev);
 			return 0;
 		}
 
@@ -694,7 +698,8 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			kfree_skb(frag);
 			frag = skb;
 		}
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGFAILS);
+		dev_put(dev);
 		return err;
 
 slow_path_clean:
@@ -811,15 +816,17 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		if (err)
 			goto fail;
 
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGCREATES);
+		IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGCREATES);
 	}
 	consume_skb(skb);
-	IP_INC_STATS(net, IPSTATS_MIB_FRAGOKS);
+	IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGOKS);
+	dev_put(dev);
 	return err;
 
 fail:
 	kfree_skb(skb);
-	IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+	IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGFAILS);
+	dev_put(dev);
 	return err;
 }
 EXPORT_SYMBOL(ip_do_fragment);
@@ -1097,7 +1104,7 @@ static int __ip_append_data(struct sock *sk,
 	err = -EFAULT;
 error:
 	cork->length -= length;
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
+	IP_INC_STATS(sock_net(sk), rt->dst.dev, IPSTATS_MIB_OUTDISCARDS);
 	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
 	return err;
 }
@@ -1305,7 +1312,7 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 
 error:
 	cork->length -= size;
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
+	IP_INC_STATS(sock_net(sk), rt->dst.dev, IPSTATS_MIB_OUTDISCARDS);
 	return err;
 }
 
@@ -1406,7 +1413,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 	skb_dst_set(skb, &rt->dst);
 
 	if (iph->protocol == IPPROTO_ICMP)
-		icmp_out_count(net, ((struct icmphdr *)
+		icmp_out_count(skb_dst(skb)->dev, ((struct icmphdr *)
 			skb_transport_header(skb))->type);
 
 	ip_cork_release(cork);
@@ -1417,13 +1424,14 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 int ip_send_skb(struct net *net, struct sk_buff *skb)
 {
 	int err;
+	struct net_device *dev = skb_dst(skb)->dev;
 
 	err = ip_local_out(net, skb->sk, skb);
 	if (err) {
 		if (err > 0)
 			err = net_xmit_errno(err);
 		if (err)
-			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+			IP_INC_STATS(net, dev, IPSTATS_MIB_OUTDISCARDS);
 	}
 
 	return err;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 2fb4de3..67ec987 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1778,8 +1778,8 @@ static inline int ipmr_forward_finish(struct net *net, struct sock *sk,
 {
 	struct ip_options *opt = &(IPCB(skb)->opt);
 
-	IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
-	IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len);
+	IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTFORWDATAGRAMS);
+	IP_ADD_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTOCTETS, skb->len);
 
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
@@ -1862,7 +1862,7 @@ static void ipmr_queue_xmit(struct net *net, struct mr_table *mrt,
 		 * allow to send ICMP, so that packets will disappear
 		 * to blackhole.
 		 */
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGFAILS);
 		ip_rt_put(rt);
 		goto out_free;
 	}
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 05e47d7..1b4db11 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -712,6 +712,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	__be32 saddr, daddr, faddr;
 	u8  tos;
 	int err;
+	struct net_device *dev;
 
 	pr_debug("ping_v4_sendmsg(sk=%p,sk->num=%u)\n", inet, inet->inet_num);
 
@@ -805,7 +806,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		err = PTR_ERR(rt);
 		rt = NULL;
 		if (err == -ENETUNREACH)
-			IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+			IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 		goto out;
 	}
 
@@ -841,13 +842,17 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	release_sock(sk);
 
 out:
+	dev = rt->dst.dev;
+	dev_hold(dev);
 	ip_rt_put(rt);
 	if (free)
 		kfree(ipc.opt);
 	if (!err) {
-		icmp_out_count(sock_net(sk), user_icmph.type);
+		icmp_out_count(dev, user_icmph.type);
+		dev_put(dev);
 		return len;
 	}
+	dev_put(dev);
 	return err;
 
 do_confirm:
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index a058de6..c0c822f 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -134,6 +134,17 @@ static const struct snmp_mib snmp4_ipextstats_list[] = {
 	SNMP_MIB_SENTINEL
 };
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static const struct snmp_mib snmp4_icmp_list[] = {
+	SNMP_MIB_ITEM("InMsgs", ICMP_MIB_INMSGS),
+	SNMP_MIB_ITEM("InErrors", ICMP_MIB_INERRORS),
+	SNMP_MIB_ITEM("OutMsgs", ICMP_MIB_OUTMSGS),
+	SNMP_MIB_ITEM("OutErrors", ICMP_MIB_OUTERRORS),
+	SNMP_MIB_ITEM("InCsumErrors", ICMP_MIB_CSUMERRORS),
+	SNMP_MIB_SENTINEL
+};
+#endif
+
 static const struct {
 	const char *name;
 	int index;
@@ -473,6 +484,109 @@ static const struct file_operations snmp_seq_fops = {
 };
 
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static void snmp_seq_show_item(struct seq_file *seq, void __percpu **pcpumib,
+			       atomic_long_t *smib,
+			       const struct snmp_mib *itemlist,
+			       char *prefix)
+{
+	char name[32];
+	int i;
+	unsigned long val;
+
+	for (i = 0; itemlist[i].name; i++) {
+		val = pcpumib ?
+			snmp_fold_field64(pcpumib, itemlist[i].entry,
+					  offsetof(struct ipstats_mib, syncp)) :
+			atomic_long_read(smib + itemlist[i].entry);
+		snprintf(name, sizeof(name), "%s%s",
+			 prefix, itemlist[i].name);
+		seq_printf(seq, "%-32s\t%lu\n", name, val);
+	}
+}
+
+static void snmp_seq_show_icmpmsg(struct seq_file *seq, atomic_long_t *smib)
+{
+	char name[32];
+	int i;
+	unsigned long val;
+
+	for (i = 0; i < ICMPMSG_MIB_MAX; i++) {
+		val = atomic_long_read(smib + i);
+		if (val) {
+			snprintf(name, sizeof(name), "Icmp%sType%u",
+				 i & 0x100 ? "Out" : "In", i & 0xff);
+			seq_printf(seq, "%-32s\t%lu\n", name, val);
+		}
+	}
+}
+
+static int snmp_dev_seq_show(struct seq_file *seq, void *v)
+{
+	struct in_device *idev = (struct in_device *)seq->private;
+
+	seq_printf(seq, "%-32s\t%u\n", "ifIndex", idev->dev->ifindex);
+
+	BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);
+
+	snmp_seq_show_item(seq, (void __percpu **)idev->stats.ip, NULL,
+			   snmp4_ipstats_list, "Ip");
+	snmp_seq_show_item(seq, (void __percpu **)idev->stats.ip, NULL,
+			   snmp4_ipextstats_list, "Ip");
+	snmp_seq_show_item(seq, NULL, idev->stats.icmpdev->mibs,
+			   snmp4_icmp_list, "Icmp");
+	snmp_seq_show_icmpmsg(seq, idev->stats.icmpmsgdev->mibs);
+	return 0;
+}
+
+static int snmp_dev_seq_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, snmp_dev_seq_show, PDE_DATA(inode));
+}
+
+static const struct file_operations snmp_dev_seq_fops = {
+	.owner   = THIS_MODULE,
+	.open    = snmp_dev_seq_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = single_release,
+};
+
+int snmp_register_dev(struct in_device *idev)
+{
+	struct proc_dir_entry *p;
+	struct net *net;
+
+	if (!idev || !idev->dev)
+		return -EINVAL;
+
+	net = dev_net(idev->dev);
+	if (!net->mib.proc_net_devsnmp)
+		return -ENOENT;
+
+	p = proc_create_data(idev->dev->name, 0444,
+			     net->mib.proc_net_devsnmp,
+			     &snmp_dev_seq_fops, idev);
+	if (!p)
+		return -ENOMEM;
+
+	idev->stats.proc_dir_entry = p;
+	return 0;
+}
+
+int snmp_unregister_dev(struct in_device *idev)
+{
+	struct net *net = dev_net(idev->dev);
+
+	if (!net->mib.proc_net_devsnmp)
+		return -ENOENT;
+	if (!idev->stats.proc_dir_entry)
+		return -EINVAL;
+	proc_remove(idev->stats.proc_dir_entry);
+	idev->stats.proc_dir_entry = NULL;
+	return 0;
+}
+#endif
 
 /*
  *	Output /proc/net/netstat
@@ -528,9 +642,18 @@ static __net_init int ip_proc_init_net(struct net *net)
 		goto out_netstat;
 	if (!proc_create("snmp", 0444, net->proc_net, &snmp_seq_fops))
 		goto out_snmp;
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	net->mib.proc_net_devsnmp = proc_mkdir("dev_snmp", net->proc_net);
+	if (!net->mib.proc_net_devsnmp)
+		goto out_dev_snmp;
+#endif
 
 	return 0;
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+out_dev_snmp:
+	remove_proc_entry("snmp", net->proc_net);
+#endif
 out_snmp:
 	remove_proc_entry("netstat", net->proc_net);
 out_netstat:
@@ -544,6 +667,9 @@ static __net_exit void ip_proc_exit_net(struct net *net)
 	remove_proc_entry("snmp", net->proc_net);
 	remove_proc_entry("netstat", net->proc_net);
 	remove_proc_entry("sockstat", net->proc_net);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	remove_proc_entry("dev_snmp", net->proc_net);
+#endif
 }
 
 static __net_initdata struct pernet_operations ip_proc_ops = {
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 1b4d335..f39f87a 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -425,7 +425,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 		skb->transport_header += iphlen;
 		if (iph->protocol == IPPROTO_ICMP &&
 		    length >= iphlen + sizeof(struct icmphdr))
-			icmp_out_count(net, ((struct icmphdr *)
+			icmp_out_count(rt->dst.dev, ((struct icmphdr *)
 				skb_transport_header(skb))->type);
 	}
 
@@ -442,7 +442,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 error_free:
 	kfree_skb(skb);
 error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+	IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_OUTDISCARDS);
 	if (err == -ENOBUFS && !inet->recverr)
 		err = 0;
 	return err;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 8322e47..5115f0d3 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -961,11 +961,11 @@ static int ip_error(struct sk_buff *skb)
 	if (!IN_DEV_FORWARD(in_dev)) {
 		switch (rt->dst.error) {
 		case EHOSTUNREACH:
-			__IP_INC_STATS(net, IPSTATS_MIB_INADDRERRORS);
+			__IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_INADDRERRORS);
 			break;
 
 		case ENETUNREACH:
-			__IP_INC_STATS(net, IPSTATS_MIB_INNOROUTES);
+			__IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_INNOROUTES);
 			break;
 		}
 		goto out;
@@ -980,7 +980,7 @@ static int ip_error(struct sk_buff *skb)
 		break;
 	case ENETUNREACH:
 		code = ICMP_NET_UNREACH;
-		__IP_INC_STATS(net, IPSTATS_MIB_INNOROUTES);
+		__IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_INNOROUTES);
 		break;
 	case EACCES:
 		code = ICMP_PKT_FILTERED;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b..df1b989 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -194,7 +194,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		if (err == -ENETUNREACH)
-			IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+			IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 		return err;
 	}
 
@@ -402,7 +402,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 				       th->dest, iph->saddr, ntohs(th->source),
 				       inet_iif(icmp_skb), 0);
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, icmp_skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 	if (sk->sk_state == TCP_TIME_WAIT) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 24b5c59..a8b6567 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -609,7 +609,7 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 			       iph->saddr, uh->source, skb->dev->ifindex, 0,
 			       udptable, NULL);
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 		return;	/* No socket for error */
 	}
 
@@ -1008,7 +1008,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 			err = PTR_ERR(rt);
 			rt = NULL;
 			if (err == -ENETUNREACH)
-				IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+				IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 			goto out;
 		}
 
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index a9c05b2..b52b2e3 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -381,7 +381,7 @@ static int l2tp_ip_backlog_recv(struct sock *sk, struct sk_buff *skb)
 	return 0;
 
 drop:
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_INDISCARDS);
+	IP_INC_STATS(sock_net(sk), skb->dev, IPSTATS_MIB_INDISCARDS);
 	kfree_skb(skb);
 	return 0;
 }
@@ -504,7 +504,7 @@ static int l2tp_ip_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 no_route:
 	rcu_read_unlock();
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+	IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	rc = -EHOSTUNREACH;
 	goto out;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 9573691..cf2172d 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -462,7 +462,7 @@ static int l2tp_ip6_backlog_recv(struct sock *sk, struct sk_buff *skb)
 	return 0;
 
 drop:
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_INDISCARDS);
+	IP_INC_STATS(sock_net(sk), skb->dev, IPSTATS_MIB_INDISCARDS);
 	kfree_skb(skb);
 	return -1;
 }
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 7a4de6d..c629239 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -141,7 +141,7 @@ void mpls_stats_inc_outucastpkts(struct net_device *dev,
 					   tx_packets,
 					   tx_bytes);
 	} else if (skb->protocol == htons(ETH_P_IP)) {
-		IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
+		IP_UPD_PO_STATS(dev_net(dev), dev, IPSTATS_MIB_OUT, skb->len);
 #if IS_ENABLED(CONFIG_IPV6)
 	} else if (skb->protocol == htons(ETH_P_IPV6)) {
 		struct inet6_dev *in6dev = __in6_dev_get(dev);
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 4527921..32bd3af 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -286,7 +286,7 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
 	{
 		if (ip_hdr(skb)->ttl <= 1) {
 			/* Tell the sender its packet died... */
-			__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+			__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_INHDRERRORS);
 			icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
 			return false;
 		}
diff --git a/net/sctp/input.c b/net/sctp/input.c
index ba8a6e6..fef625a 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -596,7 +596,7 @@ void sctp_v4_err(struct sk_buff *skb, __u32 info)
 	skb->network_header = saveip;
 	skb->transport_header = savesctp;
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 	/* Warning:  The sock lock is held.  Remember to call
diff --git a/net/sctp/output.c b/net/sctp/output.c
index d6e1c90..a2a27fa 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -600,7 +600,7 @@ int sctp_packet_transmit(struct sctp_packet *packet, gfp_t gfp)
 	/* drop packet if no dst */
 	dst = dst_clone(tp->dst);
 	if (!dst) {
-		IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+		IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 		kfree_skb(head);
 		goto out;
 	}
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH] net: samsung: sxgbe: Replace mdelay with usleep_range in sxgbe_sw_reset
From: kbuild test robot @ 2018-04-11 15:52 UTC (permalink / raw)
  To: Jia-Ju Bai
  Cc: kbuild-all, bh74.an, ks.giri, vipul.pandya, netdev, linux-kernel,
	Jia-Ju Bai
In-Reply-To: <1523413280-2336-1-git-send-email-baijiaju1990@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1639 bytes --]

Hi Jia-Ju,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]
[also build test ERROR on v4.16 next-20180411]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jia-Ju-Bai/net-samsung-sxgbe-Replace-mdelay-with-usleep_range-in-sxgbe_sw_reset/20180411-225900
config: i386-randconfig-s1-201814 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c: In function 'sxgbe_sw_reset':
>> drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c:2039:3: error: implicit declaration of function 'usleep' [-Werror=implicit-function-declaration]
      usleep(10000, 11000);
      ^~~~~~
   cc1: some warnings being treated as errors

vim +/usleep +2039 drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c

  2029	
  2030	static int sxgbe_sw_reset(void __iomem *addr)
  2031	{
  2032		int retry_count = 10;
  2033	
  2034		writel(SXGBE_DMA_SOFT_RESET, addr + SXGBE_DMA_MODE_REG);
  2035		while (retry_count--) {
  2036			if (!(readl(addr + SXGBE_DMA_MODE_REG) &
  2037			      SXGBE_DMA_SOFT_RESET))
  2038				break;
> 2039			usleep(10000, 11000);
  2040		}
  2041	
  2042		if (retry_count < 0)
  2043			return -EBUSY;
  2044	
  2045		return 0;
  2046	}
  2047	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 24019 bytes --]

^ permalink raw reply

* [PATCH v2] net: samsung: sxgbe: Replace mdelay with usleep_range in sxgbe_sw_reset
From: Jia-Ju Bai @ 2018-04-11 15:51 UTC (permalink / raw)
  To: bh74.an, ks.giri, vipul.pandya; +Cc: netdev, linux-kernel, Jia-Ju Bai

sxgbe_sw_reset() is never called in atomic context.

sxgbe_sw_reset() is only called by sxgbe_drv_probe(), which is 
only called by sxgbe_platform_probe().
sxgbe_platform_probe() is set as ".probe" in struct platform_driver.
This function is not called in atomic context.

Despite never getting called from atomic context, sxgbe_sw_reset()
calls mdelay() to busily wait.
This is not necessary and can be replaced with usleep_range() to
avoid busy waiting.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
---
v2:
* Use usleep_range() to correct usleep() in v1.

---	
 drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c b/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
index 89831ad..99cd586 100644
--- a/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
+++ b/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
@@ -2038,7 +2038,7 @@ static int sxgbe_sw_reset(void __iomem *addr)
 		if (!(readl(addr + SXGBE_DMA_MODE_REG) &
 		      SXGBE_DMA_SOFT_RESET))
 			break;
-		mdelay(10);
+		usleep_range(10000, 11000);
 	}
 
 	if (retry_count < 0)
-- 
1.9.1

^ permalink raw reply related

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-11 15:51 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh
In-Reply-To: <1523386790-12396-3-git-send-email-sridhar.samudrala@intel.com>

Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. existing netvsc driver that uses 2 netdev model. In this model, no
>master netdev is created. The paravirtual driver registers each bypass
>instance along with a set of ops to manage the slave events.
>     bypass_master_register()
>     bypass_master_unregister()
>2. new virtio_net based solution that uses 3 netdev model. In this model,
>the bypass module provides interfaces to create/destroy additional master
>netdev and all the slave events are managed internally.
>      bypass_master_create()
>      bypass_master_destroy()
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>---
> include/linux/netdevice.h |  14 +
> include/net/bypass.h      |  96 ++++++
> net/Kconfig               |  18 +
> net/core/Makefile         |   1 +
> net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 973 insertions(+)
> create mode 100644 include/net/bypass.h
> create mode 100644 net/core/bypass.c
>
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index cf44503ea81a..587293728f70 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
> 	IFF_PHONY_HEADROOM		= 1<<24,
> 	IFF_MACSEC			= 1<<25,
> 	IFF_NO_RX_HANDLER		= 1<<26,
>+	IFF_BYPASS			= 1 << 27,
>+	IFF_BYPASS_SLAVE		= 1 << 28,

I wonder, why you don't follow the existing coding style... Also, please
add these to into the comment above.


> };
> 
> #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>@@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
> #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
> #define IFF_MACSEC			IFF_MACSEC
> #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>+#define IFF_BYPASS			IFF_BYPASS
>+#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
> 
> /**
>  *	struct net_device - The DEVICE structure.
>@@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
> 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
> }
> 
>+static inline bool netif_is_bypass_master(const struct net_device *dev)
>+{
>+	return dev->priv_flags & IFF_BYPASS;
>+}
>+
>+static inline bool netif_is_bypass_slave(const struct net_device *dev)
>+{
>+	return dev->priv_flags & IFF_BYPASS_SLAVE;
>+}
>+
> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
> static inline void netif_keep_dst(struct net_device *dev)
> {
>diff --git a/include/net/bypass.h b/include/net/bypass.h
>new file mode 100644
>index 000000000000..86b02cb894cf
>--- /dev/null
>+++ b/include/net/bypass.h
>@@ -0,0 +1,96 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+#ifndef _NET_BYPASS_H
>+#define _NET_BYPASS_H
>+
>+#include <linux/netdevice.h>
>+
>+struct bypass_ops {
>+	int (*slave_pre_register)(struct net_device *slave_netdev,
>+				  struct net_device *bypass_netdev);
>+	int (*slave_join)(struct net_device *slave_netdev,
>+			  struct net_device *bypass_netdev);
>+	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>+				    struct net_device *bypass_netdev);
>+	int (*slave_release)(struct net_device *slave_netdev,
>+			     struct net_device *bypass_netdev);
>+	int (*slave_link_change)(struct net_device *slave_netdev,
>+				 struct net_device *bypass_netdev);
>+	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>+};
>+
>+struct bypass_master {
>+	struct list_head list;
>+	struct net_device __rcu *bypass_netdev;
>+	struct bypass_ops __rcu *ops;
>+};
>+
>+/* bypass state */
>+struct bypass_info {
>+	/* passthru netdev with same MAC */
>+	struct net_device __rcu *active_netdev;

You still use "active"/"backup" names which is highly misleading as
it has completely different meaning that in bond for example.
I noted that in my previous review already. Please change it.


>+
>+	/* virtio_net netdev */
>+	struct net_device __rcu *backup_netdev;
>+
>+	/* active netdev stats */
>+	struct rtnl_link_stats64 active_stats;
>+
>+	/* backup netdev stats */
>+	struct rtnl_link_stats64 backup_stats;
>+
>+	/* aggregated stats */
>+	struct rtnl_link_stats64 bypass_stats;
>+
>+	/* spinlock while updating stats */
>+	spinlock_t stats_lock;
>+};
>+
>+#if IS_ENABLED(CONFIG_NET_BYPASS)
>+
>+int bypass_master_create(struct net_device *backup_netdev,
>+			 struct bypass_master **pbypass_master);
>+void bypass_master_destroy(struct bypass_master *bypass_master);
>+
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+			   struct bypass_master **pbypass_master);
>+void bypass_master_unregister(struct bypass_master *bypass_master);
>+
>+int bypass_slave_unregister(struct net_device *slave_netdev);
>+
>+#else
>+
>+static inline
>+int bypass_master_create(struct net_device *backup_netdev,
>+			 struct bypass_master **pbypass_master);
>+{
>+	return 0;
>+}
>+
>+static inline
>+void bypass_master_destroy(struct bypass_master *bypass_master)
>+{
>+}
>+
>+static inline
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+			   struct pbypass_master **pbypass_master);
>+{
>+	return 0;
>+}
>+
>+static inline
>+void bypass_master_unregister(struct bypass_master *bypass_master)
>+{
>+}
>+
>+static inline
>+int bypass_slave_unregister(struct net_device *slave_netdev)
>+{
>+	return 0;
>+}
>+
>+#endif
>+
>+#endif /* _NET_BYPASS_H */
>diff --git a/net/Kconfig b/net/Kconfig
>index 0428f12c25c2..994445f4a96a 100644
>--- a/net/Kconfig
>+++ b/net/Kconfig
>@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
> 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
> 	  devlink is a loadable module and the driver using it is built-in.
> 
>+config NET_BYPASS
>+	tristate "Bypass interface"
>+	---help---
>+	  This provides a generic interface for paravirtual drivers to listen
>+	  for netdev register/unregister/link change events from pci ethernet
>+	  devices with the same MAC and takeover their datapath. This also
>+	  enables live migration of a VM with direct attached VF by failing
>+	  over to the paravirtual datapath when the VF is unplugged.
>+
>+config MAY_USE_BYPASS
>+	tristate
>+	default m if NET_BYPASS=m
>+	default y if NET_BYPASS=y || NET_BYPASS=n
>+	help
>+	  Drivers using the bypass infrastructure should have a dependency
>+	  on MAY_USE_BYPASS to ensure they do not cause link errors when
>+	  bypass is a loadable module and the driver using it is built-in.
>+
> endif   # if NET
> 
> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>diff --git a/net/core/Makefile b/net/core/Makefile
>index 6dbbba8c57ae..a9727ed1c8fc 100644
>--- a/net/core/Makefile
>+++ b/net/core/Makefile
>@@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
> obj-$(CONFIG_HWBM) += hwbm.o
> obj-$(CONFIG_NET_DEVLINK) += devlink.o
> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>+obj-$(CONFIG_NET_BYPASS) += bypass.o
>diff --git a/net/core/bypass.c b/net/core/bypass.c
>new file mode 100644
>index 000000000000..b5b9cb554c3f
>--- /dev/null
>+++ b/net/core/bypass.c
>@@ -0,0 +1,844 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+/* A common module to handle registrations and notifications for paravirtual
>+ * drivers to enable accelerated datapath and support VF live migration.
>+ *
>+ * The notifier and event handling code is based on netvsc driver.
>+ */
>+
>+#include <linux/netdevice.h>
>+#include <linux/etherdevice.h>
>+#include <linux/ethtool.h>
>+#include <linux/module.h>
>+#include <linux/slab.h>
>+#include <linux/netdevice.h>
>+#include <linux/netpoll.h>
>+#include <linux/rtnetlink.h>
>+#include <linux/if_vlan.h>
>+#include <linux/pci.h>
>+#include <net/sch_generic.h>
>+#include <uapi/linux/if_arp.h>
>+#include <net/bypass.h>
>+
>+static LIST_HEAD(bypass_master_list);
>+static DEFINE_SPINLOCK(bypass_lock);
>+
>+static int bypass_slave_pre_register(struct net_device *slave_netdev,
>+				     struct net_device *bypass_netdev,
>+				     struct bypass_ops *bypass_ops)
>+{
>+	struct bypass_info *bi;
>+	bool backup;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_pre_register)
>+			return -EINVAL;
>+
>+		return bypass_ops->slave_pre_register(slave_netdev,
>+						      bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>+	if (backup ? rtnl_dereference(bi->backup_netdev) :
>+			rtnl_dereference(bi->active_netdev)) {
>+		netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
>+			   slave_netdev->name, backup ? "backup" : "active");
>+		return -EEXIST;
>+	}
>+
>+	/* Avoid non pci devices as active netdev */
>+	if (!backup && (!slave_netdev->dev.parent ||
>+			!dev_is_pci(slave_netdev->dev.parent)))
>+		return -EINVAL;
>+
>+	return 0;
>+}
>+
>+static int bypass_slave_join(struct net_device *slave_netdev,
>+			     struct net_device *bypass_netdev,
>+			     struct bypass_ops *bypass_ops)
>+{
>+	struct bypass_info *bi;
>+	bool backup;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_join)
>+			return -EINVAL;
>+
>+		return bypass_ops->slave_join(slave_netdev, bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>+
>+	dev_hold(slave_netdev);
>+
>+	if (backup) {
>+		rcu_assign_pointer(bi->backup_netdev, slave_netdev);
>+		dev_get_stats(bi->backup_netdev, &bi->backup_stats);
>+	} else {
>+		rcu_assign_pointer(bi->active_netdev, slave_netdev);
>+		dev_get_stats(bi->active_netdev, &bi->active_stats);
>+		bypass_netdev->min_mtu = slave_netdev->min_mtu;
>+		bypass_netdev->max_mtu = slave_netdev->max_mtu;
>+	}
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s joined\n",
>+		    slave_netdev->name);
>+
>+	return 0;
>+}
>+
>+/* Called when slave dev is injecting data into network stack.
>+ * Change the associated network device from lower dev to virtio.
>+ * note: already called with rcu_read_lock
>+ */
>+static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
>+{
>+	struct sk_buff *skb = *pskb;
>+	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>+
>+	skb->dev = ndev;
>+
>+	return RX_HANDLER_ANOTHER;
>+}
>+
>+static struct net_device *bypass_master_get_bymac(u8 *mac,
>+						  struct bypass_ops **ops)
>+{
>+	struct bypass_master *bypass_master;
>+	struct net_device *bypass_netdev;
>+
>+	spin_lock(&bypass_lock);
>+	list_for_each_entry(bypass_master, &bypass_master_list, list) {

As I wrote the last time, you don't need this list, spinlock.
You can do just something like:
        for_each_net(net) {
                for_each_netdev(net, dev) {
			if (netif_is_bypass_master(dev)) {




>+		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>+			*ops = rcu_dereference(bypass_master->ops);

I don't see how rcu_dereference is ok here.
1) I don't see rcu_read_lock taken
2) Looks like bypass_master->ops has the same value across the whole
   existence.


>+			spin_unlock(&bypass_lock);
>+			return bypass_netdev;
>+		}
>+	}
>+	spin_unlock(&bypass_lock);
>+	return NULL;
>+}
>+
>+static int bypass_slave_register(struct net_device *slave_netdev)
>+{
>+	struct net_device *bypass_netdev;
>+	struct bypass_ops *bypass_ops;
>+	int ret, orig_mtu;
>+
>+	ASSERT_RTNL();
>+
>+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+						&bypass_ops);

For master, could you use word "master" in the variables so it is clear?
Also, "dev" is fine instead of "netdev".
Something like "bpmaster_dev"


>+	if (!bypass_netdev)
>+		goto done;
>+
>+	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>+					bypass_ops);
>+	if (ret != 0)

	Just "if (ret)" will do. You have this on more places.


>+		goto done;
>+
>+	ret = netdev_rx_handler_register(slave_netdev,
>+					 bypass_ops ? bypass_ops->handle_frame :
>+					 bypass_handle_frame, bypass_netdev);
>+	if (ret != 0) {
>+		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>+			   ret);
>+		goto done;
>+	}
>+
>+	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>+	if (ret != 0) {
>+		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>+			   bypass_netdev->name, ret);
>+		goto upper_link_failed;
>+	}
>+
>+	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>+
>+	if (netif_running(bypass_netdev)) {
>+		ret = dev_open(slave_netdev);
>+		if (ret && (ret != -EBUSY)) {
>+			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>+				   slave_netdev->name, ret);
>+			goto err_interface_up;
>+		}
>+	}
>+
>+	/* Align MTU of slave with master */
>+	orig_mtu = slave_netdev->mtu;
>+	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>+	if (ret != 0) {
>+		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>+			   slave_netdev->name, bypass_netdev->mtu);
>+		goto err_set_mtu;
>+	}
>+
>+	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>+	if (ret != 0)
>+		goto err_join;
>+
>+	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>+		    slave_netdev->name);
>+
>+	goto done;
>+
>+err_join:
>+	dev_set_mtu(slave_netdev, orig_mtu);
>+err_set_mtu:
>+	dev_close(slave_netdev);
>+err_interface_up:
>+	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>+	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>+upper_link_failed:
>+	netdev_rx_handler_unregister(slave_netdev);
>+done:
>+	return NOTIFY_DONE;
>+}
>+
>+static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>+				       struct net_device *bypass_netdev,
>+				       struct bypass_ops *bypass_ops)
>+{
>+	struct net_device *backup_netdev, *active_netdev;
>+	struct bypass_info *bi;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_pre_unregister)
>+			return -EINVAL;
>+
>+		return bypass_ops->slave_pre_unregister(slave_netdev,
>+							bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>+		return -EINVAL;
>+
>+	return 0;
>+}
>+
>+static int bypass_slave_release(struct net_device *slave_netdev,
>+				struct net_device *bypass_netdev,
>+				struct bypass_ops *bypass_ops)
>+{
>+	struct net_device *backup_netdev, *active_netdev;
>+	struct bypass_info *bi;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_release)
>+			return -EINVAL;

I think it would be good to make the API to the driver more strict and
have a separate set of ops for "active" and "backup" netdevices.
That should stop people thinking about extending this to more slaves in
the future.



>+
>+		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+	if (slave_netdev == backup_netdev) {
>+		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>+	} else {
>+		RCU_INIT_POINTER(bi->active_netdev, NULL);
>+		if (backup_netdev) {
>+			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+		}
>+	}
>+
>+	dev_put(slave_netdev);
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>+		    slave_netdev->name);
>+
>+	return 0;
>+}
>+
>+int bypass_slave_unregister(struct net_device *slave_netdev)
>+{
>+	struct net_device *bypass_netdev;
>+	struct bypass_ops *bypass_ops;
>+	int ret;
>+
>+	if (!netif_is_bypass_slave(slave_netdev))
>+		goto done;
>+
>+	ASSERT_RTNL();
>+
>+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+						&bypass_ops);
>+	if (!bypass_netdev)
>+		goto done;
>+
>+	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>+					  bypass_ops);
>+	if (ret != 0)
>+		goto done;
>+
>+	netdev_rx_handler_unregister(slave_netdev);
>+	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>+	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>+
>+	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>+		    slave_netdev->name);
>+
>+done:
>+	return NOTIFY_DONE;
>+}
>+EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>+
>+static bool bypass_xmit_ready(struct net_device *dev)
>+{
>+	return netif_running(dev) && netif_carrier_ok(dev);
>+}
>+
>+static int bypass_slave_link_change(struct net_device *slave_netdev)
>+{
>+	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>+	struct bypass_ops *bypass_ops;
>+	struct bypass_info *bi;
>+
>+	if (!netif_is_bypass_slave(slave_netdev))
>+		goto done;
>+
>+	ASSERT_RTNL();
>+
>+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+						&bypass_ops);
>+	if (!bypass_netdev)
>+		goto done;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_link_change)
>+			goto done;
>+
>+		return bypass_ops->slave_link_change(slave_netdev,
>+						     bypass_netdev);
>+	}
>+
>+	if (!netif_running(bypass_netdev))
>+		return 0;
>+
>+	bi = netdev_priv(bypass_netdev);
>+
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>+		goto done;

You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
above is enough.


>+
>+	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>+	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>+		netif_carrier_on(bypass_netdev);
>+		netif_tx_wake_all_queues(bypass_netdev);
>+	} else {
>+		netif_carrier_off(bypass_netdev);
>+		netif_tx_stop_all_queues(bypass_netdev);
>+	}
>+
>+done:
>+	return NOTIFY_DONE;
>+}
>+
>+static bool bypass_validate_event_dev(struct net_device *dev)
>+{
>+	/* Skip parent events */
>+	if (netif_is_bypass_master(dev))
>+		return false;
>+
>+	/* Avoid non-Ethernet type devices */
>+	if (dev->type != ARPHRD_ETHER)
>+		return false;
>+
>+	/* Avoid Vlan dev with same MAC registering as VF */
>+	if (is_vlan_dev(dev))
>+		return false;
>+
>+	/* Avoid Bonding master dev with same MAC registering as slave dev */
>+	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))

Yeah, this is certainly incorrect. One thing is, you should be using the
helpers netif_is_bond_master().
But what about the rest? macsec, macvlan, team, bridge, ovs and others?

You need to do it not by blacklisting, but with whitelisting. You need
to whitelist VF devices. My port flavours patchset might help with this.


>+		return false;
>+
>+	return true;
>+}
>+
>+static int
>+bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>+{
>+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>+
>+	if (!bypass_validate_event_dev(event_dev))
>+		return NOTIFY_DONE;
>+
>+	switch (event) {
>+	case NETDEV_REGISTER:
>+		return bypass_slave_register(event_dev);
>+	case NETDEV_UNREGISTER:
>+		return bypass_slave_unregister(event_dev);
>+	case NETDEV_UP:
>+	case NETDEV_DOWN:
>+	case NETDEV_CHANGE:
>+		return bypass_slave_link_change(event_dev);
>+	default:
>+		return NOTIFY_DONE;
>+	}
>+}
>+
>+static struct notifier_block bypass_notifier = {
>+	.notifier_call = bypass_event,
>+};
>+
>+int bypass_open(struct net_device *dev)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *active_netdev, *backup_netdev;
>+	int err;
>+
>+	netif_carrier_off(dev);
>+	netif_tx_wake_all_queues(dev);
>+
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	if (active_netdev) {
>+		err = dev_open(active_netdev);
>+		if (err)
>+			goto err_active_open;
>+	}
>+
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+	if (backup_netdev) {
>+		err = dev_open(backup_netdev);
>+		if (err)
>+			goto err_backup_open;
>+	}
>+
>+	return 0;
>+
>+err_backup_open:
>+	dev_close(active_netdev);
>+err_active_open:
>+	netif_tx_disable(dev);
>+	return err;
>+}
>+EXPORT_SYMBOL_GPL(bypass_open);
>+
>+int bypass_close(struct net_device *dev)
>+{
>+	struct bypass_info *vi = netdev_priv(dev);

This should be probably "bi"


>+	struct net_device *slave_netdev;
>+
>+	netif_tx_disable(dev);
>+
>+	slave_netdev = rtnl_dereference(vi->active_netdev);
>+	if (slave_netdev)
>+		dev_close(slave_netdev);
>+
>+	slave_netdev = rtnl_dereference(vi->backup_netdev);
>+	if (slave_netdev)
>+		dev_close(slave_netdev);
>+
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_close);
>+
>+static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
>+{
>+	atomic_long_inc(&dev->tx_dropped);
>+	dev_kfree_skb_any(skb);
>+	return NETDEV_TX_OK;
>+}
>+
>+netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);

If you rename the other variable to "bpmaster_dev", it would be nice to
rename this to bpinfo or something more descriptive. "bi" is too short
to know what that is right away.


>+	struct net_device *xmit_dev;

Don't mix "dev" and "netdev" in one .c file. Just use "dev" for all.



>+
>+	/* Try xmit via active netdev followed by backup netdev */
>+	xmit_dev = rcu_dereference_bh(bi->active_netdev);
>+	if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
>+		xmit_dev = rcu_dereference_bh(bi->backup_netdev);
>+		if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
>+			return bypass_drop_xmit(skb, dev);
>+	}
>+
>+	skb->dev = xmit_dev;
>+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>+
>+	return dev_queue_xmit(skb);
>+}
>+EXPORT_SYMBOL_GPL(bypass_start_xmit);
>+
>+u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
>+			void *accel_priv, select_queue_fallback_t fallback)
>+{
>+	/* This helper function exists to help dev_pick_tx get the correct
>+	 * destination queue.  Using a helper function skips a call to
>+	 * skb_tx_hash and will put the skbs in the queue we expect on their
>+	 * way down to the bonding driver.
>+	 */
>+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>+
>+	/* Save the original txq to restore before passing to the driver */
>+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>+
>+	if (unlikely(txq >= dev->real_num_tx_queues)) {
>+		do {
>+			txq -= dev->real_num_tx_queues;
>+		} while (txq >= dev->real_num_tx_queues);
>+	}
>+
>+	return txq;
>+}
>+EXPORT_SYMBOL_GPL(bypass_select_queue);
>+
>+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>+ * that some drivers can provide 32bit values only.
>+ */
>+static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
>+			      const struct rtnl_link_stats64 *_new,
>+			      const struct rtnl_link_stats64 *_old)
>+{
>+	const u64 *new = (const u64 *)_new;
>+	const u64 *old = (const u64 *)_old;
>+	u64 *res = (u64 *)_res;
>+	int i;
>+
>+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>+		u64 nv = new[i];
>+		u64 ov = old[i];
>+		s64 delta = nv - ov;
>+
>+		/* detects if this particular field is 32bit only */
>+		if (((nv | ov) >> 32) == 0)
>+			delta = (s64)(s32)((u32)nv - (u32)ov);
>+
>+		/* filter anomalies, some drivers reset their stats
>+		 * at down/up events.
>+		 */
>+		if (delta > 0)
>+			res[i] += delta;
>+	}
>+}
>+
>+void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);

You can WARN_ON and return in case the dev is not bypass master, just
to catch buggy drivers. Same with other helpers.


>+	const struct rtnl_link_stats64 *new;
>+	struct rtnl_link_stats64 temp;
>+	struct net_device *slave_netdev;
>+
>+	spin_lock(&bi->stats_lock);
>+	memcpy(stats, &bi->bypass_stats, sizeof(*stats));
>+
>+	rcu_read_lock();
>+
>+	slave_netdev = rcu_dereference(bi->active_netdev);
>+	if (slave_netdev) {
>+		new = dev_get_stats(slave_netdev, &temp);
>+		bypass_fold_stats(stats, new, &bi->active_stats);
>+		memcpy(&bi->active_stats, new, sizeof(*new));
>+	}
>+
>+	slave_netdev = rcu_dereference(bi->backup_netdev);
>+	if (slave_netdev) {
>+		new = dev_get_stats(slave_netdev, &temp);
>+		bypass_fold_stats(stats, new, &bi->backup_stats);
>+		memcpy(&bi->backup_stats, new, sizeof(*new));
>+	}
>+
>+	rcu_read_unlock();
>+
>+	memcpy(&bi->bypass_stats, stats, sizeof(*stats));
>+	spin_unlock(&bi->stats_lock);
>+}
>+EXPORT_SYMBOL_GPL(bypass_get_stats);
>+
>+int bypass_change_mtu(struct net_device *dev, int new_mtu)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *active_netdev, *backup_netdev;
>+	int ret = 0;

Pointless initialization.


>+
>+	active_netdev = rcu_dereference(bi->active_netdev);
>+	if (active_netdev) {
>+		ret = dev_set_mtu(active_netdev, new_mtu);
>+		if (ret)
>+			return ret;
>+	}
>+
>+	backup_netdev = rcu_dereference(bi->backup_netdev);
>+	if (backup_netdev) {
>+		ret = dev_set_mtu(backup_netdev, new_mtu);
>+		if (ret) {
>+			dev_set_mtu(active_netdev, dev->mtu);
>+			return ret;
>+		}
>+	}
>+
>+	dev->mtu = new_mtu;
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_change_mtu);
>+
>+void bypass_set_rx_mode(struct net_device *dev)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *slave_netdev;
>+
>+	rcu_read_lock();
>+
>+	slave_netdev = rcu_dereference(bi->active_netdev);
>+	if (slave_netdev) {
>+		dev_uc_sync_multiple(slave_netdev, dev);
>+		dev_mc_sync_multiple(slave_netdev, dev);
>+	}
>+
>+	slave_netdev = rcu_dereference(bi->backup_netdev);
>+	if (slave_netdev) {
>+		dev_uc_sync_multiple(slave_netdev, dev);
>+		dev_mc_sync_multiple(slave_netdev, dev);
>+	}
>+
>+	rcu_read_unlock();
>+}
>+EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
>+
>+static const struct net_device_ops bypass_netdev_ops = {
>+	.ndo_open		= bypass_open,
>+	.ndo_stop		= bypass_close,
>+	.ndo_start_xmit		= bypass_start_xmit,
>+	.ndo_select_queue	= bypass_select_queue,
>+	.ndo_get_stats64	= bypass_get_stats,
>+	.ndo_change_mtu		= bypass_change_mtu,
>+	.ndo_set_rx_mode	= bypass_set_rx_mode,
>+	.ndo_validate_addr	= eth_validate_addr,
>+	.ndo_features_check	= passthru_features_check,
>+};
>+
>+#define BYPASS_DRV_NAME "bypass"
>+#define BYPASS_DRV_VERSION "0.1"
>+
>+static void bypass_ethtool_get_drvinfo(struct net_device *dev,
>+				       struct ethtool_drvinfo *drvinfo)
>+{
>+	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>+	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>+}
>+
>+int bypass_ethtool_get_link_ksettings(struct net_device *dev,
>+				      struct ethtool_link_ksettings *cmd)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *slave_netdev;
>+
>+	slave_netdev = rtnl_dereference(bi->active_netdev);
>+	if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>+		slave_netdev = rtnl_dereference(bi->backup_netdev);
>+		if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>+			cmd->base.duplex = DUPLEX_UNKNOWN;
>+			cmd->base.port = PORT_OTHER;
>+			cmd->base.speed = SPEED_UNKNOWN;
>+
>+			return 0;
>+		}
>+	}
>+
>+	return __ethtool_get_link_ksettings(slave_netdev, cmd);
>+}
>+EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
>+
>+static const struct ethtool_ops bypass_ethtool_ops = {
>+	.get_drvinfo            = bypass_ethtool_get_drvinfo,
>+	.get_link               = ethtool_op_get_link,
>+	.get_link_ksettings     = bypass_ethtool_get_link_ksettings,
>+};
>+
>+static void bypass_register_existing_slave(struct net_device *bypass_netdev)
>+{
>+	struct net *net = dev_net(bypass_netdev);
>+	struct net_device *dev;
>+
>+	rtnl_lock();
>+	for_each_netdev(net, dev) {
>+		if (dev == bypass_netdev)
>+			continue;
>+		if (!bypass_validate_event_dev(dev))
>+			continue;
>+		if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>+			bypass_slave_register(dev);
>+	}
>+	rtnl_unlock();
>+}
>+
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+			   struct bypass_master **pbypass_master)
>+{
>+	struct bypass_master *bypass_master;
>+
>+	bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
>+	if (!bypass_master)
>+		return -ENOMEM;
>+
>+	rcu_assign_pointer(bypass_master->ops, ops);
>+	dev_hold(dev);
>+	dev->priv_flags |= IFF_BYPASS;
>+	rcu_assign_pointer(bypass_master->bypass_netdev, dev);
>+
>+	spin_lock(&bypass_lock);
>+	list_add_tail(&bypass_master->list, &bypass_master_list);
>+	spin_unlock(&bypass_lock);
>+
>+	bypass_register_existing_slave(dev);
>+
>+	*pbypass_master = bypass_master;
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_register);
>+
>+void bypass_master_unregister(struct bypass_master *bypass_master)
>+{
>+	struct net_device *bypass_netdev;
>+
>+	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+
>+	bypass_netdev->priv_flags &= ~IFF_BYPASS;
>+	dev_put(bypass_netdev);
>+
>+	spin_lock(&bypass_lock);
>+	list_del(&bypass_master->list);
>+	spin_unlock(&bypass_lock);
>+
>+	kfree(bypass_master);
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_unregister);
>+
>+int bypass_master_create(struct net_device *backup_netdev,
>+			 struct bypass_master **pbypass_master)
>+{
>+	struct device *dev = backup_netdev->dev.parent;
>+	struct net_device *bypass_netdev;
>+	int err;
>+
>+	/* Alloc at least 2 queues, for now we are going with 16 assuming
>+	 * that most devices being bonded won't have too many queues.
>+	 */
>+	bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
>+	if (!bypass_netdev) {
>+		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>+		return -ENOMEM;
>+	}
>+
>+	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>+	SET_NETDEV_DEV(bypass_netdev, dev);
>+
>+	bypass_netdev->netdev_ops = &bypass_netdev_ops;
>+	bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
>+
>+	/* Initialize the device options */
>+	bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>+	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>+				       IFF_TX_SKB_SHARING);
>+
>+	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
>+	bypass_netdev->features |= NETIF_F_LLTX;
>+
>+	/* Don't allow bypass devices to change network namespaces. */
>+	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>+
>+	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>+				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>+				     NETIF_F_HIGHDMA | NETIF_F_LRO;
>+
>+	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>+	bypass_netdev->features |= bypass_netdev->hw_features;
>+
>+	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>+	       bypass_netdev->addr_len);
>+
>+	bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+	bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+
>+	err = register_netdev(bypass_netdev);
>+	if (err < 0) {
>+		dev_err(dev, "Unable to register bypass_netdev!\n");
>+		goto err_register_netdev;
>+	}
>+
>+	netif_carrier_off(bypass_netdev);
>+
>+	err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
>+	if (err < 0)

just "if (err)" would do.


>+		goto err_bypass;
>+
>+	return 0;
>+
>+err_bypass:
>+	unregister_netdev(bypass_netdev);
>+err_register_netdev:
>+	free_netdev(bypass_netdev);
>+
>+	return err;
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_create);
>+
>+void bypass_master_destroy(struct bypass_master *bypass_master)
>+{
>+	struct net_device *bypass_netdev;
>+	struct net_device *slave_netdev;
>+	struct bypass_info *bi;
>+
>+	if (!bypass_master)
>+		return;
>+
>+	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+	bi = netdev_priv(bypass_netdev);
>+
>+	netif_device_detach(bypass_netdev);
>+
>+	rtnl_lock();
>+
>+	slave_netdev = rtnl_dereference(bi->active_netdev);
>+	if (slave_netdev)
>+		bypass_slave_unregister(slave_netdev);
>+
>+	slave_netdev = rtnl_dereference(bi->backup_netdev);
>+	if (slave_netdev)
>+		bypass_slave_unregister(slave_netdev);
>+
>+	bypass_master_unregister(bypass_master);
>+
>+	unregister_netdevice(bypass_netdev);
>+
>+	rtnl_unlock();
>+
>+	free_netdev(bypass_netdev);
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_destroy);
>+
>+static __init int
>+bypass_init(void)
>+{
>+	register_netdevice_notifier(&bypass_notifier);
>+
>+	return 0;
>+}
>+module_init(bypass_init);
>+
>+static __exit
>+void bypass_exit(void)
>+{
>+	unregister_netdevice_notifier(&bypass_notifier);
>+}
>+module_exit(bypass_exit);
>+
>+MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>+MODULE_LICENSE("GPL v2");
>-- 
>2.14.3
>

^ permalink raw reply

* [PATCH net v2 6/6] bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>

When open fails during ethtool -L ring change, for example, the driver
may crash at bnxt_free_irq() because bp->bnapi is NULL.

If we fail to allocate all the new rings, bnxt_open_nic() will free
all the memory including bp->bnapi.  Subsequent call to bnxt_close_nic()
will try to dereference bp->bnapi in bnxt_free_irq().

Fix it by checking for !bp->bnapi in bnxt_free_irq().

Fixes: e5811b8c09df ("bnxt_en: Add IRQ remapping logic.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 9cb8b4b..f83769d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -6090,7 +6090,7 @@ static void bnxt_free_irq(struct bnxt *bp)
 	free_irq_cpu_rmap(bp->dev->rx_cpu_rmap);
 	bp->dev->rx_cpu_rmap = NULL;
 #endif
-	if (!bp->irq_tbl)
+	if (!bp->irq_tbl || !bp->bnapi)
 		return;
 
 	for (i = 0; i < bp->cp_nr_rings; i++) {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v2 5/6] bnxt_en: Need to include RDMA rings in bnxt_check_rings().
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>

With recent changes to reserve both L2 and RDMA rings, we need to include
the RDMA rings in bnxt_check_rings().  Otherwise we will under-estimate
the rings we need during ethtool -L and may lead to failure.

Fixes: fbcfc8e46741 ("bnxt_en: Reserve completion rings and MSIX for bnxt_re RDMA driver.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1991f0c..9cb8b4b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7686,6 +7686,8 @@ int bnxt_check_rings(struct bnxt *bp, int tx, int rx, bool sh, int tcs,
 	if (bp->flags & BNXT_FLAG_AGG_RINGS)
 		rx_rings <<= 1;
 	cp = sh ? max_t(int, tx_rings_needed, rx) : tx_rings_needed + rx;
+	if (bp->flags & BNXT_FLAG_NEW_RM)
+		cp += bnxt_get_ulp_msix_num(bp);
 	return bnxt_hwrm_check_rings(bp, tx_rings_needed, rx_rings, rx, cp,
 				     vnics);
 }
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v2 4/6] bnxt_en: Support max-mtu with VF-reps
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev, Sriharsha Basavapatna
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>

From: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>

While a VF is configured with a bigger mtu (> 1500), any packets that
are punted to the VF-rep (slow-path) get dropped by OVS kernel-datapath
with the following message: "dropped over-mtu packet". Fix this by
returning the max-mtu value for a VF-rep derived from its corresponding VF.
VF-rep's mtu can be changed using 'ip' command as shown in this example:

	$ ip link set bnxt0_pf0vf0 mtu 9000

Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c | 30 +++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
index 2629040..38f635c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
@@ -64,6 +64,31 @@ static int hwrm_cfa_vfr_free(struct bnxt *bp, u16 vf_idx)
 	return rc;
 }
 
+static int bnxt_hwrm_vfr_qcfg(struct bnxt *bp, struct bnxt_vf_rep *vf_rep,
+			      u16 *max_mtu)
+{
+	struct hwrm_func_qcfg_output *resp = bp->hwrm_cmd_resp_addr;
+	struct hwrm_func_qcfg_input req = {0};
+	u16 mtu;
+	int rc;
+
+	bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FUNC_QCFG, -1, -1);
+	req.fid = cpu_to_le16(bp->pf.vf[vf_rep->vf_idx].fw_fid);
+
+	mutex_lock(&bp->hwrm_cmd_lock);
+
+	rc = _hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT);
+	if (!rc) {
+		mtu = le16_to_cpu(resp->max_mtu_configured);
+		if (!mtu)
+			*max_mtu = BNXT_MAX_MTU;
+		else
+			*max_mtu = mtu;
+	}
+	mutex_unlock(&bp->hwrm_cmd_lock);
+	return rc;
+}
+
 static int bnxt_vf_rep_open(struct net_device *dev)
 {
 	struct bnxt_vf_rep *vf_rep = netdev_priv(dev);
@@ -365,6 +390,7 @@ static void bnxt_vf_rep_netdev_init(struct bnxt *bp, struct bnxt_vf_rep *vf_rep,
 				    struct net_device *dev)
 {
 	struct net_device *pf_dev = bp->dev;
+	u16 max_mtu;
 
 	dev->netdev_ops = &bnxt_vf_rep_netdev_ops;
 	dev->ethtool_ops = &bnxt_vf_rep_ethtool_ops;
@@ -380,6 +406,10 @@ static void bnxt_vf_rep_netdev_init(struct bnxt *bp, struct bnxt_vf_rep *vf_rep,
 	bnxt_vf_rep_eth_addr_gen(bp->pf.mac_addr, vf_rep->vf_idx,
 				 dev->perm_addr);
 	ether_addr_copy(dev->dev_addr, dev->perm_addr);
+	/* Set VF-Rep's max-mtu to the corresponding VF's max-mtu */
+	if (!bnxt_hwrm_vfr_qcfg(bp, vf_rep, &max_mtu))
+		dev->max_mtu = max_mtu;
+	dev->min_mtu = ETH_ZLEN;
 }
 
 static int bnxt_pcie_dsn_get(struct bnxt *bp, u8 dsn[])
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v2 3/6] bnxt_en: Ignore src port field in decap filter nodes
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev, Sriharsha Basavapatna
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>

From: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>

The driver currently uses src port field (along with other fields) in the
decap tunnel key, while looking up and adding tunnel nodes. This leads to
redundant cfa_decap_filter_alloc() requests to the FW and flow-miss in the
flow engine. Fix this by ignoring the src port field in decap tunnel nodes.

Fixes: f484f6782e01 ("bnxt_en: add hwrm FW cmds for cfa_encap_record and decap_filter")
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index ac193408..795f450 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -1051,8 +1051,10 @@ static int bnxt_tc_get_decap_handle(struct bnxt *bp, struct bnxt_tc_flow *flow,
 
 	/* Check if there's another flow using the same tunnel decap.
 	 * If not, add this tunnel to the table and resolve the other
-	 * tunnel header fileds
+	 * tunnel header fileds. Ignore src_port in the tunnel_key,
+	 * since it is not required for decap filters.
 	 */
+	decap_key->tp_src = 0;
 	decap_node = bnxt_tc_get_tunnel_node(bp, &tc_info->decap_table,
 					     &tc_info->decap_ht_params,
 					     decap_key);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev, Andy Gospodarek
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>

From: Andy Gospodarek <gospo@broadcom.com>

Before this patch the following commands would succeed as far as the
user was concerned:

$ tc qdisc add dev p1p1 ingress
$ tc filter add dev p1p1 parent ffff: protocol all \
	flower skip_sw action drop
$ tc filter add dev p1p1 parent ffff: protocol ipv4 \
	flower skip_sw src_mac 00:02:00:00:00:01/44 action drop

The current flow offload infrastructure used does not support wildcard
matching for ethernet headers, so do not allow the second or third
commands to succeed.  If a user wants to drop traffic on that interface
the protocol and MAC addresses need to be specified explicitly:

$ tc qdisc add dev p1p1 ingress
$ tc filter add dev p1p1 parent ffff: protocol arp \
	flower skip_sw action drop
$ tc filter add dev p1p1 parent ffff: protocol ipv4 \
	flower skip_sw action drop
...
$ tc filter add dev p1p1 parent ffff: protocol ipv4 \
	flower skip_sw src_mac 00:02:00:00:00:01 action drop
$ tc filter add dev p1p1 parent ffff: protocol ipv4 \
	flower skip_sw src_mac 00:02:00:00:00:02 action drop
...

There are also checks for VLAN parameters in this patch as other callers
may wildcard those parameters even if tc does not.  Using different
flow infrastructure could allow this to work in the future for L2 flows,
but for now it does not.

Fixes: 2ae7408fedfe ("bnxt_en: bnxt: add TC flower filter offload support")
Signed-off-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 59 ++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index 65c2cee..ac193408 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -377,6 +377,30 @@ static bool is_wildcard(void *mask, int len)
 	return true;
 }
 
+static bool is_exactmatch(void *mask, int len)
+{
+	const u8 *p = mask;
+	int i;
+
+	for (i = 0; i < len; i++)
+		if (p[i] != 0xff)
+			return false;
+
+	return true;
+}
+
+static bool bits_set(void *key, int len)
+{
+	const u8 *p = key;
+	int i;
+
+	for (i = 0; i < len; i++)
+		if (p[i] != 0)
+			return true;
+
+	return false;
+}
+
 static int bnxt_hwrm_cfa_flow_alloc(struct bnxt *bp, struct bnxt_tc_flow *flow,
 				    __le16 ref_flow_handle,
 				    __le32 tunnel_handle, __le16 *flow_handle)
@@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
 		return false;
 	}
 
+	/* Currently source/dest MAC cannot be partial wildcard  */
+	if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
+	    !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
+		netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");
+		return false;
+	}
+	if (bits_set(&flow->l2_key.dmac, sizeof(flow->l2_key.dmac)) &&
+	    !is_exactmatch(&flow->l2_mask.dmac, sizeof(flow->l2_mask.dmac))) {
+		netdev_info(bp->dev, "Wildcard match unsupported for Dest MAC\n");
+		return false;
+	}
+
+	/* Currently VLAN fields cannot be partial wildcard */
+	if (bits_set(&flow->l2_key.inner_vlan_tci,
+		     sizeof(flow->l2_key.inner_vlan_tci)) &&
+	    !is_exactmatch(&flow->l2_mask.inner_vlan_tci,
+			   sizeof(flow->l2_mask.inner_vlan_tci))) {
+		netdev_info(bp->dev, "Wildcard match unsupported for VLAN TCI\n");
+		return false;
+	}
+	if (bits_set(&flow->l2_key.inner_vlan_tpid,
+		     sizeof(flow->l2_key.inner_vlan_tpid)) &&
+	    !is_exactmatch(&flow->l2_mask.inner_vlan_tpid,
+			   sizeof(flow->l2_mask.inner_vlan_tpid))) {
+		netdev_info(bp->dev, "Wildcard match unsupported for VLAN TPID\n");
+		return false;
+	}
+
+	/* Currently Ethertype must be set */
+	if (!is_exactmatch(&flow->l2_mask.ether_type,
+			   sizeof(flow->l2_mask.ether_type))) {
+		netdev_info(bp->dev, "Wildcard match unsupported for Ethertype\n");
+		return false;
+	}
+
 	return true;
 }
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v2 1/6] bnxt_en: Fix ethtool -x crash when device is down.
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>

Fix ethtool .get_rxfh() crash by checking for valid indirection table
address before copying the data.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 8d8ccd6..1f622ca 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -870,17 +870,22 @@ static int bnxt_get_rxfh(struct net_device *dev, u32 *indir, u8 *key,
 			 u8 *hfunc)
 {
 	struct bnxt *bp = netdev_priv(dev);
-	struct bnxt_vnic_info *vnic = &bp->vnic_info[0];
+	struct bnxt_vnic_info *vnic;
 	int i = 0;
 
 	if (hfunc)
 		*hfunc = ETH_RSS_HASH_TOP;
 
-	if (indir)
+	if (!bp->vnic_info)
+		return 0;
+
+	vnic = &bp->vnic_info[0];
+	if (indir && vnic->rss_table) {
 		for (i = 0; i < HW_HASH_INDEX_SIZE; i++)
 			indir[i] = le16_to_cpu(vnic->rss_table[i]);
+	}
 
-	if (key)
+	if (key && vnic->rss_hash_key)
 		memcpy(key, vnic->rss_hash_key, HW_HASH_KEY_SIZE);
 
 	return 0;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v2 0/6] bnxt_en: Fixes for net.
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev

This bug fix series include NULL pointer fixes in ethtool -x code path
and in the error clean up path when freeing IRQs, a ring accounting bug
that missed rings used by the RDMA driver, and 3 bug fixes related to TC
Flower and VF-reps.

v2: Fixed commit message of patch 4.  Changed the pound sign to $ sign
in front of the ip command.
 
Andy Gospodarek (1):
  bnxt_en: do not allow wildcard matches for L2 flows

Michael Chan (3):
  bnxt_en: Fix ethtool -x crash when device is down.
  bnxt_en: Need to include RDMA rings in bnxt_check_rings().
  bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().

Sriharsha Basavapatna (2):
  bnxt_en: Ignore src port field in decap filter nodes
  bnxt_en: Support max-mtu with VF-reps

 drivers/net/ethernet/broadcom/bnxt/bnxt.c         |  4 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 11 ++--
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c      | 63 ++++++++++++++++++++++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c     | 30 +++++++++++
 4 files changed, 103 insertions(+), 5 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: [PATCH] net: samsung: sxgbe: Replace mdelay with usleep_range in sxgbe_sw_reset
From: kbuild test robot @ 2018-04-11 15:50 UTC (permalink / raw)
  To: Jia-Ju Bai
  Cc: kbuild-all, bh74.an, ks.giri, vipul.pandya, netdev, linux-kernel,
	Jia-Ju Bai
In-Reply-To: <1523413280-2336-1-git-send-email-baijiaju1990@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1660 bytes --]

Hi Jia-Ju,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]
[also build test ERROR on v4.16 next-20180411]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jia-Ju-Bai/net-samsung-sxgbe-Replace-mdelay-with-usleep_range-in-sxgbe_sw_reset/20180411-225900
config: i386-randconfig-x019-201814 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c: In function 'sxgbe_sw_reset':
>> drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c:2039:3: error: implicit declaration of function 'usleep'; did you mean 'ssleep'? [-Werror=implicit-function-declaration]
      usleep(10000, 11000);
      ^~~~~~
      ssleep
   cc1: some warnings being treated as errors

vim +2039 drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c

  2029	
  2030	static int sxgbe_sw_reset(void __iomem *addr)
  2031	{
  2032		int retry_count = 10;
  2033	
  2034		writel(SXGBE_DMA_SOFT_RESET, addr + SXGBE_DMA_MODE_REG);
  2035		while (retry_count--) {
  2036			if (!(readl(addr + SXGBE_DMA_MODE_REG) &
  2037			      SXGBE_DMA_SOFT_RESET))
  2038				break;
> 2039			usleep(10000, 11000);
  2040		}
  2041	
  2042		if (retry_count < 0)
  2043			return -EBUSY;
  2044	
  2045		return 0;
  2046	}
  2047	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 32699 bytes --]

^ permalink raw reply

* [RFC bpf-next v2 8/8] bpf: add documentation for eBPF helpers (58-64)
From: Quentin Monnet @ 2018-04-11 15:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man,
	John Fastabend
In-Reply-To: <20180411121759.4191e267@redhat.com>

2018-04-11 12:17 UTC+0200 ~ Jesper Dangaard Brouer <brouer@redhat.com>
> On Tue, 10 Apr 2018 15:41:57 +0100
> Quentin Monnet <quentin.monnet@netronome.com> wrote:
> 
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 7343af4196c8..db090ad03626 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1250,6 +1250,51 @@ union bpf_attr {
>>   * 	Return
>>   * 		0 on success, or a negative error in case of failure.
>>   *
>> + * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
>> + * 	Description
>> + * 		Redirect the packet to the endpoint referenced by *map* at
>> + * 		index *key*. Depending on its type, his *map* can contain
>> + * 		references to net devices (for forwarding packets through other
>> + * 		ports), or to CPUs (for redirecting XDP frames to another CPU;
>> + * 		but this is not fully implemented as of this writing).
> 
> Stating that CPUMAP redirect "is not fully implemented" is confusing.
> The issue is that CPUMAP only works for "native" XDP.
> 
> What about saying:
> 
> "[...] or to CPUs (for redirecting XDP frames to another CPU;
>  but this is only implemented for native XDP as of this writing)"
> 

Fine by me, I will change it. Thank you Jesper for the review!

Quentin

^ permalink raw reply

* [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Quentin Monnet @ 2018-04-11 15:45 UTC (permalink / raw)
  To: Andrey Ignatov
  Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man,
	Lawrence Brakmo, Yonghong Song, Josef Bacik
In-Reply-To: <20180410175015.GA6762@rdna-mbp.dhcp.thefacebook.com>

2018-04-10 10:50 UTC-0700 ~ Andrey Ignatov <rdna@fb.com>
> Quentin Monnet <quentin.monnet@netronome.com> [Tue, 2018-04-10 07:43 -0700]:
>> + * int bpf_bind(struct bpf_sock_addr_kern *ctx, struct sockaddr *addr, int addr_len)
>> + * 	Description
>> + * 		Bind the socket associated to *ctx* to the address pointed by
>> + * 		*addr*, of length *addr_len*. This allows for making outgoing
>> + * 		connection from the desired IP address, which can be useful for
>> + * 		example when all processes inside a cgroup should use one
>> + * 		single IP address on a host that has multiple IP configured.
>> + *
>> + * 		This helper works for IPv4 and IPv6, TCP and UDP sockets. The
>> + * 		domain (*addr*\ **->sa_family**) must be **AF_INET** (or
>> + * 		**AF_INET6**). Looking for a free port to bind to can be
>> + * 		expensive, therefore binding to port is not permitted by the
>> + * 		helper: *addr*\ **->sin_port** (or **sin6_port**, respectively)
>> + * 		must be set to zero.
>> + *
>> + * 		As for the remote end, both parts of it can be overridden,
>> + * 		remote IP and remote port. This can be useful if an application
>> + * 		inside a cgroup wants to connect to another application inside
>> + * 		the same cgroup or to itself, but knows nothing about the IP
>> + * 		address assigned to the cgroup.
> 
> The last paragraph ("As for the remote end ...") is not relevant to
> bpf_bind() and should be removed. It's about sys_connect hook itself
> that can call to bpf_bind() but also has other functionality (and that
> other functionality is described by this paragraph).

Thanks Andrey, I will remove this paragraph.

Quentin

^ permalink raw reply

* [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Quentin Monnet @ 2018-04-11 15:44 UTC (permalink / raw)
  To: Yonghong Song, daniel, ast
  Cc: netdev, oss-drivers, linux-doc, linux-man, Lawrence Brakmo,
	Josef Bacik, Andrey Ignatov
In-Reply-To: <cc54b41e-3f2f-e87f-042f-842c96308626@fb.com>

2018-04-10 09:58 UTC-0700 ~ Yonghong Song <yhs@fb.com>
> On 4/10/18 7:41 AM, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions:
>>
>> Helpers from Lawrence:
>> - bpf_setsockopt()
>> - bpf_getsockopt()
>> - bpf_sock_ops_cb_flags_set()
>>
>> Helpers from Yonghong:
>> - bpf_perf_event_read_value()
>> - bpf_perf_prog_read_value()
>>
>> Helper from Josef:
>> - bpf_override_return()
>>
>> Helper from Andrey:
>> - bpf_bind()
>>
>> Cc: Lawrence Brakmo <brakmo@fb.com>
>> Cc: Yonghong Song <yhs@fb.com>
>> Cc: Josef Bacik <jbacik@fb.com>
>> Cc: Andrey Ignatov <rdna@fb.com>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>>   include/uapi/linux/bpf.h | 184
>> +++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 184 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 15d9ccafebbe..7343af4196c8 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h

[...]

>> @@ -1255,6 +1277,168 @@ union bpf_attr {
>>    *         performed again.
>>    *     Return
>>    *         0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_perf_event_read_value(struct bpf_map *map, u64 flags,
>> struct bpf_perf_event_value *buf, u32 buf_size)
>> + *     Description
>> + *         Read the value of a perf event counter, and store it into
>> *buf*
>> + *         of size *buf_size*. This helper relies on a *map* of type
>> + *         **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of the perf
>> + *         event counter is selected at the creation of the *map*. The
> 
> The nature of the perf event counter is selected when *map* is updated
> with perf_event fd's.
> 

Thanks, I will fix it.

>> + *         *map* is an array whose size is the number of available CPU
>> + *         cores, and each cell contains a value relative to one
>> core. The
> 
> It is confusing to mix core/cpu here. Maybe just use perf_event
> convention, always using cpu?
> 

Right, I'll remove occurrences of "core".

>> + *         value to retrieve is indicated by *flags*, that contains the
>> + *         index of the core to look up, masked with
>> + *         **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to
>> + *         **BPF_F_CURRENT_CPU** to indicate that the value for the
>> + *         current CPU core should be retrieved.
>> + *
>> + *         This helper behaves in a way close to
>> + *         **bpf_perf_event_read**\ () helper, save that instead of
>> + *         just returning the value observed, it fills the *buf*
>> + *         structure. This allows for additional data to be
>> retrieved: in
>> + *         particular, the enabled and running times (in *buf*\
>> + *         **->enabled** and *buf*\ **->running**, respectively) are
>> + *         copied.
>> + *
>> + *         These values are interesting, because hardware PMU
>> (Performance
>> + *         Monitoring Unit) counters are limited resources. When
>> there are
>> + *         more PMU based perf events opened than available counters,
>> + *         kernel will multiplex these events so each event gets certain
>> + *         percentage (but not all) of the PMU time. In case that
>> + *         multiplexing happens, the number of samples or counter value
>> + *         will not reflect the case compared to when no multiplexing
>> + *         occurs. This makes comparison between different runs
>> difficult.
>> + *         Typically, the counter value should be normalized before
>> + *         comparing to other experiments. The usual normalization is
>> done
>> + *         as follows.
>> + *
>> + *         ::
>> + *
>> + *             normalized_counter = counter * t_enabled / t_running
>> + *
>> + *         Where t_enabled is the time enabled for event and
>> t_running is
>> + *         the time running for event since last normalization. The
>> + *         enabled and running times are accumulated since the perf
>> event
>> + *         open. To achieve scaling factor between two invocations of an
>> + *         eBPF program, users can can use CPU id as the key (which is
>> + *         typical for perf array usage model) to remember the previous
>> + *         value and do the calculation inside the eBPF program.
>> + *     Return
>> + *         0 on success, or a negative error in case of failure.
>> + *

[...]

Thanks Yonghong for the review!

I have a favor to ask of you. I got a bounce for Kaixu Xia's email
address, and I don't know what alternative email address I could use. I
CC-ed to have a review for helper bpf_perf_event_read() (in patch 6 of
this series), which is rather close to bpf_perf_event_read_value().
Would you mind having a look at that one too, please? The description is
not long.

Quentin

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox