Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net 1/2] ibmvnic: Handle all login error conditions
From: Nathan Fontenot @ 2018-04-11 15:24 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, jallen, tlfalcon
In-Reply-To: <20180411.110731.507616117263900808.davem@davemloft.net>

On 04/11/2018 10:07 AM, David Miller wrote:
> From: Nathan Fontenot <nfont@linux.vnet.ibm.com>
> Date: Wed, 11 Apr 2018 09:37:21 -0500
> 
>> There is a bug in handling the possible return codes from sending the
>> login CRQ. The current code treats any non-success return value,
>> minus failure to send the crq and a timeout waiting for a login response,
>> as a need to re-send the login CRQ. This can put the drive in an
>                                                        ^^^^^
> 
> "driver"
> 
>> inifinite loop of trying to login when getting return values other
>   ^^^^^^^^^
> 
> "infinite"
> 
>> that a partial success such as a return code of aborted. For these
>> scenarios the login will not ever succeed at this point and the
>> driver would need to be reset again.
>>
>> To resolve this loop trying to login is updated to only retry the
>> login if the driver gets a return code of a partial success. Other
>> return coes are treated as an error and the driver returns an error
>          ^^^^
> 
> "codes"
> 
>> from ibmvnic_login().
>>
>> To avoid infinite looping in the partial success return cases, the
>> number of retries is capped at the maximum number of supported
>> queues. This value was chosen because the driver does a renegatiation
>                                                           ^^^^^^^^^^^^^
> 
> "renegotiation"
> 
>> of capabilities which sets the number of queus possible and allows
>                                            ^^^^^
> 
> "queues"
> 

Oh man, complete spelling failure. resending.

-Nathan

^ permalink raw reply

* [PATCH v2 net 0/2] ibmvnic: Fix parameter change request handling
From: Nathan Fontenot @ 2018-04-11 15:09 UTC (permalink / raw)
  To: netdev; +Cc: jallen, tlfalcon

When updating parameters for the ibmvnic driver there is a possibility
of entering an infinite loop if a return value other that a partial
success is received from sending the login CRQ.

Also, a deadlock can occur on the rtnl lock if netdev_notify_peers()
is called during driver reset for a parameter change reset.

This patch set corrects both of these issues by updating the return
code handling in ibmvnic_login() nand gaurding against calling
netdev_notify_peers() for parameter change requests.

-Nathan

Updates for V2: Correct spelling mistakes in commit messages.
---

Nathan Fontenot (2):
      ibmvnic: Handle all login error conditions
      ibmvnic: Do not notify peers on parameter change resets

 drivers/net/ethernet/ibm/ibmvnic.c |   58 +++++++++++++++++++++++-------------
 drivers/net/ethernet/ibm/ibmvnic.h |    1 -
 2 files changed, 37 insertions(+), 22 deletions(-)

^ permalink raw reply

* [PATCH v2 net 1/2] ibmvnic: Handle all login error conditions
From: Nathan Fontenot @ 2018-04-11 15:09 UTC (permalink / raw)
  To: netdev; +Cc: jallen, tlfalcon
In-Reply-To: <152345930946.41899.8546547036273488202.stgit@ltcalpine2-lp14.aus.stglabs.ibm.com>

There is a bug in handling the possible return codes from sending the
login CRQ. The current code treats any non-success return value,
minus failure to send the crq and a timeout waiting for a login response,
as a need to re-send the login CRQ. This can put the drive in an
infinite loop of trying to login when getting return values other
that a partial success such as a return code of aborted. For these
scenarios the login will not ever succeed at this point and the
driver would need to be reset again.

To resolve this loop trying to login is updated to only retry the
login if the driver gets a return code of a partial success. Other
return codes are treated as an error and the driver returns an error
from ibmvnic_login().

To avoid infinite looping in the partial success return cases, the
number of retries is capped at the maximum number of supported
queues. This value was chosen because the driver does a renegotiation
of capabilities which sets the number of queues possible and allows
the driver to attempt a login for possible value for the number
of queues supported.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c |   55 +++++++++++++++++++++++-------------
 drivers/net/ethernet/ibm/ibmvnic.h |    1 -
 2 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index aad5658d79d5..c6d5e9448eef 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -794,46 +794,61 @@ static int ibmvnic_login(struct net_device *netdev)
 {
 	struct ibmvnic_adapter *adapter = netdev_priv(netdev);
 	unsigned long timeout = msecs_to_jiffies(30000);
-	struct device *dev = &adapter->vdev->dev;
+	int retry_count = 0;
 	int rc;
 
 	do {
-		if (adapter->renegotiate) {
-			adapter->renegotiate = false;
+		if (retry_count > IBMVNIC_MAX_QUEUES) {
+			netdev_warn(netdev, "Login attempts exceeded\n");
+			return -1;
+		}
+
+		adapter->init_done_rc = 0;
+		reinit_completion(&adapter->init_done);
+		rc = send_login(adapter);
+		if (rc) {
+			netdev_warn(netdev, "Unable to login\n");
+			return rc;
+		}
+
+		if (!wait_for_completion_timeout(&adapter->init_done,
+						 timeout)) {
+			netdev_warn(netdev, "Login timed out\n");
+			return -1;
+		}
+
+		if (adapter->init_done_rc == PARTIALSUCCESS) {
+			retry_count++;
 			release_sub_crqs(adapter, 1);
 
+			adapter->init_done_rc = 0;
 			reinit_completion(&adapter->init_done);
 			send_cap_queries(adapter);
 			if (!wait_for_completion_timeout(&adapter->init_done,
 							 timeout)) {
-				dev_err(dev, "Capabilities query timeout\n");
+				netdev_warn(netdev,
+					    "Capabilities query timed out\n");
 				return -1;
 			}
+
 			rc = init_sub_crqs(adapter);
 			if (rc) {
-				dev_err(dev,
-					"Initialization of SCRQ's failed\n");
+				netdev_warn(netdev,
+					    "SCRQ initialization failed\n");
 				return -1;
 			}
+
 			rc = init_sub_crq_irqs(adapter);
 			if (rc) {
-				dev_err(dev,
-					"Initialization of SCRQ's irqs failed\n");
+				netdev_warn(netdev,
+					    "SCRQ irq initialization failed\n");
 				return -1;
 			}
-		}
-
-		reinit_completion(&adapter->init_done);
-		rc = send_login(adapter);
-		if (rc) {
-			dev_err(dev, "Unable to attempt device login\n");
-			return rc;
-		} else if (!wait_for_completion_timeout(&adapter->init_done,
-						 timeout)) {
-			dev_err(dev, "Login timeout\n");
+		} else if (adapter->init_done_rc) {
+			netdev_warn(netdev, "Adapter login failed\n");
 			return -1;
 		}
-	} while (adapter->renegotiate);
+	} while (adapter->init_done_rc == PARTIALSUCCESS);
 
 	/* handle pending MAC address changes after successful login */
 	if (adapter->mac_change_pending) {
@@ -3942,7 +3957,7 @@ static int handle_login_rsp(union ibmvnic_crq *login_rsp_crq,
 	 * to resend the login buffer with fewer queues requested.
 	 */
 	if (login_rsp_crq->generic.rc.code) {
-		adapter->renegotiate = true;
+		adapter->init_done_rc = login_rsp_crq->generic.rc.code;
 		complete(&adapter->init_done);
 		return 0;
 	}
diff --git a/drivers/net/ethernet/ibm/ibmvnic.h b/drivers/net/ethernet/ibm/ibmvnic.h
index 99c0b58c2c39..22391e8805f6 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.h
+++ b/drivers/net/ethernet/ibm/ibmvnic.h
@@ -1035,7 +1035,6 @@ struct ibmvnic_adapter {
 
 	struct ibmvnic_sub_crq_queue **tx_scrq;
 	struct ibmvnic_sub_crq_queue **rx_scrq;
-	bool renegotiate;
 
 	/* rx structs */
 	struct napi_struct *napi;

^ permalink raw reply related

* [PATCH v2 net 2/2] ibmvnic: Do not notify peers on parameter change resets
From: Nathan Fontenot @ 2018-04-11 15:09 UTC (permalink / raw)
  To: netdev; +Cc: jallen, tlfalcon
In-Reply-To: <152345930946.41899.8546547036273488202.stgit@ltcalpine2-lp14.aus.stglabs.ibm.com>

When attempting to change the driver parameters, such as the MTU
value or number of queues, do not call netdev_notify_peers().
Doing so will deadlock on the rtnl_lock.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index c6d5e9448eef..fdca1ea5bbf9 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1843,7 +1843,8 @@ static int do_reset(struct ibmvnic_adapter *adapter,
 	for (i = 0; i < adapter->req_rx_queues; i++)
 		napi_schedule(&adapter->napi[i]);
 
-	if (adapter->reset_reason != VNIC_RESET_FAILOVER)
+	if (adapter->reset_reason != VNIC_RESET_FAILOVER &&
+	    adapter->reset_reason != VNIC_RESET_CHANGE_PARAM)
 		netdev_notify_peers(netdev);
 
 	netif_carrier_on(netdev);

^ permalink raw reply related

* [PATCH v2] dec: tulip: de4x5: Replace mdelay with usleep_range in de4x5_hw_init
From: Jia-Ju Bai @ 2018-04-11 15:39 UTC (permalink / raw)
  To: davem, stephen, johannes.berg, arvind.yadav.cs, dhowells
  Cc: netdev, linux-parisc, linux-kernel, Jia-Ju Bai

de4x5_hw_init() is never called in atomic context.

de4x5_hw_init() is only called by de4x5_pci_probe(), which is only 
set as ".probe" in struct pci_driver.

Despite never getting called from atomic context, de4x5_hw_init() 
calls mdelay() to busily wait.
This is not necessary and can be replaced with usleep_range() to 
avoid busy waiting.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
---
v2:
* Use usleep_range() to correct usleep() in v1.

---
 drivers/net/ethernet/dec/tulip/de4x5.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/dec/tulip/de4x5.c b/drivers/net/ethernet/dec/tulip/de4x5.c
index 0affee9..3fb0119 100644
--- a/drivers/net/ethernet/dec/tulip/de4x5.c
+++ b/drivers/net/ethernet/dec/tulip/de4x5.c
@@ -1107,7 +1107,7 @@ static int (*dc_infoblock[])(struct net_device *dev, u_char, u_char *) = {
 	pdev = to_pci_dev (gendev);
 	pci_write_config_byte(pdev, PCI_CFDA_PSM, WAKEUP);
     }
-    mdelay(10);
+    usleep_range(10000, 11000);

     RESET_DE4X5;

-- 
1.9.1

^ permalink raw reply related

* [RFC bpf-next v2 1/8] bpf: add script and prepare bpf.h for new helpers documentation
From: Quentin Monnet @ 2018-04-11 15:41 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410181620.axzpwh2iawg2kgh3@ast-mbp.dhcp.thefacebook.com>

2018-04-10 11:16 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Tue, Apr 10, 2018 at 03:41:50PM +0100, Quentin Monnet wrote:
>> Remove previous "overview" of eBPF helpers from user bpf.h header.
>> Replace it by a comment explaining how to process the new documentation
>> (to come in following patches) with a Python script to produce RST, then
>> man page documentation.
>>
>> Also add the aforementioned Python script under scripts/. It is used to
>> process include/uapi/linux/bpf.h and to extract helper descriptions, to
>> turn it into a RST document that can further be processed with rst2man
>> to produce a man page. The script takes one "--filename <path/to/file>"
>> option. If the script is launched from scripts/ in the kernel root
>> directory, it should be able to find the location of the header to
>> parse, and "--filename <path/to/file>" is then optional. If it cannot
>> find the file, then the option becomes mandatory. RST-formatted
>> documentation is printed to standard output.
>>
>> Typical workflow for producing the final man page would be:
>>
>>     $ ./scripts/bpf_helpers_doc.py \
>>             --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
>>     $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
>>     $ man /tmp/bpf-helpers.7
>>
>> Note that the tool kernel-doc cannot be used to document eBPF helpers,
>> whose signatures are not available directly in the header files
>> (pre-processor directives are used to produce them at the beginning of
>> the compilation process).
>>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>>  include/uapi/linux/bpf.h   | 406 ++------------------------------------------
>>  scripts/bpf_helpers_doc.py | 414 +++++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 430 insertions(+), 390 deletions(-)
>>  create mode 100755 scripts/bpf_helpers_doc.py
>>

[...]

>> diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
>> new file mode 100755
>> index 000000000000..3a15ba3f0a83
>> --- /dev/null
>> +++ b/scripts/bpf_helpers_doc.py
>> @@ -0,0 +1,414 @@
>> +#!/usr/bin/python3
>> +#
>> +# Copyright (C) 2018 Netronome Systems, Inc.
>> +#
>> +# This software is licensed under the GNU General License Version 2,
>> +# June 1991 as shown in the file COPYING in the top-level directory of this
>> +# source tree.
> 
> please use SPDX instead.
> 

Same as for bpftool, our layers remain a bit cautious about it. I'd be
happy to change it for SPDX as a follow-up when I get the green light.

>> +#
>> +# THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS"
>> +# WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
>> +# BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
>> +# FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE
>> +# OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME
>> +# THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
>> +

[...]

>> +class PrinterRST(Printer):
>> +    """
>> +    A printer for dumping collected information about helpers as a ReStructured
>> +    Text page compatible with the rst2man program, which can be used to
>> +    generate a manual page for the helpers.
>> +    @helpers: array of Helper objects to print to standard output
>> +    """
>> +    def print_header(self):
>> +        header = '''\
>> +.. Copyright (C) 2018 Netronome Systems, Inc.
> 
> I think would be good to capture copyrights of all authors that added
> the helpers being documented. Since a lot of text was copied from commit
> logs it's only fair to preserve the copyrights.
> Such man page file is automatically generated by the python script
> and script itself is copyrighted by Netronome. That's fine, but the text
> of man page is not netronome only.
> I'm not sure what would be the solution. May be something like:
> "
> Copyright (C) All BPF authors and contributors from 2011 to present
> See git log include/uapi/linux/bpf.h for details
> "
> ?

Seems fair indeed. I do not have a better suggestion myself, so I will
stick to yours.

Out of curiosity, why 2011 for the year? I thought you introduced eBPF
in the kernel in 2014 (bd4cf0ed331a), and I do not believe helpers have
any link with cBPF?

>> +.. 
>> +.. %%%LICENSE_START(VERBATIM)
>> +.. Permission is granted to make and distribute verbatim copies of this
>> +.. manual provided the copyright notice and this permission notice are
>> +.. preserved on all copies.
>> +.. 
>> +.. Permission is granted to copy and distribute modified versions of this
>> +.. manual under the conditions for verbatim copying, provided that the
>> +.. entire resulting derived work is distributed under the terms of a
>> +.. permission notice identical to this one.
>> +.. 
>> +.. Since the Linux kernel and libraries are constantly changing, this
>> +.. manual page may be incorrect or out-of-date.  The author(s) assume no
>> +.. responsibility for errors or omissions, or for damages resulting from
>> +.. the use of the information contained herein.  The author(s) may not
>> +.. have taken the same level of care in the production of this manual,
>> +.. which is licensed free of charge, as they might when working
>> +.. professionally.
>> +.. 
>> +.. Formatted or processed versions of this manual, if unaccompanied by
>> +.. the source, must acknowledge the copyright and authors of this work.
>> +.. %%%LICENSE_END
>> +.. 
>> +.. Please do not edit this file. It was generated from the documentation
>> +.. located in file include/uapi/linux/bpf.h of the Linux kernel sources
>> +.. (helpers description), and from scripts/bpf_helpers_doc.py in the same
>> +.. repository (header and footer).
>> +
>> +===========
>> +BPF-HELPERS
>> +===========
>> +-------------------------------------------------------------------------------
>> +list of eBPF helper functions
>> +-------------------------------------------------------------------------------
>> +
>> +:Manual section: 7
>> +
>> +DESCRIPTION
>> +===========
>> +
>> +The extended Berkeley Packet Filter (eBPF) subsystem consists in programs
>> +written in a pseudo-assembly language, then attached to one of the several
>> +kernel hooks and run in reaction of specific events. This framework differs
>> +from the older, "classic" BPF (or "cBPF") in several aspects, one of them being
>> +the ability to call special functions (or "helpers") from within a program. For
>> +security reasons, these functions are restricted to a white-list of helpers
>> +defined in the kernel.
> 
> 'for security reasons' sounds a bit odd. May be 'for safety reasons' ?
> Or drop that part.

I'll drop it and keep "These functions are restricted to a white-list of
helpers defined in the kernel."

>> +
>> +These helpers are used by eBPF programs to interact with the system, or with
>> +the context in which they work. For instance, they can be used to print
>> +debugging messages, to get the time since the system was booted, to interact
>> +with eBPF maps, or to manipulate network packets metadata. Since there are
> 
> s/packets metadata/packets/

Ok.

>> +several eBPF program types, and that they do not run in the same context, each
>> +program type can only call a subset of those helpers.
>> +
>> +Due to eBPF conventions, a helper can not have more than five arguments.
>> +
>> +This document is an attempt to list and document the helpers available to eBPF
>> +developers. They are sorted by chronological order (the oldest helpers in the
>> +kernel at the top).
>> +
>> +HELPERS
>> +=======
>> +'''
>> +        print(header)
>> +
>> +    def print_footer(self):
>> +        footer = '''
>> +NOTES
>> +=====
>> +
>> +On the performance side, eBPF programs move to the stack all arguments to pass
>> +to the helpers, and call directly into the compiled helper functions without
> 
> "move to the stack all arguments" ?! I'm not sure what you're trying to say.
> The arguments stay in registers for the call.
> 
>> +requiring any foreign-function interface. As a result, calling helpers
>> +introduce very little overhead.
> 
> not true. it's zero overhead. Literally. Very little is not the same as zero.

Not the same indeed :). I will fix with "no overhead".

I'm not too sure either what I meant when I wrote the thing about moving
arguments to the stack... In fact this "NOTES" section is short and not
really relevant. I will probably delete it and add a line in page header
about helpers being called with no overhead.

>> +
>> +EXAMPLES
>> +========
>> +
>> +Example usage for most of the eBPF helpers listed in this manual page are
>> +available within the Linux kernel sources, at the following locations:
>> +
>> +* *samples/bpf/*
>> +* *tools/testing/selftests/bpf/*
>> +
>> +IMPLEMENTATION
>> +==============
>> +
>> +This manual page is an effort to document the existing eBPF helper functions.
>> +But as of this writing, the BPF sub-system is under heavy development. New eBPF
>> +program or map types are added, along with new helper functions. Some helpers
>> +are occasionally made available for additional program types. So in spite of
>> +the efforts of the community, this page might not be up-to-date. If you want to
>> +check by yourself what helper functions exist in your kernel, or what types of
>> +programs they can support, here are some files among the kernel tree that you
>> +may be interested in:
>> +
>> +* *include/uapi/linux/bpf.h* contains the full list of all helper functions.
>> +* *net/core/filter.c* contains the definition of most network-related helper
>> +  functions, and the list of program types from which they can be used.
>> +* *kernel/trace/bpf_trace.c* is the equivalent for most tracing program-related
>> +  helpers.
>> +* *kernel/bpf/verifier.c* contains the functions used to check that valid types
>> +  of eBPF maps are used with a given helper function.
>> +* *kernel/bpf/* directory contains other files in which additional helpers are
>> +  defined (for cgroups, sockmaps, etc.).
>> +
>> +Compatibility between helper functions and program types can generally be found
>> +in the files where helper functions are defined. Look for the **struct
>> +bpf_func_proto** objects and for functions returning them: these functions
>> +contain a list of helpers that a given program type can call. Note that the
>> +**default:** label of the **switch ... case** used to filter helpers can call
>> +other functions, themselves allowing access to additional helpers. The
>> +requirement for GPL license is also in those **struct bpf_func_proto**.
> 
> I think here would be good to add that most networking helpers are non-GPL
> because they operate on packets which are abstract bytes on the wire,
> whereas most tracing helpers are GPL, since they inspect the guts of
> the linux kernel which is GPL itself.
> That's the main reason why adding extra 'gpl=yes/no' for each helper
> description is redundant.
> 

I removed information from the page header about licensing since v1, I
may reintroduce some of it and tell about the difference between
networking and tracing programs, as you suggest.

I understand this difference and I see that specifying GPL requirement
for each individual helper is redundant. And yet I still believe that
for newcomers, it remains easier to have the indication for the specific
helper they are reading about in the man page rather than to take a
guess ("Is this helper for networking only?"). But I do not intend to
add it back to this set anyway, so let's keep this for future
discussions :).

Thanks for the review!
Quentin

^ permalink raw reply

* [RFC bpf-next v2 2/8] bpf: add documentation for eBPF helpers (01-11)
From: Quentin Monnet @ 2018-04-11 15:42 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410175605.2wqhaqx34a4o3gdi@ast-mbp.dhcp.thefacebook.com>

2018-04-10 10:56 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Tue, Apr 10, 2018 at 03:41:51PM +0100, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions, all
>> written by Alexei:
>>
>> - bpf_map_lookup_elem()
>> - bpf_map_update_elem()
>> - bpf_map_delete_elem()
>> - bpf_probe_read()
>> - bpf_ktime_get_ns()
>> - bpf_trace_printk()
>> - bpf_skb_store_bytes()
>> - bpf_l3_csum_replace()
>> - bpf_l4_csum_replace()
>> - bpf_tail_call()
>> - bpf_clone_redirect()
>>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>>  include/uapi/linux/bpf.h | 199 +++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 199 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 45f77f01e672..2bc653a3a20f 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -381,6 +381,205 @@ union bpf_attr {
>>   * intentional, removing them would break paragraphs for rst2man.
>>   *
>>   * Start of BPF helper function descriptions:
>> + *
>> + * void *bpf_map_lookup_elem(struct bpf_map *map, void *key)
>> + * 	Description
>> + * 		Perform a lookup in *map* for an entry associated to *key*.
>> + * 	Return
>> + * 		Map value associated to *key*, or **NULL** if no entry was
>> + * 		found.
>> + *
>> + * int bpf_map_update_elem(struct bpf_map *map, void *key, void *value, u64 flags)
>> + * 	Description
>> + * 		Add or update the value of the entry associated to *key* in
>> + * 		*map* with *value*. *flags* is one of:
>> + *
>> + * 		**BPF_NOEXIST**
>> + * 			The entry for *key* must not exist in the map.
>> + * 		**BPF_EXIST**
>> + * 			The entry for *key* must already exist in the map.
>> + * 		**BPF_ANY**
>> + * 			No condition on the existence of the entry for *key*.
>> + *
>> + * 		These flags are only useful for maps of type
>> + * 		**BPF_MAP_TYPE_HASH**. For all other map types, **BPF_ANY**
>> + * 		should be used.
> 
> I think that's not entirely accurate.
> The flags work as expected for all other map types as well
> and for lru map, sockmap, map in map the flags have practical use cases.
> 

Ok, I missed that. I have to go back and check how the flags are used
for those maps. I will cook up something cleaner for the next version of
the set.

>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *

[...]

>> + *
>> + * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
>> + * 	Description
>> + * 		This helper is a "printk()-like" facility for debugging. It
>> + * 		prints a message defined by format *fmt* (of size *fmt_size*)
>> + * 		to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if
>> + * 		available. It can take up to three additional **u64**
>> + * 		arguments (as an eBPF helpers, the total number of arguments is
>> + * 		limited to five). Each time the helper is called, it appends a
>> + * 		line that looks like the following:
>> + *
>> + * 		::
>> + *
>> + * 			telnet-470   [001] .N.. 419421.045894: 0x00000001: BPF command: 2
>> + *
>> + * 		In the above:
>> + *
>> + * 			* ``telnet`` is the name of the current task.
>> + * 			* ``470`` is the PID of the current task.
>> + * 			* ``001`` is the CPU number on which the task is
>> + * 			  running.
>> + * 			* In ``.N..``, each character refers to a set of
>> + * 			  options (whether irqs are enabled, scheduling
>> + * 			  options, whether hard/softirqs are running, level of
>> + * 			  preempt_disabled respectively). **N** means that
>> + * 			  **TIF_NEED_RESCHED** and **PREEMPT_NEED_RESCHED**
>> + * 			  are set.
>> + * 			* ``419421.045894`` is a timestamp.
>> + * 			* ``0x00000001`` is a fake value used by BPF for the
>> + * 			  instruction pointer register.
>> + * 			* ``BPF command: 2`` is the message formatted with
>> + * 			  *fmt*.
> 
> the above depends on how trace_pipe was configured. It's a default
> configuration for many, but would be good to explain this a bit better.
> 

I did not know about that. Would you have a pointer about how to
configure trace_pipe, please?

>> + *
>> + * 		The conversion specifiers supported by *fmt* are similar, but
>> + * 		more limited than for printk(). They are **%d**, **%i**,
>> + * 		**%u**, **%x**, **%ld**, **%li**, **%lu**, **%lx**, **%lld**,
>> + * 		**%lli**, **%llu**, **%llx**, **%p**, **%s**. No modifier (size
>> + * 		of field, padding with zeroes, etc.) is available, and the
>> + * 		helper will silently fail if it encounters an unknown
>> + * 		specifier.
> 
> This is not true. bpf_trace_printk will return -EINVAL for unknown specifier.
> 

Correct, sorry about that. I never check the return value of
bpf_trace_printk(), and it's hard to realise it failed without resorting
to another bpf_trace_printk() :). I'll fix it, what about:

"No modifier (size of field, padding with zeroes, etc.) is available,
and the helper will return **-EINVAL** (but print nothing) if it
encounters an unknown specifier."

(I would like to keep the "print nothing" idea, at the beginning I spent
some time myself trying to figure out why my bpf_trace_prink() seemed to
be never called--I was simply trying to print with "%#x".)

>> + *
>> + * 		Also, note that **bpf_trace_printk**\ () is slow, and should
>> + * 		only be used for debugging purposes. For passing values to user
>> + * 		space, perf events should be preferred.
> 
> please mention the giant dmesg warning that people will definitely
> notice when they try to use this helper.

This is a good idea, I will mention it.

>> + * 	Return
>> + * 		The number of bytes written to the buffer, or a negative error
>> + * 		in case of failure.
>> + *

[...]

>> + * int bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)
>> + * 	Description
>> + * 		This special helper is used to trigger a "tail call", or in
>> + * 		other words, to jump into another eBPF program. The contents of
>> + * 		eBPF registers and stack are not modified, the new program
>> + * 		"inherits" them from the caller. This mechanism allows for
> 
> "inherits" is a technically correct, but misleading statement,
> since callee program cannot access caller's registers and stack.
> 

I can replace this sentence by:

"The same stack frame is used (but values on stack and in registers for
the caller are not accessible to the callee)."

>> + * 		program chaining, either for raising the maximum number of
>> + * 		available eBPF instructions, or to execute given programs in
>> + * 		conditional blocks. For security reasons, there is an upper
>> + * 		limit to the number of successive tail calls that can be
>> + * 		performed.
>> + *
>> + * 		Upon call of this helper, the program attempts to jump into a
>> + * 		program referenced at index *index* in *prog_array_map*, a
>> + * 		special map of type **BPF_MAP_TYPE_PROG_ARRAY**, and passes
>> + * 		*ctx*, a pointer to the context.
>> + *
>> + * 		If the call succeeds, the kernel immediately runs the first
>> + * 		instruction of the new program. This is not a function call,
>> + * 		and it never goes back to the previous program. If the call
>> + * 		fails, then the helper has no effect, and the caller continues
>> + * 		to run its own instructions. A call can fail if the destination
>> + * 		program for the jump does not exist (i.e. *index* is superior
>> + * 		to the number of entries in *prog_array_map*), or if the
>> + * 		maximum number of tail calls has been reached for this chain of
>> + * 		programs. This limit is defined in the kernel by the macro
>> + * 		**MAX_TAIL_CALL_CNT** (not accessible to user space), which
>> + * 		is currently set to 32.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags)
>> + * 	Description
>> + * 		Clone and redirect the packet associated to *skb* to another
>> + * 		net device of index *ifindex*. The only flag supported for now
>> + * 		is **BPF_F_INGRESS**, which indicates the packet is to be
>> + * 		redirected to the ingress interface instead of (by default)
>> + * 		egress.
> 
> imo the above sentence is prone to misinterpretation.
> Can you rephrase it to say that both redirect to ingress and redirect to egress
> are supported and flag is used to indicate which path to take ?
> 

I could replace with the following:

"Clone and redirect the packet associated to *skb* to another net device
of index *ifindex*. Both ingress and egress interfaces can be used for
redirection. The **BPF_F_INGRESS** value in *flags* is used to make the
distinction (ingress path is selected if the flag is present, egress
path otherwise). This is the only flag supported for now."

I think I wrote similar things about other helpers using BPF_F_INGRESS
flag, I will also update them accordingly.

>> + *
>> + * 		A call to this helper is susceptible to change data from the
>> + * 		packet. Therefore, at load time, all checks on pointers
>> + * 		previously done by the verifier are invalidated and must be
>> + * 		performed again.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>>   */
>>  #define __BPF_FUNC_MAPPER(FN)		\
>>  	FN(unspec),			\
>> -- 
>> 2.14.1
>>

^ permalink raw reply

* [RFC bpf-next v2 3/8] bpf: add documentation for eBPF helpers (12-22)
From: Quentin Monnet @ 2018-04-11 15:43 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man
In-Reply-To: <20180410224259.v5p2t2dc5s27mw3z@ast-mbp.dhcp.thefacebook.com>

2018-04-10 15:43 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Tue, Apr 10, 2018 at 03:41:52PM +0100, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions, all
>> writter by Alexei:
>>
>> - bpf_get_current_pid_tgid()
>> - bpf_get_current_uid_gid()
>> - bpf_get_current_comm()
>> - bpf_skb_vlan_push()
>> - bpf_skb_vlan_pop()
>> - bpf_skb_get_tunnel_key()
>> - bpf_skb_set_tunnel_key()
>> - bpf_redirect()
>> - bpf_perf_event_output()
>> - bpf_get_stackid()
>> - bpf_get_current_task()
>>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>>  include/uapi/linux/bpf.h | 237 +++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 237 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 2bc653a3a20f..f3ea8824efbc 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -580,6 +580,243 @@ union bpf_attr {
>>   * 		performed again.
>>   * 	Return
>>   * 		0 on success, or a negative error in case of failure.
>> + *
>> + * u64 bpf_get_current_pid_tgid(void)
>> + * 	Return
>> + * 		A 64-bit integer containing the current tgid and pid, and
>> + * 		created as such:
>> + * 		*current_task*\ **->tgid << 32 \|**
>> + * 		*current_task*\ **->pid**.
>> + *
>> + * u64 bpf_get_current_uid_gid(void)
>> + * 	Return
>> + * 		A 64-bit integer containing the current GID and UID, and
>> + * 		created as such: *current_gid* **<< 32 \|** *current_uid*.
>> + *
>> + * int bpf_get_current_comm(char *buf, u32 size_of_buf)
>> + * 	Description
>> + * 		Copy the **comm** attribute of the current task into *buf* of
>> + * 		*size_of_buf*. The **comm** attribute contains the name of
>> + * 		the executable (excluding the path) for the current task. The
>> + * 		*size_of_buf* must be strictly positive. On success, the
> 
> that reminds me that we probably should relax it to ARG_CONST_SIZE_OR_ZERO.
> The programs won't be passing an actual zero into it, but it helps
> a lot to tell verifier that zero is also valid, since programs
> become much simpler.
> 

Ok. No change to helper description for now, we will update here when
your patch lands.

>> + * 		helper makes sure that the *buf* is NUL-terminated. On failure,
>> + * 		it is filled with zeroes.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci)
>> + * 	Description
>> + * 		Push a *vlan_tci* (VLAN tag control information) of protocol
>> + * 		*vlan_proto* to the packet associated to *skb*, then update
>> + * 		the checksum. Note that if *vlan_proto* is different from
>> + * 		**ETH_P_8021Q** and **ETH_P_8021AD**, it is considered to
>> + * 		be **ETH_P_8021Q**.
>> + *
>> + * 		A call to this helper is susceptible to change data from the
>> + * 		packet. Therefore, at load time, all checks on pointers
>> + * 		previously done by the verifier are invalidated and must be
>> + * 		performed again.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_vlan_pop(struct sk_buff *skb)
>> + * 	Description
>> + * 		Pop a VLAN header from the packet associated to *skb*.
>> + *
>> + * 		A call to this helper is susceptible to change data from the
>> + * 		packet. Therefore, at load time, all checks on pointers
>> + * 		previously done by the verifier are invalidated and must be
>> + * 		performed again.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
>> + * 	Description
>> + * 		Get tunnel metadata. This helper takes a pointer *key* to an
>> + * 		empty **struct bpf_tunnel_key** of **size**, that will be
>> + * 		filled with tunnel metadata for the packet associated to *skb*.
>> + * 		The *flags* can be set to **BPF_F_TUNINFO_IPV6**, which
>> + * 		indicates that the tunnel is based on IPv6 protocol instead of
>> + * 		IPv4.
>> + *
>> + * 		This is typically used on the receive path to perform a lookup
>> + * 		or a packet redirection based on the value of *key*:
> 
> above is correct, but feels a bit cryptic.
> May be give more concrete example for particular tunneling protocol like gre
> and say that tunnel_key.remote_ip[46] is essential part of the encap and
> bpf prog will make decisions based on the contents of the encap header
> where bpf_tunnel_key is a single structure that generalizes parameters of
> various tunneling protocols into one struct.
> 

I will try to do this.

>> + *
>> + * 		::
>> + *
>> + * 			struct bpf_tunnel_key key = {};
>> + * 			bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0);
>> + * 			     lookup or redirect based on key ...
>> + *
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags)
>> + * 	Description
>> + * 		Populate tunnel metadata for packet associated to *skb.* The
>> + * 		tunnel metadata is set to the contents of *key*, of *size*. The
>> + * 		*flags* can be set to a combination of the following values:
>> + *
>> + * 		**BPF_F_TUNINFO_IPV6**
>> + * 			Indicate that the tunnel is based on IPv6 protocol
>> + * 			instead of IPv4.
>> + * 		**BPF_F_ZERO_CSUM_TX**
>> + * 			For IPv4 packets, add a flag to tunnel metadata
>> + * 			indicating that checksum computation should be skipped
>> + * 			and checksum set to zeroes.
>> + * 		**BPF_F_DONT_FRAGMENT**
>> + * 			Add a flag to tunnel metadata indicating that the
>> + * 			packet should not be fragmented.
>> + * 		**BPF_F_SEQ_NUMBER**
>> + * 			Add a flag to tunnel metadata indicating that a
>> + * 			sequence number should be added to tunnel header before
>> + * 			sending the packet. This flag was added for GRE
>> + * 			encapsulation, but might be used with other protocols
>> + * 			as well in the future.
>> + *
>> + * 		Here is a typical usage on the transmit path:
>> + *
>> + * 		::
>> + *
>> + * 			struct bpf_tunnel_key key;
>> + * 			     populate key ...
>> + * 			bpf_skb_set_tunnel_key(skb, &key, sizeof(key), 0);
>> + * 			bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
>> + *
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_redirect(u32 ifindex, u64 flags)
>> + * 	Description
>> + * 		Redirect the packet to another net device of index *ifindex*.
>> + * 		This helper is somewhat similar to **bpf_clone_redirect**\
>> + * 		(), except that the packet is not cloned, which provides
>> + * 		increased performance.
>> + *
>> + * 		For hooks other than XDP, *flags* can be set to
>> + * 		**BPF_F_INGRESS**, which indicates the packet is to be
>> + * 		redirected to the ingress interface instead of (by default)
>> + * 		egress. Currently, XDP does not support any flag.
>> + * 	Return
>> + * 		For XDP, the helper returns **XDP_REDIRECT** on success or
>> + * 		**XDP_ABORT** on error. For other program types, the values
>> + * 		are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on
>> + * 		error.
>> + *
>> + * int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size)
>> + * 	Description
>> + * 		Write perf raw sample into a perf event held by *map* of type
> 
> I'd say:
> Write raw *data* blob into special bpf perf event held by ...
> 

Yes it sounds better, I will follow the suggestion.

>> + * 		**BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf event must
>> + * 		have the following attributes: **PERF_SAMPLE_RAW** as
>> + * 		**sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and
>> + * 		**PERF_COUNT_SW_BPF_OUTPUT** as **config**.
>> + *
>> + * 		The *flags* are used to indicate the index in *map* for which
>> + * 		the value must be put, masked with **BPF_F_INDEX_MASK**.
>> + * 		Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU**
>> + * 		to indicate that the index of the current CPU core should be
>> + * 		used.
>> + *
>> + * 		The value to write, of *size*, is passed through eBPF stack and
>> + * 		pointed by *data*.
>> + *
>> + * 		The context of the program *ctx* needs also be passed to the
>> + * 		helper, and will get interpreted as a pointer to a **struct
>> + * 		pt_reg**.
> 
> Not quite correct.
> Initially bpf_perf_event_output() was only used with 'struct pt_reg *ctx',
> but then later it was generalized for all other tracing prog types,
> for clsact and even for XDP.
> So 'ctx' can be any of the context used by these program types.
> 

Right, I suppose I only looked at bpf_perf_event_output_tp() for this
one :(. I can simply trim it to:

"The context of the program *ctx* needs also be passed to the helper."

>> + *
>> + * 		On user space, a program willing to read the values needs to
>> + * 		call **perf_event_open**\ () on the perf event (either for
>> + * 		one or for all CPUs) and to store the file descriptor into the
>> + * 		*map*. This must be done before the eBPF program can send data
>> + * 		into it. An example is available in file
>> + * 		*samples/bpf/trace_output_user.c* in the Linux kernel source
>> + * 		tree (the eBPF program counterpart is in
>> + * 		*samples/bpf/trace_output_kern.c*). It looks like the
>> + * 		following snippet:
>> + *
>> + * 		::
>> + *
>> + * 			volatile struct perf_event_mmap_page *header;
>> + * 			struct perf_event_attr attr = {
>> + * 			        .sample_type = PERF_SAMPLE_RAW,
>> + * 			        .type = PERF_TYPE_SOFTWARE,
>> + * 			        .config = PERF_COUNT_SW_BPF_OUTPUT,
>> + * 			};
>> + * 			int page_size;
>> + * 			int mmap_size;
>> + * 			int key = 0;
>> + * 			int pmu_fd;
>> + * 			void *base;
>> + * 			
>> + * 			if (load_bpf_file(filename))
>> + * 			        return -1;
>> + * 			
>> + * 			pmu_fd = sys_perf_event_open(&attr,
>> + * 			                             -1, // pid
>> + * 			                              0, // cpu
>> + * 			                             -1, // group_fd
>> + * 			                              0);
>> + * 			
>> + * 			assert(pmu_fd >= 0);
>> + * 			assert(bpf_map_update_elem(map_fd[0], &key,
>> + * 			                           &pmu_fd, BPF_ANY) == 0);
>> + * 			assert(ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0) == 0);
>> + * 			
>> + * 			page_size = getpagesize();
>> + * 			mmap_size = page_size * (page_cnt + 1);
>> + * 			
>> + * 			base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
>> + * 			            MAP_SHARED, fd, 0);
>> + * 			if (base == MAP_FAILED)
>> + * 			        return -1;
>> + * 			
>> + * 			header = base;
> 
> I think that is too much for the man page, especially above is far from
> complete example.
> 

Yeah, I was unsure about keeping it. I will remove the snippet.

>> + *
>> + * 		**bpf_perf_event_output**\ () achieves better performance
>> + * 		than **bpf_trace_printk**\ () for sharing data with user
>> + * 		space, and is much better suitable for streaming data from eBPF
>> + * 		programs.
>> + * 	Return
>> + * 		0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_get_stackid(struct pt_reg *ctx, struct bpf_map *map, u64 flags)
>> + * 	Description
>> + * 		Walk a user or a kernel stack and return its id. To achieve
>> + * 		this, the helper needs *ctx*, which is a pointer to the context
>> + * 		on which the tracing program is executed, and a pointer to a
>> + * 		*map* of type **BPF_MAP_TYPE_STACK_TRACE**.
>> + *
>> + * 		The last argument, *flags*, holds the number of stack frames to
>> + * 		skip (from 0 to 255), masked with
>> + * 		**BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
>> + * 		a combination of the following flags:
>> + *
>> + * 		**BPF_F_USER_STACK**
>> + * 			Collect a user space stack instead of a kernel stack.
>> + * 		**BPF_F_FAST_STACK_CMP**
>> + * 			Compare stacks by hash only.
>> + * 		**BPF_F_REUSE_STACKID**
>> + * 			If two different stacks hash into the same *stackid*,
>> + * 			discard the old one.
> 
> we have an annoying bug here that we will be sending a patch to fix soon,
> since right now there is no way for the program to know that stackid
> got replaced.
> 

Understood. Same as for bpf_get_current_comm(), I will leave the
description untouched until the patch lands.

>> + *
>> + * 		The stack id retrieved is a 32 bit long integer handle which
>> + * 		can be further combined with other data (including other stack
>> + * 		ids) and used as a key into maps. This can be useful for
>> + * 		generating a variety of graphs (such as flame graphs or off-cpu
>> + * 		graphs).
>> + *
>> + * 		For walking a stack, this helper is an improvement over
>> + * 		**bpf_probe_read**\ (), which can be used with unrolled loops
>> + * 		but is not efficient and consumes a lot of eBPF instructions.
>> + * 		Instead, **bpf_get_stackid**\ () can collect up to
>> + * 		**PERF_MAX_STACK_DEPTH** both kernel and user frames.
> 
> PERF_MAX_STACK_DEPTH is now controlled by sysctl knob.
> Would be good to mention that this limit can and should be increased
> for profiling long user stacks like java.
> 

Good idea, I will add it.

Thanks a lot Alexei for the thorough reviews!
Quentin

^ permalink raw reply

* [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Quentin Monnet @ 2018-04-11 15:44 UTC (permalink / raw)
  To: Yonghong Song, daniel, ast
  Cc: netdev, oss-drivers, linux-doc, linux-man, Lawrence Brakmo,
	Josef Bacik, Andrey Ignatov
In-Reply-To: <cc54b41e-3f2f-e87f-042f-842c96308626@fb.com>

2018-04-10 09:58 UTC-0700 ~ Yonghong Song <yhs@fb.com>
> On 4/10/18 7:41 AM, Quentin Monnet wrote:
>> Add documentation for eBPF helper functions to bpf.h user header file.
>> This documentation can be parsed with the Python script provided in
>> another commit of the patch series, in order to provide a RST document
>> that can later be converted into a man page.
>>
>> The objective is to make the documentation easily understandable and
>> accessible to all eBPF developers, including beginners.
>>
>> This patch contains descriptions for the following helper functions:
>>
>> Helpers from Lawrence:
>> - bpf_setsockopt()
>> - bpf_getsockopt()
>> - bpf_sock_ops_cb_flags_set()
>>
>> Helpers from Yonghong:
>> - bpf_perf_event_read_value()
>> - bpf_perf_prog_read_value()
>>
>> Helper from Josef:
>> - bpf_override_return()
>>
>> Helper from Andrey:
>> - bpf_bind()
>>
>> Cc: Lawrence Brakmo <brakmo@fb.com>
>> Cc: Yonghong Song <yhs@fb.com>
>> Cc: Josef Bacik <jbacik@fb.com>
>> Cc: Andrey Ignatov <rdna@fb.com>
>> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
>> ---
>>   include/uapi/linux/bpf.h | 184
>> +++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 184 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 15d9ccafebbe..7343af4196c8 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h

[...]

>> @@ -1255,6 +1277,168 @@ union bpf_attr {
>>    *         performed again.
>>    *     Return
>>    *         0 on success, or a negative error in case of failure.
>> + *
>> + * int bpf_perf_event_read_value(struct bpf_map *map, u64 flags,
>> struct bpf_perf_event_value *buf, u32 buf_size)
>> + *     Description
>> + *         Read the value of a perf event counter, and store it into
>> *buf*
>> + *         of size *buf_size*. This helper relies on a *map* of type
>> + *         **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of the perf
>> + *         event counter is selected at the creation of the *map*. The
> 
> The nature of the perf event counter is selected when *map* is updated
> with perf_event fd's.
> 

Thanks, I will fix it.

>> + *         *map* is an array whose size is the number of available CPU
>> + *         cores, and each cell contains a value relative to one
>> core. The
> 
> It is confusing to mix core/cpu here. Maybe just use perf_event
> convention, always using cpu?
> 

Right, I'll remove occurrences of "core".

>> + *         value to retrieve is indicated by *flags*, that contains the
>> + *         index of the core to look up, masked with
>> + *         **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to
>> + *         **BPF_F_CURRENT_CPU** to indicate that the value for the
>> + *         current CPU core should be retrieved.
>> + *
>> + *         This helper behaves in a way close to
>> + *         **bpf_perf_event_read**\ () helper, save that instead of
>> + *         just returning the value observed, it fills the *buf*
>> + *         structure. This allows for additional data to be
>> retrieved: in
>> + *         particular, the enabled and running times (in *buf*\
>> + *         **->enabled** and *buf*\ **->running**, respectively) are
>> + *         copied.
>> + *
>> + *         These values are interesting, because hardware PMU
>> (Performance
>> + *         Monitoring Unit) counters are limited resources. When
>> there are
>> + *         more PMU based perf events opened than available counters,
>> + *         kernel will multiplex these events so each event gets certain
>> + *         percentage (but not all) of the PMU time. In case that
>> + *         multiplexing happens, the number of samples or counter value
>> + *         will not reflect the case compared to when no multiplexing
>> + *         occurs. This makes comparison between different runs
>> difficult.
>> + *         Typically, the counter value should be normalized before
>> + *         comparing to other experiments. The usual normalization is
>> done
>> + *         as follows.
>> + *
>> + *         ::
>> + *
>> + *             normalized_counter = counter * t_enabled / t_running
>> + *
>> + *         Where t_enabled is the time enabled for event and
>> t_running is
>> + *         the time running for event since last normalization. The
>> + *         enabled and running times are accumulated since the perf
>> event
>> + *         open. To achieve scaling factor between two invocations of an
>> + *         eBPF program, users can can use CPU id as the key (which is
>> + *         typical for perf array usage model) to remember the previous
>> + *         value and do the calculation inside the eBPF program.
>> + *     Return
>> + *         0 on success, or a negative error in case of failure.
>> + *

[...]

Thanks Yonghong for the review!

I have a favor to ask of you. I got a bounce for Kaixu Xia's email
address, and I don't know what alternative email address I could use. I
CC-ed to have a review for helper bpf_perf_event_read() (in patch 6 of
this series), which is rather close to bpf_perf_event_read_value().
Would you mind having a look at that one too, please? The description is
not long.

Quentin

^ permalink raw reply

* [RFC bpf-next v2 7/8] bpf: add documentation for eBPF helpers (51-57)
From: Quentin Monnet @ 2018-04-11 15:45 UTC (permalink / raw)
  To: Andrey Ignatov
  Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man,
	Lawrence Brakmo, Yonghong Song, Josef Bacik
In-Reply-To: <20180410175015.GA6762@rdna-mbp.dhcp.thefacebook.com>

2018-04-10 10:50 UTC-0700 ~ Andrey Ignatov <rdna@fb.com>
> Quentin Monnet <quentin.monnet@netronome.com> [Tue, 2018-04-10 07:43 -0700]:
>> + * int bpf_bind(struct bpf_sock_addr_kern *ctx, struct sockaddr *addr, int addr_len)
>> + * 	Description
>> + * 		Bind the socket associated to *ctx* to the address pointed by
>> + * 		*addr*, of length *addr_len*. This allows for making outgoing
>> + * 		connection from the desired IP address, which can be useful for
>> + * 		example when all processes inside a cgroup should use one
>> + * 		single IP address on a host that has multiple IP configured.
>> + *
>> + * 		This helper works for IPv4 and IPv6, TCP and UDP sockets. The
>> + * 		domain (*addr*\ **->sa_family**) must be **AF_INET** (or
>> + * 		**AF_INET6**). Looking for a free port to bind to can be
>> + * 		expensive, therefore binding to port is not permitted by the
>> + * 		helper: *addr*\ **->sin_port** (or **sin6_port**, respectively)
>> + * 		must be set to zero.
>> + *
>> + * 		As for the remote end, both parts of it can be overridden,
>> + * 		remote IP and remote port. This can be useful if an application
>> + * 		inside a cgroup wants to connect to another application inside
>> + * 		the same cgroup or to itself, but knows nothing about the IP
>> + * 		address assigned to the cgroup.
> 
> The last paragraph ("As for the remote end ...") is not relevant to
> bpf_bind() and should be removed. It's about sys_connect hook itself
> that can call to bpf_bind() but also has other functionality (and that
> other functionality is described by this paragraph).

Thanks Andrey, I will remove this paragraph.

Quentin

^ permalink raw reply

* [RFC bpf-next v2 8/8] bpf: add documentation for eBPF helpers (58-64)
From: Quentin Monnet @ 2018-04-11 15:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: daniel, ast, netdev, oss-drivers, linux-doc, linux-man,
	John Fastabend
In-Reply-To: <20180411121759.4191e267@redhat.com>

2018-04-11 12:17 UTC+0200 ~ Jesper Dangaard Brouer <brouer@redhat.com>
> On Tue, 10 Apr 2018 15:41:57 +0100
> Quentin Monnet <quentin.monnet@netronome.com> wrote:
> 
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 7343af4196c8..db090ad03626 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1250,6 +1250,51 @@ union bpf_attr {
>>   * 	Return
>>   * 		0 on success, or a negative error in case of failure.
>>   *
>> + * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
>> + * 	Description
>> + * 		Redirect the packet to the endpoint referenced by *map* at
>> + * 		index *key*. Depending on its type, his *map* can contain
>> + * 		references to net devices (for forwarding packets through other
>> + * 		ports), or to CPUs (for redirecting XDP frames to another CPU;
>> + * 		but this is not fully implemented as of this writing).
> 
> Stating that CPUMAP redirect "is not fully implemented" is confusing.
> The issue is that CPUMAP only works for "native" XDP.
> 
> What about saying:
> 
> "[...] or to CPUs (for redirecting XDP frames to another CPU;
>  but this is only implemented for native XDP as of this writing)"
> 

Fine by me, I will change it. Thank you Jesper for the review!

Quentin

^ permalink raw reply

* Re: [PATCH] net: samsung: sxgbe: Replace mdelay with usleep_range in sxgbe_sw_reset
From: kbuild test robot @ 2018-04-11 15:50 UTC (permalink / raw)
  To: Jia-Ju Bai
  Cc: kbuild-all, bh74.an, ks.giri, vipul.pandya, netdev, linux-kernel,
	Jia-Ju Bai
In-Reply-To: <1523413280-2336-1-git-send-email-baijiaju1990@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1660 bytes --]

Hi Jia-Ju,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]
[also build test ERROR on v4.16 next-20180411]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jia-Ju-Bai/net-samsung-sxgbe-Replace-mdelay-with-usleep_range-in-sxgbe_sw_reset/20180411-225900
config: i386-randconfig-x019-201814 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c: In function 'sxgbe_sw_reset':
>> drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c:2039:3: error: implicit declaration of function 'usleep'; did you mean 'ssleep'? [-Werror=implicit-function-declaration]
      usleep(10000, 11000);
      ^~~~~~
      ssleep
   cc1: some warnings being treated as errors

vim +2039 drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c

  2029	
  2030	static int sxgbe_sw_reset(void __iomem *addr)
  2031	{
  2032		int retry_count = 10;
  2033	
  2034		writel(SXGBE_DMA_SOFT_RESET, addr + SXGBE_DMA_MODE_REG);
  2035		while (retry_count--) {
  2036			if (!(readl(addr + SXGBE_DMA_MODE_REG) &
  2037			      SXGBE_DMA_SOFT_RESET))
  2038				break;
> 2039			usleep(10000, 11000);
  2040		}
  2041	
  2042		if (retry_count < 0)
  2043			return -EBUSY;
  2044	
  2045		return 0;
  2046	}
  2047	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 32699 bytes --]

^ permalink raw reply

* [PATCH net v2 0/6] bnxt_en: Fixes for net.
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev

This bug fix series include NULL pointer fixes in ethtool -x code path
and in the error clean up path when freeing IRQs, a ring accounting bug
that missed rings used by the RDMA driver, and 3 bug fixes related to TC
Flower and VF-reps.

v2: Fixed commit message of patch 4.  Changed the pound sign to $ sign
in front of the ip command.

Andy Gospodarek (1):
  bnxt_en: do not allow wildcard matches for L2 flows

Michael Chan (3):
  bnxt_en: Fix ethtool -x crash when device is down.
  bnxt_en: Need to include RDMA rings in bnxt_check_rings().
  bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().

Sriharsha Basavapatna (2):
  bnxt_en: Ignore src port field in decap filter nodes
  bnxt_en: Support max-mtu with VF-reps

 drivers/net/ethernet/broadcom/bnxt/bnxt.c         |  4 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 11 ++--
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c      | 63 ++++++++++++++++++++++-
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c     | 30 +++++++++++
 4 files changed, 103 insertions(+), 5 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH net v2 1/6] bnxt_en: Fix ethtool -x crash when device is down.
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>

Fix ethtool .get_rxfh() crash by checking for valid indirection table
address before copying the data.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 8d8ccd6..1f622ca 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -870,17 +870,22 @@ static int bnxt_get_rxfh(struct net_device *dev, u32 *indir, u8 *key,
 			 u8 *hfunc)
 {
 	struct bnxt *bp = netdev_priv(dev);
-	struct bnxt_vnic_info *vnic = &bp->vnic_info[0];
+	struct bnxt_vnic_info *vnic;
 	int i = 0;
 
 	if (hfunc)
 		*hfunc = ETH_RSS_HASH_TOP;
 
-	if (indir)
+	if (!bp->vnic_info)
+		return 0;
+
+	vnic = &bp->vnic_info[0];
+	if (indir && vnic->rss_table) {
 		for (i = 0; i < HW_HASH_INDEX_SIZE; i++)
 			indir[i] = le16_to_cpu(vnic->rss_table[i]);
+	}
 
-	if (key)
+	if (key && vnic->rss_hash_key)
 		memcpy(key, vnic->rss_hash_key, HW_HASH_KEY_SIZE);
 
 	return 0;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v2 2/6] bnxt_en: do not allow wildcard matches for L2 flows
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev, Andy Gospodarek
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>

From: Andy Gospodarek <gospo@broadcom.com>

Before this patch the following commands would succeed as far as the
user was concerned:

$ tc qdisc add dev p1p1 ingress
$ tc filter add dev p1p1 parent ffff: protocol all \
	flower skip_sw action drop
$ tc filter add dev p1p1 parent ffff: protocol ipv4 \
	flower skip_sw src_mac 00:02:00:00:00:01/44 action drop

The current flow offload infrastructure used does not support wildcard
matching for ethernet headers, so do not allow the second or third
commands to succeed.  If a user wants to drop traffic on that interface
the protocol and MAC addresses need to be specified explicitly:

$ tc qdisc add dev p1p1 ingress
$ tc filter add dev p1p1 parent ffff: protocol arp \
	flower skip_sw action drop
$ tc filter add dev p1p1 parent ffff: protocol ipv4 \
	flower skip_sw action drop
...
$ tc filter add dev p1p1 parent ffff: protocol ipv4 \
	flower skip_sw src_mac 00:02:00:00:00:01 action drop
$ tc filter add dev p1p1 parent ffff: protocol ipv4 \
	flower skip_sw src_mac 00:02:00:00:00:02 action drop
...

There are also checks for VLAN parameters in this patch as other callers
may wildcard those parameters even if tc does not.  Using different
flow infrastructure could allow this to work in the future for L2 flows,
but for now it does not.

Fixes: 2ae7408fedfe ("bnxt_en: bnxt: add TC flower filter offload support")
Signed-off-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 59 ++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index 65c2cee..ac193408 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -377,6 +377,30 @@ static bool is_wildcard(void *mask, int len)
 	return true;
 }
 
+static bool is_exactmatch(void *mask, int len)
+{
+	const u8 *p = mask;
+	int i;
+
+	for (i = 0; i < len; i++)
+		if (p[i] != 0xff)
+			return false;
+
+	return true;
+}
+
+static bool bits_set(void *key, int len)
+{
+	const u8 *p = key;
+	int i;
+
+	for (i = 0; i < len; i++)
+		if (p[i] != 0)
+			return true;
+
+	return false;
+}
+
 static int bnxt_hwrm_cfa_flow_alloc(struct bnxt *bp, struct bnxt_tc_flow *flow,
 				    __le16 ref_flow_handle,
 				    __le32 tunnel_handle, __le16 *flow_handle)
@@ -764,6 +788,41 @@ static bool bnxt_tc_can_offload(struct bnxt *bp, struct bnxt_tc_flow *flow)
 		return false;
 	}
 
+	/* Currently source/dest MAC cannot be partial wildcard  */
+	if (bits_set(&flow->l2_key.smac, sizeof(flow->l2_key.smac)) &&
+	    !is_exactmatch(flow->l2_mask.smac, sizeof(flow->l2_mask.smac))) {
+		netdev_info(bp->dev, "Wildcard match unsupported for Source MAC\n");
+		return false;
+	}
+	if (bits_set(&flow->l2_key.dmac, sizeof(flow->l2_key.dmac)) &&
+	    !is_exactmatch(&flow->l2_mask.dmac, sizeof(flow->l2_mask.dmac))) {
+		netdev_info(bp->dev, "Wildcard match unsupported for Dest MAC\n");
+		return false;
+	}
+
+	/* Currently VLAN fields cannot be partial wildcard */
+	if (bits_set(&flow->l2_key.inner_vlan_tci,
+		     sizeof(flow->l2_key.inner_vlan_tci)) &&
+	    !is_exactmatch(&flow->l2_mask.inner_vlan_tci,
+			   sizeof(flow->l2_mask.inner_vlan_tci))) {
+		netdev_info(bp->dev, "Wildcard match unsupported for VLAN TCI\n");
+		return false;
+	}
+	if (bits_set(&flow->l2_key.inner_vlan_tpid,
+		     sizeof(flow->l2_key.inner_vlan_tpid)) &&
+	    !is_exactmatch(&flow->l2_mask.inner_vlan_tpid,
+			   sizeof(flow->l2_mask.inner_vlan_tpid))) {
+		netdev_info(bp->dev, "Wildcard match unsupported for VLAN TPID\n");
+		return false;
+	}
+
+	/* Currently Ethertype must be set */
+	if (!is_exactmatch(&flow->l2_mask.ether_type,
+			   sizeof(flow->l2_mask.ether_type))) {
+		netdev_info(bp->dev, "Wildcard match unsupported for Ethertype\n");
+		return false;
+	}
+
 	return true;
 }
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v2 3/6] bnxt_en: Ignore src port field in decap filter nodes
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev, Sriharsha Basavapatna
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>

From: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>

The driver currently uses src port field (along with other fields) in the
decap tunnel key, while looking up and adding tunnel nodes. This leads to
redundant cfa_decap_filter_alloc() requests to the FW and flow-miss in the
flow engine. Fix this by ignoring the src port field in decap tunnel nodes.

Fixes: f484f6782e01 ("bnxt_en: add hwrm FW cmds for cfa_encap_record and decap_filter")
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index ac193408..795f450 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -1051,8 +1051,10 @@ static int bnxt_tc_get_decap_handle(struct bnxt *bp, struct bnxt_tc_flow *flow,
 
 	/* Check if there's another flow using the same tunnel decap.
 	 * If not, add this tunnel to the table and resolve the other
-	 * tunnel header fileds
+	 * tunnel header fileds. Ignore src_port in the tunnel_key,
+	 * since it is not required for decap filters.
 	 */
+	decap_key->tp_src = 0;
 	decap_node = bnxt_tc_get_tunnel_node(bp, &tc_info->decap_table,
 					     &tc_info->decap_ht_params,
 					     decap_key);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v2 4/6] bnxt_en: Support max-mtu with VF-reps
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev, Sriharsha Basavapatna
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>

From: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>

While a VF is configured with a bigger mtu (> 1500), any packets that
are punted to the VF-rep (slow-path) get dropped by OVS kernel-datapath
with the following message: "dropped over-mtu packet". Fix this by
returning the max-mtu value for a VF-rep derived from its corresponding VF.
VF-rep's mtu can be changed using 'ip' command as shown in this example:

	$ ip link set bnxt0_pf0vf0 mtu 9000

Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c | 30 +++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
index 2629040..38f635c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
@@ -64,6 +64,31 @@ static int hwrm_cfa_vfr_free(struct bnxt *bp, u16 vf_idx)
 	return rc;
 }
 
+static int bnxt_hwrm_vfr_qcfg(struct bnxt *bp, struct bnxt_vf_rep *vf_rep,
+			      u16 *max_mtu)
+{
+	struct hwrm_func_qcfg_output *resp = bp->hwrm_cmd_resp_addr;
+	struct hwrm_func_qcfg_input req = {0};
+	u16 mtu;
+	int rc;
+
+	bnxt_hwrm_cmd_hdr_init(bp, &req, HWRM_FUNC_QCFG, -1, -1);
+	req.fid = cpu_to_le16(bp->pf.vf[vf_rep->vf_idx].fw_fid);
+
+	mutex_lock(&bp->hwrm_cmd_lock);
+
+	rc = _hwrm_send_message(bp, &req, sizeof(req), HWRM_CMD_TIMEOUT);
+	if (!rc) {
+		mtu = le16_to_cpu(resp->max_mtu_configured);
+		if (!mtu)
+			*max_mtu = BNXT_MAX_MTU;
+		else
+			*max_mtu = mtu;
+	}
+	mutex_unlock(&bp->hwrm_cmd_lock);
+	return rc;
+}
+
 static int bnxt_vf_rep_open(struct net_device *dev)
 {
 	struct bnxt_vf_rep *vf_rep = netdev_priv(dev);
@@ -365,6 +390,7 @@ static void bnxt_vf_rep_netdev_init(struct bnxt *bp, struct bnxt_vf_rep *vf_rep,
 				    struct net_device *dev)
 {
 	struct net_device *pf_dev = bp->dev;
+	u16 max_mtu;
 
 	dev->netdev_ops = &bnxt_vf_rep_netdev_ops;
 	dev->ethtool_ops = &bnxt_vf_rep_ethtool_ops;
@@ -380,6 +406,10 @@ static void bnxt_vf_rep_netdev_init(struct bnxt *bp, struct bnxt_vf_rep *vf_rep,
 	bnxt_vf_rep_eth_addr_gen(bp->pf.mac_addr, vf_rep->vf_idx,
 				 dev->perm_addr);
 	ether_addr_copy(dev->dev_addr, dev->perm_addr);
+	/* Set VF-Rep's max-mtu to the corresponding VF's max-mtu */
+	if (!bnxt_hwrm_vfr_qcfg(bp, vf_rep, &max_mtu))
+		dev->max_mtu = max_mtu;
+	dev->min_mtu = ETH_ZLEN;
 }
 
 static int bnxt_pcie_dsn_get(struct bnxt *bp, u8 dsn[])
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v2 5/6] bnxt_en: Need to include RDMA rings in bnxt_check_rings().
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>

With recent changes to reserve both L2 and RDMA rings, we need to include
the RDMA rings in bnxt_check_rings().  Otherwise we will under-estimate
the rings we need during ethtool -L and may lead to failure.

Fixes: fbcfc8e46741 ("bnxt_en: Reserve completion rings and MSIX for bnxt_re RDMA driver.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1991f0c..9cb8b4b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7686,6 +7686,8 @@ int bnxt_check_rings(struct bnxt *bp, int tx, int rx, bool sh, int tcs,
 	if (bp->flags & BNXT_FLAG_AGG_RINGS)
 		rx_rings <<= 1;
 	cp = sh ? max_t(int, tx_rings_needed, rx) : tx_rings_needed + rx;
+	if (bp->flags & BNXT_FLAG_NEW_RM)
+		cp += bnxt_get_ulp_msix_num(bp);
 	return bnxt_hwrm_check_rings(bp, tx_rings_needed, rx_rings, rx, cp,
 				     vnics);
 }
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net v2 6/6] bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().
From: Michael Chan @ 2018-04-11 15:50 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1523461818-15774-1-git-send-email-michael.chan@broadcom.com>

When open fails during ethtool -L ring change, for example, the driver
may crash at bnxt_free_irq() because bp->bnapi is NULL.

If we fail to allocate all the new rings, bnxt_open_nic() will free
all the memory including bp->bnapi.  Subsequent call to bnxt_close_nic()
will try to dereference bp->bnapi in bnxt_free_irq().

Fix it by checking for !bp->bnapi in bnxt_free_irq().

Fixes: e5811b8c09df ("bnxt_en: Add IRQ remapping logic.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 9cb8b4b..f83769d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -6090,7 +6090,7 @@ static void bnxt_free_irq(struct bnxt *bp)
 	free_irq_cpu_rmap(bp->dev->rx_cpu_rmap);
 	bp->dev->rx_cpu_rmap = NULL;
 #endif
-	if (!bp->irq_tbl)
+	if (!bp->irq_tbl || !bp->bnapi)
 		return;

 	for (i = 0; i < bp->cp_nr_rings; i++) {
-- 
1.8.3.1

^ permalink raw reply related

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-11 15:51 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh
In-Reply-To: <1523386790-12396-3-git-send-email-sridhar.samudrala@intel.com>

Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation.
>
>It exposes 2 sets of interfaces to the paravirtual drivers.
>1. existing netvsc driver that uses 2 netdev model. In this model, no
>master netdev is created. The paravirtual driver registers each bypass
>instance along with a set of ops to manage the slave events.
>     bypass_master_register()
>     bypass_master_unregister()
>2. new virtio_net based solution that uses 3 netdev model. In this model,
>the bypass module provides interfaces to create/destroy additional master
>netdev and all the slave events are managed internally.
>      bypass_master_create()
>      bypass_master_destroy()
>
>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>---
> include/linux/netdevice.h |  14 +
> include/net/bypass.h      |  96 ++++++
> net/Kconfig               |  18 +
> net/core/Makefile         |   1 +
> net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 973 insertions(+)
> create mode 100644 include/net/bypass.h
> create mode 100644 net/core/bypass.c
>
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index cf44503ea81a..587293728f70 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
> 	IFF_PHONY_HEADROOM		= 1<<24,
> 	IFF_MACSEC			= 1<<25,
> 	IFF_NO_RX_HANDLER		= 1<<26,
>+	IFF_BYPASS			= 1 << 27,
>+	IFF_BYPASS_SLAVE		= 1 << 28,

I wonder, why you don't follow the existing coding style... Also, please
add these to into the comment above.


> };
> 
> #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>@@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
> #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
> #define IFF_MACSEC			IFF_MACSEC
> #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>+#define IFF_BYPASS			IFF_BYPASS
>+#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
> 
> /**
>  *	struct net_device - The DEVICE structure.
>@@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
> 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
> }
> 
>+static inline bool netif_is_bypass_master(const struct net_device *dev)
>+{
>+	return dev->priv_flags & IFF_BYPASS;
>+}
>+
>+static inline bool netif_is_bypass_slave(const struct net_device *dev)
>+{
>+	return dev->priv_flags & IFF_BYPASS_SLAVE;
>+}
>+
> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
> static inline void netif_keep_dst(struct net_device *dev)
> {
>diff --git a/include/net/bypass.h b/include/net/bypass.h
>new file mode 100644
>index 000000000000..86b02cb894cf
>--- /dev/null
>+++ b/include/net/bypass.h
>@@ -0,0 +1,96 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+#ifndef _NET_BYPASS_H
>+#define _NET_BYPASS_H
>+
>+#include <linux/netdevice.h>
>+
>+struct bypass_ops {
>+	int (*slave_pre_register)(struct net_device *slave_netdev,
>+				  struct net_device *bypass_netdev);
>+	int (*slave_join)(struct net_device *slave_netdev,
>+			  struct net_device *bypass_netdev);
>+	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>+				    struct net_device *bypass_netdev);
>+	int (*slave_release)(struct net_device *slave_netdev,
>+			     struct net_device *bypass_netdev);
>+	int (*slave_link_change)(struct net_device *slave_netdev,
>+				 struct net_device *bypass_netdev);
>+	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>+};
>+
>+struct bypass_master {
>+	struct list_head list;
>+	struct net_device __rcu *bypass_netdev;
>+	struct bypass_ops __rcu *ops;
>+};
>+
>+/* bypass state */
>+struct bypass_info {
>+	/* passthru netdev with same MAC */
>+	struct net_device __rcu *active_netdev;

You still use "active"/"backup" names which is highly misleading as
it has completely different meaning that in bond for example.
I noted that in my previous review already. Please change it.


>+
>+	/* virtio_net netdev */
>+	struct net_device __rcu *backup_netdev;
>+
>+	/* active netdev stats */
>+	struct rtnl_link_stats64 active_stats;
>+
>+	/* backup netdev stats */
>+	struct rtnl_link_stats64 backup_stats;
>+
>+	/* aggregated stats */
>+	struct rtnl_link_stats64 bypass_stats;
>+
>+	/* spinlock while updating stats */
>+	spinlock_t stats_lock;
>+};
>+
>+#if IS_ENABLED(CONFIG_NET_BYPASS)
>+
>+int bypass_master_create(struct net_device *backup_netdev,
>+			 struct bypass_master **pbypass_master);
>+void bypass_master_destroy(struct bypass_master *bypass_master);
>+
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+			   struct bypass_master **pbypass_master);
>+void bypass_master_unregister(struct bypass_master *bypass_master);
>+
>+int bypass_slave_unregister(struct net_device *slave_netdev);
>+
>+#else
>+
>+static inline
>+int bypass_master_create(struct net_device *backup_netdev,
>+			 struct bypass_master **pbypass_master);
>+{
>+	return 0;
>+}
>+
>+static inline
>+void bypass_master_destroy(struct bypass_master *bypass_master)
>+{
>+}
>+
>+static inline
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+			   struct pbypass_master **pbypass_master);
>+{
>+	return 0;
>+}
>+
>+static inline
>+void bypass_master_unregister(struct bypass_master *bypass_master)
>+{
>+}
>+
>+static inline
>+int bypass_slave_unregister(struct net_device *slave_netdev)
>+{
>+	return 0;
>+}
>+
>+#endif
>+
>+#endif /* _NET_BYPASS_H */
>diff --git a/net/Kconfig b/net/Kconfig
>index 0428f12c25c2..994445f4a96a 100644
>--- a/net/Kconfig
>+++ b/net/Kconfig
>@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
> 	  on MAY_USE_DEVLINK to ensure they do not cause link errors when
> 	  devlink is a loadable module and the driver using it is built-in.
> 
>+config NET_BYPASS
>+	tristate "Bypass interface"
>+	---help---
>+	  This provides a generic interface for paravirtual drivers to listen
>+	  for netdev register/unregister/link change events from pci ethernet
>+	  devices with the same MAC and takeover their datapath. This also
>+	  enables live migration of a VM with direct attached VF by failing
>+	  over to the paravirtual datapath when the VF is unplugged.
>+
>+config MAY_USE_BYPASS
>+	tristate
>+	default m if NET_BYPASS=m
>+	default y if NET_BYPASS=y || NET_BYPASS=n
>+	help
>+	  Drivers using the bypass infrastructure should have a dependency
>+	  on MAY_USE_BYPASS to ensure they do not cause link errors when
>+	  bypass is a loadable module and the driver using it is built-in.
>+
> endif   # if NET
> 
> # Used by archs to tell that they support BPF JIT compiler plus which flavour.
>diff --git a/net/core/Makefile b/net/core/Makefile
>index 6dbbba8c57ae..a9727ed1c8fc 100644
>--- a/net/core/Makefile
>+++ b/net/core/Makefile
>@@ -30,3 +30,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
> obj-$(CONFIG_HWBM) += hwbm.o
> obj-$(CONFIG_NET_DEVLINK) += devlink.o
> obj-$(CONFIG_GRO_CELLS) += gro_cells.o
>+obj-$(CONFIG_NET_BYPASS) += bypass.o
>diff --git a/net/core/bypass.c b/net/core/bypass.c
>new file mode 100644
>index 000000000000..b5b9cb554c3f
>--- /dev/null
>+++ b/net/core/bypass.c
>@@ -0,0 +1,844 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+/* A common module to handle registrations and notifications for paravirtual
>+ * drivers to enable accelerated datapath and support VF live migration.
>+ *
>+ * The notifier and event handling code is based on netvsc driver.
>+ */
>+
>+#include <linux/netdevice.h>
>+#include <linux/etherdevice.h>
>+#include <linux/ethtool.h>
>+#include <linux/module.h>
>+#include <linux/slab.h>
>+#include <linux/netdevice.h>
>+#include <linux/netpoll.h>
>+#include <linux/rtnetlink.h>
>+#include <linux/if_vlan.h>
>+#include <linux/pci.h>
>+#include <net/sch_generic.h>
>+#include <uapi/linux/if_arp.h>
>+#include <net/bypass.h>
>+
>+static LIST_HEAD(bypass_master_list);
>+static DEFINE_SPINLOCK(bypass_lock);
>+
>+static int bypass_slave_pre_register(struct net_device *slave_netdev,
>+				     struct net_device *bypass_netdev,
>+				     struct bypass_ops *bypass_ops)
>+{
>+	struct bypass_info *bi;
>+	bool backup;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_pre_register)
>+			return -EINVAL;
>+
>+		return bypass_ops->slave_pre_register(slave_netdev,
>+						      bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>+	if (backup ? rtnl_dereference(bi->backup_netdev) :
>+			rtnl_dereference(bi->active_netdev)) {
>+		netdev_err(bypass_netdev, "%s attempting to register as slave dev when %s already present\n",
>+			   slave_netdev->name, backup ? "backup" : "active");
>+		return -EEXIST;
>+	}
>+
>+	/* Avoid non pci devices as active netdev */
>+	if (!backup && (!slave_netdev->dev.parent ||
>+			!dev_is_pci(slave_netdev->dev.parent)))
>+		return -EINVAL;
>+
>+	return 0;
>+}
>+
>+static int bypass_slave_join(struct net_device *slave_netdev,
>+			     struct net_device *bypass_netdev,
>+			     struct bypass_ops *bypass_ops)
>+{
>+	struct bypass_info *bi;
>+	bool backup;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_join)
>+			return -EINVAL;
>+
>+		return bypass_ops->slave_join(slave_netdev, bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	backup = (slave_netdev->dev.parent == bypass_netdev->dev.parent);
>+
>+	dev_hold(slave_netdev);
>+
>+	if (backup) {
>+		rcu_assign_pointer(bi->backup_netdev, slave_netdev);
>+		dev_get_stats(bi->backup_netdev, &bi->backup_stats);
>+	} else {
>+		rcu_assign_pointer(bi->active_netdev, slave_netdev);
>+		dev_get_stats(bi->active_netdev, &bi->active_stats);
>+		bypass_netdev->min_mtu = slave_netdev->min_mtu;
>+		bypass_netdev->max_mtu = slave_netdev->max_mtu;
>+	}
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s joined\n",
>+		    slave_netdev->name);
>+
>+	return 0;
>+}
>+
>+/* Called when slave dev is injecting data into network stack.
>+ * Change the associated network device from lower dev to virtio.
>+ * note: already called with rcu_read_lock
>+ */
>+static rx_handler_result_t bypass_handle_frame(struct sk_buff **pskb)
>+{
>+	struct sk_buff *skb = *pskb;
>+	struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
>+
>+	skb->dev = ndev;
>+
>+	return RX_HANDLER_ANOTHER;
>+}
>+
>+static struct net_device *bypass_master_get_bymac(u8 *mac,
>+						  struct bypass_ops **ops)
>+{
>+	struct bypass_master *bypass_master;
>+	struct net_device *bypass_netdev;
>+
>+	spin_lock(&bypass_lock);
>+	list_for_each_entry(bypass_master, &bypass_master_list, list) {

As I wrote the last time, you don't need this list, spinlock.
You can do just something like:
        for_each_net(net) {
                for_each_netdev(net, dev) {
			if (netif_is_bypass_master(dev)) {




>+		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>+			*ops = rcu_dereference(bypass_master->ops);

I don't see how rcu_dereference is ok here.
1) I don't see rcu_read_lock taken
2) Looks like bypass_master->ops has the same value across the whole
   existence.


>+			spin_unlock(&bypass_lock);
>+			return bypass_netdev;
>+		}
>+	}
>+	spin_unlock(&bypass_lock);
>+	return NULL;
>+}
>+
>+static int bypass_slave_register(struct net_device *slave_netdev)
>+{
>+	struct net_device *bypass_netdev;
>+	struct bypass_ops *bypass_ops;
>+	int ret, orig_mtu;
>+
>+	ASSERT_RTNL();
>+
>+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+						&bypass_ops);

For master, could you use word "master" in the variables so it is clear?
Also, "dev" is fine instead of "netdev".
Something like "bpmaster_dev"


>+	if (!bypass_netdev)
>+		goto done;
>+
>+	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>+					bypass_ops);
>+	if (ret != 0)

	Just "if (ret)" will do. You have this on more places.


>+		goto done;
>+
>+	ret = netdev_rx_handler_register(slave_netdev,
>+					 bypass_ops ? bypass_ops->handle_frame :
>+					 bypass_handle_frame, bypass_netdev);
>+	if (ret != 0) {
>+		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>+			   ret);
>+		goto done;
>+	}
>+
>+	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>+	if (ret != 0) {
>+		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>+			   bypass_netdev->name, ret);
>+		goto upper_link_failed;
>+	}
>+
>+	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>+
>+	if (netif_running(bypass_netdev)) {
>+		ret = dev_open(slave_netdev);
>+		if (ret && (ret != -EBUSY)) {
>+			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>+				   slave_netdev->name, ret);
>+			goto err_interface_up;
>+		}
>+	}
>+
>+	/* Align MTU of slave with master */
>+	orig_mtu = slave_netdev->mtu;
>+	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>+	if (ret != 0) {
>+		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>+			   slave_netdev->name, bypass_netdev->mtu);
>+		goto err_set_mtu;
>+	}
>+
>+	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>+	if (ret != 0)
>+		goto err_join;
>+
>+	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>+		    slave_netdev->name);
>+
>+	goto done;
>+
>+err_join:
>+	dev_set_mtu(slave_netdev, orig_mtu);
>+err_set_mtu:
>+	dev_close(slave_netdev);
>+err_interface_up:
>+	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>+	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>+upper_link_failed:
>+	netdev_rx_handler_unregister(slave_netdev);
>+done:
>+	return NOTIFY_DONE;
>+}
>+
>+static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>+				       struct net_device *bypass_netdev,
>+				       struct bypass_ops *bypass_ops)
>+{
>+	struct net_device *backup_netdev, *active_netdev;
>+	struct bypass_info *bi;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_pre_unregister)
>+			return -EINVAL;
>+
>+		return bypass_ops->slave_pre_unregister(slave_netdev,
>+							bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>+		return -EINVAL;
>+
>+	return 0;
>+}
>+
>+static int bypass_slave_release(struct net_device *slave_netdev,
>+				struct net_device *bypass_netdev,
>+				struct bypass_ops *bypass_ops)
>+{
>+	struct net_device *backup_netdev, *active_netdev;
>+	struct bypass_info *bi;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_release)
>+			return -EINVAL;

I think it would be good to make the API to the driver more strict and
have a separate set of ops for "active" and "backup" netdevices.
That should stop people thinking about extending this to more slaves in
the future.



>+
>+		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>+	}
>+
>+	bi = netdev_priv(bypass_netdev);
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+	if (slave_netdev == backup_netdev) {
>+		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>+	} else {
>+		RCU_INIT_POINTER(bi->active_netdev, NULL);
>+		if (backup_netdev) {
>+			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+		}
>+	}
>+
>+	dev_put(slave_netdev);
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>+		    slave_netdev->name);
>+
>+	return 0;
>+}
>+
>+int bypass_slave_unregister(struct net_device *slave_netdev)
>+{
>+	struct net_device *bypass_netdev;
>+	struct bypass_ops *bypass_ops;
>+	int ret;
>+
>+	if (!netif_is_bypass_slave(slave_netdev))
>+		goto done;
>+
>+	ASSERT_RTNL();
>+
>+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+						&bypass_ops);
>+	if (!bypass_netdev)
>+		goto done;
>+
>+	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>+					  bypass_ops);
>+	if (ret != 0)
>+		goto done;
>+
>+	netdev_rx_handler_unregister(slave_netdev);
>+	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>+	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>+
>+	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>+
>+	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>+		    slave_netdev->name);
>+
>+done:
>+	return NOTIFY_DONE;
>+}
>+EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>+
>+static bool bypass_xmit_ready(struct net_device *dev)
>+{
>+	return netif_running(dev) && netif_carrier_ok(dev);
>+}
>+
>+static int bypass_slave_link_change(struct net_device *slave_netdev)
>+{
>+	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>+	struct bypass_ops *bypass_ops;
>+	struct bypass_info *bi;
>+
>+	if (!netif_is_bypass_slave(slave_netdev))
>+		goto done;
>+
>+	ASSERT_RTNL();
>+
>+	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>+						&bypass_ops);
>+	if (!bypass_netdev)
>+		goto done;
>+
>+	if (bypass_ops) {
>+		if (!bypass_ops->slave_link_change)
>+			goto done;
>+
>+		return bypass_ops->slave_link_change(slave_netdev,
>+						     bypass_netdev);
>+	}
>+
>+	if (!netif_running(bypass_netdev))
>+		return 0;
>+
>+	bi = netdev_priv(bypass_netdev);
>+
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+
>+	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>+		goto done;

You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
above is enough.


>+
>+	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>+	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>+		netif_carrier_on(bypass_netdev);
>+		netif_tx_wake_all_queues(bypass_netdev);
>+	} else {
>+		netif_carrier_off(bypass_netdev);
>+		netif_tx_stop_all_queues(bypass_netdev);
>+	}
>+
>+done:
>+	return NOTIFY_DONE;
>+}
>+
>+static bool bypass_validate_event_dev(struct net_device *dev)
>+{
>+	/* Skip parent events */
>+	if (netif_is_bypass_master(dev))
>+		return false;
>+
>+	/* Avoid non-Ethernet type devices */
>+	if (dev->type != ARPHRD_ETHER)
>+		return false;
>+
>+	/* Avoid Vlan dev with same MAC registering as VF */
>+	if (is_vlan_dev(dev))
>+		return false;
>+
>+	/* Avoid Bonding master dev with same MAC registering as slave dev */
>+	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))

Yeah, this is certainly incorrect. One thing is, you should be using the
helpers netif_is_bond_master().
But what about the rest? macsec, macvlan, team, bridge, ovs and others?

You need to do it not by blacklisting, but with whitelisting. You need
to whitelist VF devices. My port flavours patchset might help with this.


>+		return false;
>+
>+	return true;
>+}
>+
>+static int
>+bypass_event(struct notifier_block *this, unsigned long event, void *ptr)
>+{
>+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>+
>+	if (!bypass_validate_event_dev(event_dev))
>+		return NOTIFY_DONE;
>+
>+	switch (event) {
>+	case NETDEV_REGISTER:
>+		return bypass_slave_register(event_dev);
>+	case NETDEV_UNREGISTER:
>+		return bypass_slave_unregister(event_dev);
>+	case NETDEV_UP:
>+	case NETDEV_DOWN:
>+	case NETDEV_CHANGE:
>+		return bypass_slave_link_change(event_dev);
>+	default:
>+		return NOTIFY_DONE;
>+	}
>+}
>+
>+static struct notifier_block bypass_notifier = {
>+	.notifier_call = bypass_event,
>+};
>+
>+int bypass_open(struct net_device *dev)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *active_netdev, *backup_netdev;
>+	int err;
>+
>+	netif_carrier_off(dev);
>+	netif_tx_wake_all_queues(dev);
>+
>+	active_netdev = rtnl_dereference(bi->active_netdev);
>+	if (active_netdev) {
>+		err = dev_open(active_netdev);
>+		if (err)
>+			goto err_active_open;
>+	}
>+
>+	backup_netdev = rtnl_dereference(bi->backup_netdev);
>+	if (backup_netdev) {
>+		err = dev_open(backup_netdev);
>+		if (err)
>+			goto err_backup_open;
>+	}
>+
>+	return 0;
>+
>+err_backup_open:
>+	dev_close(active_netdev);
>+err_active_open:
>+	netif_tx_disable(dev);
>+	return err;
>+}
>+EXPORT_SYMBOL_GPL(bypass_open);
>+
>+int bypass_close(struct net_device *dev)
>+{
>+	struct bypass_info *vi = netdev_priv(dev);

This should be probably "bi"


>+	struct net_device *slave_netdev;
>+
>+	netif_tx_disable(dev);
>+
>+	slave_netdev = rtnl_dereference(vi->active_netdev);
>+	if (slave_netdev)
>+		dev_close(slave_netdev);
>+
>+	slave_netdev = rtnl_dereference(vi->backup_netdev);
>+	if (slave_netdev)
>+		dev_close(slave_netdev);
>+
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_close);
>+
>+static netdev_tx_t bypass_drop_xmit(struct sk_buff *skb, struct net_device *dev)
>+{
>+	atomic_long_inc(&dev->tx_dropped);
>+	dev_kfree_skb_any(skb);
>+	return NETDEV_TX_OK;
>+}
>+
>+netdev_tx_t bypass_start_xmit(struct sk_buff *skb, struct net_device *dev)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);

If you rename the other variable to "bpmaster_dev", it would be nice to
rename this to bpinfo or something more descriptive. "bi" is too short
to know what that is right away.


>+	struct net_device *xmit_dev;

Don't mix "dev" and "netdev" in one .c file. Just use "dev" for all.



>+
>+	/* Try xmit via active netdev followed by backup netdev */
>+	xmit_dev = rcu_dereference_bh(bi->active_netdev);
>+	if (!xmit_dev || !bypass_xmit_ready(xmit_dev)) {
>+		xmit_dev = rcu_dereference_bh(bi->backup_netdev);
>+		if (!xmit_dev || !bypass_xmit_ready(xmit_dev))
>+			return bypass_drop_xmit(skb, dev);
>+	}
>+
>+	skb->dev = xmit_dev;
>+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
>+
>+	return dev_queue_xmit(skb);
>+}
>+EXPORT_SYMBOL_GPL(bypass_start_xmit);
>+
>+u16 bypass_select_queue(struct net_device *dev, struct sk_buff *skb,
>+			void *accel_priv, select_queue_fallback_t fallback)
>+{
>+	/* This helper function exists to help dev_pick_tx get the correct
>+	 * destination queue.  Using a helper function skips a call to
>+	 * skb_tx_hash and will put the skbs in the queue we expect on their
>+	 * way down to the bonding driver.
>+	 */
>+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
>+
>+	/* Save the original txq to restore before passing to the driver */
>+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
>+
>+	if (unlikely(txq >= dev->real_num_tx_queues)) {
>+		do {
>+			txq -= dev->real_num_tx_queues;
>+		} while (txq >= dev->real_num_tx_queues);
>+	}
>+
>+	return txq;
>+}
>+EXPORT_SYMBOL_GPL(bypass_select_queue);
>+
>+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
>+ * that some drivers can provide 32bit values only.
>+ */
>+static void bypass_fold_stats(struct rtnl_link_stats64 *_res,
>+			      const struct rtnl_link_stats64 *_new,
>+			      const struct rtnl_link_stats64 *_old)
>+{
>+	const u64 *new = (const u64 *)_new;
>+	const u64 *old = (const u64 *)_old;
>+	u64 *res = (u64 *)_res;
>+	int i;
>+
>+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
>+		u64 nv = new[i];
>+		u64 ov = old[i];
>+		s64 delta = nv - ov;
>+
>+		/* detects if this particular field is 32bit only */
>+		if (((nv | ov) >> 32) == 0)
>+			delta = (s64)(s32)((u32)nv - (u32)ov);
>+
>+		/* filter anomalies, some drivers reset their stats
>+		 * at down/up events.
>+		 */
>+		if (delta > 0)
>+			res[i] += delta;
>+	}
>+}
>+
>+void bypass_get_stats(struct net_device *dev, struct rtnl_link_stats64 *stats)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);

You can WARN_ON and return in case the dev is not bypass master, just
to catch buggy drivers. Same with other helpers.


>+	const struct rtnl_link_stats64 *new;
>+	struct rtnl_link_stats64 temp;
>+	struct net_device *slave_netdev;
>+
>+	spin_lock(&bi->stats_lock);
>+	memcpy(stats, &bi->bypass_stats, sizeof(*stats));
>+
>+	rcu_read_lock();
>+
>+	slave_netdev = rcu_dereference(bi->active_netdev);
>+	if (slave_netdev) {
>+		new = dev_get_stats(slave_netdev, &temp);
>+		bypass_fold_stats(stats, new, &bi->active_stats);
>+		memcpy(&bi->active_stats, new, sizeof(*new));
>+	}
>+
>+	slave_netdev = rcu_dereference(bi->backup_netdev);
>+	if (slave_netdev) {
>+		new = dev_get_stats(slave_netdev, &temp);
>+		bypass_fold_stats(stats, new, &bi->backup_stats);
>+		memcpy(&bi->backup_stats, new, sizeof(*new));
>+	}
>+
>+	rcu_read_unlock();
>+
>+	memcpy(&bi->bypass_stats, stats, sizeof(*stats));
>+	spin_unlock(&bi->stats_lock);
>+}
>+EXPORT_SYMBOL_GPL(bypass_get_stats);
>+
>+int bypass_change_mtu(struct net_device *dev, int new_mtu)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *active_netdev, *backup_netdev;
>+	int ret = 0;

Pointless initialization.


>+
>+	active_netdev = rcu_dereference(bi->active_netdev);
>+	if (active_netdev) {
>+		ret = dev_set_mtu(active_netdev, new_mtu);
>+		if (ret)
>+			return ret;
>+	}
>+
>+	backup_netdev = rcu_dereference(bi->backup_netdev);
>+	if (backup_netdev) {
>+		ret = dev_set_mtu(backup_netdev, new_mtu);
>+		if (ret) {
>+			dev_set_mtu(active_netdev, dev->mtu);
>+			return ret;
>+		}
>+	}
>+
>+	dev->mtu = new_mtu;
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_change_mtu);
>+
>+void bypass_set_rx_mode(struct net_device *dev)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *slave_netdev;
>+
>+	rcu_read_lock();
>+
>+	slave_netdev = rcu_dereference(bi->active_netdev);
>+	if (slave_netdev) {
>+		dev_uc_sync_multiple(slave_netdev, dev);
>+		dev_mc_sync_multiple(slave_netdev, dev);
>+	}
>+
>+	slave_netdev = rcu_dereference(bi->backup_netdev);
>+	if (slave_netdev) {
>+		dev_uc_sync_multiple(slave_netdev, dev);
>+		dev_mc_sync_multiple(slave_netdev, dev);
>+	}
>+
>+	rcu_read_unlock();
>+}
>+EXPORT_SYMBOL_GPL(bypass_set_rx_mode);
>+
>+static const struct net_device_ops bypass_netdev_ops = {
>+	.ndo_open		= bypass_open,
>+	.ndo_stop		= bypass_close,
>+	.ndo_start_xmit		= bypass_start_xmit,
>+	.ndo_select_queue	= bypass_select_queue,
>+	.ndo_get_stats64	= bypass_get_stats,
>+	.ndo_change_mtu		= bypass_change_mtu,
>+	.ndo_set_rx_mode	= bypass_set_rx_mode,
>+	.ndo_validate_addr	= eth_validate_addr,
>+	.ndo_features_check	= passthru_features_check,
>+};
>+
>+#define BYPASS_DRV_NAME "bypass"
>+#define BYPASS_DRV_VERSION "0.1"
>+
>+static void bypass_ethtool_get_drvinfo(struct net_device *dev,
>+				       struct ethtool_drvinfo *drvinfo)
>+{
>+	strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
>+	strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
>+}
>+
>+int bypass_ethtool_get_link_ksettings(struct net_device *dev,
>+				      struct ethtool_link_ksettings *cmd)
>+{
>+	struct bypass_info *bi = netdev_priv(dev);
>+	struct net_device *slave_netdev;
>+
>+	slave_netdev = rtnl_dereference(bi->active_netdev);
>+	if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>+		slave_netdev = rtnl_dereference(bi->backup_netdev);
>+		if (!slave_netdev || !bypass_xmit_ready(slave_netdev)) {
>+			cmd->base.duplex = DUPLEX_UNKNOWN;
>+			cmd->base.port = PORT_OTHER;
>+			cmd->base.speed = SPEED_UNKNOWN;
>+
>+			return 0;
>+		}
>+	}
>+
>+	return __ethtool_get_link_ksettings(slave_netdev, cmd);
>+}
>+EXPORT_SYMBOL_GPL(bypass_ethtool_get_link_ksettings);
>+
>+static const struct ethtool_ops bypass_ethtool_ops = {
>+	.get_drvinfo            = bypass_ethtool_get_drvinfo,
>+	.get_link               = ethtool_op_get_link,
>+	.get_link_ksettings     = bypass_ethtool_get_link_ksettings,
>+};
>+
>+static void bypass_register_existing_slave(struct net_device *bypass_netdev)
>+{
>+	struct net *net = dev_net(bypass_netdev);
>+	struct net_device *dev;
>+
>+	rtnl_lock();
>+	for_each_netdev(net, dev) {
>+		if (dev == bypass_netdev)
>+			continue;
>+		if (!bypass_validate_event_dev(dev))
>+			continue;
>+		if (ether_addr_equal(bypass_netdev->perm_addr, dev->perm_addr))
>+			bypass_slave_register(dev);
>+	}
>+	rtnl_unlock();
>+}
>+
>+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
>+			   struct bypass_master **pbypass_master)
>+{
>+	struct bypass_master *bypass_master;
>+
>+	bypass_master = kzalloc(sizeof(*bypass_master), GFP_KERNEL);
>+	if (!bypass_master)
>+		return -ENOMEM;
>+
>+	rcu_assign_pointer(bypass_master->ops, ops);
>+	dev_hold(dev);
>+	dev->priv_flags |= IFF_BYPASS;
>+	rcu_assign_pointer(bypass_master->bypass_netdev, dev);
>+
>+	spin_lock(&bypass_lock);
>+	list_add_tail(&bypass_master->list, &bypass_master_list);
>+	spin_unlock(&bypass_lock);
>+
>+	bypass_register_existing_slave(dev);
>+
>+	*pbypass_master = bypass_master;
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_register);
>+
>+void bypass_master_unregister(struct bypass_master *bypass_master)
>+{
>+	struct net_device *bypass_netdev;
>+
>+	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+
>+	bypass_netdev->priv_flags &= ~IFF_BYPASS;
>+	dev_put(bypass_netdev);
>+
>+	spin_lock(&bypass_lock);
>+	list_del(&bypass_master->list);
>+	spin_unlock(&bypass_lock);
>+
>+	kfree(bypass_master);
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_unregister);
>+
>+int bypass_master_create(struct net_device *backup_netdev,
>+			 struct bypass_master **pbypass_master)
>+{
>+	struct device *dev = backup_netdev->dev.parent;
>+	struct net_device *bypass_netdev;
>+	int err;
>+
>+	/* Alloc at least 2 queues, for now we are going with 16 assuming
>+	 * that most devices being bonded won't have too many queues.
>+	 */
>+	bypass_netdev = alloc_etherdev_mq(sizeof(struct bypass_info), 16);
>+	if (!bypass_netdev) {
>+		dev_err(dev, "Unable to allocate bypass_netdev!\n");
>+		return -ENOMEM;
>+	}
>+
>+	dev_net_set(bypass_netdev, dev_net(backup_netdev));
>+	SET_NETDEV_DEV(bypass_netdev, dev);
>+
>+	bypass_netdev->netdev_ops = &bypass_netdev_ops;
>+	bypass_netdev->ethtool_ops = &bypass_ethtool_ops;
>+
>+	/* Initialize the device options */
>+	bypass_netdev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
>+	bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
>+				       IFF_TX_SKB_SHARING);
>+
>+	/* don't acquire bypass netdev's netif_tx_lock when transmitting */
>+	bypass_netdev->features |= NETIF_F_LLTX;
>+
>+	/* Don't allow bypass devices to change network namespaces. */
>+	bypass_netdev->features |= NETIF_F_NETNS_LOCAL;
>+
>+	bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
>+				     NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
>+				     NETIF_F_HIGHDMA | NETIF_F_LRO;
>+
>+	bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>+	bypass_netdev->features |= bypass_netdev->hw_features;
>+
>+	memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr,
>+	       bypass_netdev->addr_len);
>+
>+	bypass_netdev->min_mtu = backup_netdev->min_mtu;
>+	bypass_netdev->max_mtu = backup_netdev->max_mtu;
>+
>+	err = register_netdev(bypass_netdev);
>+	if (err < 0) {
>+		dev_err(dev, "Unable to register bypass_netdev!\n");
>+		goto err_register_netdev;
>+	}
>+
>+	netif_carrier_off(bypass_netdev);
>+
>+	err = bypass_master_register(bypass_netdev, NULL, pbypass_master);
>+	if (err < 0)

just "if (err)" would do.


>+		goto err_bypass;
>+
>+	return 0;
>+
>+err_bypass:
>+	unregister_netdev(bypass_netdev);
>+err_register_netdev:
>+	free_netdev(bypass_netdev);
>+
>+	return err;
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_create);
>+
>+void bypass_master_destroy(struct bypass_master *bypass_master)
>+{
>+	struct net_device *bypass_netdev;
>+	struct net_device *slave_netdev;
>+	struct bypass_info *bi;
>+
>+	if (!bypass_master)
>+		return;
>+
>+	bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>+	bi = netdev_priv(bypass_netdev);
>+
>+	netif_device_detach(bypass_netdev);
>+
>+	rtnl_lock();
>+
>+	slave_netdev = rtnl_dereference(bi->active_netdev);
>+	if (slave_netdev)
>+		bypass_slave_unregister(slave_netdev);
>+
>+	slave_netdev = rtnl_dereference(bi->backup_netdev);
>+	if (slave_netdev)
>+		bypass_slave_unregister(slave_netdev);
>+
>+	bypass_master_unregister(bypass_master);
>+
>+	unregister_netdevice(bypass_netdev);
>+
>+	rtnl_unlock();
>+
>+	free_netdev(bypass_netdev);
>+}
>+EXPORT_SYMBOL_GPL(bypass_master_destroy);
>+
>+static __init int
>+bypass_init(void)
>+{
>+	register_netdevice_notifier(&bypass_notifier);
>+
>+	return 0;
>+}
>+module_init(bypass_init);
>+
>+static __exit
>+void bypass_exit(void)
>+{
>+	unregister_netdevice_notifier(&bypass_notifier);
>+}
>+module_exit(bypass_exit);
>+
>+MODULE_DESCRIPTION("Bypass infrastructure/interface for Paravirtual drivers");
>+MODULE_LICENSE("GPL v2");
>-- 
>2.14.3
>

^ permalink raw reply

* [PATCH v2] net: samsung: sxgbe: Replace mdelay with usleep_range in sxgbe_sw_reset
From: Jia-Ju Bai @ 2018-04-11 15:51 UTC (permalink / raw)
  To: bh74.an, ks.giri, vipul.pandya; +Cc: netdev, linux-kernel, Jia-Ju Bai

sxgbe_sw_reset() is never called in atomic context.

sxgbe_sw_reset() is only called by sxgbe_drv_probe(), which is 
only called by sxgbe_platform_probe().
sxgbe_platform_probe() is set as ".probe" in struct platform_driver.
This function is not called in atomic context.

Despite never getting called from atomic context, sxgbe_sw_reset()
calls mdelay() to busily wait.
This is not necessary and can be replaced with usleep_range() to
avoid busy waiting.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
---
v2:
* Use usleep_range() to correct usleep() in v1.

---	
 drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c b/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
index 89831ad..99cd586 100644
--- a/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
+++ b/drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c
@@ -2038,7 +2038,7 @@ static int sxgbe_sw_reset(void __iomem *addr)
 		if (!(readl(addr + SXGBE_DMA_MODE_REG) &
 		      SXGBE_DMA_SOFT_RESET))
 			break;
-		mdelay(10);
+		usleep_range(10000, 11000);
 	}

 	if (retry_count < 0)
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH] net: samsung: sxgbe: Replace mdelay with usleep_range in sxgbe_sw_reset
From: kbuild test robot @ 2018-04-11 15:52 UTC (permalink / raw)
  To: Jia-Ju Bai
  Cc: kbuild-all, bh74.an, ks.giri, vipul.pandya, netdev, linux-kernel,
	Jia-Ju Bai
In-Reply-To: <1523413280-2336-1-git-send-email-baijiaju1990@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1639 bytes --]

Hi Jia-Ju,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]
[also build test ERROR on v4.16 next-20180411]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Jia-Ju-Bai/net-samsung-sxgbe-Replace-mdelay-with-usleep_range-in-sxgbe_sw_reset/20180411-225900
config: i386-randconfig-s1-201814 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c: In function 'sxgbe_sw_reset':
>> drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c:2039:3: error: implicit declaration of function 'usleep' [-Werror=implicit-function-declaration]
      usleep(10000, 11000);
      ^~~~~~
   cc1: some warnings being treated as errors

vim +/usleep +2039 drivers/net//ethernet/samsung/sxgbe/sxgbe_main.c

  2029	
  2030	static int sxgbe_sw_reset(void __iomem *addr)
  2031	{
  2032		int retry_count = 10;
  2033	
  2034		writel(SXGBE_DMA_SOFT_RESET, addr + SXGBE_DMA_MODE_REG);
  2035		while (retry_count--) {
  2036			if (!(readl(addr + SXGBE_DMA_MODE_REG) &
  2037			      SXGBE_DMA_SOFT_RESET))
  2038				break;
> 2039			usleep(10000, 11000);
  2040		}
  2041	
  2042		if (retry_count < 0)
  2043			return -EBUSY;
  2044	
  2045		return 0;
  2046	}
  2047	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 24019 bytes --]

^ permalink raw reply

* [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)
From: Stephen Suryaputra @ 2018-04-11  2:55 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Suryaputra

This is enhanced from the proposed patch by Igor Maravic in 2011 to
support per interface IPv4 stats. The enhancement is mainly adding a
kernel configuration option CONFIG_IP_IFSTATS_TABLE.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
---
 drivers/net/vrf.c               |   2 +-
 include/linux/inetdevice.h      |  22 +++++++
 include/net/icmp.h              |  44 ++++++++++++--
 include/net/ip.h                |  72 +++++++++++++++++++++--
 include/net/ipv6.h              |  62 ++++----------------
 include/net/netns/mib.h         |   3 +
 include/net/snmp.h              |  95 ++++++++++++++++++++++++++++++
 net/bridge/br_netfilter_hooks.c |  10 ++--
 net/dccp/ipv4.c                 |   4 +-
 net/ipv4/Kconfig                |   8 +++
 net/ipv4/af_inet.c              |   7 ++-
 net/ipv4/datagram.c             |   2 +-
 net/ipv4/devinet.c              |  85 ++++++++++++++++++++++++++-
 net/ipv4/icmp.c                 |  32 +++++-----
 net/ipv4/inet_connection_sock.c |   8 ++-
 net/ipv4/ip_forward.c           |   8 +--
 net/ipv4/ip_fragment.c          |  20 ++++---
 net/ipv4/ip_input.c             |  29 ++++-----
 net/ipv4/ip_output.c            |  40 ++++++++-----
 net/ipv4/ipmr.c                 |   6 +-
 net/ipv4/ping.c                 |   9 ++-
 net/ipv4/proc.c                 | 126 ++++++++++++++++++++++++++++++++++++++++
 net/ipv4/raw.c                  |   4 +-
 net/ipv4/route.c                |   6 +-
 net/ipv4/tcp_ipv4.c             |   4 +-
 net/ipv4/udp.c                  |   4 +-
 net/l2tp/l2tp_ip.c              |   4 +-
 net/l2tp/l2tp_ip6.c             |   2 +-
 net/mpls/af_mpls.c              |   2 +-
 net/netfilter/ipvs/ip_vs_xmit.c |   2 +-
 net/sctp/input.c                |   2 +-
 net/sctp/output.c               |   2 +-
 32 files changed, 570 insertions(+), 156 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 0a2b180..509c2ca 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -592,7 +592,7 @@ static int vrf_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
 
-	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
+	IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index e16fe7d..3d120cb 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -22,6 +22,15 @@ struct ipv4_devconf {
 
 #define MC_HASH_SZ_LOG 9
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+struct ipv4_devstat {
+	struct proc_dir_entry *proc_dir_entry;
+	DEFINE_SNMP_STAT(struct ipstats_mib, ip);
+	DEFINE_SNMP_STAT_ATOMIC(struct icmp_mib_device, icmpdev);
+	DEFINE_SNMP_STAT_ATOMIC(struct icmpmsg_mib_device, icmpmsgdev);
+};
+#endif
+
 struct in_device {
 	struct net_device	*dev;
 	refcount_t		refcnt;
@@ -45,6 +54,9 @@ struct in_device {
 
 	struct neigh_parms	*arp_parms;
 	struct ipv4_devconf	cnf;
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	struct ipv4_devstat stats;
+#endif
 	struct rcu_head		rcu_head;
 };
 
@@ -216,6 +228,16 @@ static inline struct in_device *__in_dev_get_rcu(const struct net_device *dev)
 	return rcu_dereference(dev->ip_ptr);
 }
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static inline struct in_device *__in_dev_get_rcu_safely(const struct net_device *dev)
+{
+	if (likely(dev))
+		return rcu_dereference(dev->ip_ptr);
+	else
+		return NULL;
+}
+#endif
+
 static inline struct in_device *in_dev_get(const struct net_device *dev)
 {
 	struct in_device *in_dev;
diff --git a/include/net/icmp.h b/include/net/icmp.h
index 3ef2743..fdfbc0f 100644
--- a/include/net/icmp.h
+++ b/include/net/icmp.h
@@ -29,10 +29,44 @@ struct icmp_err {
 };
 
 extern const struct icmp_err icmp_err_convert[];
-#define ICMP_INC_STATS(net, field)	SNMP_INC_STATS((net)->mib.icmp_statistics, field)
-#define __ICMP_INC_STATS(net, field)	__SNMP_INC_STATS((net)->mib.icmp_statistics, field)
-#define ICMPMSGOUT_INC_STATS(net, field)	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field+256)
-#define ICMPMSGIN_INC_STATS(net, field)		SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field)
+#ifdef CONFIG_IP_IFSTATS_TABLE
+#define ICMP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINCATOMIC(net, icmp, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+
+#define __ICMP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		___DEVINCATOMIC(net, icmp, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+
+#define ICMPMSGOUT_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC_ATOMIC_ATOMIC(net, icmpmsg, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field+256);	\
+		rcu_read_unlock();	\
+	})
+
+#define ICMPMSGIN_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC_ATOMIC_ATOMIC(net, icmpmsg, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+#else
+#define ICMP_INC_STATS(net, dev, field)	SNMP_INC_STATS((net)->mib.icmp_statistics, field)
+#define __ICMP_INC_STATS(net, dev, field)	__SNMP_INC_STATS((net)->mib.icmp_statistics, field)
+#define ICMPMSGOUT_INC_STATS(net, dev, field)	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field+256)
+#define ICMPMSGIN_INC_STATS(net, dev, field)		SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field)
+#endif
 
 struct dst_entry;
 struct net_proto_family;
@@ -43,6 +77,6 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info);
 int icmp_rcv(struct sk_buff *skb);
 void icmp_err(struct sk_buff *skb, u32 info);
 int icmp_init(void);
-void icmp_out_count(struct net *net, unsigned char type);
+void icmp_out_count(struct net_device *dev, unsigned char type);
 
 #endif	/* _ICMP_H */
diff --git a/include/net/ip.h b/include/net/ip.h
index ecffd84..aa8a55b 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -27,6 +27,7 @@
 #include <linux/in.h>
 #include <linux/skbuff.h>
 #include <linux/jhash.h>
+#include <linux/inetdevice.h>
 
 #include <net/inet_sock.h>
 #include <net/route.h>
@@ -218,12 +219,62 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
 			   const struct ip_reply_arg *arg,
 			   unsigned int len);
 
-#define IP_INC_STATS(net, field)	SNMP_INC_STATS64((net)->mib.ip_statistics, field)
-#define __IP_INC_STATS(net, field)	__SNMP_INC_STATS64((net)->mib.ip_statistics, field)
-#define IP_ADD_STATS(net, field, val)	SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
-#define __IP_ADD_STATS(net, field, val) __SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
-#define IP_UPD_PO_STATS(net, field, val) SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
-#define __IP_UPD_PO_STATS(net, field, val) __SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
+#ifdef CONFIG_IP_IFSTATS_TABLE
+#define IP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC(net, ip, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+
+#define __IP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		___DEVINC(net, ip, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+
+#define IP_ADD_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVADD(net, ip, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field, val);	\
+		rcu_read_unlock();	\
+	})
+
+#define __IP_ADD_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		___DEVADD(net, ip, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field, val);	\
+		rcu_read_unlock();	\
+	})
+
+#define IP_UPD_PO_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVUPD(net, ip, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field, val);	\
+		rcu_read_unlock();	\
+	})
+
+#define __IP_UPD_PO_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		___DEVUPD(net, ip, struct in_device,	\
+			__in_dev_get_rcu_safely(dev), field, val);	\
+		rcu_read_unlock();	\
+	})
+#else
+#define IP_INC_STATS(net, dev, field)	SNMP_INC_STATS64((net)->mib.ip_statistics, field)
+#define __IP_INC_STATS(net, dev, field)	__SNMP_INC_STATS64((net)->mib.ip_statistics, field)
+#define IP_ADD_STATS(net, dev, field, val)	SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
+#define __IP_ADD_STATS(net, dev, field, val) __SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
+#define IP_UPD_PO_STATS(net, dev, field, val) SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
+#define __IP_UPD_PO_STATS(net, dev, field, val) __SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
+#endif
 #define NET_INC_STATS(net, field)	SNMP_INC_STATS((net)->mib.net_statistics, field)
 #define __NET_INC_STATS(net, field)	__SNMP_INC_STATS((net)->mib.net_statistics, field)
 #define NET_ADD_STATS(net, field, adnd)	SNMP_ADD_STATS((net)->mib.net_statistics, field, adnd)
@@ -660,4 +711,13 @@ extern int sysctl_icmp_msgs_burst;
 int ip_misc_proc_init(void);
 #endif
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+#ifdef CONFIG_PROC_FS
+extern int snmp_register_dev(struct in_device *idev);
+extern int snmp_unregister_dev(struct in_device *idev);
+#else
+extern int snmp_register_dev(struct in_device *idev) { return 0; }
+extern int snmp_unregister_dev(struct in_device *idev) { return 0; }
+#endif
+#endif
 #endif	/* _IP_H */
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 9b6e7f5..c9be5e1 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -163,71 +163,29 @@ struct frag_hdr {
 extern int sysctl_mld_max_msf;
 extern int sysctl_mld_qrv;
 
-#define _DEVINC(net, statname, mod, idev, field)			\
-({									\
-	struct inet6_dev *_idev = (idev);				\
-	if (likely(_idev != NULL))					\
-		mod##SNMP_INC_STATS64((_idev)->stats.statname, (field));\
-	mod##SNMP_INC_STATS64((net)->mib.statname##_statistics, (field));\
-})
-
-/* per device counters are atomic_long_t */
-#define _DEVINCATOMIC(net, statname, mod, idev, field)			\
-({									\
-	struct inet6_dev *_idev = (idev);				\
-	if (likely(_idev != NULL))					\
-		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field)); \
-	mod##SNMP_INC_STATS((net)->mib.statname##_statistics, (field));\
-})
-
-/* per device and per net counters are atomic_long_t */
-#define _DEVINC_ATOMIC_ATOMIC(net, statname, idev, field)		\
-({									\
-	struct inet6_dev *_idev = (idev);				\
-	if (likely(_idev != NULL))					\
-		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field)); \
-	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.statname##_statistics, (field));\
-})
-
-#define _DEVADD(net, statname, mod, idev, field, val)			\
-({									\
-	struct inet6_dev *_idev = (idev);				\
-	if (likely(_idev != NULL))					\
-		mod##SNMP_ADD_STATS((_idev)->stats.statname, (field), (val)); \
-	mod##SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val));\
-})
-
-#define _DEVUPD(net, statname, mod, idev, field, val)			\
-({									\
-	struct inet6_dev *_idev = (idev);				\
-	if (likely(_idev != NULL))					\
-		mod##SNMP_UPD_PO_STATS((_idev)->stats.statname, field, (val)); \
-	mod##SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val));\
-})
-
 /* MIBs */
 
 #define IP6_INC_STATS(net, idev,field)		\
-		_DEVINC(net, ipv6, , idev, field)
+		_DEVINC(net, ipv6, struct inet6_dev, idev, field)
 #define __IP6_INC_STATS(net, idev,field)	\
-		_DEVINC(net, ipv6, __, idev, field)
+		___DEVINC(net, ipv6, struct inet6_dev, idev, field)
 #define IP6_ADD_STATS(net, idev,field,val)	\
-		_DEVADD(net, ipv6, , idev, field, val)
+		_DEVADD(net, ipv6, struct inet6_dev, idev, field, val)
 #define __IP6_ADD_STATS(net, idev,field,val)	\
-		_DEVADD(net, ipv6, __, idev, field, val)
+		___DEVADD(net, ipv6, struct inet6_dev, idev, field, val)
 #define IP6_UPD_PO_STATS(net, idev,field,val)   \
-		_DEVUPD(net, ipv6, , idev, field, val)
+		_DEVUPD(net, ipv6, struct inet6_dev, idev, field, val)
 #define __IP6_UPD_PO_STATS(net, idev,field,val)   \
-		_DEVUPD(net, ipv6, __, idev, field, val)
+		___DEVUPD(net, ipv6, struct inet6_dev, idev, field, val)
 #define ICMP6_INC_STATS(net, idev, field)	\
-		_DEVINCATOMIC(net, icmpv6, , idev, field)
+		_DEVINCATOMIC(net, icmpv6, struct inet6_dev, idev, field)
 #define __ICMP6_INC_STATS(net, idev, field)	\
-		_DEVINCATOMIC(net, icmpv6, __, idev, field)
+		___DEVINCATOMIC(net, icmpv6, struct inet6_dev, idev, field)
 
 #define ICMP6MSGOUT_INC_STATS(net, idev, field)		\
-	_DEVINC_ATOMIC_ATOMIC(net, icmpv6msg, idev, field +256)
+	_DEVINC_ATOMIC_ATOMIC(net, icmpv6msg, struct inet6_dev, idev, field+256)
 #define ICMP6MSGIN_INC_STATS(net, idev, field)	\
-	_DEVINC_ATOMIC_ATOMIC(net, icmpv6msg, idev, field)
+	_DEVINC_ATOMIC_ATOMIC(net, icmpv6msg, struct inet6_dev, idev, field)
 
 struct ip6_ra_chain {
 	struct ip6_ra_chain	*next;
diff --git a/include/net/netns/mib.h b/include/net/netns/mib.h
index 830bdf3..798bbc2 100644
--- a/include/net/netns/mib.h
+++ b/include/net/netns/mib.h
@@ -5,6 +5,9 @@
 #include <net/snmp.h>
 
 struct netns_mib {
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	struct proc_dir_entry *proc_net_devsnmp;
+#endif
 	DEFINE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 	DEFINE_SNMP_STAT(struct ipstats_mib, ip_statistics);
 	DEFINE_SNMP_STAT(struct linux_mib, net_statistics);
diff --git a/include/net/snmp.h b/include/net/snmp.h
index c9228ad..56e1f98 100644
--- a/include/net/snmp.h
+++ b/include/net/snmp.h
@@ -61,14 +61,26 @@ struct ipstats_mib {
 
 /* ICMP */
 #define ICMP_MIB_MAX	__ICMP_MIB_MAX
+/* per network ns counters */
 struct icmp_mib {
 	unsigned long	mibs[ICMP_MIB_MAX];
 };
 
 #define ICMPMSG_MIB_MAX	__ICMPMSG_MIB_MAX
+/* per network ns counters */
 struct icmpmsg_mib {
 	atomic_long_t	mibs[ICMPMSG_MIB_MAX];
 };
+#ifdef CONFIG_IP_IFSTATS_TABLE
+/* per device counters, (shared on all cpus) */
+struct icmp_mib_device {
+	atomic_long_t	mibs[ICMP_MIB_MAX];
+};
+/* per device counters, (shared on all cpus) */
+struct icmpmsg_mib_device {
+	atomic_long_t	mibs[ICMPMSG_MIB_MAX];
+};
+#endif
 
 /* ICMP6 (IPv6-ICMP) */
 #define ICMP6_MIB_MAX	__ICMP6_MIB_MAX
@@ -198,4 +210,87 @@ struct linux_xfrm_mib {
 #define __SNMP_UPD_PO_STATS64(mib, basefield, addend) __SNMP_UPD_PO_STATS(mib, basefield, addend)
 #endif
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+/* Macros for enabling per device statistics */
+#define _DEVINC(net, statname, type, idev, field)			\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_INC_STATS((_idev)->stats.statname, (field));	\
+	SNMP_INC_STATS((net)->mib.statname##_statistics, (field));	\
+})
+#define ___DEVINC(net, statname, type, idev, field)			\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		__SNMP_INC_STATS((_idev)->stats.statname, (field));	\
+	__SNMP_INC_STATS((net)->mib.statname##_statistics, (field));	\
+})
+
+/* per device counters are atomic_long_t */
+#define _DEVINCATOMIC(net, statname, type, idev, field)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field));	\
+	SNMP_INC_STATS((net)->mib.statname##_statistics, (field));	\
+})
+#define ___DEVINCATOMIC(net, statname, type, idev, field)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field));	\
+	__SNMP_INC_STATS((net)->mib.statname##_statistics, (field));	\
+})
+
+/* per device and per net counters are atomic_long_t */
+#define _DEVINC_ATOMIC_ATOMIC(net, statname, type, idev, field)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field));	\
+	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.statname##_statistics, (field));	\
+})
+
+#define _DEVADD(net, statname, type, idev, field, val)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_ADD_STATS((_idev)->stats.statname, (field), (val));	\
+	SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val));	\
+})
+#define ___DEVADD(net, statname, type, idev, field, val)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		__SNMP_ADD_STATS((_idev)->stats.statname, (field), (val));	\
+	__SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val));	\
+})
+
+#define _DEVUPD(net, statname, type, idev, field, val)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_UPD_PO_STATS((_idev)->stats.statname, field, (val));	\
+	SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val));	\
+})
+#define ___DEVUPD(net, statname, type, idev, field, val)	\
+({	\
+	__typeof__(type) *_idev = (idev);	\
+	if (likely(_idev))	\
+		__SNMP_UPD_PO_STATS((_idev)->stats.statname, field, (val));	\
+	__SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val));	\
+})
+#else
+#define _DEVINC(net, statname, type, idev, field) SNMP_INC_STATS((net)->mib.statname##_statistics, (field))
+#define ___DEVINC(net, statname, type, idev, field) __SNMP_INC_STATS((net)->mib.statname##_statistics, (field))
+#define _DEVINCATOMIC(net, statname, type, idev, field) SNMP_INC_STATS((net)->mib.statname##_statistics, (field))
+#define ___DEVINCATOMIC(net, statname, type, idev, field) __SNMP_INC_STATS((net)->mib.statname##_statistics, (field))
+#define _DEVINC_ATOMIC_ATOMIC(net, statname, type, idev, field)	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.statname##_statistics, (field))
+#define _DEVADD(net, statname, type, idev, field, val)	SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val))
+#define ___DEVADD(net, statname, type, idev, field, val)	__SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val))
+#define _DEVUPD(net, statname, type, idev, field, val)	SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val))
+#define ___DEVUPD(net, statname, type, idev, field, val)	__SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val))
+#endif
+
 #endif
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 9b16eaf..f9576b7 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -218,13 +218,13 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
 
 	len = ntohs(iph->tot_len);
 	if (skb->len < len) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
+		__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INTRUNCATEDPKTS);
 		goto drop;
 	} else if (len < (iph->ihl*4))
 		goto inhdr_error;
 
 	if (pskb_trim_rcsum(skb, len)) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -237,9 +237,9 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
 	return 0;
 
 csum_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS);
+	__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_CSUMERRORS);
 inhdr_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+	__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INHDRERRORS);
 drop:
 	return -1;
 }
@@ -691,7 +691,7 @@ br_nf_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
 		     (IPCB(skb)->frag_max_size &&
 		      IPCB(skb)->frag_max_size > mtu))) {
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_FRAGFAILS);
 		kfree_skb(skb);
 		return -EMSGSIZE;
 	}
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index e65fcb4..11e0d4c 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -258,7 +258,7 @@ static void dccp_v4_err(struct sk_buff *skb, u32 info)
 				       iph->saddr, ntohs(dh->dccph_sport),
 				       inet_iif(skb), 0);
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 
@@ -468,7 +468,7 @@ static struct dst_entry* dccp_v4_route_skb(struct net *net, struct sock *sk,
 	security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
 	rt = ip_route_output_flow(net, &fl4, sk);
 	if (IS_ERR(rt)) {
-		IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+		IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 		return NULL;
 	}
 
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 80dad30..6470b95 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -52,6 +52,14 @@ config IP_ADVANCED_ROUTER
 
 	  If unsure, say N here.
 
+config IP_IFSTATS_TABLE
+	def_bool n
+	depends on IP_ADVANCED_ROUTER
+	prompt "IP: interface statistics"
+	help
+	  This option enables per interface statistics for IPv4. Refer to
+	  RFC 4293.
+
 config IP_FIB_TRIE_STATS
 	bool "FIB TRIE statistics"
 	depends on IP_ADVANCED_ROUTER
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index eaed036..c99da7a 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1938,6 +1938,11 @@ static int __init inet_init(void)
 		inet_register_protosw(q);
 
 	/*
+	 *	Init proc fs before initializing devinet
+	 */
+	ipv4_proc_init();
+
+	/*
 	 *	Set the ARP module up
 	 */
 
@@ -1984,8 +1989,6 @@ static int __init inet_init(void)
 	if (init_ipv4_mibs())
 		pr_crit("%s: Cannot init ipv4 mibs\n", __func__);
 
-	ipv4_proc_init();
-
 	ipfrag_init();
 
 	dev_add_pack(&ip_packet_type);
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f915abf..7acaadb 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -55,7 +55,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		if (err == -ENETUNREACH)
-			IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+			IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 		goto out;
 	}
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 40f0017..7600e1e 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -218,6 +218,49 @@ static void inet_free_ifa(struct in_ifaddr *ifa)
 	call_rcu(&ifa->rcu_head, inet_rcu_free_ifa);
 }
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static int snmp_alloc_dev(struct in_device *idev)
+{
+	int i;
+
+	idev->stats.ip = alloc_percpu(struct ipstats_mib);
+	if (!idev->stats.ip)
+		goto err_ip;
+
+	for_each_possible_cpu(i) {
+		struct ipstats_mib *addrconf_stats;
+
+		addrconf_stats = per_cpu_ptr(idev->stats.ip, i);
+		u64_stats_init(&addrconf_stats->syncp);
+	}
+
+	idev->stats.icmpdev = kzalloc(sizeof(*idev->stats.icmpdev),
+				      GFP_KERNEL);
+	if (!idev->stats.icmpdev)
+		goto err_icmp;
+	idev->stats.icmpmsgdev = kzalloc(sizeof(*idev->stats.icmpmsgdev),
+					 GFP_KERNEL);
+	if (!idev->stats.icmpmsgdev)
+		goto err_icmpmsg;
+
+	return 0;
+
+err_icmpmsg:
+	kfree(idev->stats.icmpdev);
+err_icmp:
+	free_percpu(idev->stats.ip);
+err_ip:
+	return -ENOMEM;
+}
+
+static void snmp_free_dev(struct in_device *idev)
+{
+	kfree(idev->stats.icmpmsgdev);
+	kfree(idev->stats.icmpdev);
+	free_percpu(idev->stats.ip);
+}
+#endif
+
 void in_dev_finish_destroy(struct in_device *idev)
 {
 	struct net_device *dev = idev->dev;
@@ -229,10 +272,14 @@ void in_dev_finish_destroy(struct in_device *idev)
 	pr_debug("%s: %p=%s\n", __func__, idev, dev ? dev->name : "NIL");
 #endif
 	dev_put(dev);
-	if (!idev->dead)
+	if (!idev->dead) {
 		pr_err("Freeing alive in_device %p\n", idev);
-	else
+	} else {
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		snmp_free_dev(idev);
+#endif
 		kfree(idev);
+	}
 }
 EXPORT_SYMBOL(in_dev_finish_destroy);
 
@@ -257,6 +304,25 @@ static struct in_device *inetdev_init(struct net_device *dev)
 		dev_disable_lro(dev);
 	/* Reference in_dev->dev */
 	dev_hold(dev);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	err = snmp_alloc_dev(in_dev);
+	if (err < 0) {
+		netdev_crit(dev,
+			    "%s(): cannot allocate memory for statistics; dev=%s err=%d.\n",
+			    __func__, dev->name, err);
+		neigh_parms_release(&arp_tbl, in_dev->arp_parms);
+		dev_put(dev);
+		kfree(in_dev);
+		return NULL;
+	}
+
+	err = snmp_register_dev(in_dev);
+	if (err < 0)
+		netdev_warn(dev,
+			    "%s(): cannot create /proc/net/dev_snmp/%s err=%d\n",
+			    __func__, dev->name, err);
+#endif
+
 	/* Account for reference dev->ip_ptr (below) */
 	refcount_set(&in_dev->refcnt, 1);
 
@@ -306,6 +372,9 @@ static void inetdev_destroy(struct in_device *in_dev)
 	}
 
 	RCU_INIT_POINTER(dev->ip_ptr, NULL);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	snmp_unregister_dev(in_dev);
+#endif
 
 	devinet_sysctl_unregister(in_dev);
 	neigh_parms_release(&arp_tbl, in_dev->arp_parms);
@@ -1529,8 +1598,20 @@ static int inetdev_event(struct notifier_block *this, unsigned long event,
 		 */
 		inetdev_changename(dev, in_dev);
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		snmp_unregister_dev(in_dev);
+#endif
 		devinet_sysctl_unregister(in_dev);
 		devinet_sysctl_register(in_dev);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		{
+			int err = snmp_register_dev(in_dev);
+
+			if (err)
+				return notifier_from_errno(err);
+		}
+#endif
+		/* CONFIG_EXTREME: End */
 		break;
 	}
 out:
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 1617604..4d5c092 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -338,10 +338,10 @@ static bool icmpv4_xrlim_allow(struct net *net, struct rtable *rt,
 /*
  *	Maintain the counters used in the SNMP statistics for outgoing ICMP
  */
-void icmp_out_count(struct net *net, unsigned char type)
+void icmp_out_count(struct net_device *dev, unsigned char type)
 {
-	ICMPMSGOUT_INC_STATS(net, type);
-	ICMP_INC_STATS(net, ICMP_MIB_OUTMSGS);
+	ICMPMSGOUT_INC_STATS(dev_net(dev), dev, type);
+	ICMP_INC_STATS(dev_net(dev), dev, ICMP_MIB_OUTMSGS);
 }
 
 /*
@@ -370,13 +370,14 @@ static void icmp_push_reply(struct icmp_bxm *icmp_param,
 {
 	struct sock *sk;
 	struct sk_buff *skb;
+	struct net_device *dev = (*rt)->dst.dev;
 
-	sk = icmp_sk(dev_net((*rt)->dst.dev));
+	sk = icmp_sk(dev_net(dev));
 	if (ip_append_data(sk, fl4, icmp_glue_bits, icmp_param,
 			   icmp_param->data_len+icmp_param->head_len,
 			   icmp_param->head_len,
 			   ipc, rt, MSG_DONTWAIT) < 0) {
-		__ICMP_INC_STATS(sock_net(sk), ICMP_MIB_OUTERRORS);
+		__ICMP_INC_STATS(sock_net(sk), dev, ICMP_MIB_OUTERRORS);
 		ip_flush_pending_frames(sk);
 	} else if ((skb = skb_peek(&sk->sk_write_queue)) != NULL) {
 		struct icmphdr *icmph = icmp_hdr(skb);
@@ -760,7 +761,7 @@ static void icmp_socket_deliver(struct sk_buff *skb, u32 info)
 	 * avoid additional coding at protocol handlers.
 	 */
 	if (!pskb_may_pull(skb, iph->ihl * 4 + 8)) {
-		__ICMP_INC_STATS(dev_net(skb->dev), ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(dev_net(skb->dev), skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 
@@ -792,8 +793,9 @@ static bool icmp_unreach(struct sk_buff *skb)
 	struct icmphdr *icmph;
 	struct net *net;
 	u32 info = 0;
+	struct net_device *dev = skb_dst(skb)->dev;
 
-	net = dev_net(skb_dst(skb)->dev);
+	net = dev_net(dev);
 
 	/*
 	 *	Incomplete header ?
@@ -852,7 +854,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 		info = ntohl(icmph->un.gateway) >> 24;
 		break;
 	case ICMP_TIME_EXCEEDED:
-		__ICMP_INC_STATS(net, ICMP_MIB_INTIMEEXCDS);
+		__ICMP_INC_STATS(net, dev, ICMP_MIB_INTIMEEXCDS);
 		if (icmph->code == ICMP_EXC_FRAGTIME)
 			goto out;
 		break;
@@ -890,7 +892,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 out:
 	return true;
 out_err:
-	__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+	__ICMP_INC_STATS(net, dev, ICMP_MIB_INERRORS);
 	return false;
 }
 
@@ -902,7 +904,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 static bool icmp_redirect(struct sk_buff *skb)
 {
 	if (skb->len < sizeof(struct iphdr)) {
-		__ICMP_INC_STATS(dev_net(skb->dev), ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(dev_net(skb->dev), skb->dev, ICMP_MIB_INERRORS);
 		return false;
 	}
 
@@ -982,7 +984,7 @@ static bool icmp_timestamp(struct sk_buff *skb)
 	return true;
 
 out_err:
-	__ICMP_INC_STATS(dev_net(skb_dst(skb)->dev), ICMP_MIB_INERRORS);
+	__ICMP_INC_STATS(dev_net(skb_dst(skb)->dev), skb_dst(skb)->dev, ICMP_MIB_INERRORS);
 	return false;
 }
 
@@ -1022,7 +1024,7 @@ int icmp_rcv(struct sk_buff *skb)
 		skb_set_network_header(skb, nh);
 	}
 
-	__ICMP_INC_STATS(net, ICMP_MIB_INMSGS);
+	__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INMSGS);
 
 	if (skb_checksum_simple_validate(skb))
 		goto csum_error;
@@ -1032,7 +1034,7 @@ int icmp_rcv(struct sk_buff *skb)
 
 	icmph = icmp_hdr(skb);
 
-	ICMPMSGIN_INC_STATS(net, icmph->type);
+	ICMPMSGIN_INC_STATS(net, skb->dev, icmph->type);
 	/*
 	 *	18 is the highest 'known' ICMP type. Anything else is a mystery
 	 *
@@ -1078,9 +1080,9 @@ int icmp_rcv(struct sk_buff *skb)
 	kfree_skb(skb);
 	return NET_RX_DROP;
 csum_error:
-	__ICMP_INC_STATS(net, ICMP_MIB_CSUMERRORS);
+	__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_CSUMERRORS);
 error:
-	__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+	__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 	goto drop;
 }
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 881ac6d..30afff4 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -539,6 +539,7 @@ struct dst_entry *inet_csk_route_req(const struct sock *sk,
 	struct net *net = read_pnet(&ireq->ireq_net);
 	struct ip_options_rcu *opt;
 	struct rtable *rt;
+	struct net_device *dev = NULL;
 
 	opt = ireq_opt_deref(ireq);
 
@@ -557,9 +558,10 @@ struct dst_entry *inet_csk_route_req(const struct sock *sk,
 	return &rt->dst;
 
 route_err:
+	dev = rt->dst.dev;
 	ip_rt_put(rt);
 no_route:
-	__IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_OUTNOROUTES);
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(inet_csk_route_req);
@@ -574,6 +576,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
 	struct ip_options_rcu *opt;
 	struct flowi4 *fl4;
 	struct rtable *rt;
+	struct net_device *dev = NULL;
 
 	opt = rcu_dereference(ireq->ireq_opt);
 	fl4 = &newinet->cork.fl.u.ip4;
@@ -593,9 +596,10 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
 	return &rt->dst;
 
 route_err:
+	dev = rt->dst.dev;
 	ip_rt_put(rt);
 no_route:
-	__IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_OUTNOROUTES);
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index b54b948..bb9be11 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -66,8 +66,8 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
 {
 	struct ip_options *opt	= &(IPCB(skb)->opt);
 
-	__IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
-	__IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len);
+	__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTFORWDATAGRAMS);
+	__IP_ADD_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTOCTETS, skb->len);
 
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
@@ -121,7 +121,7 @@ int ip_forward(struct sk_buff *skb)
 	IPCB(skb)->flags |= IPSKB_FORWARDED;
 	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
 	if (ip_exceeds_mtu(skb, mtu)) {
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 			  htonl(mtu));
 		goto drop;
@@ -158,7 +158,7 @@ int ip_forward(struct sk_buff *skb)
 
 too_many_hops:
 	/* Tell the sender its packet died... */
-	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+	__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_INHDRERRORS);
 	icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
 drop:
 	kfree_skb(skb);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 994fa70..40aadd4 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -136,6 +136,7 @@ static void ip_expire(struct timer_list *t)
 {
 	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	const struct iphdr *iph;
+	struct net_device *dev;
 	struct sk_buff *head;
 	struct net *net;
 	struct ipq *qp;
@@ -151,16 +152,17 @@ static void ip_expire(struct timer_list *t)
 		goto out;
 
 	ipq_kill(qp);
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+	dev = dev_get_by_index_rcu(net, qp->iif);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMFAILS);
 
 	head = qp->q.fragments;
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMTIMEOUT);
 
 	if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
 		goto out;
 
-	head->dev = dev_get_by_index_rcu(net, qp->iif);
+	head->dev = dev;
 	if (!head->dev)
 		goto out;
 
@@ -237,7 +239,9 @@ static int ip_frag_too_far(struct ipq *qp)
 		struct net *net;
 
 		net = container_of(qp->q.net, struct net, ipv4.frags);
-		__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+		rcu_read_lock();
+		__IP_INC_STATS(net, dev_get_by_index_rcu(net, qp->iif), IPSTATS_MIB_REASMFAILS);
+		rcu_read_unlock();
 	}
 
 	return rc;
@@ -582,7 +586,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 
 	ip_send_check(iph);
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMOKS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMOKS);
 	qp->q.fragments = NULL;
 	qp->q.fragments_tail = NULL;
 	return 0;
@@ -594,7 +598,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 out_oversize:
 	net_info_ratelimited("Oversized IP packet from %pI4\n", &qp->q.key.v4.saddr);
 out_fail:
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMFAILS);
 	return err;
 }
 
@@ -605,7 +609,7 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 user)
 	int vif = l3mdev_master_ifindex_rcu(dev);
 	struct ipq *qp;
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMREQDS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMREQDS);
 	skb_orphan(skb);
 
 	/* Lookup (or create) queue header */
@@ -622,7 +626,7 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 user)
 		return ret;
 	}
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMFAILS);
 	kfree_skb(skb);
 	return -ENOMEM;
 }
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 7582713..1143fe2 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -197,6 +197,7 @@ static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_b
 		int protocol = ip_hdr(skb)->protocol;
 		const struct net_protocol *ipprot;
 		int raw;
+		struct net_device *dev = skb->dev;
 
 	resubmit:
 		raw = raw_local_deliver(skb, protocol);
@@ -217,17 +218,17 @@ static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_b
 				protocol = -ret;
 				goto resubmit;
 			}
-			__IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+			__IP_INC_STATS(net, dev, IPSTATS_MIB_INDELIVERS);
 		} else {
 			if (!raw) {
 				if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
-					__IP_INC_STATS(net, IPSTATS_MIB_INUNKNOWNPROTOS);
+					__IP_INC_STATS(net, dev, IPSTATS_MIB_INUNKNOWNPROTOS);
 					icmp_send(skb, ICMP_DEST_UNREACH,
 						  ICMP_PROT_UNREACH, 0);
 				}
 				kfree_skb(skb);
 			} else {
-				__IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+				__IP_INC_STATS(net, dev, IPSTATS_MIB_INDELIVERS);
 				consume_skb(skb);
 			}
 		}
@@ -272,7 +273,7 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 					      --ANK (980813)
 	*/
 	if (skb_cow(skb, skb_headroom(skb))) {
-		__IP_INC_STATS(dev_net(dev), IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(dev_net(dev), dev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -281,7 +282,7 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
 
 	if (ip_options_compile(dev_net(dev), opt, skb)) {
-		__IP_INC_STATS(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
+		__IP_INC_STATS(dev_net(dev), dev, IPSTATS_MIB_INHDRERRORS);
 		goto drop;
 	}
 
@@ -366,9 +367,9 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 
 	rt = skb_rtable(skb);
 	if (rt->rt_type == RTN_MULTICAST) {
-		__IP_UPD_PO_STATS(net, IPSTATS_MIB_INMCAST, skb->len);
+		__IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_INMCAST, skb->len);
 	} else if (rt->rt_type == RTN_BROADCAST) {
-		__IP_UPD_PO_STATS(net, IPSTATS_MIB_INBCAST, skb->len);
+		__IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_INBCAST, skb->len);
 	} else if (skb->pkt_type == PACKET_BROADCAST ||
 		   skb->pkt_type == PACKET_MULTICAST) {
 		struct in_device *in_dev = __in_dev_get_rcu(dev);
@@ -422,11 +423,11 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 
 
 	net = dev_net(dev);
-	__IP_UPD_PO_STATS(net, IPSTATS_MIB_IN, skb->len);
+	__IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_IN, skb->len);
 
 	skb = skb_share_check(skb, GFP_ATOMIC);
 	if (!skb) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INDISCARDS);
 		goto out;
 	}
 
@@ -452,7 +453,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	BUILD_BUG_ON(IPSTATS_MIB_ECT1PKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_ECT_1);
 	BUILD_BUG_ON(IPSTATS_MIB_ECT0PKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_ECT_0);
 	BUILD_BUG_ON(IPSTATS_MIB_CEPKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_CE);
-	__IP_ADD_STATS(net,
+	__IP_ADD_STATS(net, dev,
 		       IPSTATS_MIB_NOECTPKTS + (iph->tos & INET_ECN_MASK),
 		       max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs));
 
@@ -466,7 +467,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 
 	len = ntohs(iph->tot_len);
 	if (skb->len < len) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INTRUNCATEDPKTS);
 		goto drop;
 	} else if (len < (iph->ihl*4))
 		goto inhdr_error;
@@ -476,7 +477,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	 * Note this now means skb->len holds ntohs(iph->tot_len).
 	 */
 	if (pskb_trim_rcsum(skb, len)) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -494,9 +495,9 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		       ip_rcv_finish);
 
 csum_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_CSUMERRORS);
 inhdr_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_INHDRERRORS);
 drop:
 	kfree_skb(skb);
 out:
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 94cacae..a0cf93d 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -191,9 +191,9 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
 	u32 nexthop;
 
 	if (rt->rt_type == RTN_MULTICAST) {
-		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTMCAST, skb->len);
+		IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUTMCAST, skb->len);
 	} else if (rt->rt_type == RTN_BROADCAST)
-		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len);
+		IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUTBCAST, skb->len);
 
 	/* Be paranoid, rather than too clever. */
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
@@ -339,7 +339,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	/*
 	 *	If the indicated interface is up and running, send the packet.
 	 */
-	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
+	IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
@@ -397,7 +397,7 @@ int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
 
-	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
+	IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
@@ -507,7 +507,7 @@ int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
 
 no_route:
 	rcu_read_unlock();
-	IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+	IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	return -EHOSTUNREACH;
 }
@@ -548,7 +548,7 @@ static int ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	if (unlikely(!skb->ignore_df ||
 		     (IPCB(skb)->frag_max_size &&
 		      IPCB(skb)->frag_max_size > mtu))) {
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, skb->dev, IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 			  htonl(mtu));
 		kfree_skb(skb);
@@ -575,8 +575,11 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	int offset;
 	__be16 not_last_frag;
 	struct rtable *rt = skb_rtable(skb);
+	struct net_device *dev = rt->dst.dev;
 	int err = 0;
 
+	dev_hold(dev);
+
 	/* for offloaded checksums cleanup checksum before fragmentation */
 	if (skb->ip_summed == CHECKSUM_PARTIAL &&
 	    (err = skb_checksum_help(skb)))
@@ -675,7 +678,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			err = output(net, sk, skb);
 
 			if (!err)
-				IP_INC_STATS(net, IPSTATS_MIB_FRAGCREATES);
+				IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGCREATES);
 			if (err || !frag)
 				break;
 
@@ -685,7 +688,8 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		}
 
 		if (err == 0) {
-			IP_INC_STATS(net, IPSTATS_MIB_FRAGOKS);
+			IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGOKS);
+			dev_put(dev);
 			return 0;
 		}
 
@@ -694,7 +698,8 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			kfree_skb(frag);
 			frag = skb;
 		}
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGFAILS);
+		dev_put(dev);
 		return err;
 
 slow_path_clean:
@@ -811,15 +816,17 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		if (err)
 			goto fail;
 
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGCREATES);
+		IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGCREATES);
 	}
 	consume_skb(skb);
-	IP_INC_STATS(net, IPSTATS_MIB_FRAGOKS);
+	IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGOKS);
+	dev_put(dev);
 	return err;
 
 fail:
 	kfree_skb(skb);
-	IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+	IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGFAILS);
+	dev_put(dev);
 	return err;
 }
 EXPORT_SYMBOL(ip_do_fragment);
@@ -1097,7 +1104,7 @@ static int __ip_append_data(struct sock *sk,
 	err = -EFAULT;
 error:
 	cork->length -= length;
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
+	IP_INC_STATS(sock_net(sk), rt->dst.dev, IPSTATS_MIB_OUTDISCARDS);
 	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
 	return err;
 }
@@ -1305,7 +1312,7 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 
 error:
 	cork->length -= size;
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
+	IP_INC_STATS(sock_net(sk), rt->dst.dev, IPSTATS_MIB_OUTDISCARDS);
 	return err;
 }
 
@@ -1406,7 +1413,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 	skb_dst_set(skb, &rt->dst);
 
 	if (iph->protocol == IPPROTO_ICMP)
-		icmp_out_count(net, ((struct icmphdr *)
+		icmp_out_count(skb_dst(skb)->dev, ((struct icmphdr *)
 			skb_transport_header(skb))->type);
 
 	ip_cork_release(cork);
@@ -1417,13 +1424,14 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 int ip_send_skb(struct net *net, struct sk_buff *skb)
 {
 	int err;
+	struct net_device *dev = skb_dst(skb)->dev;
 
 	err = ip_local_out(net, skb->sk, skb);
 	if (err) {
 		if (err > 0)
 			err = net_xmit_errno(err);
 		if (err)
-			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+			IP_INC_STATS(net, dev, IPSTATS_MIB_OUTDISCARDS);
 	}
 
 	return err;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 2fb4de3..67ec987 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1778,8 +1778,8 @@ static inline int ipmr_forward_finish(struct net *net, struct sock *sk,
 {
 	struct ip_options *opt = &(IPCB(skb)->opt);
 
-	IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
-	IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len);
+	IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTFORWDATAGRAMS);
+	IP_ADD_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTOCTETS, skb->len);
 
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
@@ -1862,7 +1862,7 @@ static void ipmr_queue_xmit(struct net *net, struct mr_table *mrt,
 		 * allow to send ICMP, so that packets will disappear
 		 * to blackhole.
 		 */
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGFAILS);
 		ip_rt_put(rt);
 		goto out_free;
 	}
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 05e47d7..1b4db11 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -712,6 +712,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	__be32 saddr, daddr, faddr;
 	u8  tos;
 	int err;
+	struct net_device *dev;
 
 	pr_debug("ping_v4_sendmsg(sk=%p,sk->num=%u)\n", inet, inet->inet_num);
 
@@ -805,7 +806,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		err = PTR_ERR(rt);
 		rt = NULL;
 		if (err == -ENETUNREACH)
-			IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+			IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 		goto out;
 	}
 
@@ -841,13 +842,17 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	release_sock(sk);
 
 out:
+	dev = rt->dst.dev;
+	dev_hold(dev);
 	ip_rt_put(rt);
 	if (free)
 		kfree(ipc.opt);
 	if (!err) {
-		icmp_out_count(sock_net(sk), user_icmph.type);
+		icmp_out_count(dev, user_icmph.type);
+		dev_put(dev);
 		return len;
 	}
+	dev_put(dev);
 	return err;
 
 do_confirm:
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index a058de6..c0c822f 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -134,6 +134,17 @@ static const struct snmp_mib snmp4_ipextstats_list[] = {
 	SNMP_MIB_SENTINEL
 };
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static const struct snmp_mib snmp4_icmp_list[] = {
+	SNMP_MIB_ITEM("InMsgs", ICMP_MIB_INMSGS),
+	SNMP_MIB_ITEM("InErrors", ICMP_MIB_INERRORS),
+	SNMP_MIB_ITEM("OutMsgs", ICMP_MIB_OUTMSGS),
+	SNMP_MIB_ITEM("OutErrors", ICMP_MIB_OUTERRORS),
+	SNMP_MIB_ITEM("InCsumErrors", ICMP_MIB_CSUMERRORS),
+	SNMP_MIB_SENTINEL
+};
+#endif
+
 static const struct {
 	const char *name;
 	int index;
@@ -473,6 +484,109 @@ static const struct file_operations snmp_seq_fops = {
 };
 
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static void snmp_seq_show_item(struct seq_file *seq, void __percpu **pcpumib,
+			       atomic_long_t *smib,
+			       const struct snmp_mib *itemlist,
+			       char *prefix)
+{
+	char name[32];
+	int i;
+	unsigned long val;
+
+	for (i = 0; itemlist[i].name; i++) {
+		val = pcpumib ?
+			snmp_fold_field64(pcpumib, itemlist[i].entry,
+					  offsetof(struct ipstats_mib, syncp)) :
+			atomic_long_read(smib + itemlist[i].entry);
+		snprintf(name, sizeof(name), "%s%s",
+			 prefix, itemlist[i].name);
+		seq_printf(seq, "%-32s\t%lu\n", name, val);
+	}
+}
+
+static void snmp_seq_show_icmpmsg(struct seq_file *seq, atomic_long_t *smib)
+{
+	char name[32];
+	int i;
+	unsigned long val;
+
+	for (i = 0; i < ICMPMSG_MIB_MAX; i++) {
+		val = atomic_long_read(smib + i);
+		if (val) {
+			snprintf(name, sizeof(name), "Icmp%sType%u",
+				 i & 0x100 ? "Out" : "In", i & 0xff);
+			seq_printf(seq, "%-32s\t%lu\n", name, val);
+		}
+	}
+}
+
+static int snmp_dev_seq_show(struct seq_file *seq, void *v)
+{
+	struct in_device *idev = (struct in_device *)seq->private;
+
+	seq_printf(seq, "%-32s\t%u\n", "ifIndex", idev->dev->ifindex);
+
+	BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);
+
+	snmp_seq_show_item(seq, (void __percpu **)idev->stats.ip, NULL,
+			   snmp4_ipstats_list, "Ip");
+	snmp_seq_show_item(seq, (void __percpu **)idev->stats.ip, NULL,
+			   snmp4_ipextstats_list, "Ip");
+	snmp_seq_show_item(seq, NULL, idev->stats.icmpdev->mibs,
+			   snmp4_icmp_list, "Icmp");
+	snmp_seq_show_icmpmsg(seq, idev->stats.icmpmsgdev->mibs);
+	return 0;
+}
+
+static int snmp_dev_seq_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, snmp_dev_seq_show, PDE_DATA(inode));
+}
+
+static const struct file_operations snmp_dev_seq_fops = {
+	.owner   = THIS_MODULE,
+	.open    = snmp_dev_seq_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = single_release,
+};
+
+int snmp_register_dev(struct in_device *idev)
+{
+	struct proc_dir_entry *p;
+	struct net *net;
+
+	if (!idev || !idev->dev)
+		return -EINVAL;
+
+	net = dev_net(idev->dev);
+	if (!net->mib.proc_net_devsnmp)
+		return -ENOENT;
+
+	p = proc_create_data(idev->dev->name, 0444,
+			     net->mib.proc_net_devsnmp,
+			     &snmp_dev_seq_fops, idev);
+	if (!p)
+		return -ENOMEM;
+
+	idev->stats.proc_dir_entry = p;
+	return 0;
+}
+
+int snmp_unregister_dev(struct in_device *idev)
+{
+	struct net *net = dev_net(idev->dev);
+
+	if (!net->mib.proc_net_devsnmp)
+		return -ENOENT;
+	if (!idev->stats.proc_dir_entry)
+		return -EINVAL;
+	proc_remove(idev->stats.proc_dir_entry);
+	idev->stats.proc_dir_entry = NULL;
+	return 0;
+}
+#endif
 
 /*
  *	Output /proc/net/netstat
@@ -528,9 +642,18 @@ static __net_init int ip_proc_init_net(struct net *net)
 		goto out_netstat;
 	if (!proc_create("snmp", 0444, net->proc_net, &snmp_seq_fops))
 		goto out_snmp;
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	net->mib.proc_net_devsnmp = proc_mkdir("dev_snmp", net->proc_net);
+	if (!net->mib.proc_net_devsnmp)
+		goto out_dev_snmp;
+#endif
 
 	return 0;
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+out_dev_snmp:
+	remove_proc_entry("snmp", net->proc_net);
+#endif
 out_snmp:
 	remove_proc_entry("netstat", net->proc_net);
 out_netstat:
@@ -544,6 +667,9 @@ static __net_exit void ip_proc_exit_net(struct net *net)
 	remove_proc_entry("snmp", net->proc_net);
 	remove_proc_entry("netstat", net->proc_net);
 	remove_proc_entry("sockstat", net->proc_net);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	remove_proc_entry("dev_snmp", net->proc_net);
+#endif
 }
 
 static __net_initdata struct pernet_operations ip_proc_ops = {
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 1b4d335..f39f87a 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -425,7 +425,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 		skb->transport_header += iphlen;
 		if (iph->protocol == IPPROTO_ICMP &&
 		    length >= iphlen + sizeof(struct icmphdr))
-			icmp_out_count(net, ((struct icmphdr *)
+			icmp_out_count(rt->dst.dev, ((struct icmphdr *)
 				skb_transport_header(skb))->type);
 	}
 
@@ -442,7 +442,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 error_free:
 	kfree_skb(skb);
 error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+	IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_OUTDISCARDS);
 	if (err == -ENOBUFS && !inet->recverr)
 		err = 0;
 	return err;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 8322e47..5115f0d3 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -961,11 +961,11 @@ static int ip_error(struct sk_buff *skb)
 	if (!IN_DEV_FORWARD(in_dev)) {
 		switch (rt->dst.error) {
 		case EHOSTUNREACH:
-			__IP_INC_STATS(net, IPSTATS_MIB_INADDRERRORS);
+			__IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_INADDRERRORS);
 			break;
 
 		case ENETUNREACH:
-			__IP_INC_STATS(net, IPSTATS_MIB_INNOROUTES);
+			__IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_INNOROUTES);
 			break;
 		}
 		goto out;
@@ -980,7 +980,7 @@ static int ip_error(struct sk_buff *skb)
 		break;
 	case ENETUNREACH:
 		code = ICMP_NET_UNREACH;
-		__IP_INC_STATS(net, IPSTATS_MIB_INNOROUTES);
+		__IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_INNOROUTES);
 		break;
 	case EACCES:
 		code = ICMP_PKT_FILTERED;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b..df1b989 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -194,7 +194,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		if (err == -ENETUNREACH)
-			IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+			IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 		return err;
 	}
 
@@ -402,7 +402,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 				       th->dest, iph->saddr, ntohs(th->source),
 				       inet_iif(icmp_skb), 0);
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, icmp_skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 	if (sk->sk_state == TCP_TIME_WAIT) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 24b5c59..a8b6567 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -609,7 +609,7 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 			       iph->saddr, uh->source, skb->dev->ifindex, 0,
 			       udptable, NULL);
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 		return;	/* No socket for error */
 	}
 
@@ -1008,7 +1008,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 			err = PTR_ERR(rt);
 			rt = NULL;
 			if (err == -ENETUNREACH)
-				IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+				IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 			goto out;
 		}
 
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index a9c05b2..b52b2e3 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -381,7 +381,7 @@ static int l2tp_ip_backlog_recv(struct sock *sk, struct sk_buff *skb)
 	return 0;
 
 drop:
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_INDISCARDS);
+	IP_INC_STATS(sock_net(sk), skb->dev, IPSTATS_MIB_INDISCARDS);
 	kfree_skb(skb);
 	return 0;
 }
@@ -504,7 +504,7 @@ static int l2tp_ip_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 no_route:
 	rcu_read_unlock();
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+	IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	rc = -EHOSTUNREACH;
 	goto out;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 9573691..cf2172d 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -462,7 +462,7 @@ static int l2tp_ip6_backlog_recv(struct sock *sk, struct sk_buff *skb)
 	return 0;
 
 drop:
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_INDISCARDS);
+	IP_INC_STATS(sock_net(sk), skb->dev, IPSTATS_MIB_INDISCARDS);
 	kfree_skb(skb);
 	return -1;
 }
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 7a4de6d..c629239 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -141,7 +141,7 @@ void mpls_stats_inc_outucastpkts(struct net_device *dev,
 					   tx_packets,
 					   tx_bytes);
 	} else if (skb->protocol == htons(ETH_P_IP)) {
-		IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
+		IP_UPD_PO_STATS(dev_net(dev), dev, IPSTATS_MIB_OUT, skb->len);
 #if IS_ENABLED(CONFIG_IPV6)
 	} else if (skb->protocol == htons(ETH_P_IPV6)) {
 		struct inet6_dev *in6dev = __in6_dev_get(dev);
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 4527921..32bd3af 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -286,7 +286,7 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
 	{
 		if (ip_hdr(skb)->ttl <= 1) {
 			/* Tell the sender its packet died... */
-			__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+			__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_INHDRERRORS);
 			icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
 			return false;
 		}
diff --git a/net/sctp/input.c b/net/sctp/input.c
index ba8a6e6..fef625a 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -596,7 +596,7 @@ void sctp_v4_err(struct sk_buff *skb, __u32 info)
 	skb->network_header = saveip;
 	skb->transport_header = savesctp;
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 	/* Warning:  The sock lock is held.  Remember to call
diff --git a/net/sctp/output.c b/net/sctp/output.c
index d6e1c90..a2a27fa 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -600,7 +600,7 @@ int sctp_packet_transmit(struct sctp_packet *packet, gfp_t gfp)
 	/* drop packet if no dst */
 	dst = dst_clone(tp->dst);
 	if (!dst) {
-		IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+		IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 		kfree_skb(head);
 		goto out;
 	}
-- 
2.7.4

^ permalink raw reply related

* 4.15.13 kernel panic, ip_rcv_finish, nf_xfrm_me_harder warnings continue to fill dmesg
From: Denys Fedoryshchenko @ 2018-04-11 15:51 UTC (permalink / raw)
  To: Linux Kernel Network Developers

Apr 11 18:01:34[99194.935520] general protection fault: 0000 [#1] SMP
Apr 11 18:01:34[99194.935998] Modules linked in: pppoe pppox ppp_generic 
slhc ip_set_hash_net xt_nat xt_string xt_connmark xt_TCPMSS xt_mark 
xt_CT xt_set xt_tcpudp ip_set_bitmap_port ip_set nfnetlink
iptable_raw iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip_tables x_tables 
netconsole configfs 8021q garp mrp stp llc ixgbe dca ipv6
Apr 11 18:01:34[99194.938313] CPU: 23 PID: 150 Comm: ksoftirqd/23 
Tainted: G        W        4.15.13-build-0135 #4
Apr 11 18:01:34[99194.939258] Hardware name: Intel Corporation 
S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
Apr 11 18:01:34[99194.940189] RIP: 0010:ip_rcv_finish+0x2b5/0x2e5
Apr 11 18:01:34[99194.940716] RSP: 0018:ffffc9000cad7cf8 EFLAGS: 
00010286
Apr 11 18:01:34[99194.941214] RAX: ffff00e2ffff476d RBX: 
ffff88178f944400 RCX: ffff88179a32a800
Apr 11 18:01:34[99194.941741] RDX: ffff88178f944400 RSI: 
0000000000000000 RDI: ffff88178f944400
Apr 11 18:01:34[99194.942234] RBP: ffff882fd580d000 R08: 
0000000000000001 R09: ffff882fd034ee00
Apr 11 18:01:34[99194.942771] R10: ffffc9000cad7b58 R11: 
00000000e92316b9 R12: ffff88179a32a8d6
Apr 11 18:01:34[99194.943286] R13: ffff882fd580d000 R14: 
0000000000ea0008 R15: ffff882fd580d078
Apr 11 18:01:34[99194.943821] FS:  0000000000000000(0000) 
GS:ffff88303fcc0000(0000) knlGS:0000000000000000
Apr 11 18:01:34[99194.944779] CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Apr 11 18:01:34[99194.945287] CR2: 00007f8bb37888f0 CR3: 
000000303e209003 CR4: 00000000001606e0
Apr 11 18:01:34[99194.945808] Call Trace:
Apr 11 18:01:34[99194.946307]  ip_rcv+0x2f2/0x325
Apr 11 18:01:34[99194.946816]  ? ip_local_deliver_finish+0x187/0x187
Apr 11 18:01:34[99194.947331]  __netif_receive_skb_core+0x81c/0x89c
Apr 11 18:01:34[99194.947872]  ? napi_complete_done+0xb4/0xba
Apr 11 18:01:34[99194.948391]  ? ixgbe_poll+0xf96/0x104d [ixgbe]
Apr 11 18:01:34[99194.948931]  ? process_backlog+0x8b/0x10d
Apr 11 18:01:34[99194.949424]  process_backlog+0x8b/0x10d
Apr 11 18:01:34[99194.949953]  net_rx_action+0x127/0x2b5
Apr 11 18:01:34[99194.950445]  __do_softirq+0xc1/0x1b1
Apr 11 18:01:34[99194.950951]  ? sort_range+0x17/0x17
Apr 11 18:01:34[99194.951442]  run_ksoftirqd+0x11/0x22
Apr 11 18:01:34[99194.951972]  smpboot_thread_fn+0x121/0x136
Apr 11 18:01:34[99194.952489]  kthread+0xfd/0x105
Apr 11 18:01:34[99194.953018]  ? kthread_create_on_node+0x3a/0x3a
Apr 11 18:01:34[99194.953528]  ret_from_fork+0x1f/0x30
Apr 11 18:01:34[99194.954047] Code: 15 77 9e 99 00 83 7a 7c 00 75 37 83 
b8 2c 01 00 00 00 75 2e 48 8b 43 58 48 89 df 5b 5d 48 83 e0 fe 41 5c 41 
5d 41 5e 48 8b 40 50 <ff> e0 83 f8 ee 75 10 49 8b 84 24
90 01 00 00 65 48 ff 80 40 02
Apr 11 18:01:34[99194.955449] RIP: ip_rcv_finish+0x2b5/0x2e5 RSP: 
ffffc9000cad7cf8
Apr 11 18:01:34[99194.956008] ---[ end trace 312b0bf537b4709a ]---
Apr 11 18:01:34[99195.007900] Kernel panic - not syncing: Fatal 
exception in interrupt
Apr 11 18:01:34[99195.008400] Kernel Offset: disabled
Apr 11 18:01:34[99195.013950] Rebooting in 5 seconds..
--------------


and i reported before about warnings in nf_frm_me_harder, but probably 
nobody have interest to take a look, and it is seems plaguing 4.15.x and 
nearby versions kernels . Here is one of such warnings.

-----------------------
Apr 11 00:00:17[34320.802349] dst_release: dst:00000000b32dca17 
refcnt:-2
Apr 11 00:00:19[34323.018468] WARNING: CPU: 7 PID: 0 at 
./include/net/dst.h:256 nf_xfrm_me_harder+0x62/0xfe [nf_nat]
Apr 11 00:00:19[34323.019357] Modules linked in: pppoe pppox ppp_generic 
slhc ip_set_hash_net xt_nat xt_string xt_connmark xt_TCPMSS xt_mark 
xt_CT xt_set xt_tcpudp ip_set_bitmap_port ip_set nfnetlink
iptable_raw iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip_tables x_tables 
netconsole configfs 8021q garp mrp stp llc ixgbe dca ipv6
Apr 11 00:00:19[34323.021503] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G   
      W        4.15.13-build-0135 #4
Apr 11 00:00:19[34323.022380] Hardware name: Intel Corporation 
S2600GZ/S2600GZ, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
Apr 11 00:00:19[34323.023261] RIP: 0010:nf_xfrm_me_harder+0x62/0xfe 
[nf_nat]
Apr 11 00:00:19[34323.023737] RSP: 0018:ffff88303fa43c90 EFLAGS: 
00010246
Apr 11 00:00:19[34323.024218] RAX: 0000000000000000 RBX: 
ffff8817b2c35200 RCX: 0000000000000000
Apr 11 00:00:19[34323.024703] RDX: 0000000000000002 RSI: 
ffff88178fab3700 RDI: ffff88303fa43cd0
Apr 11 00:00:19[34323.025214] RBP: ffffffff822a6180 R08: 
0000000000000005 R09: 0000000000000001
Apr 11 00:00:19[34323.025717] R10: 00000000000000d6 R11: 
ffff8817c945bca0 R12: 0000000000000001
Apr 11 00:00:19[34323.026214] R13: ffff88303fa43d60 R14: 
0000000000ce0008 R15: ffff8817b7477078
Apr 11 00:00:19[34323.026736] FS:  0000000000000000(0000) 
GS:ffff88303fa40000(0000) knlGS:0000000000000000
Apr 11 00:00:19[34323.027680] CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Apr 11 00:00:19[34323.028209] CR2: 00007f3557e13490 CR3: 
000000303e209002 CR4: 00000000001606e0
Apr 11 00:00:19[34323.028709] Call Trace:
Apr 11 00:00:19[34323.029181]  <IRQ>
Apr 11 00:00:19[34323.029670]  nf_nat_ipv4_out+0xa7/0xbb [nf_nat_ipv4]
Apr 11 00:00:19[34323.030161]  nf_hook_slow+0x31/0x90
Apr 11 00:00:19[34323.030681]  ip_output+0x93/0xaf
Apr 11 00:00:19[34323.031158]  ? ip_fragment.constprop.5+0x6e/0x6e
Apr 11 00:00:19[34323.031668]  ip_forward+0x327/0x36a
Apr 11 00:00:19[34323.032173]  ? ip_frag_mem+0x7/0x7
Apr 11 00:00:19[34323.032689]  ip_rcv+0x2f2/0x325
Apr 11 00:00:19[34323.033179]  ? ip_local_deliver_finish+0x187/0x187
Apr 11 00:00:19[34323.033680]  __netif_receive_skb_core+0x81c/0x89c
Apr 11 00:00:19[34323.034179]  ? process_backlog+0x8b/0x10d
Apr 11 00:00:19[34323.034701]  process_backlog+0x8b/0x10d
Apr 11 00:00:19[34323.035177]  net_rx_action+0x127/0x2b5
Apr 11 00:00:19[34323.035694]  __do_softirq+0xc1/0x1b1
Apr 11 00:00:19[34323.036202]  irq_exit+0x49/0x88
Apr 11 00:00:19[34323.036728]  call_function_single_interrupt+0x77/0x80
Apr 11 00:00:19[34323.037213]  </IRQ>
Apr 11 00:00:19[34323.037698] RIP: 0010:mwait_idle+0x52/0x64
Apr 11 00:00:19[34323.038273] RSP: 0018:ffffc9000c607f00 EFLAGS: 
00000246 ORIG_RAX: ffffffffffffff04
Apr 11 00:00:19[34323.039183] RAX: 0000000000000000 RBX: 
0000000000000000 RCX: 0000000000000000
Apr 11 00:00:19[34323.039701] RDX: 0000000000000000 RSI: 
0000000000000000 RDI: 0000000000000000
Apr 11 00:00:19[34323.040211] RBP: 0000000000000000 R08: 
012105aacb441779 R09: ffffffff82244640
Apr 11 00:00:19[34323.040742] R10: ffffc9000c607e90 R11: 
0000000000000046 R12: 0000000000000007
Apr 11 00:00:19[34323.041236] R13: 0000000000000000 R14: 
0000000000000000 R15: 0000000000000000
Apr 11 00:00:19[34323.041777]  do_idle+0xae/0x137
Apr 11 00:00:19[34323.042292]  cpu_startup_entry+0x57/0x59
Apr 11 00:00:19[34323.042814]  start_secondary+0x151/0x154
Apr 11 00:00:19[34323.043305]  secondary_startup_64+0xa5/0xb0
Apr 11 00:00:19[34323.043819] Code: 48 83 e6 fe 48 83 7e 48 00 74 07 48 
8b b6 80 01 00 00 8b 86 80 00 00 00 85 c0 74 0f 8d 50 01 f0 0f b1 96 80 
00 00 00 74 04 eb ed <0f> 0b 48 8b 4b 18 48 8d 54 24 08
45 31 c0 48 89 ef e8 c5 ff 83
Apr 11 00:00:19[34323.045208] ---[ end trace 312b0bf537b46b33 ]---
Apr 11 00:00:19[34323.045739] dst_release: dst:00000000156ec4ca 
refcnt:-1

^ permalink raw reply

* Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)
From: Stephen Hemminger @ 2018-04-11 16:14 UTC (permalink / raw)
  To: Stephen Suryaputra; +Cc: netdev
In-Reply-To: <1523415335-17154-1-git-send-email-ssuryaextr@gmail.com>

On Tue, 10 Apr 2018 22:55:35 -0400
Stephen Suryaputra <ssuryaextr@gmail.com> wrote:

> This is enhanced from the proposed patch by Igor Maravic in 2011 to
> support per interface IPv4 stats. The enhancement is mainly adding a
> kernel configuration option CONFIG_IP_IFSTATS_TABLE.
> 
> Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>


Net-next is closed. http://vger.kernel.org/~davem/net-next.html

Also, these kind of statistics have a negative performance impact.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox