Netdev List
 help / color / mirror / Atom feed
* Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
From: Richard Guy Briggs @ 2018-04-21 14:34 UTC (permalink / raw)
  To: Paul Moore
  Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
	dhowells, viro, simo, Eric Paris, serge
In-Reply-To: <CAHC9VhTyvxxj2e2Gn+iyW6iLLeYB7hp8a+JvfeMmJ2nUPqtEaw@mail.gmail.com>

On 2018-04-18 19:47, Paul Moore wrote:
> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
> > Implement the proc fs write to set the audit container ID of a process,
> > emitting an AUDIT_CONTAINER record to document the event.
> >
> > This is a write from the container orchestrator task to a proc entry of
> > the form /proc/PID/containerid where PID is the process ID of the newly
> > created task that is to become the first task in a container, or an
> > additional task added to a container.
> >
> > The write expects up to a u64 value (unset: 18446744073709551615).
> >
> > This will produce a record such as this:
> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
> >
> > The "op" field indicates an initial set.  The "pid" to "ses" fields are
> > the orchestrator while the "opid" field is the object's PID, the process
> > being "contained".  Old and new container ID values are given in the
> > "contid" fields, while res indicates its success.
> >
> > It is not permitted to self-set, unset or re-set the container ID.  A
> > child inherits its parent's container ID, but then can be set only once
> > after.
> >
> > See: https://github.com/linux-audit/audit-kernel/issues/32
> >
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > ---
> >  fs/proc/base.c             | 37 ++++++++++++++++++++
> >  include/linux/audit.h      | 16 +++++++++
> >  include/linux/init_task.h  |  4 ++-
> >  include/linux/sched.h      |  1 +
> >  include/uapi/linux/audit.h |  2 ++
> >  kernel/auditsc.c           | 84 ++++++++++++++++++++++++++++++++++++++++++++++
> >  6 files changed, 143 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/proc/base.c b/fs/proc/base.c
> > index 60316b5..6ce4fbe 100644
> > --- a/fs/proc/base.c
> > +++ b/fs/proc/base.c
> > @@ -1299,6 +1299,41 @@ static ssize_t proc_sessionid_read(struct file * file, char __user * buf,
> >         .read           = proc_sessionid_read,
> >         .llseek         = generic_file_llseek,
> >  };
> > +
> > +static ssize_t proc_containerid_write(struct file *file, const char __user *buf,
> > +                                  size_t count, loff_t *ppos)
> > +{
> > +       struct inode *inode = file_inode(file);
> > +       u64 containerid;
> > +       int rv;
> > +       struct task_struct *task = get_proc_task(inode);
> > +
> > +       if (!task)
> > +               return -ESRCH;
> > +       if (*ppos != 0) {
> > +               /* No partial writes. */
> > +               put_task_struct(task);
> > +               return -EINVAL;
> > +       }
> > +
> > +       rv = kstrtou64_from_user(buf, count, 10, &containerid);
> > +       if (rv < 0) {
> > +               put_task_struct(task);
> > +               return rv;
> > +       }
> > +
> > +       rv = audit_set_containerid(task, containerid);
> > +       put_task_struct(task);
> > +       if (rv < 0)
> > +               return rv;
> > +       return count;
> > +}
> > +
> > +static const struct file_operations proc_containerid_operations = {
> > +       .write          = proc_containerid_write,
> > +       .llseek         = generic_file_llseek,
> > +};
> > +
> >  #endif
> >
> >  #ifdef CONFIG_FAULT_INJECTION
> > @@ -2961,6 +2996,7 @@ static int proc_pid_patch_state(struct seq_file *m, struct pid_namespace *ns,
> >  #ifdef CONFIG_AUDITSYSCALL
> >         REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
> >         REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> > +       REG("containerid", S_IWUSR, proc_containerid_operations),
> >  #endif
> >  #ifdef CONFIG_FAULT_INJECTION
> >         REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> > @@ -3355,6 +3391,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
> >  #ifdef CONFIG_AUDITSYSCALL
> >         REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
> >         REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> > +       REG("containerid", S_IWUSR, proc_containerid_operations),
> >  #endif
> >  #ifdef CONFIG_FAULT_INJECTION
> >         REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> > diff --git a/include/linux/audit.h b/include/linux/audit.h
> > index af410d9..fe4ba3f 100644
> > --- a/include/linux/audit.h
> > +++ b/include/linux/audit.h
> > @@ -29,6 +29,7 @@
> >
> >  #define AUDIT_INO_UNSET ((unsigned long)-1)
> >  #define AUDIT_DEV_UNSET ((dev_t)-1)
> > +#define INVALID_CID AUDIT_CID_UNSET
> 
> Why can't we just use AUDIT_CID_UNSET?  Is there an important
> distinction?  If so, they shouldn't they have different values?

One was intended as user-facing and the other was intended for kernel
internal.  As you point out, this does not appear to be necessary since
they are both the same type.  This was to mirror loginuid due to UID
namespace practice to seperate the two to make things very clear that a
userspace view of a UID needed to be translated from the user's user
namespace to the kernel's absolute view of UIDs from the init user
namespace.  Since container ID meanings do not depend on any namespace
context, I agree we can use just one and I'd go with AUDIT_CID_UNSET.

> If we do need to keep INVALID_CID, let's rename it to
> AUDIT_CID_INVALID so we have some consistency to the naming patterns
> and we stress that it is an *audit* container ID.
> 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d258826..1b82191 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -796,6 +796,7 @@ struct task_struct {
> >  #ifdef CONFIG_AUDITSYSCALL
> >         kuid_t                          loginuid;
> >         unsigned int                    sessionid;
> > +       u64                             containerid;
> 
> This one line addition to the task_struct scares me the most of
> anything in this patchset.  Why?  It's a field named "containerid" in
> a perhaps one of the most widely used core kernel structures; the
> possibilities for abuse are endless, and it's foolish to think we
> would ever be able to adequately police this.

Fair enough.

> Unfortunately, we can't add the field to audit_context as things
> currently stand because we don't always allocate an audit_context,
> it's dependent on the system's configuration, and we need to track the
> audit container ID for a given process, regardless of the audit
> configuration.  Pretty much the same reason why loginuid and sessionid
> are located directly in task_struct now.  As I stressed during the
> design phase, I really want to keep this as an *audit* container ID
> and not a general purpose kernel wide container ID.  If the kernel
> ever grows a general purpose container ID token, I'll be the first in
> line to convert the audit code, but I don't want audit to be that
> general purpose mechanism ... audit is hated enough as-is ;)

When would we need an audit container ID when audit is not enabled
enough to have an audit_context?

If it is only used for audit, and audit is the only consumer, and audit
can only use it when it is enabled, then we can just return success to
any write to the proc filehandle, or not even present it.  Nothing will
be able to know that value wasn't used.

When are loginuid and sessionid used now when audit is not enabled (or
should I say, explicitly disabled)?

> I think the right solution to this is to create another new struct,
> audit_task_info (or similar, the name really isn't that important),
> which would be stored as a pointer in task_struct and would replace
> the audit_context pointer, loginuid, sessionid, and the newly proposed
> containerid.  The new audit_task_info would always be allocated in the
> audit_alloc() function (please use kmem_cache), and the audit_context
> pointer included inside would continue to be allocated based on the
> existing conditions.  By keeping audit_task_info as a pointer inside
> task_struct we could hide the structure definition inside
> kernel/audit*.c and make it much more difficult for other subsystems
> to abuse it.[1]
> 
>   struct audit_task_info {
>     kuid_t loginuid;
>     unsigned int sessionid;
>     u64 containerid;
>     struct audit_context *ctx;
>   }

I agree this looks like a good change.

> Actually, we might even want to consider storing audit_context in
> audit_task_info (no pointer), or making it a zero length array
> (ctx[0]) and going with a variable sized allocation of audit_task_info
> ... but all that could be done as a follow up optimization once we get
> the basic idea sorted.
> 
> [1] If for some reason allocating audit_task_info becomes too much
> overhead to bear (somewhat doubtful since we would only do it at task
> creation), we could do some ugly tricks to directly include an
> audit_task_struct chunk in task_struct but I'd like to avoid that if
> possible (and I think we can).
> 
> >  #endif
> >         struct seccomp                  seccomp;
> 
> ...
> 
> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> > index 4e61a9e..921a71f 100644
> > --- a/include/uapi/linux/audit.h
> > +++ b/include/uapi/linux/audit.h
> > @@ -71,6 +71,7 @@
> >  #define AUDIT_TTY_SET          1017    /* Set TTY auditing status */
> >  #define AUDIT_SET_FEATURE      1018    /* Turn an audit feature on or off */
> >  #define AUDIT_GET_FEATURE      1019    /* Get which features are enabled */
> > +#define AUDIT_CONTAINER                1020    /* Define the container id and information */
> >
> >  #define AUDIT_FIRST_USER_MSG   1100    /* Userspace messages mostly uninteresting to kernel */
> >  #define AUDIT_USER_AVC         1107    /* We filter this differently */
> > @@ -465,6 +466,7 @@ struct audit_tty_status {
> >  };
> >
> >  #define AUDIT_UID_UNSET (unsigned int)-1
> > +#define AUDIT_CID_UNSET ((u64)-1)
> 
> I think we need to decide if we want to distinguish between the "host"
> (e.g. init ns) and "unset".  Looking at this patch (I've only quickly
> skimmed the others so far) it would appear that you don't think we
> need to worry about this distinction; that's fine, but let's make it
> explicit with a comment in the code that AUDIT_CID_UNSET means "unset"
> as well as "host".

I don't see any reason to distinguish between "host" and "unset".  Since
a container doesn't have a concrete definition based in namespaces, the
initial namespace set is meaningless here.

Is there value in having a container orchestrator process have a
reserved container ID that has a policy distinct from any other
container?  If so, then I could see the value in making the distinction.
For example, I've heard of interest in systemd acting as a container
orchestrator, so if it took on that role as PID 1, then every process in
the system would inherit that ID and none would be unset.

I can't picture how having seperate "host" and "unset" values helps us.

> If we do need to make a distinction, let's add a constant/macro for "host".

Currently "unset" is -1 which fits the convention used for sessionid and
loginuid and a number of others, so I think it makes sense to stick with
that.  If we decide we need a "host" flag, would it make sense to use 0
or (u64)-2?

> >  /* audit_rule_data supports filter rules with both integer and string
> >   * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index 4e0a4ac..29c8482 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid)
> >         return rc;
> >  }
> >
> > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
> > +{
> > +       struct task_struct *parent;
> > +       u64 pcontainerid, ccontainerid;
> > +
> > +       /* Don't allow to set our own containerid */
> > +       if (current == task)
> > +               return -EPERM;
> 
> Why not?  Is there some obvious security concern that I missing?

We then lose the distinction in the AUDIT_CONTAINER record between the
initiating PID and the target PID.  This was outlined in the proposal.

Having said that, I'm still not sure we have protected sufficiently from
a child turning around and setting it's parent's as yet unset or
inherited audit container ID.

> I ask because I suppose it might be possible for some container
> runtime to do a fork, setup some of the environment and them exec the
> container (before you answer the obvious "namespaces!" please remember
> we're not trying to define containers).

I don't think namespaces have any bearing on this concern since none are
required.

> > +       /* Don't allow the containerid to be unset */
> > +       if (!cid_valid(containerid))
> > +               return -EINVAL;
> > +       /* if we don't have caps, reject */
> > +       if (!capable(CAP_AUDIT_CONTROL))
> > +               return -EPERM;
> > +       /* if containerid is unset, allow */
> > +       if (!audit_containerid_set(task))
> > +               return 0;
> > +       /* it is already set, and not inherited from the parent, reject */
> > +       ccontainerid = audit_get_containerid(task);
> > +       rcu_read_lock();
> > +       parent = rcu_dereference(task->real_parent);
> > +       rcu_read_unlock();
> > +       task_lock(parent);
> > +       pcontainerid = audit_get_containerid(parent);
> > +       task_unlock(parent);
> > +       if (ccontainerid != pcontainerid)
> > +               return -EPERM;
> > +       return 0;
> > +}
> > +
> > +static void audit_log_set_containerid(struct task_struct *task, u64 oldcontainerid,
> > +                                     u64 containerid, int rc)
> > +{
> > +       struct audit_buffer *ab;
> > +       uid_t uid;
> > +       struct tty_struct *tty;
> > +
> > +       if (!audit_enabled)
> > +               return;
> > +
> > +       ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONTAINER);
> > +       if (!ab)
> > +               return;
> > +
> > +       uid = from_kuid(&init_user_ns, task_uid(current));
> > +       tty = audit_get_tty(current);
> > +
> > +       audit_log_format(ab, "op=set pid=%d uid=%u", task_tgid_nr(current), uid);
> > +       audit_log_task_context(ab);
> > +       audit_log_format(ab, " auid=%u tty=%s ses=%u opid=%d old-contid=%llu contid=%llu res=%d",
> > +                        from_kuid(&init_user_ns, audit_get_loginuid(current)),
> > +                        tty ? tty_name(tty) : "(none)", audit_get_sessionid(current),
> > +                        task_tgid_nr(task), oldcontainerid, containerid, !rc);
> > +
> > +       audit_put_tty(tty);
> > +       audit_log_end(ab);
> > +}
> > +
> > +/**
> > + * audit_set_containerid - set current task's audit_context containerid
> > + * @containerid: containerid value
> > + *
> > + * Returns 0 on success, -EPERM on permission failure.
> > + *
> > + * Called (set) from fs/proc/base.c::proc_containerid_write().
> > + */
> > +int audit_set_containerid(struct task_struct *task, u64 containerid)
> > +{
> > +       u64 oldcontainerid;
> > +       int rc;
> > +
> > +       oldcontainerid = audit_get_containerid(task);
> > +
> > +       rc = audit_set_containerid_perm(task, containerid);
> > +       if (!rc) {
> > +               task_lock(task);
> > +               task->containerid = containerid;
> > +               task_unlock(task);
> > +       }
> > +
> > +       audit_log_set_containerid(task, oldcontainerid, containerid, rc);
> > +       return rc;
> 
> Why are audit_set_containerid_perm() and audit_log_containerid()
> separate functions?

(I assume you mean audit_log_set_containerid()?)
It seemed clearer that all the permission checking was in one function
and its return code could be used to report the outcome when logging the
(attempted) action.  This is the same structure as audit_set_loginuid()
and it made sense.

This would be the time to connect it to a syscall if that seems like a
good idea and remove pid, uid, auid, tty, ses fields.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [RFC PATCH ghak32 V2 11/13] audit: add support for containerid to network namespaces
From: Paul Moore @ 2018-04-21 12:10 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, luto-DgEjT+Ai2ygdnm+yROfE0A,
	jlayton-H+wXaHxf7aLQT0dZR+AlfA, carlos-H+wXaHxf7aLQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, LKML,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA, Linux-Audit Mailing List,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, simo-H+wXaHxf7aLQT0dZR+AlfA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eric Paris
In-Reply-To: <20180420204225.iik2lgtj6gx2ep4w-bcJWsdo4jJjeVoXN4CMphl7TgLCtbB0G@public.gmane.org>

On April 20, 2018 4:48:34 PM Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
On 2018-04-20 16:22, Paul Moore wrote:
On Fri, Apr 20, 2018 at 4:02 PM, Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
On 2018-04-18 21:46, Paul Moore wrote:
On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
Audit events could happen in a network namespace outside of a task
context due to packets received from the net that trigger an auditing
rule prior to being associated with a running task.  The network
namespace could in use by multiple containers by association to the
tasks in that network namespace.  We still want a way to attribute
these events to any potential containers.  Keep a list per network
namespace to track these container identifiiers.

Add/increment the container identifier on:
- initial setting of the container id via /proc
- clone/fork call that inherits a container identifier
- unshare call that inherits a container identifier
- setns call that inherits a container identifier
Delete/decrement the container identifier on:
- an inherited container id dropped when child set
- process exit
- unshare call that drops a net namespace
- setns call that drops a net namespace

See: https://github.com/linux-audit/audit-kernel/issues/32
See: https://github.com/linux-audit/audit-testsuite/issues/64
Signed-off-by: Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
include/linux/audit.h       |  7 +++++++
include/net/net_namespace.h | 12 ++++++++++++
kernel/auditsc.c            |  9 ++++++---
kernel/nsproxy.c            |  6 ++++++
net/core/net_namespace.c    | 45 +++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 76 insertions(+), 3 deletions(-)

...

diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index f6c5d33..d9f1090 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -140,6 +140,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
struct nsproxy *old_ns = tsk->nsproxy;
struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
struct nsproxy *new_ns;
+       u64 containerid = audit_get_containerid(tsk);

if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
@@ -167,6 +168,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
return  PTR_ERR(new_ns);

tsk->nsproxy = new_ns;
+       net_add_audit_containerid(new_ns->net_ns, containerid);
return 0;
}

Hopefully we can handle this in audit_net_init(), we just need to
figure out where we can get the correct task_struct for the audit
container ID (some backpointer in the net struct?).

I don't follow.  This needs to happen on every task startup.
audit_net_init() is only called when a new network namespace starts up.

Yep, sorry, my mistake.  I must have confused myself when I was
looking at the code.

I'm thinking out loud here, bear with me ...

Assuming we move the netns/audit-container-ID tracking to audit_net,
and considering we already have an audit hook in copy_process() (it
calls audit_alloc()), would this be better handled by the
copy_process() hook?  This ignores naming, audit_alloc() reuse, etc.;
those can be easily fixed.  I'm just thinking of ways to limit our
impact on the core kernel and leverage our existing interaction
points.

The new namespace hasn't been cloned yet and this is the only function
where we have access to both namespaces, so I don't see how that could
work...

I'll take another, closer look, with v3.


paul moore

- RGB

--
Richard Guy Briggs <rgb-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635


--
paul moore
www.paul-moore.com

^ permalink raw reply related

* Re: general protection fault in smc_getname
From: syzbot @ 2018-04-21 11:55 UTC (permalink / raw)
  To: davem, linux-kernel, linux-s390, netdev, syzkaller-bugs, ubraun,
	ubraun
In-Reply-To: <001a1145ecc48d1438056655dcd9@google.com>

syzbot has found reproducer for the following crash on upstream commit
83beed7b2b26f232d782127792dd0cd4362fdc41 (Fri Apr 20 17:56:32 2018 +0000)
Merge branch 'fixes' of  
git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=9605e6cace1b5efd4a0a

So far this crash happened 6 times on net-next, upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=4803108223844352
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=6277384739225600
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5836548759093248
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+9605e6cace1b5efd4a0a@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed.

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4548 Comm: syzkaller769662 Not tainted 4.17.0-rc1+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:smc_getname+0x124/0x1c0 net/smc/af_smc.c:1089
RSP: 0018:ffff8801ad1d7c20 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff873e3a58
RDX: 0000000000000005 RSI: ffffffff873e3af6 RDI: 0000000000000028
RBP: ffff8801ad1d7c48 R08: ffff8801b10ae280 R09: ffffed0036321370
R10: ffffed0036321370 R11: ffff8801b1909b83 R12: 0000000000000000
R13: ffff8801ad1d7d10 R14: ffff8801a866e0c0 R15: dffffc0000000000
FS:  0000000001a42880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000080 CR3: 00000001ac8b3000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  __sys_getsockname+0x184/0x380 net/socket.c:1699
  __do_sys_getsockname net/socket.c:1714 [inline]
  __se_sys_getsockname net/socket.c:1711 [inline]
  __x64_sys_getsockname+0x73/0xb0 net/socket.c:1711
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43fce9
RSP: 002b:00007fff2eba2418 EFLAGS: 00000217 ORIG_RAX: 0000000000000033
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fce9
RDX: 0000000020000080 RSI: 0000000020000000 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
R10: 00000000004002c8 R11: 0000000000000217 R12: 0000000000401610
R13: 00000000004016a0 R14: 0000000000000000 R15: 0000000000000000
Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 99 00 00 00 48 8b 9b 50 04 00 00 48  
b8 00 00 00 00 00 fc ff df 48 8d 7b 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00  
75 70 48 b8 00 00 00 00 00 fc ff df 4c 8b 73 28 49
RIP: smc_getname+0x124/0x1c0 net/smc/af_smc.c:1089 RSP: ffff8801ad1d7c20
---[ end trace 9f5c3169466d9443 ]---

^ permalink raw reply

* Re: WARNING: refcount bug in should_fail
From: Tetsuo Handa @ 2018-04-21 10:26 UTC (permalink / raw)
  To: ebiederm, dvyukov
  Cc: syzbot+84371b6062cb639d797e, syzkaller-bugs, viro, linux-fsdevel,
	linux-mm, netdev
In-Reply-To: <877epnj7bq.fsf@xmission.com>

Eric W. Biederman wrote:
> Al Viro <viro@ZenIV.linux.org.uk> writes:
> 
> > On Mon, Apr 02, 2018 at 10:59:34PM +0100, Al Viro wrote:
> >
> >> FWIW, I'm going through the ->kill_sb() instances, fixing that sort
> >> of bugs (most of them preexisting, but I should've checked instead
> >> of assuming that everything's fine).  Will push out later tonight.
> >
> > OK, see vfs.git#for-linus.  Caught: 4 old bugs (allocation failure
> > in fill_super oopses ->kill_sb() in hypfs, jffs2 and orangefs resp.
> > and double-dput in late failure exit in rpc_fill_super())
> > and 5 regressions from register_shrinker() failure recovery.
> 
> One issue with your vfs.git#for-linus branch.
> 
> It is missing Fixes tags and  Cc: stable on those patches.
> As the bug came in v4.15 those tags would really help the stable
> maintainers get the recent regression fixes applied.

OK. The patch was sent to linux.git as commit 8e04944f0ea8b838.

#syz fix: mm,vmscan: Allow preallocating memory for register_shrinker().

^ permalink raw reply

* Re: Grant
From: M. M. Fridman @ 2018-04-21  5:41 UTC (permalink / raw)
  To: Recipients

I Mikhail Fridman. has selected you specially as one of my beneficiaries
for my Charitable Donation, Just as I have declared on May 23, 2016 to give
my fortune as charity.

Check the link below for confirmation:

http://www.ibtimes.co.uk/russias-second-wealthiest-man-mikhail-fridman-plans-leaving-14-2bn-fortune-charity-1561604

Reply as soon as possible with further directives.

Best Regards,
Mikhail Fridman.

^ permalink raw reply

* GREETING  FROM MISS QADESA,,F
From: Miss Qadesa AbdulAziz @ 2018-04-21  9:22 UTC (permalink / raw)
  To: netdev

Friend,
My name is Miss Qadesa AbdulAziz and I am 17 years old girl from Syria.  There is serious war crisis here in Syria, and I have lost my parents and
my two brothers in this war. I want you to help me and receive ($7.md)  which my late father deposited with my name in a bank in London. I want
to come to your country and start a new life  and invest with you, because am the only survival
in my family.

I wait to hear from you, Please do not let me die here, I begging you the name of Almighty. please respond here my pirate email    qadesa@protonmail.com

Regards
Miss Qadesa AbdulAziz

^ permalink raw reply

* Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
From: Christoph Hellwig @ 2018-04-21  9:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, linux-kernel, Soheil Hassas Yeganeh,
	Eric Dumazet, linux-mm, linux-fsdevel
In-Reply-To: <20180420155542.122183-1-edumazet@google.com>

On Fri, Apr 20, 2018 at 08:55:38AM -0700, Eric Dumazet wrote:
> This patch series provide a new mmap_hook to fs willing to grab
> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity.
> 
> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem
> is held), and improve multi-threading scalability. 

Missing CC to linu-fsdevel and linux-mm that will have to decide.

We've rejected this approach multiple times before, so you better
make a really good argument for it.

introducing a multiplexer that overloads a single method certainly
doesn't help making that case.

^ permalink raw reply

* Re: Page allocator bottleneck
From: Aaron Lu @ 2018-04-21  8:15 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Linux Kernel Network Developers, linux-mm, Mel Gorman,
	David Miller, Jesper Dangaard Brouer, Eric Dumazet,
	Alexei Starovoitov, Saeed Mahameed, Eran Ben Elisha,
	Andrew Morton, Michal Hocko
In-Reply-To: <1c218381-067e-7757-ccc2-4e5befd2bfc3@mellanox.com>

Sorry to bring up an old thread...

On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote:
> 
> 
> On 18/09/2017 12:16 PM, Tariq Toukan wrote:
> > 
> > 
> > On 15/09/2017 1:23 PM, Mel Gorman wrote:
> > > On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
> > > > Insights: Major degradation between #1 and #2, not getting any
> > > > close to linerate! Degradation is fixed between #2 and #3. This is
> > > > because page allocator cannot stand the higher allocation rate. In
> > > > #2, we also see that the addition of rings (cores) reduces BW (!!),
> > > > as result of increasing congestion over shared resources.
> > > > 
> > > 
> > > Unfortunately, no surprises there.
> > > 
> > > > Congestion in this case is very clear. When monitored in perf
> > > > top: 85.58% [kernel] [k] queued_spin_lock_slowpath
> > > > 
> > > 
> > > While it's not proven, the most likely candidate is the zone lock
> > > and that should be confirmed using a call-graph profile. If so, then
> > > the suggestion to tune to the size of the per-cpu allocator would
> > > mitigate the problem.
> > > 
> > Indeed, I tuned the per-cpu allocator and bottleneck is released.
> > 
> 
> Hi all,
> 
> After leaving this task for a while doing other tasks, I got back to it now
> and see that the good behavior I observed earlier was not stable.

I posted a patchset to improve zone->lock contention for order-0 pages
recently, it can almost eliminate 80% zone->lock contention for
will-it-scale/page_fault1 testcase when tested on a 2 sockets Intel
Skylake server and it doesn't require PCP size tune, so should have
some effects on your workload where one CPU does allocation while
another does free.

It did this by some disruptive changes:
1 on free path, it skipped doing merge(so could be bad for mixed
  workloads where both 4K and high order pages are needed);
2 on allocation path, it avoided touching multiple cachelines.

RFC v2 patchset:
https://lkml.org/lkml/2018/3/20/171

repo:
https://github.com/aaronlu/linux zone_lock_rfc_v2

 
> Recall: I work with a modified driver that allocates a page (4K) per packet
> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
> NICs.
> 
> Performance is good as long as pages are available in the allocating cores's
> PCP.
> Issue is that pages are allocated in one core, then free'd in another,
> making it's hard for the PCP to work efficiently, and both the allocator
> core and the freeing core need to access the buddy allocator very often.
> 
> I'd like to share with you some testing numbers:
> 
> Test: ./super_netperf 128 -H 24.134.0.51 -l 1000
> 
> 100% cpu on all cores, top func in perf:
>    84.98%  [kernel]             [k] queued_spin_lock_slowpath
> 
> system wide (all cores)
>            1135941      kmem:mm_page_alloc
> 
>            2606629      kmem:mm_page_free
> 
>                  0      kmem:mm_page_alloc_extfrag
>            4784616      kmem:mm_page_alloc_zone_locked
> 
>               1337      kmem:mm_page_free_batched
> 
>            6488213      kmem:mm_page_pcpu_drain
> 
>            8925503      net:napi_gro_receive_entry
> 
> 
> Two types of cores:
> A core mostly running napi (8 such cores):
>             221875      kmem:mm_page_alloc
> 
>              17100      kmem:mm_page_free
> 
>                  0      kmem:mm_page_alloc_extfrag
>             766584      kmem:mm_page_alloc_zone_locked
> 
>                 16      kmem:mm_page_free_batched
> 
>                 35      kmem:mm_page_pcpu_drain
> 
>            1340139      net:napi_gro_receive_entry
> 
> 
> Other core, mostly running user application (40 such):
>                  2      kmem:mm_page_alloc
> 
>              38922      kmem:mm_page_free
> 
>                  0      kmem:mm_page_alloc_extfrag
>                  1      kmem:mm_page_alloc_zone_locked
> 
>                  8      kmem:mm_page_free_batched
> 
>             107289      kmem:mm_page_pcpu_drain
> 
>                 34      net:napi_gro_receive_entry
> 
> 
> As you can see, sync overhead is enormous.
> 
> PCP-wise, a key improvement in such scenarios would be reached if we could
> (1) keep and handle the allocated page on same cpu, or (2) somehow get the
> page back to the allocating core's PCP in a fast-path, without going through
> the regular buddy allocator paths.

^ permalink raw reply

* Re: [PATCH 3/8] [media] v4l: rcar_fdp1: Change platform dependency to ARCH_RENESAS
From: Laurent Pinchart @ 2018-04-21  8:07 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Simon Horman, Magnus Damm, Russell King, Catalin Marinas,
	Will Deacon, Dan Williams, Vinod Koul, Mauro Carvalho Chehab,
	Sergei Shtylyov, David S . Miller, Greg Kroah-Hartman,
	Liam Girdwood, Mark Brown, Jaroslav Kysela, Takashi Iwai,
	Arnd Bergmann, Kuninori Morimoto, linux-renesas-soc,
	linux-arm-kernel, dmaengine
In-Reply-To: <1524230914-10175-4-git-send-email-geert+renesas@glider.be>

Hi Geert,

Thank you for the patch.

On Friday, 20 April 2018 16:28:29 EEST Geert Uytterhoeven wrote:
> The Renesas Fine Display Processor driver is used on Renesas R-Car SoCs
> only.  Since commit 9b5ba0df4ea4f940 ("ARM: shmobile: Introduce
> ARCH_RENESAS") is ARCH_RENESAS a more appropriate platform dependency
> than the legacy ARCH_SHMOBILE, hence use the former.
> 
> This will allow to drop ARCH_SHMOBILE on ARM and ARM64 in the near
> future.
> 
> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>

Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>

How would you like to get this merged ?

> ---
>  drivers/media/platform/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/media/platform/Kconfig b/drivers/media/platform/Kconfig
> index f9235e8f8e962d2e..7ad4725f9d1f9627 100644
> --- a/drivers/media/platform/Kconfig
> +++ b/drivers/media/platform/Kconfig
> @@ -396,7 +396,7 @@ config VIDEO_SH_VEU
>  config VIDEO_RENESAS_FDP1
>  	tristate "Renesas Fine Display Processor"
>  	depends on VIDEO_DEV && VIDEO_V4L2 && HAS_DMA
> -	depends on ARCH_SHMOBILE || COMPILE_TEST
> +	depends on ARCH_RENESAS || COMPILE_TEST
>  	depends on (!ARCH_RENESAS && !VIDEO_RENESAS_FCP) || VIDEO_RENESAS_FCP
>  	select VIDEOBUF2_DMA_CONTIG
>  	select V4L2_MEM2MEM_DEV

-- 
Regards,

Laurent Pinchart

^ permalink raw reply

* Re: [pci PATCH v8 4/4] pci-pf-stub: Add PF driver stub for PFs that function only to enable VFs
From: Christoph Hellwig @ 2018-04-21  7:07 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: bhelgaas, linux-pci, virtio-dev, kvm, netdev, dan.daly,
	linux-kernel, linux-nvme, keith.busch, netanel, ddutile, mheyne,
	liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180420163109.46077.60334.stgit@ahduyck-green-test.jf.intel.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [pci PATCH v8 3/4] nvme: Migrate over to unmanaged SR-IOV support
From: Christoph Hellwig @ 2018-04-21  7:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: bhelgaas, linux-pci, virtio-dev, kvm, netdev, dan.daly,
	linux-kernel, linux-nvme, keith.busch, netanel, ddutile, mheyne,
	liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180420163027.46077.93925.stgit@ahduyck-green-test.jf.intel.com>

On Fri, Apr 20, 2018 at 12:31:03PM -0400, Alexander Duyck wrote:
> Instead of implementing our own version of a SR-IOV configuration stub in
> the nvme driver we can just reuse the existing
> pci_sriov_configure_simple function.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [pci PATCH v8 2/4] ena: Migrate over to unmanaged SR-IOV support
From: Christoph Hellwig @ 2018-04-21  7:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: bhelgaas, linux-pci, virtio-dev, kvm, netdev, dan.daly,
	linux-kernel, linux-nvme, keith.busch, netanel, ddutile, mheyne,
	liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180420162841.46077.17967.stgit@ahduyck-green-test.jf.intel.com>

On Fri, Apr 20, 2018 at 12:30:21PM -0400, Alexander Duyck wrote:
> Instead of implementing our own version of a SR-IOV configuration stub in
> the ena driver we can just reuse the existing
> pci_sriov_configure_simple function.
> 
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [pci PATCH v8 1/4] pci: Add pci_sriov_configure_simple for PFs that don't manage VF resources
From: Christoph Hellwig @ 2018-04-21  7:05 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: bhelgaas, linux-pci, virtio-dev, kvm, netdev, dan.daly,
	linux-kernel, linux-nvme, keith.busch, netanel, ddutile, mheyne,
	liang-min.wang, mark.d.rustad, dwmw2, hch, dwmw
In-Reply-To: <20180420162814.46077.11939.stgit@ahduyck-green-test.jf.intel.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply

* Re: [virtio-dev] [pci PATCH v7 2/5] virtio_pci: Add support for unmanaged SR-IOV on virtio_pci devices
From: Christoph Hellwig @ 2018-04-21  7:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alexander Duyck, Daly, Dan, Bjorn Helgaas, Duyck, Alexander H,
	linux-pci, virtio-dev, kvm, Netdev, LKML, linux-nvme, Keith Busch,
	netanel, Don Dutile, Maximilian Heyne, Wang, Liang-min,
	Rustad, Mark D, David Woodhouse, Christoph Hellwig, dwmw
In-Reply-To: <20180420180839-mutt-send-email-mst@kernel.org>

On Fri, Apr 20, 2018 at 06:28:50PM +0300, Michael S. Tsirkin wrote:
> But maybe it's not needed here.  I am not making the decisions myself.
> Not too late: post to the TC list and let's see what the response is.
> Without a feature bit you are making a change affecting all future
> implementations without exception so the bar is a bit higher: you need
> to actually post a spec text proposal not just a patch showing how to
> use the feature, and TC needs to vote on it. Voting takes a week,
> review a week or two depending on change complexity.

Also IFF the hardware already is out we can quirk it in the PCI ID
table to manually set the feature in the driver as a workaround.

^ permalink raw reply

* Re: general protection fault in smc_getsockopt
From: syzbot @ 2018-04-21  6:40 UTC (permalink / raw)
  To: davem, linux-kernel, linux-s390, netdev, syzkaller-bugs, ubraun,
	ubraun
In-Reply-To: <94eb2c030cc4d2be530566a62109@google.com>

syzbot has found reproducer for the following crash on upstream commit
83beed7b2b26f232d782127792dd0cd4362fdc41 (Fri Apr 20 17:56:32 2018 +0000)
Merge branch 'fixes' of  
git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=28a2c86cf19c81d871fa

So far this crash happened 59 times on net-next, upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6375334488309760
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=6112997885870080
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5942131738804224
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+28a2c86cf19c81d871fa@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed.

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4492 Comm: syzkaller771634 Not tainted 4.17.0-rc1+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:smc_getsockopt+0x8b/0x120 net/smc/af_smc.c:1298
RSP: 0018:ffff8801d90b7cc0 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000020000000
RDX: 0000000000000005 RSI: ffffffff873e3d16 RDI: 0000000000000028
RBP: ffff8801d90b7cf0 R08: 0000000020000040 R09: ffffed0036a05800
R10: ffffed0036a05800 R11: ffff8801b502c003 R12: ffff8801d90b7d40
R13: 0000000000000000 R14: 0000000000000008 R15: 0000000020000000
FS:  0000000002017880(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000040 CR3: 00000001ac552000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  __sys_getsockopt+0x1a5/0x370 net/socket.c:1940
  __do_sys_getsockopt net/socket.c:1951 [inline]
  __se_sys_getsockopt net/socket.c:1948 [inline]
  __x64_sys_getsockopt+0xbe/0x150 net/socket.c:1948
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43fcf9
RSP: 002b:00007ffcb16b9c58 EFLAGS: 00000217 ORIG_RAX: 0000000000000037
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fcf9
RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000020000040 R09: 00000000004002c8
R10: 0000000020000000 R11: 0000000000000217 R12: 0000000000401620
R13: 00000000004016b0 R14: 0000000000000000 R15: 0000000000000000
Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 93 00 00 00 48 8b 9b 50 04 00 00 48  
b8 00 00 00 00 00 fc ff df 48 8d 7b 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00  
75 62 48 b8 00 00 00 00 00 fc ff df 4c 8b 63 28 49
RIP: smc_getsockopt+0x8b/0x120 net/smc/af_smc.c:1298 RSP: ffff8801d90b7cc0
---[ end trace 7e67761582d7c7ee ]---

^ permalink raw reply

* Re: [PATCH iproute2] iplink_geneve: correct size of message to avoid spurious errors
From: William Tu @ 2018-04-21  4:14 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stephen Hemminger, David Ahern, oss-drivers,
	Linux Kernel Network Developers
In-Reply-To: <20180418180607.11843-1-jakub.kicinski@netronome.com>

On Wed, Apr 18, 2018 at 11:06 AM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> Commit 6c4b672738ac ("iplink_geneve: Get rid of inet_get_addr()")
> inadvertently changed the parameter to addattr_l() resulting in:
>
> addattr_l ERROR: message exceeded bound of 4
>
> when remote is specified.
>
> Fixes: 6c4b672738ac ("iplink_geneve: Get rid of inet_get_addr()")
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---

Thanks. We also hit this issue when creating geneve tunnel.

Acked-by: William Tu <u9012063@gmail.com>

^ permalink raw reply

* [PATCH net-next 1/1] tc-testing: updated ife test cases
From: Roman Mashak @ 2018-04-21  3:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, kernel, jhs, xiyou.wangcong, jiri, Roman Mashak

Signed-off-by: Roman Mashak <mrv@mojatatu.com>
---
 .../selftests/tc-testing/tc-tests/actions/ife.json | 1036 +++++++++++++++++++-
 1 file changed, 1024 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/ife.json b/tools/testing/selftests/tc-testing/tc-tests/actions/ife.json
index 9f34f0753969..03777730ef29 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/ife.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/ife.json
@@ -1,7 +1,7 @@
 [
     {
-        "id": "a568",
-        "name": "Add action with ife type",
+        "id": "7682",
+        "name": "Create valid ife encode action with mark and pass control",
         "category": [
             "actions",
             "ife"
@@ -12,21 +12,878 @@
                 0,
                 1,
                 255
-            ],
-            "$TC actions add action ife encode type 0xDEAD index 1"
+            ]
         ],
-        "cmdUnderTest": "$TC actions get action ife index 1",
+        "cmdUnderTest": "$TC actions add action ife encode allow mark pass index 2",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 2",
+        "matchPattern": "action order [0-9]*: ife encode action pass.*type 0xED3E.*allow mark.*index 2",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "ef47",
+        "name": "Create valid ife encode action with mark and pipe control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use mark 10 pipe index 2",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 2",
+        "matchPattern": "action order [0-9]*: ife encode action pipe.*type 0xED3E.*use mark.*index 2",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "df43",
+        "name": "Create valid ife encode action with mark and continue control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow mark continue index 2",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 2",
+        "matchPattern": "action order [0-9]*: ife encode action continue.*type 0xED3E.*allow mark.*index 2",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "e4cf",
+        "name": "Create valid ife encode action with mark and drop control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use mark 789 drop index 2",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 2",
+        "matchPattern": "action order [0-9]*: ife encode action drop.*type 0xED3E.*use mark 789.*index 2",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "ccba",
+        "name": "Create valid ife encode action with mark and reclassify control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use mark 656768 reclassify index 2",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 2",
+        "matchPattern": "action order [0-9]*: ife encode action reclassify.*type 0xED3E.*use mark 656768.*index 2",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "a1cf",
+        "name": "Create valid ife encode action with mark and jump control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use mark 65 jump 1 index 2",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 2",
+        "matchPattern": "action order [0-9]*: ife encode action jump 1.*type 0xED3E.*use mark 65.*index 2",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "cb3d",
+        "name": "Create valid ife encode action with mark value at 32-bit maximum",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use mark 4294967295 reclassify index 90",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 90",
+        "matchPattern": "action order [0-9]*: ife encode action reclassify.*type 0xED3E.*use mark 4294967295.*index 90",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "1efb",
+        "name": "Create ife encode action with mark value exceeding 32-bit maximum",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use mark 4294967295999 pipe index 90",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action ife index 90",
+        "matchPattern": "action order [0-9]*: ife encode action pipe.*type 0xED3E.*use mark 4294967295999.*index 90",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "95ed",
+        "name": "Create valid ife encode action with prio and pass control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow prio pass index 9",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 9",
+        "matchPattern": "action order [0-9]*: ife encode action pass.*type 0xED3E.*allow prio.*index 9",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "aa17",
+        "name": "Create valid ife encode action with prio and pipe control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use prio 7 pipe index 9",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 9",
+        "matchPattern": "action order [0-9]*: ife encode action pipe.*type 0xED3E.*use prio 7.*index 9",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "74c7",
+        "name": "Create valid ife encode action with prio and continue control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use prio 3 continue index 9",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 9",
+        "matchPattern": "action order [0-9]*: ife encode action continue.*type 0xED3E.*use prio 3.*index 9",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "7a97",
+        "name": "Create valid ife encode action with prio and drop control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow prio drop index 9",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 9",
+        "matchPattern": "action order [0-9]*: ife encode action drop.*type 0xED3E.*allow prio.*index 9",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "f66b",
+        "name": "Create valid ife encode action with prio and reclassify control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use prio 998877 reclassify index 9",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 9",
+        "matchPattern": "action order [0-9]*: ife encode action reclassify.*type 0xED3E.*use prio 998877.*index 9",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "3056",
+        "name": "Create valid ife encode action with prio and jump control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use prio 998877 jump 10 index 9",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 9",
+        "matchPattern": "action order [0-9]*: ife encode action jump 10.*type 0xED3E.*use prio 998877.*index 9",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "7dd3",
+        "name": "Create valid ife encode action with prio value at 32-bit maximum",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use prio 4294967295 reclassify index 99",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 99",
+        "matchPattern": "action order [0-9]*: ife encode action reclassify.*type 0xED3E.*use prio 4294967295.*index 99",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "2ca1",
+        "name": "Create ife encode action with prio value exceeding 32-bit maximum",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use prio 4294967298 pipe index 99",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action ife index 99",
+        "matchPattern": "action order [0-9]*: ife encode action pipe.*type 0xED3E.*use prio 4294967298.*index 99",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "05bb",
+        "name": "Create valid ife encode action with tcindex and pass control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow tcindex pass index 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 1",
+        "matchPattern": "action order [0-9]*: ife encode action pass.*type 0xED3E.*allow tcindex.*index 1",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "ce65",
+        "name": "Create valid ife encode action with tcindex and pipe control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use tcindex 111 pipe index 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 1",
+        "matchPattern": "action order [0-9]*: ife encode action pipe.*type 0xED3E.*use tcindex 111.*index 1",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "09cd",
+        "name": "Create valid ife encode action with tcindex and continue control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use tcindex 1 continue index 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 1",
+        "matchPattern": "action order [0-9]*: ife encode action continue.*type 0xED3E.*use tcindex 1.*index 1",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "8eb5",
+        "name": "Create valid ife encode action with tcindex and continue control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use tcindex 1 continue index 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 1",
+        "matchPattern": "action order [0-9]*: ife encode action continue.*type 0xED3E.*use tcindex 1.*index 1",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "451a",
+        "name": "Create valid ife encode action with tcindex and drop control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow tcindex drop index 77",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 77",
+        "matchPattern": "action order [0-9]*: ife encode action drop.*type 0xED3E.*allow tcindex.*index 77",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "d76c",
+        "name": "Create valid ife encode action with tcindex and reclassify control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow tcindex reclassify index 77",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 77",
+        "matchPattern": "action order [0-9]*: ife encode action reclassify.*type 0xED3E.*allow tcindex.*index 77",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "e731",
+        "name": "Create valid ife encode action with tcindex and jump control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow tcindex jump 999 index 77",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 77",
+        "matchPattern": "action order [0-9]*: ife encode action jump 999.*type 0xED3E.*allow tcindex.*index 77",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "b7b8",
+        "name": "Create valid ife encode action with tcindex value at 16-bit maximum",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use tcindex 65535 pass index 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 1",
+        "matchPattern": "action order [0-9]*: ife encode action pass.*type 0xED3E.*use tcindex 65535.*index 1",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action skbedit"
+        ]
+    },
+    {
+        "id": "d0d8",
+        "name": "Create ife encode action with tcindex value exceeding 16-bit maximum",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use tcindex 65539 pipe index 1",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action ife index 1",
+        "matchPattern": "action order [0-9]*: ife encode action pipe.*type 0xED3E.*use tcindex 65539.*index 1",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "2a9c",
+        "name": "Create valid ife encode action with mac src parameter",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow mark src 00:11:22:33:44:55 pipe index 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 1",
+        "matchPattern": "action order [0-9]*: ife encode action pipe.*type 0xED3E.*allow mark src 00:11:22:33:44:55.*index 1",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "cf5c",
+        "name": "Create valid ife encode action with mac dst parameter",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use prio 9876 dst 00:11:22:33:44:55 reclassify index 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 1",
+        "matchPattern": "action order [0-9]*: ife encode action reclassify.*type 0xED3E.*use prio 9876 dst 00:11:22:33:44:55.*index 1",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "2353",
+        "name": "Create valid ife encode action with mac src and mac dst parameters",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow tcindex src 00:aa:bb:cc:dd:ee dst 00:11:22:33:44:55 pass index 11",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 11",
+        "matchPattern": "action order [0-9]*: ife encode action pass.*type 0xED3E.*allow tcindex dst 00:11:22:33:44:55 src 00:aa:bb:cc:dd:ee .*index 11",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "552c",
+        "name": "Create valid ife encode action with mark and type parameters",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use mark 7 type 0xfefe pass index 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 1",
+        "matchPattern": "action order [0-9]*: ife encode action pass.*type 0xFEFE.*use mark 7.*index 1",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "0421",
+        "name": "Create valid ife encode action with prio and type parameters",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use prio 444 type 0xabba pipe index 21",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 21",
+        "matchPattern": "action order [0-9]*: ife encode action pipe.*type 0xABBA.*use prio 444.*index 21",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "4017",
+        "name": "Create valid ife encode action with tcindex and type parameters",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode use tcindex 5000 type 0xabcd reclassify index 21",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 21",
+        "matchPattern": "action order [0-9]*: ife encode action reclassify.*type 0xABCD.*use tcindex 5000.*index 21",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "fac3",
+        "name": "Create valid ife encode action with index at 32-bit maximnum",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow mark pass index 4294967295",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 4294967295",
+        "matchPattern": "action order [0-9]*: ife encode action pass.*type 0xED3E.*allow mark.*index 4294967295",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "7c25",
+        "name": "Create valid ife decode action with pass control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife decode pass index 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 1",
+        "matchPattern": "action order [0-9]*: ife decode action pass.*type 0x0.*allow mark allow tcindex allow prio.*index 1",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "dccb",
+        "name": "Create valid ife decode action with pipe control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife decode pipe index 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 1",
+        "matchPattern": "action order [0-9]*: ife decode action pipe.*type 0x0.*allow mark allow tcindex allow prio.*index 1",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "7bb9",
+        "name": "Create valid ife decode action with continue control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife decode continue index 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 1",
+        "matchPattern": "action order [0-9]*: ife decode action continue.*type 0x0.*allow mark allow tcindex allow prio.*index 1",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "d9ad",
+        "name": "Create valid ife decode action with drop control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife decode drop index 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 1",
+        "matchPattern": "action order [0-9]*: ife decode action drop.*type 0x0.*allow mark allow tcindex allow prio.*index 1",
+        "matchCount": "1",
+        "teardown": [
+            "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "219f",
+        "name": "Create valid ife decode action with reclassify control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife decode reclassify index 1",
         "expExitCode": "0",
         "verifyCmd": "$TC actions get action ife index 1",
-        "matchPattern": "type 0xDEAD",
+        "matchPattern": "action order [0-9]*: ife decode action reclassify.*type 0x0.*allow mark allow tcindex allow prio.*index 1",
         "matchCount": "1",
         "teardown": [
             "$TC actions flush action ife"
         ]
     },
     {
-        "id": "b983",
-        "name": "Add action without ife type",
+        "id": "8f44",
+        "name": "Create valid ife decode action with jump control",
         "category": [
             "actions",
             "ife"
@@ -37,16 +894,171 @@
                 0,
                 1,
                 255
-            ],
-            "$TC actions add action ife encode index 1"
+            ]
         ],
-        "cmdUnderTest": "$TC actions get action ife index 1",
+        "cmdUnderTest": "$TC actions add action ife decode jump 10 index 1",
         "expExitCode": "0",
         "verifyCmd": "$TC actions get action ife index 1",
-        "matchPattern": "type 0xED3E",
+        "matchPattern": "action order [0-9]*: ife decode action jump 10.*type 0x0.*allow mark allow tcindex allow prio.*index 1",
         "matchCount": "1",
         "teardown": [
             "$TC actions flush action ife"
         ]
+    },
+    {
+        "id": "56cf",
+        "name": "Create ife encode action with index exceeding 32-bit maximum",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow mark pass index 4294967295999",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action ife index 4294967295999",
+        "matchPattern": "action order [0-9]*: ife encode action pass.*type 0xED3E.*allow mark.*index 4294967295999",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "ee94",
+        "name": "Create ife encode action with invalid control",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow mark kuka index 4",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action ife index 4",
+        "matchPattern": "action order [0-9]*: ife encode action kuka.*type 0xED3E.*allow mark.*index 4",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "b330",
+        "name": "Create ife encode action with cookie",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow prio pipe index 4 cookie aabbccddeeff112233445566778800a1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions get action ife index 4",
+        "matchPattern": "action order [0-9]*: ife encode action pipe.*type 0xED3E.*allow prio.*index 4.*cookie aabbccddeeff112233445566778800a1",
+        "matchCount": "1",
+        "teardown": [
+           "$TC actions flush action ife"
+        ]
+    },
+    {
+        "id": "bbc0",
+        "name": "Create ife encode action with invalid argument",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow foo pipe index 4",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action ife index 4",
+        "matchPattern": "action order [0-9]*: ife encode action pipe.*type 0xED3E.*allow foo.*index 4",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "d54a",
+        "name": "Create ife encode action with invalid type argument",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow prio type 70000 pipe index 4",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action ife index 4",
+        "matchPattern": "action order [0-9]*: ife encode action pipe.*type 0x11170.*allow prio.*index 4",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "7ee0",
+        "name": "Create ife encode action with invalid mac src argument",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow prio src 00:11:22:33:44:pp pipe index 4",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action ife index 4",
+        "matchPattern": "action order [0-9]*: ife encode action pipe.*allow prio.*index 4",
+        "matchCount": "0",
+        "teardown": []
+    },
+    {
+        "id": "0a7d",
+        "name": "Create ife encode action with invalid mac dst argument",
+        "category": [
+            "actions",
+            "ife"
+        ],
+        "setup": [
+            [
+                "$TC actions flush action ife",
+                0,
+                1,
+                255
+            ]
+        ],
+        "cmdUnderTest": "$TC actions add action ife encode allow prio dst 00.111-22:33:44:aa pipe index 4",
+        "expExitCode": "255",
+        "verifyCmd": "$TC actions get action ife index 4",
+        "matchPattern": "action order [0-9]*: ife encode action pipe.*allow prio.*index 4",
+        "matchCount": "0",
+        "teardown": []
     }
 ]
-- 
2.7.4

^ permalink raw reply related

* Re: general protection fault in smc_setsockopt
From: syzbot @ 2018-04-21  3:49 UTC (permalink / raw)
  To: davem, linux-kernel, linux-s390, netdev, syzkaller-bugs, ubraun,
	ubraun
In-Reply-To: <001a11426eae20adb5056655e0f1@google.com>

syzbot has found reproducer for the following crash on upstream commit
83beed7b2b26f232d782127792dd0cd4362fdc41 (Fri Apr 20 17:56:32 2018 +0000)
Merge branch 'fixes' of  
git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=9045fc589fcd196ef522

So far this crash happened 124 times on net-next, upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6522155797839872
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5566093930266624
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=6661555940753408
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+9045fc589fcd196ef522@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed.

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4520 Comm: syzkaller696326 Not tainted 4.17.0-rc1+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:smc_setsockopt+0x8b/0x120 net/smc/af_smc.c:1287
RSP: 0018:ffff8801b433fcc8 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000020000000
RDX: 0000000000000005 RSI: ffffffff873e3bf6 RDI: 0000000000000028
RBP: ffff8801b433fcf8 R08: 0000000000000000 R09: ffffed00359a3780
R10: ffffed00359a3780 R11: ffff8801acd1bc03 R12: ffff8801b433fd40
R13: 0000000000000021 R14: 000000000000000d R15: 0000000020000000
FS:  0000000000b01880(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000562e1fa410d0 CR3: 00000001acf1e000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  __sys_setsockopt+0x1bd/0x390 net/socket.c:1903
  __do_sys_setsockopt net/socket.c:1914 [inline]
  __se_sys_setsockopt net/socket.c:1911 [inline]
  __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43fd19
RSP: 002b:00007ffccc4960f8 EFLAGS: 00000217 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 000000000043fd19
RDX: 000000000000000d RSI: 0000000000000021 RDI: 0000000000000004
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000020000000 R11: 0000000000000217 R12: 0000000000401640
R13: 00000000004016d0 R14: 0000000000000000 R15: 0000000000000000
Code: fa 48 c1 ea 03 80 3c 02 00 0f 85 93 00 00 00 48 8b 9b 50 04 00 00 48  
b8 00 00 00 00 00 fc ff df 48 8d 7b 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00  
75 62 48 b8 00 00 00 00 00 fc ff df 4c 8b 63 28 49
RIP: smc_setsockopt+0x8b/0x120 net/smc/af_smc.c:1287 RSP: ffff8801b433fcc8
---[ end trace 3858d0cd9ce5e4d4 ]---

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: Alexander Duyck @ 2018-04-21  2:08 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: David Miller, Sowmini Varadhan, Samudrala, Sridhar, Netdev,
	Willem de Bruijn
In-Reply-To: <CAF=yD-JX--1m8owpS+EqNAhU6rP-8KCDDXC51HNpRss4px6yig@mail.gmail.com>

On Fri, Apr 20, 2018 at 2:58 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>>>> Also any plans for HW offload support for this? I vaguely recall that
>>>> the igb and ixgbe parts had support for something like this in
>>>> hardware. I would have to double check to see what exactly is
>>>> supported.
>>>
>>> I hadn't given that much thought until the request yesterday to
>>> expose the NETIF_F_GSO_UDP_L4 flag through ethtool. By
>>> virtue of having only a single fixed segmentation length, it
>>> appears reasonably straightforward to offload.
>>
>> Actually I just got a chance to start on a review of things. Do we
>> need to have to use both GSO_UDP and and GSO_UDP_L4? It might be
>> better if we could split these up and specifically call out GSO_UDP as
>> UFO and GSO_UDP_L4 as being UDP segmentation.
>
> Thanks for taking a look, Alex.
>
> Agreed, I'll revise that. My initial thought was that both gso skbs need
> to take the same udp gso special case branches in places like act_csum
> and ovs. But on rereading that seems an unsafe approach, as some
> branches are fragmentation specific. I'll review them all and add
> separate SKB_GSO_UDP_L4 cases where needed, instead.

Sounds good. Keep me in the loop on these patches. I will see if we
can get support for ixgbe, ixgbevf, and igb to at least offload this
since it shouldn't be much of a lift to get that. Once we have that I
can also take a look and see if there would be much work needed to add
gso_partial support so we could deal with tunnel encapsulated UDP
segmentation offload support on those devices.

Thanks.

- Alex

^ permalink raw reply

* pull-request: bpf-next 2018-04-21
From: Daniel Borkmann @ 2018-04-21  1:07 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Initial work on BPF Type Format (BTF) is added, which is a meta
   data format which describes the data types of BPF programs / maps.
   BTF has its roots from CTF (Compact C-Type format) with a number
   of changes to it. First use case is to provide a generic pretty
   print capability for BPF maps inspection, later work will also
   add BTF to bpftool. pahole support to convert dwarf to BTF will
   be upstreamed as well (https://github.com/iamkafai/pahole/tree/btf),
   from Martin.

2) Add a new xdp_bpf_adjust_tail() BPF helper for XDP that allows
   for changing the data_end pointer. Only shrinking is currently
   supported which helps for crafting ICMP control messages. Minor
   changes in drivers have been added where needed so they recalc
   the packet's length also when data_end was adjusted, from Nikita.

3) Improve bpftool to make it easier to feed hex bytes via cmdline
   for map operations, from Quentin.

4) Add support for various missing BPF prog types and attach types
   that have been added to kernel recently but neither to bpftool
   nor libbpf yet. Doc and bash completion updates have been added
   as well for bpftool, from Andrey.

5) Proper fix for avoiding to leak info stored in frame data on page
   reuse for the two bpf_xdp_adjust_{head,meta} helpers by disallowing
   to move the pointers into struct xdp_frame area, from Jesper.

6) Follow-up compile fix from BTF in order to include stdbool.h in
   libbpf, from Björn.

7) Few fixes in BPF sample code, that is, a typo on the netdevice
   in a comment and fixup proper dump of XDP action code in the
   tracepoint exception, from Wang and Jesper.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit a2d481b326c98b6b67eea8a378c858d57ca5ff3d:

  ipv6: send netlink notifications for manually configured addresses (2018-04-17 14:03:56 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to 878a4d328104fed86ea321c48a5245eed44fbe14:

  libbpf: fixed build error for samples/bpf/ (2018-04-20 10:12:19 +0200)

----------------------------------------------------------------
Andrey Ignatov (3):
      bpftool: Support new prog types and attach types
      libbpf: Support guessing post_bind{4,6} progs
      libbpf: Type functions for raw tracepoints

Björn Töpel (1):
      libbpf: fixed build error for samples/bpf/

Daniel Borkmann (3):
      Merge branch 'bpf-libbpf-add-types'
      Merge branch 'bpf-xdp-adjust-tail'
      Merge branch 'bpf-type-format'

Jesper Dangaard Brouer (2):
      samples/bpf: fix xdp_monitor user output for tracepoint exception
      bpf: reserve xdp_frame size in xdp headroom

Martin KaFai Lau (10):
      bpf: btf: Introduce BPF Type Format (BTF)
      bpf: btf: Validate type reference
      bpf: btf: Check members of struct/union
      bpf: btf: Add pretty print capability for data with BTF type info
      bpf: btf: Add BPF_BTF_LOAD command
      bpf: btf: Add BPF_OBJ_GET_INFO_BY_FD support to BTF fd
      bpf: btf: Add pretty print support to the basic arraymap
      bpf: btf: Sync bpf.h and btf.h to tools/
      bpf: btf: Add BTF support to libbpf
      bpf: btf: Add BTF tests

Nikita V. Shirokov (11):
      bpf: adding bpf_xdp_adjust_tail helper
      bpf: make generic xdp compatible w/ bpf_xdp_adjust_tail
      bpf: make mlx4 compatible w/ bpf_xdp_adjust_tail
      bpf: make bnxt compatible w/ bpf_xdp_adjust_tail
      bpf: make cavium thunder compatible w/ bpf_xdp_adjust_tail
      bpf: make netronome nfp compatible w/ bpf_xdp_adjust_tail
      bpf: make tun compatible w/ bpf_xdp_adjust_tail
      bpf: make virtio compatible w/ bpf_xdp_adjust_tail
      bpf: making bpf_prog_test run aware of possible data_end ptr change
      bpf: adding tests for bpf_xdp_adjust_tail
      bpf: add bpf_xdp_adjust_tail sample prog

Quentin Monnet (1):
      tools: bpftool: make it easier to feed hex bytes to bpftool

Wang Sheng-Hui (1):
      samples/bpf: correct comment in sock_example.c

 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c      |    2 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |    2 +-
 .../net/ethernet/netronome/nfp/nfp_net_common.c    |    2 +-
 drivers/net/tun.c                                  |    3 +-
 drivers/net/virtio_net.c                           |    7 +-
 include/linux/bpf.h                                |   20 +-
 include/linux/btf.h                                |   48 +
 include/uapi/linux/bpf.h                           |   22 +-
 include/uapi/linux/btf.h                           |  130 ++
 kernel/bpf/Makefile                                |    1 +
 kernel/bpf/arraymap.c                              |   50 +
 kernel/bpf/btf.c                                   | 2064 ++++++++++++++++++++
 kernel/bpf/inode.c                                 |  156 +-
 kernel/bpf/syscall.c                               |   51 +-
 net/bpf/test_run.c                                 |    3 +-
 net/core/dev.c                                     |   10 +-
 net/core/filter.c                                  |   41 +-
 samples/bpf/Makefile                               |    4 +
 samples/bpf/sock_example.c                         |    4 +-
 samples/bpf/xdp_adjust_tail_kern.c                 |  152 ++
 samples/bpf/xdp_adjust_tail_user.c                 |  142 ++
 samples/bpf/xdp_monitor_user.c                     |    2 +-
 tools/bpf/bpftool/Documentation/bpftool-cgroup.rst |   11 +-
 tools/bpf/bpftool/Documentation/bpftool-map.rst    |   29 +-
 tools/bpf/bpftool/bash-completion/bpftool          |   14 +-
 tools/bpf/bpftool/cgroup.c                         |   15 +-
 tools/bpf/bpftool/map.c                            |   17 +-
 tools/bpf/bpftool/prog.c                           |    3 +
 tools/include/uapi/linux/bpf.h                     |   22 +-
 tools/include/uapi/linux/btf.h                     |  130 ++
 tools/lib/bpf/Build                                |    2 +-
 tools/lib/bpf/bpf.c                                |   92 +-
 tools/lib/bpf/bpf.h                                |   17 +
 tools/lib/bpf/btf.c                                |  374 ++++
 tools/lib/bpf/btf.h                                |   22 +
 tools/lib/bpf/libbpf.c                             |  156 +-
 tools/lib/bpf/libbpf.h                             |    5 +
 tools/testing/selftests/bpf/Makefile               |   26 +-
 tools/testing/selftests/bpf/bpf_helpers.h          |    5 +
 tools/testing/selftests/bpf/test_adjust_tail.c     |   30 +
 tools/testing/selftests/bpf/test_btf.c             | 1669 ++++++++++++++++
 tools/testing/selftests/bpf/test_btf_haskv.c       |   48 +
 tools/testing/selftests/bpf/test_btf_nokv.c        |   43 +
 tools/testing/selftests/bpf/test_progs.c           |   32 +
 45 files changed, 5591 insertions(+), 89 deletions(-)
 create mode 100644 include/linux/btf.h
 create mode 100644 include/uapi/linux/btf.h
 create mode 100644 kernel/bpf/btf.c
 create mode 100644 samples/bpf/xdp_adjust_tail_kern.c
 create mode 100644 samples/bpf/xdp_adjust_tail_user.c
 create mode 100644 tools/include/uapi/linux/btf.h
 create mode 100644 tools/lib/bpf/btf.c
 create mode 100644 tools/lib/bpf/btf.h
 create mode 100644 tools/testing/selftests/bpf/test_adjust_tail.c
 create mode 100644 tools/testing/selftests/bpf/test_btf.c
 create mode 100644 tools/testing/selftests/bpf/test_btf_haskv.c
 create mode 100644 tools/testing/selftests/bpf/test_btf_nokv.c

^ permalink raw reply

* pull-request: bpf 2018-04-21
From: Daniel Borkmann @ 2018-04-21  0:22 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a deadlock between mm->mmap_sem and bpf_event_mutex when
   one task is detaching a BPF prog via perf_event_detach_bpf_prog()
   and another one dumping through bpf_prog_array_copy_info(). For
   the latter we move the copy_to_user() out of the bpf_event_mutex
   lock to fix it, from Yonghong.

2) Fix test_sock and test_sock_addr.sh failures. The former was
   hitting rlimit issues and the latter required ping to specify
   the address family, from Yonghong.

3) Remove a dead check in sockmap's sock_map_alloc(), from Jann.

4) Add generated files to BPF kselftests gitignore that were previously
   missed, from Anders.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit 1cc5954f44150bb70cac07c3cc5df7cf0dfb61ec:

  ip_gre: clear feature flags when incompatible o_flags are set (2018-04-10 11:03:32 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to 6ab690aa439803347743c0d899ac422774fdd5e7:

  bpf: sockmap remove dead check (2018-04-20 22:09:51 +0200)

----------------------------------------------------------------
Anders Roxell (1):
      selftests: bpf: update .gitignore with missing generated files

Jann Horn (1):
      bpf: sockmap remove dead check

Yonghong Song (2):
      bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
      tools/bpf: fix test_sock and test_sock_addr.sh failure

 include/linux/bpf.h                           |  4 +--
 kernel/bpf/core.c                             | 45 +++++++++++++++++----------
 kernel/bpf/sockmap.c                          |  3 --
 kernel/trace/bpf_trace.c                      | 25 ++++++++++++---
 tools/testing/selftests/bpf/.gitignore        |  3 ++
 tools/testing/selftests/bpf/test_sock.c       |  1 +
 tools/testing/selftests/bpf/test_sock_addr.c  |  1 +
 tools/testing/selftests/bpf/test_sock_addr.sh |  4 +--
 8 files changed, 59 insertions(+), 27 deletions(-)

^ permalink raw reply

* [PATCH net-next, v2] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)
From: Stephen Suryaputra @ 2018-04-20 23:52 UTC (permalink / raw)
  To: netdev, ja; +Cc: Stephen Suryaputra

This is enhanced from the proposed patch by Igor Maravic in 2011 to
support per interface IPv4 stats. The enhancement is mainly adding a
kernel configuration option CONFIG_IP_IFSTATS_TABLE.

Changes from v1:
- Count input statistics in the input device (per Julian Anastasov).
- Changes so that the existing per interface IPv6 stats aren't affected
  when the option isn't enabled.
- Restore the order of calling ipv4_proc_init().

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
---
 drivers/net/vrf.c               |   2 +-
 include/linux/inetdevice.h      |  22 ++++++
 include/net/icmp.h              |   6 +-
 include/net/ip.h                | 144 ++++++++++++++++++++++++++++++++++++++--
 include/net/ipv6.h              |  30 ++++-----
 include/net/netns/mib.h         |   3 +
 include/net/snmp.h              |  12 ++++
 net/bridge/br_netfilter_hooks.c |  10 +--
 net/dccp/ipv4.c                 |   4 +-
 net/ipv4/Kconfig                |   8 +++
 net/ipv4/datagram.c             |   2 +-
 net/ipv4/devinet.c              |  84 ++++++++++++++++++++++-
 net/ipv4/icmp.c                 |  32 ++++-----
 net/ipv4/inet_connection_sock.c |   8 ++-
 net/ipv4/ip_forward.c           |   8 +--
 net/ipv4/ip_fragment.c          |  20 +++---
 net/ipv4/ip_input.c             |  31 +++++----
 net/ipv4/ip_output.c            |  42 +++++++-----
 net/ipv4/ipmr.c                 |   6 +-
 net/ipv4/ping.c                 |   9 ++-
 net/ipv4/proc.c                 | 126 +++++++++++++++++++++++++++++++++++
 net/ipv4/raw.c                  |   4 +-
 net/ipv4/route.c                |   6 +-
 net/ipv4/tcp_ipv4.c             |   4 +-
 net/ipv4/udp.c                  |   4 +-
 net/l2tp/l2tp_ip.c              |   4 +-
 net/l2tp/l2tp_ip6.c             |   2 +-
 net/mpls/af_mpls.c              |   2 +-
 net/netfilter/ipvs/ip_vs_xmit.c |   6 +-
 net/sctp/input.c                |   2 +-
 net/sctp/output.c               |   2 +-
 31 files changed, 528 insertions(+), 117 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 90b5f39..2b17ead 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -593,7 +593,7 @@ static int vrf_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
 
-	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
+	IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index e16fe7d..3d120cb 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -22,6 +22,15 @@ struct ipv4_devconf {
 
 #define MC_HASH_SZ_LOG 9
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+struct ipv4_devstat {
+	struct proc_dir_entry *proc_dir_entry;
+	DEFINE_SNMP_STAT(struct ipstats_mib, ip);
+	DEFINE_SNMP_STAT_ATOMIC(struct icmp_mib_device, icmpdev);
+	DEFINE_SNMP_STAT_ATOMIC(struct icmpmsg_mib_device, icmpmsgdev);
+};
+#endif
+
 struct in_device {
 	struct net_device	*dev;
 	refcount_t		refcnt;
@@ -45,6 +54,9 @@ struct in_device {
 
 	struct neigh_parms	*arp_parms;
 	struct ipv4_devconf	cnf;
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	struct ipv4_devstat stats;
+#endif
 	struct rcu_head		rcu_head;
 };
 
@@ -216,6 +228,16 @@ static inline struct in_device *__in_dev_get_rcu(const struct net_device *dev)
 	return rcu_dereference(dev->ip_ptr);
 }
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static inline struct in_device *__in_dev_get_rcu_safely(const struct net_device *dev)
+{
+	if (likely(dev))
+		return rcu_dereference(dev->ip_ptr);
+	else
+		return NULL;
+}
+#endif
+
 static inline struct in_device *in_dev_get(const struct net_device *dev)
 {
 	struct in_device *in_dev;
diff --git a/include/net/icmp.h b/include/net/icmp.h
index 3ef2743..70612ca 100644
--- a/include/net/icmp.h
+++ b/include/net/icmp.h
@@ -29,10 +29,6 @@ struct icmp_err {
 };
 
 extern const struct icmp_err icmp_err_convert[];
-#define ICMP_INC_STATS(net, field)	SNMP_INC_STATS((net)->mib.icmp_statistics, field)
-#define __ICMP_INC_STATS(net, field)	__SNMP_INC_STATS((net)->mib.icmp_statistics, field)
-#define ICMPMSGOUT_INC_STATS(net, field)	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field+256)
-#define ICMPMSGIN_INC_STATS(net, field)		SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field)
 
 struct dst_entry;
 struct net_proto_family;
@@ -43,6 +39,6 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info);
 int icmp_rcv(struct sk_buff *skb);
 void icmp_err(struct sk_buff *skb, u32 info);
 int icmp_init(void);
-void icmp_out_count(struct net *net, unsigned char type);
+void icmp_out_count(struct net_device *dev, unsigned char type);
 
 #endif	/* _ICMP_H */
diff --git a/include/net/ip.h b/include/net/ip.h
index dc4a2d6..ada33da 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -27,6 +27,7 @@
 #include <linux/in.h>
 #include <linux/skbuff.h>
 #include <linux/jhash.h>
+#include <linux/inetdevice.h>
 
 #include <net/inet_sock.h>
 #include <net/route.h>
@@ -218,12 +219,134 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
 			   const struct ip_reply_arg *arg,
 			   unsigned int len);
 
-#define IP_INC_STATS(net, field)	SNMP_INC_STATS64((net)->mib.ip_statistics, field)
-#define __IP_INC_STATS(net, field)	__SNMP_INC_STATS64((net)->mib.ip_statistics, field)
-#define IP_ADD_STATS(net, field, val)	SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
-#define __IP_ADD_STATS(net, field, val) __SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
-#define IP_UPD_PO_STATS(net, field, val) SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
-#define __IP_UPD_PO_STATS(net, field, val) __SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
+#ifdef CONFIG_IP_IFSTATS_TABLE
+#define _DEVINC(net, statname, mod, idev, field)			\
+({	\
+	struct in_device *_idev = (idev);	\
+	if (likely(_idev))	\
+		mod##SNMP_INC_STATS((_idev)->stats.statname, (field));	\
+	mod##SNMP_INC_STATS((net)->mib.statname##_statistics, (field));	\
+})
+
+/* per device counters are atomic_long_t */
+#define _DEVINCATOMIC(net, statname, mod, idev, field)	\
+({	\
+	struct in_device *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field));	\
+	mod##SNMP_INC_STATS((net)->mib.statname##_statistics, (field));	\
+})
+
+/* per device and per net counters are atomic_long_t */
+#define _DEVINC_ATOMIC_ATOMIC(net, statname, idev, field)	\
+({	\
+	struct in_device *_idev = (idev);	\
+	if (likely(_idev))	\
+		SNMP_INC_STATS_ATOMIC_LONG((_idev)->stats.statname##dev, (field));	\
+	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.statname##_statistics, (field));	\
+})
+
+#define _DEVADD(net, statname, mod, idev, field, val)	\
+({	\
+	struct in_device *_idev = (idev);	\
+	if (likely(_idev))	\
+		mod##SNMP_ADD_STATS((_idev)->stats.statname, (field), (val));	\
+	mod##SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val));	\
+})
+
+#define _DEVUPD(net, statname, mod, idev, field, val)	\
+({	\
+	struct in_device *_idev = (idev);	\
+	if (likely(_idev))	\
+		mod##SNMP_UPD_PO_STATS((_idev)->stats.statname, field, (val));	\
+	mod##SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val));	\
+})
+
+#define IP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC(net, ip, , __in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+
+#define __IP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC(net, ip, __, __in_dev_get_rcu_safely(dev), field); \
+		rcu_read_unlock();	\
+	})
+
+#define IP_ADD_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVADD(net, ip, , __in_dev_get_rcu_safely(dev), field, val); \
+		rcu_read_unlock();	\
+	})
+
+#define __IP_ADD_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVADD(net, ip, __, __in_dev_get_rcu_safely(dev), field, val); \
+		rcu_read_unlock();	\
+	})
+
+#define IP_UPD_PO_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVUPD(net, ip, , __in_dev_get_rcu_safely(dev), field, val); \
+		rcu_read_unlock();	\
+	})
+
+#define __IP_UPD_PO_STATS(net, dev, field, val)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVUPD(net, ip, __, __in_dev_get_rcu_safely(dev), field, val); \
+		rcu_read_unlock();	\
+	})
+
+#define ICMP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINCATOMIC(net, icmp, , __in_dev_get_rcu_safely(dev), field); \
+		rcu_read_unlock();	\
+	})
+
+#define __ICMP_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINCATOMIC(net, icmp, __,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+
+#define ICMPMSGOUT_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC_ATOMIC_ATOMIC(net, icmpmsg,	\
+			__in_dev_get_rcu_safely(dev), field+256);	\
+		rcu_read_unlock();	\
+	})
+
+#define ICMPMSGIN_INC_STATS(net, dev, field)	\
+	({	\
+		rcu_read_lock();	\
+		_DEVINC_ATOMIC_ATOMIC(net, icmpmsg,	\
+			__in_dev_get_rcu_safely(dev), field);	\
+		rcu_read_unlock();	\
+	})
+#else
+#define IP_INC_STATS(net, dev, field)	SNMP_INC_STATS64((net)->mib.ip_statistics, field)
+#define __IP_INC_STATS(net, dev, field)	__SNMP_INC_STATS64((net)->mib.ip_statistics, field)
+#define IP_ADD_STATS(net, dev, field, val)	SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
+#define __IP_ADD_STATS(net, dev, field, val) __SNMP_ADD_STATS64((net)->mib.ip_statistics, field, val)
+#define IP_UPD_PO_STATS(net, dev, field, val) SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
+#define __IP_UPD_PO_STATS(net, dev, field, val) __SNMP_UPD_PO_STATS64((net)->mib.ip_statistics, field, val)
+
+#define ICMP_INC_STATS(net, dev, field)	SNMP_INC_STATS((net)->mib.icmp_statistics, field)
+#define __ICMP_INC_STATS(net, dev, field)	__SNMP_INC_STATS((net)->mib.icmp_statistics, field)
+#define ICMPMSGOUT_INC_STATS(net, dev, field)	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field+256)
+#define ICMPMSGIN_INC_STATS(net, dev, field)		SNMP_INC_STATS_ATOMIC_LONG((net)->mib.icmpmsg_statistics, field)
+#endif
 #define NET_INC_STATS(net, field)	SNMP_INC_STATS((net)->mib.net_statistics, field)
 #define __NET_INC_STATS(net, field)	__SNMP_INC_STATS((net)->mib.net_statistics, field)
 #define NET_ADD_STATS(net, field, adnd)	SNMP_ADD_STATS((net)->mib.net_statistics, field, adnd)
@@ -663,4 +786,13 @@ extern int sysctl_icmp_msgs_burst;
 int ip_misc_proc_init(void);
 #endif
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+#ifdef CONFIG_PROC_FS
+extern int snmp_register_dev(struct in_device *idev);
+extern int snmp_unregister_dev(struct in_device *idev);
+#else
+extern int snmp_register_dev(struct in_device *idev) { return 0; }
+extern int snmp_unregister_dev(struct in_device *idev) { return 0; }
+#endif
+#endif
 #endif	/* _IP_H */
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 68b167d..a26ffc9 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -163,7 +163,7 @@ struct frag_hdr {
 extern int sysctl_mld_max_msf;
 extern int sysctl_mld_qrv;
 
-#define _DEVINC(net, statname, mod, idev, field)			\
+#define _DEVINC6(net, statname, mod, idev, field)			\
 ({									\
 	struct inet6_dev *_idev = (idev);				\
 	if (likely(_idev != NULL))					\
@@ -172,7 +172,7 @@ extern int sysctl_mld_qrv;
 })
 
 /* per device counters are atomic_long_t */
-#define _DEVINCATOMIC(net, statname, mod, idev, field)			\
+#define _DEVINCATOMIC6(net, statname, mod, idev, field)			\
 ({									\
 	struct inet6_dev *_idev = (idev);				\
 	if (likely(_idev != NULL))					\
@@ -181,7 +181,7 @@ extern int sysctl_mld_qrv;
 })
 
 /* per device and per net counters are atomic_long_t */
-#define _DEVINC_ATOMIC_ATOMIC(net, statname, idev, field)		\
+#define _DEVINC_ATOMIC_ATOMIC6(net, statname, idev, field)		\
 ({									\
 	struct inet6_dev *_idev = (idev);				\
 	if (likely(_idev != NULL))					\
@@ -189,7 +189,7 @@ extern int sysctl_mld_qrv;
 	SNMP_INC_STATS_ATOMIC_LONG((net)->mib.statname##_statistics, (field));\
 })
 
-#define _DEVADD(net, statname, mod, idev, field, val)			\
+#define _DEVADD6(net, statname, mod, idev, field, val)			\
 ({									\
 	struct inet6_dev *_idev = (idev);				\
 	if (likely(_idev != NULL))					\
@@ -197,7 +197,7 @@ extern int sysctl_mld_qrv;
 	mod##SNMP_ADD_STATS((net)->mib.statname##_statistics, (field), (val));\
 })
 
-#define _DEVUPD(net, statname, mod, idev, field, val)			\
+#define _DEVUPD6(net, statname, mod, idev, field, val)			\
 ({									\
 	struct inet6_dev *_idev = (idev);				\
 	if (likely(_idev != NULL))					\
@@ -208,26 +208,26 @@ extern int sysctl_mld_qrv;
 /* MIBs */
 
 #define IP6_INC_STATS(net, idev,field)		\
-		_DEVINC(net, ipv6, , idev, field)
+		_DEVINC6(net, ipv6, , idev, field)
 #define __IP6_INC_STATS(net, idev,field)	\
-		_DEVINC(net, ipv6, __, idev, field)
+		_DEVINC6(net, ipv6, __, idev, field)
 #define IP6_ADD_STATS(net, idev,field,val)	\
-		_DEVADD(net, ipv6, , idev, field, val)
+		_DEVADD6(net, ipv6, , idev, field, val)
 #define __IP6_ADD_STATS(net, idev,field,val)	\
-		_DEVADD(net, ipv6, __, idev, field, val)
+		_DEVADD6(net, ipv6, __, idev, field, val)
 #define IP6_UPD_PO_STATS(net, idev,field,val)   \
-		_DEVUPD(net, ipv6, , idev, field, val)
+		_DEVUPD6(net, ipv6, , idev, field, val)
 #define __IP6_UPD_PO_STATS(net, idev,field,val)   \
-		_DEVUPD(net, ipv6, __, idev, field, val)
+		_DEVUPD6(net, ipv6, __, idev, field, val)
 #define ICMP6_INC_STATS(net, idev, field)	\
-		_DEVINCATOMIC(net, icmpv6, , idev, field)
+		_DEVINCATOMIC6(net, icmpv6, , idev, field)
 #define __ICMP6_INC_STATS(net, idev, field)	\
-		_DEVINCATOMIC(net, icmpv6, __, idev, field)
+		_DEVINCATOMIC6(net, icmpv6, __, idev, field)
 
 #define ICMP6MSGOUT_INC_STATS(net, idev, field)		\
-	_DEVINC_ATOMIC_ATOMIC(net, icmpv6msg, idev, field +256)
+	_DEVINC_ATOMIC_ATOMIC6(net, icmpv6msg, idev, field +256)
 #define ICMP6MSGIN_INC_STATS(net, idev, field)	\
-	_DEVINC_ATOMIC_ATOMIC(net, icmpv6msg, idev, field)
+	_DEVINC_ATOMIC_ATOMIC6(net, icmpv6msg, idev, field)
 
 struct ip6_ra_chain {
 	struct ip6_ra_chain	*next;
diff --git a/include/net/netns/mib.h b/include/net/netns/mib.h
index 830bdf3..798bbc2 100644
--- a/include/net/netns/mib.h
+++ b/include/net/netns/mib.h
@@ -5,6 +5,9 @@
 #include <net/snmp.h>
 
 struct netns_mib {
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	struct proc_dir_entry *proc_net_devsnmp;
+#endif
 	DEFINE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 	DEFINE_SNMP_STAT(struct ipstats_mib, ip_statistics);
 	DEFINE_SNMP_STAT(struct linux_mib, net_statistics);
diff --git a/include/net/snmp.h b/include/net/snmp.h
index c9228ad..b7c99a3 100644
--- a/include/net/snmp.h
+++ b/include/net/snmp.h
@@ -61,14 +61,26 @@ struct ipstats_mib {
 
 /* ICMP */
 #define ICMP_MIB_MAX	__ICMP_MIB_MAX
+/* per network ns counters */
 struct icmp_mib {
 	unsigned long	mibs[ICMP_MIB_MAX];
 };
 
 #define ICMPMSG_MIB_MAX	__ICMPMSG_MIB_MAX
+/* per network ns counters */
 struct icmpmsg_mib {
 	atomic_long_t	mibs[ICMPMSG_MIB_MAX];
 };
+#ifdef CONFIG_IP_IFSTATS_TABLE
+/* per device counters, (shared on all cpus) */
+struct icmp_mib_device {
+	atomic_long_t	mibs[ICMP_MIB_MAX];
+};
+/* per device counters, (shared on all cpus) */
+struct icmpmsg_mib_device {
+	atomic_long_t	mibs[ICMPMSG_MIB_MAX];
+};
+#endif
 
 /* ICMP6 (IPv6-ICMP) */
 #define ICMP6_MIB_MAX	__ICMP6_MIB_MAX
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 9b16eaf..f9576b7 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -218,13 +218,13 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
 
 	len = ntohs(iph->tot_len);
 	if (skb->len < len) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
+		__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INTRUNCATEDPKTS);
 		goto drop;
 	} else if (len < (iph->ihl*4))
 		goto inhdr_error;
 
 	if (pskb_trim_rcsum(skb, len)) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -237,9 +237,9 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
 	return 0;
 
 csum_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS);
+	__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_CSUMERRORS);
 inhdr_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+	__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INHDRERRORS);
 drop:
 	return -1;
 }
@@ -691,7 +691,7 @@ br_nf_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
 		     (IPCB(skb)->frag_max_size &&
 		      IPCB(skb)->frag_max_size > mtu))) {
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_FRAGFAILS);
 		kfree_skb(skb);
 		return -EMSGSIZE;
 	}
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index b08feb2..40e4c00 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -258,7 +258,7 @@ static void dccp_v4_err(struct sk_buff *skb, u32 info)
 				       iph->saddr, ntohs(dh->dccph_sport),
 				       inet_iif(skb), 0);
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 
@@ -468,7 +468,7 @@ static struct dst_entry* dccp_v4_route_skb(struct net *net, struct sock *sk,
 	security_skb_classify_flow(skb, flowi4_to_flowi(&fl4));
 	rt = ip_route_output_flow(net, &fl4, sk);
 	if (IS_ERR(rt)) {
-		IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+		IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 		return NULL;
 	}
 
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 80dad30..6470b95 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -52,6 +52,14 @@ config IP_ADVANCED_ROUTER
 
 	  If unsure, say N here.
 
+config IP_IFSTATS_TABLE
+	def_bool n
+	depends on IP_ADVANCED_ROUTER
+	prompt "IP: interface statistics"
+	help
+	  This option enables per interface statistics for IPv4. Refer to
+	  RFC 4293.
+
 config IP_FIB_TRIE_STATS
 	bool "FIB TRIE statistics"
 	depends on IP_ADVANCED_ROUTER
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f915abf..7acaadb 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -55,7 +55,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		if (err == -ENETUNREACH)
-			IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+			IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 		goto out;
 	}
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 40f0017..c2809f3 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -218,6 +218,49 @@ static void inet_free_ifa(struct in_ifaddr *ifa)
 	call_rcu(&ifa->rcu_head, inet_rcu_free_ifa);
 }
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static int snmp_alloc_dev(struct in_device *idev)
+{
+	int i;
+
+	idev->stats.ip = alloc_percpu(struct ipstats_mib);
+	if (!idev->stats.ip)
+		goto err_ip;
+
+	for_each_possible_cpu(i) {
+		struct ipstats_mib *addrconf_stats;
+
+		addrconf_stats = per_cpu_ptr(idev->stats.ip, i);
+		u64_stats_init(&addrconf_stats->syncp);
+	}
+
+	idev->stats.icmpdev = kzalloc(sizeof(*idev->stats.icmpdev),
+				      GFP_KERNEL);
+	if (!idev->stats.icmpdev)
+		goto err_icmp;
+	idev->stats.icmpmsgdev = kzalloc(sizeof(*idev->stats.icmpmsgdev),
+					 GFP_KERNEL);
+	if (!idev->stats.icmpmsgdev)
+		goto err_icmpmsg;
+
+	return 0;
+
+err_icmpmsg:
+	kfree(idev->stats.icmpdev);
+err_icmp:
+	free_percpu(idev->stats.ip);
+err_ip:
+	return -ENOMEM;
+}
+
+static void snmp_free_dev(struct in_device *idev)
+{
+	kfree(idev->stats.icmpmsgdev);
+	kfree(idev->stats.icmpdev);
+	free_percpu(idev->stats.ip);
+}
+#endif
+
 void in_dev_finish_destroy(struct in_device *idev)
 {
 	struct net_device *dev = idev->dev;
@@ -229,10 +272,14 @@ void in_dev_finish_destroy(struct in_device *idev)
 	pr_debug("%s: %p=%s\n", __func__, idev, dev ? dev->name : "NIL");
 #endif
 	dev_put(dev);
-	if (!idev->dead)
+	if (!idev->dead) {
 		pr_err("Freeing alive in_device %p\n", idev);
-	else
+	} else {
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		snmp_free_dev(idev);
+#endif
 		kfree(idev);
+	}
 }
 EXPORT_SYMBOL(in_dev_finish_destroy);
 
@@ -257,6 +304,25 @@ static struct in_device *inetdev_init(struct net_device *dev)
 		dev_disable_lro(dev);
 	/* Reference in_dev->dev */
 	dev_hold(dev);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	err = snmp_alloc_dev(in_dev);
+	if (err < 0) {
+		netdev_crit(dev,
+			    "%s(): cannot allocate memory for statistics; dev=%s err=%d.\n",
+			    __func__, dev->name, err);
+		neigh_parms_release(&arp_tbl, in_dev->arp_parms);
+		dev_put(dev);
+		kfree(in_dev);
+		return NULL;
+	}
+
+	err = snmp_register_dev(in_dev);
+	if (err < 0)
+		netdev_warn(dev,
+			    "%s(): cannot create /proc/net/dev_snmp/%s err=%d\n",
+			    __func__, dev->name, err);
+#endif
+
 	/* Account for reference dev->ip_ptr (below) */
 	refcount_set(&in_dev->refcnt, 1);
 
@@ -306,6 +372,9 @@ static void inetdev_destroy(struct in_device *in_dev)
 	}
 
 	RCU_INIT_POINTER(dev->ip_ptr, NULL);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	snmp_unregister_dev(in_dev);
+#endif
 
 	devinet_sysctl_unregister(in_dev);
 	neigh_parms_release(&arp_tbl, in_dev->arp_parms);
@@ -1529,8 +1598,19 @@ static int inetdev_event(struct notifier_block *this, unsigned long event,
 		 */
 		inetdev_changename(dev, in_dev);
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		snmp_unregister_dev(in_dev);
+#endif
 		devinet_sysctl_unregister(in_dev);
 		devinet_sysctl_register(in_dev);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		{
+			int err = snmp_register_dev(in_dev);
+
+			if (err)
+				return notifier_from_errno(err);
+		}
+#endif
 		break;
 	}
 out:
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 1617604..4d5c092 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -338,10 +338,10 @@ static bool icmpv4_xrlim_allow(struct net *net, struct rtable *rt,
 /*
  *	Maintain the counters used in the SNMP statistics for outgoing ICMP
  */
-void icmp_out_count(struct net *net, unsigned char type)
+void icmp_out_count(struct net_device *dev, unsigned char type)
 {
-	ICMPMSGOUT_INC_STATS(net, type);
-	ICMP_INC_STATS(net, ICMP_MIB_OUTMSGS);
+	ICMPMSGOUT_INC_STATS(dev_net(dev), dev, type);
+	ICMP_INC_STATS(dev_net(dev), dev, ICMP_MIB_OUTMSGS);
 }
 
 /*
@@ -370,13 +370,14 @@ static void icmp_push_reply(struct icmp_bxm *icmp_param,
 {
 	struct sock *sk;
 	struct sk_buff *skb;
+	struct net_device *dev = (*rt)->dst.dev;
 
-	sk = icmp_sk(dev_net((*rt)->dst.dev));
+	sk = icmp_sk(dev_net(dev));
 	if (ip_append_data(sk, fl4, icmp_glue_bits, icmp_param,
 			   icmp_param->data_len+icmp_param->head_len,
 			   icmp_param->head_len,
 			   ipc, rt, MSG_DONTWAIT) < 0) {
-		__ICMP_INC_STATS(sock_net(sk), ICMP_MIB_OUTERRORS);
+		__ICMP_INC_STATS(sock_net(sk), dev, ICMP_MIB_OUTERRORS);
 		ip_flush_pending_frames(sk);
 	} else if ((skb = skb_peek(&sk->sk_write_queue)) != NULL) {
 		struct icmphdr *icmph = icmp_hdr(skb);
@@ -760,7 +761,7 @@ static void icmp_socket_deliver(struct sk_buff *skb, u32 info)
 	 * avoid additional coding at protocol handlers.
 	 */
 	if (!pskb_may_pull(skb, iph->ihl * 4 + 8)) {
-		__ICMP_INC_STATS(dev_net(skb->dev), ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(dev_net(skb->dev), skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 
@@ -792,8 +793,9 @@ static bool icmp_unreach(struct sk_buff *skb)
 	struct icmphdr *icmph;
 	struct net *net;
 	u32 info = 0;
+	struct net_device *dev = skb_dst(skb)->dev;
 
-	net = dev_net(skb_dst(skb)->dev);
+	net = dev_net(dev);
 
 	/*
 	 *	Incomplete header ?
@@ -852,7 +854,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 		info = ntohl(icmph->un.gateway) >> 24;
 		break;
 	case ICMP_TIME_EXCEEDED:
-		__ICMP_INC_STATS(net, ICMP_MIB_INTIMEEXCDS);
+		__ICMP_INC_STATS(net, dev, ICMP_MIB_INTIMEEXCDS);
 		if (icmph->code == ICMP_EXC_FRAGTIME)
 			goto out;
 		break;
@@ -890,7 +892,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 out:
 	return true;
 out_err:
-	__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+	__ICMP_INC_STATS(net, dev, ICMP_MIB_INERRORS);
 	return false;
 }
 
@@ -902,7 +904,7 @@ static bool icmp_unreach(struct sk_buff *skb)
 static bool icmp_redirect(struct sk_buff *skb)
 {
 	if (skb->len < sizeof(struct iphdr)) {
-		__ICMP_INC_STATS(dev_net(skb->dev), ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(dev_net(skb->dev), skb->dev, ICMP_MIB_INERRORS);
 		return false;
 	}
 
@@ -982,7 +984,7 @@ static bool icmp_timestamp(struct sk_buff *skb)
 	return true;
 
 out_err:
-	__ICMP_INC_STATS(dev_net(skb_dst(skb)->dev), ICMP_MIB_INERRORS);
+	__ICMP_INC_STATS(dev_net(skb_dst(skb)->dev), skb_dst(skb)->dev, ICMP_MIB_INERRORS);
 	return false;
 }
 
@@ -1022,7 +1024,7 @@ int icmp_rcv(struct sk_buff *skb)
 		skb_set_network_header(skb, nh);
 	}
 
-	__ICMP_INC_STATS(net, ICMP_MIB_INMSGS);
+	__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INMSGS);
 
 	if (skb_checksum_simple_validate(skb))
 		goto csum_error;
@@ -1032,7 +1034,7 @@ int icmp_rcv(struct sk_buff *skb)
 
 	icmph = icmp_hdr(skb);
 
-	ICMPMSGIN_INC_STATS(net, icmph->type);
+	ICMPMSGIN_INC_STATS(net, skb->dev, icmph->type);
 	/*
 	 *	18 is the highest 'known' ICMP type. Anything else is a mystery
 	 *
@@ -1078,9 +1080,9 @@ int icmp_rcv(struct sk_buff *skb)
 	kfree_skb(skb);
 	return NET_RX_DROP;
 csum_error:
-	__ICMP_INC_STATS(net, ICMP_MIB_CSUMERRORS);
+	__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_CSUMERRORS);
 error:
-	__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+	__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 	goto drop;
 }
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 881ac6d..30afff4 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -539,6 +539,7 @@ struct dst_entry *inet_csk_route_req(const struct sock *sk,
 	struct net *net = read_pnet(&ireq->ireq_net);
 	struct ip_options_rcu *opt;
 	struct rtable *rt;
+	struct net_device *dev = NULL;
 
 	opt = ireq_opt_deref(ireq);
 
@@ -557,9 +558,10 @@ struct dst_entry *inet_csk_route_req(const struct sock *sk,
 	return &rt->dst;
 
 route_err:
+	dev = rt->dst.dev;
 	ip_rt_put(rt);
 no_route:
-	__IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_OUTNOROUTES);
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(inet_csk_route_req);
@@ -574,6 +576,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
 	struct ip_options_rcu *opt;
 	struct flowi4 *fl4;
 	struct rtable *rt;
+	struct net_device *dev = NULL;
 
 	opt = rcu_dereference(ireq->ireq_opt);
 	fl4 = &newinet->cork.fl.u.ip4;
@@ -593,9 +596,10 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
 	return &rt->dst;
 
 route_err:
+	dev = rt->dst.dev;
 	ip_rt_put(rt);
 no_route:
-	__IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_OUTNOROUTES);
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(inet_csk_route_child_sock);
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index b54b948..c84e177 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -66,8 +66,8 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
 {
 	struct ip_options *opt	= &(IPCB(skb)->opt);
 
-	__IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
-	__IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len);
+	__IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTFORWDATAGRAMS);
+	__IP_ADD_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTOCTETS, skb->len);
 
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
@@ -121,7 +121,7 @@ int ip_forward(struct sk_buff *skb)
 	IPCB(skb)->flags |= IPSKB_FORWARDED;
 	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
 	if (ip_exceeds_mtu(skb, mtu)) {
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 			  htonl(mtu));
 		goto drop;
@@ -158,7 +158,7 @@ int ip_forward(struct sk_buff *skb)
 
 too_many_hops:
 	/* Tell the sender its packet died... */
-	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+	__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INHDRERRORS);
 	icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
 drop:
 	kfree_skb(skb);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 8e9528e..6324254 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -136,6 +136,7 @@ static void ip_expire(struct timer_list *t)
 {
 	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	const struct iphdr *iph;
+	struct net_device *dev;
 	struct sk_buff *head;
 	struct net *net;
 	struct ipq *qp;
@@ -151,16 +152,17 @@ static void ip_expire(struct timer_list *t)
 		goto out;
 
 	ipq_kill(qp);
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+	dev = dev_get_by_index_rcu(net, qp->iif);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMFAILS);
 
 	head = qp->q.fragments;
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMTIMEOUT);
 
 	if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
 		goto out;
 
-	head->dev = dev_get_by_index_rcu(net, qp->iif);
+	head->dev = dev;
 	if (!head->dev)
 		goto out;
 
@@ -237,7 +239,9 @@ static int ip_frag_too_far(struct ipq *qp)
 		struct net *net;
 
 		net = container_of(qp->q.net, struct net, ipv4.frags);
-		__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+		rcu_read_lock();
+		__IP_INC_STATS(net, dev_get_by_index_rcu(net, qp->iif), IPSTATS_MIB_REASMFAILS);
+		rcu_read_unlock();
 	}
 
 	return rc;
@@ -582,7 +586,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 
 	ip_send_check(iph);
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMOKS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMOKS);
 	qp->q.fragments = NULL;
 	qp->q.fragments_tail = NULL;
 	return 0;
@@ -594,7 +598,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 out_oversize:
 	net_info_ratelimited("Oversized IP packet from %pI4\n", &qp->q.key.v4.saddr);
 out_fail:
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMFAILS);
 	return err;
 }
 
@@ -605,7 +609,7 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 user)
 	int vif = l3mdev_master_ifindex_rcu(dev);
 	struct ipq *qp;
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMREQDS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMREQDS);
 	skb_orphan(skb);
 
 	/* Lookup (or create) queue header */
@@ -622,7 +626,7 @@ int ip_defrag(struct net *net, struct sk_buff *skb, u32 user)
 		return ret;
 	}
 
-	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_REASMFAILS);
 	kfree_skb(skb);
 	return -ENOMEM;
 }
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 7582713..2d3e073 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -197,6 +197,9 @@ static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_b
 		int protocol = ip_hdr(skb)->protocol;
 		const struct net_protocol *ipprot;
 		int raw;
+#ifdef CONFIG_IP_IFSTATS_TABLE
+		struct net_device *dev = skb->dev;
+#endif
 
 	resubmit:
 		raw = raw_local_deliver(skb, protocol);
@@ -217,17 +220,17 @@ static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_b
 				protocol = -ret;
 				goto resubmit;
 			}
-			__IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+			__IP_INC_STATS(net, dev, IPSTATS_MIB_INDELIVERS);
 		} else {
 			if (!raw) {
 				if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
-					__IP_INC_STATS(net, IPSTATS_MIB_INUNKNOWNPROTOS);
+					__IP_INC_STATS(net, dev, IPSTATS_MIB_INUNKNOWNPROTOS);
 					icmp_send(skb, ICMP_DEST_UNREACH,
 						  ICMP_PROT_UNREACH, 0);
 				}
 				kfree_skb(skb);
 			} else {
-				__IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+				__IP_INC_STATS(net, dev, IPSTATS_MIB_INDELIVERS);
 				consume_skb(skb);
 			}
 		}
@@ -272,7 +275,7 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 					      --ANK (980813)
 	*/
 	if (skb_cow(skb, skb_headroom(skb))) {
-		__IP_INC_STATS(dev_net(dev), IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(dev_net(dev), dev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -281,7 +284,7 @@ static inline bool ip_rcv_options(struct sk_buff *skb)
 	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
 
 	if (ip_options_compile(dev_net(dev), opt, skb)) {
-		__IP_INC_STATS(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
+		__IP_INC_STATS(dev_net(dev), dev, IPSTATS_MIB_INHDRERRORS);
 		goto drop;
 	}
 
@@ -366,9 +369,9 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 
 	rt = skb_rtable(skb);
 	if (rt->rt_type == RTN_MULTICAST) {
-		__IP_UPD_PO_STATS(net, IPSTATS_MIB_INMCAST, skb->len);
+		__IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_INMCAST, skb->len);
 	} else if (rt->rt_type == RTN_BROADCAST) {
-		__IP_UPD_PO_STATS(net, IPSTATS_MIB_INBCAST, skb->len);
+		__IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_INBCAST, skb->len);
 	} else if (skb->pkt_type == PACKET_BROADCAST ||
 		   skb->pkt_type == PACKET_MULTICAST) {
 		struct in_device *in_dev = __in_dev_get_rcu(dev);
@@ -422,11 +425,11 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 
 
 	net = dev_net(dev);
-	__IP_UPD_PO_STATS(net, IPSTATS_MIB_IN, skb->len);
+	__IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_IN, skb->len);
 
 	skb = skb_share_check(skb, GFP_ATOMIC);
 	if (!skb) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INDISCARDS);
 		goto out;
 	}
 
@@ -452,7 +455,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	BUILD_BUG_ON(IPSTATS_MIB_ECT1PKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_ECT_1);
 	BUILD_BUG_ON(IPSTATS_MIB_ECT0PKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_ECT_0);
 	BUILD_BUG_ON(IPSTATS_MIB_CEPKTS != IPSTATS_MIB_NOECTPKTS + INET_ECN_CE);
-	__IP_ADD_STATS(net,
+	__IP_ADD_STATS(net, dev,
 		       IPSTATS_MIB_NOECTPKTS + (iph->tos & INET_ECN_MASK),
 		       max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs));
 
@@ -466,7 +469,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 
 	len = ntohs(iph->tot_len);
 	if (skb->len < len) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INTRUNCATEDPKTS);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INTRUNCATEDPKTS);
 		goto drop;
 	} else if (len < (iph->ihl*4))
 		goto inhdr_error;
@@ -476,7 +479,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	 * Note this now means skb->len holds ntohs(iph->tot_len).
 	 */
 	if (pskb_trim_rcsum(skb, len)) {
-		__IP_INC_STATS(net, IPSTATS_MIB_INDISCARDS);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INDISCARDS);
 		goto drop;
 	}
 
@@ -494,9 +497,9 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		       ip_rcv_finish);
 
 csum_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_CSUMERRORS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_CSUMERRORS);
 inhdr_error:
-	__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+	__IP_INC_STATS(net, dev, IPSTATS_MIB_INHDRERRORS);
 drop:
 	kfree_skb(skb);
 out:
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4c11b81..54194a9 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -191,9 +191,9 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
 	u32 nexthop;
 
 	if (rt->rt_type == RTN_MULTICAST) {
-		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTMCAST, skb->len);
+		IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUTMCAST, skb->len);
 	} else if (rt->rt_type == RTN_BROADCAST)
-		IP_UPD_PO_STATS(net, IPSTATS_MIB_OUTBCAST, skb->len);
+		IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUTBCAST, skb->len);
 
 	/* Be paranoid, rather than too clever. */
 	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
@@ -339,7 +339,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	/*
 	 *	If the indicated interface is up and running, send the packet.
 	 */
-	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
+	IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
@@ -397,7 +397,7 @@ int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
 
-	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
+	IP_UPD_PO_STATS(net, dev, IPSTATS_MIB_OUT, skb->len);
 
 	skb->dev = dev;
 	skb->protocol = htons(ETH_P_IP);
@@ -507,7 +507,7 @@ int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
 
 no_route:
 	rcu_read_unlock();
-	IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+	IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	return -EHOSTUNREACH;
 }
@@ -548,7 +548,7 @@ static int ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	if (unlikely(!skb->ignore_df ||
 		     (IPCB(skb)->frag_max_size &&
 		      IPCB(skb)->frag_max_size > mtu))) {
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, skb->dev, IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 			  htonl(mtu));
 		kfree_skb(skb);
@@ -575,8 +575,11 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 	int offset;
 	__be16 not_last_frag;
 	struct rtable *rt = skb_rtable(skb);
+	struct net_device *dev = rt->dst.dev;
 	int err = 0;
 
+	dev_hold(dev);
+
 	/* for offloaded checksums cleanup checksum before fragmentation */
 	if (skb->ip_summed == CHECKSUM_PARTIAL &&
 	    (err = skb_checksum_help(skb)))
@@ -675,7 +678,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			err = output(net, sk, skb);
 
 			if (!err)
-				IP_INC_STATS(net, IPSTATS_MIB_FRAGCREATES);
+				IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGCREATES);
 			if (err || !frag)
 				break;
 
@@ -685,7 +688,8 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		}
 
 		if (err == 0) {
-			IP_INC_STATS(net, IPSTATS_MIB_FRAGOKS);
+			IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGOKS);
+			dev_put(dev);
 			return 0;
 		}
 
@@ -694,7 +698,8 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 			kfree_skb(frag);
 			frag = skb;
 		}
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGFAILS);
+		dev_put(dev);
 		return err;
 
 slow_path_clean:
@@ -811,15 +816,17 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 		if (err)
 			goto fail;
 
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGCREATES);
+		IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGCREATES);
 	}
 	consume_skb(skb);
-	IP_INC_STATS(net, IPSTATS_MIB_FRAGOKS);
+	IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGOKS);
+	dev_put(dev);
 	return err;
 
 fail:
 	kfree_skb(skb);
-	IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+	IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGFAILS);
+	dev_put(dev);
 	return err;
 }
 EXPORT_SYMBOL(ip_do_fragment);
@@ -1098,7 +1105,7 @@ static int __ip_append_data(struct sock *sk,
 	err = -EFAULT;
 error:
 	cork->length -= length;
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
+	IP_INC_STATS(sock_net(sk), rt->dst.dev, IPSTATS_MIB_OUTDISCARDS);
 	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
 	return err;
 }
@@ -1306,7 +1313,7 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 
 error:
 	cork->length -= size;
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
+	IP_INC_STATS(sock_net(sk), rt->dst.dev, IPSTATS_MIB_OUTDISCARDS);
 	return err;
 }
 
@@ -1407,7 +1414,7 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 	skb_dst_set(skb, &rt->dst);
 
 	if (iph->protocol == IPPROTO_ICMP)
-		icmp_out_count(net, ((struct icmphdr *)
+		icmp_out_count(skb_dst(skb)->dev, ((struct icmphdr *)
 			skb_transport_header(skb))->type);
 
 	ip_cork_release(cork);
@@ -1418,13 +1425,16 @@ struct sk_buff *__ip_make_skb(struct sock *sk,
 int ip_send_skb(struct net *net, struct sk_buff *skb)
 {
 	int err;
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	struct net_device *dev = skb_dst(skb)->dev;
+#endif
 
 	err = ip_local_out(net, skb->sk, skb);
 	if (err) {
 		if (err > 0)
 			err = net_xmit_errno(err);
 		if (err)
-			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+			IP_INC_STATS(net, dev, IPSTATS_MIB_OUTDISCARDS);
 	}
 
 	return err;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 2fb4de3..67ec987 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1778,8 +1778,8 @@ static inline int ipmr_forward_finish(struct net *net, struct sock *sk,
 {
 	struct ip_options *opt = &(IPCB(skb)->opt);
 
-	IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
-	IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len);
+	IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTFORWDATAGRAMS);
+	IP_ADD_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTOCTETS, skb->len);
 
 	if (unlikely(opt->optlen))
 		ip_forward_options(skb);
@@ -1862,7 +1862,7 @@ static void ipmr_queue_xmit(struct net *net, struct mr_table *mrt,
 		 * allow to send ICMP, so that packets will disappear
 		 * to blackhole.
 		 */
-		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
+		IP_INC_STATS(net, dev, IPSTATS_MIB_FRAGFAILS);
 		ip_rt_put(rt);
 		goto out_free;
 	}
diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 05e47d7..1b4db11 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -712,6 +712,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	__be32 saddr, daddr, faddr;
 	u8  tos;
 	int err;
+	struct net_device *dev;
 
 	pr_debug("ping_v4_sendmsg(sk=%p,sk->num=%u)\n", inet, inet->inet_num);
 
@@ -805,7 +806,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		err = PTR_ERR(rt);
 		rt = NULL;
 		if (err == -ENETUNREACH)
-			IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+			IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 		goto out;
 	}
 
@@ -841,13 +842,17 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	release_sock(sk);
 
 out:
+	dev = rt->dst.dev;
+	dev_hold(dev);
 	ip_rt_put(rt);
 	if (free)
 		kfree(ipc.opt);
 	if (!err) {
-		icmp_out_count(sock_net(sk), user_icmph.type);
+		icmp_out_count(dev, user_icmph.type);
+		dev_put(dev);
 		return len;
 	}
+	dev_put(dev);
 	return err;
 
 do_confirm:
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index a058de6..c0c822f 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -134,6 +134,17 @@ static const struct snmp_mib snmp4_ipextstats_list[] = {
 	SNMP_MIB_SENTINEL
 };
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static const struct snmp_mib snmp4_icmp_list[] = {
+	SNMP_MIB_ITEM("InMsgs", ICMP_MIB_INMSGS),
+	SNMP_MIB_ITEM("InErrors", ICMP_MIB_INERRORS),
+	SNMP_MIB_ITEM("OutMsgs", ICMP_MIB_OUTMSGS),
+	SNMP_MIB_ITEM("OutErrors", ICMP_MIB_OUTERRORS),
+	SNMP_MIB_ITEM("InCsumErrors", ICMP_MIB_CSUMERRORS),
+	SNMP_MIB_SENTINEL
+};
+#endif
+
 static const struct {
 	const char *name;
 	int index;
@@ -473,6 +484,109 @@ static const struct file_operations snmp_seq_fops = {
 };
 
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+static void snmp_seq_show_item(struct seq_file *seq, void __percpu **pcpumib,
+			       atomic_long_t *smib,
+			       const struct snmp_mib *itemlist,
+			       char *prefix)
+{
+	char name[32];
+	int i;
+	unsigned long val;
+
+	for (i = 0; itemlist[i].name; i++) {
+		val = pcpumib ?
+			snmp_fold_field64(pcpumib, itemlist[i].entry,
+					  offsetof(struct ipstats_mib, syncp)) :
+			atomic_long_read(smib + itemlist[i].entry);
+		snprintf(name, sizeof(name), "%s%s",
+			 prefix, itemlist[i].name);
+		seq_printf(seq, "%-32s\t%lu\n", name, val);
+	}
+}
+
+static void snmp_seq_show_icmpmsg(struct seq_file *seq, atomic_long_t *smib)
+{
+	char name[32];
+	int i;
+	unsigned long val;
+
+	for (i = 0; i < ICMPMSG_MIB_MAX; i++) {
+		val = atomic_long_read(smib + i);
+		if (val) {
+			snprintf(name, sizeof(name), "Icmp%sType%u",
+				 i & 0x100 ? "Out" : "In", i & 0xff);
+			seq_printf(seq, "%-32s\t%lu\n", name, val);
+		}
+	}
+}
+
+static int snmp_dev_seq_show(struct seq_file *seq, void *v)
+{
+	struct in_device *idev = (struct in_device *)seq->private;
+
+	seq_printf(seq, "%-32s\t%u\n", "ifIndex", idev->dev->ifindex);
+
+	BUILD_BUG_ON(offsetof(struct ipstats_mib, mibs) != 0);
+
+	snmp_seq_show_item(seq, (void __percpu **)idev->stats.ip, NULL,
+			   snmp4_ipstats_list, "Ip");
+	snmp_seq_show_item(seq, (void __percpu **)idev->stats.ip, NULL,
+			   snmp4_ipextstats_list, "Ip");
+	snmp_seq_show_item(seq, NULL, idev->stats.icmpdev->mibs,
+			   snmp4_icmp_list, "Icmp");
+	snmp_seq_show_icmpmsg(seq, idev->stats.icmpmsgdev->mibs);
+	return 0;
+}
+
+static int snmp_dev_seq_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, snmp_dev_seq_show, PDE_DATA(inode));
+}
+
+static const struct file_operations snmp_dev_seq_fops = {
+	.owner   = THIS_MODULE,
+	.open    = snmp_dev_seq_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = single_release,
+};
+
+int snmp_register_dev(struct in_device *idev)
+{
+	struct proc_dir_entry *p;
+	struct net *net;
+
+	if (!idev || !idev->dev)
+		return -EINVAL;
+
+	net = dev_net(idev->dev);
+	if (!net->mib.proc_net_devsnmp)
+		return -ENOENT;
+
+	p = proc_create_data(idev->dev->name, 0444,
+			     net->mib.proc_net_devsnmp,
+			     &snmp_dev_seq_fops, idev);
+	if (!p)
+		return -ENOMEM;
+
+	idev->stats.proc_dir_entry = p;
+	return 0;
+}
+
+int snmp_unregister_dev(struct in_device *idev)
+{
+	struct net *net = dev_net(idev->dev);
+
+	if (!net->mib.proc_net_devsnmp)
+		return -ENOENT;
+	if (!idev->stats.proc_dir_entry)
+		return -EINVAL;
+	proc_remove(idev->stats.proc_dir_entry);
+	idev->stats.proc_dir_entry = NULL;
+	return 0;
+}
+#endif
 
 /*
  *	Output /proc/net/netstat
@@ -528,9 +642,18 @@ static __net_init int ip_proc_init_net(struct net *net)
 		goto out_netstat;
 	if (!proc_create("snmp", 0444, net->proc_net, &snmp_seq_fops))
 		goto out_snmp;
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	net->mib.proc_net_devsnmp = proc_mkdir("dev_snmp", net->proc_net);
+	if (!net->mib.proc_net_devsnmp)
+		goto out_dev_snmp;
+#endif
 
 	return 0;
 
+#ifdef CONFIG_IP_IFSTATS_TABLE
+out_dev_snmp:
+	remove_proc_entry("snmp", net->proc_net);
+#endif
 out_snmp:
 	remove_proc_entry("netstat", net->proc_net);
 out_netstat:
@@ -544,6 +667,9 @@ static __net_exit void ip_proc_exit_net(struct net *net)
 	remove_proc_entry("snmp", net->proc_net);
 	remove_proc_entry("netstat", net->proc_net);
 	remove_proc_entry("sockstat", net->proc_net);
+#ifdef CONFIG_IP_IFSTATS_TABLE
+	remove_proc_entry("dev_snmp", net->proc_net);
+#endif
 }
 
 static __net_initdata struct pernet_operations ip_proc_ops = {
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 1b4d335..f39f87a 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -425,7 +425,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 		skb->transport_header += iphlen;
 		if (iph->protocol == IPPROTO_ICMP &&
 		    length >= iphlen + sizeof(struct icmphdr))
-			icmp_out_count(net, ((struct icmphdr *)
+			icmp_out_count(rt->dst.dev, ((struct icmphdr *)
 				skb_transport_header(skb))->type);
 	}
 
@@ -442,7 +442,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 error_free:
 	kfree_skb(skb);
 error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+	IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_OUTDISCARDS);
 	if (err == -ENOBUFS && !inet->recverr)
 		err = 0;
 	return err;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index ccb25d8..8fc87b4 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -960,11 +960,11 @@ static int ip_error(struct sk_buff *skb)
 	if (!IN_DEV_FORWARD(in_dev)) {
 		switch (rt->dst.error) {
 		case EHOSTUNREACH:
-			__IP_INC_STATS(net, IPSTATS_MIB_INADDRERRORS);
+			__IP_INC_STATS(net, dev, IPSTATS_MIB_INADDRERRORS);
 			break;
 
 		case ENETUNREACH:
-			__IP_INC_STATS(net, IPSTATS_MIB_INNOROUTES);
+			__IP_INC_STATS(net, dev, IPSTATS_MIB_INNOROUTES);
 			break;
 		}
 		goto out;
@@ -979,7 +979,7 @@ static int ip_error(struct sk_buff *skb)
 		break;
 	case ENETUNREACH:
 		code = ICMP_NET_UNREACH;
-		__IP_INC_STATS(net, IPSTATS_MIB_INNOROUTES);
+		__IP_INC_STATS(net, dev, IPSTATS_MIB_INNOROUTES);
 		break;
 	case EACCES:
 		code = ICMP_PKT_FILTERED;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b..df1b989 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -194,7 +194,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		if (err == -ENETUNREACH)
-			IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+			IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 		return err;
 	}
 
@@ -402,7 +402,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 				       th->dest, iph->saddr, ntohs(th->source),
 				       inet_iif(icmp_skb), 0);
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, icmp_skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 	if (sk->sk_state == TCP_TIME_WAIT) {
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 24b5c59..a8b6567 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -609,7 +609,7 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable)
 			       iph->saddr, uh->source, skb->dev->ifindex, 0,
 			       udptable, NULL);
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 		return;	/* No socket for error */
 	}
 
@@ -1008,7 +1008,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 			err = PTR_ERR(rt);
 			rt = NULL;
 			if (err == -ENETUNREACH)
-				IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
+				IP_INC_STATS(net, NULL, IPSTATS_MIB_OUTNOROUTES);
 			goto out;
 		}
 
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index a9c05b2..b52b2e3 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -381,7 +381,7 @@ static int l2tp_ip_backlog_recv(struct sock *sk, struct sk_buff *skb)
 	return 0;
 
 drop:
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_INDISCARDS);
+	IP_INC_STATS(sock_net(sk), skb->dev, IPSTATS_MIB_INDISCARDS);
 	kfree_skb(skb);
 	return 0;
 }
@@ -504,7 +504,7 @@ static int l2tp_ip_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 
 no_route:
 	rcu_read_unlock();
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+	IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 	kfree_skb(skb);
 	rc = -EHOSTUNREACH;
 	goto out;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 9573691..cf2172d 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -462,7 +462,7 @@ static int l2tp_ip6_backlog_recv(struct sock *sk, struct sk_buff *skb)
 	return 0;
 
 drop:
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_INDISCARDS);
+	IP_INC_STATS(sock_net(sk), skb->dev, IPSTATS_MIB_INDISCARDS);
 	kfree_skb(skb);
 	return -1;
 }
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 7a4de6d..c629239 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -141,7 +141,7 @@ void mpls_stats_inc_outucastpkts(struct net_device *dev,
 					   tx_packets,
 					   tx_bytes);
 	} else if (skb->protocol == htons(ETH_P_IP)) {
-		IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
+		IP_UPD_PO_STATS(dev_net(dev), dev, IPSTATS_MIB_OUT, skb->len);
 #if IS_ENABLED(CONFIG_IPV6)
 	} else if (skb->protocol == htons(ETH_P_IPV6)) {
 		struct inet6_dev *in6dev = __in6_dev_get(dev);
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index ba0a0fd..9875463 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -287,7 +287,11 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
 	{
 		if (ip_hdr(skb)->ttl <= 1) {
 			/* Tell the sender its packet died... */
-			__IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
+			/* at LOCAL_IN the stat is incremented for the input
+			 * dev, at LOCAL_OUT the global is incremented since
+			 * skb->dev is NULL
+			 */
+			__IP_INC_STATS(net, skb->dev, IPSTATS_MIB_INHDRERRORS);
 			icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
 			return false;
 		}
diff --git a/net/sctp/input.c b/net/sctp/input.c
index ba8a6e6..fef625a 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -596,7 +596,7 @@ void sctp_v4_err(struct sk_buff *skb, __u32 info)
 	skb->network_header = saveip;
 	skb->transport_header = savesctp;
 	if (!sk) {
-		__ICMP_INC_STATS(net, ICMP_MIB_INERRORS);
+		__ICMP_INC_STATS(net, skb->dev, ICMP_MIB_INERRORS);
 		return;
 	}
 	/* Warning:  The sock lock is held.  Remember to call
diff --git a/net/sctp/output.c b/net/sctp/output.c
index 690d855..8ae6dc0 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -608,7 +608,7 @@ int sctp_packet_transmit(struct sctp_packet *packet, gfp_t gfp)
 	/* drop packet if no dst */
 	dst = dst_clone(tp->dst);
 	if (!dst) {
-		IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTNOROUTES);
+		IP_INC_STATS(sock_net(sk), NULL, IPSTATS_MIB_OUTNOROUTES);
 		kfree_skb(head);
 		goto out;
 	}
-- 
2.7.4

^ permalink raw reply related

* Re: unregister_netdevice: waiting for DEV to become free
From: syzbot @ 2018-04-20 23:59 UTC (permalink / raw)
  To: ddstreet, dvyukov, linux-kernel, netdev, syzkaller-bugs
In-Reply-To: <000000000000d71bf6056a34b628@google.com>

syzbot has found reproducer for the following crash on  
https://github.com/google/kmsan.git/master commit
48c6a2b0ab1b752451cdc40b5392471ed1a2a329 (Mon Apr 16 08:42:26 2018 +0000)
mm/kmsan: fix origin calculation in kmsan_internal_check_memory
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=2dfb68e639f0621b19fb

So far this crash happened 180 times on  
https://github.com/google/kmsan.git/master, net-next, upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=4936564132020224
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5817131010621440
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=6313498770407424
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=6627248707860932248
compiler: clang version 7.0.0 (trunk 329391)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+2dfb68e639f0621b19fb@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed.

IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
device bridge_slave_1 left promiscuous mode
bridge0: port 2(bridge_slave_1) entered disabled state
device bridge_slave_0 left promiscuous mode
bridge0: port 1(bridge_slave_0) entered disabled state
unregister_netdevice: waiting for lo to become free. Usage count = 3

^ permalink raw reply

* Query: eBPF socket filter
From: Tushar Dave @ 2018-04-21  6:18 UTC (permalink / raw)
  To: netdev

Hi,

I am dealing with Reliable Datagram Socket (RDS) protocol and exploring
ways to filter RDS traffic using eBPF.

FYI, RDS sits on top of transport layer and works with
transport such as TCP and IB/RDMA. So RDS messages arrive in form of skb
(over TCP)and scatterlist (over IB/RDMA).

For RDS traffic over TCP, I am using ebpf program of type
BPF_PROG_TYPE_SOCKET_FILTER and with that my filter program receives
__sk_buff as bpf context. This works.
However, I cannot use BPF_PROG_TYPE_SOCKET_FILTER with struct
scatterlist. I am not sure if it would be appropriate to extend BPF
socket filter so that it also work with struct scatterlist?

Ideas/suggestions or any other alternatives please?

Thanks in advance,
-Tushar

^ permalink raw reply

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
From: Martin KaFai Lau @ 2018-04-20 23:13 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, Tom Herbert, Eric Dumazet, Nikita Shirokov, kernel-team
In-Reply-To: <alpine.LFD.2.20.1804202205290.2906@ja.home.ssi.bg>

On Fri, Apr 20, 2018 at 11:43:59PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 19 Apr 2018, Martin KaFai Lau wrote:
> 
> > This patch is not a proper fix and mainly serves for discussion purpose.
> > It is based on net-next which I have been using to debug the issue.
> > 
> > The change that works around the issue is in ensure_mtu_is_adequate().
> > Other changes are the rippling effect in function arg.
> > 
> > This bug was uncovered by one of our legacy service that
> > are still using ipvs for load balancing.  In that setup,
> > the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> > before tx it out to eth0.
> > 
> > The problem is the kernel stack could pass a skb (which was
> > originated from a sys_write(tcp_fd)) to the driver with skb->len
> > bigger than the device MTU.  In one NIC setup (with gso and tso off)
> > that we are using, it upset the NIC/driver and caused the tx queue
> > stalled for tens of seconds which is how it got uncovered.
> > (On the NIC side, the NIC firmware and driver have been fixed
> > to avoid this tx queue stall after seeing this skb).
> > 
> > On the kernel side, based on the commit log, this bug should have
> > been exposed after commit 815d22e55b0e ("ip6ip6: Support for GSO/GRO").
> 
> 	Before I go to deeply analyze the GSO code, here
> are some preliminary observations:
> 
> - IPVS started to use iptunnel_handle_offloads() around 2014,
> commit ea1d5d7755a3
> 
> - later (2016) the "ip6ip6: Support for GSO/GRO" commits started
> to use skb_set_inner_ipproto(). But it is missing in the IPVS code.
> I'm not sure if such call can help. At least, I don't see what
> different happens in IPVS compared to ip6ip6_tnl_xmit(), for example.
I don't see skb->inner_ipproto used in
ipv6_gso_segment() -> ipv6_gso_segment() -> tcp6_gso_segment()

One thing may worth to highlight is that this problem sort of exist
even before commit 815d22e55b0e, it just happened that the ipv6_gso_segment()
error out (-EPROTONOSUPPORT) which then stops sending it down to the driver.

> 
> > Before commit 815d22e55b0e, ipv6_gso_segment() would just error
> > out (-EPROTONOSUPPORT) because the tx-ing packet is an ip6ip6.
> > Due to this error out, it avoid passing it to the driver.  The TCP
> > stack then timeout and the TCP mtu probing eventually kicked in to
> > lower the skb->len enough to avoid gso_segment.
> > 
> > After commit 815d22e55b0e, ipv6_gso_segment() -> ipv6_gso_segment()
> > -> tcp6_gso_segment() which segment the packet based on a mss
> > that does not account for the extra IPv6 hdr.
> > 
> > Here is a stack from the WARN_ON() that we added to the driver to
> > capture the issue:
> > [ 1128.611875] WARNING: CPU: 40 PID: 31495 at drivers/net/ethernet/mellanox/mlx5/core/en_tx.c:424 mlx5e_xmit+0x814
> > ...
> > [ 1129.016536] Call Trace:
> > [ 1129.021412]  ? skb_release_data+0xfc/0x120
> > [ 1129.029587]  ? kfree_skbmem+0x64/0x70
> > [ 1129.036905]  dev_hard_start_xmit+0xa4/0x200
> > [ 1129.045262]  sch_direct_xmit+0x10f/0x280
> > [ 1129.053111]  __qdisc_run+0x223/0x5a0
> > [ 1129.060251]  __dev_queue_xmit+0x245/0x7d0
> > [ 1129.068268]  dev_queue_xmit+0x10/0x20
> > [ 1129.075573]  ? dev_queue_xmit+0x10/0x20
> > [ 1129.083218]  ip6_finish_output2+0x2db/0x490
> > [ 1129.091573]  ip6_finish_output+0x125/0x190
> > [ 1129.099754]  ip6_output+0x5f/0x100
> > [ 1129.106548]  ? ip6_fragment+0x9f0/0x9f0
> > [ 1129.114212]  ip6_local_out+0x35/0x40
> > [ 1129.121356]  ip_vs_tunnel_xmit_v6+0x267/0x290 [ip_vs]
> > [ 1129.131443]  ip_vs_in.part.24+0x302/0x710 [ip_vs]
> > [ 1129.140837]  ? ip_vs_in.part.24+0x302/0x710 [ip_vs]
> > [ 1129.150578]  ? ip_vs_conn_out_get+0x17/0x140 [ip_vs]
> > [ 1129.160493]  ? ip_vs_conn_out_get_proto+0x25/0x30 [ip_vs]
> > [ 1129.171273]  ip_vs_in+0x43/0x130 [ip_vs]
> > [ 1129.179109]  ip_vs_local_request6+0x26/0x30 [ip_vs]
> > [ 1129.188849]  nf_hook_slow+0x3e/0xc0
> > [ 1129.195800]  ip6_xmit+0x30b/0x540
> > [ 1129.202421]  ? ac6_proc_exit+0x20/0x20
> > [ 1129.209909]  inet6_csk_xmit+0x82/0xd0
> > [ 1129.217207]  ? lock_timer_base+0x76/0xa0
> > [ 1129.225043]  tcp_transmit_skb+0x56f/0xa40
> > [ 1129.233051]  tcp_write_xmit+0x2b2/0x11b0
> > [ 1129.240885]  __tcp_push_pending_frames+0x33/0xa0
> > [ 1129.250106]  tcp_push+0xde/0x100
> > [ 1129.256554]  tcp_sendmsg_locked+0x9ca/0xca0
> > [ 1129.264910]  tcp_sendmsg+0x2c/0x50
> > [ 1129.271703]  inet_sendmsg+0x31/0xb0
> > [ 1129.278672]  sock_write_iter+0xf8/0x110
> > [ 1129.286335]  new_sync_write+0xd9/0x120
> > [ 1129.293823]  vfs_write+0x18d/0x1e0
> > [ 1129.300614]  SyS_write+0x48/0xa0
> > [ 1129.307045]  do_syscall_64+0x69/0x1e0
> > [ 1129.314361]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > ...
> > [ 1129.648183] ---[ end trace 635061c9c300799e ]---
> > [ 1129.657407] skb->len:1554 MTU:1522
> > 
> > The tcp flow is connecting from the address ending ':27:0' to the ':85'.
> > 
> > [host-a] > ip -6 r show table local
> > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 advmss 1440 pref medium
> > 
> > [host-a] > ip -6 a
> > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
> >     inet6 2401:db00:1011:1f01:face:b00c:0:85/128 scope global
> >        valid_lft forever preferred_lft forever
> > 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
> >     inet6 2401:db00:1011:10af:face:0:27:0/64 scope global
> >        valid_lft forever preferred_lft forever
> > 
> > [host-a] > cat /proc/net/ip_vs
> > TCP  [2401:db00:1011:1f01:face:b00c:0000:0085]:01BB rr
> >   -> [2401:db00:1011:10cc:face:0000:0091:0000]:01BB      Tunnel  6772   9          6
> >   -> [2401:db00:1011:10d8:face:0000:0091:0000]:01BB      Tunnel  6772   8          6
> >   -> [2401:db00:1011:10d2:face:0000:0091:0000]:01BB      Tunnel  6772   19         7
> > 
> > [host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443
> > send-something-long-here-to-trigger-the-bug
> > 
> > Changing the local route mtu to 1460 to account for the extra ipv6 tunnel header
> > can also side step the issue.  Like this:
> 
> 	Yes, reducing the MTU was a solution recommended from
> long time ago in the IPVS HOWTO.
> 
> > 
> > > ip -6 r show table local
> > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium
> > 
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > ---
> >  net/netfilter/ipvs/ip_vs_xmit.c | 49 +++++++++++++++++++++++++++--------------
> >  1 file changed, 33 insertions(+), 16 deletions(-)
> > 
> > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> > index 11c416f3d6e3..88cc0d53ebce 100644
> > --- a/net/netfilter/ipvs/ip_vs_xmit.c
> > +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> > @@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
> >  		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
> >  }
> >  
> > -static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> > +static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp,
> >  					  int rt_mode,
> >  					  struct ip_vs_iphdr *ipvsh,
> >  					  struct sk_buff *skb, int mtu)
> >  {
> > +	struct netns_ipvs *ipvs = cp->ipvs;
> > +
> >  #ifdef CONFIG_IP_VS_IPV6
> > -	if (skb_af == AF_INET6) {
> > +	if (cp->af == AF_INET6) {
> >  		struct net *net = ipvs->net;
> >  
> >  		if (unlikely(__mtu_check_toobig_v6(skb, mtu))) {
> > @@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
> >  		}
> >  	}
> >  
> > +	if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) {
> > +		const struct tcphdr *th = (struct tcphdr *)skb_transport_header(skb);
> > +		unsigned short hdr_len = (th->doff << 2) +
> > +			skb_network_header_len(skb);
> > +
> > +		if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size)
> > +			skb_decrease_gso_size(skb_shinfo(skb),
> > +					      skb_shinfo(skb)->gso_size -
> > +					      (mtu - hdr_len));
> 
> 	So, software segmentation happens and we want the
> tunnel header to be accounted immediately and not after PMTU
> probing period?
I think accounting tunnel header correctly early on is the expected
behavior.  Otherwise, if we choose to drop the big skb instead (either
gso_segment error out, driver error out or NIC drops it from tx queue),
the first few packets of a TCP connection will guarantee to
experience timeout.

> Is this a problem only for IPVS tunnels?
> Do we see such delays with other tunnels?
Based on what I understand, a tun dev (setup by iproute2) should
not experience this issue because the MTU of that tun dev has
already accounted for the extra header.

Our current work around using a smaller MTU in the local route has
similar effect (in the sense that introducing a smaller MTU betewen
the phyical eth0 and the TCP stack).

Thanks for looking into it!

Martin

> May be this should
> be solved for all protocols (not just TCP) and for all tunnels?
> Looking at ip6_xmit, on GSO we do not return -EMSGSIZE to
> local sender, so we should really alter the gso_size for proper
> segmentation?
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox