Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache
From: Eric Dumazet @ 2018-04-18 19:21 UTC (permalink / raw)
  To: Paolo Abeni, netdev; +Cc: David S. Miller
In-Reply-To: <1524071712.2599.60.camel@redhat.com>



On 04/18/2018 10:15 AM, Paolo Abeni wrote:
is not appealing to me :/
> 
> Thank you for the feedback.
> Sorry for not being clear about it, but knotd is using SO_REUSEPORT and
> the above tests are leveraging it.
> 
> That 5% is on top of that 300%.

Then there is something wrong.

Adding copies should not increase performance.

If it does, there is certainly another way, reaching 10% instead of 5%

^ permalink raw reply

* Re: [RFC PATCH V1 01/12] audit: add container id
From: Richard Guy Briggs @ 2018-04-18 19:23 UTC (permalink / raw)
  To: Stefan Berger
  Cc: mszeredi, ebiederm, simo, jlayton, carlos, linux-api, containers,
	LKML, eparis, dhowells, Linux-Audit Mailing List, viro, luto,
	netdev, linux-fsdevel, cgroups, serge, trondmy
In-Reply-To: <f966fa52-da4b-3d74-0848-1f0b08e57fd9@linux.vnet.ibm.com>

On 2018-04-18 14:45, Stefan Berger wrote:
> On 03/15/2018 11:58 PM, Richard Guy Briggs wrote:
> > On 2018-03-15 16:27, Stefan Berger wrote:
> > > On 03/01/2018 02:41 PM, Richard Guy Briggs wrote:
> > > > Implement the proc fs write to set the audit container ID of a process,
> > > > emitting an AUDIT_CONTAINER record to document the event.
> > > > 
> > > > This is a write from the container orchestrator task to a proc entry of
> > > > the form /proc/PID/containerid where PID is the process ID of the newly
> > > > created task that is to become the first task in a container, or an
> > > > additional task added to a container.
> > > > 
> > > > The write expects up to a u64 value (unset: 18446744073709551615).
> > > > 
> > > > This will produce a record such as this:
> > > > type=UNKNOWN[1333] msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
> > > > 
> > > > The "op" field indicates an initial set.  The "pid" to "ses" fields are
> > > > the orchestrator while the "opid" field is the object's PID, the process
> > > > being "contained".  Old and new container ID values are given in the
> > > > "contid" fields, while res indicates its success.
> > > > 
> > > > It is not permitted to self-set, unset or re-set the container ID.  A
> > > > child inherits its parent's container ID, but then can be set only once
> > > > after.
> > > > 
> > > > See: https://github.com/linux-audit/audit-kernel/issues/32
> > > > 
> > > > 
> > > >    /* audit_rule_data supports filter rules with both integer and string
> > > >     * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
> > > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > > > index 4e0a4ac..0ee1e59 100644
> > > > --- a/kernel/auditsc.c
> > > > +++ b/kernel/auditsc.c
> > > > @@ -2073,6 +2073,92 @@ int audit_set_loginuid(kuid_t loginuid)
> > > >    	return rc;
> > > >    }
> > > > 
> > > > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
> > > > +{
> > > > +	struct task_struct *parent;
> > > > +	u64 pcontainerid, ccontainerid;
> > > > +	pid_t ppid;
> > > > +
> > > > +	/* Don't allow to set our own containerid */
> > > > +	if (current == task)
> > > > +		return -EPERM;
> > > > +	/* Don't allow the containerid to be unset */
> > > > +	if (!cid_valid(containerid))
> > > > +		return -EINVAL;
> > > > +	/* if we don't have caps, reject */
> > > > +	if (!capable(CAP_AUDIT_CONTROL))
> > > > +		return -EPERM;
> > > > +	/* if containerid is unset, allow */
> > > > +	if (!audit_containerid_set(task))
> > > > +		return 0;
> > > I am wondering whether there should be a check for the target process that
> > > will receive the containerid to not have CAP_SYS_ADMIN that would otherwise
> > > allow it to arbitrarily unshare()/clone() and leave the set of namespaces
> > > that may make up the container whose containerid we assign here?
> > This is a reasonable question.  This has been debated and I understood
> > the conclusion was that without a clear definition of a "container", the
> > task still remains in that container that just now has more
> > sub-namespaces (in the case of hierarchical namespaces), we don't want
> > to restrict it in such a way and that allows it to create nested
> > containers.  I see setns being more problematic if it could switch to
> > another existing namespace that was set up by the orchestrator for a
> > different container.  The coming v2 patchset acknowledges this situation
> > with the network namespace being potentially shared by multiple
> > containers.
> 
> Are you going to post v2 soon? We would like to build on top of it for IMA
> namespacing and auditing inside of IMA namespaces.

I don't know if it addresses your specific needs, but V2 was posted on
March 16th along with userspace patches:
	https://www.redhat.com/archives/linux-audit/2018-March/msg00110.html
	https://www.redhat.com/archives/linux-audit/2018-March/msg00124.html

V3 is pending.

>    Stefan

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: David Miller @ 2018-04-18 19:33 UTC (permalink / raw)
  To: alexander.duyck
  Cc: sowmini.varadhan, willemdebruijn.kernel, sridhar.samudrala,
	netdev, willemb
In-Reply-To: <CAKgT0UcefCpG2j7RQN8aUUgu2bud_RTFs_s+nFPY2GhKgE8L+A@mail.gmail.com>

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Wed, 18 Apr 2018 11:12:06 -0700

> My only concern with the patch set is verifying what mitigations are
> in case so that we aren't trying to set an MSS size that results in a
> frame larger than MTU. I'm still digging through the code and trying
> to grok it, but I figured I might just put the question out there to
> may my reviewing easier.

It signals an error if a too large segment size is requested.

^ permalink raw reply

* Re: [RFC PATCH V1 01/12] audit: add container id
From: Stefan Berger @ 2018-04-18 19:39 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: mszeredi, ebiederm, simo, jlayton, carlos, linux-api, containers,
	LKML, eparis, dhowells, Linux-Audit Mailing List, viro, luto,
	netdev, linux-fsdevel, cgroups, serge, trondmy
In-Reply-To: <20180418192359.n4q53bvsdhrjftjg@madcap2.tricolour.ca>

On 04/18/2018 03:23 PM, Richard Guy Briggs wrote:
> On 2018-04-18 14:45, Stefan Berger wrote:
>> On 03/15/2018 11:58 PM, Richard Guy Briggs wrote:
>>> On 2018-03-15 16:27, Stefan Berger wrote:
>>>> On 03/01/2018 02:41 PM, Richard Guy Briggs wrote:
>>>>> Implement the proc fs write to set the audit container ID of a process,
>>>>> emitting an AUDIT_CONTAINER record to document the event.
>>>>>
>>>>> This is a write from the container orchestrator task to a proc entry of
>>>>> the form /proc/PID/containerid where PID is the process ID of the newly
>>>>> created task that is to become the first task in a container, or an
>>>>> additional task added to a container.
>>>>>
>>>>> The write expects up to a u64 value (unset: 18446744073709551615).
>>>>>
>>>>> This will produce a record such as this:
>>>>> type=UNKNOWN[1333] msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
>>>>>
>>>>> The "op" field indicates an initial set.  The "pid" to "ses" fields are
>>>>> the orchestrator while the "opid" field is the object's PID, the process
>>>>> being "contained".  Old and new container ID values are given in the
>>>>> "contid" fields, while res indicates its success.
>>>>>
>>>>> It is not permitted to self-set, unset or re-set the container ID.  A
>>>>> child inherits its parent's container ID, but then can be set only once
>>>>> after.
>>>>>
>>>>> See: https://github.com/linux-audit/audit-kernel/issues/32
>>>>>
>>>>>
>>>>>     /* audit_rule_data supports filter rules with both integer and string
>>>>>      * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
>>>>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>>>>> index 4e0a4ac..0ee1e59 100644
>>>>> --- a/kernel/auditsc.c
>>>>> +++ b/kernel/auditsc.c
>>>>> @@ -2073,6 +2073,92 @@ int audit_set_loginuid(kuid_t loginuid)
>>>>>     	return rc;
>>>>>     }
>>>>>
>>>>> +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
>>>>> +{
>>>>> +	struct task_struct *parent;
>>>>> +	u64 pcontainerid, ccontainerid;
>>>>> +	pid_t ppid;
>>>>> +
>>>>> +	/* Don't allow to set our own containerid */
>>>>> +	if (current == task)
>>>>> +		return -EPERM;
>>>>> +	/* Don't allow the containerid to be unset */
>>>>> +	if (!cid_valid(containerid))
>>>>> +		return -EINVAL;
>>>>> +	/* if we don't have caps, reject */
>>>>> +	if (!capable(CAP_AUDIT_CONTROL))
>>>>> +		return -EPERM;
>>>>> +	/* if containerid is unset, allow */
>>>>> +	if (!audit_containerid_set(task))
>>>>> +		return 0;
>>>> I am wondering whether there should be a check for the target process that
>>>> will receive the containerid to not have CAP_SYS_ADMIN that would otherwise
>>>> allow it to arbitrarily unshare()/clone() and leave the set of namespaces
>>>> that may make up the container whose containerid we assign here?
>>> This is a reasonable question.  This has been debated and I understood
>>> the conclusion was that without a clear definition of a "container", the
>>> task still remains in that container that just now has more
>>> sub-namespaces (in the case of hierarchical namespaces), we don't want
>>> to restrict it in such a way and that allows it to create nested
>>> containers.  I see setns being more problematic if it could switch to
>>> another existing namespace that was set up by the orchestrator for a
>>> different container.  The coming v2 patchset acknowledges this situation
>>> with the network namespace being potentially shared by multiple
>>> containers.
>> Are you going to post v2 soon? We would like to build on top of it for IMA
>> namespacing and auditing inside of IMA namespaces.
> I don't know if it addresses your specific needs, but V2 was posted on
> March 16th along with userspace patches:
> 	https://www.redhat.com/archives/linux-audit/2018-March/msg00110.html
> 	https://www.redhat.com/archives/linux-audit/2018-March/msg00124.html
>
> V3 is pending.
Thanks. I hadn't actually looked at primarily due to the ghak and ghau 
in the title. Whatever these may mean.

Does V2 or will V3 prevent a privileged process to setns() to a whole 
different set of namespaces and still be audited with that initial 
container id ?

^ permalink raw reply

* net namespaces kernel stack overflow
From: Alexander Aring @ 2018-04-18 19:45 UTC (permalink / raw)
  To: ktkhai; +Cc: netdev, Jamal Hadi Salim

Hi,

I currently can crash my net/master kernel by execute the following script:

--- snip

modprobe dummy

#mkdir /var/run/netns
#touch /var/run/netns/init_net
#mount --bind /proc/1/ns/net /var/run/netns/init_net

while true
do
    mkdir /var/run/netns
    touch /var/run/netns/init_net
    mount --bind /proc/1/ns/net /var/run/netns/init_net

    ip netns add foo
    ip netns exec foo ip link add dummy0 type dummy
    ip netns delete foo
done

--- snap

After max ~1 minute the kernel will crash.
Doing my hack of saving init_net outside the loop it will run fine...
So the mount bind is necessary.

The last message which I see is:

BUG: stack guard page was hit at 00000000f0751759 (stack is
0000000069363195..0000000073ddc474)
kernel stack overflow (double-fault): 0000 [#1] SMP PTI
Modules linked in:
CPU: 0 PID: 13917 Comm: ip Not tainted 4.16.0-11878-gef9d066f6808 #32
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
RIP: 0010:validate_chain.isra.23+0x44/0xc40
RSP: 0018:ffffc900002cbff8 EFLAGS: 00010002
RAX: 0000000000040000 RBX: 0e58b88e1d4d15da RCX: 0e58b88e1d4d15da
RDX: 0000000000000000 RSI: ffff8802b25ee2a0 RDI: ffff8802b25edb00
RBP: 0e58b88e1d4d15da R08: 0000000000000000 R09: 0000000000000004
R10: ffffc900002cc050 R11: ffff8802b1054be8 R12: 0000000000000001
R13: ffff8802b25ee268 R14: ffff8802b25edb00 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8802bfc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc900002cbfe8 CR3: 0000000002024000 CR4: 00000000000006f0
Call Trace:
 ? get_max_files+0x10/0x10
 __lock_acquire+0x332/0x710
 lock_acquire+0x67/0xb0
 ? lockref_put_or_lock+0x9/0x30
 ? dput.part.7+0x17/0x2d0
 _raw_spin_lock+0x2b/0x60
 ? lockref_put_or_lock+0x9/0x30
 lockref_put_or_lock+0x9/0x30
 dput.part.7+0x1ec/0x2d0
 drop_mountpoint+0x10/0x40
 pin_kill+0x9b/0x3a0
 ? wait_woken+0x90/0x90
 ? mnt_pin_kill+0x2d/0x100
 mnt_pin_kill+0x2d/0x100
 cleanup_mnt+0x66/0x70
 pin_kill+0x9b/0x3a0
 ? wait_woken+0x90/0x90
 ? mnt_pin_kill+0x2d/0x100
 mnt_pin_kill+0x2d/0x100
 cleanup_mnt+0x66/0x70
...

I guess maybe it has something to do with recently switching to
migrate per-net ops to async.

- Alex

^ permalink raw reply

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Michael S. Tsirkin @ 2018-04-18 19:46 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
	virtualization, loseweigh, netdev, davem
In-Reply-To: <20180418191315.GA1922@nanopsycho>

On Wed, Apr 18, 2018 at 09:13:15PM +0200, Jiri Pirko wrote:
> Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
> >On 4/18/2018 2:25 AM, Jiri Pirko wrote:
> >> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
> >> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
> >> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > This provides a generic interface for paravirtual drivers to listen
> >> > > > for netdev register/unregister/link change events from pci ethernet
> >> > > > devices with the same MAC and takeover their datapath. The notifier and
> >> > > > event handling code is based on the existing netvsc implementation.
> >> > > > 
> >> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
> >> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
> >> > > > master netdev is created. The paravirtual driver registers each bypass
> >> > > > instance along with a set of ops to manage the slave events.
> >> > > >       bypass_master_register()
> >> > > >       bypass_master_unregister()
> >> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
> >> > > > the bypass module provides interfaces to create/destroy additional master
> >> > > > netdev and all the slave events are managed internally.
> >> > > >        bypass_master_create()
> >> > > >        bypass_master_destroy()
> >> > > > 
> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > > > ---
> >> > > > include/linux/netdevice.h |  14 +
> >> > > > include/net/bypass.h      |  96 ++++++
> >> > > > net/Kconfig               |  18 +
> >> > > > net/core/Makefile         |   1 +
> >> > > > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
> >> > > > 5 files changed, 973 insertions(+)
> >> > > > create mode 100644 include/net/bypass.h
> >> > > > create mode 100644 net/core/bypass.c
> >> > > > 
> >> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >> > > > index cf44503ea81a..587293728f70 100644
> >> > > > --- a/include/linux/netdevice.h
> >> > > > +++ b/include/linux/netdevice.h
> >> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
> >> > > > 	IFF_PHONY_HEADROOM		= 1<<24,
> >> > > > 	IFF_MACSEC			= 1<<25,
> >> > > > 	IFF_NO_RX_HANDLER		= 1<<26,
> >> > > > +	IFF_BYPASS			= 1 << 27,
> >> > > > +	IFF_BYPASS_SLAVE		= 1 << 28,
> >> > > I wonder, why you don't follow the existing coding style... Also, please
> >> > > add these to into the comment above.
> >> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
> >> > to the existing coding style to be consistent.
> >> Please do.
> >> 
> >> 
> >> > > 
> >> > > > };
> >> > > > 
> >> > > > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
> >> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
> >> > > > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
> >> > > > #define IFF_MACSEC			IFF_MACSEC
> >> > > > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
> >> > > > +#define IFF_BYPASS			IFF_BYPASS
> >> > > > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
> >> > > > 
> >> > > > /**
> >> > > >    *	struct net_device - The DEVICE structure.
> >> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
> >> > > > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
> >> > > > }
> >> > > > 
> >> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
> >> > > > +{
> >> > > > +	return dev->priv_flags & IFF_BYPASS;
> >> > > > +}
> >> > > > +
> >> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
> >> > > > +{
> >> > > > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
> >> > > > +}
> >> > > > +
> >> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
> >> > > > static inline void netif_keep_dst(struct net_device *dev)
> >> > > > {
> >> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
> >> > > > new file mode 100644
> >> > > > index 000000000000..86b02cb894cf
> >> > > > --- /dev/null
> >> > > > +++ b/include/net/bypass.h
> >> > > > @@ -0,0 +1,96 @@
> >> > > > +// SPDX-License-Identifier: GPL-2.0
> >> > > > +/* Copyright (c) 2018, Intel Corporation. */
> >> > > > +
> >> > > > +#ifndef _NET_BYPASS_H
> >> > > > +#define _NET_BYPASS_H
> >> > > > +
> >> > > > +#include <linux/netdevice.h>
> >> > > > +
> >> > > > +struct bypass_ops {
> >> > > > +	int (*slave_pre_register)(struct net_device *slave_netdev,
> >> > > > +				  struct net_device *bypass_netdev);
> >> > > > +	int (*slave_join)(struct net_device *slave_netdev,
> >> > > > +			  struct net_device *bypass_netdev);
> >> > > > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
> >> > > > +				    struct net_device *bypass_netdev);
> >> > > > +	int (*slave_release)(struct net_device *slave_netdev,
> >> > > > +			     struct net_device *bypass_netdev);
> >> > > > +	int (*slave_link_change)(struct net_device *slave_netdev,
> >> > > > +				 struct net_device *bypass_netdev);
> >> > > > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
> >> > > > +};
> >> > > > +
> >> > > > +struct bypass_master {
> >> > > > +	struct list_head list;
> >> > > > +	struct net_device __rcu *bypass_netdev;
> >> > > > +	struct bypass_ops __rcu *ops;
> >> > > > +};
> >> > > > +
> >> > > > +/* bypass state */
> >> > > > +struct bypass_info {
> >> > > > +	/* passthru netdev with same MAC */
> >> > > > +	struct net_device __rcu *active_netdev;
> >> > > You still use "active"/"backup" names which is highly misleading as
> >> > > it has completely different meaning that in bond for example.
> >> > > I noted that in my previous review already. Please change it.
> >> > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
> >> > matches with the BACKUP feature bit we are adding to virtio_net.
> >> I think that "backup" is also misleading. Both "active" and "backup"
> >> mean a *state* of slaves. This should be named differently.
> >> 
> >> 
> >> 
> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
> >> > am not too happy with it.
> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
> >> No. The netdev could be any netdevice. It does not have to be a "VF".
> >> I think "stolen" is quite appropriate since it describes the modus
> >> operandi. The bypass master steals some netdevice according to some
> >> match.
> >> 
> >> But I don't insist on "stolen". Just sounds right.
> >
> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
> >'backup' name is consistent.
> 
> It perhaps makes sense from the view of virtio device. However, as I
> described couple of times, for master/slave device the name "backup" is
> highly misleading.

virtio is the backup. You are supposed to use another
(typically passthrough) device, if that fails use virtio.
It does seem appropriate to me. If you like, we can
change that to "standby".  Active I don't like either. "main"?

In fact would failover be better than bypass?


> 
> >
> >The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
> >a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
> >
> >Will look for any suggestions in the next day or two. If i don't get any, i will go
> >with 'stolen'
> >
> ><snip>
> >
> >
> >> +
> >> +static struct net_device *bypass_master_get_bymac(u8 *mac,
> >> +						  struct bypass_ops **ops)
> >> +{
> >> +	struct bypass_master *bypass_master;
> >> +	struct net_device *bypass_netdev;
> >> +
> >> +	spin_lock(&bypass_lock);
> >> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
> >> > > As I wrote the last time, you don't need this list, spinlock.
> >> > > You can do just something like:
> >> > >           for_each_net(net) {
> >> > >                   for_each_netdev(net, dev) {
> >> > > 			if (netif_is_bypass_master(dev)) {
> >> > This function returns the upper netdev as well as the ops associated
> >> > with that netdev.
> >> > bypass_master_list is a list of 'struct bypass_master' that associates
> >> Well, can't you have it in netdev priv?
> >
> >We cannot do this for 2-netdev model as there is no bypass_netdev created.
> 
> Howcome? You have no master? I don't understand..
> 
> 
> 
> >
> >> 
> >> 
> >> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
> >> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
> >> > NULL for 3-netdev model.
> >> I see :(
> >> 
> >> 
> >> > 
> >> > > 
> >> > > 
> >> > > 
> >> > > > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
> >> > > > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
> >> > > > +			*ops = rcu_dereference(bypass_master->ops);
> >> > > I don't see how rcu_dereference is ok here.
> >> > > 1) I don't see rcu_read_lock taken
> >> > > 2) Looks like bypass_master->ops has the same value across the whole
> >> > >      existence.
> >> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
> >> > Yes. ops doesn't change.
> >> If it does not change, you can just access it directly.
> >> 
> >> 
> >> > > 
> >> > > > +			spin_unlock(&bypass_lock);
> >> > > > +			return bypass_netdev;
> >> > > > +		}
> >> > > > +	}
> >> > > > +	spin_unlock(&bypass_lock);
> >> > > > +	return NULL;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > +	struct net_device *bypass_netdev;
> >> > > > +	struct bypass_ops *bypass_ops;
> >> > > > +	int ret, orig_mtu;
> >> > > > +
> >> > > > +	ASSERT_RTNL();
> >> > > > +
> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > +						&bypass_ops);
> >> > > For master, could you use word "master" in the variables so it is clear?
> >> > > Also, "dev" is fine instead of "netdev".
> >> > > Something like "bpmaster_dev"
> >> > bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
> >> I was trying to point out, that "bypass_netdev" represents a "master"
> >> netdev, yet it does not say master. That is why I suggested
> >> "bpmaster_dev"
> >> 
> >> 
> >> > I can change all _netdev suffixes to _dev to make the names shorter.
> >> ok.
> >> 
> >> 
> >> > 
> >> > > 
> >> > > > +	if (!bypass_netdev)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
> >> > > > +					bypass_ops);
> >> > > > +	if (ret != 0)
> >> > > 	Just "if (ret)" will do. You have this on more places.
> >> > OK.
> >> > 
> >> > 
> >> > > 
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ret = netdev_rx_handler_register(slave_netdev,
> >> > > > +					 bypass_ops ? bypass_ops->handle_frame :
> >> > > > +					 bypass_handle_frame, bypass_netdev);
> >> > > > +	if (ret != 0) {
> >> > > > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
> >> > > > +			   ret);
> >> > > > +		goto done;
> >> > > > +	}
> >> > > > +
> >> > > > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
> >> > > > +	if (ret != 0) {
> >> > > > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
> >> > > > +			   bypass_netdev->name, ret);
> >> > > > +		goto upper_link_failed;
> >> > > > +	}
> >> > > > +
> >> > > > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
> >> > > > +
> >> > > > +	if (netif_running(bypass_netdev)) {
> >> > > > +		ret = dev_open(slave_netdev);
> >> > > > +		if (ret && (ret != -EBUSY)) {
> >> > > > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
> >> > > > +				   slave_netdev->name, ret);
> >> > > > +			goto err_interface_up;
> >> > > > +		}
> >> > > > +	}
> >> > > > +
> >> > > > +	/* Align MTU of slave with master */
> >> > > > +	orig_mtu = slave_netdev->mtu;
> >> > > > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
> >> > > > +	if (ret != 0) {
> >> > > > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
> >> > > > +			   slave_netdev->name, bypass_netdev->mtu);
> >> > > > +		goto err_set_mtu;
> >> > > > +	}
> >> > > > +
> >> > > > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
> >> > > > +	if (ret != 0)
> >> > > > +		goto err_join;
> >> > > > +
> >> > > > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
> >> > > > +
> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
> >> > > > +		    slave_netdev->name);
> >> > > > +
> >> > > > +	goto done;
> >> > > > +
> >> > > > +err_join:
> >> > > > +	dev_set_mtu(slave_netdev, orig_mtu);
> >> > > > +err_set_mtu:
> >> > > > +	dev_close(slave_netdev);
> >> > > > +err_interface_up:
> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
> >> > > > +upper_link_failed:
> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
> >> > > > +done:
> >> > > > +	return NOTIFY_DONE;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
> >> > > > +				       struct net_device *bypass_netdev,
> >> > > > +				       struct bypass_ops *bypass_ops)
> >> > > > +{
> >> > > > +	struct net_device *backup_netdev, *active_netdev;
> >> > > > +	struct bypass_info *bi;
> >> > > > +
> >> > > > +	if (bypass_ops) {
> >> > > > +		if (!bypass_ops->slave_pre_unregister)
> >> > > > +			return -EINVAL;
> >> > > > +
> >> > > > +		return bypass_ops->slave_pre_unregister(slave_netdev,
> >> > > > +							bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +	bi = netdev_priv(bypass_netdev);
> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
> >> > > > +		return -EINVAL;
> >> > > > +
> >> > > > +	return 0;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
> >> > > > +				struct net_device *bypass_netdev,
> >> > > > +				struct bypass_ops *bypass_ops)
> >> > > > +{
> >> > > > +	struct net_device *backup_netdev, *active_netdev;
> >> > > > +	struct bypass_info *bi;
> >> > > > +
> >> > > > +	if (bypass_ops) {
> >> > > > +		if (!bypass_ops->slave_release)
> >> > > > +			return -EINVAL;
> >> > > I think it would be good to make the API to the driver more strict and
> >> > > have a separate set of ops for "active" and "backup" netdevices.
> >> > > That should stop people thinking about extending this to more slaves in
> >> > > the future.
> >> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
> >> > 'active' slave.
> >> I'm very well aware of that. I just thought that explicit ops for the
> >> two slaves would make this more clear.
> >> 
> >> 
> >> > 
> >> > > 
> >> > > 
> >> > > > +
> >> > > > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +	bi = netdev_priv(bypass_netdev);
> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > +	if (slave_netdev == backup_netdev) {
> >> > > > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
> >> > > > +	} else {
> >> > > > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
> >> > > > +		if (backup_netdev) {
> >> > > > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
> >> > > > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
> >> > > > +		}
> >> > > > +	}
> >> > > > +
> >> > > > +	dev_put(slave_netdev);
> >> > > > +
> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
> >> > > > +		    slave_netdev->name);
> >> > > > +
> >> > > > +	return 0;
> >> > > > +}
> >> > > > +
> >> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > +	struct net_device *bypass_netdev;
> >> > > > +	struct bypass_ops *bypass_ops;
> >> > > > +	int ret;
> >> > > > +
> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ASSERT_RTNL();
> >> > > > +
> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > +						&bypass_ops);
> >> > > > +	if (!bypass_netdev)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
> >> > > > +					  bypass_ops);
> >> > > > +	if (ret != 0)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
> >> > > > +
> >> > > > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
> >> > > > +
> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
> >> > > > +		    slave_netdev->name);
> >> > > > +
> >> > > > +done:
> >> > > > +	return NOTIFY_DONE;
> >> > > > +}
> >> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
> >> > > > +
> >> > > > +static bool bypass_xmit_ready(struct net_device *dev)
> >> > > > +{
> >> > > > +	return netif_running(dev) && netif_carrier_ok(dev);
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
> >> > > > +	struct bypass_ops *bypass_ops;
> >> > > > +	struct bypass_info *bi;
> >> > > > +
> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ASSERT_RTNL();
> >> > > > +
> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > +						&bypass_ops);
> >> > > > +	if (!bypass_netdev)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	if (bypass_ops) {
> >> > > > +		if (!bypass_ops->slave_link_change)
> >> > > > +			goto done;
> >> > > > +
> >> > > > +		return bypass_ops->slave_link_change(slave_netdev,
> >> > > > +						     bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +	if (!netif_running(bypass_netdev))
> >> > > > +		return 0;
> >> > > > +
> >> > > > +	bi = netdev_priv(bypass_netdev);
> >> > > > +
> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
> >> > > > +		goto done;
> >> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
> >> > > above is enough.
> >> > I think we need this check to not allow events from a slave that is not
> >> > attached to this master but has the same MAC.
> >> Why do we need such events? Seems wrong to me.
> >
> >We want to avoid events from a netdev that is mis-configured with the same MAC as
> >a bypass setup.
> >
> >>   Consider:
> >> 
> >> bp1      bp2
> >> a1 b1    a2 b2
> >> 
> >> 
> >> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
> >
> >We should not have 2 bypass configs with the same MAC.
> >I need to add a check in the bypass_master_register() to prevent this.
> 
> Mac can change, you would have to check in change as well. Feels odd
> thought. 
> 
> 
> >
> >The above check is to avoid cases where we have
> >bp1(a1, b1) with mac1
> >and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
> >
> >> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
> >> the order of creation.
> >> Let's say it will return bp1. Then when we have event for a2, the
> >> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
> >> 
> >> 
> >> You cannot use bypass_master_get_bymac() here.
> >> 
> >> 
> >> 
> >> > > 
> >> > > > +
> >> > > > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
> >> > > > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
> >> > > > +		netif_carrier_on(bypass_netdev);
> >> > > > +		netif_tx_wake_all_queues(bypass_netdev);
> >> > > > +	} else {
> >> > > > +		netif_carrier_off(bypass_netdev);
> >> > > > +		netif_tx_stop_all_queues(bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +done:
> >> > > > +	return NOTIFY_DONE;
> >> > > > +}
> >> > > > +
> >> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
> >> > > > +{
> >> > > > +	/* Skip parent events */
> >> > > > +	if (netif_is_bypass_master(dev))
> >> > > > +		return false;
> >> > > > +
> >> > > > +	/* Avoid non-Ethernet type devices */
> >> > > > +	if (dev->type != ARPHRD_ETHER)
> >> > > > +		return false;
> >> > > > +
> >> > > > +	/* Avoid Vlan dev with same MAC registering as VF */
> >> > > > +	if (is_vlan_dev(dev))
> >> > > > +		return false;
> >> > > > +
> >> > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
> >> > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
> >> > > Yeah, this is certainly incorrect. One thing is, you should be using the
> >> > > helpers netif_is_bond_master().
> >> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
> >> > > 
> >> > > You need to do it not by blacklisting, but with whitelisting. You need
> >> > > to whitelist VF devices. My port flavours patchset might help with this.
> >> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
> >> I don't see such function in the code.
> >
> >It is netdev_has_any_lower_dev(). I need to export it.
> 
> Come on, you cannot use that. That would allow bonding without slaves,
> but the slaves could be added later on.
> 
> What exactly you are trying to achieve by this?
> 
> 
> >
> >> 
> >> 
> >> > device is not an upper dev.
> >> > Can you point to your port flavours patchset? Is it upstream?
> >> I sent rfc couple of weeks ago:
> >> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
> >
> >
> >

^ permalink raw reply

* [PATCH net] vmxnet3: fix incorrect dereference when rxvlan is disabled
From: Ronak Doshi @ 2018-04-18 19:48 UTC (permalink / raw)
  To: netdev; +Cc: Ronak Doshi, VMware, Inc., open list

vmxnet3_get_hdr_len() is used to calculate the header length which in
turn is used to calculate the gso_size for skb. When rxvlan offload is
disabled, vlan tag is present in the header and the function references
ip header from sizeof(ethhdr) and leads to incorrect pointer reference.

This patch fixes this issue by taking sizeof(vlan_ethhdr) into account
if vlan tag is present and correctly references the ip hdr.

Signed-off-by: Ronak Doshi <doshir@vmware.com>
Acked-by: Guolin Yang <gyang@vmware.com>
Acked-by: Louis Luo <llouis@vmware.com>
---
 drivers/net/vmxnet3/vmxnet3_drv.c | 17 +++++++++++++----
 drivers/net/vmxnet3/vmxnet3_int.h |  4 ++--
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index e04937f44f33..9ebe2a689966 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -1218,6 +1218,7 @@ vmxnet3_get_hdr_len(struct vmxnet3_adapter *adapter, struct sk_buff *skb,
 	union {
 		void *ptr;
 		struct ethhdr *eth;
+		struct vlan_ethhdr *veth;
 		struct iphdr *ipv4;
 		struct ipv6hdr *ipv6;
 		struct tcphdr *tcp;
@@ -1228,16 +1229,24 @@ vmxnet3_get_hdr_len(struct vmxnet3_adapter *adapter, struct sk_buff *skb,
 	if (unlikely(sizeof(struct iphdr) + sizeof(struct tcphdr) > maplen))
 		return 0;
 
+	if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
+	    skb->protocol == cpu_to_be16(ETH_P_8021AD))
+		hlen = sizeof(struct vlan_ethhdr);
+	else
+		hlen = sizeof(struct ethhdr);
+
 	hdr.eth = eth_hdr(skb);
 	if (gdesc->rcd.v4) {
-		BUG_ON(hdr.eth->h_proto != htons(ETH_P_IP));
-		hdr.ptr += sizeof(struct ethhdr);
+		BUG_ON(hdr.eth->h_proto != htons(ETH_P_IP) &&
+		       hdr.veth->h_vlan_encapsulated_proto != htons(ETH_P_IP));
+		hdr.ptr += hlen;
 		BUG_ON(hdr.ipv4->protocol != IPPROTO_TCP);
 		hlen = hdr.ipv4->ihl << 2;
 		hdr.ptr += hdr.ipv4->ihl << 2;
 	} else if (gdesc->rcd.v6) {
-		BUG_ON(hdr.eth->h_proto != htons(ETH_P_IPV6));
-		hdr.ptr += sizeof(struct ethhdr);
+		BUG_ON(hdr.eth->h_proto != htons(ETH_P_IPV6) &&
+		       hdr.veth->h_vlan_encapsulated_proto != htons(ETH_P_IPV6));
+		hdr.ptr += hlen;
 		/* Use an estimated value, since we also need to handle
 		 * TSO case.
 		 */
diff --git a/drivers/net/vmxnet3/vmxnet3_int.h b/drivers/net/vmxnet3/vmxnet3_int.h
index 59ec34052a65..a3326463b71f 100644
--- a/drivers/net/vmxnet3/vmxnet3_int.h
+++ b/drivers/net/vmxnet3/vmxnet3_int.h
@@ -69,10 +69,10 @@
 /*
  * Version numbers
  */
-#define VMXNET3_DRIVER_VERSION_STRING   "1.4.13.0-k"
+#define VMXNET3_DRIVER_VERSION_STRING   "1.4.14.0-k"
 
 /* a 32-bit int, each byte encode a verion number in VMXNET3_DRIVER_VERSION */
-#define VMXNET3_DRIVER_VERSION_NUM      0x01040d00
+#define VMXNET3_DRIVER_VERSION_NUM      0x01040e00
 
 #if defined(CONFIG_PCI_MSI)
 	/* RSS only makes sense if MSI-X is supported. */
-- 
2.11.0

^ permalink raw reply related

* Re: [RFC PATCH V1 01/12] audit: add container id
From: Richard Guy Briggs @ 2018-04-18 19:51 UTC (permalink / raw)
  To: Stefan Berger
  Cc: mszeredi, ebiederm, simo, jlayton, carlos, linux-api, containers,
	LKML, eparis, dhowells, Linux-Audit Mailing List, viro, luto,
	netdev, linux-fsdevel, cgroups, serge, trondmy
In-Reply-To: <c1ec93a2-b398-373c-55da-b2be8e60c6b6@linux.vnet.ibm.com>

On 2018-04-18 15:39, Stefan Berger wrote:
> On 04/18/2018 03:23 PM, Richard Guy Briggs wrote:
> > On 2018-04-18 14:45, Stefan Berger wrote:
> > > On 03/15/2018 11:58 PM, Richard Guy Briggs wrote:
> > > > On 2018-03-15 16:27, Stefan Berger wrote:
> > > > > On 03/01/2018 02:41 PM, Richard Guy Briggs wrote:
> > > > > > Implement the proc fs write to set the audit container ID of a process,
> > > > > > emitting an AUDIT_CONTAINER record to document the event.
> > > > > > 
> > > > > > This is a write from the container orchestrator task to a proc entry of
> > > > > > the form /proc/PID/containerid where PID is the process ID of the newly
> > > > > > created task that is to become the first task in a container, or an
> > > > > > additional task added to a container.
> > > > > > 
> > > > > > The write expects up to a u64 value (unset: 18446744073709551615).
> > > > > > 
> > > > > > This will produce a record such as this:
> > > > > > type=UNKNOWN[1333] msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
> > > > > > 
> > > > > > The "op" field indicates an initial set.  The "pid" to "ses" fields are
> > > > > > the orchestrator while the "opid" field is the object's PID, the process
> > > > > > being "contained".  Old and new container ID values are given in the
> > > > > > "contid" fields, while res indicates its success.
> > > > > > 
> > > > > > It is not permitted to self-set, unset or re-set the container ID.  A
> > > > > > child inherits its parent's container ID, but then can be set only once
> > > > > > after.
> > > > > > 
> > > > > > See: https://github.com/linux-audit/audit-kernel/issues/32
> > > > > > 
> > > > > > 
> > > > > >     /* audit_rule_data supports filter rules with both integer and string
> > > > > >      * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
> > > > > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > > > > > index 4e0a4ac..0ee1e59 100644
> > > > > > --- a/kernel/auditsc.c
> > > > > > +++ b/kernel/auditsc.c
> > > > > > @@ -2073,6 +2073,92 @@ int audit_set_loginuid(kuid_t loginuid)
> > > > > >     	return rc;
> > > > > >     }
> > > > > > 
> > > > > > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
> > > > > > +{
> > > > > > +	struct task_struct *parent;
> > > > > > +	u64 pcontainerid, ccontainerid;
> > > > > > +	pid_t ppid;
> > > > > > +
> > > > > > +	/* Don't allow to set our own containerid */
> > > > > > +	if (current == task)
> > > > > > +		return -EPERM;
> > > > > > +	/* Don't allow the containerid to be unset */
> > > > > > +	if (!cid_valid(containerid))
> > > > > > +		return -EINVAL;
> > > > > > +	/* if we don't have caps, reject */
> > > > > > +	if (!capable(CAP_AUDIT_CONTROL))
> > > > > > +		return -EPERM;
> > > > > > +	/* if containerid is unset, allow */
> > > > > > +	if (!audit_containerid_set(task))
> > > > > > +		return 0;
> > > > > I am wondering whether there should be a check for the target process that
> > > > > will receive the containerid to not have CAP_SYS_ADMIN that would otherwise
> > > > > allow it to arbitrarily unshare()/clone() and leave the set of namespaces
> > > > > that may make up the container whose containerid we assign here?
> > > > This is a reasonable question.  This has been debated and I understood
> > > > the conclusion was that without a clear definition of a "container", the
> > > > task still remains in that container that just now has more
> > > > sub-namespaces (in the case of hierarchical namespaces), we don't want
> > > > to restrict it in such a way and that allows it to create nested
> > > > containers.  I see setns being more problematic if it could switch to
> > > > another existing namespace that was set up by the orchestrator for a
> > > > different container.  The coming v2 patchset acknowledges this situation
> > > > with the network namespace being potentially shared by multiple
> > > > containers.
> > > Are you going to post v2 soon? We would like to build on top of it for IMA
> > > namespacing and auditing inside of IMA namespaces.
> > I don't know if it addresses your specific needs, but V2 was posted on
> > March 16th along with userspace patches:
> > 	https://www.redhat.com/archives/linux-audit/2018-March/msg00110.html
> > 	https://www.redhat.com/archives/linux-audit/2018-March/msg00124.html
> > 
> > V3 is pending.
> Thanks. I hadn't actually looked at primarily due to the ghak and ghau in
> the title. Whatever these may mean.

They are Github issue numbers:
GHAK: GitHub Audit Kernel
GHAU: GitHub Audit Userspace
GHAD: GitHub Audit Documentation
GHAT: GitHub Audit Testsuite

> Does V2 or will V3 prevent a privileged process to setns() to a whole
> different set of namespaces and still be audited with that initial container
> id ?

No, not significantly different from V1 in that respect.

It does not prevent setns(), but will maintain its containerid.

It will prevent games by blocking a child and parent from setting each
other's containerids.

It does check that the task being conainered does not yet have any
children or peer threads.

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* [PATCH] docs: ip-sysctl.txt: fix name of some ipv6 variables
From: Olivier Gayot @ 2018-04-18 20:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Olivier Gayot

The name of the following proc/sysctl entries were incorrectly
documented:

    /proc/sys/net/ipv6/conf/<interface>/max_dst_opts_number
    /proc/sys/net/ipv6/conf/<interface>/max_hbt_opts_number
    /proc/sys/net/ipv6/conf/<interface>/max_dst_opts_length
    /proc/sys/net/ipv6/conf/<interface>/max_hbt_length

Their name was set to the name of the symbol in the .data field of the
control table instead of their .proc name.

Signed-off-by: Olivier Gayot <olivier.gayot@sigexec.com>
---
 Documentation/networking/ip-sysctl.txt | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 5dc1a040a2f1..b583a73cf95f 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1390,26 +1390,26 @@ mld_qrv - INTEGER
 	Default: 2 (as specified by RFC3810 9.1)
 	Minimum: 1 (as specified by RFC6636 4.5)
 
-max_dst_opts_cnt - INTEGER
+max_dst_opts_number - INTEGER
 	Maximum number of non-padding TLVs allowed in a Destination
 	options extension header. If this value is less than zero
 	then unknown options are disallowed and the number of known
 	TLVs allowed is the absolute value of this number.
 	Default: 8
 
-max_hbh_opts_cnt - INTEGER
+max_hbh_opts_number - INTEGER
 	Maximum number of non-padding TLVs allowed in a Hop-by-Hop
 	options extension header. If this value is less than zero
 	then unknown options are disallowed and the number of known
 	TLVs allowed is the absolute value of this number.
 	Default: 8
 
-max dst_opts_len - INTEGER
+max_dst_opts_length - INTEGER
 	Maximum length allowed for a Destination options extension
 	header.
 	Default: INT_MAX (unlimited)
 
-max hbh_opts_len - INTEGER
+max_hbh_length - INTEGER
 	Maximum length allowed for a Hop-by-Hop options extension
 	header.
 	Default: INT_MAX (unlimited)
-- 
2.17.0

^ permalink raw reply related

* Re: [RFC net-next PATCH 2/2] bpf: disallow XDP data_meta to overlap with xdp_frame area
From: Daniel Borkmann @ 2018-04-18 20:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20180418195322.381b041c@redhat.com>

On 04/18/2018 07:53 PM, Jesper Dangaard Brouer wrote:
> On Wed, 18 Apr 2018 18:21:21 +0200
> Daniel Borkmann <daniel@iogearbox.net> wrote:
> 
>> On 04/18/2018 02:10 PM, Jesper Dangaard Brouer wrote:
>>> If combining xdp_adjust_head and xdp_adjust_meta, then it is possible
>>> to make data_meta overlap with area used by xdp_frame.  And another
>>> invocation of xdp_adjust_head can then clear that area, due to
>>> clearing of xdp_frame area.
>>>
>>> The easiest solution I found was to simply not allow
>>> xdp_buff->data_meta to overlap with area used by xdp_frame.  
>>
>> Thanks Jesper! Trying to answer both emails in one. :) More below.
>>
>>> Fixes: 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse")
>>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>> ---
>>>  net/core/filter.c |   11 +++++++++++
>>>  1 file changed, 11 insertions(+)
>>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 15e9b5477360..e3623e741181 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -2701,6 +2701,11 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
>>>  		     data > xdp->data_end - ETH_HLEN))
>>>  		return -EINVAL;
>>>  
>>> +	/* Disallow data_meta to use xdp_frame area */
>>> +	if (metalen > 0 &&
>>> +	    unlikely((data - metalen) < xdp_frame_end))
>>> +		return -EINVAL;
>>> +
>>>  	/* Avoid info leak, when reusing area prev used by xdp_frame */
>>>  	if (data < xdp_frame_end) {  
>>
>> Effectively, when metalen > 0, then data_meta < data pointer, so above test
>> on new data_meta might be better, but feels like a bit of a workaround to
>> handle moving data pointer but disallowing moving data_meta pointer whereas
>> both could be handled if we wanted to go that path.
>>
>>>  		unsigned long clearlen = xdp_frame_end - data;
>>> @@ -2734,6 +2739,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
>>>  
>>>  BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
>>>  {
>>> +	void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>>>  	void *meta = xdp->data_meta + offset;
>>>  	unsigned long metalen = xdp->data - meta;
>>>  
>>> @@ -2742,6 +2748,11 @@ BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
>>>  	if (unlikely(meta < xdp->data_hard_start ||
>>>  		     meta > xdp->data))
>>>  		return -EINVAL;
>>> +
>>> +	/* Disallow data_meta to use xdp_frame area */
>>> +	if (unlikely(meta < xdp_frame_end))
>>> +		return -EINVAL;  
>>
>> (Ditto.)
>>
>>>  	if (unlikely((metalen & (sizeof(__u32) - 1)) ||
>>>  		     (metalen > 32)))
>>>  		return -EACCES;  
>>
>> The other, perhaps less invasive/complex option would be to just disallow
>> moving anything into previous sizeof(struct xdp_frame) area. My original
>> concern was that not all drivers use 256 bytes of headroom, e.g. afaik the
>> i40e and ixgbe have around 192 bytes of headroom available, but that should
>> actually still be plenty of space for encap + meta data, and potentially
>> with meta data use I would expect that at best custom decap would be
>> happening when pushing the packet up the stack. So might as well disallow
>> going into that region and not worry about it. Thus, reverting e9e9545e10d3
>> ("xdp: avoid leaking info stored in frame data on page reuse") and adding
>> something like the below (uncompiled), should just do it:
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 3bb0cb9..ad98ddd 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2692,8 +2692,9 @@ static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)
>>
>>  BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
>>  {
>> +	void *frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>>  	unsigned long metalen = xdp_get_metalen(xdp);
>> -	void *data_start = xdp->data_hard_start + metalen;
>> +	void *data_start = frame_end + metalen;
>>  	void *data = xdp->data + offset;
>>
>>  	if (unlikely(data < data_start ||
>> @@ -2719,13 +2720,13 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
>>
>>  BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
>>  {
>> +	void *frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>>  	void *meta = xdp->data_meta + offset;
>>  	unsigned long metalen = xdp->data - meta;
>>
>>  	if (xdp_data_meta_unsupported(xdp))
>>  		return -ENOTSUPP;
>> -	if (unlikely(meta < xdp->data_hard_start ||
>> -		     meta > xdp->data))
>> +	if (unlikely(meta < frame_end || meta > xdp->data))
>>  		return -EINVAL;
>>  	if (unlikely((metalen & (sizeof(__u32) - 1)) ||
>>  		     (metalen > 32)))
> 
> Okay, so you say just disallow using xdp_frame area in general.  It
> would be simpler.  

Yeah, lets do that.

> The advantage it that we don't run into strange situations, where the
> user/bpf_prog increased headroom too much, such that convert_to_xdp_frame()
> fails and thus XDP_REDIRECT action fails.  (That will be confusing to
> users to debug/troubleshoot).

Agree, yet another argument for doing it.

^ permalink raw reply

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-18 20:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	jasowang, loseweigh
In-Reply-To: <20180418222309-mutt-send-email-mst@kernel.org>

Wed, Apr 18, 2018 at 09:46:04PM CEST, mst@redhat.com wrote:
>On Wed, Apr 18, 2018 at 09:13:15PM +0200, Jiri Pirko wrote:
>> Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
>> >On 4/18/2018 2:25 AM, Jiri Pirko wrote:
>> >> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> >> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > This provides a generic interface for paravirtual drivers to listen
>> >> > > > for netdev register/unregister/link change events from pci ethernet
>> >> > > > devices with the same MAC and takeover their datapath. The notifier and
>> >> > > > event handling code is based on the existing netvsc implementation.
>> >> > > > 
>> >> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
>> >> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> >> > > > master netdev is created. The paravirtual driver registers each bypass
>> >> > > > instance along with a set of ops to manage the slave events.
>> >> > > >       bypass_master_register()
>> >> > > >       bypass_master_unregister()
>> >> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> >> > > > the bypass module provides interfaces to create/destroy additional master
>> >> > > > netdev and all the slave events are managed internally.
>> >> > > >        bypass_master_create()
>> >> > > >        bypass_master_destroy()
>> >> > > > 
>> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> >> > > > ---
>> >> > > > include/linux/netdevice.h |  14 +
>> >> > > > include/net/bypass.h      |  96 ++++++
>> >> > > > net/Kconfig               |  18 +
>> >> > > > net/core/Makefile         |   1 +
>> >> > > > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> >> > > > 5 files changed, 973 insertions(+)
>> >> > > > create mode 100644 include/net/bypass.h
>> >> > > > create mode 100644 net/core/bypass.c
>> >> > > > 
>> >> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> >> > > > index cf44503ea81a..587293728f70 100644
>> >> > > > --- a/include/linux/netdevice.h
>> >> > > > +++ b/include/linux/netdevice.h
>> >> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> >> > > > 	IFF_PHONY_HEADROOM		= 1<<24,
>> >> > > > 	IFF_MACSEC			= 1<<25,
>> >> > > > 	IFF_NO_RX_HANDLER		= 1<<26,
>> >> > > > +	IFF_BYPASS			= 1 << 27,
>> >> > > > +	IFF_BYPASS_SLAVE		= 1 << 28,
>> >> > > I wonder, why you don't follow the existing coding style... Also, please
>> >> > > add these to into the comment above.
>> >> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> >> > to the existing coding style to be consistent.
>> >> Please do.
>> >> 
>> >> 
>> >> > > 
>> >> > > > };
>> >> > > > 
>> >> > > > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> >> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> >> > > > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> >> > > > #define IFF_MACSEC			IFF_MACSEC
>> >> > > > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> >> > > > +#define IFF_BYPASS			IFF_BYPASS
>> >> > > > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>> >> > > > 
>> >> > > > /**
>> >> > > >    *	struct net_device - The DEVICE structure.
>> >> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> >> > > > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> >> > > > }
>> >> > > > 
>> >> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> >> > > > +{
>> >> > > > +	return dev->priv_flags & IFF_BYPASS;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> >> > > > +{
>> >> > > > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> >> > > > +}
>> >> > > > +
>> >> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> >> > > > static inline void netif_keep_dst(struct net_device *dev)
>> >> > > > {
>> >> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> >> > > > new file mode 100644
>> >> > > > index 000000000000..86b02cb894cf
>> >> > > > --- /dev/null
>> >> > > > +++ b/include/net/bypass.h
>> >> > > > @@ -0,0 +1,96 @@
>> >> > > > +// SPDX-License-Identifier: GPL-2.0
>> >> > > > +/* Copyright (c) 2018, Intel Corporation. */
>> >> > > > +
>> >> > > > +#ifndef _NET_BYPASS_H
>> >> > > > +#define _NET_BYPASS_H
>> >> > > > +
>> >> > > > +#include <linux/netdevice.h>
>> >> > > > +
>> >> > > > +struct bypass_ops {
>> >> > > > +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> >> > > > +				  struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_join)(struct net_device *slave_netdev,
>> >> > > > +			  struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> >> > > > +				    struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_release)(struct net_device *slave_netdev,
>> >> > > > +			     struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_link_change)(struct net_device *slave_netdev,
>> >> > > > +				 struct net_device *bypass_netdev);
>> >> > > > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> >> > > > +};
>> >> > > > +
>> >> > > > +struct bypass_master {
>> >> > > > +	struct list_head list;
>> >> > > > +	struct net_device __rcu *bypass_netdev;
>> >> > > > +	struct bypass_ops __rcu *ops;
>> >> > > > +};
>> >> > > > +
>> >> > > > +/* bypass state */
>> >> > > > +struct bypass_info {
>> >> > > > +	/* passthru netdev with same MAC */
>> >> > > > +	struct net_device __rcu *active_netdev;
>> >> > > You still use "active"/"backup" names which is highly misleading as
>> >> > > it has completely different meaning that in bond for example.
>> >> > > I noted that in my previous review already. Please change it.
>> >> > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> >> > matches with the BACKUP feature bit we are adding to virtio_net.
>> >> I think that "backup" is also misleading. Both "active" and "backup"
>> >> mean a *state* of slaves. This should be named differently.
>> >> 
>> >> 
>> >> 
>> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> >> > am not too happy with it.
>> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> >> No. The netdev could be any netdevice. It does not have to be a "VF".
>> >> I think "stolen" is quite appropriate since it describes the modus
>> >> operandi. The bypass master steals some netdevice according to some
>> >> match.
>> >> 
>> >> But I don't insist on "stolen". Just sounds right.
>> >
>> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>> >'backup' name is consistent.
>> 
>> It perhaps makes sense from the view of virtio device. However, as I
>> described couple of times, for master/slave device the name "backup" is
>> highly misleading.
>
>virtio is the backup. You are supposed to use another
>(typically passthrough) device, if that fails use virtio.
>It does seem appropriate to me. If you like, we can
>change that to "standby".  Active I don't like either. "main"?

Sounds much better, yes.


>
>In fact would failover be better than bypass?

Also, much better.


>
>
>> 
>> >
>> >The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>> >a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>> >
>> >Will look for any suggestions in the next day or two. If i don't get any, i will go
>> >with 'stolen'
>> >
>> ><snip>
>> >
>> >
>> >> +
>> >> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> >> +						  struct bypass_ops **ops)
>> >> +{
>> >> +	struct bypass_master *bypass_master;
>> >> +	struct net_device *bypass_netdev;
>> >> +
>> >> +	spin_lock(&bypass_lock);
>> >> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> >> > > As I wrote the last time, you don't need this list, spinlock.
>> >> > > You can do just something like:
>> >> > >           for_each_net(net) {
>> >> > >                   for_each_netdev(net, dev) {
>> >> > > 			if (netif_is_bypass_master(dev)) {
>> >> > This function returns the upper netdev as well as the ops associated
>> >> > with that netdev.
>> >> > bypass_master_list is a list of 'struct bypass_master' that associates
>> >> Well, can't you have it in netdev priv?
>> >
>> >We cannot do this for 2-netdev model as there is no bypass_netdev created.
>> 
>> Howcome? You have no master? I don't understand..
>> 
>> 
>> 
>> >
>> >> 
>> >> 
>> >> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> >> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> >> > NULL for 3-netdev model.
>> >> I see :(
>> >> 
>> >> 
>> >> > 
>> >> > > 
>> >> > > 
>> >> > > 
>> >> > > > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> >> > > > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> >> > > > +			*ops = rcu_dereference(bypass_master->ops);
>> >> > > I don't see how rcu_dereference is ok here.
>> >> > > 1) I don't see rcu_read_lock taken
>> >> > > 2) Looks like bypass_master->ops has the same value across the whole
>> >> > >      existence.
>> >> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> >> > Yes. ops doesn't change.
>> >> If it does not change, you can just access it directly.
>> >> 
>> >> 
>> >> > > 
>> >> > > > +			spin_unlock(&bypass_lock);
>> >> > > > +			return bypass_netdev;
>> >> > > > +		}
>> >> > > > +	}
>> >> > > > +	spin_unlock(&bypass_lock);
>> >> > > > +	return NULL;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > +	struct net_device *bypass_netdev;
>> >> > > > +	struct bypass_ops *bypass_ops;
>> >> > > > +	int ret, orig_mtu;
>> >> > > > +
>> >> > > > +	ASSERT_RTNL();
>> >> > > > +
>> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > +						&bypass_ops);
>> >> > > For master, could you use word "master" in the variables so it is clear?
>> >> > > Also, "dev" is fine instead of "netdev".
>> >> > > Something like "bpmaster_dev"
>> >> > bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
>> >> I was trying to point out, that "bypass_netdev" represents a "master"
>> >> netdev, yet it does not say master. That is why I suggested
>> >> "bpmaster_dev"
>> >> 
>> >> 
>> >> > I can change all _netdev suffixes to _dev to make the names shorter.
>> >> ok.
>> >> 
>> >> 
>> >> > 
>> >> > > 
>> >> > > > +	if (!bypass_netdev)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> >> > > > +					bypass_ops);
>> >> > > > +	if (ret != 0)
>> >> > > 	Just "if (ret)" will do. You have this on more places.
>> >> > OK.
>> >> > 
>> >> > 
>> >> > > 
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ret = netdev_rx_handler_register(slave_netdev,
>> >> > > > +					 bypass_ops ? bypass_ops->handle_frame :
>> >> > > > +					 bypass_handle_frame, bypass_netdev);
>> >> > > > +	if (ret != 0) {
>> >> > > > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> >> > > > +			   ret);
>> >> > > > +		goto done;
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> >> > > > +	if (ret != 0) {
>> >> > > > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> >> > > > +			   bypass_netdev->name, ret);
>> >> > > > +		goto upper_link_failed;
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> >> > > > +
>> >> > > > +	if (netif_running(bypass_netdev)) {
>> >> > > > +		ret = dev_open(slave_netdev);
>> >> > > > +		if (ret && (ret != -EBUSY)) {
>> >> > > > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> >> > > > +				   slave_netdev->name, ret);
>> >> > > > +			goto err_interface_up;
>> >> > > > +		}
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	/* Align MTU of slave with master */
>> >> > > > +	orig_mtu = slave_netdev->mtu;
>> >> > > > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> >> > > > +	if (ret != 0) {
>> >> > > > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> >> > > > +			   slave_netdev->name, bypass_netdev->mtu);
>> >> > > > +		goto err_set_mtu;
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> >> > > > +	if (ret != 0)
>> >> > > > +		goto err_join;
>> >> > > > +
>> >> > > > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> >> > > > +
>> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> >> > > > +		    slave_netdev->name);
>> >> > > > +
>> >> > > > +	goto done;
>> >> > > > +
>> >> > > > +err_join:
>> >> > > > +	dev_set_mtu(slave_netdev, orig_mtu);
>> >> > > > +err_set_mtu:
>> >> > > > +	dev_close(slave_netdev);
>> >> > > > +err_interface_up:
>> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> >> > > > +upper_link_failed:
>> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> >> > > > +done:
>> >> > > > +	return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> >> > > > +				       struct net_device *bypass_netdev,
>> >> > > > +				       struct bypass_ops *bypass_ops)
>> >> > > > +{
>> >> > > > +	struct net_device *backup_netdev, *active_netdev;
>> >> > > > +	struct bypass_info *bi;
>> >> > > > +
>> >> > > > +	if (bypass_ops) {
>> >> > > > +		if (!bypass_ops->slave_pre_unregister)
>> >> > > > +			return -EINVAL;
>> >> > > > +
>> >> > > > +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> >> > > > +							bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	bi = netdev_priv(bypass_netdev);
>> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> >> > > > +		return -EINVAL;
>> >> > > > +
>> >> > > > +	return 0;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
>> >> > > > +				struct net_device *bypass_netdev,
>> >> > > > +				struct bypass_ops *bypass_ops)
>> >> > > > +{
>> >> > > > +	struct net_device *backup_netdev, *active_netdev;
>> >> > > > +	struct bypass_info *bi;
>> >> > > > +
>> >> > > > +	if (bypass_ops) {
>> >> > > > +		if (!bypass_ops->slave_release)
>> >> > > > +			return -EINVAL;
>> >> > > I think it would be good to make the API to the driver more strict and
>> >> > > have a separate set of ops for "active" and "backup" netdevices.
>> >> > > That should stop people thinking about extending this to more slaves in
>> >> > > the future.
>> >> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> >> > 'active' slave.
>> >> I'm very well aware of that. I just thought that explicit ops for the
>> >> two slaves would make this more clear.
>> >> 
>> >> 
>> >> > 
>> >> > > 
>> >> > > 
>> >> > > > +
>> >> > > > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	bi = netdev_priv(bypass_netdev);
>> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > +	if (slave_netdev == backup_netdev) {
>> >> > > > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> >> > > > +	} else {
>> >> > > > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> >> > > > +		if (backup_netdev) {
>> >> > > > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> >> > > > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> >> > > > +		}
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	dev_put(slave_netdev);
>> >> > > > +
>> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> >> > > > +		    slave_netdev->name);
>> >> > > > +
>> >> > > > +	return 0;
>> >> > > > +}
>> >> > > > +
>> >> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > +	struct net_device *bypass_netdev;
>> >> > > > +	struct bypass_ops *bypass_ops;
>> >> > > > +	int ret;
>> >> > > > +
>> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ASSERT_RTNL();
>> >> > > > +
>> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > +						&bypass_ops);
>> >> > > > +	if (!bypass_netdev)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> >> > > > +					  bypass_ops);
>> >> > > > +	if (ret != 0)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> >> > > > +
>> >> > > > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> >> > > > +
>> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> >> > > > +		    slave_netdev->name);
>> >> > > > +
>> >> > > > +done:
>> >> > > > +	return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> >> > > > +
>> >> > > > +static bool bypass_xmit_ready(struct net_device *dev)
>> >> > > > +{
>> >> > > > +	return netif_running(dev) && netif_carrier_ok(dev);
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> >> > > > +	struct bypass_ops *bypass_ops;
>> >> > > > +	struct bypass_info *bi;
>> >> > > > +
>> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ASSERT_RTNL();
>> >> > > > +
>> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > +						&bypass_ops);
>> >> > > > +	if (!bypass_netdev)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	if (bypass_ops) {
>> >> > > > +		if (!bypass_ops->slave_link_change)
>> >> > > > +			goto done;
>> >> > > > +
>> >> > > > +		return bypass_ops->slave_link_change(slave_netdev,
>> >> > > > +						     bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	if (!netif_running(bypass_netdev))
>> >> > > > +		return 0;
>> >> > > > +
>> >> > > > +	bi = netdev_priv(bypass_netdev);
>> >> > > > +
>> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> >> > > > +		goto done;
>> >> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>> >> > > above is enough.
>> >> > I think we need this check to not allow events from a slave that is not
>> >> > attached to this master but has the same MAC.
>> >> Why do we need such events? Seems wrong to me.
>> >
>> >We want to avoid events from a netdev that is mis-configured with the same MAC as
>> >a bypass setup.
>> >
>> >>   Consider:
>> >> 
>> >> bp1      bp2
>> >> a1 b1    a2 b2
>> >> 
>> >> 
>> >> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
>> >
>> >We should not have 2 bypass configs with the same MAC.
>> >I need to add a check in the bypass_master_register() to prevent this.
>> 
>> Mac can change, you would have to check in change as well. Feels odd
>> thought. 
>> 
>> 
>> >
>> >The above check is to avoid cases where we have
>> >bp1(a1, b1) with mac1
>> >and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
>> >
>> >> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
>> >> the order of creation.
>> >> Let's say it will return bp1. Then when we have event for a2, the
>> >> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>> >> 
>> >> 
>> >> You cannot use bypass_master_get_bymac() here.
>> >> 
>> >> 
>> >> 
>> >> > > 
>> >> > > > +
>> >> > > > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> >> > > > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> >> > > > +		netif_carrier_on(bypass_netdev);
>> >> > > > +		netif_tx_wake_all_queues(bypass_netdev);
>> >> > > > +	} else {
>> >> > > > +		netif_carrier_off(bypass_netdev);
>> >> > > > +		netif_tx_stop_all_queues(bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +done:
>> >> > > > +	return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
>> >> > > > +{
>> >> > > > +	/* Skip parent events */
>> >> > > > +	if (netif_is_bypass_master(dev))
>> >> > > > +		return false;
>> >> > > > +
>> >> > > > +	/* Avoid non-Ethernet type devices */
>> >> > > > +	if (dev->type != ARPHRD_ETHER)
>> >> > > > +		return false;
>> >> > > > +
>> >> > > > +	/* Avoid Vlan dev with same MAC registering as VF */
>> >> > > > +	if (is_vlan_dev(dev))
>> >> > > > +		return false;
>> >> > > > +
>> >> > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> >> > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> >> > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> >> > > helpers netif_is_bond_master().
>> >> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> >> > > 
>> >> > > You need to do it not by blacklisting, but with whitelisting. You need
>> >> > > to whitelist VF devices. My port flavours patchset might help with this.
>> >> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> >> I don't see such function in the code.
>> >
>> >It is netdev_has_any_lower_dev(). I need to export it.
>> 
>> Come on, you cannot use that. That would allow bonding without slaves,
>> but the slaves could be added later on.
>> 
>> What exactly you are trying to achieve by this?
>> 
>> 
>> >
>> >> 
>> >> 
>> >> > device is not an upper dev.
>> >> > Can you point to your port flavours patchset? Is it upstream?
>> >> I sent rfc couple of weeks ago:
>> >> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
>> >
>> >
>> >

^ permalink raw reply

* Re: [PATCH bpf-next v4 00/10] BTF: BPF Type Format
From: Daniel Borkmann @ 2018-04-18 20:36 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev
  Cc: Alexei Starovoitov, kernel-team, Arnaldo Carvalho de Melo
In-Reply-To: <20180417204243.4028831-1-kafai@fb.com>

On 04/17/2018 10:42 PM, Martin KaFai Lau wrote:
> This patch introduces BPF Type Format (BTF).
> 
> BTF (BPF Type Format) is the meta data format which describes
> the data types of BPF program/map.  Hence, it basically focus
> on the C programming language which the modern BPF is primary
> using.  The first use case is to provide a generic pretty print
> capability for a BPF map.

Ok, as discussed with Martin this will have one more minor respin.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] net: don't use kvzalloc for DMA memory
From: Eric Dumazet @ 2018-04-18 20:38 UTC (permalink / raw)
  To: Michael S. Tsirkin, Eric Dumazet
  Cc: Mikulas Patocka, David S. Miller, Eric Dumazet, Joby Poriyath,
	Ben Hutchings, netdev, linux-kernel
In-Reply-To: <20180418204229-mutt-send-email-mst@kernel.org>



On 04/18/2018 10:55 AM, Michael S. Tsirkin wrote:

> Imagine you want to pass some data to card.
> Natural thing is to just put it in a variable and start DMA.
> However DMA API disallows stack access nowdays,
> so it's natural to put this within struct device.
> 
> See e.g.
> 
> 	commit a725ee3e44e39dab1ec82cc745899a785d2a555e
> 	Author: Andy Lutomirski <luto@kernel.org>
> 	Date:   Mon Jul 18 15:34:49 2016 -0700
> 
> 	    virtio-net: Remove more stack DMA
>

Andy just moved the problem to another one, since at that time we already
had vmalloc() fallback for at least 2 years.

Note that my original patch had :

p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
if (!p)
      p = vzalloc(alloc_size);

So really, normal (less than PAGE_SIZE) allocations would have almost-zero-chance to end up to vmalloc(one_page)

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: add helper for getting xfrm states
From: Daniel Borkmann @ 2018-04-18 20:59 UTC (permalink / raw)
  To: Eyal Birger, netdev; +Cc: shmulik, ast, fw, steffen.klassert
In-Reply-To: <1523940529-3791-2-git-send-email-eyal.birger@gmail.com>

On 04/17/2018 06:48 AM, Eyal Birger wrote:
> This commit introduces a helper which allows fetching xfrm state
> parameters by eBPF programs attached to TC.
> 
> Prototype:
> bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
> 
> skb: pointer to skb
> index: the index in the skb xfrm_state secpath array
> xfrm_state: pointer to 'struct bpf_xfrm_state'
> size: size of 'struct bpf_xfrm_state'
> flags: reserved for future extensions
> 
> The helper returns 0 on success. Non zero if no xfrm state at the index
> is found - or non exists at all.
> 
> struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
> address and the reqid; it can be further extended by adding elements to
> its end - indicating the populated fields by the 'size' argument -
> keeping backwards compatibility.
> 
> Typical usage:
> 
> struct bpf_xfrm_state x = {};
> bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
> ...
> 
> Signed-off-by: Eyal Birger <eyal.birger@gmail.com>

Patch looks good to me, two comments below:

> ---
>  include/uapi/linux/bpf.h | 25 ++++++++++++++++++++++++-
>  net/core/filter.c        | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 70 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c5ec897..132e172 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -755,6 +755,15 @@ union bpf_attr {
>   *     @addr: pointer to struct sockaddr to bind socket to
>   *     @addr_len: length of sockaddr structure
>   *     Return: 0 on success or negative error code
> + *
> + * int bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
> + *     retrieve XFRM state
> + *     @skb: pointer to skb
> + *     @index: index of the xfrm state in the secpath
> + *     @key: pointer to 'struct bpf_xfrm_state'
> + *     @size: size of 'struct bpf_xfrm_state'
> + *     @flags: room for future extensions
> + *     Return: 0 on success or negative error
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -821,7 +830,8 @@ union bpf_attr {
>  	FN(msg_apply_bytes),		\
>  	FN(msg_cork_bytes),		\
>  	FN(msg_pull_data),		\
> -	FN(bind),
> +	FN(bind),			\
> +	FN(skb_get_xfrm_state),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -927,6 +937,19 @@ struct bpf_tunnel_key {
>  	__u32 tunnel_label;
>  };
>  
> +/* user accessible mirror of in-kernel xfrm_state.
> + * new fields can only be added to the end of this structure
> + */
> +struct bpf_xfrm_state {
> +	__u32 reqid;
> +	__u32 spi;
> +	__u16 family;
> +	union {
> +		__u32 remote_ipv4;
> +		__u32 remote_ipv6[4];
> +	};
> +};
> +
>  /* Generic BPF return codes which all BPF program types may support.
>   * The values are binary compatible with their TC_ACT_* counter-part to
>   * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
> diff --git a/net/core/filter.c b/net/core/filter.c
> index d31aff9..c06600a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -57,6 +57,7 @@
>  #include <net/sock_reuseport.h>
>  #include <net/busy_poll.h>
>  #include <net/tcp.h>
> +#include <net/xfrm.h>
>  #include <linux/bpf_trace.h>
>  
>  /**
> @@ -3703,6 +3704,49 @@ static const struct bpf_func_proto bpf_bind_proto = {
>  	.arg3_type	= ARG_CONST_SIZE,
>  };
>  
> +BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32, index,
> +	   struct bpf_xfrm_state *, to, u32, size, u64, flags)
> +{
> +#ifdef CONFIG_XFRM
> +	const struct sec_path *sp = skb_sec_path(skb);
> +	const struct xfrm_state *x;
> +
> +	if (!sp || index >= sp->len)

This should be something like: if (!sp || unlikely(index >= sp->len || flags))
Such that we unconditionally bail out on any flags currently, since this is
reserved for future use and anything non-zero would be invalid and rejected
until we start extending it.

> +		goto err_clear;
> +
> +	x = sp->xvec[index];
> +
> +	if (unlikely(size != sizeof(struct bpf_xfrm_state)))
> +		goto err_clear;
> +
> +	to->reqid = x->props.reqid;
> +	to->spi = be32_to_cpu(x->id.spi);
> +	to->family = x->props.family;
> +	if (to->family == AF_INET6) {
> +		memcpy(to->remote_ipv6, x->props.saddr.a6,
> +		       sizeof(to->remote_ipv6));
> +	} else {
> +		to->remote_ipv4 = be32_to_cpu(x->props.saddr.a4);
> +	}
> +
> +	return 0;
> +err_clear:
> +#endif
> +	memset(to, 0, size);
> +	return -EINVAL;
> +}
> +
> +static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
> +	.func		= bpf_skb_get_xfrm_state,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +	.arg2_type	= ARG_ANYTHING,
> +	.arg3_type	= ARG_PTR_TO_UNINIT_MEM,
> +	.arg4_type	= ARG_CONST_SIZE,
> +	.arg5_type	= ARG_ANYTHING,
> +};
> +
>  static const struct bpf_func_proto *
>  bpf_base_func_proto(enum bpf_func_id func_id)
>  {
> @@ -3844,6 +3888,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  		return &bpf_get_socket_cookie_proto;
>  	case BPF_FUNC_get_socket_uid:
>  		return &bpf_get_socket_uid_proto;
> +	case BPF_FUNC_skb_get_xfrm_state:
> +		return &bpf_skb_get_xfrm_state_proto;

Potentially, on kernels with !CONFIG_XFRM, you might want to let the program
bail out at program verification phase already? Thus it would become ...

#ifdef CONFIG_XFRM
	case BPF_FUNC_skb_get_xfrm_state:
		return &bpf_skb_get_xfrm_state_proto;
#endif

... where you'd also wrap the helper + state_proto in CONFIG_XFRM.

>  	default:
>  		return bpf_base_func_proto(func_id);
>  	}
> 

^ permalink raw reply

* WARNING: suspicious RCU usage in fib6_info_alloc
From: syzbot @ 2018-04-18 21:02 UTC (permalink / raw)
  To: christian.brauner, davem, dsahern, fw, jbenc, ktkhai,
	linux-kernel, lucien.xin, mschiffer, netdev, syzkaller-bugs,
	vyasevich

Hello,

syzbot hit the following crash on net-next commit
0565de29cbd65b378147d36f9642f93a046240dc (Wed Apr 18 03:41:18 2018 +0000)
Merge branch 'ipv6-Separate-data-structures-for-FIB-and-data-path'
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=2add39b05179b31f912f

So far this crash happened 2 times on net-next.
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=4660613020123136
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5742127124316160
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-5947642240294114534
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+2add39b05179b31f912f@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.
If you forward the report, please keep this part and the footer.

IPVS: ftp: loaded support on port[0] = 21
IPVS: ftp: loaded support on port[0] = 21
IPVS: ftp: loaded support on port[0] = 21

=============================
WARNING: suspicious RCU usage
4.16.0+ #5 Not tainted
-----------------------------
kernel/sched/core.c:6153 Illegal context switch in RCU-bh read-side  
critical section!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
4 locks held by kworker/1:1/25:
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
__write_once_size include/linux/compiler.h:215 [inline]
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
atomic64_set include/asm-generic/atomic-instrumented.h:40 [inline]
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
set_work_data kernel/workqueue.c:617 [inline]
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
  #1: 000000007d88bc46 ((work_completion)(&(&ifa->dad_work)->work)){+.+.},  
at: process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
  #2: 00000000943eaf98 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74
  #3: 00000000a39c89a4 (rcu_read_lock_bh){....}, at:  
ipv6_ifa_notify+0x0/0x210 net/ipv6/addrconf.c:5621

stack backtrace:
CPU: 1 PID: 25 Comm: kworker/1:1 Not tainted 4.16.0+ #5
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: ipv6_addrconf addrconf_dad_work
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
  ___might_sleep+0x2e7/0x320 kernel/sched/core.c:6153
  __might_sleep+0x95/0x190 kernel/sched/core.c:6141
  slab_pre_alloc_hook mm/slab.h:421 [inline]
  slab_alloc mm/slab.c:3378 [inline]
  kmem_cache_alloc_trace+0x2bc/0x780 mm/slab.c:3618
  kmalloc include/linux/slab.h:512 [inline]
  kzalloc include/linux/slab.h:701 [inline]
  fib6_info_alloc+0xbb/0x280 net/ipv6/ip6_fib.c:152
  ip6_route_info_create+0x7bf/0x3240 net/ipv6/route.c:2891
  ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
  addrconf_prefix_route.isra.47+0x4f7/0x6f0 net/ipv6/addrconf.c:2347
  __ipv6_ifa_notify+0x591/0xa00 net/ipv6/addrconf.c:5620
  ipv6_ifa_notify+0xff/0x210 net/ipv6/addrconf.c:5650
  addrconf_dad_completed+0xeb/0xbf0 net/ipv6/addrconf.c:4083
  addrconf_dad_begin net/ipv6/addrconf.c:3889 [inline]
  addrconf_dad_work+0x873/0x1300 net/ipv6/addrconf.c:3991
  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
  kthread+0x345/0x410 kernel/kthread.c:238
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
BUG: sleeping function called from invalid context at mm/slab.h:421
in_atomic(): 1, irqs_disabled(): 0, pid: 25, name: kworker/1:1
4 locks held by kworker/1:1/25:
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
__write_once_size include/linux/compiler.h:215 [inline]
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
atomic64_set include/asm-generic/atomic-instrumented.h:40 [inline]
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
atomic_long_set include/asm-generic/atomic-long.h:57 [inline]
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
set_work_data kernel/workqueue.c:617 [inline]
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0: 00000000df858653 ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at:  
process_one_work+0xaef/0x1b50 kernel/workqueue.c:2116
  #1: 000000007d88bc46 ((work_completion)(&(&ifa->dad_work)->work)){+.+.},  
at: process_one_work+0xb46/0x1b50 kernel/workqueue.c:2120
  #2: 00000000943eaf98 (rtnl_mutex){+.+.}, at: rtnl_lock+0x17/0x20  
net/core/rtnetlink.c:74
  #3: 00000000a39c89a4 (rcu_read_lock_bh){....}, at:  
ipv6_ifa_notify+0x0/0x210 net/ipv6/addrconf.c:5621
CPU: 1 PID: 25 Comm: kworker/1:1 Not tainted 4.16.0+ #5
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: ipv6_addrconf addrconf_dad_work
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  ___might_sleep.cold.88+0x11f/0x13a kernel/sched/core.c:6188
  __might_sleep+0x95/0x190 kernel/sched/core.c:6141
  slab_pre_alloc_hook mm/slab.h:421 [inline]
  slab_alloc mm/slab.c:3378 [inline]
  kmem_cache_alloc_trace+0x2bc/0x780 mm/slab.c:3618
  kmalloc include/linux/slab.h:512 [inline]
  kzalloc include/linux/slab.h:701 [inline]
  fib6_info_alloc+0xbb/0x280 net/ipv6/ip6_fib.c:152
  ip6_route_info_create+0x7bf/0x3240 net/ipv6/route.c:2891
  ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
  addrconf_prefix_route.isra.47+0x4f7/0x6f0 net/ipv6/addrconf.c:2347
  __ipv6_ifa_notify+0x591/0xa00 net/ipv6/addrconf.c:5620
  ipv6_ifa_notify+0xff/0x210 net/ipv6/addrconf.c:5650
  addrconf_dad_completed+0xeb/0xbf0 net/ipv6/addrconf.c:4083
  addrconf_dad_begin net/ipv6/addrconf.c:3889 [inline]
  addrconf_dad_work+0x873/0x1300 net/ipv6/addrconf.c:3991
  process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
  worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
  kthread+0x345/0x410 kernel/kthread.c:238
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkaller@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.
Note: all commands must start from beginning of the line in the email body.

^ permalink raw reply

* Re: WARNING: suspicious RCU usage in fib6_info_alloc
From: David Ahern @ 2018-04-18 21:04 UTC (permalink / raw)
  To: syzbot, christian.brauner, davem, fw, jbenc, ktkhai, linux-kernel,
	lucien.xin, mschiffer, netdev, syzkaller-bugs, vyasevich
In-Reply-To: <0000000000004366bc056a25c4f3@google.com>

On 4/18/18 3:02 PM, syzbot wrote:
> stack backtrace:
> CPU: 1 PID: 25 Comm: kworker/1:1 Not tainted 4.16.0+ #5
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Workqueue: ipv6_addrconf addrconf_dad_work
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>  lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
>  ___might_sleep+0x2e7/0x320 kernel/sched/core.c:6153
>  __might_sleep+0x95/0x190 kernel/sched/core.c:6141
>  slab_pre_alloc_hook mm/slab.h:421 [inline]
>  slab_alloc mm/slab.c:3378 [inline]
>  kmem_cache_alloc_trace+0x2bc/0x780 mm/slab.c:3618
>  kmalloc include/linux/slab.h:512 [inline]
>  kzalloc include/linux/slab.h:701 [inline]
>  fib6_info_alloc+0xbb/0x280 net/ipv6/ip6_fib.c:152
>  ip6_route_info_create+0x7bf/0x3240 net/ipv6/route.c:2891
>  ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
>  addrconf_prefix_route.isra.47+0x4f7/0x6f0 net/ipv6/addrconf.c:2347
>  __ipv6_ifa_notify+0x591/0xa00 net/ipv6/addrconf.c:5620

Eric is faster than the robot ;-)

Patch is queued up; will send later today.

^ permalink raw reply

* Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
From: Florian Fainelli @ 2018-04-18 21:07 UTC (permalink / raw)
  To: greearb, netdev; +Cc: linux-wireless, ath10k
In-Reply-To: <1524016176-3881-1-git-send-email-greearb@candelatech.com>

On 04/17/2018 06:49 PM, greearb@candelatech.com wrote:
> From: Ben Greear <greearb@candelatech.com>
> 
> This is similar to ETHTOOL_GSTATS, but it allows you to specify
> flags.  These flags can be used by the driver to decrease the
> amount of stats refreshed.  In particular, this helps with ath10k
> since getting the firmware stats can be slow.

You can configure the statistics refresh rate through the ethtool
coalescing parameter stats_block_coalesce_usecs, not sure if this is
exactly what you had in mind, but if it works, then you might as well
want to use it.

> 
> Signed-off-by: Ben Greear <greearb@candelatech.com>
> ---
>  include/linux/ethtool.h      | 12 ++++++++++++
>  include/uapi/linux/ethtool.h | 10 ++++++++++
>  net/core/ethtool.c           | 40 +++++++++++++++++++++++++++++++++++-----
>  3 files changed, 57 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index ebe4181..a4aa11f 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -243,6 +243,15 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
>   * @get_ethtool_stats: Return extended statistics about the device.
>   *	This is only useful if the device maintains statistics not
>   *	included in &struct rtnl_link_stats64.
> + * @get_ethtool_stats2: Return extended statistics about the device.
> + *	This is only useful if the device maintains statistics not
> + *	included in &struct rtnl_link_stats64.
> + *      Takes a flags argument:  0 means all (same as get_ethtool_stats),
> + *      0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
> + *      Other flags are reserved for now.
> + *      Same number of stats will be returned, but some of them might
> + *      not be as accurate/refreshed.  This is to allow not querying
> + *      firmware or other expensive-to-read stats, for instance.
>   * @begin: Function to be called before any other operation.  Returns a
>   *	negative error code or zero.
>   * @complete: Function to be called after any other operation except
> @@ -355,6 +364,9 @@ struct ethtool_ops {
>  	int	(*set_phys_id)(struct net_device *, enum ethtool_phys_id_state);
>  	void	(*get_ethtool_stats)(struct net_device *,
>  				     struct ethtool_stats *, u64 *);
> +	void	(*get_ethtool_stats2)(struct net_device *dev,
> +				      struct ethtool_stats *gstats, u64 *data,
> +				      u32 flags);
>  	int	(*begin)(struct net_device *);
>  	void	(*complete)(struct net_device *);
>  	u32	(*get_priv_flags)(struct net_device *);
> diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
> index 4ca65b5..1c74f3e 100644
> --- a/include/uapi/linux/ethtool.h
> +++ b/include/uapi/linux/ethtool.h
> @@ -1396,11 +1396,21 @@ enum ethtool_fec_config_bits {
>  #define ETHTOOL_PHY_STUNABLE	0x0000004f /* Set PHY tunable configuration */
>  #define ETHTOOL_GFECPARAM	0x00000050 /* Get FEC settings */
>  #define ETHTOOL_SFECPARAM	0x00000051 /* Set FEC settings */
> +#define ETHTOOL_GSTATS2		0x00000052 /* get NIC-specific statistics
> +					    * with ability to specify flags.
> +					    * See ETHTOOL_GS2* below.
> +					    */
>  
>  /* compatibility with older code */
>  #define SPARC_ETH_GSET		ETHTOOL_GSET
>  #define SPARC_ETH_SSET		ETHTOOL_SSET
>  
> +/* GSTATS2 flags */
> +#define ETHTOOL_GS2_SKIP_NONE (0)    /* default is to update all stats */
> +#define ETHTOOL_GS2_SKIP_FW   (1<<0) /* Skip reading stats that probe firmware,
> +				      * and thus are slow/expensive.
> +				      */
> +
>  /* Link mode bit indices */
>  enum ethtool_link_mode_bit_indices {
>  	ETHTOOL_LINK_MODE_10baseT_Half_BIT	= 0,
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index 03416e6..6ec3413 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -1952,16 +1952,14 @@ static int ethtool_phys_id(struct net_device *dev, void __user *useraddr)
>  	return rc;
>  }
>  
> -static int ethtool_get_stats(struct net_device *dev, void __user *useraddr)
> +static int _ethtool_get_stats(struct net_device *dev, void __user *useraddr,
> +			      u32 flags)
>  {
>  	struct ethtool_stats stats;
>  	const struct ethtool_ops *ops = dev->ethtool_ops;
>  	u64 *data;
>  	int ret, n_stats;
>  
> -	if (!ops->get_ethtool_stats || !ops->get_sset_count)
> -		return -EOPNOTSUPP;
> -
>  	n_stats = ops->get_sset_count(dev, ETH_SS_STATS);
>  	if (n_stats < 0)
>  		return n_stats;
> @@ -1976,7 +1974,10 @@ static int ethtool_get_stats(struct net_device *dev, void __user *useraddr)
>  	if (n_stats && !data)
>  		return -ENOMEM;
>  
> -	ops->get_ethtool_stats(dev, &stats, data);
> +	if (flags != ETHTOOL_GS2_SKIP_NONE)
> +		ops->get_ethtool_stats2(dev, &stats, data, flags);
> +	else
> +		ops->get_ethtool_stats(dev, &stats, data);
>  
>  	ret = -EFAULT;
>  	if (copy_to_user(useraddr, &stats, sizeof(stats)))
> @@ -1991,6 +1992,31 @@ static int ethtool_get_stats(struct net_device *dev, void __user *useraddr)
>  	return ret;
>  }
>  
> +static int ethtool_get_stats(struct net_device *dev, void __user *useraddr)
> +{
> +	const struct ethtool_ops *ops = dev->ethtool_ops;
> +
> +	if (!ops->get_ethtool_stats || !ops->get_sset_count)
> +		return -EOPNOTSUPP;
> +	return _ethtool_get_stats(dev, useraddr, ETHTOOL_GS2_SKIP_NONE);
> +}
> +
> +static int ethtool_get_stats2(struct net_device *dev, void __user *useraddr)
> +{
> +	struct ethtool_stats stats;
> +	const struct ethtool_ops *ops = dev->ethtool_ops;
> +	u32 flags = 0;
> +
> +	if (!ops->get_ethtool_stats2 || !ops->get_sset_count)
> +		return -EOPNOTSUPP;
> +
> +	if (copy_from_user(&stats, useraddr, sizeof(stats)))
> +		return -EFAULT;
> +
> +	flags = stats.n_stats;
> +	return _ethtool_get_stats(dev, useraddr, flags);
> +}
> +
>  static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
>  {
>  	struct ethtool_stats stats;
> @@ -2632,6 +2658,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
>  	case ETHTOOL_GSSET_INFO:
>  	case ETHTOOL_GSTRINGS:
>  	case ETHTOOL_GSTATS:
> +	case ETHTOOL_GSTATS2:
>  	case ETHTOOL_GPHYSTATS:
>  	case ETHTOOL_GTSO:
>  	case ETHTOOL_GPERMADDR:
> @@ -2742,6 +2769,9 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
>  	case ETHTOOL_GSTATS:
>  		rc = ethtool_get_stats(dev, useraddr);
>  		break;
> +	case ETHTOOL_GSTATS2:
> +		rc = ethtool_get_stats2(dev, useraddr);
> +		break;
>  	case ETHTOOL_GPERMADDR:
>  		rc = ethtool_get_perm_addr(dev, useraddr);
>  		break;
> 


-- 
Florian

^ permalink raw reply

* Re: WARNING: suspicious RCU usage in fib6_info_alloc
From: Eric Dumazet @ 2018-04-18 21:16 UTC (permalink / raw)
  To: David Ahern, syzbot, christian.brauner, davem, fw, jbenc, ktkhai,
	linux-kernel, lucien.xin, mschiffer, netdev, syzkaller-bugs,
	vyasevich
In-Reply-To: <3096e146-ee1d-d3f8-e900-8aa70cb54d00@gmail.com>



On 04/18/2018 02:04 PM, David Ahern wrote:
> On 4/18/18 3:02 PM, syzbot wrote:
>> stack backtrace:
>> CPU: 1 PID: 25 Comm: kworker/1:1 Not tainted 4.16.0+ #5
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Workqueue: ipv6_addrconf addrconf_dad_work
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:77 [inline]
>>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>>  lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
>>  ___might_sleep+0x2e7/0x320 kernel/sched/core.c:6153
>>  __might_sleep+0x95/0x190 kernel/sched/core.c:6141
>>  slab_pre_alloc_hook mm/slab.h:421 [inline]
>>  slab_alloc mm/slab.c:3378 [inline]
>>  kmem_cache_alloc_trace+0x2bc/0x780 mm/slab.c:3618
>>  kmalloc include/linux/slab.h:512 [inline]
>>  kzalloc include/linux/slab.h:701 [inline]
>>  fib6_info_alloc+0xbb/0x280 net/ipv6/ip6_fib.c:152
>>  ip6_route_info_create+0x7bf/0x3240 net/ipv6/route.c:2891
>>  ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
>>  addrconf_prefix_route.isra.47+0x4f7/0x6f0 net/ipv6/addrconf.c:2347
>>  __ipv6_ifa_notify+0x591/0xa00 net/ipv6/addrconf.c:5620
> 
> Eric is faster than the robot ;-)

Not really ;)

> 
> Patch is queued up; will send later today.
> 

David, I simply looked at the report before allowing the bot to send it ;)

I determined the bug was added recently, and warned you about it.

Thanks !

^ permalink raw reply

* Re: [PATCH v3 00/10] New network driver for Amiga X-Surf 100 (m68k)
From: Michael Schmitz @ 2018-04-18 21:26 UTC (permalink / raw)
  To: Finn Thain
  Cc: netdev, Andrew Lunn, Geert Uytterhoeven, Florian Fainelli,
	Linux/m68k, Michael Karcher
In-Reply-To: <alpine.LNX.2.21.1804181531180.8@nippy.intranet>

Hi Finn,

thanks for the feedback!

On Wed, Apr 18, 2018 at 5:45 PM, Finn Thain <fthain@telegraphics.com.au> wrote:
>> > 1/9 net: phy: new Asix Electronics PHY driver
>> > 2/9 net: ax88796: Fix MAC address reading
>> > 3/9 net: ax88796: Attach MII bus only when open
>> > 4/9 net: ax88796: Do not free IRQ in ax_remove() (already freed in ax_close()).
>> > 5/9 net: ax88796: Add block_input/output hooks to ax_plat_data
>
> I found that git am rejects this one, though 'patch' applies it with fuzz.

Can't remember seeing anything there when rebasing the series, odd.

>
>> > 6/9 net: ax88796: add interrupt status callback to platform data
>> > 7/9 net: ax88796: set IRQF_SHARED flag when IRQ resource is marked as shareable
>> > 8/9 net: ax88796: release platform device drvdata on probe error and module remove
>> > 9/9 net: New ax88796 platform driver for Amiga X-Surf 100 Zorro board (m68k)
>
> git am rejected this one and also complained about trailing whitespace.
>
> I'd rebase on v4.17-rc1 and also run checkpatch over the results.

Done that now, thanks. Andrew recommended basing my patches on
net-next so I'll do that before resubmitting.

Cheers,

  Michael

> --
>
>> >
>> >  drivers/net/ethernet/8390/Kconfig    |   17 ++-
>> >  drivers/net/ethernet/8390/Makefile   |    1 +
>> >  drivers/net/ethernet/8390/ax88796.c  |  228 ++++++++++++--------
>> >  drivers/net/ethernet/8390/xsurf100.c |  381 ++++++++++++++++++++++++++++++++++
>> >  drivers/net/phy/Kconfig              |    6 +
>> >  drivers/net/phy/Makefile             |    1 +
>> >  drivers/net/phy/asix.c               |   65 ++++++
>> >  drivers/net/phy/phy_device.c         |    3 +-
>> >  include/linux/phy.h                  |    1 +
>> >  include/net/ax88796.h                |   14 ++
>> >  10 files changed, 621 insertions(+), 96 deletions(-)
>> >
>> > Cheers,
>> >
>> >   Michael

^ permalink raw reply

* Re: [PATCH 1/3] ethtool: Support ETHTOOL_GSTATS2 command.
From: Johannes Berg @ 2018-04-18 21:26 UTC (permalink / raw)
  To: greearb, netdev; +Cc: linux-wireless, ath10k
In-Reply-To: <1524016176-3881-1-git-send-email-greearb@candelatech.com>

On Tue, 2018-04-17 at 18:49 -0700, greearb@candelatech.com wrote:
> 
> + * @get_ethtool_stats2: Return extended statistics about the device.
> + *	This is only useful if the device maintains statistics not
> + *	included in &struct rtnl_link_stats64.
> + *      Takes a flags argument:  0 means all (same as get_ethtool_stats),
> + *      0x1 (ETHTOOL_GS2_SKIP_FW) means skip firmware stats.
> + *      Other flags are reserved for now.

It'd be pretty hard to know which flags are firmware stats?

Anyway, there's no way I'm going to take this patch, so you need to
float it on netdev first (best CC us here) and get it applied there
before we can do anything on the wifi side.

johannes

^ permalink raw reply

* Re: [PATCH v3 00/10] New network driver for Amiga X-Surf 100 (m68k)
From: Michael Schmitz @ 2018-04-18 21:34 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, Finn Thain, Geert Uytterhoeven, Florian Fainelli,
	Linux/m68k, Michael Karcher
In-Reply-To: <20180418121957.GB31643@lunn.ch>

Hi Andrew,

sorry, my mistake. I didn't realize how fast DaveM's tree diverges
from Linus' (and Geert's) once the merge window opens.

Cheers,

  Michael


On Thu, Apr 19, 2018 at 12:19 AM, Andrew Lunn <andrew@lunn.ch> wrote:
> On Wed, Apr 18, 2018 at 05:10:45PM +1200, Michael Schmitz wrote:
>> All,
>>
>> just noticed belatedly that the Makefile hunk of patch 9 does no
>> longer apply cleanly in 4.17-rc1, sorry. My series was based on 4.16.
>> I'll resend that one, OK?
>
> Hi Michael
>
> You should be based on DaveM net-next tree:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
>
> Please also have "net-next" in the patch subject. See
> Documentation/networking/netdev-FAQ.txt
>
>         Andrew

^ permalink raw reply

* [PATCH net 0/3] net: sched: ife: malformed ife packet fixes
From: Alexander Aring @ 2018-04-18 21:35 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring

As promised at netdev 2.2 tc workshop I am working on adding scapy support for
tdc testing. It is still work in progress. I will submit the patches to tdc
later (they are not in good shape yet). The good news is I have been able to
find bugs which normal packet testing would not be able to find.
With fuzzy testing I was able to craft certain malformed packets that IFE
action was not able to deal with. This patch set fixes those bugs.

Alexander Aring (3):
  net: sched: ife: signal not finding metaid
  net: sched: ife: handle malformed tlv length
  net: sched: ife: check on metadata length

 include/net/ife.h   |  3 ++-
 net/ife/ife.c       | 38 ++++++++++++++++++++++++++++++++++++--
 net/sched/act_ife.c |  9 +++++++--
 3 files changed, 45 insertions(+), 5 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH net 1/3] net: sched: ife: signal not finding metaid
From: Alexander Aring @ 2018-04-18 21:35 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180418213534.6215-1-aring@mojatatu.com>

We need to record stats for received metadata that we dont know how
to process. Have find_decode_metaid() return -ENOENT to capture this.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
---
 net/sched/act_ife.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index a5994cf0512b..49b8ab551fbe 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -652,7 +652,7 @@ static int find_decode_metaid(struct sk_buff *skb, struct tcf_ife_info *ife,
 		}
 	}
 
-	return 0;
+	return -ENOENT;
 }
 
 static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
-- 
2.11.0

^ permalink raw reply related

* [PATCH net 2/3] net: sched: ife: handle malformed tlv length
From: Alexander Aring @ 2018-04-18 21:35 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180418213534.6215-1-aring@mojatatu.com>

There is currently no handling to check on a invalid tlv length. This
patch adds such handling to avoid killing the kernel with a malformed
ife packet.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
---
 include/net/ife.h   |  3 ++-
 net/ife/ife.c       | 35 +++++++++++++++++++++++++++++++++--
 net/sched/act_ife.c |  7 ++++++-
 3 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/include/net/ife.h b/include/net/ife.h
index 44b9c00f7223..e117617e3c34 100644
--- a/include/net/ife.h
+++ b/include/net/ife.h
@@ -12,7 +12,8 @@
 void *ife_encode(struct sk_buff *skb, u16 metalen);
 void *ife_decode(struct sk_buff *skb, u16 *metalen);
 
-void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen);
+void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
+			  u16 *dlen, u16 *totlen);
 int ife_tlv_meta_encode(void *skbdata, u16 attrtype, u16 dlen,
 			const void *dval);
 
diff --git a/net/ife/ife.c b/net/ife/ife.c
index 7d1ec76e7f43..8632d2685efb 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -92,12 +92,43 @@ struct meta_tlvhdr {
 	__be16 len;
 };
 
+static inline bool __ife_tlv_meta_valid(const unsigned char *skbdata,
+					const unsigned char *ifehdr_end)
+{
+	const struct meta_tlvhdr *tlv;
+	u16 tlvlen;
+
+	if (unlikely(skbdata + sizeof(*tlv) > ifehdr_end))
+		return false;
+
+	tlv = (struct meta_tlvhdr *)skbdata;
+	tlvlen = ntohs(tlv->len);
+
+	/* tlv length field is inc header, check on minimum */
+	if (tlvlen < NLA_HDRLEN)
+		return false;
+
+	/* overflow by NLA_ALIGN check */
+	if (NLA_ALIGN(tlvlen) < tlvlen)
+		return false;
+
+	if (unlikely(skbdata + NLA_ALIGN(tlvlen) > ifehdr_end))
+		return false;
+
+	return true;
+}
+
 /* Caller takes care of presenting data in network order
  */
-void *ife_tlv_meta_decode(void *skbdata, u16 *attrtype, u16 *dlen, u16 *totlen)
+void *ife_tlv_meta_decode(void *skbdata, const void *ifehdr_end, u16 *attrtype,
+			  u16 *dlen, u16 *totlen)
 {
-	struct meta_tlvhdr *tlv = (struct meta_tlvhdr *) skbdata;
+	struct meta_tlvhdr *tlv;
+
+	if (!__ife_tlv_meta_valid(skbdata, ifehdr_end))
+		return NULL;
 
+	tlv = (struct meta_tlvhdr *)skbdata;
 	*dlen = ntohs(tlv->len) - NLA_HDRLEN;
 	*attrtype = ntohs(tlv->type);
 
diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index 49b8ab551fbe..8527cfdc446d 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -682,7 +682,12 @@ static int tcf_ife_decode(struct sk_buff *skb, const struct tc_action *a,
 		u16 mtype;
 		u16 dlen;
 
-		curr_data = ife_tlv_meta_decode(tlv_data, &mtype, &dlen, NULL);
+		curr_data = ife_tlv_meta_decode(tlv_data, ifehdr_end, &mtype,
+						&dlen, NULL);
+		if (!curr_data) {
+			qstats_drop_inc(this_cpu_ptr(ife->common.cpu_qstats));
+			return TC_ACT_SHOT;
+		}
 
 		if (find_decode_metaid(skb, ife, mtype, dlen, curr_data)) {
 			/* abuse overlimits to count when we receive metadata
-- 
2.11.0

^ permalink raw reply related

* [PATCH net 3/3] net: sched: ife: check on metadata length
From: Alexander Aring @ 2018-04-18 21:35 UTC (permalink / raw)
  To: yotam.gi
  Cc: jhs, davem, xiyou.wangcong, jiri, yuvalm, netdev, kernel,
	Alexander Aring
In-Reply-To: <20180418213534.6215-1-aring@mojatatu.com>

This patch checks if sk buffer is available to dererence ife header. If
not then NULL will returned to signal an malformed ife packet. This
avoids to crashing the kernel from outside.

Signed-off-by: Alexander Aring <aring@mojatatu.com>
---
 net/ife/ife.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ife/ife.c b/net/ife/ife.c
index 8632d2685efb..7c100034fbee 100644
--- a/net/ife/ife.c
+++ b/net/ife/ife.c
@@ -70,6 +70,9 @@ void *ife_decode(struct sk_buff *skb, u16 *metalen)
 	u16 ifehdrln;
 
 	ifehdr = (struct ifeheadr *) (skb->data + skb->dev->hard_header_len);
+	if (skb->len < skb->dev->hard_header_len + IFE_METAHDRLEN)
+		return NULL;
+
 	ifehdrln = ntohs(ifehdr->metalen);
 	total_pull = skb->dev->hard_header_len + ifehdrln;
 
-- 
2.11.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox