Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH RFC net-next 00/11] udp gso
From: Alexander Duyck @ 2018-04-18 18:12 UTC (permalink / raw)
  To: David Miller
  Cc: Sowmini Varadhan, Willem de Bruijn, Samudrala, Sridhar, Netdev,
	Willem de Bruijn
In-Reply-To: <20180418.132808.1710130437020293308.davem@davemloft.net>

On Wed, Apr 18, 2018 at 10:28 AM, David Miller <davem@davemloft.net> wrote:
> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> Date: Wed, 18 Apr 2018 08:31:03 -0400
>
>> However, I share Sridhar's concerns about the very fundamental change
>> to UDP message boundary semantics here.  There is actually no such thing
>> as a "segment" in udp, so in general this feature makes me a little
>> uneasy.  Well behaved udp applications should already be sending mtu
>> sized datagrams. And the not-so-well-behaved ones are probably relying
>> on IP fragmentation/reassembly to take care of datagram boundary semantics
>> for them?
>>
>> As Sridhar points out, the feature is not really "negotiated" - one side
>> unilaterally sets the option. If the receiver is a classic/POSIX UDP
>> implementation, it will have no way of knowing that message boundaries
>> have been re-adjusted at the sender.
>
> There are no "semantics".
>
> What ends up on the wire is the same before the kernel/app changes as
> afterwards.
>
> The only difference is that instead of the application doing N - 1
> sendmsg() calls with mtu sized writes, it's giving everything all at
> once and asking the kernel to segment.
>
> It even gives the application control over the size of the packets,
> which I think is completely prudent in this situation.

My only concern with the patch set is verifying what mitigations are
in case so that we aren't trying to set an MSS size that results in a
frame larger than MTU. I'm still digging through the code and trying
to grok it, but I figured I might just put the question out there to
may my reviewing easier.

Also any plans for HW offload support for this? I vaguely recall that
the igb and ixgbe parts had support for something like this in
hardware. I would have to double check to see what exactly is
supported.

Thanks.

- Alex

^ permalink raw reply

* Re: [PATCH 0/8] ipconfig: NTP server support, bug fixes, documentation improvements
From: Chris Novakovic @ 2018-04-18 18:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <e01df1b3-2780-dffb-9c10-b5be8792f126@chrisn.me.uk>

On 18/04/2018 19:06, Chris Novakovic wrote:
> On 18/04/2018 18:59, David Miller wrote:
>> I think a plain file named /proc/net/ntp is quite confusing.  It doesn't
>> give any indication that it's a special file populated only by ipconfig
>> and not some general NTP thing the kernel is doing.
>>
>> I would suggest creating a subdirectory like /proc/net/ipconfig or similar
>> to put such files.  Then the use and meaning is clear.
> 
> That sounds more sensible --- I'll rework patch 7 and resend. (Would you
> prefer me to post the entire series to the list again, or just that one
> patch?)

Sorry, I just realised how stupid that sounded: I can't remove patch 8
unless I repost the series... :)

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: Willem de Bruijn @ 2018-04-18 18:12 UTC (permalink / raw)
  To: David Miller; +Cc: Network Development, Willem de Bruijn
In-Reply-To: <20180418.135010.273233202471255234.davem@davemloft.net>

On Wed, Apr 18, 2018 at 1:50 PM, David Miller <davem@davemloft.net> wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Tue, 17 Apr 2018 16:00:50 -0400
>
>> Segmentation offload reduces cycles/byte for large packets by
>> amortizing the cost of protocol stack traversal.
>>
>> This patchset implements GSO for UDP.
>
> This looks great.
>
> And as mentioned in other emails, the interface looks good too by letting
> the app choose the segmentation point.
>
> It's a real shame we lose checksumming with MSG_MORE, perhaps we can find
> a way to lift that restriction?

Actually, yes, I should be able to relax that constraint with
segmentation.

It is there in case a corked packet may grow to the point of
having to be fragmented. But segmentation already ensures
that its datagrams always fit in MTU.

Should be simple refinement to

            !(flags & MSG_MORE) &&

in __ip_append_data.

^ permalink raw reply

* [PATCH net-next] team: account for oper state
From: George Wilkie @ 2018-04-18 10:29 UTC (permalink / raw)
  To: Jiri Pirko, netdev

Account for operational state when determining port linkup state,
as per Documentation/networking/operstates.txt.

Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com>
---
 drivers/net/team/team.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index a6c6ce19eeee..231264a05e55 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -2918,7 +2918,8 @@ static int team_device_event(struct notifier_block *unused,
 	case NETDEV_CHANGE:
 		if (netif_running(port->dev))
 			team_port_change_check(port,
-					       !!netif_carrier_ok(port->dev));
+					       !!(netif_carrier_ok(port->dev) &&
+						  netif_oper_up(port->dev)));
 		break;
 	case NETDEV_UNREGISTER:
 		team_del_slave(port->team->dev, dev);
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH net 5/5] nfp: remove false positive offloads in flower vxlan
From: Or Gerlitz @ 2018-04-18 18:18 UTC (permalink / raw)
  To: John Hurley; +Cc: Jakub Kicinski, Linux Netdev List, oss-drivers, Simon Horman
In-Reply-To: <CAK+XE=ks_z4xxx9gmV-HjpOH0KBz+qPZ4c+kB990-W9nBazj+A@mail.gmail.com>

On Wed, Apr 18, 2018 at 3:31 PM, John Hurley <john.hurley@netronome.com> wrote:
> On Wed, Apr 18, 2018 at 8:43 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> On Fri, Nov 17, 2017 at 4:06 AM, Jakub Kicinski
>> <jakub.kicinski@netronome.com> wrote:
>>> From: John Hurley <john.hurley@netronome.com>
>>>
>>> Pass information to the match offload on whether or not the repr is the
>>> ingress or egress dev. Only accept tunnel matches if repr is the egress dev.
>>>
>>> This means rules such as the following are successfully offloaded:
>>> tc .. add dev vxlan0 .. enc_dst_port 4789 .. action redirect dev nfp_p0
>>>
>>> While rules such as the following are rejected:
>>> tc .. add dev nfp_p0 .. enc_dst_port 4789 .. action redirect dev vxlan0
>>
>> cool
>>
>>
>>> Also reject non tunnel flows that are offloaded to an egress dev.
>>> Non tunnel matches assume that the offload dev is the ingress port and
>>> offload a match accordingly.
>>
>> not following on the "Also" here, see below
>>
>>
>>> diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>> index a0193e0c24a0..f5d73b83dcc2 100644
>>> --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>> +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>>> @@ -131,7 +131,8 @@ static bool nfp_flower_check_higher_than_mac(struct tc_cls_flower_offload *f)
>>>
>>>  static int
>>>  nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>> -                               struct tc_cls_flower_offload *flow)
>>> +                               struct tc_cls_flower_offload *flow,
>>> +                               bool egress)
>>>  {
>>>         struct flow_dissector_key_basic *mask_basic = NULL;
>>>         struct flow_dissector_key_basic *key_basic = NULL;
>>> @@ -167,6 +168,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>>                         skb_flow_dissector_target(flow->dissector,
>>>                                                   FLOW_DISSECTOR_KEY_ENC_CONTROL,
>>>                                                   flow->key);
>>> +               if (!egress)
>>> +                       return -EOPNOTSUPP;
>>> +
>>>                 if (mask_enc_ctl->addr_type != 0xffff ||
>>>                     enc_ctl->addr_type != FLOW_DISSECTOR_KEY_IPV4_ADDRS)
>>>                         return -EOPNOTSUPP;
>>> @@ -194,6 +198,9 @@ nfp_flower_calculate_key_layers(struct nfp_fl_key_ls *ret_key_ls,
>>>
>>>                 key_layer |= NFP_FLOWER_LAYER_VXLAN;
>>>                 key_size += sizeof(struct nfp_flower_vxlan);
>>> +       } else if (egress) {
>>> +               /* Reject non tunnel matches offloaded to egress repr. */
>>> +               return -EOPNOTSUPP;
>>>         }
>>
>> with these two hunks we get: egress <- IFF -> encap match, right?
>>
>> (1) we can't offload the egress way if there isn't matching on encap headers
>> (2) we can't go the matching on encap headers way if we are not egress
>>
>
> yes, this is correct.
> With the block code and egdev offload, we do not have access to the
> ingress netdev when doing an offload.
> We need to use the encap headers (especially the enc_port) to
> distinguish the type of tunnel used and, therefore, require that the
> encap matches be present before offloading.
>
>> what other cases are rejected by this logic?
>>
>
> Yes, some other cases may be rejected (like veth mentioned below).

my claim is that the veth case I mentioned below will not be rejected
if it has the matching on encap headers, and a wrong rule will be set
into hw, agree?

> However, this is better than allowing rules to be incorrectly
> offloaded (as could have happened before these changes).

> Currently, we are looking at offloading flows on other ingress devices
> such as bonds so this will require a change to the driver code here.

for the ingress side, Jiri suggested that the slave devices (uplink reps),
will be just getting all the rules set on the bond, so I am not sure what
problem you see here... for decap it will be still vxlan --> vf rep and your
egress logic will allow it.

> IMO, the cleanest solution will also require tc core changes to either
> avoid egdev offload or to have access to the ingress netdev of a rule.

>> e.g If we add a rule with SW device (veth. tap) being the ingress, and
>> HW device (vf rep)
>> being the egress while not using skip_sw (just no flags == both) we
>> get the TC stack
>> go along the egdev callback from the vf rep hw device and add an
>> (uplink --> vf rep) rule
>> which will not be rejected if there is matching on tunnel headers, it
>> will also not be rejected
>> by some driver logic as the one we discussed to identify and ignore
>> rules that are attempted to being added twice.
>>
>> Or.

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: Willem de Bruijn @ 2018-04-18 18:22 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Miller, Sowmini Varadhan, Samudrala, Sridhar, Netdev,
	Willem de Bruijn
In-Reply-To: <CAKgT0UcefCpG2j7RQN8aUUgu2bud_RTFs_s+nFPY2GhKgE8L+A@mail.gmail.com>

On Wed, Apr 18, 2018 at 2:12 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Apr 18, 2018 at 10:28 AM, David Miller <davem@davemloft.net> wrote:
>> From: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> Date: Wed, 18 Apr 2018 08:31:03 -0400
>>
>>> However, I share Sridhar's concerns about the very fundamental change
>>> to UDP message boundary semantics here.  There is actually no such thing
>>> as a "segment" in udp, so in general this feature makes me a little
>>> uneasy.  Well behaved udp applications should already be sending mtu
>>> sized datagrams. And the not-so-well-behaved ones are probably relying
>>> on IP fragmentation/reassembly to take care of datagram boundary semantics
>>> for them?
>>>
>>> As Sridhar points out, the feature is not really "negotiated" - one side
>>> unilaterally sets the option. If the receiver is a classic/POSIX UDP
>>> implementation, it will have no way of knowing that message boundaries
>>> have been re-adjusted at the sender.
>>
>> There are no "semantics".
>>
>> What ends up on the wire is the same before the kernel/app changes as
>> afterwards.
>>
>> The only difference is that instead of the application doing N - 1
>> sendmsg() calls with mtu sized writes, it's giving everything all at
>> once and asking the kernel to segment.
>>
>> It even gives the application control over the size of the packets,
>> which I think is completely prudent in this situation.
>
> My only concern with the patch set is verifying what mitigations are
> in case so that we aren't trying to set an MSS size that results in a
> frame larger than MTU. I'm still digging through the code and trying
> to grok it, but I figured I might just put the question out there to
> may my reviewing easier.

This is checked in udp_send_skb in

                const int hlen = skb_network_header_len(skb) +
                                 sizeof(struct udphdr);

                if (hlen + cork->gso_size > cork->fragsize)
                        return -EINVAL;

At this point cork->fragsize will have been set in ip_setup_cork
based on device or path mtu:

        cork->fragsize = ip_sk_use_pmtu(sk) ?
                         dst_mtu(&rt->dst) : rt->dst.dev->mtu;

The feature bypasses the MTU sanity checks in __ip_append_data
by setting local variable mtu to a network layer max

        mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;

but the above should perform the same check, only later. The
check in udp_send_skb is simpler as it does not have to deal
with the case of fragmentation.

> Also any plans for HW offload support for this? I vaguely recall that
> the igb and ixgbe parts had support for something like this in
> hardware. I would have to double check to see what exactly is
> supported.

I hadn't given that much thought until the request yesterday to
expose the NETIF_F_GSO_UDP_L4 flag through ethtool. By
virtue of having only a single fixed segmentation length, it
appears reasonably straightforward to offload.

^ permalink raw reply

* Re: [PATCH bpf-next v4 03/10] bpf: btf: Check members of struct/union
From: Jakub Kicinski @ 2018-04-18 18:30 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team
In-Reply-To: <20180418180101.6tdt5hr3ned25gu3@kafai-mbp>

On Wed, 18 Apr 2018 11:01:15 -0700, Martin KaFai Lau wrote:
> On Wed, Apr 18, 2018 at 10:22:10AM -0700, Jakub Kicinski wrote:
> > On Tue, 17 Apr 2018 13:42:36 -0700, Martin KaFai Lau wrote:  
> > > This patch checks a few things of struct's members:
> > > 
> > > 1) It has a valid size (e.g. a "const void" is invalid)
> > > 2) A member's size (+ its member's offset) does not exceed
> > >    the containing struct's size.
> > > 3) The member's offset satisfies the alignment requirement  
> > 
> > Could we also introduce a requirement for members to have different
> > names?  Maybe it's there but I missed it.  Would BTF with duplicated
> > member names be considered valid?  
>
> It could check but I don't see BTF needs to check everything
> that clang does.

Agreed, I don't think correct tooling should ever generate duplicated
members.  Should the BTF forbid it then?  It could help catch bugs and
avoid problems.  I was thinking about JSON where duplicated field
names will result in invalid JSON potentially leading to issues in
user space stacks...

^ permalink raw reply

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Samudrala, Sridhar @ 2018-04-18 18:43 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh
In-Reply-To: <20180418092515.GB1989@nanopsycho>

On 4/18/2018 2:25 AM, Jiri Pirko wrote:
> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>>>> This provides a generic interface for paravirtual drivers to listen
>>>> for netdev register/unregister/link change events from pci ethernet
>>>> devices with the same MAC and takeover their datapath. The notifier and
>>>> event handling code is based on the existing netvsc implementation.
>>>>
>>>> It exposes 2 sets of interfaces to the paravirtual drivers.
>>>> 1. existing netvsc driver that uses 2 netdev model. In this model, no
>>>> master netdev is created. The paravirtual driver registers each bypass
>>>> instance along with a set of ops to manage the slave events.
>>>>       bypass_master_register()
>>>>       bypass_master_unregister()
>>>> 2. new virtio_net based solution that uses 3 netdev model. In this model,
>>>> the bypass module provides interfaces to create/destroy additional master
>>>> netdev and all the slave events are managed internally.
>>>>        bypass_master_create()
>>>>        bypass_master_destroy()
>>>>
>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>> ---
>>>> include/linux/netdevice.h |  14 +
>>>> include/net/bypass.h      |  96 ++++++
>>>> net/Kconfig               |  18 +
>>>> net/core/Makefile         |   1 +
>>>> net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>>>> 5 files changed, 973 insertions(+)
>>>> create mode 100644 include/net/bypass.h
>>>> create mode 100644 net/core/bypass.c
>>>>
>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>> index cf44503ea81a..587293728f70 100644
>>>> --- a/include/linux/netdevice.h
>>>> +++ b/include/linux/netdevice.h
>>>> @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>>>> 	IFF_PHONY_HEADROOM		= 1<<24,
>>>> 	IFF_MACSEC			= 1<<25,
>>>> 	IFF_NO_RX_HANDLER		= 1<<26,
>>>> +	IFF_BYPASS			= 1 << 27,
>>>> +	IFF_BYPASS_SLAVE		= 1 << 28,
>>> I wonder, why you don't follow the existing coding style... Also, please
>>> add these to into the comment above.
>> To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> to the existing coding style to be consistent.
> Please do.
>
>
>>>
>>>> };
>>>>
>>>> #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>>>> @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>>>> #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>>>> #define IFF_MACSEC			IFF_MACSEC
>>>> #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>>>> +#define IFF_BYPASS			IFF_BYPASS
>>>> +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>>>>
>>>> /**
>>>>    *	struct net_device - The DEVICE structure.
>>>> @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>>>> 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>>>> }
>>>>
>>>> +static inline bool netif_is_bypass_master(const struct net_device *dev)
>>>> +{
>>>> +	return dev->priv_flags & IFF_BYPASS;
>>>> +}
>>>> +
>>>> +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>>>> +{
>>>> +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>>>> +}
>>>> +
>>>> /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>>>> static inline void netif_keep_dst(struct net_device *dev)
>>>> {
>>>> diff --git a/include/net/bypass.h b/include/net/bypass.h
>>>> new file mode 100644
>>>> index 000000000000..86b02cb894cf
>>>> --- /dev/null
>>>> +++ b/include/net/bypass.h
>>>> @@ -0,0 +1,96 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/* Copyright (c) 2018, Intel Corporation. */
>>>> +
>>>> +#ifndef _NET_BYPASS_H
>>>> +#define _NET_BYPASS_H
>>>> +
>>>> +#include <linux/netdevice.h>
>>>> +
>>>> +struct bypass_ops {
>>>> +	int (*slave_pre_register)(struct net_device *slave_netdev,
>>>> +				  struct net_device *bypass_netdev);
>>>> +	int (*slave_join)(struct net_device *slave_netdev,
>>>> +			  struct net_device *bypass_netdev);
>>>> +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>>>> +				    struct net_device *bypass_netdev);
>>>> +	int (*slave_release)(struct net_device *slave_netdev,
>>>> +			     struct net_device *bypass_netdev);
>>>> +	int (*slave_link_change)(struct net_device *slave_netdev,
>>>> +				 struct net_device *bypass_netdev);
>>>> +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>>>> +};
>>>> +
>>>> +struct bypass_master {
>>>> +	struct list_head list;
>>>> +	struct net_device __rcu *bypass_netdev;
>>>> +	struct bypass_ops __rcu *ops;
>>>> +};
>>>> +
>>>> +/* bypass state */
>>>> +struct bypass_info {
>>>> +	/* passthru netdev with same MAC */
>>>> +	struct net_device __rcu *active_netdev;
>>> You still use "active"/"backup" names which is highly misleading as
>>> it has completely different meaning that in bond for example.
>>> I noted that in my previous review already. Please change it.
>> I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> matches with the BACKUP feature bit we are adding to virtio_net.
> I think that "backup" is also misleading. Both "active" and "backup"
> mean a *state* of slaves. This should be named differently.
>
>
>
>> With regards to alternate names for 'active', you suggested 'stolen', but i
>> am not too happy with it.
>> netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
> No. The netdev could be any netdevice. It does not have to be a "VF".
> I think "stolen" is quite appropriate since it describes the modus
> operandi. The bypass master steals some netdevice according to some
> match.
>
> But I don't insist on "stolen". Just sounds right.

We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
'backup' name is consistent.

The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.

Will look for any suggestions in the next day or two. If i don't get any, i will go
with 'stolen'

<snip>


> +
> +static struct net_device *bypass_master_get_bymac(u8 *mac,
> +						  struct bypass_ops **ops)
> +{
> +	struct bypass_master *bypass_master;
> +	struct net_device *bypass_netdev;
> +
> +	spin_lock(&bypass_lock);
> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>>> As I wrote the last time, you don't need this list, spinlock.
>>> You can do just something like:
>>>           for_each_net(net) {
>>>                   for_each_netdev(net, dev) {
>>> 			if (netif_is_bypass_master(dev)) {
>> This function returns the upper netdev as well as the ops associated
>> with that netdev.
>> bypass_master_list is a list of 'struct bypass_master' that associates
> Well, can't you have it in netdev priv?

We cannot do this for 2-netdev model as there is no bypass_netdev created.

>
>
>> 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> NULL for 3-netdev model.
> I see :(
>
>
>>
>>>
>>>
>>>
>>>> +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>>>> +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>>>> +			*ops = rcu_dereference(bypass_master->ops);
>>> I don't see how rcu_dereference is ok here.
>>> 1) I don't see rcu_read_lock taken
>>> 2) Looks like bypass_master->ops has the same value across the whole
>>>      existence.
>> We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> Yes. ops doesn't change.
> If it does not change, you can just access it directly.
>
>
>>>
>>>> +			spin_unlock(&bypass_lock);
>>>> +			return bypass_netdev;
>>>> +		}
>>>> +	}
>>>> +	spin_unlock(&bypass_lock);
>>>> +	return NULL;
>>>> +}
>>>> +
>>>> +static int bypass_slave_register(struct net_device *slave_netdev)
>>>> +{
>>>> +	struct net_device *bypass_netdev;
>>>> +	struct bypass_ops *bypass_ops;
>>>> +	int ret, orig_mtu;
>>>> +
>>>> +	ASSERT_RTNL();
>>>> +
>>>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> +						&bypass_ops);
>>> For master, could you use word "master" in the variables so it is clear?
>>> Also, "dev" is fine instead of "netdev".
>>> Something like "bpmaster_dev"
>> bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
> I was trying to point out, that "bypass_netdev" represents a "master"
> netdev, yet it does not say master. That is why I suggested
> "bpmaster_dev"
>
>
>> I can change all _netdev suffixes to _dev to make the names shorter.
> ok.
>
>
>>
>>>
>>>> +	if (!bypass_netdev)
>>>> +		goto done;
>>>> +
>>>> +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>>>> +					bypass_ops);
>>>> +	if (ret != 0)
>>> 	Just "if (ret)" will do. You have this on more places.
>> OK.
>>
>>
>>>
>>>> +		goto done;
>>>> +
>>>> +	ret = netdev_rx_handler_register(slave_netdev,
>>>> +					 bypass_ops ? bypass_ops->handle_frame :
>>>> +					 bypass_handle_frame, bypass_netdev);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>>>> +			   ret);
>>>> +		goto done;
>>>> +	}
>>>> +
>>>> +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>>>> +			   bypass_netdev->name, ret);
>>>> +		goto upper_link_failed;
>>>> +	}
>>>> +
>>>> +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>>>> +
>>>> +	if (netif_running(bypass_netdev)) {
>>>> +		ret = dev_open(slave_netdev);
>>>> +		if (ret && (ret != -EBUSY)) {
>>>> +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>>>> +				   slave_netdev->name, ret);
>>>> +			goto err_interface_up;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	/* Align MTU of slave with master */
>>>> +	orig_mtu = slave_netdev->mtu;
>>>> +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>>>> +	if (ret != 0) {
>>>> +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>>>> +			   slave_netdev->name, bypass_netdev->mtu);
>>>> +		goto err_set_mtu;
>>>> +	}
>>>> +
>>>> +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>>>> +	if (ret != 0)
>>>> +		goto err_join;
>>>> +
>>>> +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>>>> +
>>>> +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>>>> +		    slave_netdev->name);
>>>> +
>>>> +	goto done;
>>>> +
>>>> +err_join:
>>>> +	dev_set_mtu(slave_netdev, orig_mtu);
>>>> +err_set_mtu:
>>>> +	dev_close(slave_netdev);
>>>> +err_interface_up:
>>>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>>>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>>>> +upper_link_failed:
>>>> +	netdev_rx_handler_unregister(slave_netdev);
>>>> +done:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>>>> +				       struct net_device *bypass_netdev,
>>>> +				       struct bypass_ops *bypass_ops)
>>>> +{
>>>> +	struct net_device *backup_netdev, *active_netdev;
>>>> +	struct bypass_info *bi;
>>>> +
>>>> +	if (bypass_ops) {
>>>> +		if (!bypass_ops->slave_pre_unregister)
>>>> +			return -EINVAL;
>>>> +
>>>> +		return bypass_ops->slave_pre_unregister(slave_netdev,
>>>> +							bypass_netdev);
>>>> +	}
>>>> +
>>>> +	bi = netdev_priv(bypass_netdev);
>>>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>>>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>>>> +		return -EINVAL;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static int bypass_slave_release(struct net_device *slave_netdev,
>>>> +				struct net_device *bypass_netdev,
>>>> +				struct bypass_ops *bypass_ops)
>>>> +{
>>>> +	struct net_device *backup_netdev, *active_netdev;
>>>> +	struct bypass_info *bi;
>>>> +
>>>> +	if (bypass_ops) {
>>>> +		if (!bypass_ops->slave_release)
>>>> +			return -EINVAL;
>>> I think it would be good to make the API to the driver more strict and
>>> have a separate set of ops for "active" and "backup" netdevices.
>>> That should stop people thinking about extending this to more slaves in
>>> the future.
>> We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> 'active' slave.
> I'm very well aware of that. I just thought that explicit ops for the
> two slaves would make this more clear.
>
>
>>
>>>
>>>
>>>> +
>>>> +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>>>> +	}
>>>> +
>>>> +	bi = netdev_priv(bypass_netdev);
>>>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>>>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> +	if (slave_netdev == backup_netdev) {
>>>> +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>>>> +	} else {
>>>> +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>>>> +		if (backup_netdev) {
>>>> +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>>>> +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	dev_put(slave_netdev);
>>>> +
>>>> +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>>>> +		    slave_netdev->name);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +int bypass_slave_unregister(struct net_device *slave_netdev)
>>>> +{
>>>> +	struct net_device *bypass_netdev;
>>>> +	struct bypass_ops *bypass_ops;
>>>> +	int ret;
>>>> +
>>>> +	if (!netif_is_bypass_slave(slave_netdev))
>>>> +		goto done;
>>>> +
>>>> +	ASSERT_RTNL();
>>>> +
>>>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> +						&bypass_ops);
>>>> +	if (!bypass_netdev)
>>>> +		goto done;
>>>> +
>>>> +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>>>> +					  bypass_ops);
>>>> +	if (ret != 0)
>>>> +		goto done;
>>>> +
>>>> +	netdev_rx_handler_unregister(slave_netdev);
>>>> +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>>>> +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>>>> +
>>>> +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>>>> +
>>>> +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>>>> +		    slave_netdev->name);
>>>> +
>>>> +done:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>>>> +
>>>> +static bool bypass_xmit_ready(struct net_device *dev)
>>>> +{
>>>> +	return netif_running(dev) && netif_carrier_ok(dev);
>>>> +}
>>>> +
>>>> +static int bypass_slave_link_change(struct net_device *slave_netdev)
>>>> +{
>>>> +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>>>> +	struct bypass_ops *bypass_ops;
>>>> +	struct bypass_info *bi;
>>>> +
>>>> +	if (!netif_is_bypass_slave(slave_netdev))
>>>> +		goto done;
>>>> +
>>>> +	ASSERT_RTNL();
>>>> +
>>>> +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>>>> +						&bypass_ops);
>>>> +	if (!bypass_netdev)
>>>> +		goto done;
>>>> +
>>>> +	if (bypass_ops) {
>>>> +		if (!bypass_ops->slave_link_change)
>>>> +			goto done;
>>>> +
>>>> +		return bypass_ops->slave_link_change(slave_netdev,
>>>> +						     bypass_netdev);
>>>> +	}
>>>> +
>>>> +	if (!netif_running(bypass_netdev))
>>>> +		return 0;
>>>> +
>>>> +	bi = netdev_priv(bypass_netdev);
>>>> +
>>>> +	active_netdev = rtnl_dereference(bi->active_netdev);
>>>> +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>>>> +
>>>> +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>>>> +		goto done;
>>> You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>>> above is enough.
>> I think we need this check to not allow events from a slave that is not
>> attached to this master but has the same MAC.
> Why do we need such events? Seems wrong to me.

We want to avoid events from a netdev that is mis-configured with the same MAC as
a bypass setup.

>   Consider:
>
> bp1      bp2
> a1 b1    a2 b2
>
>
> a1 and a2 have the same mac and bp1 and bp2 have the same mac.

We should not have 2 bypass configs with the same MAC.
I need to add a check in the bypass_master_register() to prevent this.

The above check is to avoid cases where we have
bp1(a1, b1) with mac1
and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.

> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
> the order of creation.
> Let's say it will return bp1. Then when we have event for a2, the
> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>
>
> You cannot use bypass_master_get_bymac() here.
>
>
>
>>>
>>>> +
>>>> +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>>>> +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>>>> +		netif_carrier_on(bypass_netdev);
>>>> +		netif_tx_wake_all_queues(bypass_netdev);
>>>> +	} else {
>>>> +		netif_carrier_off(bypass_netdev);
>>>> +		netif_tx_stop_all_queues(bypass_netdev);
>>>> +	}
>>>> +
>>>> +done:
>>>> +	return NOTIFY_DONE;
>>>> +}
>>>> +
>>>> +static bool bypass_validate_event_dev(struct net_device *dev)
>>>> +{
>>>> +	/* Skip parent events */
>>>> +	if (netif_is_bypass_master(dev))
>>>> +		return false;
>>>> +
>>>> +	/* Avoid non-Ethernet type devices */
>>>> +	if (dev->type != ARPHRD_ETHER)
>>>> +		return false;
>>>> +
>>>> +	/* Avoid Vlan dev with same MAC registering as VF */
>>>> +	if (is_vlan_dev(dev))
>>>> +		return false;
>>>> +
>>>> +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>>>> +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>>> Yeah, this is certainly incorrect. One thing is, you should be using the
>>> helpers netif_is_bond_master().
>>> But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>>>
>>> You need to do it not by blacklisting, but with whitelisting. You need
>>> to whitelist VF devices. My port flavours patchset might help with this.
>> May be i can use netdev_has_lower_dev() helper to make sure that the slave
> I don't see such function in the code.

It is netdev_has_any_lower_dev(). I need to export it.

>
>
>> device is not an upper dev.
>> Can you point to your port flavours patchset? Is it upstream?
> I sent rfc couple of weeks ago:
> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation

^ permalink raw reply

* [PATCH net-next] net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
From: Eric Dumazet @ 2018-04-18 18:43 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet

After working on IP defragmentation lately, I found that some large
packets defeat CHECKSUM_COMPLETE optimization because of NIC adding
zero paddings on the last (small) fragment.

While removing the padding with pskb_trim_rcsum(), we set skb->ip_summed
to CHECKSUM_NONE, forcing a full csum validation, even if all prior
fragments had CHECKSUM_COMPLETE set.

We can instead compute the checksum of the part we are trimming,
usually smaller than the part we keep.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h |  5 ++---
 net/core/skbuff.c      | 14 ++++++++++++++
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9065477ed255a48f7e01b8a28ea6321cce9127f5..d274059529eb5216d041dfdcad4a564a623c8ea0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3131,6 +3131,7 @@ static inline void *skb_push_rcsum(struct sk_buff *skb, unsigned int len)
 	return skb->data;
 }
 
+int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len);
 /**
  *	pskb_trim_rcsum - trim received skb and update checksum
  *	@skb: buffer to trim
@@ -3144,9 +3145,7 @@ static inline int pskb_trim_rcsum(struct sk_buff *skb, unsigned int len)
 {
 	if (likely(len >= skb->len))
 		return 0;
-	if (skb->ip_summed == CHECKSUM_COMPLETE)
-		skb->ip_summed = CHECKSUM_NONE;
-	return __pskb_trim(skb, len);
+	return pskb_trim_rcsum_slow(skb, len);
 }
 
 static inline int __skb_trim_rcsum(struct sk_buff *skb, unsigned int len)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 345b51837ca80bb709bfffe04d58eedbba0b9907..ff49e352deea1d07049f5c5ac425fbcdb4fd96a8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1839,6 +1839,20 @@ int ___pskb_trim(struct sk_buff *skb, unsigned int len)
 }
 EXPORT_SYMBOL(___pskb_trim);
 
+/* Note : use pskb_trim_rcsum() instead of calling this directly
+ */
+int pskb_trim_rcsum_slow(struct sk_buff *skb, unsigned int len)
+{
+	if (skb->ip_summed == CHECKSUM_COMPLETE) {
+		int delta = skb->len - len;
+
+		skb->csum = csum_sub(skb->csum,
+				     skb_checksum(skb, len, delta, 0));
+	}
+	return __pskb_trim(skb, len);
+}
+EXPORT_SYMBOL(pskb_trim_rcsum_slow);
+
 /**
  *	__pskb_pull_tail - advance tail of skb header
  *	@skb: buffer to reallocate
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related

* Re: [RFC PATCH V1 01/12] audit: add container id
From: Stefan Berger @ 2018-04-18 18:45 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, mszeredi, luto, jlayton, carlos,
	viro, dhowells, simo, trondmy, eparis, serge, ebiederm, madzcar
In-Reply-To: <20180316035837.ddnqvbyrbp3fdk7e@madcap2.tricolour.ca>

On 03/15/2018 11:58 PM, Richard Guy Briggs wrote:
> On 2018-03-15 16:27, Stefan Berger wrote:
>> On 03/01/2018 02:41 PM, Richard Guy Briggs wrote:
>>> Implement the proc fs write to set the audit container ID of a process,
>>> emitting an AUDIT_CONTAINER record to document the event.
>>>
>>> This is a write from the container orchestrator task to a proc entry of
>>> the form /proc/PID/containerid where PID is the process ID of the newly
>>> created task that is to become the first task in a container, or an
>>> additional task added to a container.
>>>
>>> The write expects up to a u64 value (unset: 18446744073709551615).
>>>
>>> This will produce a record such as this:
>>> type=UNKNOWN[1333] msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
>>>
>>> The "op" field indicates an initial set.  The "pid" to "ses" fields are
>>> the orchestrator while the "opid" field is the object's PID, the process
>>> being "contained".  Old and new container ID values are given in the
>>> "contid" fields, while res indicates its success.
>>>
>>> It is not permitted to self-set, unset or re-set the container ID.  A
>>> child inherits its parent's container ID, but then can be set only once
>>> after.
>>>
>>> See: https://github.com/linux-audit/audit-kernel/issues/32
>>>
>>>
>>>    /* audit_rule_data supports filter rules with both integer and string
>>>     * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
>>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>>> index 4e0a4ac..0ee1e59 100644
>>> --- a/kernel/auditsc.c
>>> +++ b/kernel/auditsc.c
>>> @@ -2073,6 +2073,92 @@ int audit_set_loginuid(kuid_t loginuid)
>>>    	return rc;
>>>    }
>>>
>>> +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
>>> +{
>>> +	struct task_struct *parent;
>>> +	u64 pcontainerid, ccontainerid;
>>> +	pid_t ppid;
>>> +
>>> +	/* Don't allow to set our own containerid */
>>> +	if (current == task)
>>> +		return -EPERM;
>>> +	/* Don't allow the containerid to be unset */
>>> +	if (!cid_valid(containerid))
>>> +		return -EINVAL;
>>> +	/* if we don't have caps, reject */
>>> +	if (!capable(CAP_AUDIT_CONTROL))
>>> +		return -EPERM;
>>> +	/* if containerid is unset, allow */
>>> +	if (!audit_containerid_set(task))
>>> +		return 0;
>> I am wondering whether there should be a check for the target process that
>> will receive the containerid to not have CAP_SYS_ADMIN that would otherwise
>> allow it to arbitrarily unshare()/clone() and leave the set of namespaces
>> that may make up the container whose containerid we assign here?
> This is a reasonable question.  This has been debated and I understood
> the conclusion was that without a clear definition of a "container", the
> task still remains in that container that just now has more
> sub-namespaces (in the case of hierarchical namespaces), we don't want
> to restrict it in such a way and that allows it to create nested
> containers.  I see setns being more problematic if it could switch to
> another existing namespace that was set up by the orchestrator for a
> different container.  The coming v2 patchset acknowledges this situation
> with the network namespace being potentially shared by multiple
> containers.

Are you going to post v2 soon? We would like to build on top of it for 
IMA namespacing and auditing inside of IMA namespaces.

    Stefan

^ permalink raw reply

* [Patch net] llc: hold llc_sap before release_sock()
From: Cong Wang @ 2018-04-18 18:51 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang

syzbot reported we still access llc->sap in llc_backlog_rcv()
after it is freed in llc_sap_remove_socket():

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
 llc_conn_ac_send_sabme_cmd_p_set_x+0x3a8/0x460 net/llc/llc_c_ac.c:785
 llc_exec_conn_trans_actions net/llc/llc_conn.c:475 [inline]
 llc_conn_service net/llc/llc_conn.c:400 [inline]
 llc_conn_state_process+0x4e1/0x13a0 net/llc/llc_conn.c:75
 llc_backlog_rcv+0x195/0x1e0 net/llc/llc_conn.c:891
 sk_backlog_rcv include/net/sock.h:909 [inline]
 __release_sock+0x12f/0x3a0 net/core/sock.c:2335
 release_sock+0xa4/0x2b0 net/core/sock.c:2850
 llc_ui_release+0xc8/0x220 net/llc/af_llc.c:204

llc->sap is refcount'ed and llc_sap_remove_socket() is paired
with llc_sap_add_socket(). This can be amended by holding its refcount
before llc_sap_remove_socket() and releasing it after release_sock().

Reported-by: <syzbot+6e181fc95081c2cf9051@syzkaller.appspotmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/llc/af_llc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index 01dcc0823d1f..6d29b2b94e84 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -189,6 +189,7 @@ static int llc_ui_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
 	struct llc_sock *llc;
+	struct llc_sap *sap;
 
 	if (unlikely(sk == NULL))
 		goto out;
@@ -199,9 +200,15 @@ static int llc_ui_release(struct socket *sock)
 		llc->laddr.lsap, llc->daddr.lsap);
 	if (!llc_send_disc(sk))
 		llc_ui_wait_for_disc(sk, sk->sk_rcvtimeo);
+	sap = llc->sap;
+	/* Hold this for release_sock(), so that llc_backlog_rcv() could still
+	 * use it.
+	 */
+	llc_sap_hold(sap);
 	if (!sock_flag(sk, SOCK_ZAPPED))
 		llc_sap_remove_socket(llc->sap, sk);
 	release_sock(sk);
+	llc_sap_put(sap);
 	if (llc->dev)
 		dev_put(llc->dev);
 	sock_put(sk);
-- 
2.13.0

^ permalink raw reply related

* sendmmsg flags userspace ABI change in kernel 4.6
From: Florian Weimer @ 2018-04-18 19:03 UTC (permalink / raw)
  To: linux-api, netdev, linux-sctp; +Cc: linux-kernel, Tom Herbert, mjw

Since this commit:

commit 28a94d8fb35b3a75b802f368ae6f4a9f6b0d435a
Author: Tom Herbert <tom@herbertland.com>
Date:   Mon Mar 7 14:11:02 2016 -0800

    net: Allow MSG_EOR in each msghdr of sendmmsg

    This patch allows setting MSG_EOR in each individual msghdr passed
    in sendmmsg. This allows a sendmmsg to send multiple messages when
    using SOCK_SEQPACKET.

    Signed-off-by: Tom Herbert <tom@herbertland.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

the msg_flags argument in individual msghdr arguments is longer
completely ignored for SOCK_SEQPACKET sockets.  msg_flags was and is
still documented as ignored for sendmsg(2), so by analogy for
sendmmsg(2) as well.

It seems that valgrind does not know about this yet, and due to
limited use of SCTP, this userspace ABI change has not been noticed so
far.

What are the plans in this area?  Will other kinds of sockets start
using the msghdr flags for sending?

A fully backwards-compatibility way to achieve this would be to
specify that you have to pass a new flag to sendmmsg (MSG_PERHDR?), in
its flags argument, to activate the per-msghdr flags.

The glibc DNS stub resolver relies on the previously documented
behavior, and I wonder how widely we should backport the change:

  https://sourceware.org/bugzilla/show_bug.cgi?id=23037

If the MSG_PERHDR route will be taken, we can skip this work, and
valgrind can flag uninitialized bits in msg_flags only if MSG_PERHDR
is passed.  (I believe it would be difficult for valgrind to look at
the socket type to determine whether undefined bits need reporting.)

^ permalink raw reply

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-18 19:13 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh
In-Reply-To: <dd0d53f0-f5da-cb7d-f8f6-d0c8245eb3cf@intel.com>

Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
>On 4/18/2018 2:25 AM, Jiri Pirko wrote:
>> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> > > > This provides a generic interface for paravirtual drivers to listen
>> > > > for netdev register/unregister/link change events from pci ethernet
>> > > > devices with the same MAC and takeover their datapath. The notifier and
>> > > > event handling code is based on the existing netvsc implementation.
>> > > > 
>> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
>> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> > > > master netdev is created. The paravirtual driver registers each bypass
>> > > > instance along with a set of ops to manage the slave events.
>> > > >       bypass_master_register()
>> > > >       bypass_master_unregister()
>> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> > > > the bypass module provides interfaces to create/destroy additional master
>> > > > netdev and all the slave events are managed internally.
>> > > >        bypass_master_create()
>> > > >        bypass_master_destroy()
>> > > > 
>> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > > > ---
>> > > > include/linux/netdevice.h |  14 +
>> > > > include/net/bypass.h      |  96 ++++++
>> > > > net/Kconfig               |  18 +
>> > > > net/core/Makefile         |   1 +
>> > > > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> > > > 5 files changed, 973 insertions(+)
>> > > > create mode 100644 include/net/bypass.h
>> > > > create mode 100644 net/core/bypass.c
>> > > > 
>> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > > > index cf44503ea81a..587293728f70 100644
>> > > > --- a/include/linux/netdevice.h
>> > > > +++ b/include/linux/netdevice.h
>> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> > > > 	IFF_PHONY_HEADROOM		= 1<<24,
>> > > > 	IFF_MACSEC			= 1<<25,
>> > > > 	IFF_NO_RX_HANDLER		= 1<<26,
>> > > > +	IFF_BYPASS			= 1 << 27,
>> > > > +	IFF_BYPASS_SLAVE		= 1 << 28,
>> > > I wonder, why you don't follow the existing coding style... Also, please
>> > > add these to into the comment above.
>> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> > to the existing coding style to be consistent.
>> Please do.
>> 
>> 
>> > > 
>> > > > };
>> > > > 
>> > > > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> > > > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> > > > #define IFF_MACSEC			IFF_MACSEC
>> > > > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> > > > +#define IFF_BYPASS			IFF_BYPASS
>> > > > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>> > > > 
>> > > > /**
>> > > >    *	struct net_device - The DEVICE structure.
>> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> > > > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> > > > }
>> > > > 
>> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> > > > +{
>> > > > +	return dev->priv_flags & IFF_BYPASS;
>> > > > +}
>> > > > +
>> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> > > > +{
>> > > > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> > > > +}
>> > > > +
>> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> > > > static inline void netif_keep_dst(struct net_device *dev)
>> > > > {
>> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> > > > new file mode 100644
>> > > > index 000000000000..86b02cb894cf
>> > > > --- /dev/null
>> > > > +++ b/include/net/bypass.h
>> > > > @@ -0,0 +1,96 @@
>> > > > +// SPDX-License-Identifier: GPL-2.0
>> > > > +/* Copyright (c) 2018, Intel Corporation. */
>> > > > +
>> > > > +#ifndef _NET_BYPASS_H
>> > > > +#define _NET_BYPASS_H
>> > > > +
>> > > > +#include <linux/netdevice.h>
>> > > > +
>> > > > +struct bypass_ops {
>> > > > +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> > > > +				  struct net_device *bypass_netdev);
>> > > > +	int (*slave_join)(struct net_device *slave_netdev,
>> > > > +			  struct net_device *bypass_netdev);
>> > > > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> > > > +				    struct net_device *bypass_netdev);
>> > > > +	int (*slave_release)(struct net_device *slave_netdev,
>> > > > +			     struct net_device *bypass_netdev);
>> > > > +	int (*slave_link_change)(struct net_device *slave_netdev,
>> > > > +				 struct net_device *bypass_netdev);
>> > > > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> > > > +};
>> > > > +
>> > > > +struct bypass_master {
>> > > > +	struct list_head list;
>> > > > +	struct net_device __rcu *bypass_netdev;
>> > > > +	struct bypass_ops __rcu *ops;
>> > > > +};
>> > > > +
>> > > > +/* bypass state */
>> > > > +struct bypass_info {
>> > > > +	/* passthru netdev with same MAC */
>> > > > +	struct net_device __rcu *active_netdev;
>> > > You still use "active"/"backup" names which is highly misleading as
>> > > it has completely different meaning that in bond for example.
>> > > I noted that in my previous review already. Please change it.
>> > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> > matches with the BACKUP feature bit we are adding to virtio_net.
>> I think that "backup" is also misleading. Both "active" and "backup"
>> mean a *state* of slaves. This should be named differently.
>> 
>> 
>> 
>> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> > am not too happy with it.
>> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> No. The netdev could be any netdevice. It does not have to be a "VF".
>> I think "stolen" is quite appropriate since it describes the modus
>> operandi. The bypass master steals some netdevice according to some
>> match.
>> 
>> But I don't insist on "stolen". Just sounds right.
>
>We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>'backup' name is consistent.

It perhaps makes sense from the view of virtio device. However, as I
described couple of times, for master/slave device the name "backup" is
highly misleading.


>
>The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>
>Will look for any suggestions in the next day or two. If i don't get any, i will go
>with 'stolen'
>
><snip>
>
>
>> +
>> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> +						  struct bypass_ops **ops)
>> +{
>> +	struct bypass_master *bypass_master;
>> +	struct net_device *bypass_netdev;
>> +
>> +	spin_lock(&bypass_lock);
>> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> > > As I wrote the last time, you don't need this list, spinlock.
>> > > You can do just something like:
>> > >           for_each_net(net) {
>> > >                   for_each_netdev(net, dev) {
>> > > 			if (netif_is_bypass_master(dev)) {
>> > This function returns the upper netdev as well as the ops associated
>> > with that netdev.
>> > bypass_master_list is a list of 'struct bypass_master' that associates
>> Well, can't you have it in netdev priv?
>
>We cannot do this for 2-netdev model as there is no bypass_netdev created.

Howcome? You have no master? I don't understand..



>
>> 
>> 
>> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> > NULL for 3-netdev model.
>> I see :(
>> 
>> 
>> > 
>> > > 
>> > > 
>> > > 
>> > > > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> > > > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> > > > +			*ops = rcu_dereference(bypass_master->ops);
>> > > I don't see how rcu_dereference is ok here.
>> > > 1) I don't see rcu_read_lock taken
>> > > 2) Looks like bypass_master->ops has the same value across the whole
>> > >      existence.
>> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> > Yes. ops doesn't change.
>> If it does not change, you can just access it directly.
>> 
>> 
>> > > 
>> > > > +			spin_unlock(&bypass_lock);
>> > > > +			return bypass_netdev;
>> > > > +		}
>> > > > +	}
>> > > > +	spin_unlock(&bypass_lock);
>> > > > +	return NULL;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
>> > > > +{
>> > > > +	struct net_device *bypass_netdev;
>> > > > +	struct bypass_ops *bypass_ops;
>> > > > +	int ret, orig_mtu;
>> > > > +
>> > > > +	ASSERT_RTNL();
>> > > > +
>> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > +						&bypass_ops);
>> > > For master, could you use word "master" in the variables so it is clear?
>> > > Also, "dev" is fine instead of "netdev".
>> > > Something like "bpmaster_dev"
>> > bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
>> I was trying to point out, that "bypass_netdev" represents a "master"
>> netdev, yet it does not say master. That is why I suggested
>> "bpmaster_dev"
>> 
>> 
>> > I can change all _netdev suffixes to _dev to make the names shorter.
>> ok.
>> 
>> 
>> > 
>> > > 
>> > > > +	if (!bypass_netdev)
>> > > > +		goto done;
>> > > > +
>> > > > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> > > > +					bypass_ops);
>> > > > +	if (ret != 0)
>> > > 	Just "if (ret)" will do. You have this on more places.
>> > OK.
>> > 
>> > 
>> > > 
>> > > > +		goto done;
>> > > > +
>> > > > +	ret = netdev_rx_handler_register(slave_netdev,
>> > > > +					 bypass_ops ? bypass_ops->handle_frame :
>> > > > +					 bypass_handle_frame, bypass_netdev);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> > > > +			   ret);
>> > > > +		goto done;
>> > > > +	}
>> > > > +
>> > > > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> > > > +			   bypass_netdev->name, ret);
>> > > > +		goto upper_link_failed;
>> > > > +	}
>> > > > +
>> > > > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> > > > +
>> > > > +	if (netif_running(bypass_netdev)) {
>> > > > +		ret = dev_open(slave_netdev);
>> > > > +		if (ret && (ret != -EBUSY)) {
>> > > > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> > > > +				   slave_netdev->name, ret);
>> > > > +			goto err_interface_up;
>> > > > +		}
>> > > > +	}
>> > > > +
>> > > > +	/* Align MTU of slave with master */
>> > > > +	orig_mtu = slave_netdev->mtu;
>> > > > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> > > > +	if (ret != 0) {
>> > > > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> > > > +			   slave_netdev->name, bypass_netdev->mtu);
>> > > > +		goto err_set_mtu;
>> > > > +	}
>> > > > +
>> > > > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> > > > +	if (ret != 0)
>> > > > +		goto err_join;
>> > > > +
>> > > > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> > > > +
>> > > > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> > > > +		    slave_netdev->name);
>> > > > +
>> > > > +	goto done;
>> > > > +
>> > > > +err_join:
>> > > > +	dev_set_mtu(slave_netdev, orig_mtu);
>> > > > +err_set_mtu:
>> > > > +	dev_close(slave_netdev);
>> > > > +err_interface_up:
>> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > > > +upper_link_failed:
>> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> > > > +done:
>> > > > +	return NOTIFY_DONE;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> > > > +				       struct net_device *bypass_netdev,
>> > > > +				       struct bypass_ops *bypass_ops)
>> > > > +{
>> > > > +	struct net_device *backup_netdev, *active_netdev;
>> > > > +	struct bypass_info *bi;
>> > > > +
>> > > > +	if (bypass_ops) {
>> > > > +		if (!bypass_ops->slave_pre_unregister)
>> > > > +			return -EINVAL;
>> > > > +
>> > > > +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> > > > +							bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +	bi = netdev_priv(bypass_netdev);
>> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > > > +		return -EINVAL;
>> > > > +
>> > > > +	return 0;
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
>> > > > +				struct net_device *bypass_netdev,
>> > > > +				struct bypass_ops *bypass_ops)
>> > > > +{
>> > > > +	struct net_device *backup_netdev, *active_netdev;
>> > > > +	struct bypass_info *bi;
>> > > > +
>> > > > +	if (bypass_ops) {
>> > > > +		if (!bypass_ops->slave_release)
>> > > > +			return -EINVAL;
>> > > I think it would be good to make the API to the driver more strict and
>> > > have a separate set of ops for "active" and "backup" netdevices.
>> > > That should stop people thinking about extending this to more slaves in
>> > > the future.
>> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> > 'active' slave.
>> I'm very well aware of that. I just thought that explicit ops for the
>> two slaves would make this more clear.
>> 
>> 
>> > 
>> > > 
>> > > 
>> > > > +
>> > > > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +	bi = netdev_priv(bypass_netdev);
>> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > +	if (slave_netdev == backup_netdev) {
>> > > > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> > > > +	} else {
>> > > > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> > > > +		if (backup_netdev) {
>> > > > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> > > > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> > > > +		}
>> > > > +	}
>> > > > +
>> > > > +	dev_put(slave_netdev);
>> > > > +
>> > > > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> > > > +		    slave_netdev->name);
>> > > > +
>> > > > +	return 0;
>> > > > +}
>> > > > +
>> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> > > > +{
>> > > > +	struct net_device *bypass_netdev;
>> > > > +	struct bypass_ops *bypass_ops;
>> > > > +	int ret;
>> > > > +
>> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> > > > +		goto done;
>> > > > +
>> > > > +	ASSERT_RTNL();
>> > > > +
>> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > +						&bypass_ops);
>> > > > +	if (!bypass_netdev)
>> > > > +		goto done;
>> > > > +
>> > > > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> > > > +					  bypass_ops);
>> > > > +	if (ret != 0)
>> > > > +		goto done;
>> > > > +
>> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> > > > +
>> > > > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> > > > +
>> > > > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> > > > +		    slave_netdev->name);
>> > > > +
>> > > > +done:
>> > > > +	return NOTIFY_DONE;
>> > > > +}
>> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> > > > +
>> > > > +static bool bypass_xmit_ready(struct net_device *dev)
>> > > > +{
>> > > > +	return netif_running(dev) && netif_carrier_ok(dev);
>> > > > +}
>> > > > +
>> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> > > > +{
>> > > > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> > > > +	struct bypass_ops *bypass_ops;
>> > > > +	struct bypass_info *bi;
>> > > > +
>> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> > > > +		goto done;
>> > > > +
>> > > > +	ASSERT_RTNL();
>> > > > +
>> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> > > > +						&bypass_ops);
>> > > > +	if (!bypass_netdev)
>> > > > +		goto done;
>> > > > +
>> > > > +	if (bypass_ops) {
>> > > > +		if (!bypass_ops->slave_link_change)
>> > > > +			goto done;
>> > > > +
>> > > > +		return bypass_ops->slave_link_change(slave_netdev,
>> > > > +						     bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +	if (!netif_running(bypass_netdev))
>> > > > +		return 0;
>> > > > +
>> > > > +	bi = netdev_priv(bypass_netdev);
>> > > > +
>> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> > > > +
>> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> > > > +		goto done;
>> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>> > > above is enough.
>> > I think we need this check to not allow events from a slave that is not
>> > attached to this master but has the same MAC.
>> Why do we need such events? Seems wrong to me.
>
>We want to avoid events from a netdev that is mis-configured with the same MAC as
>a bypass setup.
>
>>   Consider:
>> 
>> bp1      bp2
>> a1 b1    a2 b2
>> 
>> 
>> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
>
>We should not have 2 bypass configs with the same MAC.
>I need to add a check in the bypass_master_register() to prevent this.

Mac can change, you would have to check in change as well. Feels odd
thought. 


>
>The above check is to avoid cases where we have
>bp1(a1, b1) with mac1
>and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
>
>> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
>> the order of creation.
>> Let's say it will return bp1. Then when we have event for a2, the
>> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>> 
>> 
>> You cannot use bypass_master_get_bymac() here.
>> 
>> 
>> 
>> > > 
>> > > > +
>> > > > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> > > > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> > > > +		netif_carrier_on(bypass_netdev);
>> > > > +		netif_tx_wake_all_queues(bypass_netdev);
>> > > > +	} else {
>> > > > +		netif_carrier_off(bypass_netdev);
>> > > > +		netif_tx_stop_all_queues(bypass_netdev);
>> > > > +	}
>> > > > +
>> > > > +done:
>> > > > +	return NOTIFY_DONE;
>> > > > +}
>> > > > +
>> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
>> > > > +{
>> > > > +	/* Skip parent events */
>> > > > +	if (netif_is_bypass_master(dev))
>> > > > +		return false;
>> > > > +
>> > > > +	/* Avoid non-Ethernet type devices */
>> > > > +	if (dev->type != ARPHRD_ETHER)
>> > > > +		return false;
>> > > > +
>> > > > +	/* Avoid Vlan dev with same MAC registering as VF */
>> > > > +	if (is_vlan_dev(dev))
>> > > > +		return false;
>> > > > +
>> > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> > > helpers netif_is_bond_master().
>> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> > > 
>> > > You need to do it not by blacklisting, but with whitelisting. You need
>> > > to whitelist VF devices. My port flavours patchset might help with this.
>> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> I don't see such function in the code.
>
>It is netdev_has_any_lower_dev(). I need to export it.

Come on, you cannot use that. That would allow bonding without slaves,
but the slaves could be added later on.

What exactly you are trying to achieve by this?


>
>> 
>> 
>> > device is not an upper dev.
>> > Can you point to your port flavours patchset? Is it upstream?
>> I sent rfc couple of weeks ago:
>> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
>
>
>

^ permalink raw reply

* Re: [PATCH net-next] team: account for oper state
From: Jiri Pirko @ 2018-04-18 19:17 UTC (permalink / raw)
  To: George Wilkie; +Cc: netdev
In-Reply-To: <20180418153312.24h7mvle6sy2dv25@debian9.gwilkie>

Wed, Apr 18, 2018 at 05:33:12PM CEST, gwilkie@vyatta.att-mail.com wrote:
>On Wed, Apr 18, 2018 at 04:58:22PM +0200, Jiri Pirko wrote:
>> Wed, Apr 18, 2018 at 03:35:49PM CEST, gwilkie@vyatta.att-mail.com wrote:
>> >On Wed, Apr 18, 2018 at 02:56:44PM +0200, Jiri Pirko wrote:
>> >> Wed, Apr 18, 2018 at 12:29:50PM CEST, gwilkie@vyatta.att-mail.com wrote:
>> >> >Account for operational state when determining port linkup state,
>> >> >as per Documentation/networking/operstates.txt.
>> >> 
>> >> Could you please point me to the exact place in the document where this
>> >> is suggested?
>> >> 
>> >
>> >Various places cover it I think.
>> >
>> >In 1. Introduction:
>> >"interface is not usable just because the admin enabled it"
>> >"userspace must be granted the possibility to
>> >influence operational state"
>> >
>> >In 4. Setting from userspace:
>> >"the userspace application can set IFLA_OPERSTATE
>> >to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set
>> >netif_carrier_off() or netif_dormant_on()"
>> >
>> >We have a use case where we want to set the oper state of the team ports based
>> >on whether they are actually usable or not (as opposed to just admin up).
>> 
>> Are you running a supplicant there or what is the use-case?
>> 
>
>We are using tun/tap interfaces for the team ports with the physical interfaces
>under the control of a user process.
>
>> How is this handle in other drivers like bond, openvswitch, bridge, etc?
>
>It looks like bridge is using it, looking at br_port_carrier_check() and
>br_add_if().

Okay, so why do you still need to check netif_carrier_ok?
Looks like netif_oper_up is enough, right?


>
>Cheers.
>
>> 
>> >
>> >Cheers.
>> >
>> >> 
>> >> >
>> >> >Signed-off-by: George Wilkie <gwilkie@vyatta.att-mail.com>
>> >> >---
>> >> > drivers/net/team/team.c | 3 ++-
>> >> > 1 file changed, 2 insertions(+), 1 deletion(-)
>> >> >
>> >> >diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>> >> >index a6c6ce19eeee..231264a05e55 100644
>> >> >--- a/drivers/net/team/team.c
>> >> >+++ b/drivers/net/team/team.c
>> >> >@@ -2918,7 +2918,8 @@ static int team_device_event(struct notifier_block *unused,
>> >> > 	case NETDEV_CHANGE:
>> >> > 		if (netif_running(port->dev))
>> >> > 			team_port_change_check(port,
>> >> >-					       !!netif_carrier_ok(port->dev));
>> >> >+					       !!(netif_carrier_ok(port->dev) &&
>> >> >+						  netif_oper_up(port->dev)));
>> >> > 		break;
>> >> > 	case NETDEV_UNREGISTER:
>> >> > 		team_del_slave(port->team->dev, dev);
>> >> >-- 
>> >> >2.11.0
>> >> >

^ permalink raw reply

* Re: [PATCH net-next 2/2] udp: implement and use per cpu rx skbs cache
From: Eric Dumazet @ 2018-04-18 19:21 UTC (permalink / raw)
  To: Paolo Abeni, netdev; +Cc: David S. Miller
In-Reply-To: <1524071712.2599.60.camel@redhat.com>

On 04/18/2018 10:15 AM, Paolo Abeni wrote:
is not appealing to me :/
> 
> Thank you for the feedback.
> Sorry for not being clear about it, but knotd is using SO_REUSEPORT and
> the above tests are leveraging it.
> 
> That 5% is on top of that 300%.

Then there is something wrong.

Adding copies should not increase performance.

If it does, there is certainly another way, reaching 10% instead of 5%

^ permalink raw reply

* Re: [RFC PATCH V1 01/12] audit: add container id
From: Richard Guy Briggs @ 2018-04-18 19:23 UTC (permalink / raw)
  To: Stefan Berger
  Cc: mszeredi, ebiederm, simo, jlayton, carlos, linux-api, containers,
	LKML, eparis, dhowells, Linux-Audit Mailing List, viro, luto,
	netdev, linux-fsdevel, cgroups, serge, trondmy
In-Reply-To: <f966fa52-da4b-3d74-0848-1f0b08e57fd9@linux.vnet.ibm.com>

On 2018-04-18 14:45, Stefan Berger wrote:
> On 03/15/2018 11:58 PM, Richard Guy Briggs wrote:
> > On 2018-03-15 16:27, Stefan Berger wrote:
> > > On 03/01/2018 02:41 PM, Richard Guy Briggs wrote:
> > > > Implement the proc fs write to set the audit container ID of a process,
> > > > emitting an AUDIT_CONTAINER record to document the event.
> > > > 
> > > > This is a write from the container orchestrator task to a proc entry of
> > > > the form /proc/PID/containerid where PID is the process ID of the newly
> > > > created task that is to become the first task in a container, or an
> > > > additional task added to a container.
> > > > 
> > > > The write expects up to a u64 value (unset: 18446744073709551615).
> > > > 
> > > > This will produce a record such as this:
> > > > type=UNKNOWN[1333] msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
> > > > 
> > > > The "op" field indicates an initial set.  The "pid" to "ses" fields are
> > > > the orchestrator while the "opid" field is the object's PID, the process
> > > > being "contained".  Old and new container ID values are given in the
> > > > "contid" fields, while res indicates its success.
> > > > 
> > > > It is not permitted to self-set, unset or re-set the container ID.  A
> > > > child inherits its parent's container ID, but then can be set only once
> > > > after.
> > > > 
> > > > See: https://github.com/linux-audit/audit-kernel/issues/32
> > > > 
> > > > 
> > > >    /* audit_rule_data supports filter rules with both integer and string
> > > >     * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
> > > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > > > index 4e0a4ac..0ee1e59 100644
> > > > --- a/kernel/auditsc.c
> > > > +++ b/kernel/auditsc.c
> > > > @@ -2073,6 +2073,92 @@ int audit_set_loginuid(kuid_t loginuid)
> > > >    	return rc;
> > > >    }
> > > > 
> > > > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
> > > > +{
> > > > +	struct task_struct *parent;
> > > > +	u64 pcontainerid, ccontainerid;
> > > > +	pid_t ppid;
> > > > +
> > > > +	/* Don't allow to set our own containerid */
> > > > +	if (current == task)
> > > > +		return -EPERM;
> > > > +	/* Don't allow the containerid to be unset */
> > > > +	if (!cid_valid(containerid))
> > > > +		return -EINVAL;
> > > > +	/* if we don't have caps, reject */
> > > > +	if (!capable(CAP_AUDIT_CONTROL))
> > > > +		return -EPERM;
> > > > +	/* if containerid is unset, allow */
> > > > +	if (!audit_containerid_set(task))
> > > > +		return 0;
> > > I am wondering whether there should be a check for the target process that
> > > will receive the containerid to not have CAP_SYS_ADMIN that would otherwise
> > > allow it to arbitrarily unshare()/clone() and leave the set of namespaces
> > > that may make up the container whose containerid we assign here?
> > This is a reasonable question.  This has been debated and I understood
> > the conclusion was that without a clear definition of a "container", the
> > task still remains in that container that just now has more
> > sub-namespaces (in the case of hierarchical namespaces), we don't want
> > to restrict it in such a way and that allows it to create nested
> > containers.  I see setns being more problematic if it could switch to
> > another existing namespace that was set up by the orchestrator for a
> > different container.  The coming v2 patchset acknowledges this situation
> > with the network namespace being potentially shared by multiple
> > containers.
> 
> Are you going to post v2 soon? We would like to build on top of it for IMA
> namespacing and auditing inside of IMA namespaces.

I don't know if it addresses your specific needs, but V2 was posted on
March 16th along with userspace patches:
	https://www.redhat.com/archives/linux-audit/2018-March/msg00110.html
	https://www.redhat.com/archives/linux-audit/2018-March/msg00124.html

V3 is pending.

>    Stefan

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [PATCH RFC net-next 00/11] udp gso
From: David Miller @ 2018-04-18 19:33 UTC (permalink / raw)
  To: alexander.duyck
  Cc: sowmini.varadhan, willemdebruijn.kernel, sridhar.samudrala,
	netdev, willemb
In-Reply-To: <CAKgT0UcefCpG2j7RQN8aUUgu2bud_RTFs_s+nFPY2GhKgE8L+A@mail.gmail.com>

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Wed, 18 Apr 2018 11:12:06 -0700

> My only concern with the patch set is verifying what mitigations are
> in case so that we aren't trying to set an MSS size that results in a
> frame larger than MTU. I'm still digging through the code and trying
> to grok it, but I figured I might just put the question out there to
> may my reviewing easier.

It signals an error if a too large segment size is requested.

^ permalink raw reply

* Re: [RFC PATCH V1 01/12] audit: add container id
From: Stefan Berger @ 2018-04-18 19:39 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: mszeredi, ebiederm, simo, jlayton, carlos, linux-api, containers,
	LKML, eparis, dhowells, Linux-Audit Mailing List, viro, luto,
	netdev, linux-fsdevel, cgroups, serge, trondmy
In-Reply-To: <20180418192359.n4q53bvsdhrjftjg@madcap2.tricolour.ca>

On 04/18/2018 03:23 PM, Richard Guy Briggs wrote:
> On 2018-04-18 14:45, Stefan Berger wrote:
>> On 03/15/2018 11:58 PM, Richard Guy Briggs wrote:
>>> On 2018-03-15 16:27, Stefan Berger wrote:
>>>> On 03/01/2018 02:41 PM, Richard Guy Briggs wrote:
>>>>> Implement the proc fs write to set the audit container ID of a process,
>>>>> emitting an AUDIT_CONTAINER record to document the event.
>>>>>
>>>>> This is a write from the container orchestrator task to a proc entry of
>>>>> the form /proc/PID/containerid where PID is the process ID of the newly
>>>>> created task that is to become the first task in a container, or an
>>>>> additional task added to a container.
>>>>>
>>>>> The write expects up to a u64 value (unset: 18446744073709551615).
>>>>>
>>>>> This will produce a record such as this:
>>>>> type=UNKNOWN[1333] msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
>>>>>
>>>>> The "op" field indicates an initial set.  The "pid" to "ses" fields are
>>>>> the orchestrator while the "opid" field is the object's PID, the process
>>>>> being "contained".  Old and new container ID values are given in the
>>>>> "contid" fields, while res indicates its success.
>>>>>
>>>>> It is not permitted to self-set, unset or re-set the container ID.  A
>>>>> child inherits its parent's container ID, but then can be set only once
>>>>> after.
>>>>>
>>>>> See: https://github.com/linux-audit/audit-kernel/issues/32
>>>>>
>>>>>
>>>>>     /* audit_rule_data supports filter rules with both integer and string
>>>>>      * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
>>>>> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>>>>> index 4e0a4ac..0ee1e59 100644
>>>>> --- a/kernel/auditsc.c
>>>>> +++ b/kernel/auditsc.c
>>>>> @@ -2073,6 +2073,92 @@ int audit_set_loginuid(kuid_t loginuid)
>>>>>     	return rc;
>>>>>     }
>>>>>
>>>>> +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
>>>>> +{
>>>>> +	struct task_struct *parent;
>>>>> +	u64 pcontainerid, ccontainerid;
>>>>> +	pid_t ppid;
>>>>> +
>>>>> +	/* Don't allow to set our own containerid */
>>>>> +	if (current == task)
>>>>> +		return -EPERM;
>>>>> +	/* Don't allow the containerid to be unset */
>>>>> +	if (!cid_valid(containerid))
>>>>> +		return -EINVAL;
>>>>> +	/* if we don't have caps, reject */
>>>>> +	if (!capable(CAP_AUDIT_CONTROL))
>>>>> +		return -EPERM;
>>>>> +	/* if containerid is unset, allow */
>>>>> +	if (!audit_containerid_set(task))
>>>>> +		return 0;
>>>> I am wondering whether there should be a check for the target process that
>>>> will receive the containerid to not have CAP_SYS_ADMIN that would otherwise
>>>> allow it to arbitrarily unshare()/clone() and leave the set of namespaces
>>>> that may make up the container whose containerid we assign here?
>>> This is a reasonable question.  This has been debated and I understood
>>> the conclusion was that without a clear definition of a "container", the
>>> task still remains in that container that just now has more
>>> sub-namespaces (in the case of hierarchical namespaces), we don't want
>>> to restrict it in such a way and that allows it to create nested
>>> containers.  I see setns being more problematic if it could switch to
>>> another existing namespace that was set up by the orchestrator for a
>>> different container.  The coming v2 patchset acknowledges this situation
>>> with the network namespace being potentially shared by multiple
>>> containers.
>> Are you going to post v2 soon? We would like to build on top of it for IMA
>> namespacing and auditing inside of IMA namespaces.
> I don't know if it addresses your specific needs, but V2 was posted on
> March 16th along with userspace patches:
> 	https://www.redhat.com/archives/linux-audit/2018-March/msg00110.html
> 	https://www.redhat.com/archives/linux-audit/2018-March/msg00124.html
>
> V3 is pending.
Thanks. I hadn't actually looked at primarily due to the ghak and ghau 
in the title. Whatever these may mean.

Does V2 or will V3 prevent a privileged process to setns() to a whole 
different set of namespaces and still be audited with that initial 
container id ?

^ permalink raw reply

* net namespaces kernel stack overflow
From: Alexander Aring @ 2018-04-18 19:45 UTC (permalink / raw)
  To: ktkhai; +Cc: netdev, Jamal Hadi Salim

Hi,

I currently can crash my net/master kernel by execute the following script:

--- snip

modprobe dummy

#mkdir /var/run/netns
#touch /var/run/netns/init_net
#mount --bind /proc/1/ns/net /var/run/netns/init_net

while true
do
    mkdir /var/run/netns
    touch /var/run/netns/init_net
    mount --bind /proc/1/ns/net /var/run/netns/init_net

    ip netns add foo
    ip netns exec foo ip link add dummy0 type dummy
    ip netns delete foo
done

--- snap

After max ~1 minute the kernel will crash.
Doing my hack of saving init_net outside the loop it will run fine...
So the mount bind is necessary.

The last message which I see is:

BUG: stack guard page was hit at 00000000f0751759 (stack is
0000000069363195..0000000073ddc474)
kernel stack overflow (double-fault): 0000 [#1] SMP PTI
Modules linked in:
CPU: 0 PID: 13917 Comm: ip Not tainted 4.16.0-11878-gef9d066f6808 #32
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
RIP: 0010:validate_chain.isra.23+0x44/0xc40
RSP: 0018:ffffc900002cbff8 EFLAGS: 00010002
RAX: 0000000000040000 RBX: 0e58b88e1d4d15da RCX: 0e58b88e1d4d15da
RDX: 0000000000000000 RSI: ffff8802b25ee2a0 RDI: ffff8802b25edb00
RBP: 0e58b88e1d4d15da R08: 0000000000000000 R09: 0000000000000004
R10: ffffc900002cc050 R11: ffff8802b1054be8 R12: 0000000000000001
R13: ffff8802b25ee268 R14: ffff8802b25edb00 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8802bfc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc900002cbfe8 CR3: 0000000002024000 CR4: 00000000000006f0
Call Trace:
 ? get_max_files+0x10/0x10
 __lock_acquire+0x332/0x710
 lock_acquire+0x67/0xb0
 ? lockref_put_or_lock+0x9/0x30
 ? dput.part.7+0x17/0x2d0
 _raw_spin_lock+0x2b/0x60
 ? lockref_put_or_lock+0x9/0x30
 lockref_put_or_lock+0x9/0x30
 dput.part.7+0x1ec/0x2d0
 drop_mountpoint+0x10/0x40
 pin_kill+0x9b/0x3a0
 ? wait_woken+0x90/0x90
 ? mnt_pin_kill+0x2d/0x100
 mnt_pin_kill+0x2d/0x100
 cleanup_mnt+0x66/0x70
 pin_kill+0x9b/0x3a0
 ? wait_woken+0x90/0x90
 ? mnt_pin_kill+0x2d/0x100
 mnt_pin_kill+0x2d/0x100
 cleanup_mnt+0x66/0x70
...

I guess maybe it has something to do with recently switching to
migrate per-net ops to async.

- Alex

^ permalink raw reply

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Michael S. Tsirkin @ 2018-04-18 19:46 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: alexander.h.duyck, virtio-dev, kubakici, Samudrala, Sridhar,
	virtualization, loseweigh, netdev, davem
In-Reply-To: <20180418191315.GA1922@nanopsycho>

On Wed, Apr 18, 2018 at 09:13:15PM +0200, Jiri Pirko wrote:
> Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
> >On 4/18/2018 2:25 AM, Jiri Pirko wrote:
> >> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
> >> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
> >> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
> >> > > > This provides a generic interface for paravirtual drivers to listen
> >> > > > for netdev register/unregister/link change events from pci ethernet
> >> > > > devices with the same MAC and takeover their datapath. The notifier and
> >> > > > event handling code is based on the existing netvsc implementation.
> >> > > > 
> >> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
> >> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
> >> > > > master netdev is created. The paravirtual driver registers each bypass
> >> > > > instance along with a set of ops to manage the slave events.
> >> > > >       bypass_master_register()
> >> > > >       bypass_master_unregister()
> >> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
> >> > > > the bypass module provides interfaces to create/destroy additional master
> >> > > > netdev and all the slave events are managed internally.
> >> > > >        bypass_master_create()
> >> > > >        bypass_master_destroy()
> >> > > > 
> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > > > ---
> >> > > > include/linux/netdevice.h |  14 +
> >> > > > include/net/bypass.h      |  96 ++++++
> >> > > > net/Kconfig               |  18 +
> >> > > > net/core/Makefile         |   1 +
> >> > > > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
> >> > > > 5 files changed, 973 insertions(+)
> >> > > > create mode 100644 include/net/bypass.h
> >> > > > create mode 100644 net/core/bypass.c
> >> > > > 
> >> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >> > > > index cf44503ea81a..587293728f70 100644
> >> > > > --- a/include/linux/netdevice.h
> >> > > > +++ b/include/linux/netdevice.h
> >> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
> >> > > > 	IFF_PHONY_HEADROOM		= 1<<24,
> >> > > > 	IFF_MACSEC			= 1<<25,
> >> > > > 	IFF_NO_RX_HANDLER		= 1<<26,
> >> > > > +	IFF_BYPASS			= 1 << 27,
> >> > > > +	IFF_BYPASS_SLAVE		= 1 << 28,
> >> > > I wonder, why you don't follow the existing coding style... Also, please
> >> > > add these to into the comment above.
> >> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
> >> > to the existing coding style to be consistent.
> >> Please do.
> >> 
> >> 
> >> > > 
> >> > > > };
> >> > > > 
> >> > > > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
> >> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
> >> > > > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
> >> > > > #define IFF_MACSEC			IFF_MACSEC
> >> > > > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
> >> > > > +#define IFF_BYPASS			IFF_BYPASS
> >> > > > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
> >> > > > 
> >> > > > /**
> >> > > >    *	struct net_device - The DEVICE structure.
> >> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
> >> > > > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
> >> > > > }
> >> > > > 
> >> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
> >> > > > +{
> >> > > > +	return dev->priv_flags & IFF_BYPASS;
> >> > > > +}
> >> > > > +
> >> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
> >> > > > +{
> >> > > > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
> >> > > > +}
> >> > > > +
> >> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
> >> > > > static inline void netif_keep_dst(struct net_device *dev)
> >> > > > {
> >> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
> >> > > > new file mode 100644
> >> > > > index 000000000000..86b02cb894cf
> >> > > > --- /dev/null
> >> > > > +++ b/include/net/bypass.h
> >> > > > @@ -0,0 +1,96 @@
> >> > > > +// SPDX-License-Identifier: GPL-2.0
> >> > > > +/* Copyright (c) 2018, Intel Corporation. */
> >> > > > +
> >> > > > +#ifndef _NET_BYPASS_H
> >> > > > +#define _NET_BYPASS_H
> >> > > > +
> >> > > > +#include <linux/netdevice.h>
> >> > > > +
> >> > > > +struct bypass_ops {
> >> > > > +	int (*slave_pre_register)(struct net_device *slave_netdev,
> >> > > > +				  struct net_device *bypass_netdev);
> >> > > > +	int (*slave_join)(struct net_device *slave_netdev,
> >> > > > +			  struct net_device *bypass_netdev);
> >> > > > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
> >> > > > +				    struct net_device *bypass_netdev);
> >> > > > +	int (*slave_release)(struct net_device *slave_netdev,
> >> > > > +			     struct net_device *bypass_netdev);
> >> > > > +	int (*slave_link_change)(struct net_device *slave_netdev,
> >> > > > +				 struct net_device *bypass_netdev);
> >> > > > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
> >> > > > +};
> >> > > > +
> >> > > > +struct bypass_master {
> >> > > > +	struct list_head list;
> >> > > > +	struct net_device __rcu *bypass_netdev;
> >> > > > +	struct bypass_ops __rcu *ops;
> >> > > > +};
> >> > > > +
> >> > > > +/* bypass state */
> >> > > > +struct bypass_info {
> >> > > > +	/* passthru netdev with same MAC */
> >> > > > +	struct net_device __rcu *active_netdev;
> >> > > You still use "active"/"backup" names which is highly misleading as
> >> > > it has completely different meaning that in bond for example.
> >> > > I noted that in my previous review already. Please change it.
> >> > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
> >> > matches with the BACKUP feature bit we are adding to virtio_net.
> >> I think that "backup" is also misleading. Both "active" and "backup"
> >> mean a *state* of slaves. This should be named differently.
> >> 
> >> 
> >> 
> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
> >> > am not too happy with it.
> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
> >> No. The netdev could be any netdevice. It does not have to be a "VF".
> >> I think "stolen" is quite appropriate since it describes the modus
> >> operandi. The bypass master steals some netdevice according to some
> >> match.
> >> 
> >> But I don't insist on "stolen". Just sounds right.
> >
> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
> >'backup' name is consistent.
> 
> It perhaps makes sense from the view of virtio device. However, as I
> described couple of times, for master/slave device the name "backup" is
> highly misleading.

virtio is the backup. You are supposed to use another
(typically passthrough) device, if that fails use virtio.
It does seem appropriate to me. If you like, we can
change that to "standby".  Active I don't like either. "main"?

In fact would failover be better than bypass?


> 
> >
> >The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
> >a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
> >
> >Will look for any suggestions in the next day or two. If i don't get any, i will go
> >with 'stolen'
> >
> ><snip>
> >
> >
> >> +
> >> +static struct net_device *bypass_master_get_bymac(u8 *mac,
> >> +						  struct bypass_ops **ops)
> >> +{
> >> +	struct bypass_master *bypass_master;
> >> +	struct net_device *bypass_netdev;
> >> +
> >> +	spin_lock(&bypass_lock);
> >> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
> >> > > As I wrote the last time, you don't need this list, spinlock.
> >> > > You can do just something like:
> >> > >           for_each_net(net) {
> >> > >                   for_each_netdev(net, dev) {
> >> > > 			if (netif_is_bypass_master(dev)) {
> >> > This function returns the upper netdev as well as the ops associated
> >> > with that netdev.
> >> > bypass_master_list is a list of 'struct bypass_master' that associates
> >> Well, can't you have it in netdev priv?
> >
> >We cannot do this for 2-netdev model as there is no bypass_netdev created.
> 
> Howcome? You have no master? I don't understand..
> 
> 
> 
> >
> >> 
> >> 
> >> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
> >> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
> >> > NULL for 3-netdev model.
> >> I see :(
> >> 
> >> 
> >> > 
> >> > > 
> >> > > 
> >> > > 
> >> > > > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
> >> > > > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
> >> > > > +			*ops = rcu_dereference(bypass_master->ops);
> >> > > I don't see how rcu_dereference is ok here.
> >> > > 1) I don't see rcu_read_lock taken
> >> > > 2) Looks like bypass_master->ops has the same value across the whole
> >> > >      existence.
> >> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
> >> > Yes. ops doesn't change.
> >> If it does not change, you can just access it directly.
> >> 
> >> 
> >> > > 
> >> > > > +			spin_unlock(&bypass_lock);
> >> > > > +			return bypass_netdev;
> >> > > > +		}
> >> > > > +	}
> >> > > > +	spin_unlock(&bypass_lock);
> >> > > > +	return NULL;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > +	struct net_device *bypass_netdev;
> >> > > > +	struct bypass_ops *bypass_ops;
> >> > > > +	int ret, orig_mtu;
> >> > > > +
> >> > > > +	ASSERT_RTNL();
> >> > > > +
> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > +						&bypass_ops);
> >> > > For master, could you use word "master" in the variables so it is clear?
> >> > > Also, "dev" is fine instead of "netdev".
> >> > > Something like "bpmaster_dev"
> >> > bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
> >> I was trying to point out, that "bypass_netdev" represents a "master"
> >> netdev, yet it does not say master. That is why I suggested
> >> "bpmaster_dev"
> >> 
> >> 
> >> > I can change all _netdev suffixes to _dev to make the names shorter.
> >> ok.
> >> 
> >> 
> >> > 
> >> > > 
> >> > > > +	if (!bypass_netdev)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
> >> > > > +					bypass_ops);
> >> > > > +	if (ret != 0)
> >> > > 	Just "if (ret)" will do. You have this on more places.
> >> > OK.
> >> > 
> >> > 
> >> > > 
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ret = netdev_rx_handler_register(slave_netdev,
> >> > > > +					 bypass_ops ? bypass_ops->handle_frame :
> >> > > > +					 bypass_handle_frame, bypass_netdev);
> >> > > > +	if (ret != 0) {
> >> > > > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
> >> > > > +			   ret);
> >> > > > +		goto done;
> >> > > > +	}
> >> > > > +
> >> > > > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
> >> > > > +	if (ret != 0) {
> >> > > > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
> >> > > > +			   bypass_netdev->name, ret);
> >> > > > +		goto upper_link_failed;
> >> > > > +	}
> >> > > > +
> >> > > > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
> >> > > > +
> >> > > > +	if (netif_running(bypass_netdev)) {
> >> > > > +		ret = dev_open(slave_netdev);
> >> > > > +		if (ret && (ret != -EBUSY)) {
> >> > > > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
> >> > > > +				   slave_netdev->name, ret);
> >> > > > +			goto err_interface_up;
> >> > > > +		}
> >> > > > +	}
> >> > > > +
> >> > > > +	/* Align MTU of slave with master */
> >> > > > +	orig_mtu = slave_netdev->mtu;
> >> > > > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
> >> > > > +	if (ret != 0) {
> >> > > > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
> >> > > > +			   slave_netdev->name, bypass_netdev->mtu);
> >> > > > +		goto err_set_mtu;
> >> > > > +	}
> >> > > > +
> >> > > > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
> >> > > > +	if (ret != 0)
> >> > > > +		goto err_join;
> >> > > > +
> >> > > > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
> >> > > > +
> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
> >> > > > +		    slave_netdev->name);
> >> > > > +
> >> > > > +	goto done;
> >> > > > +
> >> > > > +err_join:
> >> > > > +	dev_set_mtu(slave_netdev, orig_mtu);
> >> > > > +err_set_mtu:
> >> > > > +	dev_close(slave_netdev);
> >> > > > +err_interface_up:
> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
> >> > > > +upper_link_failed:
> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
> >> > > > +done:
> >> > > > +	return NOTIFY_DONE;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
> >> > > > +				       struct net_device *bypass_netdev,
> >> > > > +				       struct bypass_ops *bypass_ops)
> >> > > > +{
> >> > > > +	struct net_device *backup_netdev, *active_netdev;
> >> > > > +	struct bypass_info *bi;
> >> > > > +
> >> > > > +	if (bypass_ops) {
> >> > > > +		if (!bypass_ops->slave_pre_unregister)
> >> > > > +			return -EINVAL;
> >> > > > +
> >> > > > +		return bypass_ops->slave_pre_unregister(slave_netdev,
> >> > > > +							bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +	bi = netdev_priv(bypass_netdev);
> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
> >> > > > +		return -EINVAL;
> >> > > > +
> >> > > > +	return 0;
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
> >> > > > +				struct net_device *bypass_netdev,
> >> > > > +				struct bypass_ops *bypass_ops)
> >> > > > +{
> >> > > > +	struct net_device *backup_netdev, *active_netdev;
> >> > > > +	struct bypass_info *bi;
> >> > > > +
> >> > > > +	if (bypass_ops) {
> >> > > > +		if (!bypass_ops->slave_release)
> >> > > > +			return -EINVAL;
> >> > > I think it would be good to make the API to the driver more strict and
> >> > > have a separate set of ops for "active" and "backup" netdevices.
> >> > > That should stop people thinking about extending this to more slaves in
> >> > > the future.
> >> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
> >> > 'active' slave.
> >> I'm very well aware of that. I just thought that explicit ops for the
> >> two slaves would make this more clear.
> >> 
> >> 
> >> > 
> >> > > 
> >> > > 
> >> > > > +
> >> > > > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +	bi = netdev_priv(bypass_netdev);
> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > +	if (slave_netdev == backup_netdev) {
> >> > > > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
> >> > > > +	} else {
> >> > > > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
> >> > > > +		if (backup_netdev) {
> >> > > > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
> >> > > > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
> >> > > > +		}
> >> > > > +	}
> >> > > > +
> >> > > > +	dev_put(slave_netdev);
> >> > > > +
> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
> >> > > > +		    slave_netdev->name);
> >> > > > +
> >> > > > +	return 0;
> >> > > > +}
> >> > > > +
> >> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > +	struct net_device *bypass_netdev;
> >> > > > +	struct bypass_ops *bypass_ops;
> >> > > > +	int ret;
> >> > > > +
> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ASSERT_RTNL();
> >> > > > +
> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > +						&bypass_ops);
> >> > > > +	if (!bypass_netdev)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
> >> > > > +					  bypass_ops);
> >> > > > +	if (ret != 0)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
> >> > > > +
> >> > > > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
> >> > > > +
> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
> >> > > > +		    slave_netdev->name);
> >> > > > +
> >> > > > +done:
> >> > > > +	return NOTIFY_DONE;
> >> > > > +}
> >> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
> >> > > > +
> >> > > > +static bool bypass_xmit_ready(struct net_device *dev)
> >> > > > +{
> >> > > > +	return netif_running(dev) && netif_carrier_ok(dev);
> >> > > > +}
> >> > > > +
> >> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
> >> > > > +{
> >> > > > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
> >> > > > +	struct bypass_ops *bypass_ops;
> >> > > > +	struct bypass_info *bi;
> >> > > > +
> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	ASSERT_RTNL();
> >> > > > +
> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
> >> > > > +						&bypass_ops);
> >> > > > +	if (!bypass_netdev)
> >> > > > +		goto done;
> >> > > > +
> >> > > > +	if (bypass_ops) {
> >> > > > +		if (!bypass_ops->slave_link_change)
> >> > > > +			goto done;
> >> > > > +
> >> > > > +		return bypass_ops->slave_link_change(slave_netdev,
> >> > > > +						     bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +	if (!netif_running(bypass_netdev))
> >> > > > +		return 0;
> >> > > > +
> >> > > > +	bi = netdev_priv(bypass_netdev);
> >> > > > +
> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
> >> > > > +
> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
> >> > > > +		goto done;
> >> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
> >> > > above is enough.
> >> > I think we need this check to not allow events from a slave that is not
> >> > attached to this master but has the same MAC.
> >> Why do we need such events? Seems wrong to me.
> >
> >We want to avoid events from a netdev that is mis-configured with the same MAC as
> >a bypass setup.
> >
> >>   Consider:
> >> 
> >> bp1      bp2
> >> a1 b1    a2 b2
> >> 
> >> 
> >> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
> >
> >We should not have 2 bypass configs with the same MAC.
> >I need to add a check in the bypass_master_register() to prevent this.
> 
> Mac can change, you would have to check in change as well. Feels odd
> thought. 
> 
> 
> >
> >The above check is to avoid cases where we have
> >bp1(a1, b1) with mac1
> >and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
> >
> >> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
> >> the order of creation.
> >> Let's say it will return bp1. Then when we have event for a2, the
> >> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
> >> 
> >> 
> >> You cannot use bypass_master_get_bymac() here.
> >> 
> >> 
> >> 
> >> > > 
> >> > > > +
> >> > > > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
> >> > > > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
> >> > > > +		netif_carrier_on(bypass_netdev);
> >> > > > +		netif_tx_wake_all_queues(bypass_netdev);
> >> > > > +	} else {
> >> > > > +		netif_carrier_off(bypass_netdev);
> >> > > > +		netif_tx_stop_all_queues(bypass_netdev);
> >> > > > +	}
> >> > > > +
> >> > > > +done:
> >> > > > +	return NOTIFY_DONE;
> >> > > > +}
> >> > > > +
> >> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
> >> > > > +{
> >> > > > +	/* Skip parent events */
> >> > > > +	if (netif_is_bypass_master(dev))
> >> > > > +		return false;
> >> > > > +
> >> > > > +	/* Avoid non-Ethernet type devices */
> >> > > > +	if (dev->type != ARPHRD_ETHER)
> >> > > > +		return false;
> >> > > > +
> >> > > > +	/* Avoid Vlan dev with same MAC registering as VF */
> >> > > > +	if (is_vlan_dev(dev))
> >> > > > +		return false;
> >> > > > +
> >> > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
> >> > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
> >> > > Yeah, this is certainly incorrect. One thing is, you should be using the
> >> > > helpers netif_is_bond_master().
> >> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
> >> > > 
> >> > > You need to do it not by blacklisting, but with whitelisting. You need
> >> > > to whitelist VF devices. My port flavours patchset might help with this.
> >> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
> >> I don't see such function in the code.
> >
> >It is netdev_has_any_lower_dev(). I need to export it.
> 
> Come on, you cannot use that. That would allow bonding without slaves,
> but the slaves could be added later on.
> 
> What exactly you are trying to achieve by this?
> 
> 
> >
> >> 
> >> 
> >> > device is not an upper dev.
> >> > Can you point to your port flavours patchset? Is it upstream?
> >> I sent rfc couple of weeks ago:
> >> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
> >
> >
> >

^ permalink raw reply

* [PATCH net] vmxnet3: fix incorrect dereference when rxvlan is disabled
From: Ronak Doshi @ 2018-04-18 19:48 UTC (permalink / raw)
  To: netdev; +Cc: Ronak Doshi, VMware, Inc., open list

vmxnet3_get_hdr_len() is used to calculate the header length which in
turn is used to calculate the gso_size for skb. When rxvlan offload is
disabled, vlan tag is present in the header and the function references
ip header from sizeof(ethhdr) and leads to incorrect pointer reference.

This patch fixes this issue by taking sizeof(vlan_ethhdr) into account
if vlan tag is present and correctly references the ip hdr.

Signed-off-by: Ronak Doshi <doshir@vmware.com>
Acked-by: Guolin Yang <gyang@vmware.com>
Acked-by: Louis Luo <llouis@vmware.com>
---
 drivers/net/vmxnet3/vmxnet3_drv.c | 17 +++++++++++++----
 drivers/net/vmxnet3/vmxnet3_int.h |  4 ++--
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index e04937f44f33..9ebe2a689966 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -1218,6 +1218,7 @@ vmxnet3_get_hdr_len(struct vmxnet3_adapter *adapter, struct sk_buff *skb,
 	union {
 		void *ptr;
 		struct ethhdr *eth;
+		struct vlan_ethhdr *veth;
 		struct iphdr *ipv4;
 		struct ipv6hdr *ipv6;
 		struct tcphdr *tcp;
@@ -1228,16 +1229,24 @@ vmxnet3_get_hdr_len(struct vmxnet3_adapter *adapter, struct sk_buff *skb,
 	if (unlikely(sizeof(struct iphdr) + sizeof(struct tcphdr) > maplen))
 		return 0;
 
+	if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
+	    skb->protocol == cpu_to_be16(ETH_P_8021AD))
+		hlen = sizeof(struct vlan_ethhdr);
+	else
+		hlen = sizeof(struct ethhdr);
+
 	hdr.eth = eth_hdr(skb);
 	if (gdesc->rcd.v4) {
-		BUG_ON(hdr.eth->h_proto != htons(ETH_P_IP));
-		hdr.ptr += sizeof(struct ethhdr);
+		BUG_ON(hdr.eth->h_proto != htons(ETH_P_IP) &&
+		       hdr.veth->h_vlan_encapsulated_proto != htons(ETH_P_IP));
+		hdr.ptr += hlen;
 		BUG_ON(hdr.ipv4->protocol != IPPROTO_TCP);
 		hlen = hdr.ipv4->ihl << 2;
 		hdr.ptr += hdr.ipv4->ihl << 2;
 	} else if (gdesc->rcd.v6) {
-		BUG_ON(hdr.eth->h_proto != htons(ETH_P_IPV6));
-		hdr.ptr += sizeof(struct ethhdr);
+		BUG_ON(hdr.eth->h_proto != htons(ETH_P_IPV6) &&
+		       hdr.veth->h_vlan_encapsulated_proto != htons(ETH_P_IPV6));
+		hdr.ptr += hlen;
 		/* Use an estimated value, since we also need to handle
 		 * TSO case.
 		 */
diff --git a/drivers/net/vmxnet3/vmxnet3_int.h b/drivers/net/vmxnet3/vmxnet3_int.h
index 59ec34052a65..a3326463b71f 100644
--- a/drivers/net/vmxnet3/vmxnet3_int.h
+++ b/drivers/net/vmxnet3/vmxnet3_int.h
@@ -69,10 +69,10 @@
 /*
  * Version numbers
  */
-#define VMXNET3_DRIVER_VERSION_STRING   "1.4.13.0-k"
+#define VMXNET3_DRIVER_VERSION_STRING   "1.4.14.0-k"
 
 /* a 32-bit int, each byte encode a verion number in VMXNET3_DRIVER_VERSION */
-#define VMXNET3_DRIVER_VERSION_NUM      0x01040d00
+#define VMXNET3_DRIVER_VERSION_NUM      0x01040e00
 
 #if defined(CONFIG_PCI_MSI)
 	/* RSS only makes sense if MSI-X is supported. */
-- 
2.11.0

^ permalink raw reply related

* Re: [RFC PATCH V1 01/12] audit: add container id
From: Richard Guy Briggs @ 2018-04-18 19:51 UTC (permalink / raw)
  To: Stefan Berger
  Cc: mszeredi, ebiederm, simo, jlayton, carlos, linux-api, containers,
	LKML, eparis, dhowells, Linux-Audit Mailing List, viro, luto,
	netdev, linux-fsdevel, cgroups, serge, trondmy
In-Reply-To: <c1ec93a2-b398-373c-55da-b2be8e60c6b6@linux.vnet.ibm.com>

On 2018-04-18 15:39, Stefan Berger wrote:
> On 04/18/2018 03:23 PM, Richard Guy Briggs wrote:
> > On 2018-04-18 14:45, Stefan Berger wrote:
> > > On 03/15/2018 11:58 PM, Richard Guy Briggs wrote:
> > > > On 2018-03-15 16:27, Stefan Berger wrote:
> > > > > On 03/01/2018 02:41 PM, Richard Guy Briggs wrote:
> > > > > > Implement the proc fs write to set the audit container ID of a process,
> > > > > > emitting an AUDIT_CONTAINER record to document the event.
> > > > > > 
> > > > > > This is a write from the container orchestrator task to a proc entry of
> > > > > > the form /proc/PID/containerid where PID is the process ID of the newly
> > > > > > created task that is to become the first task in a container, or an
> > > > > > additional task added to a container.
> > > > > > 
> > > > > > The write expects up to a u64 value (unset: 18446744073709551615).
> > > > > > 
> > > > > > This will produce a record such as this:
> > > > > > type=UNKNOWN[1333] msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
> > > > > > 
> > > > > > The "op" field indicates an initial set.  The "pid" to "ses" fields are
> > > > > > the orchestrator while the "opid" field is the object's PID, the process
> > > > > > being "contained".  Old and new container ID values are given in the
> > > > > > "contid" fields, while res indicates its success.
> > > > > > 
> > > > > > It is not permitted to self-set, unset or re-set the container ID.  A
> > > > > > child inherits its parent's container ID, but then can be set only once
> > > > > > after.
> > > > > > 
> > > > > > See: https://github.com/linux-audit/audit-kernel/issues/32
> > > > > > 
> > > > > > 
> > > > > >     /* audit_rule_data supports filter rules with both integer and string
> > > > > >      * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
> > > > > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > > > > > index 4e0a4ac..0ee1e59 100644
> > > > > > --- a/kernel/auditsc.c
> > > > > > +++ b/kernel/auditsc.c
> > > > > > @@ -2073,6 +2073,92 @@ int audit_set_loginuid(kuid_t loginuid)
> > > > > >     	return rc;
> > > > > >     }
> > > > > > 
> > > > > > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
> > > > > > +{
> > > > > > +	struct task_struct *parent;
> > > > > > +	u64 pcontainerid, ccontainerid;
> > > > > > +	pid_t ppid;
> > > > > > +
> > > > > > +	/* Don't allow to set our own containerid */
> > > > > > +	if (current == task)
> > > > > > +		return -EPERM;
> > > > > > +	/* Don't allow the containerid to be unset */
> > > > > > +	if (!cid_valid(containerid))
> > > > > > +		return -EINVAL;
> > > > > > +	/* if we don't have caps, reject */
> > > > > > +	if (!capable(CAP_AUDIT_CONTROL))
> > > > > > +		return -EPERM;
> > > > > > +	/* if containerid is unset, allow */
> > > > > > +	if (!audit_containerid_set(task))
> > > > > > +		return 0;
> > > > > I am wondering whether there should be a check for the target process that
> > > > > will receive the containerid to not have CAP_SYS_ADMIN that would otherwise
> > > > > allow it to arbitrarily unshare()/clone() and leave the set of namespaces
> > > > > that may make up the container whose containerid we assign here?
> > > > This is a reasonable question.  This has been debated and I understood
> > > > the conclusion was that without a clear definition of a "container", the
> > > > task still remains in that container that just now has more
> > > > sub-namespaces (in the case of hierarchical namespaces), we don't want
> > > > to restrict it in such a way and that allows it to create nested
> > > > containers.  I see setns being more problematic if it could switch to
> > > > another existing namespace that was set up by the orchestrator for a
> > > > different container.  The coming v2 patchset acknowledges this situation
> > > > with the network namespace being potentially shared by multiple
> > > > containers.
> > > Are you going to post v2 soon? We would like to build on top of it for IMA
> > > namespacing and auditing inside of IMA namespaces.
> > I don't know if it addresses your specific needs, but V2 was posted on
> > March 16th along with userspace patches:
> > 	https://www.redhat.com/archives/linux-audit/2018-March/msg00110.html
> > 	https://www.redhat.com/archives/linux-audit/2018-March/msg00124.html
> > 
> > V3 is pending.
> Thanks. I hadn't actually looked at primarily due to the ghak and ghau in
> the title. Whatever these may mean.

They are Github issue numbers:
GHAK: GitHub Audit Kernel
GHAU: GitHub Audit Userspace
GHAD: GitHub Audit Documentation
GHAT: GitHub Audit Testsuite

> Does V2 or will V3 prevent a privileged process to setns() to a whole
> different set of namespaces and still be audited with that initial container
> id ?

No, not significantly different from V1 in that respect.

It does not prevent setns(), but will maintain its containerid.

It will prevent games by blocking a child and parent from setting each
other's containerids.

It does check that the task being conainered does not yet have any
children or peer threads.

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* [PATCH] docs: ip-sysctl.txt: fix name of some ipv6 variables
From: Olivier Gayot @ 2018-04-18 20:03 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Olivier Gayot

The name of the following proc/sysctl entries were incorrectly
documented:

    /proc/sys/net/ipv6/conf/<interface>/max_dst_opts_number
    /proc/sys/net/ipv6/conf/<interface>/max_hbt_opts_number
    /proc/sys/net/ipv6/conf/<interface>/max_dst_opts_length
    /proc/sys/net/ipv6/conf/<interface>/max_hbt_length

Their name was set to the name of the symbol in the .data field of the
control table instead of their .proc name.

Signed-off-by: Olivier Gayot <olivier.gayot@sigexec.com>
---
 Documentation/networking/ip-sysctl.txt | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 5dc1a040a2f1..b583a73cf95f 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1390,26 +1390,26 @@ mld_qrv - INTEGER
 	Default: 2 (as specified by RFC3810 9.1)
 	Minimum: 1 (as specified by RFC6636 4.5)
 
-max_dst_opts_cnt - INTEGER
+max_dst_opts_number - INTEGER
 	Maximum number of non-padding TLVs allowed in a Destination
 	options extension header. If this value is less than zero
 	then unknown options are disallowed and the number of known
 	TLVs allowed is the absolute value of this number.
 	Default: 8
 
-max_hbh_opts_cnt - INTEGER
+max_hbh_opts_number - INTEGER
 	Maximum number of non-padding TLVs allowed in a Hop-by-Hop
 	options extension header. If this value is less than zero
 	then unknown options are disallowed and the number of known
 	TLVs allowed is the absolute value of this number.
 	Default: 8
 
-max dst_opts_len - INTEGER
+max_dst_opts_length - INTEGER
 	Maximum length allowed for a Destination options extension
 	header.
 	Default: INT_MAX (unlimited)
 
-max hbh_opts_len - INTEGER
+max_hbh_length - INTEGER
 	Maximum length allowed for a Hop-by-Hop options extension
 	header.
 	Default: INT_MAX (unlimited)
-- 
2.17.0

^ permalink raw reply related

* Re: [RFC net-next PATCH 2/2] bpf: disallow XDP data_meta to overlap with xdp_frame area
From: Daniel Borkmann @ 2018-04-18 20:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20180418195322.381b041c@redhat.com>

On 04/18/2018 07:53 PM, Jesper Dangaard Brouer wrote:
> On Wed, 18 Apr 2018 18:21:21 +0200
> Daniel Borkmann <daniel@iogearbox.net> wrote:
> 
>> On 04/18/2018 02:10 PM, Jesper Dangaard Brouer wrote:
>>> If combining xdp_adjust_head and xdp_adjust_meta, then it is possible
>>> to make data_meta overlap with area used by xdp_frame.  And another
>>> invocation of xdp_adjust_head can then clear that area, due to
>>> clearing of xdp_frame area.
>>>
>>> The easiest solution I found was to simply not allow
>>> xdp_buff->data_meta to overlap with area used by xdp_frame.  
>>
>> Thanks Jesper! Trying to answer both emails in one. :) More below.
>>
>>> Fixes: 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse")
>>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>> ---
>>>  net/core/filter.c |   11 +++++++++++
>>>  1 file changed, 11 insertions(+)
>>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 15e9b5477360..e3623e741181 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -2701,6 +2701,11 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
>>>  		     data > xdp->data_end - ETH_HLEN))
>>>  		return -EINVAL;
>>>  
>>> +	/* Disallow data_meta to use xdp_frame area */
>>> +	if (metalen > 0 &&
>>> +	    unlikely((data - metalen) < xdp_frame_end))
>>> +		return -EINVAL;
>>> +
>>>  	/* Avoid info leak, when reusing area prev used by xdp_frame */
>>>  	if (data < xdp_frame_end) {  
>>
>> Effectively, when metalen > 0, then data_meta < data pointer, so above test
>> on new data_meta might be better, but feels like a bit of a workaround to
>> handle moving data pointer but disallowing moving data_meta pointer whereas
>> both could be handled if we wanted to go that path.
>>
>>>  		unsigned long clearlen = xdp_frame_end - data;
>>> @@ -2734,6 +2739,7 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
>>>  
>>>  BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
>>>  {
>>> +	void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>>>  	void *meta = xdp->data_meta + offset;
>>>  	unsigned long metalen = xdp->data - meta;
>>>  
>>> @@ -2742,6 +2748,11 @@ BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
>>>  	if (unlikely(meta < xdp->data_hard_start ||
>>>  		     meta > xdp->data))
>>>  		return -EINVAL;
>>> +
>>> +	/* Disallow data_meta to use xdp_frame area */
>>> +	if (unlikely(meta < xdp_frame_end))
>>> +		return -EINVAL;  
>>
>> (Ditto.)
>>
>>>  	if (unlikely((metalen & (sizeof(__u32) - 1)) ||
>>>  		     (metalen > 32)))
>>>  		return -EACCES;  
>>
>> The other, perhaps less invasive/complex option would be to just disallow
>> moving anything into previous sizeof(struct xdp_frame) area. My original
>> concern was that not all drivers use 256 bytes of headroom, e.g. afaik the
>> i40e and ixgbe have around 192 bytes of headroom available, but that should
>> actually still be plenty of space for encap + meta data, and potentially
>> with meta data use I would expect that at best custom decap would be
>> happening when pushing the packet up the stack. So might as well disallow
>> going into that region and not worry about it. Thus, reverting e9e9545e10d3
>> ("xdp: avoid leaking info stored in frame data on page reuse") and adding
>> something like the below (uncompiled), should just do it:
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 3bb0cb9..ad98ddd 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2692,8 +2692,9 @@ static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)
>>
>>  BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
>>  {
>> +	void *frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>>  	unsigned long metalen = xdp_get_metalen(xdp);
>> -	void *data_start = xdp->data_hard_start + metalen;
>> +	void *data_start = frame_end + metalen;
>>  	void *data = xdp->data + offset;
>>
>>  	if (unlikely(data < data_start ||
>> @@ -2719,13 +2720,13 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
>>
>>  BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
>>  {
>> +	void *frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>>  	void *meta = xdp->data_meta + offset;
>>  	unsigned long metalen = xdp->data - meta;
>>
>>  	if (xdp_data_meta_unsupported(xdp))
>>  		return -ENOTSUPP;
>> -	if (unlikely(meta < xdp->data_hard_start ||
>> -		     meta > xdp->data))
>> +	if (unlikely(meta < frame_end || meta > xdp->data))
>>  		return -EINVAL;
>>  	if (unlikely((metalen & (sizeof(__u32) - 1)) ||
>>  		     (metalen > 32)))
> 
> Okay, so you say just disallow using xdp_frame area in general.  It
> would be simpler.  

Yeah, lets do that.

> The advantage it that we don't run into strange situations, where the
> user/bpf_prog increased headroom too much, such that convert_to_xdp_frame()
> fails and thus XDP_REDIRECT action fails.  (That will be confusing to
> users to debug/troubleshoot).

Agree, yet another argument for doing it.

^ permalink raw reply

* Re: [RFC PATCH net-next v6 2/4] net: Introduce generic bypass module
From: Jiri Pirko @ 2018-04-18 20:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Samudrala, Sridhar, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	jasowang, loseweigh
In-Reply-To: <20180418222309-mutt-send-email-mst@kernel.org>

Wed, Apr 18, 2018 at 09:46:04PM CEST, mst@redhat.com wrote:
>On Wed, Apr 18, 2018 at 09:13:15PM +0200, Jiri Pirko wrote:
>> Wed, Apr 18, 2018 at 08:43:15PM CEST, sridhar.samudrala@intel.com wrote:
>> >On 4/18/2018 2:25 AM, Jiri Pirko wrote:
>> >> Wed, Apr 11, 2018 at 09:13:52PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > On 4/11/2018 8:51 AM, Jiri Pirko wrote:
>> >> > > Tue, Apr 10, 2018 at 08:59:48PM CEST, sridhar.samudrala@intel.com wrote:
>> >> > > > This provides a generic interface for paravirtual drivers to listen
>> >> > > > for netdev register/unregister/link change events from pci ethernet
>> >> > > > devices with the same MAC and takeover their datapath. The notifier and
>> >> > > > event handling code is based on the existing netvsc implementation.
>> >> > > > 
>> >> > > > It exposes 2 sets of interfaces to the paravirtual drivers.
>> >> > > > 1. existing netvsc driver that uses 2 netdev model. In this model, no
>> >> > > > master netdev is created. The paravirtual driver registers each bypass
>> >> > > > instance along with a set of ops to manage the slave events.
>> >> > > >       bypass_master_register()
>> >> > > >       bypass_master_unregister()
>> >> > > > 2. new virtio_net based solution that uses 3 netdev model. In this model,
>> >> > > > the bypass module provides interfaces to create/destroy additional master
>> >> > > > netdev and all the slave events are managed internally.
>> >> > > >        bypass_master_create()
>> >> > > >        bypass_master_destroy()
>> >> > > > 
>> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> >> > > > ---
>> >> > > > include/linux/netdevice.h |  14 +
>> >> > > > include/net/bypass.h      |  96 ++++++
>> >> > > > net/Kconfig               |  18 +
>> >> > > > net/core/Makefile         |   1 +
>> >> > > > net/core/bypass.c         | 844 ++++++++++++++++++++++++++++++++++++++++++++++
>> >> > > > 5 files changed, 973 insertions(+)
>> >> > > > create mode 100644 include/net/bypass.h
>> >> > > > create mode 100644 net/core/bypass.c
>> >> > > > 
>> >> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> >> > > > index cf44503ea81a..587293728f70 100644
>> >> > > > --- a/include/linux/netdevice.h
>> >> > > > +++ b/include/linux/netdevice.h
>> >> > > > @@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
>> >> > > > 	IFF_PHONY_HEADROOM		= 1<<24,
>> >> > > > 	IFF_MACSEC			= 1<<25,
>> >> > > > 	IFF_NO_RX_HANDLER		= 1<<26,
>> >> > > > +	IFF_BYPASS			= 1 << 27,
>> >> > > > +	IFF_BYPASS_SLAVE		= 1 << 28,
>> >> > > I wonder, why you don't follow the existing coding style... Also, please
>> >> > > add these to into the comment above.
>> >> > To avoid checkpatch warnings. If it is OK to ignore these warnings, I can switch back
>> >> > to the existing coding style to be consistent.
>> >> Please do.
>> >> 
>> >> 
>> >> > > 
>> >> > > > };
>> >> > > > 
>> >> > > > #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
>> >> > > > @@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
>> >> > > > #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
>> >> > > > #define IFF_MACSEC			IFF_MACSEC
>> >> > > > #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
>> >> > > > +#define IFF_BYPASS			IFF_BYPASS
>> >> > > > +#define IFF_BYPASS_SLAVE		IFF_BYPASS_SLAVE
>> >> > > > 
>> >> > > > /**
>> >> > > >    *	struct net_device - The DEVICE structure.
>> >> > > > @@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
>> >> > > > 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
>> >> > > > }
>> >> > > > 
>> >> > > > +static inline bool netif_is_bypass_master(const struct net_device *dev)
>> >> > > > +{
>> >> > > > +	return dev->priv_flags & IFF_BYPASS;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static inline bool netif_is_bypass_slave(const struct net_device *dev)
>> >> > > > +{
>> >> > > > +	return dev->priv_flags & IFF_BYPASS_SLAVE;
>> >> > > > +}
>> >> > > > +
>> >> > > > /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
>> >> > > > static inline void netif_keep_dst(struct net_device *dev)
>> >> > > > {
>> >> > > > diff --git a/include/net/bypass.h b/include/net/bypass.h
>> >> > > > new file mode 100644
>> >> > > > index 000000000000..86b02cb894cf
>> >> > > > --- /dev/null
>> >> > > > +++ b/include/net/bypass.h
>> >> > > > @@ -0,0 +1,96 @@
>> >> > > > +// SPDX-License-Identifier: GPL-2.0
>> >> > > > +/* Copyright (c) 2018, Intel Corporation. */
>> >> > > > +
>> >> > > > +#ifndef _NET_BYPASS_H
>> >> > > > +#define _NET_BYPASS_H
>> >> > > > +
>> >> > > > +#include <linux/netdevice.h>
>> >> > > > +
>> >> > > > +struct bypass_ops {
>> >> > > > +	int (*slave_pre_register)(struct net_device *slave_netdev,
>> >> > > > +				  struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_join)(struct net_device *slave_netdev,
>> >> > > > +			  struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_pre_unregister)(struct net_device *slave_netdev,
>> >> > > > +				    struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_release)(struct net_device *slave_netdev,
>> >> > > > +			     struct net_device *bypass_netdev);
>> >> > > > +	int (*slave_link_change)(struct net_device *slave_netdev,
>> >> > > > +				 struct net_device *bypass_netdev);
>> >> > > > +	rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>> >> > > > +};
>> >> > > > +
>> >> > > > +struct bypass_master {
>> >> > > > +	struct list_head list;
>> >> > > > +	struct net_device __rcu *bypass_netdev;
>> >> > > > +	struct bypass_ops __rcu *ops;
>> >> > > > +};
>> >> > > > +
>> >> > > > +/* bypass state */
>> >> > > > +struct bypass_info {
>> >> > > > +	/* passthru netdev with same MAC */
>> >> > > > +	struct net_device __rcu *active_netdev;
>> >> > > You still use "active"/"backup" names which is highly misleading as
>> >> > > it has completely different meaning that in bond for example.
>> >> > > I noted that in my previous review already. Please change it.
>> >> > I guess the issue is with only the 'active'  name. 'backup' should be fine as it also
>> >> > matches with the BACKUP feature bit we are adding to virtio_net.
>> >> I think that "backup" is also misleading. Both "active" and "backup"
>> >> mean a *state* of slaves. This should be named differently.
>> >> 
>> >> 
>> >> 
>> >> > With regards to alternate names for 'active', you suggested 'stolen', but i
>> >> > am not too happy with it.
>> >> > netvsc uses vf_netdev, are you OK with this? Or another option is 'passthru'
>> >> No. The netdev could be any netdevice. It does not have to be a "VF".
>> >> I think "stolen" is quite appropriate since it describes the modus
>> >> operandi. The bypass master steals some netdevice according to some
>> >> match.
>> >> 
>> >> But I don't insist on "stolen". Just sounds right.
>> >
>> >We are adding VIRTIO_NET_F_BACKUP as a new feature bit to enable this feature, So i think
>> >'backup' name is consistent.
>> 
>> It perhaps makes sense from the view of virtio device. However, as I
>> described couple of times, for master/slave device the name "backup" is
>> highly misleading.
>
>virtio is the backup. You are supposed to use another
>(typically passthrough) device, if that fails use virtio.
>It does seem appropriate to me. If you like, we can
>change that to "standby".  Active I don't like either. "main"?

Sounds much better, yes.


>
>In fact would failover be better than bypass?

Also, much better.


>
>
>> 
>> >
>> >The intent is to restrict the 'active' netdev to be a VF. If there is a way to check that
>> >a PCI device is a VF in the guest kernel, we could restrict 'active' netdev to be a VF.
>> >
>> >Will look for any suggestions in the next day or two. If i don't get any, i will go
>> >with 'stolen'
>> >
>> ><snip>
>> >
>> >
>> >> +
>> >> +static struct net_device *bypass_master_get_bymac(u8 *mac,
>> >> +						  struct bypass_ops **ops)
>> >> +{
>> >> +	struct bypass_master *bypass_master;
>> >> +	struct net_device *bypass_netdev;
>> >> +
>> >> +	spin_lock(&bypass_lock);
>> >> +	list_for_each_entry(bypass_master, &bypass_master_list, list) {
>> >> > > As I wrote the last time, you don't need this list, spinlock.
>> >> > > You can do just something like:
>> >> > >           for_each_net(net) {
>> >> > >                   for_each_netdev(net, dev) {
>> >> > > 			if (netif_is_bypass_master(dev)) {
>> >> > This function returns the upper netdev as well as the ops associated
>> >> > with that netdev.
>> >> > bypass_master_list is a list of 'struct bypass_master' that associates
>> >> Well, can't you have it in netdev priv?
>> >
>> >We cannot do this for 2-netdev model as there is no bypass_netdev created.
>> 
>> Howcome? You have no master? I don't understand..
>> 
>> 
>> 
>> >
>> >> 
>> >> 
>> >> > 'bypass_netdev' with 'bypass_ops' and gets added via bypass_master_register().
>> >> > We need 'ops' only to support the 2 netdev model of netvsc. ops will be
>> >> > NULL for 3-netdev model.
>> >> I see :(
>> >> 
>> >> 
>> >> > 
>> >> > > 
>> >> > > 
>> >> > > 
>> >> > > > +		bypass_netdev = rcu_dereference(bypass_master->bypass_netdev);
>> >> > > > +		if (ether_addr_equal(bypass_netdev->perm_addr, mac)) {
>> >> > > > +			*ops = rcu_dereference(bypass_master->ops);
>> >> > > I don't see how rcu_dereference is ok here.
>> >> > > 1) I don't see rcu_read_lock taken
>> >> > > 2) Looks like bypass_master->ops has the same value across the whole
>> >> > >      existence.
>> >> > We hold rtnl_lock(), i think i need to change this to rtnl_dereference.
>> >> > Yes. ops doesn't change.
>> >> If it does not change, you can just access it directly.
>> >> 
>> >> 
>> >> > > 
>> >> > > > +			spin_unlock(&bypass_lock);
>> >> > > > +			return bypass_netdev;
>> >> > > > +		}
>> >> > > > +	}
>> >> > > > +	spin_unlock(&bypass_lock);
>> >> > > > +	return NULL;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_register(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > +	struct net_device *bypass_netdev;
>> >> > > > +	struct bypass_ops *bypass_ops;
>> >> > > > +	int ret, orig_mtu;
>> >> > > > +
>> >> > > > +	ASSERT_RTNL();
>> >> > > > +
>> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > +						&bypass_ops);
>> >> > > For master, could you use word "master" in the variables so it is clear?
>> >> > > Also, "dev" is fine instead of "netdev".
>> >> > > Something like "bpmaster_dev"
>> >> > bypass_master is of  type struct bypass_master,  bypass_netdev is of type struct net_device.
>> >> I was trying to point out, that "bypass_netdev" represents a "master"
>> >> netdev, yet it does not say master. That is why I suggested
>> >> "bpmaster_dev"
>> >> 
>> >> 
>> >> > I can change all _netdev suffixes to _dev to make the names shorter.
>> >> ok.
>> >> 
>> >> 
>> >> > 
>> >> > > 
>> >> > > > +	if (!bypass_netdev)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ret = bypass_slave_pre_register(slave_netdev, bypass_netdev,
>> >> > > > +					bypass_ops);
>> >> > > > +	if (ret != 0)
>> >> > > 	Just "if (ret)" will do. You have this on more places.
>> >> > OK.
>> >> > 
>> >> > 
>> >> > > 
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ret = netdev_rx_handler_register(slave_netdev,
>> >> > > > +					 bypass_ops ? bypass_ops->handle_frame :
>> >> > > > +					 bypass_handle_frame, bypass_netdev);
>> >> > > > +	if (ret != 0) {
>> >> > > > +		netdev_err(slave_netdev, "can not register bypass rx handler (err = %d)\n",
>> >> > > > +			   ret);
>> >> > > > +		goto done;
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	ret = netdev_upper_dev_link(slave_netdev, bypass_netdev, NULL);
>> >> > > > +	if (ret != 0) {
>> >> > > > +		netdev_err(slave_netdev, "can not set master device %s (err = %d)\n",
>> >> > > > +			   bypass_netdev->name, ret);
>> >> > > > +		goto upper_link_failed;
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	slave_netdev->priv_flags |= IFF_BYPASS_SLAVE;
>> >> > > > +
>> >> > > > +	if (netif_running(bypass_netdev)) {
>> >> > > > +		ret = dev_open(slave_netdev);
>> >> > > > +		if (ret && (ret != -EBUSY)) {
>> >> > > > +			netdev_err(bypass_netdev, "Opening slave %s failed ret:%d\n",
>> >> > > > +				   slave_netdev->name, ret);
>> >> > > > +			goto err_interface_up;
>> >> > > > +		}
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	/* Align MTU of slave with master */
>> >> > > > +	orig_mtu = slave_netdev->mtu;
>> >> > > > +	ret = dev_set_mtu(slave_netdev, bypass_netdev->mtu);
>> >> > > > +	if (ret != 0) {
>> >> > > > +		netdev_err(bypass_netdev, "unable to change mtu of %s to %u register failed\n",
>> >> > > > +			   slave_netdev->name, bypass_netdev->mtu);
>> >> > > > +		goto err_set_mtu;
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	ret = bypass_slave_join(slave_netdev, bypass_netdev, bypass_ops);
>> >> > > > +	if (ret != 0)
>> >> > > > +		goto err_join;
>> >> > > > +
>> >> > > > +	call_netdevice_notifiers(NETDEV_JOIN, slave_netdev);
>> >> > > > +
>> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s registered\n",
>> >> > > > +		    slave_netdev->name);
>> >> > > > +
>> >> > > > +	goto done;
>> >> > > > +
>> >> > > > +err_join:
>> >> > > > +	dev_set_mtu(slave_netdev, orig_mtu);
>> >> > > > +err_set_mtu:
>> >> > > > +	dev_close(slave_netdev);
>> >> > > > +err_interface_up:
>> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> >> > > > +upper_link_failed:
>> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> >> > > > +done:
>> >> > > > +	return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_pre_unregister(struct net_device *slave_netdev,
>> >> > > > +				       struct net_device *bypass_netdev,
>> >> > > > +				       struct bypass_ops *bypass_ops)
>> >> > > > +{
>> >> > > > +	struct net_device *backup_netdev, *active_netdev;
>> >> > > > +	struct bypass_info *bi;
>> >> > > > +
>> >> > > > +	if (bypass_ops) {
>> >> > > > +		if (!bypass_ops->slave_pre_unregister)
>> >> > > > +			return -EINVAL;
>> >> > > > +
>> >> > > > +		return bypass_ops->slave_pre_unregister(slave_netdev,
>> >> > > > +							bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	bi = netdev_priv(bypass_netdev);
>> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> >> > > > +		return -EINVAL;
>> >> > > > +
>> >> > > > +	return 0;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_release(struct net_device *slave_netdev,
>> >> > > > +				struct net_device *bypass_netdev,
>> >> > > > +				struct bypass_ops *bypass_ops)
>> >> > > > +{
>> >> > > > +	struct net_device *backup_netdev, *active_netdev;
>> >> > > > +	struct bypass_info *bi;
>> >> > > > +
>> >> > > > +	if (bypass_ops) {
>> >> > > > +		if (!bypass_ops->slave_release)
>> >> > > > +			return -EINVAL;
>> >> > > I think it would be good to make the API to the driver more strict and
>> >> > > have a separate set of ops for "active" and "backup" netdevices.
>> >> > > That should stop people thinking about extending this to more slaves in
>> >> > > the future.
>> >> > We have checks in slave_pre_register() that allows only 1 'backup' and 1
>> >> > 'active' slave.
>> >> I'm very well aware of that. I just thought that explicit ops for the
>> >> two slaves would make this more clear.
>> >> 
>> >> 
>> >> > 
>> >> > > 
>> >> > > 
>> >> > > > +
>> >> > > > +		return bypass_ops->slave_release(slave_netdev, bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	bi = netdev_priv(bypass_netdev);
>> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > +	if (slave_netdev == backup_netdev) {
>> >> > > > +		RCU_INIT_POINTER(bi->backup_netdev, NULL);
>> >> > > > +	} else {
>> >> > > > +		RCU_INIT_POINTER(bi->active_netdev, NULL);
>> >> > > > +		if (backup_netdev) {
>> >> > > > +			bypass_netdev->min_mtu = backup_netdev->min_mtu;
>> >> > > > +			bypass_netdev->max_mtu = backup_netdev->max_mtu;
>> >> > > > +		}
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	dev_put(slave_netdev);
>> >> > > > +
>> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s released\n",
>> >> > > > +		    slave_netdev->name);
>> >> > > > +
>> >> > > > +	return 0;
>> >> > > > +}
>> >> > > > +
>> >> > > > +int bypass_slave_unregister(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > +	struct net_device *bypass_netdev;
>> >> > > > +	struct bypass_ops *bypass_ops;
>> >> > > > +	int ret;
>> >> > > > +
>> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ASSERT_RTNL();
>> >> > > > +
>> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > +						&bypass_ops);
>> >> > > > +	if (!bypass_netdev)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ret = bypass_slave_pre_unregister(slave_netdev, bypass_netdev,
>> >> > > > +					  bypass_ops);
>> >> > > > +	if (ret != 0)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	netdev_rx_handler_unregister(slave_netdev);
>> >> > > > +	netdev_upper_dev_unlink(slave_netdev, bypass_netdev);
>> >> > > > +	slave_netdev->priv_flags &= ~IFF_BYPASS_SLAVE;
>> >> > > > +
>> >> > > > +	bypass_slave_release(slave_netdev, bypass_netdev, bypass_ops);
>> >> > > > +
>> >> > > > +	netdev_info(bypass_netdev, "bypass slave:%s unregistered\n",
>> >> > > > +		    slave_netdev->name);
>> >> > > > +
>> >> > > > +done:
>> >> > > > +	return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +EXPORT_SYMBOL_GPL(bypass_slave_unregister);
>> >> > > > +
>> >> > > > +static bool bypass_xmit_ready(struct net_device *dev)
>> >> > > > +{
>> >> > > > +	return netif_running(dev) && netif_carrier_ok(dev);
>> >> > > > +}
>> >> > > > +
>> >> > > > +static int bypass_slave_link_change(struct net_device *slave_netdev)
>> >> > > > +{
>> >> > > > +	struct net_device *bypass_netdev, *active_netdev, *backup_netdev;
>> >> > > > +	struct bypass_ops *bypass_ops;
>> >> > > > +	struct bypass_info *bi;
>> >> > > > +
>> >> > > > +	if (!netif_is_bypass_slave(slave_netdev))
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	ASSERT_RTNL();
>> >> > > > +
>> >> > > > +	bypass_netdev = bypass_master_get_bymac(slave_netdev->perm_addr,
>> >> > > > +						&bypass_ops);
>> >> > > > +	if (!bypass_netdev)
>> >> > > > +		goto done;
>> >> > > > +
>> >> > > > +	if (bypass_ops) {
>> >> > > > +		if (!bypass_ops->slave_link_change)
>> >> > > > +			goto done;
>> >> > > > +
>> >> > > > +		return bypass_ops->slave_link_change(slave_netdev,
>> >> > > > +						     bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +	if (!netif_running(bypass_netdev))
>> >> > > > +		return 0;
>> >> > > > +
>> >> > > > +	bi = netdev_priv(bypass_netdev);
>> >> > > > +
>> >> > > > +	active_netdev = rtnl_dereference(bi->active_netdev);
>> >> > > > +	backup_netdev = rtnl_dereference(bi->backup_netdev);
>> >> > > > +
>> >> > > > +	if (slave_netdev != active_netdev && slave_netdev != backup_netdev)
>> >> > > > +		goto done;
>> >> > > You don't need this check. "if (!netif_is_bypass_slave(slave_netdev))"
>> >> > > above is enough.
>> >> > I think we need this check to not allow events from a slave that is not
>> >> > attached to this master but has the same MAC.
>> >> Why do we need such events? Seems wrong to me.
>> >
>> >We want to avoid events from a netdev that is mis-configured with the same MAC as
>> >a bypass setup.
>> >
>> >>   Consider:
>> >> 
>> >> bp1      bp2
>> >> a1 b1    a2 b2
>> >> 
>> >> 
>> >> a1 and a2 have the same mac and bp1 and bp2 have the same mac.
>> >
>> >We should not have 2 bypass configs with the same MAC.
>> >I need to add a check in the bypass_master_register() to prevent this.
>> 
>> Mac can change, you would have to check in change as well. Feels odd
>> thought. 
>> 
>> 
>> >
>> >The above check is to avoid cases where we have
>> >bp1(a1, b1) with mac1
>> >and a2 is mis-configured with mac1, we want to avoid using a2 link events to update bp1.
>> >
>> >> Now bypass_master_get_bymac() will return always bp1 or bp2 - depending on
>> >> the order of creation.
>> >> Let's say it will return bp1. Then when we have event for a2, the
>> >> bypass_ops->slave_link_change is called with (a2, bp1). That is wrong.
>> >> 
>> >> 
>> >> You cannot use bypass_master_get_bymac() here.
>> >> 
>> >> 
>> >> 
>> >> > > 
>> >> > > > +
>> >> > > > +	if ((active_netdev && bypass_xmit_ready(active_netdev)) ||
>> >> > > > +	    (backup_netdev && bypass_xmit_ready(backup_netdev))) {
>> >> > > > +		netif_carrier_on(bypass_netdev);
>> >> > > > +		netif_tx_wake_all_queues(bypass_netdev);
>> >> > > > +	} else {
>> >> > > > +		netif_carrier_off(bypass_netdev);
>> >> > > > +		netif_tx_stop_all_queues(bypass_netdev);
>> >> > > > +	}
>> >> > > > +
>> >> > > > +done:
>> >> > > > +	return NOTIFY_DONE;
>> >> > > > +}
>> >> > > > +
>> >> > > > +static bool bypass_validate_event_dev(struct net_device *dev)
>> >> > > > +{
>> >> > > > +	/* Skip parent events */
>> >> > > > +	if (netif_is_bypass_master(dev))
>> >> > > > +		return false;
>> >> > > > +
>> >> > > > +	/* Avoid non-Ethernet type devices */
>> >> > > > +	if (dev->type != ARPHRD_ETHER)
>> >> > > > +		return false;
>> >> > > > +
>> >> > > > +	/* Avoid Vlan dev with same MAC registering as VF */
>> >> > > > +	if (is_vlan_dev(dev))
>> >> > > > +		return false;
>> >> > > > +
>> >> > > > +	/* Avoid Bonding master dev with same MAC registering as slave dev */
>> >> > > > +	if ((dev->priv_flags & IFF_BONDING) && (dev->flags & IFF_MASTER))
>> >> > > Yeah, this is certainly incorrect. One thing is, you should be using the
>> >> > > helpers netif_is_bond_master().
>> >> > > But what about the rest? macsec, macvlan, team, bridge, ovs and others?
>> >> > > 
>> >> > > You need to do it not by blacklisting, but with whitelisting. You need
>> >> > > to whitelist VF devices. My port flavours patchset might help with this.
>> >> > May be i can use netdev_has_lower_dev() helper to make sure that the slave
>> >> I don't see such function in the code.
>> >
>> >It is netdev_has_any_lower_dev(). I need to export it.
>> 
>> Come on, you cannot use that. That would allow bonding without slaves,
>> but the slaves could be added later on.
>> 
>> What exactly you are trying to achieve by this?
>> 
>> 
>> >
>> >> 
>> >> 
>> >> > device is not an upper dev.
>> >> > Can you point to your port flavours patchset? Is it upstream?
>> >> I sent rfc couple of weeks ago:
>> >> [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
>> >
>> >
>> >

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox