Netdev List
 help / color / mirror / Atom feed
* Re: net: ipv6: Fix a bug in ndisc_send_ns when netdev only has a global address
From: Mark Smith @ 2019-07-30 10:28 UTC (permalink / raw)
  To: Su Yanjun; +Cc: netdev
In-Reply-To: <93c401b9-bf8b-4d49-9c3b-72d09073444e@cn.fujitsu.com>

Hi Su,

On Tue, 30 Jul 2019 at 19:41, Su Yanjun <suyj.fnst@cn.fujitsu.com> wrote:
>
>
> 在 2019/7/30 16:15, Mark Smith 写道:
> > Hi,
> >
> > I'm not subscribed to the Linux netdev mailing list, so I can't
> > directly reply to the patch email.
> >
> > This patch is not the correct solution to this issue.
> >

<snip>

> In linux implementation, one interface may have no link local address if
> kernel config
>
> *addr_gen_mode* is set to IN6_ADDR_GEN_MODE_NONE. My patch is to fix
> this problem.
>

So this "IN6_ADDR_GEN_MODE_NONE" behaviour doesn't comply with RFC 4291.

As RFC 4291 says,

"All interfaces are *required* to have *at least one* Link-Local
unicast address."

That's not an ambiguous requirement.

This specific, explicit requirement goes as back as far as RFC 2373
from 1998, the ancestor of RFC 4291. It is also heavily implied in RFC
1884s, 2.7 A Node's Required Addresses.

> And what you say is related to the lo interface.  I'm not sure whether
> the lo interface needs a ll adreess.
>

It is an IPv6 enabled interface, so it requires a link-local address,
per RFC 4291. RFC 4291 doesn't exclude any interfaces types from the
LL address requirement.

Even special NBMA links/interfaces are not excluded from this
requirement, as Link-Local addresses are formed and used in the NBMA
operation, per RFC 2491.

> IMO the ll address is used to get l2 address by sending ND ns. The lo is
> very special.
>

From an IPv6 perspective, the virtual loopback interface isn't all that special.

A general theme of IPv6 is to try to treat things as similarly as
possible, compared to the IPv4 where a lot of things were treated as
special cases (e.g. ND runs over ICMPv4, in comparison to ARP running
directly and only over Ethernet/802.3. RFC 4861 treats point-to-point
links as multicast capable links, emulating multicast if necessary.
RAs and DHCPv6 are used over PPP links to carry parameters, rather
than using IPv6CP, compared to using IPv4 IPCP to carry e.g. DNS
addresses)

The main place the loopback behaviour causes issues is with IPv6 ND
Duplicate Address Detection. Appendix A of RFC 4861, and RFC 7527,
"Enhanced Duplicate Address Detection" discuss how to deal with that.
Some physical interfaces can be in loopback mode too, so IPv6 DAD has
to deal with that temporary situation.

LL addresses are and can be used for lots of things, including by
end-user applications as a preference when there is a choice between a
set of LL addr(s), GUA and ULA addresses.

Here is an Internet Draft that describes the general characteristics
of Link-Local addresses with references, as well as the benefits of
and how to use them in applications.

"How to use IPv6 Link-Local Addresses in Applications"
https://tools.ietf.org/html/draft-smith-ipv6-link-locals-apps-00


Regards,
Mark.

> Thanks
>
> Su
>
> >
> > "2.1. Addressing Model"
> >
> > ...
> >
> > "All interfaces are required to have at least one Link-Local unicast
> >     address (see Section 2.8 for additional required addresses)."
> >
> > I have submitted a more specific bug regarding no Link-Local
> > address/prefix on the Linux kernel loopback interface through RedHat
> > bugzilla as I use Fedora 30, however it doesn't seem to have been
> > looked at yet.
> >
> > "Loopback network interface does not have a Link Local address,
> > contrary to RFC 4291"
> > https://bugzilla.redhat.com/show_bug.cgi?id=1706709
> >
> >
> > Thanks very much,
> > Mark.
> >
> >
>
>

^ permalink raw reply

* [PATCH nf] netfilter: nf_tables: map basechain priority to hardware priority
From: Pablo Neira Ayuso @ 2019-07-30 10:54 UTC (permalink / raw)
  To: netfilter-devel
  Cc: wenxu, jiri, marcelo.leitner, saeedm, gerlitz.or, paulb, netdev

This patch maps basechain netfilter priorities from -8192 to 8191 to
hardware priority 0xC000 + 1. tcf_auto_prio() uses 0xC000 if the user
specifies no priority, then it subtract 1 for each new tcf_proto object.
This patch uses the hardware priority range from 0xC000 to 0xFFFF for
netfilter.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
This follows a rather conservative approach, I could just expose the
2^16 hardware priority range, but we may need to split this priority
range among the ethtool_rx, tc and netfilter subsystems to start with
and it should be possible to extend the priority range later on.

By netfilter priority, I'm refering to the basechain priority:

	add chain x y { type filter hook ingress device eth0 priority 0; }
                                                             ^^^^^^^^^^^

This is no transparently mapped to hardware, this patch shifts it to
make it fit into the 0xC000 + 1 .. 0xFFFF hardware priority range.

 include/net/netfilter/nf_tables_offload.h |  2 ++
 net/netfilter/nf_tables_api.c             |  4 ++++
 net/netfilter/nf_tables_offload.c         | 32 ++++++++++++++++++++++++++++---
 3 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/include/net/netfilter/nf_tables_offload.h b/include/net/netfilter/nf_tables_offload.h
index 3196663a10e3..2d497394021e 100644
--- a/include/net/netfilter/nf_tables_offload.h
+++ b/include/net/netfilter/nf_tables_offload.h
@@ -73,4 +73,6 @@ int nft_flow_rule_offload_commit(struct net *net);
 	(__reg)->key		= __key;				\
 	memset(&(__reg)->mask, 0xff, (__reg)->len);
 
+u16 nft_chain_offload_priority(struct nft_base_chain *basechain);
+
 #endif
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 93647fdf435c..9ee6db9a668d 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1669,6 +1669,10 @@ static int nf_tables_addchain(struct nft_ctx *ctx, u8 family, u8 genmask,
 
 		chain->flags |= NFT_BASE_CHAIN | flags;
 		basechain->policy = NF_ACCEPT;
+		if (chain->flags & NFT_CHAIN_HW_OFFLOAD &&
+		    !nft_chain_offload_priority(basechain))
+			return -EOPNOTSUPP;
+
 		flow_block_init(&basechain->flow_block);
 	} else {
 		chain = kzalloc(sizeof(*chain), GFP_KERNEL);
diff --git a/net/netfilter/nf_tables_offload.c b/net/netfilter/nf_tables_offload.c
index ec70978eba5a..df8427ba857c 100644
--- a/net/netfilter/nf_tables_offload.c
+++ b/net/netfilter/nf_tables_offload.c
@@ -156,10 +156,11 @@ void nft_offload_update_dependency(struct nft_offload_ctx *ctx,
 }
 
 static void nft_flow_offload_common_init(struct flow_cls_common_offload *common,
-					 __be16 proto,
-					struct netlink_ext_ack *extack)
+					 __be16 proto, int priority,
+					 struct netlink_ext_ack *extack)
 {
 	common->protocol = proto;
+	common->prio = priority;
 	common->extack = extack;
 }
 
@@ -180,6 +181,29 @@ static int nft_setup_cb_call(struct nft_base_chain *basechain,
 	return 0;
 }
 
+/* Available priorities for hardware offload range: -8192..8191 */
+#define NFT_BASECHAIN_OFFLOAD_PRIO_MAX		(SHRT_MAX / 4)
+#define NFT_BASECHAIN_OFFLOAD_PRIO_MIN		(SHRT_MIN / 4)
+#define NFT_BASECHAIN_OFFLOAD_PRIO_RANGE	(USHRT_MAX / 4)
+/* tcf_auto_prio() uses 0xC000 as base, then subtract one for each new chain. */
+#define NFT_BASECHAIN_OFFLOAD_HW_PRIO_BASE	(0xC000 + 1)
+
+u16 nft_chain_offload_priority(struct nft_base_chain *basechain)
+{
+	u16 prio;
+
+	if (basechain->ops.priority < NFT_BASECHAIN_OFFLOAD_PRIO_MIN ||
+	    basechain->ops.priority > NFT_BASECHAIN_OFFLOAD_PRIO_MAX)
+		return 0;
+
+	/* map netfilter chain priority to hardware priority. */
+	prio = basechain->ops.priority +
+		NFT_BASECHAIN_OFFLOAD_PRIO_MAX +
+			NFT_BASECHAIN_OFFLOAD_HW_PRIO_BASE;
+
+	return prio;
+}
+
 static int nft_flow_offload_rule(struct nft_trans *trans,
 				 enum flow_cls_command command)
 {
@@ -200,7 +224,9 @@ static int nft_flow_offload_rule(struct nft_trans *trans,
 	if (flow)
 		proto = flow->proto;
 
-	nft_flow_offload_common_init(&cls_flow.common, proto, &extack);
+	nft_flow_offload_common_init(&cls_flow.common, proto,
+				     nft_chain_offload_priority(basechain),
+				     &extack);
 	cls_flow.command = command;
 	cls_flow.cookie = (unsigned long) rule;
 	if (flow)
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH][next] mlxsw: spectrum_ptp: fix duplicated check on orig_egr_types
From: Petr Machata @ 2019-07-30 11:00 UTC (permalink / raw)
  To: Colin King
  Cc: Jiri Pirko, Ido Schimmel, David S . Miller,
	netdev@vger.kernel.org, kernel-janitors@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190730102114.1506-1-colin.king@canonical.com>


Colin King <colin.king@canonical.com> writes:

> From: Colin Ian King <colin.king@canonical.com>
>
> Currently there is a duplicated check on orig_egr_types which is
> redundant, I believe this is a typo and should actually be
> orig_ing_types || orig_egr_types instead of the expression
> orig_egr_types || orig_egr_types.  Fix this.

Good catch, yes, there's a typo. Thanks for the fix!

> Addresses-Coverity: ("Same on both sides")
> Fixes: c6b36bdd04b5 ("mlxsw: spectrum_ptp: Increase parsing depth when PTP is enabled")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Reviewed-by: Petr Machata <petrm@mellanox.com>

> ---
>  drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
> index 98c5ba3200bc..f02d74e55d95 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
> @@ -999,7 +999,7 @@ static int mlxsw_sp1_ptp_mtpppc_update(struct mlxsw_sp_port *mlxsw_sp_port,
>  		}
>  	}
>  
> -	if ((ing_types || egr_types) && !(orig_egr_types || orig_egr_types)) {
> +	if ((ing_types || egr_types) && !(orig_ing_types || orig_egr_types)) {
>  		err = mlxsw_sp_nve_inc_parsing_depth_get(mlxsw_sp);
>  		if (err) {
>  			netdev_err(mlxsw_sp_port->dev, "Failed to increase parsing depth");


^ permalink raw reply

* Re: [PATCH][next] mlxsw: spectrum_ptp: fix duplicated check on orig_egr_types
From: Petr Machata @ 2019-07-30 11:05 UTC (permalink / raw)
  To: Colin King
  Cc: Jiri Pirko, Ido Schimmel, David S . Miller,
	netdev@vger.kernel.org, kernel-janitors@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <87mugvzt1m.fsf@mellanox.com>


Petr Machata <petrm@mellanox.com> writes:

> Colin King <colin.king@canonical.com> writes:
>
>> From: Colin Ian King <colin.king@canonical.com>
>>
>> Currently there is a duplicated check on orig_egr_types which is
>> redundant, I believe this is a typo and should actually be
>> orig_ing_types || orig_egr_types instead of the expression
>> orig_egr_types || orig_egr_types.  Fix this.
>
> Good catch, yes, there's a typo. Thanks for the fix!
>
>> Addresses-Coverity: ("Same on both sides")
>> Fixes: c6b36bdd04b5 ("mlxsw: spectrum_ptp: Increase parsing depth when PTP is enabled")
>> Signed-off-by: Colin Ian King <colin.king@canonical.com>
>
> Reviewed-by: Petr Machata <petrm@mellanox.com>

However the patch should address "net", not "next". Can you please respin?

^ permalink raw reply

* Re: [PATCH][next] mlxsw: spectrum_ptp: fix duplicated check on orig_egr_types
From: Petr Machata @ 2019-07-30 11:12 UTC (permalink / raw)
  To: Colin King
  Cc: Jiri Pirko, Ido Schimmel, David S . Miller,
	netdev@vger.kernel.org, kernel-janitors@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <87mugvzt1m.fsf@mellanox.com>


Petr Machata <petrm@mellanox.com> writes:

> Colin King <colin.king@canonical.com> writes:
>
>> From: Colin Ian King <colin.king@canonical.com>
>>
>> Currently there is a duplicated check on orig_egr_types which is
>> redundant, I believe this is a typo and should actually be
>> orig_ing_types || orig_egr_types instead of the expression
>> orig_egr_types || orig_egr_types.  Fix this.
>
> Good catch, yes, there's a typo. Thanks for the fix!
>
>> Addresses-Coverity: ("Same on both sides")
>> Fixes: c6b36bdd04b5 ("mlxsw: spectrum_ptp: Increase parsing depth when PTP is enabled")
>> Signed-off-by: Colin Ian King <colin.king@canonical.com>
>
> Reviewed-by: Petr Machata <petrm@mellanox.com>

I see that there is an identical problem in the code one block further.
Can you take care of that as well, please? Or should I do it?

^ permalink raw reply

* Re: [PATCH nf] netfilter: nf_tables: map basechain priority to hardware priority
From: Pablo Neira Ayuso @ 2019-07-30 11:18 UTC (permalink / raw)
  To: netfilter-devel
  Cc: wenxu, jiri, marcelo.leitner, saeedm, gerlitz.or, paulb, netdev
In-Reply-To: <20190730105417.14538-1-pablo@netfilter.org>

On Tue, Jul 30, 2019 at 12:54:17PM +0200, Pablo Neira Ayuso wrote:
[...]
> @@ -180,6 +181,29 @@ static int nft_setup_cb_call(struct nft_base_chain *basechain,
>  	return 0;
>  }
>  
> +/* Available priorities for hardware offload range: -8192..8191 */
> +#define NFT_BASECHAIN_OFFLOAD_PRIO_MAX		(SHRT_MAX / 4)
> +#define NFT_BASECHAIN_OFFLOAD_PRIO_MIN		(SHRT_MIN / 4)
> +#define NFT_BASECHAIN_OFFLOAD_PRIO_RANGE	(USHRT_MAX / 4)
> +/* tcf_auto_prio() uses 0xC000 as base, then subtract one for each new chain. */
> +#define NFT_BASECHAIN_OFFLOAD_HW_PRIO_BASE	(0xC000 + 1)
> +
> +u16 nft_chain_offload_priority(struct nft_base_chain *basechain)
> +{
> +	u16 prio;
> +
> +	if (basechain->ops.priority < NFT_BASECHAIN_OFFLOAD_PRIO_MIN ||
> +	    basechain->ops.priority > NFT_BASECHAIN_OFFLOAD_PRIO_MAX)
> +		return 0;
> +
> +	/* map netfilter chain priority to hardware priority. */
> +	prio = basechain->ops.priority +
> +		NFT_BASECHAIN_OFFLOAD_PRIO_MAX +
> +			NFT_BASECHAIN_OFFLOAD_HW_PRIO_BASE;
> +
> +	return prio;

This function should actually return:

        return prio << 16;

> +}
> +
>  static int nft_flow_offload_rule(struct nft_trans *trans,
>  				 enum flow_cls_command command)
>  {

^ permalink raw reply

* Re: [PATCH nf] netfilter: nf_tables: map basechain priority to hardware priority
From: Pablo Neira Ayuso @ 2019-07-30 11:19 UTC (permalink / raw)
  To: netfilter-devel
  Cc: wenxu, jiri, marcelo.leitner, saeedm, gerlitz.or, paulb, netdev
In-Reply-To: <20190730111800.yhtd5pgd32wyfilt@salvia>

On Tue, Jul 30, 2019 at 01:18:00PM +0200, Pablo Neira Ayuso wrote:
> On Tue, Jul 30, 2019 at 12:54:17PM +0200, Pablo Neira Ayuso wrote:
> [...]
> > @@ -180,6 +181,29 @@ static int nft_setup_cb_call(struct nft_base_chain *basechain,
> >  	return 0;
> >  }
> >  
> > +/* Available priorities for hardware offload range: -8192..8191 */
> > +#define NFT_BASECHAIN_OFFLOAD_PRIO_MAX		(SHRT_MAX / 4)
> > +#define NFT_BASECHAIN_OFFLOAD_PRIO_MIN		(SHRT_MIN / 4)
> > +#define NFT_BASECHAIN_OFFLOAD_PRIO_RANGE	(USHRT_MAX / 4)
> > +/* tcf_auto_prio() uses 0xC000 as base, then subtract one for each new chain. */
> > +#define NFT_BASECHAIN_OFFLOAD_HW_PRIO_BASE	(0xC000 + 1)
> > +
> > +u16 nft_chain_offload_priority(struct nft_base_chain *basechain)
> > +{
> > +	u16 prio;
> > +
> > +	if (basechain->ops.priority < NFT_BASECHAIN_OFFLOAD_PRIO_MIN ||
> > +	    basechain->ops.priority > NFT_BASECHAIN_OFFLOAD_PRIO_MAX)
> > +		return 0;
> > +
> > +	/* map netfilter chain priority to hardware priority. */
> > +	prio = basechain->ops.priority +
> > +		NFT_BASECHAIN_OFFLOAD_PRIO_MAX +
> > +			NFT_BASECHAIN_OFFLOAD_HW_PRIO_BASE;
> > +
> > +	return prio;
> 
> This function should actually return:
> 
>         return prio << 16;

Better shift it from here:

static void nft_flow_offload_common_init(struct flow_cls_common_offload *common,
                                         __be16 proto, int priority,
                                         struct netlink_ext_ack *extack)
{
        common->protocol = proto;
        common->prio = priority << 16;
        common->extack = extack;
}

Drivers assume tc handle format.

^ permalink raw reply

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Allan W. Nielsen @ 2019-07-30 11:23 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Ido Schimmel, Horatiu Vultur, roopa, davem, bridge, netdev,
	linux-kernel
In-Reply-To: <5b0c92cd-f78a-a504-4730-c07268366e21@cumulusnetworks.com>

The 07/30/2019 12:55, Nikolay Aleksandrov wrote:
> On 30/07/2019 12:21, Allan W. Nielsen wrote:
> > The 07/30/2019 11:58, Nikolay Aleksandrov wrote:
> >> On 30/07/2019 11:30, Allan W. Nielsen wrote:
> >>> The 07/30/2019 10:06, Ido Schimmel wrote:
> >>>> On Tue, Jul 30, 2019 at 08:27:22AM +0200, Allan W. Nielsen wrote:
> >>>>> The 07/29/2019 20:51, Ido Schimmel wrote:
> >>>>>> Can you please clarify what you're trying to achieve? I just read the
> >>>>>> thread again and my impression is that you're trying to locally receive
> >>>>>> packets with a certain link layer multicast address.
> >>>>> Yes. The thread is also a bit confusing because we half way through realized
> >>>>> that we misunderstood how the multicast packets should be handled (sorry about
> >>>>> that). To begin with we had a driver where multicast packets was only copied to
> >>>>> the CPU if someone needed it. Andrew and Nikolay made us aware that this is not
> >>>>> how other drivers are doing it, so we changed the driver to include the CPU in
> >>>>> the default multicast flood-mask.
> >>>> OK, so what prevents you from removing all other ports from the
> >>>> flood-mask and letting the CPU handle the flooding? Then you can install
> >>>> software tc filters to limit the flooding.
> >>> I do not have the bandwidth to forward the multicast traffic in the CPU.
> >>>
> >>> It will also cause enormous latency on the forwarding of L2 multicast packets.
> >>>
> >>>>> This changes the objective a bit. To begin with we needed to get more packets to
> >>>>> the CPU (which could have been done using tc ingress rules and a trap action).
> >>>>>
> >>>>> Now after we changed the driver, we realized that we need something to limit the
> >>>>> flooding of certain L2 multicast packets. This is the new problem we are trying
> >>>>> to solve!
> >>>>>
> >>>>> Example: Say we have a bridge with 4 slave interfaces, then we want to install a
> >>>>> forwarding rule saying that packets to a given L2-multicast MAC address, should
> >>>>> only be flooded to 2 of the 4 ports.
> >>>>>
> >>>>> (instead of adding rules to get certain packets to the CPU, we are now adding
> >>>>> other rules to prevent other packets from going to the CPU and other ports where
> >>>>> they are not needed/wanted).
> >>>>>
> >>>>> This is exactly the same thing as IGMP snooping does dynamically, but only for
> >>>>> IP multicast.
> >>>>>
> >>>>> The "bridge mdb" allow users to manually/static add/del a port to a multicast
> >>>>> group, but still it operates on IP multicast address (not L2 multicast
> >>>>> addresses).
> >>>>>
> >>>>>> Nik suggested SIOCADDMULTI.
> >>>>> It is not clear to me how this should be used to limit the flooding, maybe we
> >>>>> can make some hacks, but as far as I understand the intend of this is maintain
> >>>>> the list of addresses an interface should receive. I'm not sure this should
> >>>>> influence how for forwarding decisions are being made.
> >>>>>
> >>>>>> and I suggested a tc filter to get the packet to the CPU.
> >>>>> The TC solution is a good solution to the original problem where wanted to copy
> >>>>> more frames to the CPU. But we were convinced that this is not the right
> >>>>> approach, and that the CPU by default should receive all multicast packets, and
> >>>>> we should instead try to find a way to limit the flooding of certain frames as
> >>>>> an optimization.
> >>>>
> >>>> This can still work. In Linux, ingress tc filters are executed before the
> >>>> bridge's Rx handler. The same happens in every sane HW. Ingress ACL is
> >>>> performed before L2 forwarding. Assuming you have eth0-eth3 bridged and
> >>>> you want to prevent packets with DMAC 01:21:6C:00:00:01 from egressing
> >>>> eth2:
> >>>>
> >>>> # tc filter add dev eth0 ingress pref 1 flower skip_sw \
> >>>> 	dst_mac 01:21:6C:00:00:01 action trap
> >>>> # tc filter add dev eth2 egress pref 1 flower skip_hw \
> >>>> 	dst_mac 01:21:6C:00:00:01 action drop
> >>>>
> >>>> The first filter is only present in HW ('skip_sw') and should result in
> >>>> your HW passing you the sole copy of the packet.
> >>> Agree.
> >>>
> >>>> The second filter is only present in SW ('skip_hw', not using HW egress
> >>>> ACL that you don't have) and drops the packet after it was flooded by
> >>>> the SW bridge.
> >>> Agree.
> >>>
> >>>> As I mentioned earlier, you can install the filter once in your HW and
> >>>> share it between different ports using a shared block. This means you
> >>>> only consume one TCAM entry.
> >>>>
> >>>> Note that this allows you to keep flooding all other multicast packets
> >>>> in HW.
> >>> Yes, but the frames we want to limit the flood-mask on are the exact frames
> >>> which occurs at a very high rate, and where latency is important.
> >>>
> >>> I really do not consider it as an option to forward this in SW, when it is
> >>> something that can easily be offloaded in HW.
> >>>
> >>>>>> If you now want to limit the ports to which this packet is flooded, then
> >>>>>> you can use tc filters in *software*:
> >>>>>>
> >>>>>> # tc qdisc add dev eth2 clsact
> >>>>>> # tc filter add dev eth2 egress pref 1 flower skip_hw \
> >>>>>> 	dst_mac 01:21:6C:00:00:01 action drop
> >>>>> Yes. This can work in the SW bridge.
> >>>>>
> >>>>>> If you want to forward the packet in hardware and locally receive it,
> >>>>>> you can chain several mirred action and then a trap action.
> >>>>> I'm not I fully understand how this should be done, but it does sound like it
> >>>>> becomes quite complicated. Also, as far as I understand it will mean that we
> >>>>> will be using TCAM/ACL resources to do something that could have been done with
> >>>>> a simple MAC entry.
> >>>>>
> >>>>>> Both options avoid HW egress ACLs which your design does not support.
> >>>>> True, but what is wrong with expanding the functionality of the normal
> >>>>> forwarding/MAC operations to allow multiple destinations?
> >>>>>
> >>>>> It is not an uncommon feature (I just browsed the manual of some common L2
> >>>>> switches and they all has this feature).
> >>>>>
> >>>>> It seems to fit nicely into the existing user-interface:
> >>>>>
> >>>>> bridge fdb add    01:21:6C:00:00:01 port eth0
> >>>>> bridge fdb append 01:21:6C:00:00:01 port eth1
> >>>>
> >>>> Wouldn't it be better to instead extend the MDB entries so that they are
> >>>> either keyed by IP or MAC? I believe FDB should remain as unicast-only.
> >>>
> >>> You might be right, it was not clear to me which of the two would fit the
> >>> purpose best.
> >>>
> >>> From a user-space iproute2 perspective I prefer using the "bridge fdb" command
> >>> as it already supports the needed syntax, and I do not think it will be too
> >>> pretty if we squeeze this into the "bridge mdb" command syntax.
> >>>
> >>
> >> MDB is a much better fit as Ido already suggested. FDB should remain unicast
> >> and mixing them is not a good idea, we already have a good ucast/mcast separation
> >> and we'd like to keep it that way.
> > Okay. We will explore that option.
> > 
> > 
> Great, thanks.
> 
> >>> But that does not mean that it need to go into the FDB database in the
> >>> implementation.
> >>>
> >>> Last evening when I looked at it again, I was considering keeping the
> >>> net_bridge_fdb_entry structure as is, and add a new hashtable with the
> >>> following:
> >>>
> >>> struct net_bridge_fdbmc_entry {
> >>> 	struct rhash_head		rhnode;
> >>> 	struct net_bridge_fdbmc_ports   *dst;
> >>>
> >>> 	struct net_bridge_fdb_key	key;
> >>> 	struct hlist_node		fdb_node;
> >>> 	unsigned char			offloaded:1;
> >>>
> >>> 	struct rcu_head			rcu;
> >>> };
> >>>
> >>
> >> What would the notification for this look like ?
> > Not sure. But we will change the direction and use the MDB structures instead.
> > 
> 
> Ack
> 
> >>> If we go with this approach then we can look at the MAC address and see if it is
> >>> a unicast which will cause a lookup in the fdb, l3-multicast (33:33:* or
> >>> 01:00:5e:*) which will cause a lookup in the mdb, or finally a fdbmc which will
> >>> need to do a lookup in this new hashtable.
> >>
> >> That sounds wrong, you will change the current default behaviour of flooding these
> >> packets. This will have to be well hidden behind a new option and enabled only on user
> >> request.
> > It will only affect users who install a static L2-multicast entry. If no entry
> > is found, it will default to flooding, which will be the same as before.
> > 
> 
> Ack
> 
> >>> Alternative it would be like this:
> >>>
> >>> struct net_bridge_fdb_entry {
> >>> 	struct rhash_head		rhnode;
> >>> 	union net_bridge_port_or_list	*dst;
> >>>
> >>> 	struct net_bridge_fdb_key	key;
> >>> 	struct hlist_node		fdb_node;
> >>> 	unsigned char			is_local:1,
> >>> 					is_static:1,
> >>> 					is_sticky:1,
> >>> 					added_by_user:1,
> >>> 					added_by_external_learn:1,
> >>> 					offloaded:1;
> >>> 					multi_dst:1;
> >>>
> >>> 	/* write-heavy members should not affect lookups */
> >>> 	unsigned long			updated ____cacheline_aligned_in_smp;
> >>> 	unsigned long			used;
> >>>
> >>> 	struct rcu_head			rcu;
> >>> };
> >>>
> >>> Both solutions should require fairly few changes, and should not cause any
> >>> measurable performance hit.
> >>>
> >>
> >> You'll have to convert these bits to use the proper atomic bitops if you go with
> >> the second solution. That has to be done even today, but the second case would
> >> make it a must.
> > Good to know.
> > 
> > Just for my understanding, is this because this is the "current" guide lines on
> > how things should be done, or is this because the multi_dst as a special need.
> > 
> > The multi_dst flag will never be changed in the life-cycle of the structure, and
> > the structure is protected by rcu. If this is causeing a raise, then I do not
> > see it.
> > 
> 
> These flags are changed from different contexts without any locking and you can end
> up with wrong values since these are not atomic ops. We need to move to atomic ops
> to guarantee consistent results. It is not only multi_dst issue, it's a problem
> for all of them, it's just not critical today since you'll end up with wrong
> flag values in such cases, but with multi_dst it will be important because the union
> pointer will have to be treated different based on the multi_dst value.
Okay, thanks for explaining.

> >>> Making it fit into the net_bridge_mdb_entry seems to be harder.
> >>>
> >>
> >> But it is the correct abstraction from bridge POV, so please stop trying to change
> >> the FDB code and try to keep to the multicast code.
> > We are planning on letting the net_bridge_port_or_list union use the
> > net_bridge_port_group structure, which will mean that we can re-use the
> > br_multicast_flood function (if we change the signatire to accept the ports
> > instead of the entry).
> > 
> 
> That sounds great, definitely a step in the right direction.
> 
> >>>> As a bonus, existing drivers could benefit from it, as MDB entries are already
> >>>> notified by MAC.
> >>> Not sure I follow. When FDB entries are added, it also generates notification
> >>> events.
> >>>
> >>
> >> Could you please show fdb event with multiple ports ?
> > We will get to that. Maybe this is an argument for converting to mdb. We have
> > not looked into the details of this yet.
> I can see a few potential problems, the important thing here would be to keep
> backwards compatibility.
Okay, good to be aware of

> >>>>> It seems that it can be added to the existing implementation with out adding
> >>>>> significant complexity.
> >>>>>
> >>>>> It will be easy to offload in HW.
> >>>>>
> >>>>> I do not believe that it will be a performance issue, if this is a concern then
> >>>>> we may have to do a bit of benchmarking, or we can make it a configuration
> >>>>> option.
> >>>>>
> >>>>> Long story short, we (Horatiu and I) learned a lot from the discussion here, and
> >>>>> I think we should try do a new patch with the learning we got. Then it is easier
> >>>>> to see what it actually means to the exiting code, complexity, exiting drivers,
> >>>>> performance, default behavioral, backwards compatibly, and other valid concerns.
> >>>>>
> >>>>> If the patch is no good, and cannot be fixed, then we will go back and look
> >>>>> further into alternative solutions.
> >>>> Overall, I tend to agree with Nik. I think your use case is too specific
> >>>> to justify the amount of changes you want to make in the bridge driver.
> >>>> We also provided other alternatives. That being said, you're more than
> >>>> welcome to send the patches and we can continue the discussion then.
> >>> Okay, good to know. I'm not sure I agree that the alternative solutions really
> >>> solves the issue this is trying to solve, nor do I agree that this is specific
> >>> to our needs.
> >>>
> >>> But lets take a look at a new patch, and see what is the amount of changes we
> >>> are talking about. Without having the patch it is really hard to know for sure.
> >> Please keep in mind that this case is the exception, not the norm, thus it should
> >> not under any circumstance affect the standard deployments.
> > Understood - no surprises.
> > 
> 
> Great, thanks again. Will continue discussing when the new patch arrives.
Sure, looking forward to that.

/Allan

^ permalink raw reply

* [PATCH net] net: bridge: mcast: don't delete permanent entries when fast leave is enabled
From: Nikolay Aleksandrov @ 2019-07-30 11:21 UTC (permalink / raw)
  To: netdev; +Cc: davem, roopa, bridge, Nikolay Aleksandrov

When permanent entries were introduced by the commit below, they were
exempt from timing out and thus igmp leave wouldn't affect them unless
fast leave was enabled on the port which was added before permanent
entries existed. It shouldn't matter if fast leave is enabled or not
if the user added a permanent entry it shouldn't be deleted on igmp
leave.

Before:
$ echo 1 > /sys/class/net/eth4/brport/multicast_fast_leave
$ bridge mdb add dev br0 port eth4 grp 229.1.1.1 permanent
$ bridge mdb show
dev br0 port eth4 grp 229.1.1.1 permanent

< join and leave 229.1.1.1 on eth4 >

$ bridge mdb show
$

After:
$ echo 1 > /sys/class/net/eth4/brport/multicast_fast_leave
$ bridge mdb add dev br0 port eth4 grp 229.1.1.1 permanent
$ bridge mdb show
dev br0 port eth4 grp 229.1.1.1 permanent

< join and leave 229.1.1.1 on eth4 >

$ bridge mdb show
dev br0 port eth4 grp 229.1.1.1 permanent

Fixes: ccb1c31a7a87 ("bridge: add flags to distinguish permanent mdb entires")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
I'll re-work this code in net-next as there's a lot of duplication.

 net/bridge/br_multicast.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 3d8deac2353d..f8cac3702712 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -1388,6 +1388,9 @@ br_multicast_leave_group(struct net_bridge *br,
 			if (!br_port_group_equal(p, port, src))
 				continue;
 
+			if (p->flags & MDB_PG_FLAGS_PERMANENT)
+				break;
+
 			rcu_assign_pointer(*pp, p->next);
 			hlist_del_init(&p->mglist);
 			del_timer(&p->timer);
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next] be2net: disable bh with spin_lock in be_process_mcc
From: Denis Kirjanov @ 2019-07-30 11:32 UTC (permalink / raw)
  To: sathya.perla, ajit.khaparde, sriharsha.basavapatna; +Cc: netdev, Denis Kirjanov

Signed-off-by: Denis Kirjanov <kdav@linux-powerpc.org>
---
 drivers/net/ethernet/emulex/benet/be_cmds.c | 4 ++--
 drivers/net/ethernet/emulex/benet/be_main.c | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c b/drivers/net/ethernet/emulex/benet/be_cmds.c
index ef5d61d57597..9365218f4d3b 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -550,7 +550,7 @@ int be_process_mcc(struct be_adapter *adapter)
 	int num = 0, status = 0;
 	struct be_mcc_obj *mcc_obj = &adapter->mcc_obj;
 
-	spin_lock(&adapter->mcc_cq_lock);
+	spin_lock_bh(&adapter->mcc_cq_lock);
 
 	while ((compl = be_mcc_compl_get(adapter))) {
 		if (compl->flags & CQE_FLAGS_ASYNC_MASK) {
@@ -566,7 +566,7 @@ int be_process_mcc(struct be_adapter *adapter)
 	if (num)
 		be_cq_notify(adapter, mcc_obj->cq.id, mcc_obj->rearm_cq, num);
 
-	spin_unlock(&adapter->mcc_cq_lock);
+	spin_unlock_bh(&adapter->mcc_cq_lock);
 	return status;
 }
 
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 2edb86ec9fe9..4d8e40ac66d2 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -5630,9 +5630,7 @@ static void be_worker(struct work_struct *work)
 	 * mcc completions
 	 */
 	if (!netif_running(adapter->netdev)) {
-		local_bh_disable();
 		be_process_mcc(adapter);
-		local_bh_enable();
 		goto reschedule;
 	}
 
-- 
2.12.3


^ permalink raw reply related

* Re: [PATCH][next] mlxsw: spectrum_ptp: fix duplicated check on orig_egr_types
From: Colin Ian King @ 2019-07-30 11:37 UTC (permalink / raw)
  To: Petr Machata
  Cc: Jiri Pirko, Ido Schimmel, David S . Miller,
	netdev@vger.kernel.org, kernel-janitors@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <87imrjzsgu.fsf@mellanox.com>

On 30/07/2019 12:12, Petr Machata wrote:
> 
> Petr Machata <petrm@mellanox.com> writes:
> 
>> Colin King <colin.king@canonical.com> writes:
>>
>>> From: Colin Ian King <colin.king@canonical.com>
>>>
>>> Currently there is a duplicated check on orig_egr_types which is
>>> redundant, I believe this is a typo and should actually be
>>> orig_ing_types || orig_egr_types instead of the expression
>>> orig_egr_types || orig_egr_types.  Fix this.
>>
>> Good catch, yes, there's a typo. Thanks for the fix!
>>
>>> Addresses-Coverity: ("Same on both sides")
>>> Fixes: c6b36bdd04b5 ("mlxsw: spectrum_ptp: Increase parsing depth when PTP is enabled")
>>> Signed-off-by: Colin Ian King <colin.king@canonical.com>
>>
>> Reviewed-by: Petr Machata <petrm@mellanox.com>
> 
> I see that there is an identical problem in the code one block further.
> Can you take care of that as well, please? Or should I do it?
> 
I'll sort that out too


^ permalink raw reply

* [PATCH] net: stmmac: Use netif_tx_napi_add() for TX polling function
From: Bartosz Golaszewski @ 2019-07-30 11:38 UTC (permalink / raw)
  To: Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	David S . Miller, Maxime Coquelin
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-kernel,
	Frode Isaksen, Bartosz Golaszewski

From: Frode Isaksen <fisaksen@baylibre.com>

This variant of netif_napi_add() should be used from drivers
using NAPI to exclusively poll a TX queue.

Signed-off-by: Frode Isaksen <fisaksen@baylibre.com>
Tested-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 98b1a5c6d537..390dad5b9819 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -4313,8 +4313,9 @@ int stmmac_dvr_probe(struct device *device,
 				       NAPI_POLL_WEIGHT);
 		}
 		if (queue < priv->plat->tx_queues_to_use) {
-			netif_napi_add(ndev, &ch->tx_napi, stmmac_napi_poll_tx,
-				       NAPI_POLL_WEIGHT);
+			netif_tx_napi_add(ndev, &ch->tx_napi,
+					  stmmac_napi_poll_tx,
+					  NAPI_POLL_WEIGHT);
 		}
 	}
 
-- 
2.21.0


^ permalink raw reply related

* [PATCH][net-next][V2] mlxsw: spectrum_ptp: fix duplicated check on orig_egr_types
From: Colin King @ 2019-07-30 11:47 UTC (permalink / raw)
  To: Petr Machata, Jiri Pirko, Ido Schimmel, David S . Miller, netdev
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Currently are duplicated checks on orig_egr_types which are
redundant, I believe this is a typo and should actually be
orig_ing_types || orig_egr_types instead of the expression
orig_egr_types || orig_egr_types.  Fix these.

Addresses-Coverity: ("Same on both sides")
Fixes: c6b36bdd04b5 ("mlxsw: spectrum_ptp: Increase parsing depth when PTP is enabled")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
index 98c5ba3200bc..63b07edd9d81 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
@@ -999,14 +999,14 @@ static int mlxsw_sp1_ptp_mtpppc_update(struct mlxsw_sp_port *mlxsw_sp_port,
 		}
 	}
 
-	if ((ing_types || egr_types) && !(orig_egr_types || orig_egr_types)) {
+	if ((ing_types || egr_types) && !(orig_ing_types || orig_egr_types)) {
 		err = mlxsw_sp_nve_inc_parsing_depth_get(mlxsw_sp);
 		if (err) {
 			netdev_err(mlxsw_sp_port->dev, "Failed to increase parsing depth");
 			return err;
 		}
 	}
-	if (!(ing_types || egr_types) && (orig_egr_types || orig_egr_types))
+	if (!(ing_types || egr_types) && (orig_ing_types || orig_egr_types))
 		mlxsw_sp_nve_inc_parsing_depth_put(mlxsw_sp);
 
 	return mlxsw_sp1_ptp_mtpppc_set(mlxsw_sp_port->mlxsw_sp,

--

V2: fix both occurances of this typo. Thanks to Petr Machata for spotting
    the 2nd occurrance
-- 
2.20.1


^ permalink raw reply related

* KMSAN: uninit-value in smsc95xx_wait_eeprom
From: syzbot @ 2019-07-30 12:18 UTC (permalink / raw)
  To: UNGLinuxDriver, davem, glider, linux-kernel, linux-usb, netdev,
	steve.glendinning, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    beaab8a3 fix KASAN build
git tree:       kmsan
console output: https://syzkaller.appspot.com/x/log.txt?x=1685d8bfa00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=4db781fe35a84ef5
dashboard link: https://syzkaller.appspot.com/bug?extid=136c17d735f025fc86a7
compiler:       clang version 9.0.0 (/home/glider/llvm/clang  
80fee25776c2fb61e74c1ecb1a523375c2500b69)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+136c17d735f025fc86a7@syzkaller.appspotmail.com

usb 1-1: New USB device found, idVendor=0424, idProduct=9908,  
bcdDevice=6a.5e
usb 1-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
usb 1-1: config 0 descriptor??
smsc95xx v1.0.6
==================================================================
BUG: KMSAN: uninit-value in smsc95xx_wait_eeprom+0x1fb/0x3d0  
drivers/net/usb/smsc95xx.c:300
CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.2.0+ #15
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: usb_hub_wq hub_event
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x191/0x1f0 lib/dump_stack.c:113
  kmsan_report+0x162/0x2d0 mm/kmsan/kmsan_report.c:109
  __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:294
  smsc95xx_wait_eeprom+0x1fb/0x3d0 drivers/net/usb/smsc95xx.c:300
  smsc95xx_read_eeprom+0x3c2/0x920 drivers/net/usb/smsc95xx.c:357
  smsc95xx_init_mac_address drivers/net/usb/smsc95xx.c:914 [inline]
  smsc95xx_bind+0x467/0x1690 drivers/net/usb/smsc95xx.c:1286
  usbnet_probe+0x10d3/0x3950 drivers/net/usb/usbnet.c:1722
  usb_probe_interface+0xd19/0x1310 drivers/usb/core/driver.c:361
  really_probe+0x1344/0x1d90 drivers/base/dd.c:513
  driver_probe_device+0x1ba/0x510 drivers/base/dd.c:670
  __device_attach_driver+0x5b8/0x790 drivers/base/dd.c:777
  bus_for_each_drv+0x28e/0x3b0 drivers/base/bus.c:454
  __device_attach+0x489/0x750 drivers/base/dd.c:843
  device_initial_probe+0x4a/0x60 drivers/base/dd.c:890
  bus_probe_device+0x131/0x390 drivers/base/bus.c:514
  device_add+0x25b5/0x2df0 drivers/base/core.c:2111
  usb_set_configuration+0x309f/0x3710 drivers/usb/core/message.c:2027
  generic_probe+0xe7/0x280 drivers/usb/core/generic.c:210
  usb_probe_device+0x146/0x200 drivers/usb/core/driver.c:266
  really_probe+0x1344/0x1d90 drivers/base/dd.c:513
  driver_probe_device+0x1ba/0x510 drivers/base/dd.c:670
  __device_attach_driver+0x5b8/0x790 drivers/base/dd.c:777
  bus_for_each_drv+0x28e/0x3b0 drivers/base/bus.c:454
  __device_attach+0x489/0x750 drivers/base/dd.c:843
  device_initial_probe+0x4a/0x60 drivers/base/dd.c:890
  bus_probe_device+0x131/0x390 drivers/base/bus.c:514
  device_add+0x25b5/0x2df0 drivers/base/core.c:2111
  usb_new_device+0x23e5/0x2fb0 drivers/usb/core/hub.c:2534
  hub_port_connect drivers/usb/core/hub.c:5089 [inline]
  hub_port_connect_change drivers/usb/core/hub.c:5204 [inline]
  port_event drivers/usb/core/hub.c:5350 [inline]
  hub_event+0x5853/0x7320 drivers/usb/core/hub.c:5432
  process_one_work+0x1572/0x1f00 kernel/workqueue.c:2269
  worker_thread+0x111b/0x2460 kernel/workqueue.c:2415
  kthread+0x4b5/0x4f0 kernel/kthread.c:256
  ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:355

Local variable description: ----buf.i.i@smsc95xx_wait_eeprom
Variable was created at:
  __smsc95xx_read_reg drivers/net/usb/smsc95xx.c:80 [inline]
  smsc95xx_read_reg drivers/net/usb/smsc95xx.c:144 [inline]
  smsc95xx_wait_eeprom+0xb6/0x3d0 drivers/net/usb/smsc95xx.c:294
  smsc95xx_read_eeprom+0x3c2/0x920 drivers/net/usb/smsc95xx.c:357
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* NETIF_F_LLTX breaks iwlwifi
From: Jean Delvare @ 2019-07-30 12:18 UTC (permalink / raw)
  To: Felix Fietkau, Toke Høiland-Jørgensen, Johannes Berg
  Cc: David S. Miller, linux-wireless, netdev, linux-kernel

Hi Felix, Toke, Johannes,

After updating to kernel 5.2, I started losing wireless network on my
workstation a few minutes after boot. I could restart the network
service to get it back, but it would go away again a few minutes later.
No error message logged, but somehow the network traffic was no long
being processed.

My hardware is:

05:00.0 Network controller [0280]: Intel Corporation Wireless 8265 / 8275 [8086:24fd] (rev 78)

This is an Intel 8265 PCIe WiFI adapter by Gigabyte, model GC-WB867D-I,
which worked flawlessly for me until then.

I bisected it down to:

commit 8dbb000ee73be2c05e34756739ce308885312a29 (refs/bisect/bad)
Author: Felix Fietkau
Date:   Sat Mar 16 18:06:34 2019 +0100

    mac80211: set NETIF_F_LLTX when using intermediate tx queues

So whatever the commit message says, it is apparently not safe to run
TX handlers on multiple CPUs in parallel for this specific driver /
device.

Unless someone has an immediate explanation as to why it broke the
iwlwifi driver and the actual bug is in iwlwifi and it can be fixed
quickly and easily there, I would suggest that the above commit is
reverted for the time being, as apparently it wasn't fixing anything
but was just a performance optimization.

I am available to do any amount of tests or debugging, given the
guidance.

Thanks,
-- 
Jean Delvare
SUSE L3 Support

^ permalink raw reply

* Re: [PATCH][net-next][V2] mlxsw: spectrum_ptp: fix duplicated check on orig_egr_types
From: Petr Machata @ 2019-07-30 12:19 UTC (permalink / raw)
  To: Colin King
  Cc: Jiri Pirko, Ido Schimmel, David S . Miller,
	netdev@vger.kernel.org, kernel-janitors@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190730114752.24133-1-colin.king@canonical.com>


Colin King <colin.king@canonical.com> writes:

> From: Colin Ian King <colin.king@canonical.com>
>
> Currently are duplicated checks on orig_egr_types which are
> redundant, I believe this is a typo and should actually be
> orig_ing_types || orig_egr_types instead of the expression
> orig_egr_types || orig_egr_types.  Fix these.
>
> Addresses-Coverity: ("Same on both sides")
> Fixes: c6b36bdd04b5 ("mlxsw: spectrum_ptp: Increase parsing depth when PTP is enabled")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Thank you, it looks good. But can you please direct this to "net", not
"net-next"?

^ permalink raw reply

* [PATCH net-next] net: bridge: mcast: add delete due to fast-leave mdb flag
From: Nikolay Aleksandrov @ 2019-07-30 12:20 UTC (permalink / raw)
  To: netdev; +Cc: davem, roopa, bridge, Nikolay Aleksandrov

In user-space there's no way to distinguish why an mdb entry was deleted
and that is a problem for daemons which would like to keep the mdb in
sync with remote ends (e.g. mlag) but would also like to converge faster.
In almost all cases we'd like to age-out the remote entry for performance
and convergence reasons except when fast-leave is enabled. In that case we
want explicit immediate remote delete, thus add mdb flag which is set only
when the entry is being deleted due to fast-leave.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
---
 include/uapi/linux/if_bridge.h | 1 +
 net/bridge/br_mdb.c            | 2 ++
 net/bridge/br_multicast.c      | 2 +-
 net/bridge/br_private.h        | 1 +
 4 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index 773e476a8e54..1b3c2b643a02 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -237,6 +237,7 @@ struct br_mdb_entry {
 #define MDB_PERMANENT 1
 	__u8 state;
 #define MDB_FLAGS_OFFLOAD	(1 << 0)
+#define MDB_FLAGS_FAST_LEAVE	(1 << 1)
 	__u8 flags;
 	__u16 vid;
 	struct {
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index bf6acd34234d..428af1abf8cc 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -60,6 +60,8 @@ static void __mdb_entry_fill_flags(struct br_mdb_entry *e, unsigned char flags)
 	e->flags = 0;
 	if (flags & MDB_PG_FLAGS_OFFLOAD)
 		e->flags |= MDB_FLAGS_OFFLOAD;
+	if (flags & MDB_PG_FLAGS_FAST_LEAVE)
+		e->flags |= MDB_FLAGS_FAST_LEAVE;
 }
 
 static void __mdb_entry_to_br_ip(struct br_mdb_entry *entry, struct br_ip *ip)
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 3d8deac2353d..3d4b2817687f 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -1393,7 +1393,7 @@ br_multicast_leave_group(struct net_bridge *br,
 			del_timer(&p->timer);
 			kfree_rcu(p, rcu);
 			br_mdb_notify(br->dev, port, group, RTM_DELMDB,
-				      p->flags);
+				      p->flags | MDB_PG_FLAGS_FAST_LEAVE);
 
 			if (!mp->ports && !mp->host_joined &&
 			    netif_running(br->dev))
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index e8cf03b43b7d..c4fd307fbfdc 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -199,6 +199,7 @@ struct net_bridge_fdb_entry {
 
 #define MDB_PG_FLAGS_PERMANENT	BIT(0)
 #define MDB_PG_FLAGS_OFFLOAD	BIT(1)
+#define MDB_PG_FLAGS_FAST_LEAVE	BIT(2)
 
 struct net_bridge_port_group {
 	struct net_bridge_port		*port;
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH][net-next][V2] mlxsw: spectrum_ptp: fix duplicated check on orig_egr_types
From: Colin Ian King @ 2019-07-30 12:22 UTC (permalink / raw)
  To: Petr Machata, Jiri Pirko, Ido Schimmel, David S . Miller, netdev
  Cc: kernel-janitors, linux-kernel
In-Reply-To: <20190730114752.24133-1-colin.king@canonical.com>

As pointed out by Petr, this should be [net] and not [net-next]


On 30/07/2019 12:47, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> Currently are duplicated checks on orig_egr_types which are
> redundant, I believe this is a typo and should actually be
> orig_ing_types || orig_egr_types instead of the expression
> orig_egr_types || orig_egr_types.  Fix these.
> 
> Addresses-Coverity: ("Same on both sides")
> Fixes: c6b36bdd04b5 ("mlxsw: spectrum_ptp: Increase parsing depth when PTP is enabled")
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>  drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
> index 98c5ba3200bc..63b07edd9d81 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c
> @@ -999,14 +999,14 @@ static int mlxsw_sp1_ptp_mtpppc_update(struct mlxsw_sp_port *mlxsw_sp_port,
>  		}
>  	}
>  
> -	if ((ing_types || egr_types) && !(orig_egr_types || orig_egr_types)) {
> +	if ((ing_types || egr_types) && !(orig_ing_types || orig_egr_types)) {
>  		err = mlxsw_sp_nve_inc_parsing_depth_get(mlxsw_sp);
>  		if (err) {
>  			netdev_err(mlxsw_sp_port->dev, "Failed to increase parsing depth");
>  			return err;
>  		}
>  	}
> -	if (!(ing_types || egr_types) && (orig_egr_types || orig_egr_types))
> +	if (!(ing_types || egr_types) && (orig_ing_types || orig_egr_types))
>  		mlxsw_sp_nve_inc_parsing_depth_put(mlxsw_sp);
>  
>  	return mlxsw_sp1_ptp_mtpppc_set(mlxsw_sp_port->mlxsw_sp,
> 
> --
> 
> V2: fix both occurances of this typo. Thanks to Petr Machata for spotting
>     the 2nd occurrance
> 


^ permalink raw reply

* [PATCH] bridge:fragmented packets dropped by bridge
From: Rundong Ge @ 2019-07-30 12:25 UTC (permalink / raw)
  To: davem
  Cc: kuznet, yoshfuji, netdev, pablo, kadlec, fw, roopa,
	netfilter-devel, coreteam, bridge, nikolay, linux-kernel,
	rdong.ge

Given following setup:
-modprobe br_netfilter
-echo '1' > /proc/sys/net/bridge/bridge-nf-call-iptables
-brctl addbr br0
-brctl addif br0 enp2s0
-brctl addif br0 enp3s0
-brctl addif br0 enp6s0
-ifconfig enp2s0 mtu 1300
-ifconfig enp3s0 mtu 1500
-ifconfig enp6s0 mtu 1500
-ifconfig br0 up

                 multi-port
mtu1500 - mtu1500|bridge|1500 - mtu1500
  A                  |            B
                   mtu1300

With netfilter defragmentation/conntrack enabled, fragmented
packets from A will be defragmented in prerouting, and refragmented
at postrouting.
But in this scenario the bridge found the frag_max_size(1500) is
larger than the dst mtu stored in the fake_rtable whitch is
always equal to the bridge's mtu 1300, then packets will be dopped.

This modifies ip_skb_dst_mtu to use the out dev's mtu instead
of bridge's mtu in bridge refragment.

Signed-off-by: Rundong Ge <rdong.ge@gmail.com>
---
 include/net/ip.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/net/ip.h b/include/net/ip.h
index 29d89de..0512de3 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -450,6 +450,8 @@ static inline unsigned int ip_dst_mtu_maybe_forward(const struct dst_entry *dst,
 static inline unsigned int ip_skb_dst_mtu(struct sock *sk,
 					  const struct sk_buff *skb)
 {
+	if ((skb_dst(skb)->flags & DST_FAKE_RTABLE) && skb->dev)
+		return min(skb->dev->mtu, IP_MAX_MTU);
 	if (!sk || !sk_fullsock(sk) || ip_sk_use_pmtu(sk)) {
 		bool forwarding = IPCB(skb)->flags & IPSKB_FORWARDED;
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH] net: bridge: Allow bridge to joing multicast groups
From: Allan W. Nielsen @ 2019-07-30 12:19 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Nikolay Aleksandrov, Horatiu Vultur, roopa, davem, bridge, netdev,
	linux-kernel
In-Reply-To: <20190730100416.GA13250@splinter>

The 07/30/2019 13:04, Ido Schimmel wrote:
> On Tue, Jul 30, 2019 at 10:30:28AM +0200, Allan W. Nielsen wrote:
> > The 07/30/2019 10:06, Ido Schimmel wrote:
> > > As a bonus, existing drivers could benefit from it, as MDB entries are already
> > > notified by MAC.
> > Not sure I follow. When FDB entries are added, it also generates notification
> > events.
> 
> I meant the switchdev notification sent to drivers:
> 
> /* SWITCHDEV_OBJ_ID_PORT_MDB */
> struct switchdev_obj_port_mdb {
> 	struct switchdev_obj obj;
> 	unsigned char addr[ETH_ALEN];
> 	u16 vid;
> };
> 
> By extending MDB entries to also be keyed by MAC you basically get a lot
> of things for free without duplicating the same code for multicast FDBs.
Agree, this should be the same.

> AFAICS, then only change in the fast path is in br_mdb_get() where you
> need to use DMAC as key in case Ethertype is not IPv4/IPv6.
That would be nice.

-- 
/Allan

^ permalink raw reply

* Re: [PATCH] bridge:fragmented packets dropped by bridge
From: Florian Westphal @ 2019-07-30 12:35 UTC (permalink / raw)
  To: Rundong Ge
  Cc: davem, kuznet, yoshfuji, netdev, pablo, kadlec, fw, roopa,
	netfilter-devel, coreteam, bridge, nikolay, linux-kernel
In-Reply-To: <20190730122534.30687-1-rdong.ge@gmail.com>

Rundong Ge <rdong.ge@gmail.com> wrote:
> Given following setup:
> -modprobe br_netfilter
> -echo '1' > /proc/sys/net/bridge/bridge-nf-call-iptables
> -brctl addbr br0
> -brctl addif br0 enp2s0
> -brctl addif br0 enp3s0
> -brctl addif br0 enp6s0
> -ifconfig enp2s0 mtu 1300
> -ifconfig enp3s0 mtu 1500
> -ifconfig enp6s0 mtu 1500
> -ifconfig br0 up
> 
>                  multi-port
> mtu1500 - mtu1500|bridge|1500 - mtu1500
>   A                  |            B
>                    mtu1300

How can a bridge forward a frame from A/B to mtu1300?

> With netfilter defragmentation/conntrack enabled, fragmented
> packets from A will be defragmented in prerouting, and refragmented
> at postrouting.

Yes, but I don't see how that relates to the problem at hand.

> But in this scenario the bridge found the frag_max_size(1500) is
> larger than the dst mtu stored in the fake_rtable whitch is
> always equal to the bridge's mtu 1300, then packets will be dopped.

What happens without netfilter or non-fragmented packets?

> This modifies ip_skb_dst_mtu to use the out dev's mtu instead
> of bridge's mtu in bridge refragment.

It seems quite a hack?  The above setup should use a router, not a bridge.

^ permalink raw reply

* [PATCHv2 net-next 0/5] sctp: clean up __sctp_connect function
From: Xin Long @ 2019-07-30 12:38 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: Marcelo Ricardo Leitner, Neil Horman, davem

This patchset is to factor out some common code for
sctp_sendmsg_new_asoc() and __sctp_connect() into 2
new functioins.

v1->v2:
  - add the patch 1/5 to avoid a slab-out-of-bounds warning.
  - add some code comment for the check change in patch 2/5.
  - remove unused 'addrcnt' as Marcelo noticed in patch 3/5.

Xin Long (5):
  sctp: only copy the available addr data in sctp_transport_init
  sctp: check addr_size with sa_family_t size in
    __sctp_setsockopt_connectx
  sctp: clean up __sctp_connect
  sctp: factor out sctp_connect_new_asoc
  sctp: factor out sctp_connect_add_peer

 net/sctp/socket.c    | 376 ++++++++++++++++++++-------------------------------
 net/sctp/transport.c |   2 +-
 2 files changed, 147 insertions(+), 231 deletions(-)

-- 
2.1.0


^ permalink raw reply

* [PATCHv2 net-next 1/5] sctp: only copy the available addr data in sctp_transport_init
From: Xin Long @ 2019-07-30 12:38 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: Marcelo Ricardo Leitner, Neil Horman, davem
In-Reply-To: <cover.1564490276.git.lucien.xin@gmail.com>

'addr' passed to sctp_transport_init is not always a whole size
of union sctp_addr, like the path:

  sctp_sendmsg() ->
  sctp_sendmsg_new_asoc() ->
  sctp_assoc_add_peer() ->
  sctp_transport_new() -> sctp_transport_init()

In the next patches, we will also pass the address length of data
only to sctp_assoc_add_peer().

So sctp_transport_init() should copy the only available data from
addr to peer->ipaddr, instead of 'peer->ipaddr = *addr' which may
cause slab-out-of-bounds.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sctp/transport.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index e2f8e36..7235a60 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -43,8 +43,8 @@ static struct sctp_transport *sctp_transport_init(struct net *net,
 						  gfp_t gfp)
 {
 	/* Copy in the address.  */
-	peer->ipaddr = *addr;
 	peer->af_specific = sctp_get_af_specific(addr->sa.sa_family);
+	memcpy(&peer->ipaddr, addr, peer->af_specific->sockaddr_len);
 	memset(&peer->saddr, 0, sizeof(union sctp_addr));
 
 	peer->sack_generation = 0;
-- 
2.1.0


^ permalink raw reply related

* [PATCHv2 net-next 2/5] sctp: check addr_size with sa_family_t size in __sctp_setsockopt_connectx
From: Xin Long @ 2019-07-30 12:38 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: Marcelo Ricardo Leitner, Neil Horman, davem
In-Reply-To: <cover.1564490276.git.lucien.xin@gmail.com>

Now __sctp_connect() is called by __sctp_setsockopt_connectx() and
sctp_inet_connect(), the latter has done addr_size check with size
of sa_family_t.

In the next patch to clean up __sctp_connect(), we will remove
addr_size check with size of sa_family_t from __sctp_connect()
for the 1st address.

So before doing that, __sctp_setsockopt_connectx() should do
this check first, as sctp_inet_connect() does.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sctp/socket.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index aa80cda..e9c5b39 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1311,7 +1311,8 @@ static int __sctp_setsockopt_connectx(struct sock *sk,
 	pr_debug("%s: sk:%p addrs:%p addrs_size:%d\n",
 		 __func__, sk, addrs, addrs_size);
 
-	if (unlikely(addrs_size <= 0))
+	/* make sure the 1st addr's sa_family is accessible later */
+	if (unlikely(addrs_size < sizeof(sa_family_t)))
 		return -EINVAL;
 
 	kaddrs = memdup_user(addrs, addrs_size);
-- 
2.1.0


^ permalink raw reply related

* [PATCHv2 net-next 3/5] sctp: clean up __sctp_connect
From: Xin Long @ 2019-07-30 12:38 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: Marcelo Ricardo Leitner, Neil Horman, davem
In-Reply-To: <cover.1564490276.git.lucien.xin@gmail.com>

__sctp_connect is doing quit similar things as sctp_sendmsg_new_asoc.
To factor out common functions, this patch is to clean up their code
to make them look more similar:

  1. create the asoc and add a peer with the 1st addr.
  2. add peers with the other addrs into this asoc one by one.

while at it, also remove the unused 'addrcnt'.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sctp/socket.c | 209 +++++++++++++++++++-----------------------------------
 1 file changed, 73 insertions(+), 136 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index e9c5b39..b9804e5 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1049,153 +1049,105 @@ static int sctp_setsockopt_bindx(struct sock *sk,
  * Common routine for handling connect() and sctp_connectx().
  * Connect will come in with just a single address.
  */
-static int __sctp_connect(struct sock *sk,
-			  struct sockaddr *kaddrs,
-			  int addrs_size, int flags,
-			  sctp_assoc_t *assoc_id)
+static int __sctp_connect(struct sock *sk, struct sockaddr *kaddrs,
+			  int addrs_size, int flags, sctp_assoc_t *assoc_id)
 {
-	struct net *net = sock_net(sk);
-	struct sctp_sock *sp;
-	struct sctp_endpoint *ep;
-	struct sctp_association *asoc = NULL;
-	struct sctp_association *asoc2;
+	struct sctp_association *old, *asoc;
+	struct sctp_sock *sp = sctp_sk(sk);
+	struct sctp_endpoint *ep = sp->ep;
 	struct sctp_transport *transport;
-	union sctp_addr to;
+	struct net *net = sock_net(sk);
+	void *addr_buf = kaddrs;
+	union sctp_addr *daddr;
 	enum sctp_scope scope;
+	struct sctp_af *af;
+	int walk_size, err;
 	long timeo;
-	int err = 0;
-	int addrcnt = 0;
-	int walk_size = 0;
-	union sctp_addr *sa_addr = NULL;
-	void *addr_buf;
-	unsigned short port;
 
-	sp = sctp_sk(sk);
-	ep = sp->ep;
-
-	/* connect() cannot be done on a socket that is already in ESTABLISHED
-	 * state - UDP-style peeled off socket or a TCP-style socket that
-	 * is already connected.
-	 * It cannot be done even on a TCP-style listening socket.
-	 */
 	if (sctp_sstate(sk, ESTABLISHED) || sctp_sstate(sk, CLOSING) ||
-	    (sctp_style(sk, TCP) && sctp_sstate(sk, LISTENING))) {
-		err = -EISCONN;
-		goto out_free;
+	    (sctp_style(sk, TCP) && sctp_sstate(sk, LISTENING)))
+		return -EISCONN;
+
+	daddr = addr_buf;
+	af = sctp_get_af_specific(daddr->sa.sa_family);
+	if (!af || af->sockaddr_len > addrs_size)
+		return -EINVAL;
+
+	err = sctp_verify_addr(sk, daddr, af->sockaddr_len);
+	if (err)
+		return err;
+
+	asoc = sctp_endpoint_lookup_assoc(ep, daddr, &transport);
+	if (asoc)
+		return asoc->state >= SCTP_STATE_ESTABLISHED ? -EISCONN
+							     : -EALREADY;
+
+	if (sctp_endpoint_is_peeled_off(ep, daddr))
+		return -EADDRNOTAVAIL;
+
+	if (!ep->base.bind_addr.port) {
+		if (sctp_autobind(sk))
+			return -EAGAIN;
+	} else {
+		if (ep->base.bind_addr.port < inet_prot_sock(net) &&
+		    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
+			return -EACCES;
 	}
 
-	/* Walk through the addrs buffer and count the number of addresses. */
-	addr_buf = kaddrs;
-	while (walk_size < addrs_size) {
-		struct sctp_af *af;
+	scope = sctp_scope(daddr);
+	asoc = sctp_association_new(ep, sk, scope, GFP_KERNEL);
+	if (!asoc)
+		return -ENOMEM;
 
-		if (walk_size + sizeof(sa_family_t) > addrs_size) {
-			err = -EINVAL;
-			goto out_free;
-		}
+	err = sctp_assoc_set_bind_addr_from_ep(asoc, scope, GFP_KERNEL);
+	if (err < 0)
+		goto out_free;
 
-		sa_addr = addr_buf;
-		af = sctp_get_af_specific(sa_addr->sa.sa_family);
+	transport = sctp_assoc_add_peer(asoc, daddr, GFP_KERNEL, SCTP_UNKNOWN);
+	if (!transport) {
+		err = -ENOMEM;
+		goto out_free;
+	}
 
-		/* If the address family is not supported or if this address
-		 * causes the address buffer to overflow return EINVAL.
-		 */
-		if (!af || (walk_size + af->sockaddr_len) > addrs_size) {
-			err = -EINVAL;
+	addr_buf += af->sockaddr_len;
+	walk_size = af->sockaddr_len;
+	while (walk_size < addrs_size) {
+		err = -EINVAL;
+		if (walk_size + sizeof(sa_family_t) > addrs_size)
 			goto out_free;
-		}
 
-		port = ntohs(sa_addr->v4.sin_port);
-
-		/* Save current address so we can work with it */
-		memcpy(&to, sa_addr, af->sockaddr_len);
+		daddr = addr_buf;
+		af = sctp_get_af_specific(daddr->sa.sa_family);
+		if (!af || af->sockaddr_len + walk_size > addrs_size)
+			goto out_free;
 
-		err = sctp_verify_addr(sk, &to, af->sockaddr_len);
-		if (err)
+		if (asoc->peer.port != ntohs(daddr->v4.sin_port))
 			goto out_free;
 
-		/* Make sure the destination port is correctly set
-		 * in all addresses.
-		 */
-		if (asoc && asoc->peer.port && asoc->peer.port != port) {
-			err = -EINVAL;
+		err = sctp_verify_addr(sk, daddr, af->sockaddr_len);
+		if (err)
 			goto out_free;
-		}
 
-		/* Check if there already is a matching association on the
-		 * endpoint (other than the one created here).
-		 */
-		asoc2 = sctp_endpoint_lookup_assoc(ep, &to, &transport);
-		if (asoc2 && asoc2 != asoc) {
-			if (asoc2->state >= SCTP_STATE_ESTABLISHED)
-				err = -EISCONN;
-			else
-				err = -EALREADY;
+		old = sctp_endpoint_lookup_assoc(ep, daddr, &transport);
+		if (old && old != asoc) {
+			err = old->state >= SCTP_STATE_ESTABLISHED ? -EISCONN
+								   : -EALREADY;
 			goto out_free;
 		}
 
-		/* If we could not find a matching association on the endpoint,
-		 * make sure that there is no peeled-off association matching
-		 * the peer address even on another socket.
-		 */
-		if (sctp_endpoint_is_peeled_off(ep, &to)) {
+		if (sctp_endpoint_is_peeled_off(ep, daddr)) {
 			err = -EADDRNOTAVAIL;
 			goto out_free;
 		}
 
-		if (!asoc) {
-			/* If a bind() or sctp_bindx() is not called prior to
-			 * an sctp_connectx() call, the system picks an
-			 * ephemeral port and will choose an address set
-			 * equivalent to binding with a wildcard address.
-			 */
-			if (!ep->base.bind_addr.port) {
-				if (sctp_autobind(sk)) {
-					err = -EAGAIN;
-					goto out_free;
-				}
-			} else {
-				/*
-				 * If an unprivileged user inherits a 1-many
-				 * style socket with open associations on a
-				 * privileged port, it MAY be permitted to
-				 * accept new associations, but it SHOULD NOT
-				 * be permitted to open new associations.
-				 */
-				if (ep->base.bind_addr.port <
-				    inet_prot_sock(net) &&
-				    !ns_capable(net->user_ns,
-				    CAP_NET_BIND_SERVICE)) {
-					err = -EACCES;
-					goto out_free;
-				}
-			}
-
-			scope = sctp_scope(&to);
-			asoc = sctp_association_new(ep, sk, scope, GFP_KERNEL);
-			if (!asoc) {
-				err = -ENOMEM;
-				goto out_free;
-			}
-
-			err = sctp_assoc_set_bind_addr_from_ep(asoc, scope,
-							      GFP_KERNEL);
-			if (err < 0) {
-				goto out_free;
-			}
-
-		}
-
-		/* Prime the peer's transport structures.  */
-		transport = sctp_assoc_add_peer(asoc, &to, GFP_KERNEL,
+		transport = sctp_assoc_add_peer(asoc, daddr, GFP_KERNEL,
 						SCTP_UNKNOWN);
 		if (!transport) {
 			err = -ENOMEM;
 			goto out_free;
 		}
 
-		addrcnt++;
-		addr_buf += af->sockaddr_len;
+		addr_buf  += af->sockaddr_len;
 		walk_size += af->sockaddr_len;
 	}
 
@@ -1209,39 +1161,24 @@ static int __sctp_connect(struct sock *sk,
 	}
 
 	err = sctp_primitive_ASSOCIATE(net, asoc, NULL);
-	if (err < 0) {
+	if (err < 0)
 		goto out_free;
-	}
 
 	/* Initialize sk's dport and daddr for getpeername() */
 	inet_sk(sk)->inet_dport = htons(asoc->peer.port);
-	sp->pf->to_sk_daddr(sa_addr, sk);
+	sp->pf->to_sk_daddr(daddr, sk);
 	sk->sk_err = 0;
 
-	timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);
-
 	if (assoc_id)
 		*assoc_id = asoc->assoc_id;
 
-	err = sctp_wait_for_connect(asoc, &timeo);
-	/* Note: the asoc may be freed after the return of
-	 * sctp_wait_for_connect.
-	 */
-
-	/* Don't free association on exit. */
-	asoc = NULL;
+	timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);
+	return sctp_wait_for_connect(asoc, &timeo);
 
 out_free:
 	pr_debug("%s: took out_free path with asoc:%p kaddrs:%p err:%d\n",
 		 __func__, asoc, kaddrs, err);
-
-	if (asoc) {
-		/* sctp_primitive_ASSOCIATE may have added this association
-		 * To the hash table, try to unhash it, just in case, its a noop
-		 * if it wasn't hashed so we're safe
-		 */
-		sctp_association_free(asoc);
-	}
+	sctp_association_free(asoc);
 	return err;
 }
 
-- 
2.1.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox