Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [net-next PATCH v1 01/11] net: flow_table: create interface for hw match/action tables
From: John Fastabend @ 2015-01-06  1:19 UTC (permalink / raw)
  To: Simon Horman; +Cc: Thomas Graf, sfeldma, jiri, jhs, netdev, davem, andy
In-Reply-To: <20150106010942.GD14077@vergenet.net>

On 01/05/2015 05:09 PM, Simon Horman wrote:
> On Mon, Jan 05, 2015 at 04:45:50PM -0800, John Fastabend wrote:
>> [...]
>>
>>>>> +/**
>>>>> + * @struct net_flow_field_ref
>>>>> + * @brief uniquely identify field as header:field tuple
>>>>> + */
>>>>> +struct net_flow_field_ref {
>>>>> +    int instance;
>>>>> +    int header;
>>>>> +    int field;
>>>>> +    int mask_type;
>>>>> +    int type;
>>>>> +    union {    /* Are these all the required data types */
>>>>> +        __u8 value_u8;
>>>>> +        __u16 value_u16;
>>>>> +        __u32 value_u32;
>>>>> +        __u64 value_u64;
>>>>> +    };
>>>>> +    union {    /* Are these all the required data types */
>>>>> +        __u8 mask_u8;
>>>>> +        __u16 mask_u16;
>>>>> +        __u32 mask_u32;
>>>>> +        __u64 mask_u64;
>>>>> +    };
>>>>> +};
>>>>
>>>> Does it make sense to write this as follows?
>>>
>>> Yes. I'll make this update it helps make it clear value/mask pairs are
>>> needed.
>>>
>>>>
>>>> union {
>>>>          struct {
>>>>                  __u8 value_u8;
>>>>                  __u8 mask_u8;
>>>>          };
>>>>          struct {
>>>>                  __u16 value_u16;
>>>>                  __u16 mask_u16;
>>>>          };
>>>>          ...
>>>> };
>>
>> Another thought is to pull this entirely out of the structure and hide
>> it from the UAPI so we can add more value/mask types as needed without
>> having to spin versions of net_flow_field_ref. On the other hand I've
>> been able to fit all my fields in these types so far and I can't think
>> of any additions we need at the moment.
>
> FWIW, I think it would be cleaner to break both field_ref and action_args
> out into attributes and not expose the structures to user-space. But
> perhaps there is an advantage to dealing with structures directly that
> I am missing.
>

I  came to the same conclusion just now as well. I'm reworking it now
for v2.

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH] crypto: aesni-intel - avoid IPsec re-ordering
From: Sunderam K @ 2015-01-06  1:05 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20141120080205.GA29710@gondor.apana.org.au>

Herbert Xu <herbert <at> gondor.apana.org.au> writes:

> 
> On Thu, Nov 20, 2014 at 08:59:44AM +0100, Steffen Klassert wrote:
> >
> > Sure, but could be an option if this is really a rare case.
> 
> Well it's rare but when it does hit it'll probably be there all
> the time for that system.  IOW you either have no apps using the
> FPU, but when you do, it's probably going to be hogging it.
> 
> > Anyway, I don't mind too much about the solution as long as we
> > get it to work :)
> 
> :)
> 
> Cheers,


Hi Herbert

Is there an official "blessed" patch for this re-ordering problem?
I saw some issues raised in the thread with the patch that Ming Liu had 
provided? 

Thanks
-sunderam

^ permalink raw reply

* Re: [net-next PATCH v1 01/11] net: flow_table: create interface for hw match/action tables
From: Simon Horman @ 2015-01-06  1:09 UTC (permalink / raw)
  To: John Fastabend; +Cc: Thomas Graf, sfeldma, jiri, jhs, netdev, davem, andy
In-Reply-To: <54AB303E.3000601@gmail.com>

On Mon, Jan 05, 2015 at 04:45:50PM -0800, John Fastabend wrote:
> [...]
> 
> >>>+/**
> >>>+ * @struct net_flow_field_ref
> >>>+ * @brief uniquely identify field as header:field tuple
> >>>+ */
> >>>+struct net_flow_field_ref {
> >>>+    int instance;
> >>>+    int header;
> >>>+    int field;
> >>>+    int mask_type;
> >>>+    int type;
> >>>+    union {    /* Are these all the required data types */
> >>>+        __u8 value_u8;
> >>>+        __u16 value_u16;
> >>>+        __u32 value_u32;
> >>>+        __u64 value_u64;
> >>>+    };
> >>>+    union {    /* Are these all the required data types */
> >>>+        __u8 mask_u8;
> >>>+        __u16 mask_u16;
> >>>+        __u32 mask_u32;
> >>>+        __u64 mask_u64;
> >>>+    };
> >>>+};
> >>
> >>Does it make sense to write this as follows?
> >
> >Yes. I'll make this update it helps make it clear value/mask pairs are
> >needed.
> >
> >>
> >>union {
> >>         struct {
> >>                 __u8 value_u8;
> >>                 __u8 mask_u8;
> >>         };
> >>         struct {
> >>                 __u16 value_u16;
> >>                 __u16 mask_u16;
> >>         };
> >>         ...
> >>};
> 
> Another thought is to pull this entirely out of the structure and hide
> it from the UAPI so we can add more value/mask types as needed without
> having to spin versions of net_flow_field_ref. On the other hand I've
> been able to fit all my fields in these types so far and I can't think
> of any additions we need at the moment.

FWIW, I think it would be cleaner to break both field_ref and action_args
out into attributes and not expose the structures to user-space. But
perhaps there is an advantage to dealing with structures directly that
I am missing.

^ permalink raw reply

* Re: [PATCH/RFC rocker-net-next 6/6] net: flow: Limit checking of ndo_flow_{set,del}_flows
From: Simon Horman @ 2015-01-06  1:07 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev
In-Reply-To: <54AACA92.5030800@gmail.com>

On Mon, Jan 05, 2015 at 09:32:02AM -0800, John Fastabend wrote:
> On 01/04/2015 10:50 PM, Simon Horman wrote:
> >Only check for availability of ndo_flow_{set,del}_flows when
> >they are to be be used.
> >
> 
> I went ahead and merged this but, I'm not sure does it make
> sense to allow a user to add a flow that can't be deleted? Or
> delete a flow that wasn't ever added? I guess if the driver has
> a reason to do this it doesn't hurt to allow it and I think the
> code looks neater this way.
> 
> Also thanks for the other fixes I pulled the other 5 in as well
> I'll re-submit the series after running some basic tests.

I don't have any strong opinions on this but it
sounds like policy that doesn't belong in flow_table.c.

> >Signed-off-by: Simon Horman <simon.horman@netronome.com>
> >---
> >  net/core/flow_table.c | 15 +++++++++++++--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> >
> >diff --git a/net/core/flow_table.c b/net/core/flow_table.c
> >index bfc984f..6d620d4 100644
> >--- a/net/core/flow_table.c
> >+++ b/net/core/flow_table.c
> >@@ -1206,9 +1206,20 @@ static int net_flow_table_cmd_flows(struct sk_buff *recv_skb,
> >  	if (!dev)
> >  		return -EINVAL;
> >
> >-	if (!dev->netdev_ops->ndo_flow_set_flows ||
> >-	    !dev->netdev_ops->ndo_flow_del_flows)
> >+	switch (cmd) {
> >+	case NET_FLOW_TABLE_CMD_SET_FLOWS:
> >+		if (!dev->netdev_ops->ndo_flow_set_flows)
> >+			goto out;
> >+		break;
> >+
> >+	case NET_FLOW_TABLE_CMD_DEL_FLOWS:
> >+		if (!dev->netdev_ops->ndo_flow_del_flows)
> >+			goto out;
> >+		break;
> >+
> >+	default:
> >  		goto out;
> >+	}
> >
> >  	if (!info->attrs[NET_FLOW_IDENTIFIER_TYPE] ||
> >  	    !info->attrs[NET_FLOW_IDENTIFIER] ||
> >
> 
> 
> -- 
> John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH/RFC rocker-net-next 2/6] net: flow: Handle error when putting a field while putting a flow
From: Simon Horman @ 2015-01-06  1:04 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev
In-Reply-To: <54AAC9AE.4010104@gmail.com>

On Mon, Jan 05, 2015 at 09:28:14AM -0800, John Fastabend wrote:
> On 01/04/2015 10:50 PM, Simon Horman wrote:
> >Signed-off-by: Simon Horman <simon.horman@netronome.com>
> >---
> >  net/core/flow_table.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> >diff --git a/net/core/flow_table.c b/net/core/flow_table.c
> >index 2af831e..753ebe0 100644
> >--- a/net/core/flow_table.c
> >+++ b/net/core/flow_table.c
> >@@ -981,8 +981,9 @@ done:
> >
> >  int net_flow_put_flow(struct sk_buff *skb, struct net_flow_flow *flow)
> >  {
> >-	struct nlattr *flows, *matches;
> >+	struct nlattr *flows;
> >  	struct nlattr *actions = NULL; /* must be null to unwind */
> >+	struct nlattr *matches = NULL; /* must be null to unwind */
> 
> Actually we don't need to initialize to NULL now. That was some crazy
> unwind scheme I had in place initially.
> 
> Now we only ever do a nla_nest_cancel on nested attributes that have
> been initialized with nla_nest_start(). So I can simplify this to
> 
> 	struct nlattr *flows, *matches, *actions;

Thanks, that does seem much nicer :)

> 
> >  	int err, j, i = 0;
> >
> >  	flows = nla_nest_start(skb, NET_FLOW_FLOW);
> >@@ -1005,7 +1006,11 @@ int net_flow_put_flow(struct sk_buff *skb, struct net_flow_flow *flow)
> >  			if (!f->header)
> >  				continue;
> >
> >-			nla_put(skb, NET_FLOW_FIELD_REF, sizeof(*f), f);
> >+			err = nla_put(skb, NET_FLOW_FIELD_REF, sizeof(*f), f);
> 
> Great thanks. I missed this one.
> 
> >+			if (err) {
> >+				nla_nest_cancel(skb, matches);
> >+				goto flows_put_failure;
> >+			}
> >  		}
> >  		nla_nest_end(skb, matches);
> >  	}
> >
> 
> I'll fold this into the series and resubmit thanks.
> 
> .John
> 
> -- 
> John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH/RFC rocker-net-next 1/6] net: flow: Cancel innermost nested attribute first
From: Simon Horman @ 2015-01-06  1:03 UTC (permalink / raw)
  To: David Miller; +Cc: john.fastabend, netdev
In-Reply-To: <20150105.161725.1765207203472571760.davem@davemloft.net>

On Mon, Jan 05, 2015 at 04:17:25PM -0500, David Miller wrote:
> From: Simon Horman <simon.horman@netronome.com>
> Date: Mon,  5 Jan 2015 15:50:05 +0900
> 
> > Cancel innermost nested attribute first on error when putting flow actions.
> > 
> > Signed-off-by: Simon Horman <simon.horman@netronome.com>
> > 
> > ---
> > 
> > Its unclear to me if this makes any difference.
> > But it seems more logical to me.
> 
> Hmmm.  Be careful here.  nla_nest_cancel() is just rolling back
> the length of the SKB to right before the netlink attribute being
> given as the cancellation point.
> 
> So you really have to cancel attributes in exactly the reverse order
> in which they were added.  Otherwise we'll make a trim call with a
> negative adjustment that actually expands the SKB past an already
> cancelled attribute.

Thanks for clarifying that.

The aim of my patch is to perform the roll back in reverse order
which I now know is required.

^ permalink raw reply

* Re: route/max_size sysctl in ipv4
From: Ani Sinha @ 2015-01-06  0:56 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <20150105.195128.794605376092864881.davem@davemloft.net>

On Mon, Jan 5, 2015 at 4:51 PM, David Miller <davem@davemloft.net> wrote:
> From: Ani Sinha <ani@arista.com>
> Date: Mon, 5 Jan 2015 16:43:30 -0800
>
>> On Mon, Jan 5, 2015 at 4:36 PM, David Miller <davem@davemloft.net> wrote:
>>> From: Ani Sinha <ani@arista.com>
>>> Date: Mon, 5 Jan 2015 15:48:11 -0800
>>>
>>>> I am looking at the code and it looks like since the route cache for
>>>> ipv4 was removed from the kernel, this sysctl parameter no longer
>>>> serves the same purpose. It does not look like it is even used in the
>>>> ipv4/route.c module. Is there an equivalent sysctl parameter limiting
>>>> the number of route entries in the kernel? Or is there now no
>>>> mechanism to limit the number of route entries?
>>>
>>> There is nothing to limit, since the cache was removed.
>>
>> Shouldn't the documentation be updated to reflect that? Also what's
>> the point of having a dummy variable that does nothing? Should we not
>> simply remove it?
>
> There is nothing to update, the behavior is completely transparent.
> Absolutely no cache entries exist, therefore the limit cannot be
> reached.

I disagree. You are advertising a feature in an official documentation
that simply does not exist for ipv4. This is very confusing. If I did
not dig into the code, I wouldn't know that this particular knob is a
noop since the time the route cache was removed.


>
> The sysctl is kept so that scripts reading it don't suddenly stop
> working.  We can't just remove sysctl values.
>

^ permalink raw reply

* [PATCH 1/1] bridge: remove BR_GROUPFWD_RESTRICTED for arbitrary forwarding of reserved addresses
From: Bernhard Thaler @ 2015-01-06  0:56 UTC (permalink / raw)
  To: stephen, davem; +Cc: netdev, bridge, Bernhard Thaler

BR_GROUPFWD_RESTRICTED bitmask restricts users from setting values to
/sys/class/net/brX/bridge/group_fwd_mask that allow forwarding of
some IEEE 802.1D Table 7-10 Reserved addresses:
	(MAC Control) 802.3		01-80-C2-00-00-01
	(Link Aggregation) 802.3	01-80-C2-00-00-02
	802.1AB LLDP			01-80-C2-00-00-0E
BR_GROUPFWD_RESTRICTED may have been set as an extra protection against
forwarding these control frames as forwarding 802.1X PAE (01-80-C2-00-00-03)
in 802.1X setups satisfies most common use-cases.
Other situations, such as placing a software based bridge as a "TAP" between two
devices may require to forward e.g. LLDP frames while debugging network problems
or actively changing/filtering traffic with ebtables.

This patch allows to set e.g.:
	echo 65535 > /sys/class/net/brX/bridge/group_fwd_mask
which sets no restrictions on the forwardable reserved addresses.

- the default value 0 will still comply with 802.1D and not forward any
  reserved addresses
- values such as 8 for forwarding 802.1X related frames will behave the
  same way as with BR_GROUPFWD_RESTRICTED currently in place, so backward
  compatibility to current scripts using group_fwd_masks shoudl be possible

Administrators and network engineers however will be able to arbitrarily
forward any reserved addresses without BR_GROUPFWD_RESTRICTED. This will
be non-standard compliant behavior, but forwarding of any reserved address
right from the beginning is. Users should be aware of this anyway and
know what/why they are doing when setting values such as 65535, 32768, 16384,
4, 2 for group_fwd_mask

This patch was tested on a bridge with two interfaces created with bridge-utils.

Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
---
 net/bridge/br_input.c    |    8 ++++++--
 net/bridge/br_private.h  |    2 --
 net/bridge/br_sysfs_br.c |    3 ---
 3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 1f1de71..e44fe38 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -262,8 +262,12 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
 				goto forward;
 			break;
 
-		case 0x01:	/* IEEE MAC (Pause) */
-			goto drop;
+		case 0x01:      /* IEEE MAC (Pause) */
+			fwd_mask |= p->br->group_fwd_mask;
+			if (fwd_mask & (1u << dest[5]))
+				goto forward;
+			else
+				goto drop;
 
 		default:
 			/* Allow selective forwarding for most other protocols */
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index aea3d13..9b548754 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -33,8 +33,6 @@
 
 /* Control of forwarding link local multicast */
 #define BR_GROUPFWD_DEFAULT	0
-/* Don't allow forwarding control protocols like STP and LLDP */
-#define BR_GROUPFWD_RESTRICTED	0x4007u
 /* The Nearest Customer Bridge Group Address, 01-80-C2-00-00-[00,0B,0C,0D,0F] */
 #define BR_GROUPFWD_8021AD	0xB801u
 
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index 4c97fc5..7f04d8b 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -171,9 +171,6 @@ static ssize_t group_fwd_mask_store(struct device *d,
 	if (endp == buf)
 		return -EINVAL;
 
-	if (val & BR_GROUPFWD_RESTRICTED)
-		return -EINVAL;
-
 	br->group_fwd_mask = val;
 
 	return len;
-- 
1.7.10.4

^ permalink raw reply related

* Re: route/max_size sysctl in ipv4
From: David Miller @ 2015-01-06  0:51 UTC (permalink / raw)
  To: ani; +Cc: netdev
In-Reply-To: <CAOxq_8OtQ0WD-PKBd4kX0JwZiaStqJ-cQiQRhWnTAK9Q0JquxA@mail.gmail.com>

From: Ani Sinha <ani@arista.com>
Date: Mon, 5 Jan 2015 16:43:30 -0800

> On Mon, Jan 5, 2015 at 4:36 PM, David Miller <davem@davemloft.net> wrote:
>> From: Ani Sinha <ani@arista.com>
>> Date: Mon, 5 Jan 2015 15:48:11 -0800
>>
>>> I am looking at the code and it looks like since the route cache for
>>> ipv4 was removed from the kernel, this sysctl parameter no longer
>>> serves the same purpose. It does not look like it is even used in the
>>> ipv4/route.c module. Is there an equivalent sysctl parameter limiting
>>> the number of route entries in the kernel? Or is there now no
>>> mechanism to limit the number of route entries?
>>
>> There is nothing to limit, since the cache was removed.
> 
> Shouldn't the documentation be updated to reflect that? Also what's
> the point of having a dummy variable that does nothing? Should we not
> simply remove it?

There is nothing to update, the behavior is completely transparent.
Absolutely no cache entries exist, therefore the limit cannot be
reached.

The sysctl is kept so that scripts reading it don't suddenly stop
working.  We can't just remove sysctl values.

^ permalink raw reply

* Re: [net-next PATCH v1 01/11] net: flow_table: create interface for hw match/action tables
From: John Fastabend @ 2015-01-06  0:45 UTC (permalink / raw)
  To: Thomas Graf; +Cc: sfeldma, jiri, jhs, simon.horman, netdev, davem, andy
In-Reply-To: <54AADEFF.3090306@gmail.com>

[...]

>>> +/**
>>> + * @struct net_flow_field_ref
>>> + * @brief uniquely identify field as header:field tuple
>>> + */
>>> +struct net_flow_field_ref {
>>> +    int instance;
>>> +    int header;
>>> +    int field;
>>> +    int mask_type;
>>> +    int type;
>>> +    union {    /* Are these all the required data types */
>>> +        __u8 value_u8;
>>> +        __u16 value_u16;
>>> +        __u32 value_u32;
>>> +        __u64 value_u64;
>>> +    };
>>> +    union {    /* Are these all the required data types */
>>> +        __u8 mask_u8;
>>> +        __u16 mask_u16;
>>> +        __u32 mask_u32;
>>> +        __u64 mask_u64;
>>> +    };
>>> +};
>>
>> Does it make sense to write this as follows?
>
> Yes. I'll make this update it helps make it clear value/mask pairs are
> needed.
>
>>
>> union {
>>          struct {
>>                  __u8 value_u8;
>>                  __u8 mask_u8;
>>          };
>>          struct {
>>                  __u16 value_u16;
>>                  __u16 mask_u16;
>>          };
>>          ...
>> };

Another thought is to pull this entirely out of the structure and hide
it from the UAPI so we can add more value/mask types as needed without
having to spin versions of net_flow_field_ref. On the other hand I've
been able to fit all my fields in these types so far and I can't think
of any additions we need at the moment.

>>
>>> +#define NET_FLOW_TABLE_EGRESS_ROOT 1
>>> +#define    NET_FLOW_TABLE_INGRESS_ROOT 2
>>
>> Tab/space mix.
>>
>

[...]


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: route/max_size sysctl in ipv4
From: Ani Sinha @ 2015-01-06  0:43 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <20150105.193614.1827024424476781168.davem@davemloft.net>

On Mon, Jan 5, 2015 at 4:36 PM, David Miller <davem@davemloft.net> wrote:
> From: Ani Sinha <ani@arista.com>
> Date: Mon, 5 Jan 2015 15:48:11 -0800
>
>> I am looking at the code and it looks like since the route cache for
>> ipv4 was removed from the kernel, this sysctl parameter no longer
>> serves the same purpose. It does not look like it is even used in the
>> ipv4/route.c module. Is there an equivalent sysctl parameter limiting
>> the number of route entries in the kernel? Or is there now no
>> mechanism to limit the number of route entries?
>
> There is nothing to limit, since the cache was removed.

Shouldn't the documentation be updated to reflect that? Also what's
the point of having a dummy variable that does nothing? Should we not
simply remove it?

^ permalink raw reply

* Re: route/max_size sysctl in ipv4
From: David Miller @ 2015-01-06  0:36 UTC (permalink / raw)
  To: ani; +Cc: netdev
In-Reply-To: <CAOxq_8NdxBScZ182bXs5KmTZLKRMeF3AVt+DJNFUff42gaZG4w@mail.gmail.com>

From: Ani Sinha <ani@arista.com>
Date: Mon, 5 Jan 2015 15:48:11 -0800

> I am looking at the code and it looks like since the route cache for
> ipv4 was removed from the kernel, this sysctl parameter no longer
> serves the same purpose. It does not look like it is even used in the
> ipv4/route.c module. Is there an equivalent sysctl parameter limiting
> the number of route entries in the kernel? Or is there now no
> mechanism to limit the number of route entries?

There is nothing to limit, since the cache was removed.

^ permalink raw reply

* [PATCH net-next] netlink: Warn on unordered or illegal nla_nest_cancel() or nlmsg_cancel()
From: Thomas Graf @ 2015-01-06  0:04 UTC (permalink / raw)
  To: davem; +Cc: netdev

Calling nla_nest_cancel() in a different order as the nesting was
built up can lead to negative offsets being calculated which
results in skb_trim() being called with an underflowed unsigned
int. Warn if mark < skb->data as it's definitely a bug.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/net/netlink.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/netlink.h b/include/net/netlink.h
index 6415835..d5869b9 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -520,8 +520,10 @@ static inline void *nlmsg_get_pos(struct sk_buff *skb)
  */
 static inline void nlmsg_trim(struct sk_buff *skb, const void *mark)
 {
-	if (mark)
+	if (mark) {
+		WARN_ON((unsigned char *) mark < skb->data);
 		skb_trim(skb, (unsigned char *) mark - skb->data);
+	}
 }
 
 /**
-- 
1.9.3

^ permalink raw reply related

* Re: [PATCH net-next v3 0/5]: ixgbevf: Allow querying VFs RSS indirection table and key
From: Greg Rose @ 2015-01-05 23:54 UTC (permalink / raw)
  To: Vlad Zolotarov; +Cc: netdev, gleb, avi, jeffrey.t.kirsher
In-Reply-To: <1420467311-6680-1-git-send-email-vladz@cloudius-systems.com>

On Mon, Jan 5, 2015 at 6:15 AM, Vlad Zolotarov
<vladz@cloudius-systems.com> wrote:
> Add the ethtool ops to VF driver to allow querying the RSS indirection table
> and RSS Random Key.
>
>  - PF driver: Add new VF-PF channel commands.
>  - VF driver: Utilize these new commands and add the corresponding
>               ethtool callbacks.
>
> New in v3:
>    - Added a missing support for x550 devices.
>    - Mask the indirection table values according to PSRTYPE[n].RQPL.
>    - Minimized the number of added VF-PF commands.
>
> New in v2:
>    - Added a detailed description to patches 4 and 5.
>
> New in v1 (compared to RFC):
>    - Use "if-else" statement instead of a "switch-case" for a single option case.
>      More specifically: in cases where the newly added API version is the only one
>      allowed. We may consider using a "switch-case" back again when the list of
>      allowed API versions in these specific places grows up.
>
> Vlad Zolotarov (5):
>   ixgbe: Add a RETA query command to VF-PF channel API
>   ixgbevf: Add a RETA query code
>   ixgbe: Add GET_RSS_KEY command to VF-PF channel commands set
>   ixgbevf: Add RSS Key query code
>   ixgbevf: Add the appropriate ethtool ops to query RSS indirection
>     table and key
>
>  drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h      |  10 ++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c    |  91 +++++++++++++++
>  drivers/net/ethernet/intel/ixgbevf/ethtool.c      |  43 +++++++
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |   4 +-
>  drivers/net/ethernet/intel/ixgbevf/mbx.h          |  10 ++
>  drivers/net/ethernet/intel/ixgbevf/vf.c           | 132 ++++++++++++++++++++++
>  drivers/net/ethernet/intel/ixgbevf/vf.h           |   2 +
>  7 files changed, 291 insertions(+), 1 deletion(-)

I've given this code a review and I don't see a way to
set a policy in the PF driver as to whether this request should be
allowed or not.  We cannot enable this query by default - it is a
security risk. To make this acceptable you need to do a
couple of things.

A) Have the query disabled by default such that when a VF driver
requests the RSS info the request is denied.

B) Add hooks to allow system admins to set the policy in the PF driver
as to whether the RSS info requests from the VFs are allowed or
denied.  Only provide the VF the privilege to request the RSS info if
the system admin has explicitly set the policy to allow it.  All other
times the request should be denied.

As it stands this is a non-starter.  Privileged information cannot be
made available to VFs without a way for the system admin to set
policy as to whether the information should be made available or not.

- Greg Rose
Intel Corp
Networking Division
<gregory.v.rose@intel.com>


>
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* route/max_size sysctl in ipv4
From: Ani Sinha @ 2015-01-05 23:48 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hi all :

Please see : http://lxr.free-electrons.com/source/Documentation/networking/ip-sysctl.txt

I am looking at the code and it looks like since the route cache for
ipv4 was removed from the kernel, this sysctl parameter no longer
serves the same purpose. It does not look like it is even used in the
ipv4/route.c module. Is there an equivalent sysctl parameter limiting
the number of route entries in the kernel? Or is there now no
mechanism to limit the number of route entries?

Please someone enlighten me.

Thanks a lot in advance,
ani

^ permalink raw reply

* Re: [net-next PATCH v1 01/11] net: flow_table: create interface for hw match/action tables
From: John Fastabend @ 2015-01-05 23:29 UTC (permalink / raw)
  To: Thomas Graf; +Cc: sfeldma, jiri, jhs, simon.horman, netdev, davem, andy
In-Reply-To: <54AADEFF.3090306@gmail.com>

On 01/05/2015 10:59 AM, John Fastabend wrote:

[...]

>>> +#ifndef _UAPI_LINUX_IF_FLOW
>>> +#define _UAPI_LINUX_IF_FLOW
>>> +
>>> +#include <linux/types.h>
>>> +#include <linux/netlink.h>
>>> +#include <linux/if.h>
>>> +
>>> +#define NET_FLOW_NAMSIZ 80
>>
>> Did you consider allocating the memory for names? I don't have a grasp
>> for the typical number of net_flow_* instances in memory yet.
>>
>
> <100k in the devices I have. Maybe Simon can pitch in what is typical
> on the NPUs I'm not sure about them.
>
> Rocker tables can grow as large as needed at the moment.
>
> Allocating the memory may help I'll go ahead and give it a try.
>

One issue with breaking this is up is a couple structures are being
passed as attributes with name[] as a field. I think its best to break
these up passing empty arrays seems to be ugly at best. So I'll need to
adjust some of the messaging as well.

-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* Re: r8169 crash also in 3.16.7
From: Francois Romieu @ 2015-01-05 23:00 UTC (permalink / raw)
  To: ael; +Cc: netdev
In-Reply-To: <20150105112044.GA5659@shelf.conquest>

ael <lawrence_a_e@ntlworld.com> :
[...]
> Jan  4 22:09:16 shelf kernel: [10650.233180] WARNING: CPU: 0 PID: 0 at /build/linux-CMiYW9/linux-3.16.7-ckt2/net/sched/sch_generic.c:264 dev_watchdog+0x236/0x240()
[...]
> Jan  4 22:09:16 shelf kernel: [10650.233263] Hardware name: Notebook                         W54_55SU1,SUW/W54_55SU1,SUW, BIOS 4.6.5 05/29/2014

The current kernel contains several changes for your (RTL_GIGA_MAC_VER_44)
8411 r8169 chipset that 3.16.7 doesn't know of. You should give it a try.

-- 
Ueimor

^ permalink raw reply

* [PATCH net-next v3 6/6] net: tcp: add per route congestion control
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1420498668-4660-1-git-send-email-dborkman@redhat.com>

This work adds the possibility to define a per route/destination
congestion control algorithm. Generally, this opens up the possibility
for a machine with different links to enforce specific congestion
control algorithms with optimal strategies for each of them based
on their network characteristics, even transparently for a single
application listening on all links.

For our specific use case, this additionally facilitates deployment
of DCTCP, for example, applications can easily serve internal
traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
would also allow for utilizing e.g. long living, low priority
background flows for certain destinations/routes while still being
able for normal traffic to utilize the default congestion control
algorithm. We also thought about a per netns setting (where different
defaults are possible), but given its actually a link specific
property, we argue that a per route/destination setting is the most
natural and flexible.

The administrator can utilize this through ip-route(8) by appending
"congctl [lock] <name>", where <name> denotes the name of a
congestion control algorithm and the optional lock parameter allows
to enforce the given algorithm so that applications in user space
would not be allowed to overwrite that algorithm for that destination.

The dst metric lookups are being done when a dst entry is already
available in order to avoid a costly lookup and still before the
algorithms are being initialized, thus overhead is very low when the
feature is not being used. While the client side would need to drop
the current reference on the module, on server side this can actually
even be avoided as we just got a flat-copied socket clone.

Joint work with Florian Westphal.

Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/net/tcp.h        |  6 ++++++
 net/ipv4/tcp_ipv4.c      |  2 ++
 net/ipv4/tcp_minisocks.c | 30 ++++++++++++++++++++++++++----
 net/ipv4/tcp_output.c    | 21 +++++++++++++++++++++
 net/ipv6/tcp_ipv6.c      |  2 ++
 5 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 95bb237..b8fdc6b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -448,6 +448,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb);
 struct sock *tcp_create_openreq_child(struct sock *sk,
 				      struct request_sock *req,
 				      struct sk_buff *skb);
+void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst);
 struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 				  struct request_sock *req,
 				  struct dst_entry *dst);
@@ -636,6 +637,11 @@ static inline u32 tcp_rto_min_us(struct sock *sk)
 	return jiffies_to_usecs(tcp_rto_min(sk));
 }
 
+static inline bool tcp_ca_dst_locked(const struct dst_entry *dst)
+{
+	return dst_metric_locked(dst, RTAX_CC_ALGO);
+}
+
 /* Compute the actual receive window we are currently advertising.
  * Rcv_nxt can be after the window if our peer push more data
  * than the offered window.
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a3f72d7..ad3e65b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1340,6 +1340,8 @@ struct sock *tcp_v4_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 	}
 	sk_setup_caps(newsk, dst);
 
+	tcp_ca_openreq_child(newsk, dst);
+
 	tcp_sync_mss(newsk, dst_mtu(dst));
 	newtp->advmss = dst_metric_advmss(dst);
 	if (tcp_sk(sk)->rx_opt.user_mss &&
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 63d2680..bc9216d 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -399,6 +399,32 @@ static void tcp_ecn_openreq_child(struct tcp_sock *tp,
 	tp->ecn_flags = inet_rsk(req)->ecn_ok ? TCP_ECN_OK : 0;
 }
 
+void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	u32 ca_key = dst_metric(dst, RTAX_CC_ALGO);
+	bool ca_got_dst = false;
+
+	if (ca_key != TCP_CA_UNSPEC) {
+		const struct tcp_congestion_ops *ca;
+
+		rcu_read_lock();
+		ca = tcp_ca_find_key(ca_key);
+		if (likely(ca && try_module_get(ca->owner))) {
+			icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst);
+			icsk->icsk_ca_ops = ca;
+			ca_got_dst = true;
+		}
+		rcu_read_unlock();
+	}
+
+	if (!ca_got_dst && !try_module_get(icsk->icsk_ca_ops->owner))
+		tcp_assign_congestion_control(sk);
+
+	tcp_set_ca_state(sk, TCP_CA_Open);
+}
+EXPORT_SYMBOL_GPL(tcp_ca_openreq_child);
+
 /* This is not only more efficient than what we used to do, it eliminates
  * a lot of code duplication between IPv4/IPv6 SYN recv processing. -DaveM
  *
@@ -451,10 +477,6 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		newtp->snd_cwnd = TCP_INIT_CWND;
 		newtp->snd_cwnd_cnt = 0;
 
-		if (!try_module_get(newicsk->icsk_ca_ops->owner))
-			tcp_assign_congestion_control(newsk);
-
-		tcp_set_ca_state(newsk, TCP_CA_Open);
 		tcp_init_xmit_timers(newsk);
 		__skb_queue_head_init(&newtp->out_of_order_queue);
 		newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7f18262..dc30cb5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2939,6 +2939,25 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 }
 EXPORT_SYMBOL(tcp_make_synack);
 
+static void tcp_ca_dst_init(struct sock *sk, const struct dst_entry *dst)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+	const struct tcp_congestion_ops *ca;
+	u32 ca_key = dst_metric(dst, RTAX_CC_ALGO);
+
+	if (ca_key == TCP_CA_UNSPEC)
+		return;
+
+	rcu_read_lock();
+	ca = tcp_ca_find_key(ca_key);
+	if (likely(ca && try_module_get(ca->owner))) {
+		module_put(icsk->icsk_ca_ops->owner);
+		icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst);
+		icsk->icsk_ca_ops = ca;
+	}
+	rcu_read_unlock();
+}
+
 /* Do all connect socket setups that can be done AF independent. */
 static void tcp_connect_init(struct sock *sk)
 {
@@ -2964,6 +2983,8 @@ static void tcp_connect_init(struct sock *sk)
 	tcp_mtup_init(sk);
 	tcp_sync_mss(sk, dst_mtu(dst));
 
+	tcp_ca_dst_init(sk, dst);
+
 	if (!tp->window_clamp)
 		tp->window_clamp = dst_metric(dst, RTAX_WINDOW);
 	tp->advmss = dst_metric_advmss(dst);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 9c0b54e..5d46832 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1199,6 +1199,8 @@ static struct sock *tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 		inet_csk(newsk)->icsk_ext_hdr_len = (newnp->opt->opt_nflen +
 						     newnp->opt->opt_flen);
 
+	tcp_ca_openreq_child(newsk, dst);
+
 	tcp_sync_mss(newsk, dst_mtu(dst));
 	newtp->advmss = dst_metric_advmss(dst);
 	if (tcp_sk(sk)->rx_opt.user_mss &&
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v3 1/6] net: fib6: fib6_commit_metrics: fix potential NULL pointer dereference
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev, Michal Kubeček
In-Reply-To: <1420498668-4660-1-git-send-email-dborkman@redhat.com>

When IPv6 host routes with metrics attached are being added, we fetch
the metrics store from the dst via COW through dst_metrics_write_ptr(),
added through commit e5fd387ad5b3.

One remaining problem here is that we actually call into inet_getpeer()
and may end up allocating/creating a new peer from the kmemcache, which
may fail.

Example trace from perf probe (inet_getpeer:41) where create is 1:

ip 6877 [002] 4221.391591: probe:inet_getpeer: (ffffffff8165e293)
  85e294 inet_getpeer.part.7 (<- kmem_cache_alloc())
  85e578 inet_getpeer
  8eb333 ipv6_cow_metrics
  8f10ff fib6_commit_metrics

Therefore, a check for NULL on the return of dst_metrics_write_ptr()
is necessary here.

Joint work with Florian Westphal.

Fixes: e5fd387ad5b3 ("ipv6: do not overwrite inetpeer metrics prematurely")
Cc: Michal Kubeček <mkubecek@suse.cz>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/ipv6/ip6_fib.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index b2d1838..db4984e 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -633,18 +633,17 @@ static bool rt6_qualify_for_ecmp(struct rt6_info *rt)
 static int fib6_commit_metrics(struct dst_entry *dst,
 			       struct nlattr *mx, int mx_len)
 {
+	bool dst_host = dst->flags & DST_HOST;
 	struct nlattr *nla;
 	int remaining;
 	u32 *mp;
 
-	if (dst->flags & DST_HOST) {
-		mp = dst_metrics_write_ptr(dst);
-	} else {
-		mp = kzalloc(sizeof(u32) * RTAX_MAX, GFP_ATOMIC);
-		if (!mp)
-			return -ENOMEM;
+	mp = dst_host ? dst_metrics_write_ptr(dst) :
+			kzalloc(sizeof(u32) * RTAX_MAX, GFP_ATOMIC);
+	if (unlikely(!mp))
+		return -ENOMEM;
+	if (!dst_host)
 		dst_init_metrics(dst, mp, 0);
-	}
 
 	nla_for_each_attr(nla, mx, mx_len, remaining) {
 		int type = nla_type(nla);
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v3 5/6] net: tcp: add RTAX_CC_ALGO fib handling
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1420498668-4660-1-git-send-email-dborkman@redhat.com>

This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
control metric to be set up and dumped back to user space.

While the internal representation of RTAX_CC_ALGO is handled as a u32
key, we avoided to expose this implementation detail to user space, thus
instead, we chose the netlink attribute that is being exchanged between
user space to be the actual congestion control algorithm name, similarly
as in the setsockopt(2) API in order to allow for maximum flexibility,
even for 3rd party modules.

It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
it should have been stored in RTAX_FEATURES instead, we first thought
about reusing it for the congestion control key, but it brings more
complications and/or confusion than worth it.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/net/tcp.h              |  7 +++++++
 include/uapi/linux/rtnetlink.h |  2 ++
 net/core/rtnetlink.c           | 15 +++++++++++++--
 net/decnet/dn_fib.c            |  3 ++-
 net/decnet/dn_table.c          |  4 +++-
 net/ipv4/fib_semantics.c       | 14 ++++++++++++--
 net/ipv6/route.c               | 17 +++++++++++++++--
 7 files changed, 54 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 135b70c..95bb237 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -846,7 +846,14 @@ extern struct tcp_congestion_ops tcp_reno;
 
 struct tcp_congestion_ops *tcp_ca_find_key(u32 key);
 u32 tcp_ca_get_key_by_name(const char *name);
+#ifdef CONFIG_INET
 char *tcp_ca_get_name_by_key(u32 key, char *buffer);
+#else
+static inline char *tcp_ca_get_name_by_key(u32 key, char *buffer)
+{
+	return NULL;
+}
+#endif
 
 static inline bool tcp_ca_needs_ecn(const struct sock *sk)
 {
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 9c9b8b4..d81f22d 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -389,6 +389,8 @@ enum {
 #define RTAX_INITRWND RTAX_INITRWND
 	RTAX_QUICKACK,
 #define RTAX_QUICKACK RTAX_QUICKACK
+	RTAX_CC_ALGO,
+#define RTAX_CC_ALGO RTAX_CC_ALGO
 	__RTAX_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 9cf6fe9..001372f 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -50,6 +50,7 @@
 #include <net/arp.h>
 #include <net/route.h>
 #include <net/udp.h>
+#include <net/tcp.h>
 #include <net/sock.h>
 #include <net/pkt_sched.h>
 #include <net/fib_rules.h>
@@ -669,9 +670,19 @@ int rtnetlink_put_metrics(struct sk_buff *skb, u32 *metrics)
 
 	for (i = 0; i < RTAX_MAX; i++) {
 		if (metrics[i]) {
+			if (i == RTAX_CC_ALGO - 1) {
+				char tmp[TCP_CA_NAME_MAX], *name;
+
+				name = tcp_ca_get_name_by_key(metrics[i], tmp);
+				if (!name)
+					continue;
+				if (nla_put_string(skb, i + 1, name))
+					goto nla_put_failure;
+			} else {
+				if (nla_put_u32(skb, i + 1, metrics[i]))
+					goto nla_put_failure;
+			}
 			valid++;
-			if (nla_put_u32(skb, i+1, metrics[i]))
-				goto nla_put_failure;
 		}
 	}
 
diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c
index d332aef..df48034 100644
--- a/net/decnet/dn_fib.c
+++ b/net/decnet/dn_fib.c
@@ -298,7 +298,8 @@ struct dn_fib_info *dn_fib_create_info(const struct rtmsg *r, struct nlattr *att
 			int type = nla_type(attr);
 
 			if (type) {
-				if (type > RTAX_MAX || nla_len(attr) < 4)
+				if (type > RTAX_MAX || type == RTAX_CC_ALGO ||
+				    nla_len(attr) < 4)
 					goto err_inval;
 
 				fi->fib_metrics[type-1] = nla_get_u32(attr);
diff --git a/net/decnet/dn_table.c b/net/decnet/dn_table.c
index 86e3807..3f19fcb 100644
--- a/net/decnet/dn_table.c
+++ b/net/decnet/dn_table.c
@@ -29,6 +29,7 @@
 #include <linux/route.h> /* RTF_xxx */
 #include <net/neighbour.h>
 #include <net/netlink.h>
+#include <net/tcp.h>
 #include <net/dst.h>
 #include <net/flow.h>
 #include <net/fib_rules.h>
@@ -273,7 +274,8 @@ static inline size_t dn_fib_nlmsg_size(struct dn_fib_info *fi)
 	size_t payload = NLMSG_ALIGN(sizeof(struct rtmsg))
 			 + nla_total_size(4) /* RTA_TABLE */
 			 + nla_total_size(2) /* RTA_DST */
-			 + nla_total_size(4); /* RTA_PRIORITY */
+			 + nla_total_size(4) /* RTA_PRIORITY */
+			 + nla_total_size(TCP_CA_NAME_MAX); /* RTAX_CC_ALGO */
 
 	/* space for nested metrics */
 	payload += nla_total_size((RTAX_MAX * nla_total_size(4)));
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f99f41b..d2b7b55 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -360,7 +360,8 @@ static inline size_t fib_nlmsg_size(struct fib_info *fi)
 			 + nla_total_size(4) /* RTA_TABLE */
 			 + nla_total_size(4) /* RTA_DST */
 			 + nla_total_size(4) /* RTA_PRIORITY */
-			 + nla_total_size(4); /* RTA_PREFSRC */
+			 + nla_total_size(4) /* RTA_PREFSRC */
+			 + nla_total_size(TCP_CA_NAME_MAX); /* RTAX_CC_ALGO */
 
 	/* space for nested metrics */
 	payload += nla_total_size((RTAX_MAX * nla_total_size(4)));
@@ -859,7 +860,16 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 
 				if (type > RTAX_MAX)
 					goto err_inval;
-				val = nla_get_u32(nla);
+				if (type == RTAX_CC_ALGO) {
+					char tmp[TCP_CA_NAME_MAX];
+
+					nla_strlcpy(tmp, nla, sizeof(tmp));
+					val = tcp_ca_get_key_by_name(tmp);
+					if (val == TCP_CA_UNSPEC)
+						goto err_inval;
+				} else {
+					val = nla_get_u32(nla);
+				}
 				if (type == RTAX_ADVMSS && val > 65535 - 40)
 					val = 65535 - 40;
 				if (type == RTAX_MTU && val > 65535 - 15)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 454771d..34dcbb5 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1488,10 +1488,22 @@ static int ip6_convert_metrics(struct mx6_config *mxc,
 		int type = nla_type(nla);
 
 		if (type) {
+			u32 val;
+
 			if (unlikely(type > RTAX_MAX))
 				goto err;
+			if (type == RTAX_CC_ALGO) {
+				char tmp[TCP_CA_NAME_MAX];
+
+				nla_strlcpy(tmp, nla, sizeof(tmp));
+				val = tcp_ca_get_key_by_name(tmp);
+				if (val == TCP_CA_UNSPEC)
+					goto err;
+			} else {
+				val = nla_get_u32(nla);
+			}
 
-			mp[type - 1] = nla_get_u32(nla);
+			mp[type - 1] = val;
 			__set_bit(type - 1, mxc->mx_valid);
 		}
 	}
@@ -2571,7 +2583,8 @@ static inline size_t rt6_nlmsg_size(void)
 	       + nla_total_size(4) /* RTA_OIF */
 	       + nla_total_size(4) /* RTA_PRIORITY */
 	       + RTAX_MAX * nla_total_size(4) /* RTA_METRICS */
-	       + nla_total_size(sizeof(struct rta_cacheinfo));
+	       + nla_total_size(sizeof(struct rta_cacheinfo))
+	       + nla_total_size(TCP_CA_NAME_MAX); /* RTAX_CC_ALGO */
 }
 
 static int rt6_fill_node(struct net *net,
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v3 0/6] net: allow setting congctl via routing table
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev

This is the second part of our work and allows for setting the congestion
control algorithm via routing table. For details, please see individual
patches.

Since patch 1 is a bug fix, we suggest applying patch 1 to net, and then
merging net into net-next, for example, and following up with the remaining
feature patches wrt dependencies.

Joint work with Florian Westphal, suggested by Hannes Frederic Sowa.

Patch for iproute2 is available under [1], but will be reposted with along
with the man-page update when this set hits net-next.

  [1] http://patchwork.ozlabs.org/patch/418149/

Thanks!

v2 -> v3:
 - Added module auto-loading as suggested by David Miller, thanks!
  - Added patch 2 for handling possible sleeps in fib6
  - While working on this, we discovered a bug, hence fix in patch 1
  - Added auto-loading to patch 4
 - Rebased, retested, rest the same.
v1 -> v2:
 - Very sorry, I noticed I had decnet disabled during testing.
   Added missing header include in decnet, rest as is.

Daniel Borkmann (5):
  net: fib6: fib6_commit_metrics: fix potential NULL pointer dereference
  net: tcp: refactor reinitialization of congestion control
  net: tcp: add key management to congestion control
  net: tcp: add RTAX_CC_ALGO fib handling
  net: tcp: add per route congestion control

Florian Westphal (1):
  net: fib6: convert cfg metric to u32 outside of table write lock

 include/net/inet_connection_sock.h |   3 +-
 include/net/ip6_fib.h              |  10 ++-
 include/net/tcp.h                  |  22 ++++++-
 include/uapi/linux/rtnetlink.h     |   2 +
 net/core/rtnetlink.c               |  15 ++++-
 net/decnet/dn_fib.c                |   3 +-
 net/decnet/dn_table.c              |   4 +-
 net/ipv4/fib_semantics.c           |  14 ++++-
 net/ipv4/tcp_cong.c                | 121 +++++++++++++++++++++++++++++--------
 net/ipv4/tcp_ipv4.c                |   2 +
 net/ipv4/tcp_minisocks.c           |  30 +++++++--
 net/ipv4/tcp_output.c              |  21 +++++++
 net/ipv6/ip6_fib.c                 |  68 +++++++++++----------
 net/ipv6/route.c                   |  72 ++++++++++++++++++----
 net/ipv6/tcp_ipv6.c                |   2 +
 15 files changed, 304 insertions(+), 85 deletions(-)

-- 
1.7.11.7

^ permalink raw reply

* [PATCH net-next v3 4/6] net: tcp: add key management to congestion control
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1420498668-4660-1-git-send-email-dborkman@redhat.com>

This patch adds necessary infrastructure to the congestion control
framework for later per route congestion control support.

For a per route congestion control possibility, our aim is to store
a unique u32 key identifier into dst metrics, which can then be
mapped into a tcp_congestion_ops struct. We argue that having a
RTAX key entry is the most simple, generic and easy way to manage,
and also keeps the memory footprint of dst entries lower on 64 bit
than with storing a pointer directly, for example. Having a unique
key id also allows for decoupling actual TCP congestion control
module management from the FIB layer, i.e. we don't have to care
about expensive module refcounting inside the FIB at this point.

We first thought of using an IDR store for the realization, which
takes over dynamic assignment of unused key space and also performs
the key to pointer mapping in RCU. While doing so, we stumbled upon
the issue that due to the nature of dynamic key distribution, it
just so happens, arguably in very rare occasions, that excessive
module loads and unloads can lead to a possible reuse of previously
used key space. Thus, previously stale keys in the dst metric are
now being reassigned to a different congestion control algorithm,
which might lead to unexpected behaviour. One way to resolve this
would have been to walk FIBs on the actually rare occasion of a
module unload and reset the metric keys for each FIB in each netns,
but that's just very costly.

Therefore, we argue a better solution is to reuse the unique
congestion control algorithm name member and map that into u32 key
space through jhash. For that, we split the flags attribute (as it
currently uses 2 bits only anyway) into two u32 attributes, flags
and key, so that we can keep the cacheline boundary of 2 cachelines
on x86_64 and cache the precalculated key at registration time for
the fast path. On average we might expect 2 - 4 modules being loaded
worst case perhaps 15, so a key collision possibility is extremely
low, and guaranteed collision-free on LE/BE for all in-tree modules.
Overall this results in much simpler code, and all without the
overhead of an IDR. Due to the deterministic nature, modules can
now be unloaded, the congestion control algorithm for a specific
but unloaded key will fall back to the default one, and on module
reload time it will switch back to the expected algorithm
transparently.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/net/inet_connection_sock.h |  3 +-
 include/net/tcp.h                  |  9 +++-
 net/ipv4/tcp_cong.c                | 97 +++++++++++++++++++++++++++++++-------
 3 files changed, 91 insertions(+), 18 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 848e85c..5976bde 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -98,7 +98,8 @@ struct inet_connection_sock {
 	const struct tcp_congestion_ops *icsk_ca_ops;
 	const struct inet_connection_sock_af_ops *icsk_af_ops;
 	unsigned int		  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
-	__u8			  icsk_ca_state;
+	__u8			  icsk_ca_state:7,
+				  icsk_ca_dst_locked:1;
 	__u8			  icsk_retransmits;
 	__u8			  icsk_pending;
 	__u8			  icsk_backoff;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f50f29faf..135b70c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -787,6 +787,8 @@ enum tcp_ca_ack_event_flags {
 #define TCP_CA_MAX	128
 #define TCP_CA_BUF_MAX	(TCP_CA_NAME_MAX*TCP_CA_MAX)
 
+#define TCP_CA_UNSPEC	0
+
 /* Algorithm can be set on socket without CAP_NET_ADMIN privileges */
 #define TCP_CONG_NON_RESTRICTED 0x1
 /* Requires ECN/ECT set on all packets */
@@ -794,7 +796,8 @@ enum tcp_ca_ack_event_flags {
 
 struct tcp_congestion_ops {
 	struct list_head	list;
-	unsigned long flags;
+	u32 key;
+	u32 flags;
 
 	/* initialize private data (optional) */
 	void (*init)(struct sock *sk);
@@ -841,6 +844,10 @@ u32 tcp_reno_ssthresh(struct sock *sk);
 void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 acked);
 extern struct tcp_congestion_ops tcp_reno;
 
+struct tcp_congestion_ops *tcp_ca_find_key(u32 key);
+u32 tcp_ca_get_key_by_name(const char *name);
+char *tcp_ca_get_name_by_key(u32 key, char *buffer);
+
 static inline bool tcp_ca_needs_ecn(const struct sock *sk)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 38f2f8a..63c29db 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -13,6 +13,7 @@
 #include <linux/types.h>
 #include <linux/list.h>
 #include <linux/gfp.h>
+#include <linux/jhash.h>
 #include <net/tcp.h>
 
 static DEFINE_SPINLOCK(tcp_cong_list_lock);
@@ -31,6 +32,34 @@ static struct tcp_congestion_ops *tcp_ca_find(const char *name)
 	return NULL;
 }
 
+/* Must be called with rcu lock held */
+static const struct tcp_congestion_ops *__tcp_ca_find_autoload(const char *name)
+{
+	const struct tcp_congestion_ops *ca = tcp_ca_find(name);
+#ifdef CONFIG_MODULES
+	if (!ca && capable(CAP_NET_ADMIN)) {
+		rcu_read_unlock();
+		request_module("tcp_%s", name);
+		rcu_read_lock();
+		ca = tcp_ca_find(name);
+	}
+#endif
+	return ca;
+}
+
+/* Simple linear search, not much in here. */
+struct tcp_congestion_ops *tcp_ca_find_key(u32 key)
+{
+	struct tcp_congestion_ops *e;
+
+	list_for_each_entry_rcu(e, &tcp_cong_list, list) {
+		if (e->key == key)
+			return e;
+	}
+
+	return NULL;
+}
+
 /*
  * Attach new congestion control algorithm to the list
  * of available options.
@@ -45,9 +74,12 @@ int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
 		return -EINVAL;
 	}
 
+	ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
+
 	spin_lock(&tcp_cong_list_lock);
-	if (tcp_ca_find(ca->name)) {
-		pr_notice("%s already registered\n", ca->name);
+	if (ca->key == TCP_CA_UNSPEC || tcp_ca_find_key(ca->key)) {
+		pr_notice("%s already registered or non-unique key\n",
+			  ca->name);
 		ret = -EEXIST;
 	} else {
 		list_add_tail_rcu(&ca->list, &tcp_cong_list);
@@ -70,9 +102,50 @@ void tcp_unregister_congestion_control(struct tcp_congestion_ops *ca)
 	spin_lock(&tcp_cong_list_lock);
 	list_del_rcu(&ca->list);
 	spin_unlock(&tcp_cong_list_lock);
+
+	/* Wait for outstanding readers to complete before the
+	 * module gets removed entirely.
+	 *
+	 * A try_module_get() should fail by now as our module is
+	 * in "going" state since no refs are held anymore and
+	 * module_exit() handler being called.
+	 */
+	synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control);
 
+u32 tcp_ca_get_key_by_name(const char *name)
+{
+	const struct tcp_congestion_ops *ca;
+	u32 key;
+
+	might_sleep();
+
+	rcu_read_lock();
+	ca = __tcp_ca_find_autoload(name);
+	key = ca ? ca->key : TCP_CA_UNSPEC;
+	rcu_read_unlock();
+
+	return key;
+}
+EXPORT_SYMBOL_GPL(tcp_ca_get_key_by_name);
+
+char *tcp_ca_get_name_by_key(u32 key, char *buffer)
+{
+	const struct tcp_congestion_ops *ca;
+	char *ret = NULL;
+
+	rcu_read_lock();
+	ca = tcp_ca_find_key(key);
+	if (ca)
+		ret = strncpy(buffer, ca->name,
+			      TCP_CA_NAME_MAX);
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(tcp_ca_get_name_by_key);
+
 /* Assign choice of congestion control. */
 void tcp_assign_congestion_control(struct sock *sk)
 {
@@ -253,25 +326,17 @@ out:
 int tcp_set_congestion_control(struct sock *sk, const char *name)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
-	struct tcp_congestion_ops *ca;
+	const struct tcp_congestion_ops *ca;
 	int err = 0;
 
-	rcu_read_lock();
-	ca = tcp_ca_find(name);
+	if (icsk->icsk_ca_dst_locked)
+		return -EPERM;
 
-	/* no change asking for existing value */
+	rcu_read_lock();
+	ca = __tcp_ca_find_autoload(name);
+	/* No change asking for existing value */
 	if (ca == icsk->icsk_ca_ops)
 		goto out;
-
-#ifdef CONFIG_MODULES
-	/* not found attempt to autoload module */
-	if (!ca && capable(CAP_NET_ADMIN)) {
-		rcu_read_unlock();
-		request_module("tcp_%s", name);
-		rcu_read_lock();
-		ca = tcp_ca_find(name);
-	}
-#endif
 	if (!ca)
 		err = -ENOENT;
 	else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v3 2/6] net: fib6: convert cfg metric to u32 outside of table write lock
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1420498668-4660-1-git-send-email-dborkman@redhat.com>

From: Florian Westphal <fw@strlen.de>

Do the nla validation earlier, outside the write lock.

This is needed by followup patch which needs to be able to call
request_module (which can sleep) if needed.

Joint work with Daniel Borkmann.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/ip6_fib.h | 10 +++++---
 net/ipv6/ip6_fib.c    | 69 +++++++++++++++++++++++++++------------------------
 net/ipv6/route.c      | 57 ++++++++++++++++++++++++++++++++++--------
 3 files changed, 90 insertions(+), 46 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 8eea35d..20e80fa 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -74,6 +74,11 @@ struct fib6_node {
 #define FIB6_SUBTREE(fn)	((fn)->subtree)
 #endif
 
+struct mx6_config {
+	const u32 *mx;
+	DECLARE_BITMAP(mx_valid, RTAX_MAX);
+};
+
 /*
  *	routing information
  *
@@ -291,9 +296,8 @@ struct fib6_node *fib6_locate(struct fib6_node *root,
 void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg),
 		    void *arg);
 
-int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
-	     struct nlattr *mx, int mx_len);
-
+int fib6_add(struct fib6_node *root, struct rt6_info *rt,
+	     struct nl_info *info, struct mx6_config *mxc);
 int fib6_del(struct rt6_info *rt, struct nl_info *info);
 
 void inet6_rt_notify(int event, struct rt6_info *rt, struct nl_info *info);
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index db4984e..03c520a 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -630,31 +630,35 @@ static bool rt6_qualify_for_ecmp(struct rt6_info *rt)
 	       RTF_GATEWAY;
 }
 
-static int fib6_commit_metrics(struct dst_entry *dst,
-			       struct nlattr *mx, int mx_len)
+static void fib6_copy_metrics(u32 *mp, const struct mx6_config *mxc)
 {
-	bool dst_host = dst->flags & DST_HOST;
-	struct nlattr *nla;
-	int remaining;
-	u32 *mp;
+	int i;
 
-	mp = dst_host ? dst_metrics_write_ptr(dst) :
-			kzalloc(sizeof(u32) * RTAX_MAX, GFP_ATOMIC);
-	if (unlikely(!mp))
-		return -ENOMEM;
-	if (!dst_host)
-		dst_init_metrics(dst, mp, 0);
+	for (i = 0; i < RTAX_MAX; i++) {
+		if (test_bit(i, mxc->mx_valid))
+			mp[i] = mxc->mx[i];
+	}
+}
+
+static int fib6_commit_metrics(struct dst_entry *dst, struct mx6_config *mxc)
+{
+	if (!mxc->mx)
+		return 0;
 
-	nla_for_each_attr(nla, mx, mx_len, remaining) {
-		int type = nla_type(nla);
+	if (dst->flags & DST_HOST) {
+		u32 *mp = dst_metrics_write_ptr(dst);
 
-		if (type) {
-			if (type > RTAX_MAX)
-				return -EINVAL;
+		if (unlikely(!mp))
+			return -ENOMEM;
 
-			mp[type - 1] = nla_get_u32(nla);
-		}
+		fib6_copy_metrics(mp, mxc);
+	} else {
+		dst_init_metrics(dst, mxc->mx, false);
+
+		/* We've stolen mx now. */
+		mxc->mx = NULL;
 	}
+
 	return 0;
 }
 
@@ -663,7 +667,7 @@ static int fib6_commit_metrics(struct dst_entry *dst,
  */
 
 static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
-			    struct nl_info *info, struct nlattr *mx, int mx_len)
+			    struct nl_info *info, struct mx6_config *mxc)
 {
 	struct rt6_info *iter = NULL;
 	struct rt6_info **ins;
@@ -772,11 +776,10 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
 			pr_warn("NLM_F_CREATE should be set when creating new route\n");
 
 add:
-		if (mx) {
-			err = fib6_commit_metrics(&rt->dst, mx, mx_len);
-			if (err)
-				return err;
-		}
+		err = fib6_commit_metrics(&rt->dst, mxc);
+		if (err)
+			return err;
+
 		rt->dst.rt6_next = iter;
 		*ins = rt;
 		rt->rt6i_node = fn;
@@ -796,11 +799,11 @@ add:
 			pr_warn("NLM_F_REPLACE set, but no existing node found!\n");
 			return -ENOENT;
 		}
-		if (mx) {
-			err = fib6_commit_metrics(&rt->dst, mx, mx_len);
-			if (err)
-				return err;
-		}
+
+		err = fib6_commit_metrics(&rt->dst, mxc);
+		if (err)
+			return err;
+
 		*ins = rt;
 		rt->rt6i_node = fn;
 		rt->dst.rt6_next = iter->dst.rt6_next;
@@ -837,8 +840,8 @@ void fib6_force_start_gc(struct net *net)
  *	with source addr info in sub-trees
  */
 
-int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
-	     struct nlattr *mx, int mx_len)
+int fib6_add(struct fib6_node *root, struct rt6_info *rt,
+	     struct nl_info *info, struct mx6_config *mxc)
 {
 	struct fib6_node *fn, *pn = NULL;
 	int err = -ENOMEM;
@@ -933,7 +936,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info,
 	}
 #endif
 
-	err = fib6_add_rt2node(fn, rt, info, mx, mx_len);
+	err = fib6_add_rt2node(fn, rt, info, mxc);
 	if (!err) {
 		fib6_start_gc(info->nl_net, rt);
 		if (!(rt->rt6i_flags & RTF_CACHE))
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index c910831..454771d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -853,14 +853,14 @@ EXPORT_SYMBOL(rt6_lookup);
  */
 
 static int __ip6_ins_rt(struct rt6_info *rt, struct nl_info *info,
-			struct nlattr *mx, int mx_len)
+			struct mx6_config *mxc)
 {
 	int err;
 	struct fib6_table *table;
 
 	table = rt->rt6i_table;
 	write_lock_bh(&table->tb6_lock);
-	err = fib6_add(&table->tb6_root, rt, info, mx, mx_len);
+	err = fib6_add(&table->tb6_root, rt, info, mxc);
 	write_unlock_bh(&table->tb6_lock);
 
 	return err;
@@ -868,10 +868,10 @@ static int __ip6_ins_rt(struct rt6_info *rt, struct nl_info *info,
 
 int ip6_ins_rt(struct rt6_info *rt)
 {
-	struct nl_info info = {
-		.nl_net = dev_net(rt->dst.dev),
-	};
-	return __ip6_ins_rt(rt, &info, NULL, 0);
+	struct nl_info info = {	.nl_net = dev_net(rt->dst.dev), };
+	struct mx6_config mxc = { .mx = NULL, };
+
+	return __ip6_ins_rt(rt, &info, &mxc);
 }
 
 static struct rt6_info *rt6_alloc_cow(struct rt6_info *ort,
@@ -1470,9 +1470,39 @@ out:
 	return entries > rt_max_size;
 }
 
-/*
- *
- */
+static int ip6_convert_metrics(struct mx6_config *mxc,
+			       const struct fib6_config *cfg)
+{
+	struct nlattr *nla;
+	int remaining;
+	u32 *mp;
+
+	if (cfg->fc_mx == NULL)
+		return 0;
+
+	mp = kzalloc(sizeof(u32) * RTAX_MAX, GFP_KERNEL);
+	if (unlikely(!mp))
+		return -ENOMEM;
+
+	nla_for_each_attr(nla, cfg->fc_mx, cfg->fc_mx_len, remaining) {
+		int type = nla_type(nla);
+
+		if (type) {
+			if (unlikely(type > RTAX_MAX))
+				goto err;
+
+			mp[type - 1] = nla_get_u32(nla);
+			__set_bit(type - 1, mxc->mx_valid);
+		}
+	}
+
+	mxc->mx = mp;
+
+	return 0;
+ err:
+	kfree(mp);
+	return -EINVAL;
+}
 
 int ip6_route_add(struct fib6_config *cfg)
 {
@@ -1482,6 +1512,7 @@ int ip6_route_add(struct fib6_config *cfg)
 	struct net_device *dev = NULL;
 	struct inet6_dev *idev = NULL;
 	struct fib6_table *table;
+	struct mx6_config mxc = { .mx = NULL, };
 	int addr_type;
 
 	if (cfg->fc_dst_len > 128 || cfg->fc_src_len > 128)
@@ -1677,8 +1708,14 @@ install_route:
 
 	cfg->fc_nlinfo.nl_net = dev_net(dev);
 
-	return __ip6_ins_rt(rt, &cfg->fc_nlinfo, cfg->fc_mx, cfg->fc_mx_len);
+	err = ip6_convert_metrics(&mxc, cfg);
+	if (err)
+		goto out;
+
+	err = __ip6_ins_rt(rt, &cfg->fc_nlinfo, &mxc);
 
+	kfree(mxc.mx);
+	return err;
 out:
 	if (dev)
 		dev_put(dev);
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v3 3/6] net: tcp: refactor reinitialization of congestion control
From: Daniel Borkmann @ 2015-01-05 22:57 UTC (permalink / raw)
  To: davem; +Cc: hannes, fw, netdev
In-Reply-To: <1420498668-4660-1-git-send-email-dborkman@redhat.com>

We can just move this to an extra function and make the code
a bit more readable, no functional change.

Joint work with Florian Westphal.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 net/ipv4/tcp_cong.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 27ead0d..38f2f8a 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -107,6 +107,18 @@ void tcp_init_congestion_control(struct sock *sk)
 		icsk->icsk_ca_ops->init(sk);
 }
 
+static void tcp_reinit_congestion_control(struct sock *sk,
+					  const struct tcp_congestion_ops *ca)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	tcp_cleanup_congestion_control(sk);
+	icsk->icsk_ca_ops = ca;
+
+	if (sk->sk_state != TCP_CLOSE && icsk->icsk_ca_ops->init)
+		icsk->icsk_ca_ops->init(sk);
+}
+
 /* Manage refcounts on socket close. */
 void tcp_cleanup_congestion_control(struct sock *sk)
 {
@@ -262,21 +274,13 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
 #endif
 	if (!ca)
 		err = -ENOENT;
-
 	else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
 		   ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
 		err = -EPERM;
-
 	else if (!try_module_get(ca->owner))
 		err = -EBUSY;
-
-	else {
-		tcp_cleanup_congestion_control(sk);
-		icsk->icsk_ca_ops = ca;
-
-		if (sk->sk_state != TCP_CLOSE && icsk->icsk_ca_ops->init)
-			icsk->icsk_ca_ops->init(sk);
-	}
+	else
+		tcp_reinit_congestion_control(sk, ca);
  out:
 	rcu_read_unlock();
 	return err;
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH] Revert "ipw2200: select CFG80211_WEXT"
From: Arend van Spriel @ 2015-01-05 22:13 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Johannes Berg, Linus Torvalds, Marcel Holtmann,
	Stanislav Yakovlev, Kalle Valo, Jiri Kosina, linux-wireless,
	Network Development, Linux Kernel Mailing List
In-Reply-To: <1420495519.14308.29.camel@x220>

On 01/05/15 23:05, Paul Bolle wrote:
> On Mon, 2015-01-05 at 19:57 +0100, Johannes Berg wrote:
>> Multiple other groups of ioctls could be converted in similar patches,
>> until at the end you can completely remove ipw_wx_handlers and rely
>> entirely on cfg80211's wext compatibility.
>>
>> So far the theory - in practice nobody cared enough to start working on
>> any of these drivers, let alone actually has the hardware today.
>
> So my suggestion to make ipw2200 no longer use cfg80211_wext_giwname()
> would actually be backwards. What's actually needed, in theory, is to
> use more of what's provided under CFG80211_WEXT (and, I guess, less of
> what's provided under WIRELESS_EXT). Did I get that right?

Yes, but as Johannes indicated it needs consideration what to group in 
the patches.

Regards,
Arend

> Thanks,
>
>
> Paul Bolle
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox