* [PATCH net-next 1/2] geneve: Allow users to specify source port range
@ 2025-02-24 15:39 Daniel Borkmann
2025-02-24 15:39 ` [PATCH net-next 2/2] geneve, specs: Add port range to rt_link specification Daniel Borkmann
2025-02-26 2:26 ` [PATCH net-next 1/2] geneve: Allow users to specify source port range Jakub Kicinski
0 siblings, 2 replies; 4+ messages in thread
From: Daniel Borkmann @ 2025-02-24 15:39 UTC (permalink / raw)
To: kuba; +Cc: pabeni, netdev
Recently, in case of Cilium, we run into users on Azure who require to use
tunneling for east/west traffic due to hitting IPAM API limits for Kubernetes
Pods if they would have gone with publicly routable IPs for Pods. In case
of tunneling, Cilium supports the option of vxlan or geneve. In order to
RSS spread flows among remote CPUs both derive a source port hash via
udp_flow_src_port() which takes the inner packet's skb->hash into account.
For clusters with many nodes, this can then hit a new limitation [0]: Today,
the Azure networking stack supports 1M total flows (500k inbound and 500k
outbound) for a VM. [...] Once this limit is hit, other connections are
dropped. [...] Each flow is distinguished by a 5-tuple (protocol, local IP
address, remote IP address, local port, and remote port) information. [...]
For vxlan and geneve, this can create a massive amount of UDP flows which
then run into the limits if stale flows are not evicted fast enough. One
option to mitigate this for vxlan is to narrow the source port range via
IFLA_VXLAN_PORT_RANGE while still being able to benefit from RSS. However,
geneve currently does not have this option and it spreads traffic across
the full source port range of [1, USHRT_MAX]. To overcome this limitation
also for geneve, add an equivalent IFLA_GENEVE_PORT_RANGE setting for users.
Note that struct geneve_config before/after still remains at 2 cachelines
on x86-64.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://learn.microsoft.com/en-us/azure/virtual-network/virtual-machine-network-throughput [0]
---
drivers/net/geneve.c | 52 +++++++++++++++++++++++++++++++-----
include/uapi/linux/if_link.h | 6 +++++
2 files changed, 52 insertions(+), 6 deletions(-)
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index dbb3960126ee..9a3ea0042900 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -57,6 +57,8 @@ struct geneve_config {
bool ttl_inherit;
enum ifla_geneve_df df;
bool inner_proto_inherit;
+ u16 port_min;
+ u16 port_max;
};
/* Pseudo network device */
@@ -835,7 +837,8 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev,
use_cache = ip_tunnel_dst_cache_usable(skb, info);
tos = geneve_get_dsfield(skb, dev, info, &use_cache);
- sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
+ sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min,
+ geneve->cfg.port_max, true);
rt = udp_tunnel_dst_lookup(skb, dev, geneve->net, 0, &saddr,
&info->key,
@@ -945,7 +948,8 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev,
use_cache = ip_tunnel_dst_cache_usable(skb, info);
prio = geneve_get_dsfield(skb, dev, info, &use_cache);
- sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
+ sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min,
+ geneve->cfg.port_max, true);
dst = udp_tunnel6_dst_lookup(skb, dev, geneve->net, gs6->sock, 0,
&saddr, key, sport,
@@ -1083,8 +1087,8 @@ static int geneve_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb)
use_cache = ip_tunnel_dst_cache_usable(skb, info);
tos = geneve_get_dsfield(skb, dev, info, &use_cache);
- sport = udp_flow_src_port(geneve->net, skb,
- 1, USHRT_MAX, true);
+ sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min,
+ geneve->cfg.port_max, true);
rt = udp_tunnel_dst_lookup(skb, dev, geneve->net, 0, &saddr,
&info->key,
@@ -1109,8 +1113,8 @@ static int geneve_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb)
use_cache = ip_tunnel_dst_cache_usable(skb, info);
prio = geneve_get_dsfield(skb, dev, info, &use_cache);
- sport = udp_flow_src_port(geneve->net, skb,
- 1, USHRT_MAX, true);
+ sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min,
+ geneve->cfg.port_max, true);
dst = udp_tunnel6_dst_lookup(skb, dev, geneve->net, gs6->sock, 0,
&saddr, &info->key, sport,
@@ -1234,6 +1238,7 @@ static const struct nla_policy geneve_policy[IFLA_GENEVE_MAX + 1] = {
[IFLA_GENEVE_TTL_INHERIT] = { .type = NLA_U8 },
[IFLA_GENEVE_DF] = { .type = NLA_U8 },
[IFLA_GENEVE_INNER_PROTO_INHERIT] = { .type = NLA_FLAG },
+ [IFLA_GENEVE_PORT_RANGE] = NLA_POLICY_EXACT_LEN(sizeof(struct ifla_geneve_port_range)),
};
static int geneve_validate(struct nlattr *tb[], struct nlattr *data[],
@@ -1279,6 +1284,17 @@ static int geneve_validate(struct nlattr *tb[], struct nlattr *data[],
}
}
+ if (data[IFLA_GENEVE_PORT_RANGE]) {
+ const struct ifla_geneve_port_range *p =
+ nla_data(data[IFLA_GENEVE_PORT_RANGE]);
+
+ if (p->high < p->low) {
+ NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_PORT_RANGE],
+ "Invalid source port range");
+ return -EINVAL;
+ }
+ }
+
return 0;
}
@@ -1506,6 +1522,18 @@ static int geneve_nl2info(struct nlattr *tb[], struct nlattr *data[],
info->key.tp_dst = nla_get_be16(data[IFLA_GENEVE_PORT]);
}
+ if (data[IFLA_GENEVE_PORT_RANGE]) {
+ const struct ifla_geneve_port_range *p =
+ nla_data(data[IFLA_GENEVE_PORT_RANGE]);
+
+ if (changelink) {
+ attrtype = IFLA_GENEVE_PORT_RANGE;
+ goto change_notsup;
+ }
+ cfg->port_min = p->low;
+ cfg->port_max = p->high;
+ }
+
if (data[IFLA_GENEVE_COLLECT_METADATA]) {
if (changelink) {
attrtype = IFLA_GENEVE_COLLECT_METADATA;
@@ -1623,6 +1651,8 @@ static int geneve_newlink(struct net *net, struct net_device *dev,
.use_udp6_rx_checksums = false,
.ttl_inherit = false,
.collect_md = false,
+ .port_min = 1,
+ .port_max = USHRT_MAX,
};
int err;
@@ -1741,6 +1771,7 @@ static size_t geneve_get_size(const struct net_device *dev)
nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_UDP_ZERO_CSUM6_RX */
nla_total_size(sizeof(__u8)) + /* IFLA_GENEVE_TTL_INHERIT */
nla_total_size(0) + /* IFLA_GENEVE_INNER_PROTO_INHERIT */
+ nla_total_size(sizeof(struct ifla_geneve_port_range)) + /* IFLA_GENEVE_PORT_RANGE */
0;
}
@@ -1750,6 +1781,10 @@ static int geneve_fill_info(struct sk_buff *skb, const struct net_device *dev)
struct ip_tunnel_info *info = &geneve->cfg.info;
bool ttl_inherit = geneve->cfg.ttl_inherit;
bool metadata = geneve->cfg.collect_md;
+ struct ifla_geneve_port_range ports = {
+ .low = geneve->cfg.port_min,
+ .high = geneve->cfg.port_max,
+ };
__u8 tmp_vni[3];
__u32 vni;
@@ -1806,6 +1841,9 @@ static int geneve_fill_info(struct sk_buff *skb, const struct net_device *dev)
nla_put_flag(skb, IFLA_GENEVE_INNER_PROTO_INHERIT))
goto nla_put_failure;
+ if (nla_put(skb, IFLA_GENEVE_PORT_RANGE, sizeof(ports), &ports))
+ goto nla_put_failure;
+
return 0;
nla_put_failure:
@@ -1838,6 +1876,8 @@ struct net_device *geneve_dev_create_fb(struct net *net, const char *name,
.use_udp6_rx_checksums = true,
.ttl_inherit = false,
.collect_md = true,
+ .port_min = 1,
+ .port_max = USHRT_MAX,
};
memset(tb, 0, sizeof(tb));
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index bfe880fbbb24..730fecdb51a5 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -1438,6 +1438,7 @@ enum {
IFLA_GENEVE_TTL_INHERIT,
IFLA_GENEVE_DF,
IFLA_GENEVE_INNER_PROTO_INHERIT,
+ IFLA_GENEVE_PORT_RANGE,
__IFLA_GENEVE_MAX
};
#define IFLA_GENEVE_MAX (__IFLA_GENEVE_MAX - 1)
@@ -1450,6 +1451,11 @@ enum ifla_geneve_df {
GENEVE_DF_MAX = __GENEVE_DF_END - 1,
};
+struct ifla_geneve_port_range {
+ __u16 low;
+ __u16 high;
+};
+
/* Bareudp section */
enum {
IFLA_BAREUDP_UNSPEC,
--
2.43.0
^ permalink raw reply related [flat|nested] 4+ messages in thread* [PATCH net-next 2/2] geneve, specs: Add port range to rt_link specification
2025-02-24 15:39 [PATCH net-next 1/2] geneve: Allow users to specify source port range Daniel Borkmann
@ 2025-02-24 15:39 ` Daniel Borkmann
2025-02-26 2:26 ` [PATCH net-next 1/2] geneve: Allow users to specify source port range Jakub Kicinski
1 sibling, 0 replies; 4+ messages in thread
From: Daniel Borkmann @ 2025-02-24 15:39 UTC (permalink / raw)
To: kuba; +Cc: pabeni, netdev
Add the port range to rt_link, example:
# tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/rt_link.yaml \
--do getlink --json '{"ifname": "geneve1"}' --output-json | jq
{
"ifname": "geneve1",
[...]
"linkinfo": {
"kind": "geneve",
"data": {
"id": 1000,
"remote": "147.28.227.100",
"udp-csum": 0,
"ttl": 0,
"tos": 0,
"label": 0,
"df": 0,
"port": 49431,
"udp-zero-csum6-rx": 1,
"ttl-inherit": 0,
"port-range": {
"low": 4000,
"high": 5000
}
}
},
[...]
}
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
Documentation/netlink/specs/rt_link.yaml | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/Documentation/netlink/specs/rt_link.yaml b/Documentation/netlink/specs/rt_link.yaml
index 0d492500c7e5..de07c8ba2df4 100644
--- a/Documentation/netlink/specs/rt_link.yaml
+++ b/Documentation/netlink/specs/rt_link.yaml
@@ -770,6 +770,16 @@ definitions:
-
name: to
type: u32
+ -
+ name: ifla-geneve-port-range
+ type: struct
+ members:
+ -
+ name: low
+ type: u16
+ -
+ name: high
+ type: u16
-
name: ifla-vf-mac
type: struct
@@ -1915,6 +1925,10 @@ attribute-sets:
-
name: inner-proto-inherit
type: flag
+ -
+ name: port-range
+ type: binary
+ struct: ifla-geneve-port-range
-
name: linkinfo-iptun-attrs
name-prefix: ifla-iptun-
--
2.43.0
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [PATCH net-next 1/2] geneve: Allow users to specify source port range
2025-02-24 15:39 [PATCH net-next 1/2] geneve: Allow users to specify source port range Daniel Borkmann
2025-02-24 15:39 ` [PATCH net-next 2/2] geneve, specs: Add port range to rt_link specification Daniel Borkmann
@ 2025-02-26 2:26 ` Jakub Kicinski
2025-02-26 7:03 ` Daniel Borkmann
1 sibling, 1 reply; 4+ messages in thread
From: Jakub Kicinski @ 2025-02-26 2:26 UTC (permalink / raw)
To: Daniel Borkmann; +Cc: pabeni, netdev
On Mon, 24 Feb 2025 16:39:26 +0100 Daniel Borkmann wrote:
> @@ -1083,8 +1087,8 @@ static int geneve_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb)
>
> use_cache = ip_tunnel_dst_cache_usable(skb, info);
> tos = geneve_get_dsfield(skb, dev, info, &use_cache);
> - sport = udp_flow_src_port(geneve->net, skb,
> - 1, USHRT_MAX, true);
> + sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min,
nit: we do still prefer breaking at 80 columns if it doesn't make code
less readable.
> + geneve->cfg.port_max, true);
>
> rt = udp_tunnel_dst_lookup(skb, dev, geneve->net, 0, &saddr,
> &info->key,
> @@ -1279,6 +1284,17 @@ static int geneve_validate(struct nlattr *tb[], struct nlattr *data[],
> }
> }
>
> + if (data[IFLA_GENEVE_PORT_RANGE]) {
> + const struct ifla_geneve_port_range *p =
> + nla_data(data[IFLA_GENEVE_PORT_RANGE]);
nit: would be more readable as fully separate assignment
> + if (p->high < p->low) {
> + NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_PORT_RANGE],
> + "Invalid source port range");
> + return -EINVAL;
> + }
> + }
> +
> return 0;
> }
> @@ -1450,6 +1451,11 @@ enum ifla_geneve_df {
> GENEVE_DF_MAX = __GENEVE_DF_END - 1,
> };
>
> +struct ifla_geneve_port_range {
> + __u16 low;
> + __u16 high;
I agree with the choice in abstract, but since VXLAN uses byte swapped
fields I think we may be setting an annoying trap for user space
implementations. I'd err on the side of consistency. No?
> +};
> +
> /* Bareudp section */
> enum {
> IFLA_BAREUDP_UNSPEC,
--
pw-bot: cr
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH net-next 1/2] geneve: Allow users to specify source port range
2025-02-26 2:26 ` [PATCH net-next 1/2] geneve: Allow users to specify source port range Jakub Kicinski
@ 2025-02-26 7:03 ` Daniel Borkmann
0 siblings, 0 replies; 4+ messages in thread
From: Daniel Borkmann @ 2025-02-26 7:03 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: pabeni, netdev
On 2/26/25 3:26 AM, Jakub Kicinski wrote:
> On Mon, 24 Feb 2025 16:39:26 +0100 Daniel Borkmann wrote:
>> @@ -1083,8 +1087,8 @@ static int geneve_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb)
>>
>> use_cache = ip_tunnel_dst_cache_usable(skb, info);
>> tos = geneve_get_dsfield(skb, dev, info, &use_cache);
>> - sport = udp_flow_src_port(geneve->net, skb,
>> - 1, USHRT_MAX, true);
>> + sport = udp_flow_src_port(geneve->net, skb, geneve->cfg.port_min,
>
> nit: we do still prefer breaking at 80 columns if it doesn't make code
> less readable.
>
>> + geneve->cfg.port_max, true);
>>
>> rt = udp_tunnel_dst_lookup(skb, dev, geneve->net, 0, &saddr,
>> &info->key,
>
>> @@ -1279,6 +1284,17 @@ static int geneve_validate(struct nlattr *tb[], struct nlattr *data[],
>> }
>> }
>>
>> + if (data[IFLA_GENEVE_PORT_RANGE]) {
>> + const struct ifla_geneve_port_range *p =
>> + nla_data(data[IFLA_GENEVE_PORT_RANGE]);
>
> nit: would be more readable as fully separate assignment
>
>> + if (p->high < p->low) {
>> + NL_SET_ERR_MSG_ATTR(extack, data[IFLA_GENEVE_PORT_RANGE],
>> + "Invalid source port range");
>> + return -EINVAL;
>> + }
>> + }
>> +
>> return 0;
>> }
>
>> @@ -1450,6 +1451,11 @@ enum ifla_geneve_df {
>> GENEVE_DF_MAX = __GENEVE_DF_END - 1,
>> };
>>
>> +struct ifla_geneve_port_range {
>> + __u16 low;
>> + __u16 high;
>
> I agree with the choice in abstract, but since VXLAN uses byte swapped
> fields I think we may be setting an annoying trap for user space
> implementations. I'd err on the side of consistency. No?
If this is preferred I can change it, it's an odd/useless back and forth
dance for sure in vxlan, but if consistency is preferred I'll change it
this way and note it in the commit message.
>> +};
>> +
>> /* Bareudp section */
>> enum {
>> IFLA_BAREUDP_UNSPEC,
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-02-26 7:03 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-24 15:39 [PATCH net-next 1/2] geneve: Allow users to specify source port range Daniel Borkmann
2025-02-24 15:39 ` [PATCH net-next 2/2] geneve, specs: Add port range to rt_link specification Daniel Borkmann
2025-02-26 2:26 ` [PATCH net-next 1/2] geneve: Allow users to specify source port range Jakub Kicinski
2025-02-26 7:03 ` Daniel Borkmann
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).