* [PATCH net-next 1/4] net: route: expire rt if the dst it holds is expired
2024-03-07 17:11 [PATCH net-next 0/4] net: route: improve route hinting Leone Fernando
@ 2024-03-07 17:11 ` Leone Fernando
2024-03-07 17:12 ` [PATCH net-next 2/4] net: dst_cache: add input_dst_cache API Leone Fernando
` (3 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: Leone Fernando @ 2024-03-07 17:11 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, dsahern, willemb
Cc: netdev, linux-kernel, Leone Fernando
The function rt_is_expired is used to verify that a cached dst is valid.
Currently, this function ignores the rt.dst->expires value.
Add a check to rt_is_expired that validates that the dst is not expired.
Signed-off-by: Leone Fernando <leone4fernando@gmail.com>
---
net/ipv4/route.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index c8f76f56dc16..73e742424e97 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -392,7 +392,8 @@ static inline int ip_rt_proc_init(void)
static inline bool rt_is_expired(const struct rtable *rth)
{
- return rth->rt_genid != rt_genid_ipv4(dev_net(rth->dst.dev));
+ return rth->rt_genid != rt_genid_ipv4(dev_net(rth->dst.dev)) ||
+ (rth->dst.expires && time_after(jiffies, rth->dst.expires));
}
void rt_cache_flush(struct net *net)
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH net-next 2/4] net: dst_cache: add input_dst_cache API
2024-03-07 17:11 [PATCH net-next 0/4] net: route: improve route hinting Leone Fernando
2024-03-07 17:11 ` [PATCH net-next 1/4] net: route: expire rt if the dst it holds is expired Leone Fernando
@ 2024-03-07 17:12 ` Leone Fernando
2024-03-09 3:55 ` Jakub Kicinski
2024-03-07 17:12 ` [PATCH net-next 3/4] net: route: always compile dst_cache Leone Fernando
` (2 subsequent siblings)
4 siblings, 1 reply; 13+ messages in thread
From: Leone Fernando @ 2024-03-07 17:12 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, dsahern, willemb
Cc: netdev, linux-kernel, Leone Fernando
The input_dst_cache allows fast lookup of frequently encountered dsts.
In order to provide stable results, I implemented a simple linear
hashtable with each bucket containing a constant amount of
entries (DST_CACHE_INPUT_BUCKET_SIZE).
Similarly to how the route hint is used, I defined the hashtable key to
contain the daddr and the tos of the IP header.
Lookup is performed in a straightforward manner: start at the bucket head
corresponding the hashed key and search the following
DST_CACHE_INPUT_BUCKET_SIZE entries of the array for a matching key.
When inserting a new dst to the cache, if all the bucket entries are
full, the oldest one is deleted to make room for the new dst.
Signed-off-by: Leone Fernando <leone4fernando@gmail.com>
---
include/net/dst_cache.h | 68 +++++++++++++++++++++
net/core/dst_cache.c | 132 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 200 insertions(+)
diff --git a/include/net/dst_cache.h b/include/net/dst_cache.h
index df6622a5fe98..1c3b06f08d7d 100644
--- a/include/net/dst_cache.h
+++ b/include/net/dst_cache.h
@@ -7,12 +7,40 @@
#if IS_ENABLED(CONFIG_IPV6)
#include <net/ip6_fib.h>
#endif
+#include <net/ip.h>
+
+#define DST_CACHE_INPUT_SHIFT (9)
+#define DST_CACHE_INPUT_SIZE (1 << DST_CACHE_INPUT_SHIFT)
+#define DST_CACHE_INPUT_BUCKET_SIZE (4)
+#define DST_CACHE_INPUT_HASH_MASK (~(DST_CACHE_INPUT_BUCKET_SIZE - 1))
+#define INVALID_DST_CACHE_INPUT_KEY (~(u64)(0))
struct dst_cache {
struct dst_cache_pcpu __percpu *cache;
unsigned long reset_ts;
};
+extern unsigned int dst_cache_net_id __read_mostly;
+
+/**
+ * idst_for_each_in_bucket - iterate over a dst cache bucket
+ * @pos: the type * to use as a loop cursor
+ * @head: the head of the cpu dst cache.
+ * @hash: the hash of the bucket
+ */
+#define idst_for_each_in_bucket(pos, head, hash) \
+ for (pos = &head[hash]; \
+ pos < &head[hash + DST_CACHE_INPUT_BUCKET_SIZE]; \
+ pos++)
+
+/**
+ * idst_for_each_in_cache - iterate over the dst cache
+ * @pos: the type * to use as a loop cursor
+ * @head: the head of the cpu dst cache.
+ */
+#define idst_for_each_in_cache(pos, head) \
+ for (pos = head; pos < head + DST_CACHE_INPUT_SIZE; pos++)
+
/**
* dst_cache_get - perform cache lookup
* @dst_cache: the cache
@@ -106,4 +134,44 @@ int dst_cache_init(struct dst_cache *dst_cache, gfp_t gfp);
*/
void dst_cache_destroy(struct dst_cache *dst_cache);
+/**
+ * dst_cache_input_get_noref - perform lookup in the input cache,
+ * return a noref dst
+ * @dst_cache: the input cache
+ * @skb: the packet according to which the dst entry will be searched
+ * local BH must be disabled.
+ */
+struct dst_entry *dst_cache_input_get_noref(struct dst_cache *dst_cache,
+ struct sk_buff *skb);
+
+/**
+ * dst_cache_input_add - add the dst of the given skb to the input cache.
+ *
+ * in case the cache bucket is full, the oldest entry will be deleted
+ * and replaced with the new one.
+ * @dst_cache: the input cache
+ * @skb: The packet according to which the dst entry will be searched
+ *
+ * local BH must be disabled.
+ */
+void dst_cache_input_add(struct dst_cache *dst_cache,
+ const struct sk_buff *skb);
+
+/**
+ * dst_cache_input_init - initialize the input cache,
+ * allocating the required storage
+ */
+int __init dst_cache_input_init(void);
+
+static inline u64 create_dst_cache_key_ip4(const struct sk_buff *skb)
+{
+ struct iphdr *iphdr = ip_hdr(skb);
+
+ return (((u64)iphdr->daddr) << 8) | iphdr->tos;
+}
+
+static inline u32 hash_dst_cache_key(u64 key)
+{
+ return hash_64(key, DST_CACHE_INPUT_SHIFT) & DST_CACHE_INPUT_HASH_MASK;
+}
#endif
diff --git a/net/core/dst_cache.c b/net/core/dst_cache.c
index 0ccfd5fa5cb9..245f5546a7ff 100644
--- a/net/core/dst_cache.c
+++ b/net/core/dst_cache.c
@@ -13,6 +13,7 @@
#include <net/ip6_fib.h>
#endif
#include <uapi/linux/in.h>
+#include <net/netns/generic.h>
struct dst_cache_pcpu {
unsigned long refresh_ts;
@@ -21,9 +22,12 @@ struct dst_cache_pcpu {
union {
struct in_addr in_saddr;
struct in6_addr in6_saddr;
+ u64 key;
};
};
+unsigned int dst_cache_net_id __read_mostly;
+
static void dst_cache_per_cpu_dst_set(struct dst_cache_pcpu *dst_cache,
struct dst_entry *dst, u32 cookie)
{
@@ -181,3 +185,131 @@ void dst_cache_reset_now(struct dst_cache *dst_cache)
}
}
EXPORT_SYMBOL_GPL(dst_cache_reset_now);
+
+static void dst_cache_input_set(struct dst_cache_pcpu *idst,
+ struct dst_entry *dst, u64 key)
+{
+ dst_cache_per_cpu_dst_set(idst, dst, 0);
+ idst->key = key;
+ idst->refresh_ts = jiffies;
+}
+
+static struct dst_entry *__dst_cache_input_get_noref(struct dst_cache_pcpu *idst)
+{
+ struct dst_entry *dst = idst->dst;
+
+ if (unlikely(dst->obsolete && !dst->ops->check(dst, idst->cookie))) {
+ dst_cache_input_set(idst, NULL, INVALID_DST_CACHE_INPUT_KEY);
+ goto fail;
+ }
+
+ idst->refresh_ts = jiffies;
+ return dst;
+
+fail:
+ return NULL;
+}
+
+struct dst_entry *dst_cache_input_get_noref(struct dst_cache *dst_cache,
+ struct sk_buff *skb)
+{
+ struct dst_entry *out_dst = NULL;
+ struct dst_cache_pcpu *pcpu_cache;
+ struct dst_cache_pcpu *idst;
+ u32 hash;
+ u64 key;
+
+ pcpu_cache = this_cpu_ptr(dst_cache->cache);
+ key = create_dst_cache_key_ip4(skb);
+ hash = hash_dst_cache_key(key);
+ idst_for_each_in_bucket(idst, pcpu_cache, hash) {
+ if (key == idst->key) {
+ out_dst = __dst_cache_input_get_noref(idst);
+ goto out;
+ }
+ }
+out:
+ return out_dst;
+}
+
+static void dst_cache_input_reset_now(struct dst_cache *dst_cache)
+{
+ struct dst_cache_pcpu *caches;
+ struct dst_cache_pcpu *idst;
+ struct dst_entry *dst;
+ int i;
+
+ for_each_possible_cpu(i) {
+ caches = per_cpu_ptr(dst_cache->cache, i);
+ idst_for_each_in_cache(idst, caches) {
+ idst->key = INVALID_DST_CACHE_INPUT_KEY;
+ dst = idst->dst;
+ if (dst)
+ dst_release(dst);
+ }
+ }
+}
+
+static int __net_init dst_cache_input_net_init(struct net *net)
+{
+ struct dst_cache *dst_cache = net_generic(net, dst_cache_net_id);
+
+ dst_cache->cache = (struct dst_cache_pcpu __percpu *)alloc_percpu_gfp(struct dst_cache_pcpu[DST_CACHE_INPUT_SIZE],
+ GFP_KERNEL | __GFP_ZERO);
+ if (!dst_cache->cache)
+ return -ENOMEM;
+
+ dst_cache_input_reset_now(dst_cache);
+ return 0;
+}
+
+static void __net_exit dst_cache_input_net_exit(struct net *net)
+{
+ struct dst_cache *dst_cache = net_generic(net, dst_cache_net_id);
+
+ dst_cache_input_reset_now(dst_cache);
+ free_percpu(dst_cache->cache);
+ dst_cache->cache = NULL;
+}
+
+static bool idst_empty(struct dst_cache_pcpu *idst)
+{
+ return idst->key == INVALID_DST_CACHE_INPUT_KEY;
+}
+
+void dst_cache_input_add(struct dst_cache *dst_cache, const struct sk_buff *skb)
+{
+ struct dst_cache_pcpu *entry = NULL;
+ struct dst_cache_pcpu *pcpu_cache;
+ struct dst_cache_pcpu *idst;
+ u32 hash;
+ u64 key;
+
+ pcpu_cache = this_cpu_ptr(dst_cache->cache);
+ key = create_dst_cache_key_ip4(skb);
+ hash = hash_dst_cache_key(key);
+ idst_for_each_in_bucket(idst, pcpu_cache, hash) {
+ if (idst_empty(idst)) {
+ entry = idst;
+ goto add_to_cache;
+ }
+ if (!entry || time_before(idst->refresh_ts, entry->refresh_ts))
+ entry = idst;
+ }
+
+add_to_cache:
+ dst_cache_input_set(entry, skb_dst(skb), key);
+}
+
+static struct pernet_operations dst_cache_input_ops __net_initdata = {
+ .init = dst_cache_input_net_init,
+ .exit = dst_cache_input_net_exit,
+ .id = &dst_cache_net_id,
+ .size = sizeof(struct dst_cache),
+};
+
+int __init dst_cache_input_init(void)
+{
+ return register_pernet_subsys(&dst_cache_input_ops);
+}
+subsys_initcall(dst_cache_input_init);
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH net-next 3/4] net: route: always compile dst_cache
2024-03-07 17:11 [PATCH net-next 0/4] net: route: improve route hinting Leone Fernando
2024-03-07 17:11 ` [PATCH net-next 1/4] net: route: expire rt if the dst it holds is expired Leone Fernando
2024-03-07 17:12 ` [PATCH net-next 2/4] net: dst_cache: add input_dst_cache API Leone Fernando
@ 2024-03-07 17:12 ` Leone Fernando
2024-03-07 17:12 ` [PATCH net-next 4/4] net: route: replace route hints with input_dst_cache Leone Fernando
2024-03-09 4:53 ` [PATCH net-next 0/4] net: route: improve route hinting David Ahern
4 siblings, 0 replies; 13+ messages in thread
From: Leone Fernando @ 2024-03-07 17:12 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, dsahern, willemb
Cc: netdev, linux-kernel, Leone Fernando
make dst_cache always compile, delete all relevant ifdefs
Signed-off-by: Leone Fernando <leone4fernando@gmail.com>
---
drivers/net/Kconfig | 1 -
include/net/dst_metadata.h | 2 --
include/net/ip_tunnels.h | 2 --
net/Kconfig | 4 ----
net/core/Makefile | 3 +--
net/core/dst.c | 4 ----
net/ipv4/Kconfig | 1 -
net/ipv4/ip_tunnel_core.c | 4 ----
net/ipv4/udp_tunnel_core.c | 4 ----
net/ipv6/Kconfig | 4 ----
net/ipv6/ip6_udp_tunnel.c | 4 ----
net/netfilter/nft_tunnel.c | 2 --
net/openvswitch/Kconfig | 1 -
net/sched/act_tunnel_key.c | 2 --
14 files changed, 1 insertion(+), 37 deletions(-)
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 8ca0bc223b30..1be1ec8368b6 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -78,7 +78,6 @@ config WIREGUARD
depends on IPV6 || !IPV6
depends on !KMSAN # KMSAN doesn't support the crypto configs below
select NET_UDP_TUNNEL
- select DST_CACHE
select CRYPTO
select CRYPTO_LIB_CURVE25519
select CRYPTO_LIB_CHACHA20POLY1305
diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index 1b7fae4c6b24..c41a857bf0ed 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -165,7 +165,6 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
memcpy(&new_md->u.tun_info, &md_dst->u.tun_info,
sizeof(struct ip_tunnel_info) + md_size);
-#ifdef CONFIG_DST_CACHE
/* Unclone the dst cache if there is one */
if (new_md->u.tun_info.dst_cache.cache) {
int ret;
@@ -176,7 +175,6 @@ static inline struct metadata_dst *tun_dst_unclone(struct sk_buff *skb)
return ERR_PTR(ret);
}
}
-#endif
skb_dst_drop(skb);
skb_dst_set(skb, &new_md->dst);
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 5cd64bb2104d..2fe04edc23b9 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -84,9 +84,7 @@ struct ip_tunnel_encap {
struct ip_tunnel_info {
struct ip_tunnel_key key;
struct ip_tunnel_encap encap;
-#ifdef CONFIG_DST_CACHE
struct dst_cache dst_cache;
-#endif
u8 options_len;
u8 mode;
};
diff --git a/net/Kconfig b/net/Kconfig
index 3e57ccf0da27..21caffd1758d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -438,10 +438,6 @@ config LWTUNNEL_BPF
Allows to run BPF programs as a nexthop action following a route
lookup for incoming and outgoing packets.
-config DST_CACHE
- bool
- default n
-
config GRO_CELLS
bool
default n
diff --git a/net/core/Makefile b/net/core/Makefile
index 821aec06abf1..53582fef633b 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -13,7 +13,7 @@ obj-y += dev.o dev_addr_lists.o dst.o netevent.o \
neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
fib_notifier.o xdp.o flow_offload.o gro.o \
- netdev-genl.o netdev-genl-gen.o gso.o
+ netdev-genl.o netdev-genl-gen.o gso.o dst_cache.o
obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
@@ -32,7 +32,6 @@ obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o
-obj-$(CONFIG_DST_CACHE) += dst_cache.o
obj-$(CONFIG_HWBM) += hwbm.o
obj-$(CONFIG_GRO_CELLS) += gro_cells.o
obj-$(CONFIG_FAILOVER) += failover.o
diff --git a/net/core/dst.c b/net/core/dst.c
index 95f533844f17..f035c39be104 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -291,10 +291,8 @@ EXPORT_SYMBOL_GPL(metadata_dst_alloc);
void metadata_dst_free(struct metadata_dst *md_dst)
{
-#ifdef CONFIG_DST_CACHE
if (md_dst->type == METADATA_IP_TUNNEL)
dst_cache_destroy(&md_dst->u.tun_info.dst_cache);
-#endif
if (md_dst->type == METADATA_XFRM)
dst_release(md_dst->u.xfrm_info.dst_orig);
kfree(md_dst);
@@ -326,10 +324,8 @@ void metadata_dst_free_percpu(struct metadata_dst __percpu *md_dst)
for_each_possible_cpu(cpu) {
struct metadata_dst *one_md_dst = per_cpu_ptr(md_dst, cpu);
-#ifdef CONFIG_DST_CACHE
if (one_md_dst->type == METADATA_IP_TUNNEL)
dst_cache_destroy(&one_md_dst->u.tun_info.dst_cache);
-#endif
if (one_md_dst->type == METADATA_XFRM)
dst_release(one_md_dst->u.xfrm_info.dst_orig);
}
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 8e94ed7c56a0..189f716b03e8 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -185,7 +185,6 @@ config NET_IPGRE_DEMUX
config NET_IP_TUNNEL
tristate
- select DST_CACHE
select GRO_CELLS
default n
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 80ccd6661aa3..f060d55dc249 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -682,13 +682,11 @@ static int ip_tun_build_state(struct net *net, struct nlattr *attr,
return err;
}
-#ifdef CONFIG_DST_CACHE
err = dst_cache_init(&tun_info->dst_cache, GFP_KERNEL);
if (err) {
lwtstate_free(new_state);
return err;
}
-#endif
if (tb[LWTUNNEL_IP_ID])
tun_info->key.tun_id = nla_get_be64(tb[LWTUNNEL_IP_ID]);
@@ -720,11 +718,9 @@ static int ip_tun_build_state(struct net *net, struct nlattr *attr,
static void ip_tun_destroy_state(struct lwtunnel_state *lwtstate)
{
-#ifdef CONFIG_DST_CACHE
struct ip_tunnel_info *tun_info = lwt_tun_info(lwtstate);
dst_cache_destroy(&tun_info->dst_cache);
-#endif
}
static int ip_tun_fill_encap_opts_geneve(struct sk_buff *skb,
diff --git a/net/ipv4/udp_tunnel_core.c b/net/ipv4/udp_tunnel_core.c
index 860aff5f8599..ecc0990d8cab 100644
--- a/net/ipv4/udp_tunnel_core.c
+++ b/net/ipv4/udp_tunnel_core.c
@@ -215,13 +215,11 @@ struct rtable *udp_tunnel_dst_lookup(struct sk_buff *skb,
struct rtable *rt = NULL;
struct flowi4 fl4;
-#ifdef CONFIG_DST_CACHE
if (dst_cache) {
rt = dst_cache_get_ip4(dst_cache, saddr);
if (rt)
return rt;
}
-#endif
memset(&fl4, 0, sizeof(fl4));
fl4.flowi4_mark = skb->mark;
@@ -244,10 +242,8 @@ struct rtable *udp_tunnel_dst_lookup(struct sk_buff *skb,
ip_rt_put(rt);
return ERR_PTR(-ELOOP);
}
-#ifdef CONFIG_DST_CACHE
if (dst_cache)
dst_cache_set_ip4(dst_cache, &rt->dst, fl4.saddr);
-#endif
*saddr = fl4.saddr;
return rt;
}
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index 08d4b7132d4c..093c768d41ab 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -124,7 +124,6 @@ config IPV6_MIP6
config IPV6_ILA
tristate "IPv6: Identifier Locator Addressing (ILA)"
depends on NETFILTER
- select DST_CACHE
select LWTUNNEL
help
Support for IPv6 Identifier Locator Addressing (ILA).
@@ -203,7 +202,6 @@ config IPV6_NDISC_NODETYPE
config IPV6_TUNNEL
tristate "IPv6: IP-in-IPv6 tunnel (RFC2473)"
select INET6_TUNNEL
- select DST_CACHE
select GRO_CELLS
help
Support for IPv6-in-IPv6 and IPv4-in-IPv6 tunnels described in
@@ -291,7 +289,6 @@ config IPV6_SEG6_LWTUNNEL
bool "IPv6: Segment Routing Header encapsulation support"
depends on IPV6
select LWTUNNEL
- select DST_CACHE
select IPV6_MULTIPLE_TABLES
help
Support for encapsulation of packets within an outer IPv6
@@ -333,7 +330,6 @@ config IPV6_IOAM6_LWTUNNEL
bool "IPv6: IOAM Pre-allocated Trace insertion support"
depends on IPV6
select LWTUNNEL
- select DST_CACHE
help
Support for the insertion of IOAM Pre-allocated Trace
Header using the lightweight tunnels mechanism.
diff --git a/net/ipv6/ip6_udp_tunnel.c b/net/ipv6/ip6_udp_tunnel.c
index c99053189ea8..de92aea01cfc 100644
--- a/net/ipv6/ip6_udp_tunnel.c
+++ b/net/ipv6/ip6_udp_tunnel.c
@@ -145,13 +145,11 @@ struct dst_entry *udp_tunnel6_dst_lookup(struct sk_buff *skb,
struct dst_entry *dst = NULL;
struct flowi6 fl6;
-#ifdef CONFIG_DST_CACHE
if (dst_cache) {
dst = dst_cache_get_ip6(dst_cache, saddr);
if (dst)
return dst;
}
-#endif
memset(&fl6, 0, sizeof(fl6));
fl6.flowi6_mark = skb->mark;
fl6.flowi6_proto = IPPROTO_UDP;
@@ -173,10 +171,8 @@ struct dst_entry *udp_tunnel6_dst_lookup(struct sk_buff *skb,
dst_release(dst);
return ERR_PTR(-ELOOP);
}
-#ifdef CONFIG_DST_CACHE
if (dst_cache)
dst_cache_set_ip6(dst_cache, dst, &fl6.saddr);
-#endif
*saddr = fl6.saddr;
return dst;
}
diff --git a/net/netfilter/nft_tunnel.c b/net/netfilter/nft_tunnel.c
index f735d79d8be5..02cbd7f3f6dc 100644
--- a/net/netfilter/nft_tunnel.c
+++ b/net/netfilter/nft_tunnel.c
@@ -508,13 +508,11 @@ static int nft_tunnel_obj_init(const struct nft_ctx *ctx,
return -ENOMEM;
memcpy(&md->u.tun_info, &info, sizeof(info));
-#ifdef CONFIG_DST_CACHE
err = dst_cache_init(&md->u.tun_info.dst_cache, GFP_KERNEL);
if (err < 0) {
metadata_dst_free(md);
return err;
}
-#endif
ip_tunnel_info_opts_set(&md->u.tun_info, &priv->opts.u, priv->opts.len,
priv->opts.flags);
priv->md = md;
diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 29a7081858cd..b7a5ab6374b8 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -13,7 +13,6 @@ config OPENVSWITCH
select LIBCRC32C
select MPLS
select NET_MPLS_GSO
- select DST_CACHE
select NET_NSH
select NF_CONNTRACK_OVS if NF_CONNTRACK
select NF_NAT_OVS if NF_NAT
diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 1536f8b16f1b..1da69fa82512 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -476,11 +476,9 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
goto err_out;
}
-#ifdef CONFIG_DST_CACHE
ret = dst_cache_init(&metadata->u.tun_info.dst_cache, GFP_KERNEL);
if (ret)
goto release_tun_meta;
-#endif
if (opts_len) {
ret = tunnel_key_opts_set(tb[TCA_TUNNEL_KEY_ENC_OPTS],
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH net-next 4/4] net: route: replace route hints with input_dst_cache
2024-03-07 17:11 [PATCH net-next 0/4] net: route: improve route hinting Leone Fernando
` (2 preceding siblings ...)
2024-03-07 17:12 ` [PATCH net-next 3/4] net: route: always compile dst_cache Leone Fernando
@ 2024-03-07 17:12 ` Leone Fernando
2024-03-09 4:53 ` [PATCH net-next 0/4] net: route: improve route hinting David Ahern
4 siblings, 0 replies; 13+ messages in thread
From: Leone Fernando @ 2024-03-07 17:12 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, dsahern, willemb
Cc: netdev, linux-kernel, Leone Fernando
Replace route hints with cached dsts - ip_rcv_finish_core will first try
to use the cache and only then fall back to the demux or perform a full
lookup.
Only add newly found dsts to the cache after all the checks have passed
successfully to avoid adding a dropped packet's dst to the cache.
Multicast dsts are not added to the dst_cache as it will require additional
checks and multicast packets are rarer and a slower path anyway.
A check was added to ip_route_use_dst_cache that prevents forwarding
packets received by devices for which forwarding is disabled.
Relevant checks were added to ip_route_use_dst_cache to make sure the
dst can be used and to ensure IPCB(skb) flags are correct.
Signed-off-by: Leone Fernando <leone4fernando@gmail.com>
---
include/net/route.h | 6 ++--
net/ipv4/ip_input.c | 58 +++++++++++++++++++-----------------
net/ipv4/route.c | 72 +++++++++++++++++++++++++++++++++------------
3 files changed, 88 insertions(+), 48 deletions(-)
diff --git a/include/net/route.h b/include/net/route.h
index d4a0147942f1..0940383719a0 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -186,9 +186,9 @@ int ip_mc_validate_source(struct sk_buff *skb, __be32 daddr, __be32 saddr,
struct in_device *in_dev, u32 *itag);
int ip_route_input_noref(struct sk_buff *skb, __be32 dst, __be32 src,
u8 tos, struct net_device *devin);
-int ip_route_use_hint(struct sk_buff *skb, __be32 dst, __be32 src,
- u8 tos, struct net_device *devin,
- const struct sk_buff *hint);
+int ip_route_use_dst_cache(struct sk_buff *skb, __be32 daddr, __be32 saddr,
+ u8 tos, struct net_device *dev,
+ struct dst_entry *dst);
static inline int ip_route_input(struct sk_buff *skb, __be32 dst, __be32 src,
u8 tos, struct net_device *devin)
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 5e9c8156656a..35c8b122d62f 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -305,30 +305,44 @@ static inline bool ip_rcv_options(struct sk_buff *skb, struct net_device *dev)
return true;
}
-static bool ip_can_use_hint(const struct sk_buff *skb, const struct iphdr *iph,
- const struct sk_buff *hint)
+static bool ip_can_add_dst_cache(struct sk_buff *skb, __u16 rt_type)
{
- return hint && !skb_dst(skb) && ip_hdr(hint)->daddr == iph->daddr &&
- ip_hdr(hint)->tos == iph->tos;
+ return skb_valid_dst(skb) &&
+ rt_type != RTN_BROADCAST &&
+ rt_type != RTN_MULTICAST &&
+ !(IPCB(skb)->flags & IPSKB_MULTIPATH);
+}
+
+static bool ip_can_use_dst_cache(const struct net *net, struct sk_buff *skb)
+{
+ return !skb_dst(skb) && !fib4_has_custom_rules(net);
}
int tcp_v4_early_demux(struct sk_buff *skb);
int udp_v4_early_demux(struct sk_buff *skb);
static int ip_rcv_finish_core(struct net *net, struct sock *sk,
- struct sk_buff *skb, struct net_device *dev,
- const struct sk_buff *hint)
+ struct sk_buff *skb, struct net_device *dev)
{
+ struct dst_cache *dst_cache = net_generic(net, dst_cache_net_id);
const struct iphdr *iph = ip_hdr(skb);
+ struct dst_entry *dst;
int err, drop_reason;
struct rtable *rt;
+ bool do_cache;
drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
- if (ip_can_use_hint(skb, iph, hint)) {
- err = ip_route_use_hint(skb, iph->daddr, iph->saddr, iph->tos,
- dev, hint);
- if (unlikely(err))
- goto drop_error;
+ do_cache = ip_can_use_dst_cache(net, skb);
+ if (do_cache) {
+ dst = dst_cache_input_get_noref(dst_cache, skb);
+ if (dst) {
+ err = ip_route_use_dst_cache(skb, iph->daddr,
+ iph->saddr, iph->tos,
+ dev, dst);
+ if (unlikely(err))
+ goto drop_error;
+ do_cache = false;
+ }
}
if (READ_ONCE(net->ipv4.sysctl_ip_early_demux) &&
@@ -418,6 +432,9 @@ static int ip_rcv_finish_core(struct net *net, struct sock *sk,
}
}
+ if (do_cache && ip_can_add_dst_cache(skb, rt->rt_type))
+ dst_cache_input_add(dst_cache, skb);
+
return NET_RX_SUCCESS;
drop:
@@ -444,7 +461,7 @@ static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
if (!skb)
return NET_RX_SUCCESS;
- ret = ip_rcv_finish_core(net, sk, skb, dev, NULL);
+ ret = ip_rcv_finish_core(net, sk, skb, dev);
if (ret != NET_RX_DROP)
ret = dst_input(skb);
return ret;
@@ -581,21 +598,11 @@ static void ip_sublist_rcv_finish(struct list_head *head)
}
}
-static struct sk_buff *ip_extract_route_hint(const struct net *net,
- struct sk_buff *skb, int rt_type)
-{
- if (fib4_has_custom_rules(net) || rt_type == RTN_BROADCAST ||
- IPCB(skb)->flags & IPSKB_MULTIPATH)
- return NULL;
-
- return skb;
-}
-
static void ip_list_rcv_finish(struct net *net, struct sock *sk,
struct list_head *head)
{
- struct sk_buff *skb, *next, *hint = NULL;
struct dst_entry *curr_dst = NULL;
+ struct sk_buff *skb, *next;
struct list_head sublist;
INIT_LIST_HEAD(&sublist);
@@ -610,14 +617,11 @@ static void ip_list_rcv_finish(struct net *net, struct sock *sk,
skb = l3mdev_ip_rcv(skb);
if (!skb)
continue;
- if (ip_rcv_finish_core(net, sk, skb, dev, hint) == NET_RX_DROP)
+ if (ip_rcv_finish_core(net, sk, skb, dev) == NET_RX_DROP)
continue;
dst = skb_dst(skb);
if (curr_dst != dst) {
- hint = ip_extract_route_hint(net, skb,
- ((struct rtable *)dst)->rt_type);
-
/* dispatch old sublist */
if (!list_empty(&sublist))
ip_sublist_rcv_finish(&sublist);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 73e742424e97..e7de683f7b49 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1787,6 +1787,24 @@ static void ip_handle_martian_source(struct net_device *dev,
#endif
}
+static void ip_route_set_doredirect(struct in_device *in_dev,
+ struct in_device *out_dev,
+ struct sk_buff *skb,
+ u8 gw_family,
+ __be32 gw4,
+ __be32 saddr)
+{
+ if (out_dev == in_dev && IN_DEV_TX_REDIRECTS(out_dev) &&
+ skb->protocol == htons(ETH_P_IP)) {
+ __be32 gw;
+
+ gw = gw_family == AF_INET ? gw4 : 0;
+ if (IN_DEV_SHARED_MEDIA(out_dev) ||
+ inet_addr_onlink(out_dev, saddr, gw))
+ IPCB(skb)->flags |= IPSKB_DOREDIRECT;
+ }
+}
+
/* called in rcu_read_lock() section */
static int __mkroute_input(struct sk_buff *skb,
const struct fib_result *res,
@@ -1819,15 +1837,10 @@ static int __mkroute_input(struct sk_buff *skb,
}
do_cache = res->fi && !itag;
- if (out_dev == in_dev && err && IN_DEV_TX_REDIRECTS(out_dev) &&
- skb->protocol == htons(ETH_P_IP)) {
- __be32 gw;
-
- gw = nhc->nhc_gw_family == AF_INET ? nhc->nhc_gw.ipv4 : 0;
- if (IN_DEV_SHARED_MEDIA(out_dev) ||
- inet_addr_onlink(out_dev, saddr, gw))
- IPCB(skb)->flags |= IPSKB_DOREDIRECT;
- }
+ if (err)
+ ip_route_set_doredirect(in_dev, out_dev, skb,
+ nhc->nhc_gw_family,
+ nhc->nhc_gw.ipv4, saddr);
if (skb->protocol != htons(ETH_P_IP)) {
/* Not IP (i.e. ARP). Do not create route, if it is
@@ -2157,14 +2170,15 @@ static int ip_mkroute_input(struct sk_buff *skb,
/* Implements all the saddr-related checks as ip_route_input_slow(),
* assuming daddr is valid and the destination is not a local broadcast one.
- * Uses the provided hint instead of performing a route lookup.
+ * Uses the provided dst from dst_cache instead of performing a route lookup.
*/
-int ip_route_use_hint(struct sk_buff *skb, __be32 daddr, __be32 saddr,
- u8 tos, struct net_device *dev,
- const struct sk_buff *hint)
+int ip_route_use_dst_cache(struct sk_buff *skb, __be32 daddr, __be32 saddr,
+ u8 tos, struct net_device *dev,
+ struct dst_entry *dst)
{
+ struct in_device *out_dev = __in_dev_get_rcu(dst->dev);
struct in_device *in_dev = __in_dev_get_rcu(dev);
- struct rtable *rt = skb_rtable(hint);
+ struct rtable *rt = (struct rtable *)dst;
struct net *net = dev_net(dev);
int err = -EINVAL;
u32 tag = 0;
@@ -2178,21 +2192,43 @@ int ip_route_use_hint(struct sk_buff *skb, __be32 daddr, __be32 saddr,
if (ipv4_is_loopback(saddr) && !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
goto martian_source;
- if (rt->rt_type != RTN_LOCAL)
- goto skip_validate_source;
+ if (ipv4_is_loopback(daddr) && !IN_DEV_NET_ROUTE_LOCALNET(in_dev, net))
+ goto martian_destination;
+ if (rt->rt_type != RTN_LOCAL) {
+ if (!IN_DEV_FORWARD(in_dev)) {
+ err = -EHOSTUNREACH;
+ goto out_err;
+ }
+ goto skip_validate_source;
+ }
tos &= IPTOS_RT_MASK;
err = fib_validate_source(skb, saddr, daddr, tos, 0, dev, in_dev, &tag);
if (err < 0)
goto martian_source;
+ if (err)
+ ip_route_set_doredirect(in_dev, out_dev, skb, rt->rt_gw_family,
+ rt->rt_gw4, saddr);
+
skip_validate_source:
- skb_dst_copy(skb, hint);
+ skb_dst_set_noref(skb, dst);
return 0;
martian_source:
ip_handle_martian_source(dev, in_dev, skb, daddr, saddr);
+out_err:
return err;
+
+martian_destination:
+ RT_CACHE_STAT_INC(in_martian_dst);
+#ifdef CONFIG_IP_ROUTE_VERBOSE
+ if (IN_DEV_LOG_MARTIANS(in_dev))
+ net_warn_ratelimited("martian destination %pI4 from %pI4, dev %s\n",
+ &daddr, &saddr, dev->name);
+#endif
+ err = -EINVAL;
+ goto out_err;
}
/* get device for dst_alloc with local routes */
@@ -2213,7 +2249,7 @@ static struct net_device *ip_rt_get_dev(struct net *net,
* addresses, because every properly looped back packet
* must have correct destination already attached by output routine.
* Changes in the enforced policies must be applied also to
- * ip_route_use_hint().
+ * ip_route_use_dst_cache().
*
* Such approach solves two big problems:
* 1. Not simplex devices are handled properly.
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH net-next 0/4] net: route: improve route hinting
2024-03-07 17:11 [PATCH net-next 0/4] net: route: improve route hinting Leone Fernando
` (3 preceding siblings ...)
2024-03-07 17:12 ` [PATCH net-next 4/4] net: route: replace route hints with input_dst_cache Leone Fernando
@ 2024-03-09 4:53 ` David Ahern
2024-03-12 15:38 ` Leone Fernando
2024-04-02 10:08 ` Leone Fernando
4 siblings, 2 replies; 13+ messages in thread
From: David Ahern @ 2024-03-09 4:53 UTC (permalink / raw)
To: Leone Fernando, davem, edumazet, kuba, pabeni, willemb
Cc: netdev, linux-kernel
On 3/7/24 10:11 AM, Leone Fernando wrote:
> In 2017, Paolo Abeni introduced the hinting mechanism [1] to the routing
> sub-system. The hinting optimization improves performance by reusing
> previously found dsts instead of looking them up for each skb.
>
> This patch series introduces a generalized version of the hinting mechanism that
> can "remember" a larger number of dsts. This reduces the number of dst
> lookups for frequently encountered daddrs.
>
> Before diving into the code and the benchmarking results, it's important
> to address the deletion of the old route cache [2] and why
> this solution is different. The original cache was complicated,
> vulnerable to DOS attacks and had unstable performance.
>
> The new input dst_cache is much simpler thanks to its lazy approach,
> improving performance without the overhead of the removed cache
> implementation. Instead of using timers and GC, the deletion of invalid
> entries is performed lazily during their lookups.
> The dsts are stored in a simple, lightweight, static hash table. This
> keeps the lookup times fast yet stable, preventing DOS upon cache misses.
> The new input dst_cache implementation is built over the existing
> dst_cache code which supplies a fast lockless percpu behavior.
>
> I tested this patch using udp floods with different number of daddrs.
> The benchmarking setup is comprised of 3 machines: a sender,
> a forwarder and a receiver. I measured the PPS received by the receiver
> as the forwarder was running either the mainline kernel or the patched
> kernel, comparing the results. The dst_cache I tested in this benchmark
> used a total of 512 hash table entries, split into buckets of 4
> entries each.
>
> These are the results:
> UDP mainline patched delta
> conns pcpu Kpps Kpps %
> 1 274.0255 269.2205 -1.75
> 2 257.3748 268.0947 4.17
> 15 241.3513 258.8016 7.23
> 100 238.3419 258.4939 8.46
> 500 238.5390 252.6425 5.91
> 1000 238.7570 242.1820 1.43
> 2000 238.7780 236.2640 -1.05
> 4000 239.0440 233.5320 -2.31
> 8000 239.3248 232.5680 -2.82
>
I have looked at all of the sets sent. I can not convince myself this is
a good idea, but at the same time I do not have constructive feedback on
why it is not acceptable. The gains are modest at best.
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH net-next 0/4] net: route: improve route hinting
2024-03-09 4:53 ` [PATCH net-next 0/4] net: route: improve route hinting David Ahern
@ 2024-03-12 15:38 ` Leone Fernando
2024-04-02 10:08 ` Leone Fernando
1 sibling, 0 replies; 13+ messages in thread
From: Leone Fernando @ 2024-03-12 15:38 UTC (permalink / raw)
To: David Ahern, davem, edumazet, kuba, pabeni, willemb; +Cc: netdev, linux-kernel
David Ahern wrote:
> I have looked at all of the sets sent. I can not convince myself this is
> a good idea, but at the same time I do not have constructive feedback on
> why it is not acceptable. The gains are modest at best.
>
Thanks for the comment.
I believe an improvement of 5-8% in PPS is significant.
Note that the cache is per-cpu (e.g., for a machine with 10 CPUs, the improvement
affects 10X the conns mentioned).
Could you please provide more information about what you don't like
in the patch?
Some possible issues I can think of:
- Do you think the improvement is not affecting the common case?
In this case, it can be solved by tweaking the cache parameters.
- In case performance degradation for some # of conns is problematic - we can
find ways to reduce it. For example, the degradation for 1 connection can
probably be solved by keeping the route hints.
Leone
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net-next 0/4] net: route: improve route hinting
2024-03-09 4:53 ` [PATCH net-next 0/4] net: route: improve route hinting David Ahern
2024-03-12 15:38 ` Leone Fernando
@ 2024-04-02 10:08 ` Leone Fernando
2024-04-02 15:02 ` David Ahern
1 sibling, 1 reply; 13+ messages in thread
From: Leone Fernando @ 2024-04-02 10:08 UTC (permalink / raw)
To: David Ahern, davem, edumazet, kuba, pabeni, willemb; +Cc: netdev, linux-kernel
Hi David,
I plan to continue working on this patch, and it
would be helpful if you could share some more thoughts.
As I said before, I think this patch is significant, and my measurements
showed a consistent improvement in most cases.
Thanks,
Leone
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net-next 0/4] net: route: improve route hinting
2024-04-02 10:08 ` Leone Fernando
@ 2024-04-02 15:02 ` David Ahern
2024-04-03 11:50 ` Leone Fernando
0 siblings, 1 reply; 13+ messages in thread
From: David Ahern @ 2024-04-02 15:02 UTC (permalink / raw)
To: Leone Fernando, davem, edumazet, kuba, pabeni, willemb
Cc: netdev, linux-kernel
On 4/2/24 4:08 AM, Leone Fernando wrote:
> Hi David,
>
> I plan to continue working on this patch, and it
> would be helpful if you could share some more thoughts.
> As I said before, I think this patch is significant, and my measurements
> showed a consistent improvement in most cases.
>
It seems to me patch 1 and a version of it for IPv6 should go in
independent of this set.
For the rest of it, Jakub's response was a good summary: it is hard to
know if there is a benefit to real workloads. Cache's consume resources
(memory and cpu) and will be wrong some percentage of the time
increasing overhead.
Also, it is targeted at a very narrow use case -- IPv4 only, no custom
FIB rules and no multipath.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net-next 0/4] net: route: improve route hinting
2024-04-02 15:02 ` David Ahern
@ 2024-04-03 11:50 ` Leone Fernando
0 siblings, 0 replies; 13+ messages in thread
From: Leone Fernando @ 2024-04-03 11:50 UTC (permalink / raw)
To: David Ahern, davem, edumazet, kuba, pabeni, willemb; +Cc: netdev, linux-kernel
Hi David,
>
> For the rest of it, Jakub's response was a good summary: it is hard to
> know if there is a benefit to real workloads. Cache's consume resources
> (memory and cpu) and will be wrong some percentage of the time
> increasing overhead.
>
I understand what you are saying here.
Do you have an idea for extra tests or measurements to make sure
the patch also improves in real workloads?
> Also, it is targeted at a very narrow use case -- IPv4 only, no custom
> FIB rules and no multipath.
Implementation for IPv6 is almost identical. I decided to start with IPv4
and I plan to submit IPv6 support in a future patch.
This patch basically improves the hinting mechanism that Paolo introduced
and it handles the same cases.
Adding support for FIB rules and multipath is a bit more complicated but
also possible. I am willing to keep working on this patch to meet the
remaining cases including IPv6.
Thanks,
Leone
^ permalink raw reply [flat|nested] 13+ messages in thread