[PATCH net] veth: Fix RX stats for bpf_redirect

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net] veth: Fix RX stats for bpf_redirect_peer() traffic
@ 2023-10-27 18:46 Peilin Ye
  2023-10-27 19:02 ` Peilin Ye
  0 siblings, 1 reply; 8+ messages in thread
From: Peilin Ye @ 2023-10-27 18:46 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Jesper Dangaard Brouer
  Cc: Peilin Ye, netdev, bpf, linux-kernel, Cong Wang, Jiang Wang,
	Youlun Zhang, Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Traffic redirected by bpf_redirect_peer() (used by recent CNIs like
Cilium) is not accounted for in the RX stats of veth devices, confusing
user space metrics collectors such as cAdvisor [1], as reported by
Youlun.

Currently veth devices use the @lstats per-CPU counters, which only
cover TX traffic.  veth_get_stats64() actually collects RX stats of a
veth device from its peer's TX (@lstats) counters, based on the
assumption that a veth device can _only_ receive packets from its peer,
which is no longer true.

Instead, use @tstats to maintain both per-CPU RX and TX traffic counters
for each veth device, and count bpf_redirect_peer() traffic in
skb_do_redirect().

veth_stats_rx() might need a name change (perhaps to "veth_stats_xdp()")
for less confusion, but let's leave it to a separate patch to keep this
fix minimal.

[1] Specifically, the "container_network_receive_{byte,packet}s_total"
    counters are affected.

Reported-by: Youlun Zhang <zhangyoulun@bytedance.com>
Fixes: 9aa1206e8f48 ("bpf: Add redirect_peer helper")
Cc: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 drivers/net/veth.c | 36 ++++++++++++++----------------------
 net/core/filter.c  |  1 +
 2 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9980517ed8b0..df7a7c21a46d 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -373,7 +373,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	skb_tx_timestamp(skb);
 	if (likely(veth_forward_skb(rcv, skb, rq, use_napi) == NET_RX_SUCCESS)) {
 		if (!use_napi)
-			dev_lstats_add(dev, length);
+			dev_sw_netstats_tx_add(dev, 1, length);
 		else
 			__veth_xdp_flush(rq);
 	} else {
@@ -387,14 +387,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	return ret;
 }
 
-static u64 veth_stats_tx(struct net_device *dev, u64 *packets, u64 *bytes)
-{
-	struct veth_priv *priv = netdev_priv(dev);
-
-	dev_lstats_read(dev, packets, bytes);
-	return atomic64_read(&priv->dropped);
-}
-
 static void veth_stats_rx(struct veth_stats *result, struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -432,24 +424,24 @@ static void veth_get_stats64(struct net_device *dev,
 	struct veth_priv *priv = netdev_priv(dev);
 	struct net_device *peer;
 	struct veth_stats rx;
-	u64 packets, bytes;
 
-	tot->tx_dropped = veth_stats_tx(dev, &packets, &bytes);
-	tot->tx_bytes = bytes;
-	tot->tx_packets = packets;
+	tot->tx_dropped = atomic64_read(&priv->dropped);
+	dev_fetch_sw_netstats(tot, dev->tstats);
 
 	veth_stats_rx(&rx, dev);
 	tot->tx_dropped += rx.xdp_tx_err;
 	tot->rx_dropped = rx.rx_drops + rx.peer_tq_xdp_xmit_err;
-	tot->rx_bytes = rx.xdp_bytes;
-	tot->rx_packets = rx.xdp_packets;
+	tot->rx_bytes += rx.xdp_bytes;
+	tot->rx_packets += rx.xdp_packets;
 
 	rcu_read_lock();
 	peer = rcu_dereference(priv->peer);
 	if (peer) {
-		veth_stats_tx(peer, &packets, &bytes);
-		tot->rx_bytes += bytes;
-		tot->rx_packets += packets;
+		struct rtnl_link_stats64 tot_peer = {};
+
+		dev_fetch_sw_netstats(&tot_peer, peer->tstats);
+		tot->rx_bytes += tot_peer.tx_bytes;
+		tot->rx_packets += tot_peer.tx_packets;
 
 		veth_stats_rx(&rx, peer);
 		tot->tx_dropped += rx.peer_tq_xdp_xmit_err;
@@ -1508,13 +1500,13 @@ static int veth_dev_init(struct net_device *dev)
 {
 	int err;
 
-	dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats);
-	if (!dev->lstats)
+	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
+	if (!dev->tstats)
 		return -ENOMEM;
 
 	err = veth_alloc_queues(dev);
 	if (err) {
-		free_percpu(dev->lstats);
+		free_percpu(dev->tstats);
 		return err;
 	}
 
@@ -1524,7 +1516,7 @@ static int veth_dev_init(struct net_device *dev)
 static void veth_dev_free(struct net_device *dev)
 {
 	veth_free_queues(dev);
-	free_percpu(dev->lstats);
+	free_percpu(dev->tstats);
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
diff --git a/net/core/filter.c b/net/core/filter.c
index 21d75108c2e9..7aca28b7d0fd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb)
 			     net_eq(net, dev_net(dev))))
 			goto out_drop;
 		skb->dev = dev;
+		dev_sw_netstats_rx_add(dev, skb->len);
 		return -EAGAIN;
 	}
 	return flags & BPF_F_NEIGH ?
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net] veth: Fix RX stats for bpf_redirect_peer() traffic
  2023-10-27 18:46 [PATCH net] veth: Fix RX stats for bpf_redirect_peer() traffic Peilin Ye
@ 2023-10-27 19:02 ` Peilin Ye
  2023-10-28  7:06   ` Daniel Borkmann
  0 siblings, 1 reply; 8+ messages in thread
From: Peilin Ye @ 2023-10-27 19:02 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Jesper Dangaard Brouer
  Cc: Peilin Ye, netdev, bpf, linux-kernel, Cong Wang, Jiang Wang,
	Youlun Zhang

Hi all,

On Fri, Oct 27, 2023 at 06:46:57PM +0000, Peilin Ye wrote:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 21d75108c2e9..7aca28b7d0fd 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb)
>  			     net_eq(net, dev_net(dev))))
>  			goto out_drop;
>  		skb->dev = dev;
> +		dev_sw_netstats_rx_add(dev, skb->len);

This assumes that all devices that support BPF_F_PEER (currently only
veth) use tstats (instead of lstats, or dstats) - is that okay?

If not, should I add another NDO e.g. ->ndo_stats_rx_add()?

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] veth: Fix RX stats for bpf_redirect_peer() traffic
  2023-10-27 19:02 ` Peilin Ye
@ 2023-10-28  7:06   ` Daniel Borkmann
  2023-10-28 23:11     ` Peilin Ye
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Borkmann @ 2023-10-28  7:06 UTC (permalink / raw)
  To: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Jesper Dangaard Brouer
  Cc: Peilin Ye, netdev, bpf, linux-kernel, Cong Wang, Jiang Wang,
	Youlun Zhang

Hi Peilin,

On 10/27/23 9:02 PM, Peilin Ye wrote:
> On Fri, Oct 27, 2023 at 06:46:57PM +0000, Peilin Ye wrote:
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 21d75108c2e9..7aca28b7d0fd 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb)
>>   			     net_eq(net, dev_net(dev))))
>>   			goto out_drop;
>>   		skb->dev = dev;
>> +		dev_sw_netstats_rx_add(dev, skb->len);
> 
> This assumes that all devices that support BPF_F_PEER (currently only
> veth) use tstats (instead of lstats, or dstats) - is that okay?

Dumb question, but why all this change and not simply just call ...

   dev_lstats_add(dev, skb->len)

... on the host dev ?

> If not, should I add another NDO e.g. ->ndo_stats_rx_add()?

Definitely no new stats ndo resp indirect call in fast path.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] veth: Fix RX stats for bpf_redirect_peer() traffic
  2023-10-28  7:06   ` Daniel Borkmann
@ 2023-10-28 23:11     ` Peilin Ye
  2023-10-30 14:19       ` Daniel Borkmann
  0 siblings, 1 reply; 8+ messages in thread
From: Peilin Ye @ 2023-10-28 23:11 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Jesper Dangaard Brouer, Peilin Ye, netdev,
	bpf, linux-kernel, Cong Wang, Jiang Wang, Youlun Zhang

Hi Daniel,

Thanks for taking a look!

On Sat, Oct 28, 2023 at 09:06:44AM +0200, Daniel Borkmann wrote:
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 21d75108c2e9..7aca28b7d0fd 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb)
> > >   			     net_eq(net, dev_net(dev))))
> > >   			goto out_drop;
> > >   		skb->dev = dev;
> > > +		dev_sw_netstats_rx_add(dev, skb->len);
> > 
> > This assumes that all devices that support BPF_F_PEER (currently only
> > veth) use tstats (instead of lstats, or dstats) - is that okay?
> 
> Dumb question, but why all this change and not simply just call ...
> 
>   dev_lstats_add(dev, skb->len)
> 
> ... on the host dev ?

Since I didn't want to update host-veth's TX counters.  If we
bpf_redirect_peer()ed a packet from NIC TC ingress to Pod-veth TC ingress,
I think it means we've bypassed host-veth TX?

> > If not, should I add another NDO e.g. ->ndo_stats_rx_add()?
> 
> Definitely no new stats ndo resp indirect call in fast path.

Yeah, I think I'll put a comment saying that all devices that support
BPF_F_PEER must use tstats (or must use lstats), then.

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] veth: Fix RX stats for bpf_redirect_peer() traffic
  2023-10-28 23:11     ` Peilin Ye
@ 2023-10-30 14:19       ` Daniel Borkmann
  2023-10-30 21:51         ` Peilin Ye
  2023-10-31 19:53         ` Jakub Kicinski
  0 siblings, 2 replies; 8+ messages in thread
From: Daniel Borkmann @ 2023-10-30 14:19 UTC (permalink / raw)
  To: Peilin Ye
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Jesper Dangaard Brouer, Peilin Ye, netdev,
	bpf, linux-kernel, Cong Wang, Jiang Wang, Youlun Zhang

On 10/29/23 1:11 AM, Peilin Ye wrote:
> On Sat, Oct 28, 2023 at 09:06:44AM +0200, Daniel Borkmann wrote:
>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>> index 21d75108c2e9..7aca28b7d0fd 100644
>>>> --- a/net/core/filter.c
>>>> +++ b/net/core/filter.c
>>>> @@ -2492,6 +2492,7 @@ int skb_do_redirect(struct sk_buff *skb)
>>>>    			     net_eq(net, dev_net(dev))))
>>>>    			goto out_drop;
>>>>    		skb->dev = dev;
>>>> +		dev_sw_netstats_rx_add(dev, skb->len);
>>>
>>> This assumes that all devices that support BPF_F_PEER (currently only
>>> veth) use tstats (instead of lstats, or dstats) - is that okay?
>>
>> Dumb question, but why all this change and not simply just call ...
>>
>>    dev_lstats_add(dev, skb->len)
>>
>> ... on the host dev ?
> 
> Since I didn't want to update host-veth's TX counters.  If we
> bpf_redirect_peer()ed a packet from NIC TC ingress to Pod-veth TC ingress,
> I think it means we've bypassed host-veth TX?

Yes. So the idea is to transition to tstats replace the location where
we used to bump lstats with tstat's tx counter, and only the peer redirect
would bump the rx counter.. then upon stats traversal we fold the latter into
the rx stats which was populated by the opposite's tx counters. Makes sense.

OT: does cadvisor run inside the Pod to collect the device stats? Just
curious how it gathers them.

>>> If not, should I add another NDO e.g. ->ndo_stats_rx_add()?
>>
>> Definitely no new stats ndo resp indirect call in fast path.
> 
> Yeah, I think I'll put a comment saying that all devices that support
> BPF_F_PEER must use tstats (or must use lstats), then.

sgtm.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] veth: Fix RX stats for bpf_redirect_peer() traffic
  2023-10-30 14:19       ` Daniel Borkmann
@ 2023-10-30 21:51         ` Peilin Ye
  2023-10-31 19:53         ` Jakub Kicinski
  1 sibling, 0 replies; 8+ messages in thread
From: Peilin Ye @ 2023-10-30 21:51 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Jesper Dangaard Brouer, Peilin Ye, netdev,
	bpf, linux-kernel, Cong Wang, Jiang Wang, Youlun Zhang

On Mon, Oct 30, 2023 at 03:19:26PM +0100, Daniel Borkmann wrote:
> OT: does cadvisor run inside the Pod to collect the device stats? Just
> curious how it gathers them.

Seems like it reads from /proc/$PID/net/dev:
https://github.com/google/cadvisor/blob/d19df4f3b8e73c7e807d4b9894c252b54852fa8a/container/libcontainer/handler.go#L461

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] veth: Fix RX stats for bpf_redirect_peer() traffic
  2023-10-30 14:19       ` Daniel Borkmann
  2023-10-30 21:51         ` Peilin Ye
@ 2023-10-31 19:53         ` Jakub Kicinski
  2023-10-31 22:20           ` Daniel Borkmann
  1 sibling, 1 reply; 8+ messages in thread
From: Jakub Kicinski @ 2023-10-31 19:53 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Peilin Ye, David S. Miller, Eric Dumazet, Paolo Abeni,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Jesper Dangaard Brouer, Peilin Ye, netdev,
	bpf, linux-kernel, Cong Wang, Jiang Wang, Youlun Zhang

On Mon, 30 Oct 2023 15:19:26 +0100 Daniel Borkmann wrote:
> > Since I didn't want to update host-veth's TX counters.  If we
> > bpf_redirect_peer()ed a packet from NIC TC ingress to Pod-veth TC ingress,
> > I think it means we've bypassed host-veth TX?  
> 
> Yes. So the idea is to transition to tstats replace the location where
> we used to bump lstats with tstat's tx counter, and only the peer redirect
> would bump the rx counter.. then upon stats traversal we fold the latter into
> the rx stats which was populated by the opposite's tx counters. Makes sense.
> 
> OT: does cadvisor run inside the Pod to collect the device stats? Just
> curious how it gathers them.

Somewhat related - where does netkit count stats?

> >> Definitely no new stats ndo resp indirect call in fast path.  
> > 
> > Yeah, I think I'll put a comment saying that all devices that support
> > BPF_F_PEER must use tstats (or must use lstats), then.  
> 
> sgtm.

Is comment good enough? Can we try to do something more robust?
Move the allocation of stats into the core at registration based 
on some u8 assigned in the driver? (I haven't looked at the code TBH)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net] veth: Fix RX stats for bpf_redirect_peer() traffic
  2023-10-31 19:53         ` Jakub Kicinski
@ 2023-10-31 22:20           ` Daniel Borkmann
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel Borkmann @ 2023-10-31 22:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Peilin Ye, David S. Miller, Eric Dumazet, Paolo Abeni,
	Alexei Starovoitov, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Jesper Dangaard Brouer, Peilin Ye, netdev,
	bpf, linux-kernel, Cong Wang, Jiang Wang, Youlun Zhang

On 10/31/23 8:53 PM, Jakub Kicinski wrote:
> On Mon, 30 Oct 2023 15:19:26 +0100 Daniel Borkmann wrote:
>>> Since I didn't want to update host-veth's TX counters.  If we
>>> bpf_redirect_peer()ed a packet from NIC TC ingress to Pod-veth TC ingress,
>>> I think it means we've bypassed host-veth TX?
>>
>> Yes. So the idea is to transition to tstats replace the location where
>> we used to bump lstats with tstat's tx counter, and only the peer redirect
>> would bump the rx counter.. then upon stats traversal we fold the latter into
>> the rx stats which was populated by the opposite's tx counters. Makes sense.
>>
>> OT: does cadvisor run inside the Pod to collect the device stats? Just
>> curious how it gathers them.
> 
> Somewhat related - where does netkit count stats?

Yeap, it needs it as well, I have a local branch here where I pushed all
of it - coming out soon; I was planning to add some selftests in addition
till end of this week:

https://github.com/cilium/linux/commits/pr/ndo_peer

>>>> Definitely no new stats ndo resp indirect call in fast path.
>>>
>>> Yeah, I think I'll put a comment saying that all devices that support
>>> BPF_F_PEER must use tstats (or must use lstats), then.
>>
>> sgtm.
> 
> Is comment good enough? Can we try to do something more robust?
> Move the allocation of stats into the core at registration based
> on some u8 assigned in the driver? (I haven't looked at the code TBH)
Hm, not sure. One thing that comes to mind is lazy one-time allocation
like in case of netdev_core_stats_inc(), so whenever one of the helpers
like dev_sw_netstats_{rx,tx}_add() are called and dev->tstats are still
NULL, the core knows about the driver's intent, but tbh that doesn't
feel overly clean and in case of netdev_core_stats_inc() it's more in
the exception case rather than fast-path.

Other option could be to have two small helpers in the core which then
set a flag as well:

static inline int netdev_tstats_alloc(struct net_device *dev)
{
	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
	if (!dev->tstats)
		return -ENOMEM;
	dev->priv_flags |= IFF_USES_TSTATS;
	return 0;
}

static inline void netdev_tstats_free(struct net_device *dev)
{
	free_percpu(dev->tstats);
}

They can then be used from .ndo_init/uninit - not sure if this would
be overall nicer.. or just leaving it at the .ndo callback comment for
the time being until really more users show up (which I doubt tbh).

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-10-31 22:20 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-27 18:46 [PATCH net] veth: Fix RX stats for bpf_redirect_peer() traffic Peilin Ye
2023-10-27 19:02 ` Peilin Ye
2023-10-28  7:06   ` Daniel Borkmann
2023-10-28 23:11     ` Peilin Ye
2023-10-30 14:19       ` Daniel Borkmann
2023-10-30 21:51         ` Peilin Ye
2023-10-31 19:53         ` Jakub Kicinski
2023-10-31 22:20           ` Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).