From: "Daniel Xu" <dxu@dxuuu.xyz>
To: "Jesper Dangaard Brouer" <hawk@kernel.org>
Cc: "Andrii Nakryiko" <andrii.nakryiko@gmail.com>,
"Alexei Starovoitov" <alexei.starovoitov@gmail.com>,
"Eduard Zingerman" <eddyz87@gmail.com>,
"Andrii Nakryiko" <andrii@kernel.org>,
"Daniel Borkmann" <daniel@iogearbox.net>,
"Alexei Starovoitov" <ast@kernel.org>,
"Shuah Khan" <shuah@kernel.org>,
"John Fastabend" <john.fastabend@gmail.com>,
"Martin KaFai Lau" <martin.lau@linux.dev>,
"Song Liu" <song@kernel.org>,
"Yonghong Song" <yonghong.song@linux.dev>,
"KP Singh" <kpsingh@kernel.org>,
"Stanislav Fomichev" <sdf@fomichev.me>,
"Hao Luo" <haoluo@google.com>, "Jiri Olsa" <jolsa@kernel.org>,
"Mykola Lysenko" <mykolal@fb.com>,
LKML <linux-kernel@vger.kernel.org>,
"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
"open list:KERNEL SELFTEST FRAMEWORK"
<linux-kselftest@vger.kernel.org>,
"Kernel Team" <kernel-team@meta.com>,
kernel-team <kernel-team@cloudflare.com>
Subject: Re: CPUMAP latency metrics (was: [PATCH bpf-next] bpf: ringbuf: Support consuming BPF_MAP_TYPE_RINGBUF from prog)
Date: Thu, 12 Sep 2024 19:15:59 -0700 [thread overview]
Message-ID: <4c9e00be-8c48-4a3b-a50d-79ef47740c99@app.fastmail.com> (raw)
In-Reply-To: <e0321fe6-da0e-4745-b23c-24bb2b02613d@kernel.org>
On Thu, Sep 12, 2024, at 2:40 AM, Jesper Dangaard Brouer wrote:
> On 11/09/2024 20.53, Daniel Xu wrote:
>> On Wed, Sep 11, 2024 at 10:32:56AM GMT, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 11/09/2024 06.43, Daniel Xu wrote:
>>>> [cc Jesper]
>>>>
>>>> On Tue, Sep 10, 2024, at 8:31 PM, Daniel Xu wrote:
>>>>> On Tue, Sep 10, 2024 at 05:39:55PM GMT, Andrii Nakryiko wrote:
>>>>>> On Tue, Sep 10, 2024 at 4:44 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
>>>>>>>
>>>>>>> On Tue, Sep 10, 2024 at 03:21:04PM GMT, Andrii Nakryiko wrote:
>>>>>>>> On Tue, Sep 10, 2024 at 3:16 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
>>>>>>>>>
>>> [...cut...]
>>>
>>>>>> Can you give us a bit more details on what
>>>>>> you are trying to achieve?
>>>>>
>>>>> BPF cpumap, under the hood, has one MPSC ring buffer (ptr_ring) for each
>>>>> entry in the cpumap. When a prog redirects to an entry in the cpumap,
>>>>> the machinery queues up the xdp frame onto the destination CPU ptr_ring.
>>>>> This can occur on any cpu, thus multi-producer. On processing side,
>>>>> there is only the kthread created by the cpumap entry and bound to the
>>>>> specific cpu that is consuming entries. So single consumer.
>>>>>
>>>
>>> An important detail: to get Multi-Producer (MP) to scale the CPUMAP does
>>> bulk enqueue into the ptr_ring. It stores the xdp_frame's in a per-CPU
>>> array and does the flush/enqueue as part of the xdp_do_flush(). Because
>>> I was afraid of this adding latency, I choose to also flush every 8
>>> frames (CPU_MAP_BULK_SIZE).
>>>
>>> Looking at code I see this is also explained in a comment:
>>>
>>> /* General idea: XDP packets getting XDP redirected to another CPU,
>>> * will maximum be stored/queued for one driver ->poll() call. It is
>>> * guaranteed that queueing the frame and the flush operation happen on
>>> * same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
>>> * which queue in bpf_cpu_map_entry contains packets.
>>> */
>>>
>>>
>>>>> Goal is to track the latency overhead added from ptr_ring and the
>>>>> kthread (versus softirq where is less overhead). Ideally we want p50,
>>>>> p90, p95, p99 percentiles.
>>>>>
>>>
>>> I'm very interesting in this use-case of understanding the latency of
>>> CPUMAP.
>>> I'm a fan of latency histograms that I turn into heatmaps in grafana.
>>>
>>>>> To do this, we need to track every single entry enqueue time as well as
>>>>> dequeue time - events that occur in the tail are quite important.
>>>>>
>>>>> Since ptr_ring is also a ring buffer, I thought it would be easy,
>>>>> reliable, and fast to just create a "shadow" ring buffer. Every time
>>>>> producer enqueues entries, I'd enqueue the same number of current
>>>>> timestamp onto shadow RB. Same thing on consumer side, except dequeue
>>>>> and calculate timestamp delta.
>>>>>
>>>
>>> This idea seems overkill and will likely produce unreliable results.
>>> E.g. the overhead of this additional ring buffer will also affect the
>>> measurements.
>>
>> Yeah, good point.
>>
>>>
>>>>> I was originally planning on writing my own lockless ring buffer in pure
>>>>> BPF (b/c spinlocks cannot be used w/ tracepoints yet) but was hoping I
>>>>> could avoid that with this patch.
>>>>
>>>> [...]
>>>>
>>>> Alternatively, could add a u64 timestamp to xdp_frame, which makes all
>>>> this tracking inline (and thus more reliable). But I'm not sure how precious
>>>> the space in that struct is - I see some references online saying most drivers
>>>> save 128B headroom. I also see:
>>>>
>>>> #define XDP_PACKET_HEADROOM 256
>>>>
>>>
>>> I like the inline idea. I would suggest to add u64 timestamp into
>>> XDP-metadata area (ctx->data_meta code example[1]) , when XDP runs in
>>> RX-NAPI. Then at the remote CPU you can run another CPUMAP-XDP program that
>>> pickup this timestamp, and then calc a delta from "now" timestamp.
>>>
>>>
>>> [1] https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62-L77
>>
>> Cool! This is a much better idea than mine :)
>>
>> I'll give this a try.
>>
>>>
>>>
>>>> Could probably amortize the timestamp read by setting it in
>>>> bq_flush_to_queue().
>>>
>>> To amortize, consider that you might not need to timestamp EVERY packet to
>>> get sufficient statistics on the latency.
>>>
>>> Regarding bq_flush_to_queue() and the enqueue tracepoint:
>>> trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu)
>>>
>>> I have an idea for you, on how to measure the latency overhead from XDP
>>> RX-processing to when enqueue "flush" happens. It is a little tricky to
>>> explain, so I will outline the steps.
>>>
>>> 1. XDP bpf_prog store timestamp in per-CPU array,
>>> unless timestamp is already set.
>>>
>>> 2. trace_xdp_cpumap_enqueue bpf_prog reads per-CPU timestamp
>>> and calc latency diff, and clears timestamp.
>>>
>>> This measures the latency overhead of bulk enqueue. (Notice: Only the
>>> first XDP redirect frame after a bq_flush_to_queue() will set the
>>> timestamp). This per-CPU store should work as this all runs under same
>>> RX-NAPI "poll" execution.
>>
>> Makes sense to me. This breaks down the latency even further. I'll keep
>> it in mind if we need further troubleshooting.
>>
>
> Yes, this breaks down the latency even further :-)
>
>>> This latency overhead of bulk enqueue, will (unfortunately) also
>>> count/measure the XDP_PASS packets that gets processed by the normal
>>> netstack. So, watch out for this. e.g could have XDP actions (e.g
>>> XDP_PASS) counters as part of step 1, and have statistic for cases where
>>> XDP_PASS interfered.
>>
>> Not sure I got this. If we only set the percpu timestamp for
>> XDP_REDIRECT frames, then I don't see how XDP_PASS interferes. Maybe I
>> misunderstand something.
>>
>
> Not quite. You only set timestamp for the first XDP_REDIRECT frames.
> I'm simply saying, all the packets processed *after* this timestamp will
> attribute to the time it takes until trace_xdp_cpumap_enqueue() runs.
> That part should be obvious. Then, I'm saying remember that an XDP_PASS
> packet takes more time than a XDP_REDIRECT packet. I hope this makes it
> more clear. Point: This is a pitfall you need to be aware of when
> looking at your metrics.
>
> For the inline timestamping same pitfall actually applies. There you
> timestamp the packets themselves. Because the CPUMAP enqueue happens as
> the *LAST* thing in NAPI loop, during xdp_do_flush() call. This means
> that interleaved normal netstack (XDP_PASS) packets will be processed
> BEFORE this call to xdp_do_flush(). As noted earlier, to compensate for
> this effect, code will enqueue-flush every 8 frames (CPU_MAP_BULK_SIZE).
Ah, that makes sense. Thanks for explaining.
>
> I hope, I've not confused you more :-|
>
No, that was helpful!
Thanks,
Daniel
next prev parent reply other threads:[~2024-09-13 2:16 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-10 0:54 [PATCH bpf-next] bpf: ringbuf: Support consuming BPF_MAP_TYPE_RINGBUF from prog Daniel Xu
2024-09-10 18:36 ` Alexei Starovoitov
2024-09-10 20:41 ` Andrii Nakryiko
2024-09-10 21:07 ` Daniel Xu
2024-09-10 22:16 ` Daniel Xu
2024-09-10 22:21 ` Andrii Nakryiko
2024-09-10 23:44 ` Daniel Xu
2024-09-11 0:39 ` Andrii Nakryiko
2024-09-11 3:31 ` Daniel Xu
2024-09-11 4:43 ` Daniel Xu
2024-09-11 8:32 ` Jesper Dangaard Brouer
2024-09-11 18:53 ` Daniel Xu
2024-09-12 9:40 ` CPUMAP latency metrics (was: [PATCH bpf-next] bpf: ringbuf: Support consuming BPF_MAP_TYPE_RINGBUF from prog) Jesper Dangaard Brouer
2024-09-13 2:15 ` Daniel Xu [this message]
2024-09-11 20:04 ` [PATCH bpf-next] bpf: ringbuf: Support consuming BPF_MAP_TYPE_RINGBUF from prog Andrii Nakryiko
2024-09-10 22:18 ` Andrii Nakryiko
2024-09-10 20:59 ` Daniel Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4c9e00be-8c48-4a3b-a50d-79ef47740c99@app.fastmail.com \
--to=dxu@dxuuu.xyz \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=haoluo@google.com \
--cc=hawk@kernel.org \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kernel-team@cloudflare.com \
--cc=kernel-team@meta.com \
--cc=kpsingh@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=mykolal@fb.com \
--cc=sdf@fomichev.me \
--cc=shuah@kernel.org \
--cc=song@kernel.org \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox