From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9F20C433E2 for ; Tue, 14 Jul 2020 20:47:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A84CA2064B for ; Tue, 14 Jul 2020 20:47:59 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="AqntVCpw" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727854AbgGNUr6 (ORCPT ); Tue, 14 Jul 2020 16:47:58 -0400 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:29564 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726898AbgGNUr6 (ORCPT ); Tue, 14 Jul 2020 16:47:58 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1594759675; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0MaXgRjPugCbTyz5Iy3UwDJY6zPZsf/bR3z6EVEFfZM=; b=AqntVCpwJYoyPnaViHdy1DgE83HwAEMhSCuBOGX7WDjUXORMU1ZEm28NeVSRCwWtLft66K pcTw5L9SZ2k0F85uq5JpxUsM1XsYpj8CSmguenJLGgnqbTRvaSXTU7fTq5J1Ri1bDkNcPe YePHmnQLRS3O+2OsfDkSF0ycBcNl/28= Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-313-trGmxnlfNBemIZjpl3mjqw-1; Tue, 14 Jul 2020 16:47:53 -0400 X-MC-Unique: trGmxnlfNBemIZjpl3mjqw-1 Received: by mail-qk1-f198.google.com with SMTP id k16so13902180qkh.12 for ; Tue, 14 Jul 2020 13:47:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date :message-id:mime-version:content-transfer-encoding; bh=0MaXgRjPugCbTyz5Iy3UwDJY6zPZsf/bR3z6EVEFfZM=; b=C0Sej7j76yRO51uczOLP4X088zeE+J/rdzU9PVC4JRmgyDEP4QmQbG4+Q/WXSpYt1M h0Xe48k/A8aTXNLulWPGaKqscgi23hlqLcgOahQPNb0t3gWfNfxcbLVYEF4kqwIOkqJH qZVHVN8XFnRdIcILv85GaIj07nkY8EWJDazZ7vBhAWN6e/ACUnqsVClGwo1Z600x9rgh HA+nKNixZnbBLM4SAfRjBxXQnBiQldnF3k9BIyhFYFuIZqrSPW3BQvP2Jxf3Z/M2DY0o IFKVKRbNZIe60COKGShoXo11DhK/H2TPdTVfSkuDjE7JAojo4OB4Pe0S2vkW/nAQeV0z n1xQ== X-Gm-Message-State: AOAM530fJweFtY8cZ7yc+ZIIr5aSxQAo55kkw3nSueKCT5P+3JRJaI0x BigM+xJ1zQO7igwGVL+LWajXMbGF5Yfoq6yH0rl0BQsh4vyRSYw2okzqzXAmpcIa6igRx4M4bWk bG9BUEMjvZ9lKq1In X-Received: by 2002:ac8:ac3:: with SMTP id g3mr6789475qti.178.1594759672602; Tue, 14 Jul 2020 13:47:52 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy6nR2xXy0jRiZMLGRj0TzghbtfvfbIEqz7vomQo5C8fD8NnQrKy39FQyz0IWfKbXO5iFSe+A== X-Received: by 2002:ac8:ac3:: with SMTP id g3mr6789461qti.178.1594759672187; Tue, 14 Jul 2020 13:47:52 -0700 (PDT) Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2]) by smtp.gmail.com with ESMTPSA id 79sm24071619qkd.134.2020.07.14.13.47.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 14 Jul 2020 13:47:51 -0700 (PDT) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id 86BD71804F0; Tue, 14 Jul 2020 22:47:48 +0200 (CEST) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Andrii Nakryiko Cc: bpf , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau , Song Liu , Yonghong Song , Andrii Nakryiko , John Fastabend , KP Singh , Networking , Kernel Team Subject: Re: BPF logging infrastructure. Was: [PATCH bpf-next 4/6] tools: add new members to bpf_attr.raw_tracepoint in bpf.h In-Reply-To: References: <159467113970.370286.17656404860101110795.stgit@toke.dk> <159467114405.370286.1690821122507970067.stgit@toke.dk> <87r1tegusj.fsf@toke.dk> X-Clacks-Overhead: GNU Terry Pratchett Date: Tue, 14 Jul 2020 22:47:48 +0200 Message-ID: <87pn8xg6x7.fsf@toke.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Andrii Nakryiko writes: > On Tue, Jul 14, 2020 at 5:12 AM Toke H=C3=B8iland-J=C3=B8rgensen wrote: >> >> Andrii Nakryiko writes: >> >> > On Mon, Jul 13, 2020 at 1:13 PM Toke H=C3=B8iland-J=C3=B8rgensen wrote: >> >> >> >> From: Toke H=C3=B8iland-J=C3=B8rgensen >> >> >> >> Sync addition of new members from main kernel tree. >> >> >> >> Signed-off-by: Toke H=C3=B8iland-J=C3=B8rgensen >> >> --- >> >> tools/include/uapi/linux/bpf.h | 9 +++++++-- >> >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> >> >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linu= x/bpf.h >> >> index da9bf35a26f8..662a15e4a1a1 100644 >> >> --- a/tools/include/uapi/linux/bpf.h >> >> +++ b/tools/include/uapi/linux/bpf.h >> >> @@ -573,8 +573,13 @@ union bpf_attr { >> >> } query; >> >> >> >> struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN = command */ >> >> - __u64 name; >> >> - __u32 prog_fd; >> >> + __u64 name; >> >> + __u32 prog_fd; >> >> + __u32 log_level; /* verbosity level of= log */ >> >> + __u32 log_size; /* size of user buffe= r */ >> >> + __aligned_u64 log_buf; /* user supplied buff= er */ >> >> + __u32 tgt_prog_fd; >> >> + __u32 tgt_btf_id; >> >> } raw_tracepoint; >> >> >> >> struct { /* anonymous struct for BPF_BTF_LOAD */ >> >> >> > >> > I think BPF syscall would benefit from common/generalized log support >> > across all commands, given how powerful/complex it already is. >> > Sometimes it's literally impossible to understand why one gets -EINVAL >> > without adding printk()s in the kernel. >> >> Yes, I agree! This is horrible UI! > > UI?.. It's a perfectly fine and extensible API for all functionality > it provides, it just needs a better human-readable feedback mechanism, > which is what I'm proposing. Error codes are not working when you have > so many different situations that can result in error. Yes. I was agreeing with you: the lack of understandable error messages means you have to play "guess where this EINVAL came from", which is a terrible user experience (should have been UX instead of UI, I guess, sorry about that).=20 >> > But it feels wrong to add log_level/log_size/log_buf fields to every >> > section of bpf_attr and require the user to specify it differently for >> > each command. So before we go and start adding per-command fields, >> > let's discuss how we can do this more generically. I wonder if we can >> > come up with a good way to do it in one common way and then gradually >> > support that way throughout all BPF commands. >> > >> > Unfortunately it's too late to just add a few common fields to >> > bpf_attr in front of all other per-command fields, but here's two more >> > ideas just to start discussion. I hope someone can come up with >> > something nicer. >> > >> > 1. Currently bpf syscall accepts 3 input arguments (cmd, uattr, size). >> > We can extend it with two more optional arguments: one for pointer to >> > log-defining attr (for log_buf pointer, log_size, log_level, maybe >> > something more later) and another for the size of that log attr. >> > Beyond that we'd need some way to specify that the user actually meant >> > to provide those 2 optional args. The most straightforward way would >> > be to use the highest bit of cmd argument. So instead of calling bpf() >> > with BPF_MAP_CREATE command, you'd call it with BPF_MAP_CREATE | >> > BPF_LOGGED, where BPF_LOGGED =3D 1<<31. >> >> Well, if only we'd had a 'flags' argument to the syscall... I don't >> suppose we want a bpf2()? :) >> >> I like your idea of just using the top bits of the 'cmd' field as flag >> bits, but in that case we should just define this explicitly, say >> '#define BPF_CMD_FLAGS_MASK 0xFFFF0000'? > > sure > >> >> And instead of adding new arguments, why not just change the meaning of >> the 'attr' argument? Say we define: >> >> struct bpf_extended_attr { >> __u32 log_level; >> __u32 log_size; >> __aligned_u64 log_buf; >> __u8 reserved[48]; >> union bpf_attr attr; >> }; > > because this is a major PITA for libraries, plus it's not very > extensible, once you run out of 48 bytes? And when you don't need > those 48 bytes, you still need to zero them out, the kernel still > needs to copy them, etc. It just feels unclean to me. > > But before we argue that, is there a problem adding 2 more arguments > which are never used/read unless we have an extra bit set in cmd? > Honest question, are there any implications to user-space with such an > approach? Backwards-compatibility issues or anything? No idea; I don't know enough about how the lower-level details of the syscall interface works to tell either way. Is it even *possible* to add arguments like that in a backwards-compatible way? However, assuming it *is* possible, my larger point was that we shouldn't add just a 'logging struct', but rather a 'common options struct' which can be extended further as needed. And if it is *not* possible to add new arguments to a syscall like you're proposing, my suggestion above would be a different way to achieve basically the same (at the cost of having to specify the maximum reserved space in advance). >> And then define a new flag BPF_USES_EXTENDED_ATTR which will cause the >> kernel to interpret the second argument of the syscall as a pointer to >> that struct instead of to the bpf_attr union? >> >> > 2. A more "stateful" approach, would be to have an extra BPF command >> > to set log buffer (and level) per thread. And if such a log is set, it >> > would be overwritten with content on each bpf() syscall invocation >> > (i.e., log position would be reset to 0 on each BPF syscall). >> >> I don't think adding something stateful is a good idea; that's bound to >> lead to weird issues when someone messes up the state management in >> userspace. > > I agree, I'd prefer a stateless approach, but wanted to lay out a > stateful one for completeness as well. OK, cool. >> >> > Of course, the existing BPF_LOAD_PROG command would keep working with >> > existing log, but could fall back to the "common one", if >> > BPF_LOAD_PROG-specific one is not set. >> > >> > It also doesn't seem to be all that critical to signal when the log >> > buffer is overflown. It's pretty easy to detect from user-space: >> > - either last byte would be non-zero, if we don't care about >> > guaranteeing zero-termination for truncated log; >> > - of second-to-last byte would be non-zero, if BPF syscall will always >> > zero-terminate the log. >> >> I think if we're doing this we should think about making the contents of >> the log machine-readable, so applications can figure out what's going on >> without having to parse the text strings. The simplest would be to make >> this new log buffer use TLV-style messaging, say we define the log >> buffer output to be a series of messages like these: >> >> struct bpf_log_msg { >> __u16 type; >> __u32 len; >> __u8 contents[]; /* of size 'len' */ >> } __attribute__((packed)); >> >> To begin with we could define two types: >> >> struct bpf_log_msg_string { >> __u16 type; /* BPF_LOG_MSG_TYPE_STRING */ >> __u32 len; >> char message[]; >> } __attribute__((packed)); >> >> struct bpf_log_msg_end { >> __u16 type; /* BPF_LOG_MSG_TYPE_END */ >> __u32 len; >> __u16 num_truncations; >> } __attribute__((packed)); >> >> The TYPE_STRING would just be a wrapper for the existing text-based >> messages, but delimited so userspace can pick them apart. And the second >> one would solve your 'has the log been truncated' issue above; the >> kernel simply always reserves eight bytes at the end of the buffer and >> ends with writing a TYPE_END message with the number of messages that >> were dropped due to lack of space. That would make it trivial for >> userspace to detect truncation. >> > > Log truncation is not an issue, we can make bpf syscall to write back > the actual size of emitted log (and optionally extra bool to mean > "truncated") into the original log_size field. So what was all that you were talking about with "check the second-to-last byte" in your previous email? I understood that to be about detecting truncation? >> We could then add new message types later for machine-consumption. Say, >> and extended error code, or offsets into the BTF information, or >> whatever we end up needing. But just wrapping the existing log messages >> in TLVs like the ones above could be fairly straight-forwardly >> implemented with the existing bpf_log() infrastructure in the kernel, >> without having to settle on which machine-readable information is useful >> ahead of time... And the TLV format makes it easy for userspace to just >> skip message types it doesn't understand. >> >> WDYT? > > I think it's taking it a bit too far and adds more API on top of > existing API. Now all those types become a fixed API, messages become > an API, etc. Just more backwards compatibility stuff we need to > support forever, for, what I believe, very little gain. > > In practice, using human-readable strings works just fine. If there is > any kind of real issue, usually it involves humans reading debug logs > and understanding what's going on. Let me give an example of what I want to be able to do. Just today I helped someone debug getting xdp-filter to run, and they were getting output like this: libbpf: -- BEGIN DUMP LOG --- libbpf: xdpfilt_alw_all() is not a global function processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 libbpf: -- END LOG -- I would like to have xdp-filter catch that, and turn it into a friendlier "your compiler is too old" message. Having to effectively grep through the free-form log output to pick out that message feels brittle and error prone, as opposed to just having the kernel add a machine-readable id ("err_func_linkage_not_global" or somesuch) and stick it in a machine-parsable TLV. > Also adopting these packet-like messages is not as straightforward > through BPF code, as now you can't just construct a single log line > with few calls to bpf_log(). Why not? bpf_log() could just transparently write the four bytes of header (TYPE_STRING, followed by strlen(msg)) into the buffer before the string? And in the future, an enhanced version could take (say) an error ID as another parameter and transparently add that as a separate message. >> > Of course, if user code cares about such detection of log truncation, >> > it will need to set last/second-to-last bytes to zero before each >> > syscall. >> > >> > Both approaches have their pros/cons, we can dig into those later, but >> > I'd like to start this discussion and see what other people think. I >> > also wonder if there are other syscalls with similarly advanced input >> > arguments (like bpf_attr) and common logging, we could learn from >> > those. >> >> The first one that comes to mind is netlink extacks. Of course it's not >> quite comparable since netlink already has message-based semantics, but >> it does do sorta-kinda the same thing from a user PoV. The TLV format is >> obviously inspired by netlink (or, well, binary networking protocols in >> general). >> > > Yeah, I'm aware of extack, but as you said, TLV is more of a netlink > format, extack messages themselves are just strings. But my question > was more of how this log could be done for complicated API calls using > similar extendable attrs, like perf_event_open, clone3, openat2, etc. Ah, right, no idea :) -Toke