BPF List
 help / color / mirror / Atom feed
From: Alexei Starovoitov <ast@fb.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: "Daniel Borkmann" <daniel@iogearbox.net>,
	"Andrii Nakryiko" <andrii@kernel.org>, bpf <bpf@vger.kernel.org>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Kernel Team" <kernel-team@fb.com>,
	"Toke Høiland-Jørgensen" <toke@redhat.com>
Subject: Re: [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF
Date: Tue, 14 Dec 2021 12:51:33 -0800	[thread overview]
Message-ID: <35b54575-ede0-e9b3-ad3c-4b18ffe0089e@fb.com> (raw)
In-Reply-To: <CAEf4BzbmBNRB0sWAxHpSaW6fYMbgrCDm9K=8XScYGa2PEpdsPA@mail.gmail.com>

On 12/14/21 10:31 AM, Andrii Nakryiko wrote:
> On Tue, Dec 14, 2021 at 9:58 AM Alexei Starovoitov <ast@fb.com> wrote:
>>
>> On 12/14/21 9:51 AM, Andrii Nakryiko wrote:
>>> On Tue, Dec 14, 2021 at 7:09 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>
>>>> On 12/14/21 1:48 AM, Andrii Nakryiko wrote:
>>>>> The need to increase RLIMIT_MEMLOCK to do anything useful with BPF is
>>>>> one of the first extremely frustrating gotchas that all new BPF users go
>>>>> through and in some cases have to learn it a very hard way.
>>>>>
>>>>> Luckily, starting with upstream Linux kernel version 5.11, BPF subsystem
>>>>> dropped the dependency on memlock and uses memcg-based memory accounting
>>>>> instead. Unfortunately, detecting memcg-based BPF memory accounting is
>>>>> far from trivial (as can be evidenced by this patch), so in practice
>>>>> most BPF applications still do unconditional RLIMIT_MEMLOCK increase.
>>>>>
>>>>> As we move towards libbpf 1.0, it would be good to allow users to forget
>>>>> about RLIMIT_MEMLOCK vs memcg and let libbpf do the sensible adjustment
>>>>> automatically. This patch paves the way forward in this matter. Libbpf
>>>>> will do feature detection of memcg-based accounting, and if detected,
>>>>> will do nothing. But if the kernel is too old, just like BCC, libbpf
>>>>> will automatically increase RLIMIT_MEMLOCK on behalf of user
>>>>> application ([0]).
>>>>>
>>>>> As this is technically a breaking change, during the transition period
>>>>> applications have to opt into libbpf 1.0 mode by setting
>>>>> LIBBPF_STRICT_AUTO_RLIMIT_MEMLOCK bit when calling
>>>>> libbpf_set_strict_mode().
>>>>>
>>>>> Libbpf allows to control the exact amount of set RLIMIT_MEMLOCK limit
>>>>> with libbpf_set_memlock_rlim_max() API. Passing 0 will make libbpf do
>>>>> nothing with RLIMIT_MEMLOCK. libbpf_set_memlock_rlim_max() has to be
>>>>> called before the first bpf_prog_load(), bpf_btf_load(), or
>>>>> bpf_object__load() call, otherwise it has no effect and will return
>>>>> -EBUSY.
>>>>>
>>>>>      [0] Closes: https://github.com/libbpf/libbpf/issues/369
>>>>>
>>>>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>>>> [...]
>>>>>
>>>>> +/* Probe whether kernel switched from memlock-based (RLIMIT_MEMLOCK) to
>>>>> + * memcg-based memory accounting for BPF maps and progs. This was done in [0].
>>>>> + * We use the difference in reporting memlock value in BPF map's fdinfo before
>>>>> + * and after [0] to detect whether memcg accounting is done for BPF subsystem
>>>>> + * or not.
>>>>> + *
>>>>> + * Before the change, memlock value for ARRAY map would be calculated as:
>>>>> + *
>>>>> + *   memlock = sizeof(struct bpf_array) + round_up(value_size, 8) * max_entries;
>>>>> + *   memlock = round_up(memlock, PAGE_SIZE);
>>>>> + *
>>>>> + *
>>>>> + * After, memlock is approximated as:
>>>>> + *
>>>>> + *   memlock = round_up(key_size + value_size, 8) * max_entries;
>>>>> + *   memlock = round_up(memlock, PAGE_SIZE);
>>>>> + *
>>>>> + * In this check we use the fact that sizeof(struct bpf_array) is about 300
>>>>> + * bytes, so if we use value_size = (PAGE_SIZE - 100), before memcg
>>>>> + * approximation memlock would be rounded up to 2 * PAGE_SIZE, while with
>>>>> + * memcg approximation it will stay at single PAGE_SIZE (key_size is 4 for
>>>>> + * array and doesn't make much difference given 100 byte decrement we use for
>>>>> + * value_size).
>>>>> + *
>>>>> + *   [0] https://lore.kernel.org/bpf/20201201215900.3569844-1-guro@fb.com/
>>>>> + */
>>>>> +int probe_memcg_account(void)
>>>>> +{
>>>>> +     const size_t map_create_attr_sz = offsetofend(union bpf_attr, map_extra);
>>>>> +     long page_sz = sysconf(_SC_PAGESIZE), memlock_sz;
>>>>> +     char buf[128];
>>>>> +     union bpf_attr attr;
>>>>> +     int map_fd;
>>>>> +     FILE *f;
>>>>> +
>>>>> +     memset(&attr, 0, map_create_attr_sz);
>>>>> +     attr.map_type = BPF_MAP_TYPE_ARRAY;
>>>>> +     attr.key_size = 4;
>>>>> +     attr.value_size = page_sz - 100;
>>>>> +     attr.max_entries = 1;
>>>>> +     map_fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, map_create_attr_sz);
>>>>> +     if (map_fd < 0)
>>>>> +             return -errno;
>>>>> +
>>>>> +     sprintf(buf, "/proc/self/fdinfo/%d", map_fd);
>>>>> +     f = fopen(buf, "r");
>>>>> +     while (f && !feof(f) && fgets(buf, sizeof(buf), f)) {
>>>>> +             if (fscanf(f, "memlock: %ld\n", &memlock_sz) == 1) {
>>>>> +                     fclose(f);
>>>>> +                     close(map_fd);
>>>>> +                     return memlock_sz == page_sz ? 1 : 0;
>>>>> +             }
>>>>> +     }
>>>>> +
>>>>> +     /* proc FS is disabled or we failed to parse fdinfo properly, assume
>>>>> +      * we need setrlimit
>>>>> +      */
>>>>> +     if (f)
>>>>> +             fclose(f);
>>>>> +     close(map_fd);
>>>>> +     return 0;
>>>>> +}
>>>>
>>>> One other option which might be slightly more robust perhaps could be to probe
>>>> for a BPF helper that has been added along with 5.11 kernel. As Toke noted earlier
>>>> it might not work with ooo backports, but if its good with RHEL in this specific
>>>> case, we should be covered for 99% of cases. Potentially, we could then still try
>>>> to fallback to the above probing logic?
>>>
>>> Ok, I was originally thinking of probe bpf_sock_from_file() (which was
>>> added after memcg change), but it's PITA. But I see that slightly
>>> before that (but in the same 5.11 release) bpf_ktime_get_coarse_ns()
>>
>> Note that it had fixes after that, so in the kernel version where
> 
> You mean 5e0bc3082e2e ("bpf: Forbid bpf_ktime_get_coarse_ns and
> bpf_timer_* in tracing progs"), right? This shouldn't matter if I use
> BPF_PROG_TYPE_SOCKET_FILTER for probing.

hmm. I guess we allow it in unpriv too.

> fdinfo parsing approach has unnecessary dependency on PROCFS and is
> more code (and very detailed knowledge of approximation and memlock
> calculation formula). I like ktime_get_coarse_ns approach due to
> minimal amount of code and no reliance on any other kernel config
> besides CONFIG_BPF_SYSCALL.
> 
> But in the end I care about the overall feature, not a particular
> implementation of the detection. Should I send
> ktime_get_coarse_ns-based approach or we go with this one? I've
> implemented and tested all three variants already, so no time savings
> are expected either way.

Either v3 or v4 are fine by me. Let Daniel pick.

  reply	other threads:[~2021-12-14 20:51 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-14  0:48 [PATCH v3 bpf-next 0/2] libbpf: auto-bump RLIMIT_MEMLOCK on old kernels Andrii Nakryiko
2021-12-14  0:48 ` [PATCH v3 bpf-next 1/2] libbpf: auto-bump RLIMIT_MEMLOCK if kernel needs it for BPF Andrii Nakryiko
2021-12-14 15:09   ` Daniel Borkmann
2021-12-14 17:51     ` Andrii Nakryiko
2021-12-14 17:58       ` Alexei Starovoitov
2021-12-14 18:31         ` Andrii Nakryiko
2021-12-14 20:51           ` Alexei Starovoitov [this message]
2021-12-14 21:09             ` Daniel Borkmann
2021-12-14  0:48 ` [PATCH v3 bpf-next 2/2] selftests/bpf: remove explicit setrlimit(RLIMIT_MEMLOCK) in main selftests Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=35b54575-ede0-e9b3-ad3c-4b18ffe0089e@fb.com \
    --to=ast@fb.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=toke@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox