From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <netdev-owner@vger.kernel.org>
Received: from mail-pf0-f193.google.com ([209.85.192.193]:35569 "EHLO
        mail-pf0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751320AbeCTFyp (ORCPT
        <rfc822;netdev@vger.kernel.org>); Tue, 20 Mar 2018 01:54:45 -0400
Received: by mail-pf0-f193.google.com with SMTP id y186so255788pfb.2
        for <netdev@vger.kernel.org>; Mon, 19 Mar 2018 22:54:44 -0700 (PDT)
Subject: Re: [bpf-next PATCH v3 08/18] bpf: sk_msg program helper
 bpf_sk_msg_pull_data
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: davejwatson@fb.com, davem@davemloft.net, daniel@iogearbox.net,
        ast@kernel.org, netdev@vger.kernel.org
References: <20180318195501.14466.25366.stgit@john-Precision-Tower-5810>
 <20180318195725.14466.97172.stgit@john-Precision-Tower-5810>
 <20180319202400.unsb3wjr546ew4sb@ast-mbp.dhcp.thefacebook.com>
From: John Fastabend <john.fastabend@gmail.com>
Message-ID: <b3d504e8-2fc2-e520-f6ce-bbaa72c35037@gmail.com>
Date: Mon, 19 Mar 2018 22:54:28 -0700
MIME-Version: 1.0
In-Reply-To: <20180319202400.unsb3wjr546ew4sb@ast-mbp.dhcp.thefacebook.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On 03/19/2018 01:24 PM, Alexei Starovoitov wrote:
> On Sun, Mar 18, 2018 at 12:57:25PM -0700, John Fastabend wrote:
>> Currently, if a bpf sk msg program is run the program
>> can only parse data that the (start,end) pointers already
>> consumed. For sendmsg hooks this is likely the first
>> scatterlist element. For sendpage this will be the range
>> (0,0) because the data is shared with userspace and by
>> default we want to avoid allowing userspace to modify
>> data while (or after) BPF verdict is being decided.
>>
>> To support pulling in additional bytes for parsing use
>> a new helper bpf_sk_msg_pull(start, end, flags) which
>> works similar to cls tc logic. This helper will attempt
>> to point the data start pointer at 'start' bytes offest
>> into msg and data end pointer at 'end' bytes offset into
>> message.
>>
>> After basic sanity checks to ensure 'start' <= 'end' and
>> 'end' <= msg_length there are a few cases we need to
>> handle.
>>
>> First the sendmsg hook has already copied the data from
>> userspace and has exclusive access to it. Therefor, it
>> is not necessesary to copy the data. However, it may
>> be required. After finding the scatterlist element with
>> 'start' offset byte in it there are two cases. One the
>> range (start,end) is entirely contained in the sg element
>> and is already linear. All that is needed is to update the
>> data pointers, no allocate/copy is needed. The other case
>> is (start, end) crosses sg element boundaries. In this
>> case we allocate a block of size 'end - start' and copy
>> the data to linearize it.
>>
>> Next sendpage hook has not copied any data in initial
>> state so that data pointers are (0,0). In this case we
>> handle it similar to the above sendmsg case except the
>> allocation/copy must always happen. Then when sending
>> the data we have possibly three memory regions that
>> need to be sent, (0, start - 1), (start, end), and
>> (end + 1, msg_length). This is required to ensure any
>> writes by the BPF program are correctly transmitted.
>>
>> Lastly this operation will invalidate any previous
>> data checks so BPF programs will have to revalidate
>> pointers after making this BPF call.
>>
>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ..
>> +
>> +	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
>> +	if (unlikely(!page))
>> +		return -ENOMEM;
> 
> I think that's fine. Just curious what order do you see in practice?

At the moment I'm mostly reading headers so this only
happens when a header is split across multiple scatterlist
elements. In these cases a copy size of less than 4k is good
enough.

Some of the nginx configurations I have use a max sendfile
size of 128kb. So these are larger, but unless we look
at the payload we can avoid reading/writing this. If
it becomes commonplace we could look at optimizing it.
Should be doable without changing the user facing API.

> 
> Acked-by: Alexei Starovoitov <ast@kernel.org>
>