From mboxrd@z Thu Jan  1 00:00:00 1970
From: John Fastabend <john.fastabend@gmail.com>
Subject: Re: [bpf-next PATCH 5/7] bpf: create tcp_bpf_ulp allowing BPF to
 monitor socket TX/RX data
Date: Wed, 17 Jan 2018 12:32:03 -0800
Message-ID: <80c4b9fb-7938-782f-6b5f-012db04c41e6@gmail.com>
References: <20180112175029.21531.54693.stgit@john-Precision-Tower-5810>
 <20180112181111.21531.25665.stgit@john-Precision-Tower-5810>
 <20180117022545.7c2tlipmugcfeyme@ast-mbp>
 <1df36069-4b7c-5244-8d49-37ad831a298e@gmail.com>
 <20180117062013.upgi7q5qia6tuawc@ast-mbp>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Cc: borkmann@iogearbox.net, ast@kernel.org, netdev@vger.kernel.org,
        kafai@fb.com
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-ot0-f171.google.com ([74.125.82.171]:42854 "EHLO
        mail-ot0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1752578AbeAQUcG (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 17 Jan 2018 15:32:06 -0500
Received: by mail-ot0-f171.google.com with SMTP id s3so18014169otc.9
        for <netdev@vger.kernel.org>; Wed, 17 Jan 2018 12:32:06 -0800 (PST)
In-Reply-To: <20180117062013.upgi7q5qia6tuawc@ast-mbp>
Content-Language: en-US
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On 01/16/2018 10:20 PM, Alexei Starovoitov wrote:
> On Tue, Jan 16, 2018 at 09:49:16PM -0800, John Fastabend wrote:
>>
>>> but this program will see only first SG ?
>>
>> Correct, to read further into the msg we would need to have a helper
>> or some other way to catch reads/writes past the first 4k and read
>> the next sg. We have the same limitation in cls_bpf.
>>
>> I have started a patch on top of this series with the current working
>> title msg_apply_bytes(int bytes). This would let us apply a verdict to
>> a set number of bytes instead of the entire msg. By calling
>> msg_apply_bytes(data_end - data) a user who needs to read an entire msg
>> could do this in 4k chunks until the entire msg is passed through the
>> bpf prog.
> 
> good idea.
> I think would be good to add this helper as part of this patch set
> to make sure there is a way for user to look through the whole
> tcp stream if the program really wants to.

Sure I'll add it to this series.

> I understand that program cannot examine every byte anyway
> due to lack of loops and helpers, but this part of sockmap api
> should still provide an interface from day one.
> One example would be the program parsing http2 or similar
> where in the header it sees length. Then it can do
> msg_apply_bytes(length)
> to skip the bytes it processed, but still continue within
> the same 64Kbyte chunk when 0 < length < 64k
> 

Yep, this is my use case.

>>> and it's typically going to be one page only ?
>>
>> yep
>>
>>> then what's the value of waiting for MAX_SKB_FRAGS ?
>>>
>> Its not waiting for MAX_SKB_FRAGS its simple copying up to MAX_SKB_FRAGS
>> pages in one call if possible. It seems better to me to run this loop
>> over as much data as we can.
> 
> agree on trying to do MAX_SKB_FRAGS as a 'processing unit',
> but program should still be able to skip or redirect 
> parts of the bytes and not the whole 64k chunk.
> From program point of view it should never see or worry about
> SG list boundaries whereas right now it seems that below code
> is dealing with them (though program doesn't know where sg ends):
> 

The apply_bytes() helper allows for skip and redirect of
parts of the bytes and not the whole 64k chunk.

>> +
>> +		switch (eval) {
>> +		case __SK_PASS:
>> +			sg_mark_end(sg + sg_num - 1);

All this does is tell the TCP sendpage call that after this sg entry
we don't have anymore data so go ahead and flush it.

>> +			err = bpf_tcp_push(sk, sg, &sg_curr, flags, true);
>> +			if (unlikely(err)) {
>> +				copied -= free_sg(sk, sg, sg_curr, sg_num);
>> +				goto out_err;
>> +			}
>> +			break;
>> +		case __SK_REDIRECT:
>> +			sg_mark_end(sg + sg_num - 1);
>> +			goto do_redir;
> ...
>>>> +static int bpf_tcp_ulp_register(void)
>>>> +{
>>>> +	tcp_bpf_proto = tcp_prot;
>>>> +	tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
>>>> +	tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
>>>> +	return tcp_register_ulp(&bpf_tcp_ulp_ops);
>>>
>>> I don't see corresponding tcp_unregister_ulp().
>>>
>>
>> There is none. tcp_register_ulp() adds the bpf_tcp_ulp to the list of
>> available ULPs and never removes it. To remove it we would have to
>> keep a ref count on the reg/unreg calls. This would require a couple
>> more patches to the ULP infra and not sure it hurts to leave the ULP
>> reference around...
>>
>> Maybe a follow on patch? Or else it could be a patch in front of this
>> patch.
> 
> I see. I'm ok with leaving that for latter.
> It doesn't hurt to keep it registered. Please add a comment though.
> 

OK will add comment.