netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: brakmo <brakmo@fb.com>, netdev <netdev@vger.kernel.org>,
	Martin Lau <kafai@fb.com>, Alexei Starovoitov <ast@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <Kernel-team@fb.com>
Subject: Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
Date: Mon, 25 Mar 2019 01:33:49 -0700	[thread overview]
Message-ID: <5aec97f1-545a-f898-fdd9-c5821d5c6e39@gmail.com> (raw)
In-Reply-To: <20190324161911.h5eotv2j7f5avcpm@ast-mbp>



On 03/24/2019 09:19 AM, Alexei Starovoitov wrote:
> On Sat, Mar 23, 2019 at 10:36:24PM -0700, Eric Dumazet wrote:
>>
>>
>> On 03/23/2019 08:41 AM, Alexei Starovoitov wrote:
>>> On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote:
>>>>
>>>>
>>>> On 03/23/2019 01:05 AM, brakmo wrote:
>>>>> This patchset adds support for propagating congestion notifications (cn)
>>>>> to TCP from cgroup inet skb egress BPF programs.
>>>>>
>>>>> Current cgroup skb BPF programs cannot trigger TCP congestion window
>>>>> reductions, even when they drop a packet. This patch-set adds support
>>>>> for cgroup skb BPF programs to send congestion notifications in the
>>>>> return value when the packets are TCP packets. Rather than the
>>>>> current 1 for keeping the packet and 0 for dropping it, they can
>>>>> now return:
>>>>>     NET_XMIT_SUCCESS    (0)    - continue with packet output
>>>>>     NET_XMIT_DROP       (1)    - drop packet and do cn
>>>>>     NET_XMIT_CN         (2)    - continue with packet output and do cn
>>>>>     -EPERM                     - drop packet
>>>>>
>>>>
>>>> I believe I already mentioned this model is broken, if you have any virtual
>>>> device before the cgroup BPF program.
>>>>
>>>> Please think about offloading the pacing/throttling in the NIC,
>>>> there is no way we will report back to tcp stack instant notifications.
>>>
>>> I don't think 'offload to google proprietary nic' is a suggestion
>>> that folks can practically follow.
>>> Very few NICs can offload pacing to hw and there are plenty of limitations.
>>> This patch set represents a pure sw solution that works and scales to millions of flows.
>>>
>>>> This patch series is going way too far for my taste.
>>>
>>> I would really appreciate if you can do a technical review of the patches.
>>> Our previous approach didn't quite work due to complexity around locked/non-locked socket.
>>> This is a cleaner approach.
>>> Either we go with this one or will add a bpf hook into __tcp_transmit_skb.
>>> This approach is better since it works for other protocols and can be
>>> used by qdiscs w/o any bpf.
>>>
>>>> This idea is not new, you were at Google when it was experimented by Nandita and
>>>> others, and we know it is not worth the pain.
>>>
>>> google networking needs are different from the rest of the world.
>>>
>>
>> This has nothing to do with Google against Facebook really, it is a bit sad you react like this Alexei.
>>
>> We just upstreamed bpf_skb_ecn_set_ce(), so I doubt you already have numbers to show that this strategy
>> is not enough.
>>
>> All recent discussions about ECN (TCP Prague and SCE) do not _require_ instant feedback to the sender.
>>
>> Please show us experimental results before we have to carry these huge hacks.
> 
> I have a feeling we're looking at different patches.
> All of your comments seems to be talking about something else.
> I have a hard time connecting them to this patch set.
> Have you read the cover letter and patches of _this_ set?
> 
> The results are in the cover letter and there is no change in behavior in networking stack.
> Patches 1, 2, 3 are simple refactoring of bpf-cgroup return codes.
> Patch 4 is the only one that touches net/ipv4/ip_output.c only to pass
> the return code to tcp/udp layer.
> The concept works for both despite of what you're claiming being tcp only.
> 
> Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
> Please realize that existing qdiscs already doing this.
> The patchset allows bpf-cgroup to do the same.

Not the same thing I am afraid.

> 
> If you were arguing about sysctl knob in patch 5 that I would understand.
> Instead of a knob we can hard code the value for now. But please explain
> what issues you see with that knob.
> 
> Also to be fair I applied your
> commit f11216b24219 ("bpf: add skb->tstamp r/w access from tc clsact and cg skb progs")
> without demanding 'experimental results' because the feature makes sense.
> Yet, you folks didn't produce a _single_ example or test result since then.
> This is not cool.

This patch did not change fast path at all Alexei

Look at [PATCH bpf-next 4/7] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls

This part changes ip stacks.


  reply	other threads:[~2019-03-25  8:33 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-23  8:05 [PATCH bpf-next 0/7] bpf: Propagate cn to TCP brakmo
2019-03-23  8:05 ` [PATCH bpf-next 1/7] bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY brakmo
2019-03-23  8:05 ` [PATCH bpf-next 2/7] bpf: cgroup inet skb programs can return 0 to 3 brakmo
2019-03-23  8:05 ` [PATCH bpf-next 3/7] bpf: Update __cgroup_bpf_run_filter_skb with cn brakmo
2019-03-23  8:05 ` [PATCH bpf-next 4/7] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls brakmo
2019-03-23  8:05 ` [PATCH bpf-next 5/7] bpf: sysctl for probe_on_drop brakmo
2019-03-23  8:05 ` [PATCH bpf-next 6/7] bpf: Add cn support to hbm_out_kern.c brakmo
2019-03-23  8:05 ` [PATCH bpf-next 7/7] bpf: Add more stats to HBM brakmo
2019-03-23  9:12 ` [PATCH bpf-next 0/7] bpf: Propagate cn to TCP Eric Dumazet
2019-03-23 15:41   ` Alexei Starovoitov
2019-03-24  5:36     ` Eric Dumazet
2019-03-24 16:19       ` Alexei Starovoitov
2019-03-25  8:33         ` Eric Dumazet [this message]
2019-03-25  8:48           ` Eric Dumazet
2019-03-26  4:27             ` Alexei Starovoitov
2019-03-26  8:06               ` Eric Dumazet
2019-03-26 15:07                 ` Alexei Starovoitov
2019-03-26 15:43                   ` Eric Dumazet
2019-03-26 17:01                     ` Alexei Starovoitov
2019-03-26 18:07                       ` Eric Dumazet
2019-03-26  8:13               ` Eric Dumazet
2019-03-24  5:48     ` Eric Dumazet
2019-03-24  1:14   ` Lawrence Brakmo
2019-03-24  5:58     ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5aec97f1-545a-f898-fdd9-c5821d5c6e39@gmail.com \
    --to=eric.dumazet@gmail.com \
    --cc=Kernel-team@fb.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=ast@fb.com \
    --cc=brakmo@fb.com \
    --cc=daniel@iogearbox.net \
    --cc=kafai@fb.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).