From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 277D7C388F9 for ; Sat, 21 Nov 2020 05:39:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CA2FA22210 for ; Sat, 21 Nov 2020 05:39:43 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="HoTG8Sl1" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726324AbgKUFjW (ORCPT ); Sat, 21 Nov 2020 00:39:22 -0500 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:32040 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725892AbgKUFjW (ORCPT ); Sat, 21 Nov 2020 00:39:22 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1605937159; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xENM7tt6/BDscE51yBedRq8lEPjdWgxJMWQR4dV6MkM=; b=HoTG8Sl1HQTGBjGcrnMqPiQodvw0ZngNm8825hojUuVOL0eQwfFeAuuOzu3QsIc+CE5rOR q/unxJ94Vx3xUJ0IjQqNsA4izlwl/30xhCV+kovuNnRuAgz/HqEhIiBcByS29xdHD3rxKU 4WPGd7sGzk1F/n7b844eed75UAfFBEk= Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-351-6AUQQfaDPui-ILLWuHpSrw-1; Sat, 21 Nov 2020 00:39:16 -0500 X-MC-Unique: 6AUQQfaDPui-ILLWuHpSrw-1 Received: by mail-qk1-f198.google.com with SMTP id 141so9781462qkh.18 for ; Fri, 20 Nov 2020 21:39:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to:cc:content-transfer-encoding; bh=xENM7tt6/BDscE51yBedRq8lEPjdWgxJMWQR4dV6MkM=; b=I0H44vY8vVQYMPE6oarA6vWl6y8T+4Jx7RF8ewx1claM1W/mRw6Q3THlkPoFWzlOa9 rjMtYyDvuaM9p9cc/HppFgSGz+eEAa7FSFBg9Drw1ykTVgL+/veoBP//rUc9heEW4Nqq RQW1lrY8Zcx5Jbbru1HqhC4wmHzha1f2yKhSci0+aZ5Z8nx7XKf75zHvHi05Ey5SzuCC OeMOQDq06+9Yj4/skvBFezfMIBp5W1XUD/Pn9o4kSwPXop9rSoTx8ruW9sc55oItz7i4 wOEM8gaNVx5NfNA7wKYkEg1hAWBEbUhijAyoSJ6OkQ3FnZ3DiSZlZVaSJ8AaEAFnCkCm nVwA== X-Gm-Message-State: AOAM531wjMwRQnmvKbKuJAN8R/wePCXWwujhKAMaXHdbiFRkMNcoE/oy B1gTSEY7b4SZ5DuFcB4UgLJ/FxrJcT6vbVwOJHP8wB+xZiirula1JYyjk+AWetmTtaIXuJ0IvQB DCU19wBb9Wd8Yo2yvu+HokYO5uVi4FsNH X-Received: by 2002:a05:620a:2296:: with SMTP id o22mr19693702qkh.143.1605937156186; Fri, 20 Nov 2020 21:39:16 -0800 (PST) X-Google-Smtp-Source: ABdhPJy3kCXM1Gi3g5rnPV1w8+wg/ljCB92BuZRpkbzRMbG7pS7AhRsbEXvsrVAhSfX5cCYQtUHeUHJcPyeleEDJ+DY= X-Received: by 2002:a05:620a:2296:: with SMTP id o22mr19693688qkh.143.1605937155982; Fri, 20 Nov 2020 21:39:15 -0800 (PST) MIME-Version: 1.0 References: <20201119162527.GB9877@fuller.cnet> <6e43ea1e-b166-f60e-9dd1-e907108a3b12@iogearbox.net> In-Reply-To: <6e43ea1e-b166-f60e-9dd1-e907108a3b12@iogearbox.net> Reply-To: marcel@redhat.com From: Marcel Apfelbaum Date: Sat, 21 Nov 2020 07:39:04 +0200 Message-ID: Subject: Re: [kuba@kernel.org: Re: [PATCH net-next v2 0/3] net: introduce rps_default_mask] To: Daniel Borkmann Cc: Marcel Apfelbaum , Saeed Mahameed , Jakub Kicinski , netdev@vger.kernel.org, Paolo Abeni , Jonathan Corbet , "David S. Miller" , Shuah Khan , Marcelo Tosatti Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Hi Daniel, On Fri, Nov 20, 2020 at 11:56 PM Daniel Borkmann wro= te: > > On 11/20/20 6:39 PM, Marcel Apfelbaum wrote: > > +netdev > > [Sorry for the second email, I failed to set the text-only mode] > > On Fri, Nov 20, 2020 at 7:30 PM Marcel Apfelbaum = wrote: > [...] > >>> ---------- Forwarded message ---------- > >>> From: Jakub Kicinski > >>> To: Paolo Abeni > >>> Cc: Saeed Mahameed , netdev@vger.kernel.org, Jonath= an Corbet , "David S. Miller" , Shuah = Khan , linux-doc@vger.kernel.org, linux-kselftest@vger.ke= rnel.org, Marcelo Tosatti , Daniel Borkmann > >>> Bcc: > >>> Date: Wed, 4 Nov 2020 11:42:26 -0800 > >>> Subject: Re: [PATCH net-next v2 0/3] net: introduce rps_default_mask > >>> On Wed, 04 Nov 2020 18:36:08 +0100 Paolo Abeni wrote: > >>>> On Tue, 2020-11-03 at 08:52 -0800, Jakub Kicinski wrote: > >>>>> On Tue, 03 Nov 2020 16:22:07 +0100 Paolo Abeni wrote: > >>>>>> The relevant use case is an host running containers (with the rela= ted > >>>>>> orchestration tools) in a RT environment. Virtual devices (veths, = ovs > >>>>>> ports, etc.) are created by the orchestration tools at run-time. > >>>>>> Critical processes are allowed to send packets/generate outgoing, = it gets a network-interface > >>>> upstart job just as it does on a real host. > >>>> > >>>>>> network traffic - but any interrupt is moved away from the related > >>>>>> cores, so that usual incoming network traffic processing does not > >>>>>> happen there. > >>>>>> > >>>>>> Still an xmit operation on a virtual devices may be transmitted vi= a ovs > >>>>>> or veth, with the relevant forwarding operation happening in a sof= tirq > >>>>>> on the same CPU originating the packet. > >>>>>> > >>>>>> RPS is configured (even) on such virtual devices to move away the > >>>>>> forwarding from the relevant CPUs. > >>>>>> > >>>>>> As Saeed noted, such configuration could be possibly performed via= some > >>>>>> user-space daemon monitoring network devices and network namespace= s > >>>>>> creation. That will be anyway prone to some race: the orchestation= tool > >>>>>> may create and enable the netns and virtual devices before the dae= mon > >>>>>> has properly set the RPS mask. > >>>>>> > >>>>>> In the latter scenario some packet forwarding could still slip in = the > >>>>>> relevant CPU, causing measurable latency. In all non RT scenarios = the > >>>>>> above will be likely irrelevant, but in the RT context that is not > >>>>>> acceptable - e.g. it causes in real environments latency above the > >>>>>> defined limits, while the proposed patches avoid the issue. > >>>>>> > >>>>>> Do you see any other simple way to avoid the above race? > >>>>>> > >>>>>> Please let me know if the above answers your doubts, > >>>>> > >>>>> Thanks, that makes it clearer now. > >>>>> > >>>>> Depending on how RT-aware your container management is it may or ma= y not > >>>>> be the right place to configure this, as it creates the veth interf= ace. > >>>>> Presumably it's the container management which does the placement o= f > >>>>> the tasks to cores, why is it not setting other attributes, like RP= S? > >> > >> The CPU isolation is done statically at system boot by setting Linux k= ernel parameters, > >> So the container management component, in this case the Machine Config= uration Operator (for Openshift) > >> or the K8s counterpart can't really help. (actually they would help if= a global RPS mask would exist) > >> > >> I tried to tweak the rps_cpus mask using the container management stac= k, but there > >> is no sane way to do it, please let me get a little into the details. > >> > >> The k8s orchestration component that deals with injecting the network = device(s) into the > >> container is CNI, which is interface based and implemented by a lot of= plugins, making it > >> hardly feasible to go over all the existing plugins and change them. A= lso what about > >> the 3rd party ones? > >> > >> Writing a new CNI plugin and chain it into the existing one is also no= t an option AFAIK, > >> they work at the network level and do not have access to sysfs (they h= andle the network namespaces). > >> Even if it would be possible (I don't have a deep CNI understanding), = it will require a cluster global configuration > >> that is actually needed only for some of the cluster nodes. > > CNI chaining would be ugly, agree, but in a typical setting you'd have th= e CNI plugin > itself which is responsible to set up the Pod for communication to the ou= tside world; > part of it would be creation of devices and moving them into the target n= etns, and > then you also typically have an agent running in kube-system namespace in= the hostns > to which the CNI plugin talks to via IPC e.g. to set up IPAM and other st= ate. Such > agent usually sets up all sort of knobs for the networking layer upon boo= tstrap. The main issue is that CNI is networking related, but the way to set the RPS is by writing to /sys which is not considered network namespace related and is read only inside the containers. > Assuming you have a cluster where only some of the nodes have RT kernel, = these would > likely have special node annotations in K8s so you could select them to r= un certain > workloads on it.. couldn't such agent be taught to be RT-aware and set up= all the > needed knobs? I do agree this part may be doable, sadly is by far not the biggest problem= . > Agree it's a bit ugly to change the relevant CNI plugins to be RT-aware, > but what if you also need other settings in future aside from RPS mask fo= r RT? At some > point you'd likely end up having to extend these anyway, no? > All networking changes are fair play, however setting the RPS mask is related to networking but not a networking operation per se - is a cross-domain operation (network namespace/mount namespace). Thank you for your response, Marcel [...]