From: Jesper Dangaard Brouer <brouer@redhat.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: netdev@vger.kernel.org, jakub.kicinski@netronome.com,
"Michael S. Tsirkin" <mst@redhat.com>,
pavel.odintsov@gmail.com, Jason Wang <jasowang@redhat.com>,
mchan@broadcom.com, peter.waskiewicz.jr@intel.com,
Daniel Borkmann <borkmann@iogearbox.net>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Andy Gospodarek <andy@greyhouse.net>,
brouer@redhat.com
Subject: Re: [net-next V6 PATCH 0/5] New bpf cpumap type for XDP_REDIRECT
Date: Wed, 11 Oct 2017 10:06:08 +0200 [thread overview]
Message-ID: <20171011100608.48bc1c86@redhat.com> (raw)
In-Reply-To: <a430c181-aa56-61a7-fc59-9b135bbb262b@gmail.com>
On Tue, 10 Oct 2017 23:10:39 -0700
John Fastabend <john.fastabend@gmail.com> wrote:
> On 10/10/2017 05:47 AM, Jesper Dangaard Brouer wrote:
> > Introducing a new way to redirect XDP frames. Notice how no driver
> > changes are necessary given the design of XDP_REDIRECT.
> >
> > This redirect map type is called 'cpumap', as it allows redirection
> > XDP frames to remote CPUs. The remote CPU will do the SKB allocation
> > and start the network stack invocation on that CPU.
> >
> > This is a scalability and isolation mechanism, that allow separating
> > the early driver network XDP layer, from the rest of the netstack, and
> > assigning dedicated CPUs for this stage. The sysadm control/configure
> > the RX-CPU to NIC-RX queue (as usual) via procfs smp_affinity and how
> > many queues are configured via ethtool --set-channels. Benchmarks
> > show that a single CPU can handle approx 11Mpps. Thus, only assigning
> > two NIC RX-queues (and two CPUs) is sufficient for handling 10Gbit/s
> > wirespeed smallest packet 14.88Mpps. Reducing the number of queues
> > have the advantage that more packets being "bulk" available per hard
> > interrupt[1].
> >
> > [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf
> >
> > Use-cases:
> >
> > 1. End-host based pre-filtering for DDoS mitigation. This is fast
> > enough to allow software to see and filter all packets wirespeed.
> > Thus, no packets getting silently dropped by hardware.
> >
> > 2. Given NIC HW unevenly distributes packets across RX queue, this
> > mechanism can be used for redistribution load across CPUs. This
> > usually happens when HW is unaware of a new protocol. This
> > resembles RPS (Receive Packet Steering), just faster, but with more
> > responsibility placed on the BPF program for correct steering.
>
> Hi Jesper,
>
> Another (somewhat meta) comment about the performance benchmarks. In
> one of the original threads you showed that the XDP cpu map outperformed
> RPS in TCP_CRR netperf tests. It was significant iirc in the mpps range.
Let me correct this. This is (significantly) faster than RPS, and
it have the same performance as netperf TCP_CRR and TCP_RR. As this is
just invoking the network stack (on a remote CPU). Thus, I'm very happy
to see the same comparative performance. The netperf TCP_RR test is
actually the worst case scenario, where the "hidden" bulking doesn't
work. And RPS is the best case scenario. I've even left several
optimization opportunities for later.
> But, with this series we will skip GRO. Do you have any idea how this
> looks with other tests such as TCP_STREAM? I'm trying to understand
> if this is something that can be used in the general case or is more
> for the special case and will have to be enabled/disabled by the
> orchestration layer depending on workload/network conditions.
On my testlab server, the TCP_STREAM tests show the same results (full
10G with MTU size packets). This is because my server is fast-enough,
and don't need the GRO aggregation to keep up (it "only" need to handle
812Kpps).
> My intuition is the general case will be slower due to lack of GRO. If
> this is the case any ideas how we could add GRO? Not needed in the
> initial patchset but trying to see if the two are mutually exclusive.
> I don't off-hand see an easy way to pull GRO into this feature.
Adding GRO _later_ is a big part of my plan. I haven't figured out the
exact code paths. The general idea is to perform partial sorting of
flows, based on the RSS-hash or something provided by the BPF prog.
NetFlix's extension to FreeBSD illustrate the GRO sorting problem
nicely[1], see section "RSS Assisted LRO". For the record, my idea is
not based on their idea. I had this idea long before reading their
article. I want to partial sorting on many levels. E.g. cpumap enqueue
can have 8 times 8 percpu packet queues (64 packets max NAPI budget)
sorted on some part of the RSS-hash. BPF prog choosing a CPU
destination is also a sorting step. The cpumap dequeue kthread step,
that need to invoke a GRO netstack function, can also perform a partial
sorting step, plus implement a GRO flush point when the queue is empty.
[1] https://medium.com/netflix-techblog/serving-100-gbps-from-an-open-connect-appliance-cdb51dda3b99
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
prev parent reply other threads:[~2017-10-11 8:06 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-10-10 12:47 [net-next V6 PATCH 0/5] New bpf cpumap type for XDP_REDIRECT Jesper Dangaard Brouer
2017-10-10 12:47 ` [net-next V6 PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP Jesper Dangaard Brouer
2017-10-10 22:48 ` Daniel Borkmann
2017-10-11 5:36 ` Jesper Dangaard Brouer
2017-10-10 12:47 ` [net-next V6 PATCH 2/5] bpf: XDP_REDIRECT enable use of cpumap Jesper Dangaard Brouer
2017-10-10 12:47 ` [net-next V6 PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation Jesper Dangaard Brouer
2017-10-10 12:47 ` [net-next V6 PATCH 4/5] bpf: cpumap add tracepoints Jesper Dangaard Brouer
2017-10-10 12:47 ` [net-next V6 PATCH 5/5] samples/bpf: add cpumap sample program xdp_redirect_cpu Jesper Dangaard Brouer
2017-10-11 6:10 ` [net-next V6 PATCH 0/5] New bpf cpumap type for XDP_REDIRECT John Fastabend
2017-10-11 8:06 ` Jesper Dangaard Brouer [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171011100608.48bc1c86@redhat.com \
--to=brouer@redhat.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andy@greyhouse.net \
--cc=borkmann@iogearbox.net \
--cc=jakub.kicinski@netronome.com \
--cc=jasowang@redhat.com \
--cc=john.fastabend@gmail.com \
--cc=mchan@broadcom.com \
--cc=mst@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=pavel.odintsov@gmail.com \
--cc=peter.waskiewicz.jr@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).