netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <jbrouer@redhat.com>
To: Yan Zhai <yan@cloudflare.com>, Jesper Dangaard Brouer <hawk@kernel.org>
Cc: brouer@redhat.com,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	netdev@vger.kernel.org, Paolo Abeni <pabeni@redhat.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Wander Lairson Costa <wander@redhat.com>,
	linux-kernel@vger.kernel.org,
	kernel-team <kernel-team@cloudflare.com>
Subject: Re: [RFC PATCH 2/2] softirq: Drop the warning from do_softirq_post_smp_call_flush().
Date: Wed, 16 Aug 2023 23:02:34 +0200	[thread overview]
Message-ID: <22d992aa-2b65-0de3-b88c-fd216ae0218e@redhat.com> (raw)
In-Reply-To: <CAO3-PbpbrK6FAACw5TQyBxJ6jgO7_bhLFuPVAziUE+40_o_GnA@mail.gmail.com>



On 16/08/2023 17.15, Yan Zhai wrote:
> On Wed, Aug 16, 2023 at 9:49 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>
>> On 15/08/2023 14.08, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 14/08/2023 11.35, Sebastian Andrzej Siewior wrote:
>>>> This is an undesired situation and it has been attempted to avoid the
>>>> situation in which ksoftirqd becomes scheduled. This changed since
>>>> commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"")
>>>> and now a threaded interrupt handler will handle soft interrupts at its
>>>> end even if ksoftirqd is pending. That means that they will be processed
>>>> in the context in which they were raised.
>>>
>>> $ git describe --contains d15121be74856
>>> v6.5-rc1~232^2~4
>>>
>>> That revert basically removes the "overload" protection that was added
>>> to cope with DDoS situations in Aug 2016 (Cc. Cloudflare).  As described
>>> in https://git.kernel.org/torvalds/c/4cd13c21b207 ("softirq: Let
>>> ksoftirqd do its job") in UDP overload situations when UDP socket
>>> receiver runs on same CPU as ksoftirqd it "falls-off-an-edge" and almost
>>> doesn't process packets (because softirq steals CPU/sched time from UDP
>>> pid).  Warning Cloudflare (Cc) as this might affect their production
>>> use-cases, and I recommend getting involved to evaluate the effect of
>>> these changes.
>>>
>>
>> I did some testing on net-next (with commit d15121be74856 ("Revert
>> "softirq: Let ksoftirqd do its job"") using UDP pktgen + udp_sink.
>>
>> And I observe the old overload issue occur again, where userspace
>> process (udp_sink) process very few packets when running on *same* CPU
>> as the NAPI-RX/IRQ processing.  The perf report "comm" clearly shows
>> that NAPI runs in the context of the "udp_sink" process, stealing its
>> sched time. (Same CPU around 3Kpps and diff CPU 1722Kpps, see details
>> below).
>> What happens are that NAPI takes 64 packets and queue them to the
>> udp_sink process *socket*, the udp_sink process *wakeup* process 1
>> packet from socket queue and on exit (__local_bh_enable_ip) runs softirq
>> that starts NAPI (to again process 64 packets... repeat).
>>
> I think there are two scenarios to consider:
 >
> 1. Actual DoS scenario. In this case, we would drop DoS packets
> through XDP, which might actually relieve the stress. According to
> Marek's blog XDP can indeed drop 10M pps [1] so it might not steal too
> much time. This is also something I would like to validate again since

Yes, using XDP to drop packet will/should relieve the stress, as it
basically can discard some of the 64 packets processed by NAPI vs the 1
packet received by userspace (that re-trigger NAPI), giving a better 
balance.

> I cannot tell if those tests were performed before or after the
> reverted commit.

Marek's tests will likely contain the patch 4cd13c21b207 ("softirq: Let
ksoftirqd do its job") as blog is from 2018 and patch from 2016, but
shouldn't matter much.


> 2. Legit elephant flows (so it should not be just dropped). This one
> is closer to what you tested above, and it is a much harder issue
> since packets are legit and should not be dropped early at XDP. Let
> the scheduler move affected processes away seems to be the non-optimal
> but straight answer for now. However, I suspect this would impose an
> overload issue for those programmed with RFS or ARFS, since flows
> would "follow" the processes. They probably have to force threaded
> NAPI for tuning.
>

True, this is the case I don't know how to solve.

For UDP packets it is NOT optimal to let the process "follow"/run on the 
NAPI-RX CPU. For TCP traffic it is faster to run on same CPU, which 
could be related to GRO effect, or simply that tcp_recvmsg gets a stream 
of data (before it invokes __local_bh_enable_ip causing do_softirq).

I have also tested with netperf UDP packets[2] in a scenario that 
doesn't cause "overload" and CPU have idle cycles.  When UDP-netserver 
is running on same CPU as NAPI then I see approx 38% (82020/216362) 
UdpRcvbufErrors [3] (and separate CPUs 2.8%).  Sure, I could increase 
buffer size, but the point is NAPI can enqueue 64 packet and UDP 
receiver dequeue 1 packet.

This reminded me that kernel have a recvmmsg (extra "m") syscall for 
multiple packets.  I tested this (as udop_sink have support), but no 
luck. This is because internally in the kernel (do_recvmmsg) is just a 
loop over ___sys_recvmsg/__skb_recv_udp, which have a BH-spinlock per 
packet that invokes __local_bh_enable_ip/do_softirq.  I guess, we/netdev 
could fix recvmmsg() to bulk-dequeue from socket queue (BH-socket unlock 
is triggering __local_bh_enable_ip/do_softirq) and then have a solution 
for UDP(?).


[2] netperf -H 198.18.1.1 -D1 -l 1200 -t UDP_STREAM -T 0,0 -- -m 1472 -N -n

[3]
$ nstat -n && sleep 1 && nstat
#kernel
IpInReceives                    216362             0.0
IpInDelivers                    216354             0.0
UdpInDatagrams                  134356             0.0
UdpInErrors                     82020              0.0
UdpRcvbufErrors                 82020              0.0
IpExtInOctets                   324600000          0.0
IpExtInNoECTPkts                216400             0.0


> [1] https://blog.cloudflare.com/how-to-drop-10-million-packets/
> 
>>
>>> I do realize/acknowledge that the reverted patch caused other latency
>>> issues, given it was a "big-hammer" approach affecting other softirq
>>> processing (as can be seen by e.g. the watchdog fixes patches).
>>> Thus, the revert makes sense, but how to regain the "overload"
>>> protection such that RX networking cannot starve processes reading from
>>> the socket? (is this what Sebastian's patchset does?)
>>>
>>
>> I'm no expert in sched / softirq area of the kernel, but I'm willing to
>> help out testing different solution that can regain the "overload"
>> protection e.g. avoid packet processing "falls-of-an-edge" (and thus
>> opens the kernel to be DDoS'ed easily).
>> Is this what Sebastian's patchset does?
>>
>>
>>>
>>> Thread link for people Cc'ed:
>>> https://lore.kernel.org/all/20230814093528.117342-1-bigeasy@linutronix.de/#r
>>
>> --Jesper
>> (some testlab results below)
>>
>> [udp_sink]
>> https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
>>
>>
>> When udp_sink runs on same CPU and NAPI/softirq
>>    - UdpInDatagrams: 2,948 packets/sec
>>
>> $ nstat -n && sleep 1 && nstat
>> #kernel
>> IpInReceives                    2831056            0.0
>> IpInDelivers                    2831053            0.0
>> UdpInDatagrams                  2948               0.0
>> UdpInErrors                     2828118            0.0
>> UdpRcvbufErrors                 2828118            0.0
>> IpExtInOctets                   130206496          0.0
>> IpExtInNoECTPkts                2830576            0.0
>>
>> When udp_sink runs on another CPU than NAPI-RX.
>>    - UdpInDatagrams: 1,722,307 pps
>>
>> $ nstat -n && sleep 1 && nstat
>> #kernel
>> IpInReceives                    2318560            0.0
>> IpInDelivers                    2318562            0.0
>> UdpInDatagrams                  1722307            0.0
>> UdpInErrors                     596280             0.0
>> UdpRcvbufErrors                 596280             0.0
>> IpExtInOctets                   106634256          0.0
>> IpExtInNoECTPkts                2318136            0.0
>>
>>
> 
> 


  reply	other threads:[~2023-08-16 21:02 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-14  9:35 [RFC PATCH net-next 0/2] net: Use SMP threads for backlog NAPI Sebastian Andrzej Siewior
2023-08-14  9:35 ` [RFC PATCH net-next 1/2] " Sebastian Andrzej Siewior
2023-08-21  8:32   ` kernel test robot
2023-08-23 13:35   ` Paolo Abeni
2023-09-20 15:57     ` Sebastian Andrzej Siewior
2023-09-21 10:41       ` Ferenc Fejes
2023-09-22  7:26         ` Sebastian Andrzej Siewior
2023-09-22  9:38       ` Paolo Abeni
2023-08-14  9:35 ` [RFC PATCH 2/2] softirq: Drop the warning from do_softirq_post_smp_call_flush() Sebastian Andrzej Siewior
2023-08-15 12:08   ` Jesper Dangaard Brouer
2023-08-15 22:31     ` Yan Zhai
2023-08-16 14:48     ` Jesper Dangaard Brouer
2023-08-16 15:15       ` Yan Zhai
2023-08-16 21:02         ` Jesper Dangaard Brouer [this message]
2023-08-18 15:49           ` Yan Zhai
2023-08-16 15:22       ` Sebastian Andrzej Siewior
2023-08-14 18:24 ` [RFC PATCH net-next 0/2] net: Use SMP threads for backlog NAPI Jakub Kicinski
2023-08-17 13:16   ` Sebastian Andrzej Siewior
2023-08-17 15:30     ` Jakub Kicinski
2023-08-18  9:03       ` Sebastian Andrzej Siewior
2023-08-18 14:43     ` Yan Zhai
2023-08-18 14:57       ` Sebastian Andrzej Siewior
2023-08-18 16:21         ` Jakub Kicinski
2023-08-18 16:40           ` Eric Dumazet
2023-08-23  6:57           ` Sebastian Andrzej Siewior
2023-08-18 16:56         ` Yan Zhai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=22d992aa-2b65-0de3-b88c-fd216ae0218e@redhat.com \
    --to=jbrouer@redhat.com \
    --cc=bigeasy@linutronix.de \
    --cc=brouer@redhat.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=wander@redhat.com \
    --cc=yan@cloudflare.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).