From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-qk1-f178.google.com (mail-qk1-f178.google.com [209.85.222.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6A513394; Wed, 14 Aug 2024 03:16:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.178 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1723605371; cv=none; b=Zz+eg8uOgcOtgafO1lzCZKwN0tU9R6KiGqvo6S3bMtEhEiNHXv93H/4Bur5vxySOy7Cqf489NK3oA8T+/RzgwzZmT70c8K/3FbkSbpmFTj6JgnKOfDrOP0/ywRaK9orMp6G7yLfeP7EWM5/As2bhuDp2hjUKbZ8RpvPi2aeJmrs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1723605371; c=relaxed/simple; bh=gL5kFHzDGcNpuQvhLnaEZyo7Fuok7+JFbobMGciEdQQ=; h=Date:From:To:Cc:Message-ID:In-Reply-To:References:Subject: Mime-Version:Content-Type; b=s0UOEV2saX6gEWdBoTyNdouAtP2WRmMzDVxRrBpc7eTDjUuCcBfjUh6oaaPAjMX4fTgV7ASeloZPgTQ705FA6rIAjNesJURTPV1cpRBB2J/ebtELyyksce6KgP44PJt8sFrBsqvCxcfp5M941F/4nAHaTnVrtLkRCaw+lli37Y8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=MoP48sBp; arc=none smtp.client-ip=209.85.222.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="MoP48sBp" Received: by mail-qk1-f178.google.com with SMTP id af79cd13be357-7a1d7bc07b7so413139985a.0; Tue, 13 Aug 2024 20:16:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723605368; x=1724210168; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=gL5kFHzDGcNpuQvhLnaEZyo7Fuok7+JFbobMGciEdQQ=; b=MoP48sBp1aLDgBl02x+5Go9R7zN9PJD6QKLme0M9c2fUtyeo17LEw9mmL6zEnqwAVS J928h/pDNCNzV8NjurbvMw7TWBbz/uiJ79J+hZxslNvHkoCgCusob7KcRiBY36Su9y6C hJtE0+E3pz23ZYLs8mzpk6QtacPN9Rc04MBz5GQVoGvOBRHFmDDZSPuK2NWcEw62VcWx fl2Oq0dd0r1sM2M1nDegqOZAUh8mTfZu+SqvlBnsnAOR532db7kCxm27/SYgJ2iwUfp/ FAgx7yxMKkPfGPFWbcKtmVgIq1mTjtK2FXQY0/TargTaIsVCc3Rwnlp5D4QOBcqPAxvz RMDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723605368; x=1724210168; h=content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=gL5kFHzDGcNpuQvhLnaEZyo7Fuok7+JFbobMGciEdQQ=; b=jqrJPiDk4J8iuAH05z7/xKkCP/OXr1C5tGjzypEyWc0q68T3FH+mQIuF5DpWk8nPUu OEgwtF6/5aJHDIgGdGw7egPE/d9biOeH+9MxpMCYY0p9nFVsKJ8EU23NWYsaCCBgmCtV L4BdAatPZA0iMIjNjxUIezYBu5rZo2T81aefBmOF5fFdqhXQgERayqNduj/G8fftP8Ea QL+e/KQGq7zWAt3EnoKF1nHJMzZ8hjcuccRdPfM0xnjUa6zF6kNv8HAzkN9wQVXZ7tYM KBMjMI5ucXyHB10MttQjfyGQ/MfpI6/IWafCwLWwH7X/f7lyyLaRMfNKTpufGU03/4I8 WjBg== X-Forwarded-Encrypted: i=1; AJvYcCWnPAidAKNH+lwPx62KxqNmhqE8iYam0AoTpba/UHQsnLkdH48O2RDtz86qwjQDJNRB+ve7V3JnFcyWfz59ZMn5qqCf+NPX1uXeTkOHMi7WkMaxy91HWmDAhWj9h4sABpnCzmhrGakFrJtJ+97/bS9qdgXF4AkHQec1CV2VAUURfuyRhAWWyg== X-Gm-Message-State: AOJu0Yy+4NpBTbe3hPuEr0FaNStfYiv5oNBISkJNu9X+/x7jDr5oHwhj fsZYHf/P7yLSlXBvNNJIimQPQhPU0FQA+QmwqG3nrqUZPapBWjU5 X-Google-Smtp-Source: AGHT+IF1qSogIJgN39CK/edq9OoeNnK2AmQgxeHsjP18PzTXK8fS7OFyyR4Z5adA4z2oUrk2/odenQ== X-Received: by 2002:a05:620a:4621:b0:79f:44d:bafb with SMTP id af79cd13be357-7a4ee33e84cmr164403685a.38.1723605368035; Tue, 13 Aug 2024 20:16:08 -0700 (PDT) Received: from localhost (73.84.86.34.bc.googleusercontent.com. [34.86.84.73]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7a4c7dedf55sm392014285a.95.2024.08.13.20.16.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 13 Aug 2024 20:16:07 -0700 (PDT) Date: Tue, 13 Aug 2024 23:16:07 -0400 From: Willem de Bruijn To: Martin Karsten , Stanislav Fomichev Cc: netdev@vger.kernel.org, Joe Damato , amritha.nambiar@intel.com, sridhar.samudrala@intel.com, Alexander Lobakin , Alexander Viro , Breno Leitao , Christian Brauner , Daniel Borkmann , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Jan Kara , Jiri Pirko , Johannes Berg , Jonathan Corbet , "open list:DOCUMENTATION" , "open list:FILESYSTEMS (VFS and infrastructure)" , open list , Lorenzo Bianconi , Paolo Abeni , Sebastian Andrzej Siewior Message-ID: <66bc21772c6bd_985bf294b0@willemb.c.googlers.com.notmuch> In-Reply-To: <6f40b6df-4452-48f6-b552-0eceaa1f0bbc@uwaterloo.ca> References: <20240812125717.413108-1-jdamato@fastly.com> <2bb121dd-3dcd-4142-ab87-02ccf4afd469@uwaterloo.ca> <5e52b556-fe49-4fe0-8bd3-543b3afd89fa@uwaterloo.ca> <6f40b6df-4452-48f6-b552-0eceaa1f0bbc@uwaterloo.ca> Subject: Re: [RFC net-next 0/5] Suspend IRQs during preferred busy poll Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Martin Karsten wrote: > On 2024-08-13 00:07, Stanislav Fomichev wrote: > > On 08/12, Martin Karsten wrote: > >> On 2024-08-12 21:54, Stanislav Fomichev wrote: > >>> On 08/12, Martin Karsten wrote: > >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: > >>>>> On 08/12, Martin Karsten wrote: > >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: > >>>>>>> On 08/12, Joe Damato wrote: > >>>>>>>> Greetings: > = > [snip] > = > >>>>>>> Maybe expand more on what code paths are we trying to improve? = Existing > >>>>>>> busy polling code is not super readable, so would be nice to si= mplify > >>>>>>> it a bit in the process (if possible) instead of adding one mor= e tunable. > >>>>>> > >>>>>> There are essentially three possible loops for network processin= g: > >>>>>> > >>>>>> 1) hardirq -> softirq -> napi poll; this is the baseline functio= nality > >>>>>> > >>>>>> 2) timer -> softirq -> napi poll; this is deferred irq processin= g scheme > >>>>>> with the shortcomings described above > >>>>>> > >>>>>> 3) epoll -> busy-poll -> napi poll > >>>>>> > >>>>>> If a system is configured for 1), not much can be done, as it is= difficult > >>>>>> to interject anything into this loop without adding state and si= de effects. > >>>>>> This is what we tried for the paper, but it ended up being a hac= k. > >>>>>> > >>>>>> If however the system is configured for irq deferral, Loops 2) a= nd 3) > >>>>>> "wrestle" with each other for control. Injecting the larger > >>>>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts thi= s in favour > >>>>>> of Loop 3) and creates the nice pattern describe above. > >>>>> > >>>>> And you hit (2) when the epoll goes to sleep and/or when the user= space > >>>>> isn't fast enough to keep up with the timer, presumably? I wonder= > >>>>> if need to use this opportunity and do proper API as Joe hints in= the > >>>>> cover letter. Something over netlink to say "I'm gonna busy-poll = on > >>>>> this queue / napi_id and with this timeout". And then we can esse= ntially make > >>>>> gro_flush_timeout per queue (and avoid > >>>>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout f= eels > >>>>> too hacky already :-( > >>>> > >>>> If someone would implement the necessary changes to make these par= ameters > >>>> per-napi, this would improve things further, but note that the cur= rent > >>>> proposal gives strong performance across a range of workloads, whi= ch is > >>>> otherwise difficult to impossible to achieve. > >>> > >>> Let's see what other people have to say. But we tried to do a simil= ar > >>> setup at Google recently and getting all these parameters right > >>> was not trivial. Joe's recent patch series to push some of these in= to > >>> epoll context are a step in the right direction. It would be nice t= o > >>> have more explicit interface to express busy poling preference for > >>> the users vs chasing a bunch of global tunables and fighting agains= t softirq > >>> wakups. > >> > >> One of the goals of this patch set is to reduce parameter tuning and= make > >> the parameter setting independent of workload dynamics, so it should= make > >> things easier. This is of course notwithstanding that per-napi setti= ngs > >> would be even better. I don't follow how adding another tunable reduces parameter tuning. > >> > >> If you are able to share more details of your previous experiments (= here or > >> off-list), I would be very interested. > > = > > We went through a similar exercise of trying to get the tail latencie= s down. > > Starting with SO_BUSY_POLL, then switching to the per-epoll variant (= except > > we went with a hard-coded napi_id argument instead of tracking) and t= rying to > > get a workable set of budget/timeout/gro_flush. We were fine with bur= ning all > > cpu capacity we had and no sleep at all, so we ended up having a bunc= h > > of special cases in epoll loop to avoid the sleep. > > = > > But we were trying to make a different model work (the one you mentio= n in the > > paper as well) where the userspace busy-pollers are just running napi= _poll > > on one cpu and the actual work is consumed by the userspace on a diff= erent cpu. > > (we had two epoll fds - one with napi_id=3Dxxx and no sockets to driv= e napi_poll > > and another epoll fd with actual sockets for signaling). > > = > > This mode has a different set of challenges with socket lock, socket = rx > > queue and the backlog processing :-( > = > I agree. That model has challenges and is extremely difficult to tune r= ight. > = > >>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for th= e sake of > >>>> an individual queue or application to make sure that IRQ suspensio= n is > >>>> enabled/disabled right away when the state of the system changes f= rom busy > >>>> to idle and back. > >>> > >>> Can we not handle everything in napi_busy_loop? If we can mark some= napi > >>> contexts as "explicitly polled by userspace with a larger defer tim= eout", > >>> we should be able to do better compared to current NAPI_F_PREFER_BU= SY_POLL > >>> which is more like "this particular napi_poll call is user busy pol= ling". > >> > >> Then either the application needs to be polling all the time (wastin= g cpu > >> cycles) or latencies will be determined by the timeout. > >> > >> Only when switching back and forth between polling and interrupts is= it > >> possible to get low latencies across a large spectrum of offered loa= ds > >> without burning cpu cycles at 100%. > > = > > Ah, I see what you're saying, yes, you're right. In this case ignore = my comment > > about ep_suspend_napi_irqs/napi_resume_irqs. > = > Thanks for probing and double-checking everything! Feedback is importan= t = > for us to properly document our proposal. > = > > Let's see how other people feel about per-dev irq_suspend_timeout. Pr= operly > > disabling napi during busy polling is super useful, but it would stil= l > > be nice to plumb irq_suspend_timeout via epoll context or have it set= on > > a per-napi basis imho. > = > Fingers crossed. I hope this patch will be accepted, because it has = > practical performance and efficiency benefits, and that this will = > further increase the motivation to re-design the entire irq = > defer(/suspend) infrastructure for per-napi settings. Overall, the idea of keeping interrupts disabled during event processing is very interesting. Hopefully the interface can be made more intuitive. Or documented more easily. I had to read the kernel patches to fully (perhaps) grasp it. Another +1 on the referenced paper. Pointing out a specific difference in behavior that is unrelated to the protection domain, rather than a straightforward kernel vs user argument. The paper also had some explanation that may be clearer for a commit message than the current cover letter: "user-level network stacks put the application in charge of the entire network stack processing (cf. Section 2). Interrupts are disabled and the application coordinates execution by alternating between processing existing requests and polling the RX queues for new data" " [This series extends this behavior to kernel busy polling, while falling back onto interrupt processing to limit CPU overhead.] "Instead of re-enabling the respective interrupt(s) as soon as epoll_wait() returns from its NAPI busy loop, the relevant IRQs stay masked until a subsequent epoll_wait() call comes up empty, i.e., no events of interest are found and the application thread is about to be blocked." "A fallback technical approach would use a kernel timeout set on the return path from epoll_wait(). If necessary, the timeout re-enables interrupts regardless of the application=E2=80=99s (mis)behaviour." [Where misbehavior is not calling epoll_wait again] "The resulting execution model mimics the execution model of typical user-level network stacks and does not add any requirements compared to user-level networking. In fact, it is slightly better, because it can resort to blocking and interrupt delivery, instead of having to continuously busyloop during idle times." This last part shows a preference on your part to a trade-off: you want low latency, but also low cpu utilization if possible. This also came up in this thread. Please state that design decision explicitly. There are plenty of workloads where burning a core is acceptable (especially as core count continues increasing), not "slightly worse". Kernel polling with full busy polling is also already possible, by choosing a very high napi_defer_hard_irqs and gro_flush_timeout. So high in fact, that these tunables need not be tuned carefully. So what this series add is not interrupt suppression during event processing per se, but doing so in a hybrid mode balancing latency and cpu load.