From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D54DA226D1D for ; Thu, 11 Sep 2025 21:29:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757626145; cv=none; b=A0JNSoQu56+JsWx+uxEN/NcivuPorMGg1c1qyX07yzq7kReSTQatilJWnHsZW6ENby5gn1xT5ufHVHgnuvVIfkShj9sUXKxeyyqoju/jzc9fUFeAVUfa7XSJAqjujnk4Nm/xO7NB0zxOBx8K8oQIaXAK1BCu/Ot+4ZwP60phPsg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757626145; c=relaxed/simple; bh=AZHpk+ZIlKBp1F2GRyGvHiDgEuEoaOYN9dSGKeoch1Y=; h=Date:Mime-Version:Message-ID:Subject:From:To:Cc:Content-Type; b=uLuHiXFaafdmB84SM+JYApABA+r5PAQPK1zVQT7wq96g58ArgtFGSuL1D0Ad5asdSlUBKIyyDoLQWvnb++wIMzI/foC0F3qS6bvPHxqjO91CfIFBf9qXApX7ZwzsMzdLjcKqgQFTF6ZLsSuv6dAa/rD61OhjAsoXlhZVSBlhnGA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--skhawaja.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=GeaJHDO8; arc=none smtp.client-ip=209.85.210.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--skhawaja.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="GeaJHDO8" Received: by mail-pf1-f201.google.com with SMTP id d2e1a72fcca58-7724877cd7cso1327122b3a.1 for ; Thu, 11 Sep 2025 14:29:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1757626143; x=1758230943; darn=vger.kernel.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=Z2pXaso7ZMpNvdha8y2H0BILnyVrj/hwlrWGGyC4kCg=; b=GeaJHDO8kK7hJY7800NKwpgyz9iVq1rPC0iyzQGBWptwttFkMUPB6UyfIVRqqM8PbH P890vCdTwVNQgVSLfzSqKwAHrDDQXRSaypj7zBgOoWC4iX3Wx92lNlgEa7tbyLrKYh0a T+hLNHCzb6oUIPB3Fy5p/pRxY5OwzYHyT0MZGelneR574PXHfYEXIDaaA7zXbIIvzaP0 c3UOoLZsEBvTI2tJn8ikyAd75bMZa0A7yYFpWeLRQeCE4S+gAcSuFli5hrujMgCHWcph nACbUr/AY0VXAT2SDLF0u9dHdF76LWINrGKJXviu5q5oJh5Q88GvARpzM3hpm5zsAnpI +Ekg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757626143; x=1758230943; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Z2pXaso7ZMpNvdha8y2H0BILnyVrj/hwlrWGGyC4kCg=; b=Nej/pG8aOzURMjbGICoBhZoC11OvaZGvSmgPhBWER8mDGbvIcmMRyAcBCUnI4qb57B KhslbQPdMMsy0L9WMtEk+lGMcx1UiknsGNrjRbkcLO50c3LXoAGuxnXBSbCH/IGwXfqC ye9t4a7CQQPDekcMHeLi1fCLgWT+g4YM5+Gd8JjDqARXC8O56mJvtb3yATben+qs63ZM O3+bRb7ZoK5S0GIdcROX1xXkMluAX7YUBDn4uUsRANpaaoOa69REwtS0wXdgo6lTZRWk pmN1cCQ3svO3Ht4xfiWrNkKqU9/PhtTQk2+OhdfBFhkaihDQ3D1u7Ke95K/1oZqXnrQ/ RNWg== X-Forwarded-Encrypted: i=1; AJvYcCVql9A/sbf+Arvb6FOu4hFY5bnedUTAd95E4vnDgjZ0KxOTcVH1GjRBjhTDWUGuwmb9ESrXxk4=@vger.kernel.org X-Gm-Message-State: AOJu0Yy4VRLWBy1OvpAAh3R0LBWGAemx49BtJ2LA44KF+xlg2HM4RK4L 0Z//2y6BoF251RFTXy4v1UjIH1+KEpKOzaDYapVVkuA/jk1CV+FeJiWKUB7b2BV2fEH/05Nvte7 XEJig4De3uSjy7Q== X-Google-Smtp-Source: AGHT+IHM1DD93259orJ726pu9WQaxOy0dd5zHhb1mCSknbb/T5+mRR+nmctyJmMkttRoJr88DONkDB5NK0qiEg== X-Received: from pfbks22.prod.google.com ([2002:a05:6a00:4b96:b0:775:f353:e9b0]) (user=skhawaja job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a20:3c90:b0:246:3a6:3e40 with SMTP id adf61e73a8af0-2602a59398dmr885497637.12.1757626143103; Thu, 11 Sep 2025 14:29:03 -0700 (PDT) Date: Thu, 11 Sep 2025 21:28:59 +0000 Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 X-Mailer: git-send-email 2.51.0.384.g4c02a37b29-goog Message-ID: <20250911212901.1718508-1-skhawaja@google.com> Subject: [PATCH net-next v9 0/2] Add support to do threaded napi busy poll From: Samiullah Khawaja To: Jakub Kicinski , "David S . Miller " , Eric Dumazet , Paolo Abeni , almasrymina@google.com, willemb@google.com Cc: Joe Damato , mkarsten@uwaterloo.ca, netdev@vger.kernel.org, skhawaja@google.com Content-Type: text/plain; charset="UTF-8" Extend the already existing support of threaded napi poll to do continuous busy polling. This is used for doing continuous polling of napi to fetch descriptors from backing RX/TX queues for low latency applications. Allow enabling of threaded busypoll using netlink so this can be enabled on a set of dedicated napis for low latency applications. Once enabled user can fetch the PID of the kthread doing NAPI polling and set affinity, priority and scheduler for it depending on the low-latency requirements. Extend the netlink interface to allow enabling/disabling threaded busypolling at individual napi level. We use this for our AF_XDP based hard low-latency usecase with usecs level latency requirement. For our usecase we want low jitter and stable latency at P99. Following is an analysis and comparison of available (and compatible) busy poll interfaces for a low latency usecase with stable P99. This can be suitable for applications that want very low latency at the expense of cpu usage and efficiency. Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI backing a socket, but the missing piece is a mechanism to busy poll a NAPI instance in a dedicated thread while ignoring available events or packets, regardless of the userspace API. Most existing mechanisms are designed to work in a pattern where you poll until new packets or events are received, after which userspace is expected to handle them. As a result, one has to hack together a solution using a mechanism intended to receive packets or events, not to simply NAPI poll. NAPI threaded busy polling, on the other hand, provides this capability natively, independent of any userspace API. This makes it really easy to setup and manage. For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The description of the tool and how it tries to simulate the real workload is following, - It sends UDP packets between 2 machines. - The client machine sends packets at a fixed frequency. To maintain the frequency of the packet being sent, we use open-loop sampling. That is the packets are sent in a separate thread. - The server replies to the packet inline by reading the pkt from the recv ring and replies using the tx ring. - To simulate the application processing time, we use a configurable delay in usecs on the client side after a reply is received from the server. The xsk_rr tool is posted separately as an RFC for tools/testing/selftest. We use this tool with following napi polling configurations, - Interrupts only - SO_BUSYPOLL (inline in the same thread where the client receives the packet). - SO_BUSYPOLL (separate thread and separate core) - Threaded NAPI busypoll System is configured using following script in all 4 cases, ``` echo 0 | sudo tee /sys/class/net/eth0/threaded echo 0 | sudo tee /proc/sys/kernel/timer_migration echo off | sudo tee /sys/devices/system/cpu/smt/control sudo ethtool -L eth0 rx 1 tx 1 sudo ethtool -G eth0 rx 1024 echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus # pin IRQs on CPU 2 IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \ print arr[0]}' < /proc/interrupts)" for irq in "${IRQS}"; \ do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us for i in /sys/devices/virtual/workqueue/*/cpumask; \ do echo $i; echo 1,2,3,4,5,6 > $i; done if [[ -z "$1" ]]; then echo 400 | sudo tee /proc/sys/net/core/busy_read echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout fi sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0 if [[ "$1" == "enable_threaded" ]]; then echo 0 | sudo tee /proc/sys/net/core/busy_poll echo 0 | sudo tee /proc/sys/net/core/busy_read echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout NAPI_ID=$(ynl --family netdev --output-json --do queue-get \ --json '{"ifindex": '${IFINDEX}', "id": '0', "type": "rx"}' | jq '."napi-id"') ynl --family netdev --json '{"id": "'${NAPI_ID}'", "threaded": "busy-poll-enabled"}' NAPI_T=$(ynl --family netdev --output-json --do napi-get \ --json '{"id": "'$NAPI_ID'"}' | jq '."pid"') sudo chrt -f -p 50 $NAPI_T # pin threaded poll thread to CPU 2 sudo taskset -pc 2 $NAPI_T fi if [[ "$1" == "enable_interrupt" ]]; then echo 0 | sudo tee /proc/sys/net/core/busy_read echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout fi ``` To enable various configurations, script can be run as following, - Interrupt Only ```