From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 06E9534A31F for ; Tue, 26 Aug 2025 11:31:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756207912; cv=none; b=T+TqNfAlxOBf0MJqj6IoMMPWJn6+okbL8Orxj2mJgJIrJcTJ6wz/C8oLMY70LxnpirUnqZTEavi0EJTWnI/WoTHXB9VM7XlomYpdWfGD5f2KlOlF7oO8HvAhOjQLz6i7vceQ60Ow+FPJ0ruAdcv+F1LZhANRj6l/BEt/q2+3kBM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756207912; c=relaxed/simple; bh=CQJlq19xgno9ZREqvne/rmJu4VMmQ1RXYcxLzC+NAvI=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=DDC97FOuqY3p8ItTID02fUw89Qz6NBklNzwjML2CKAmv/x1dk7cS07AavTztCe3WDdkoGaQaro5D2z0r79kc7JUORYJYKM0ZjPKNeSc2SirsuhyYCql6bqInGBdE1IqYmOWTkOsUmnwN97S8fGOAqqtZimJrwiHmiOTiXE698m0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=af2sNm3S; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="af2sNm3S" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1756207910; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cZrTlDOgaHtU2Cvo3h146YTXzfLIkzwilSNfOHgvmKs=; b=af2sNm3SPJDVWjC7WOyR4w+zvf1ZYH5limds8xOBS7PNmmVmlM4Bb0PMNBsPMgA6x57FvC qyLLf+QSXyMT4bxrA1+bSJffGAXPjxaRWIknevq/+7+Zp5x1YjgcAZ7+z2qnp7NvOtsFcZ NM3QhISS5nAvFPwlSwzf8PXNjEWJBC8= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-479-oHcglnsMMFCVMOCaN0H1Ag-1; Tue, 26 Aug 2025 07:31:48 -0400 X-MC-Unique: oHcglnsMMFCVMOCaN0H1Ag-1 X-Mimecast-MFC-AGG-ID: oHcglnsMMFCVMOCaN0H1Ag_1756207908 Received: by mail-qv1-f71.google.com with SMTP id 6a1803df08f44-70d9a65c355so88724776d6.1 for ; Tue, 26 Aug 2025 04:31:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1756207908; x=1756812708; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=cZrTlDOgaHtU2Cvo3h146YTXzfLIkzwilSNfOHgvmKs=; b=kI2rn+1YCE2+0jdPvVsroV9v429IWDfYoQJi1Jg5VRFN1mGvUMzD1m/2LqM5SUYJTX GH7NqPGz0esB3aCvqllzom6TB05oZObI91Ou/EdEnGlESZuTxw3KSPpTcWM84Ur4j8jE AkxueRLLVvqYb9X2MjAcc1bn7wr+eO8ZD6scugdoeDdQKt2/nVQvD3Vzigzce8mO7uZ3 NFA/GBCC1lN6w5DrepgzYSAShf5IkVB90EE14HTV3yzVWFBVCHn8dYJ9COhYzpXq19LD G7fGr65D5Nzco2aGRNcb8XWS6j1Kk3qNRD0ryu45qwb9azQreaj9jJDopE//+NAH87Hz HzLg== X-Forwarded-Encrypted: i=1; AJvYcCVbZOJmDsCGR25cvLCT7Nc6FRv/SPekMQ4xntb36i5kdWq7rF+7RZqffneXjASMB0bz+I4ivto=@vger.kernel.org X-Gm-Message-State: AOJu0Yy2sgiJDmaqiyJO5pMRjlFfIHf9e4zy4dwKNfVQWxHmaIWST9AW OxP+66HcWMjDqSn4A42hPQUwk3Egr9AP8GqI33Zr9FeUgwvjc6OUe0zTv+gWBl/69LKdfnnL3Jo 3p3xzYMDCT5Z44XEDfCGmRyjYFoYEUY9J+1Gugcr0DZkwGLLllejOR6+k/A== X-Gm-Gg: ASbGnctY+sf51lTxqY6XuUkCvbSvr6J9BsZ89WG7Wmmj9GymagCfk4lMsfjfmkbHvO2 4T7kJZQNACytdLW6Cv0L6L9P3Wmd4SypATTCm7KvfYwD2mPPmRTl9ERBYTg6HDgm5rBN/QML/VT FHepIlg2LJuI3HGymiXJQ2XkdOOIZ8uIBV3agLGbJVRtKeqFtfC5IZCSyOl+WVdOx0FLZcusDwZ aEzU2mZI+gV/AqERLtxdihWD57If6ys6Zj2MA1MQoq/couwCRjGD75qRtmZn4aW7HOHExhFw3dC q7eRAX7G+2ZIW2itS8cKYYYGvgYIrxipm8KEPA+M0qLAqVP5VaoIYgc7LusuIkbvva6eiwpoy5x IQFY2ZzBArbU= X-Received: by 2002:a05:6214:19eb:b0:70d:ce5c:a57f with SMTP id 6a1803df08f44-70dd59c021amr10171616d6.11.1756207907370; Tue, 26 Aug 2025 04:31:47 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFlBSCTHmveXT64plMow1wIpZpukatASS2aDxW56wm1rsEm/K8j7dm/g7dCCWmuj3VAJ5014w== X-Received: by 2002:a05:6214:19eb:b0:70d:ce5c:a57f with SMTP id 6a1803df08f44-70dd59c021amr10171206d6.11.1756207906753; Tue, 26 Aug 2025 04:31:46 -0700 (PDT) Received: from ?IPV6:2a0d:3344:2712:7e10:4d59:d956:544f:d65c? ([2a0d:3344:2712:7e10:4d59:d956:544f:d65c]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-70dca06bae9sm19716656d6.28.2025.08.26.04.31.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 26 Aug 2025 04:31:45 -0700 (PDT) Message-ID: <7d7516a6-07b7-4882-9da2-2c192ef43039@redhat.com> Date: Tue, 26 Aug 2025 13:31:43 +0200 Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH net-next v15 09/15] net: homa: create homa_rpc.h and homa_rpc.c To: John Ousterhout , netdev@vger.kernel.org Cc: edumazet@google.com, horms@kernel.org, kuba@kernel.org References: <20250818205551.2082-1-ouster@cs.stanford.edu> <20250818205551.2082-10-ouster@cs.stanford.edu> Content-Language: en-US From: Paolo Abeni In-Reply-To: <20250818205551.2082-10-ouster@cs.stanford.edu> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 8/18/25 10:55 PM, John Ousterhout wrote: > +/** > + * homa_rpc_reap() - Invoked to release resources associated with dead > + * RPCs for a given socket. > + * @hsk: Homa socket that may contain dead RPCs. Must not be locked by the > + * caller; this function will lock and release. > + * @reap_all: False means do a small chunk of work; there may still be > + * unreaped RPCs on return. True means reap all dead RPCs for > + * hsk. Will busy-wait if reaping has been disabled for some RPCs. > + * > + * Return: A return value of 0 means that we ran out of work to do; calling > + * again will do no work (there could be unreaped RPCs, but if so, > + * they cannot currently be reaped). A value greater than zero means > + * there is still more reaping work to be done. > + */ > +int homa_rpc_reap(struct homa_sock *hsk, bool reap_all) > +{ > + /* RPC Reaping Strategy: > + * > + * (Note: there are references to this comment elsewhere in the > + * Homa code) > + * > + * Most of the cost of reaping comes from freeing sk_buffs; this can be > + * quite expensive for RPCs with long messages. > + * > + * The natural time to reap is when homa_rpc_end is invoked to > + * terminate an RPC, but this doesn't work for two reasons. First, > + * there may be outstanding references to the RPC; it cannot be reaped > + * until all of those references have been released. Second, reaping > + * is potentially expensive and RPC termination could occur in > + * homa_softirq when there are short messages waiting to be processed. > + * Taking time to reap a long RPC could result in significant delays > + * for subsequent short RPCs. > + * > + * Thus Homa doesn't reap immediately in homa_rpc_end. Instead, dead > + * RPCs are queued up and reaping occurs in this function, which is > + * invoked later when it is less likely to impact latency. The > + * challenge is to do this so that (a) we don't allow large numbers of > + * dead RPCs to accumulate and (b) we minimize the impact of reaping > + * on latency. > + * > + * The primary place where homa_rpc_reap is invoked is when threads > + * are waiting for incoming messages. The thread has nothing else to > + * do (it may even be polling for input), so reaping can be performed > + * with no latency impact on the application. However, if a machine > + * is overloaded then it may never wait, so this mechanism isn't always > + * sufficient. > + * > + * Homa now reaps in two other places, if reaping while waiting for > + * messages isn't adequate: > + * 1. If too may dead skbs accumulate, then homa_timer will call > + * homa_rpc_reap. > + * 2. If this timer thread cannot keep up with all the reaping to be > + * done then as a last resort homa_dispatch_pkts will reap in small > + * increments (a few sk_buffs or RPCs) for every incoming batch > + * of packets . This is undesirable because it will impact Homa's > + * performance. > + * > + * During the introduction of homa_pools for managing input > + * buffers, freeing of packets for incoming messages was moved to > + * homa_copy_to_user under the assumption that this code wouldn't be > + * on the critical path. However, there is evidence that with > + * fast networks (e.g. 100 Gbps) copying to user space is the > + * bottleneck for incoming messages, and packet freeing takes about > + * 20-25% of the total time in homa_copy_to_user. So, it may eventually > + * be desirable to remove packet freeing out of homa_copy_to_user. See skb_attempt_defer_free() > + */ > +#define BATCH_MAX 20 > + struct homa_rpc *rpcs[BATCH_MAX]; > + struct sk_buff *skbs[BATCH_MAX]; A lot of bytes on the stack, and a quite large batch. You should probaly decrease it. Also it still feel suspect the need for just another tx free strategy on top of the several existing caches. > + int num_skbs, num_rpcs; > + struct homa_rpc *rpc; > + struct homa_rpc *tmp; > + int i, batch_size; > + int skbs_to_reap; > + int result = 0; > + int rx_frees; > + > + /* Each iteration through the following loop will reap > + * BATCH_MAX skbs. > + */ > + skbs_to_reap = hsk->homa->reap_limit; > + while (skbs_to_reap > 0 && !list_empty(&hsk->dead_rpcs)) { > + batch_size = BATCH_MAX; > + if (!reap_all) { > + if (batch_size > skbs_to_reap) > + batch_size = skbs_to_reap; > + skbs_to_reap -= batch_size; > + } > + num_skbs = 0; > + num_rpcs = 0; > + rx_frees = 0; > + > + homa_sock_lock(hsk); > + if (atomic_read(&hsk->protect_count)) { > + homa_sock_unlock(hsk); > + if (reap_all) > + continue; > + return 0; > + } > + > + /* Collect buffers and freeable RPCs. */ > + list_for_each_entry_safe(rpc, tmp, &hsk->dead_rpcs, > + dead_links) { > + int refs; > + > + /* Make sure that all outstanding uses of the RPC have > + * completed. We can only be sure if the reference > + * count is zero when we're holding the lock. Note: > + * it isn't safe to block while locking the RPC here, > + * since we hold the socket lock. > + */ > + if (homa_rpc_try_lock(rpc)) { > + refs = atomic_read(&rpc->refs); > + homa_rpc_unlock(rpc); > + } else { > + refs = 1; > + } > + if (refs != 0) > + continue; > + rpc->magic = 0; > + > + /* For Tx sk_buffs, collect them here but defer > + * freeing until after releasing the socket lock. > + */ > + if (rpc->msgout.length >= 0) { > + while (rpc->msgout.packets) { > + skbs[num_skbs] = rpc->msgout.packets; > + rpc->msgout.packets = homa_get_skb_info( > + rpc->msgout.packets)->next_skb; > + num_skbs++; > + rpc->msgout.num_skbs--; > + if (num_skbs >= batch_size) > + goto release; > + } > + } > + > + /* In the normal case rx sk_buffs will already have been > + * freed before we got here. Thus it's OK to free > + * immediately in rare situations where there are > + * buffers left. > + */ > + if (rpc->msgin.length >= 0 && > + !skb_queue_empty_lockless(&rpc->msgin.packets)) { > + rx_frees += skb_queue_len(&rpc->msgin.packets); > + __skb_queue_purge(&rpc->msgin.packets); > + } > + > + /* If we get here, it means all packets have been > + * removed from the RPC. > + */ > + rpcs[num_rpcs] = rpc; > + num_rpcs++; > + list_del(&rpc->dead_links); > + WARN_ON(refcount_sub_and_test(rpc->msgout.skb_memory, > + &hsk->sock.sk_wmem_alloc)); > + if (num_rpcs >= batch_size) > + goto release; > + } > + > + /* Free all of the collected resources; release the socket > + * lock while doing this. > + */ > +release: > + hsk->dead_skbs -= num_skbs + rx_frees; > + result = !list_empty(&hsk->dead_rpcs) && > + (num_skbs + num_rpcs) != 0; > + homa_sock_unlock(hsk); > + homa_skb_free_many_tx(hsk->homa, skbs, num_skbs); > + for (i = 0; i < num_rpcs; i++) { > + rpc = rpcs[i]; > + > + if (unlikely(rpc->msgin.num_bpages)) > + homa_pool_release_buffers(rpc->hsk->buffer_pool, > + rpc->msgin.num_bpages, > + rpc->msgin.bpage_offsets); > + if (rpc->msgin.length >= 0) { > + while (1) { > + struct homa_gap *gap; > + > + gap = list_first_entry_or_null( > + &rpc->msgin.gaps, > + struct homa_gap, > + links); > + if (!gap) > + break; > + list_del(&gap->links); > + kfree(gap); > + } > + } > + if (rpc->peer) { > + homa_peer_release(rpc->peer); > + rpc->peer = NULL; > + } > + rpc->state = 0; > + kfree(rpc); > + } > + homa_sock_wakeup_wmem(hsk); Here num_rpcs can be zero, and you can have spurius wake-ups > +/** > + * homa_rpc_hold() - Increment the reference count on an RPC, which will > + * prevent it from being freed until homa_rpc_put() is called. References > + * are taken in two situations: > + * 1. An RPC is going to be manipulated by a collection of functions. In > + * this case the top-most function that identifies the RPC takes the > + * reference; any function that receives an RPC as an argument can > + * assume that a reference has been taken on the RPC by some higher > + * function on the call stack. > + * 2. A pointer to an RPC is stored in an object for use later, such as > + * an interest. A reference must be held as long as the pointer remains > + * accessible in the object. > + * @rpc: RPC on which to take a reference. > + */ > +static inline void homa_rpc_hold(struct homa_rpc *rpc) > +{ > + atomic_inc(&rpc->refs); `refs` should be a reference_t, since is uses as such. /P