netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol
@ 2025-08-18 20:55 John Ousterhout
  2025-08-18 20:55 ` [PATCH net-next v15 01/15] net: homa: define user-visible API for Homa John Ousterhout
                   ` (15 more replies)
  0 siblings, 16 replies; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

This patch series begins the process of upstreaming the Homa transport
protocol. Homa is an alternative to TCP for use in datacenter
environments. It provides 10-100x reductions in tail latency for short
messages relative to TCP. Its benefits are greatest for mixed workloads
containing both short and long messages running under high network loads.
Homa is not API-compatible with TCP: it is connectionless and message-
oriented (but still reliable and flow-controlled). Homa's new API not
only contributes to its performance gains, but it also eliminates the
massive amount of connection state required by TCP for highly connected
datacenter workloads (Homa uses ~ 1 socket per application, whereas
TCP requires a separate socket for each peer).

For more details on Homa, please consult the Homa Wiki:
https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview
The Wiki has pointers to two papers on Homa (one of which describes
this implementation) as well as man pages describing the application
API and other information.

There is also a GitHub repo for Homa:
https://github.com/PlatformLab/HomaModule
The GitHub repo contains a superset of this patch set, including:
* Additional source code that will eventually be upstreamed
* Extensive unit tests (which will also be upstreamed eventually)
* Application-level library functions (which need to go in glibc?)
* Man pages (which need to be upstreamed as well)
* Benchmarking and instrumentation code

For this patch series, Homa has been stripped down to the bare minimum
functionality capable of actually executing remote procedure calls. (about
8000 lines of source code, compared to 15000 in the complete Homa). The
remaining code will be upstreamed in smaller batches once this patch
series has been accepted. Note: the code in this patch series is
functional but its performance is not very interesting (about the same
as TCP).

The patch series is arranged to introduce the major functional components
of Homa. Until the last patch has been applied, the code is inert (it
will not be compiled).

Note: this implementation of Homa supports both IPv4 and IPv6.

Changes for v15:
* This series is a resubmit of the v14 series to repair broken Author
  email addresses in the commits. There are no other changes.

Changes for v14:
* There were no comments on the v13 patch series.
* Fix a couple of bugs and clean up a few APIs (see individual patches for
  details).

Changes for v13:
* Modify all files to include GPL-2.0+ as an option in the SPDX license line
* Fix a couple of bugs in homa_outgoing.c and one bug in homa_plumbing.c

Major changes for v12:
* There were no comments on the v11 patch series, so there are no major
  changes in this version. See individual patch files for a few small
  local changes.

Major changes for v11 (see individual patches for additional details):
* There were no comments on the v10 patch series, so there are not many
  changes in this version
* Rework the mechanism for waking up RPCs that stalled waiting for
  buffer pool space (the old approach deprioritized waking RPCs, which
  led to starvation and server overload).
* Cleanup and simplify use of RPC reference counts. Before, references were
  only acquired to bridge gaps in lock ownership; this was complicated and
  error-prone. Now, reference counts are acquired at the "top level" when
  an RPC is selected for working on. Any function that receives a homa_rpc as
  argument can assume it is protected with a reference.
* Clean up sparse annotations (use name of lock variable, not address)

Major changes for v10 (see individual patches for additional details):
- Refactor resend mechanism: consolidate code for sending RESEND packets
  in new function homa_request_retrans (simplifies homa_timer.c); a few
  bug fixes (updating "granted" field in homa_resend_pkt, etc.)
- Revise sparse annotations to eliminate __context__ definition
- Use the destroy function from struct proto properly (fixes races in
  socket cleanup)

Major changes for v9 (see individual patches for additional details):
- Introduce homa_net objects; there is now a single global struct homa
  shared by all network namespaces, with one homa_net per network namespace
  with netns-specific information. Most info, including socket table and
  peer table, is stored in the struct homa.
- Introduce homa_clock as an abstraction layer for the fine-grain clock.
- Implement limits on the number of active homa_peer objects. This includes
  adding reference counts in homa_peers and adding code to release peers
  where there are too many.
- Switch to using rhashtable to store homa_peers; the table is shared
  across all network namespaces, though individual peers are namespace-
  specific.

v8 changes:
- There were no reviews of the v7 patch series, so there are not many changes
  in this version
- Pull out pacer code into separate files pacer.h and pacer.c
- Refactor homa_pool APIs (move allocation/deallocation into homa_pool.c,
  move locking responsibility out)
- Fix various problems from sparse, checkpatch, and kernel-doc

v7 changes:
- Add documentation files reap.txt and sync.txt.
- Replace __u64 with _u64 (and __s64 with s64) in non-uapi settings.
- Replace '__aligned(L1_CACHE_BYTES)' with '____cacheline_aligned_in_smp'.
- Use alloc_percpu_gfp for homa_pool::cores.
- Extract bool homa_bpage_available from homa_pool_get_pages.
- Rename homa_rpc_free to homa_rpc_end.
- Use skb_queue_purge in homa_rpc_reap instead of hand-coding.
- Clean up RCU usage in several places:
  - Eliminate unnecessary use of RCU for homa_sock::dead_rpcs.
  - Eliminate use of RCU for homa::throttled_rpcs (unnecessary, unclear
    that it would have worked). Added return value from homa_pacer_xmit.
  - Call rcu_read_lock/unlock in homa_peer_find (just to be safe; probably
    isn't necessary)
  - Eliminate extraneous use of RCU in homa_pool_allocate.
  - Cleaned up RCU usage around homa_sock::active_rpcs.
  - Change homa_sock_find to take a reference on the returned socket;
    caller no longer has to worry about RCU issues.
- Remove "locker" arguments from homa_lock_rpc, homa_lock_sock,
  homa_rpc_try_lock, and homa_bucket_lock (shouldn't be needed, given
  CONFIG_PROVE_LOCKING).
- Use __GFP_ZERO in *alloc calls instead of initializing individual
  struct fields to zero.
- Don't use raw_smp_processor_id; use smp_processor_id instead.
- Remove homa_peertab_get_peers from this patch series (and also fix
  problems in it related to RCU usage).
- Add annotation to homa_peertab_gc_dsts requiring write_lock.
- Remove "lock_slow" functions, which don't add functionality in this patch
  series.
- Remove unused fields from homa_peer structs.
- Reorder fields in homa_rpc_bucket to squeeze out padding.
- Refactor homa_sock_start_scan etc.
  - Take a reference on the current socket to keep it from being freed.
  - No need now for homa_socktab::active_scans or struct homa_socktab_links.
  - rcu_read_lock/unlock is now entirely in the homa_sock scan methods;
    no need for callers to worry about this.
- Add homa_rpc_hold and homa_rpc_put. Replaces several ad-hoc mechanisms,
  such as RPC_COPYING_FROM_USER and RPC_COPYING_TO_USER, with a single
  general-purpose mechanism.
- Use __skb_queue_purge instead of skb_queue_purge (locking isn't needed
  because Homa has its own locks).
- Rename UNKNOWN packet type to RPC_UNKNOWN.
- Add hsk->is_server plus SO_HOMA_SERVER setsockopt: by default, sockets
  will not accept incoming RPCs unless they have been bound.
- Refactor waiting mechanism for incoming packets: simplify wait
  criteria and use standard mechanisms (wait_event_*) for blocking
  threads. Create homa_interest.c and homa_interest.h.
* Add memory accounting for outbound messages (e.g. new sysctl value
  wmem_max); senders now block when memory limit is exceeded.
* Made Homa a pernet subsystem (a separate Homa transport for each
  network namespace).

v6 changes:
- Make hrtimer variable in homa_timer_main static instead of stack-allocated
  (avoids complaints when in debug mode).
- Remove unnecessary cast in homa_dst_refresh.
- Replace erroneous uses of GFP_KERNEL with GFP_ATOMIC.
- Check for "all ports in use" in homa_sock_init.
- Refactor API for homa_rpc_reap to incorporate "reap all" feature,
  eliminate need for callers to specify exact amount of work to do
  when in "reap a few" mode.
- Fix bug in homa_rpc_reap (wasn't resetting rx_frees for each iteration
  of outer loop).

v5 changes:
- Change type of start in struct homa_rcvbuf_args from void* to __u64;
  also add more __user annotations.
- Refactor homa_interest: replace awkward ready_rpc field with two
  fields: rpc and rpc_ready. Added new functions homa_interest_get_rpc
  and homa_interest_set_rpc to encapsulate/clarify access to
  interest->rpc_ready.
- Eliminate use of LIST_POISON1 etc. in homa_interests (use list_del_init
  instead of list_del).
- Remove homa_next_skb function, which is obsolete, unused, and incorrect
- Eliminate ipv4_to_ipv6 function (use ipv6_addr_set_v4mapped instead)
- Eliminate is_mapped_ipv4 function (use ipv6_addr_v4mapped instead)
- Use __u64 instead of uint64_t in homa.h
- Remove 'extern "C"' from homa.h
- Various fixes from patchwork checks (checkpatch.pl, etc.)
- A few improvements to comments

v4 changes:
- Remove sport argument for homa_find_server_rpc (unneeded). Also
  remove client_port field from struct homa_ack
- Refactor ICMP packet handling (v6 was incorrect)
- Check for socket shutdown in homa_poll
- Fix potential for memory garbling in homa_symbol_for_type
- Remove unused ETHERNET_MAX_PAYLOAD declaration
- Rename classes in homa_wire.h so they all have "homa_" prefixes
- Various fixes from patchwork checks (checkpatch.pl, etc.)
- A few improvements to comments

v3 changes:
- Fix formatting in Kconfig
- Set ipv6_pinfo_offset in struct proto
- Check return value of inet6_register_protosw
- In homa_load cleanup, don't cleanup things that haven't been
  initialized
- Add MODULE_ALIAS_NET_PF_PROTO_TYPE to auto-load module
- Check return value from kzalloc call in homa_sock_init
- Change SO_HOMA_SET_BUF to SO_HOMA_RCVBUF
- Change struct homa_set_buf_args to struct homa_rcvbuf_args
- Implement getsockopt for SO_HOMA_RCVBUF
- Return ENOPROTOOPT instead of EINVAL where appropriate in
  setsockopt and getsockopt
- Fix crash in homa_pool_check_waiting if pool has no region yet
- Check for NULL msg->msg_name in homa_sendmsg
- Change addr->in6.sin6_family to addr->sa.sa_family in homa_sendmsg
  for clarity
- For some errors in homa_recvmsg, return directly rather than "goto done"
- Return error from recvmsg if offsets of returned read buffers are bogus
- Added comments to clarify lock-unlock pairs for RPCs
- Renamed homa_try_bucket_lock to homa_try_rpc_lock
- Fix issues found by test robot and checkpatch.pl
- Ensure first argument to do_div is 64 bits
- Remove C++ style comments
- Removed some code that will only be relevant in future patches that
  fill in missing Homa functionality

v2 changes:
- Remove sockaddr_in_union declaration from public API in homa.h
- Remove kernel wrapper functions (homa_send, etc.) from homa.h
- Fix many sparse warnings (still more work to do here) and other issues
  uncovered by test robot
- Fix checkpatch.pl issues
- Remove residual code related to unit tests
- Remove references to tt_record from comments
- Make it safe to delete sockets during homa_socktab scans
- Use uintptr_t for portability fo 32-bit platforms
- Use do_div instead of "/" for portability
- Remove homa->busy_usecs and homa->gro_busy_usecs (not needed in
  this stripped down version of Homa)
- Eliminate usage of cpu_khz, use sched_clock instead of get_cycles
- Add missing checks of kmalloc return values
- Remove "inline" qualifier from functions in .c files
- Document that pad fields must be zero
- Use more precise type "uint32_t" rather than "int"
- Remove unneeded #include of linux/version.h

John Ousterhout (15):
  net: homa: define user-visible API for Homa
  net: homa: create homa_wire.h
  net: homa: create shared Homa header files
  net: homa: create homa_pool.h and homa_pool.c
  net: homa: create homa_peer.h and homa_peer.c
  net: homa: create homa_sock.h and homa_sock.c
  net: homa: create homa_interest.h and homa_interest.c
  net: homa: create homa_pacer.h and homa_pacer.c
  net: homa: create homa_rpc.h and homa_rpc.c
  net: homa: create homa_outgoing.c
  net: homa: create homa_utils.c
  net: homa: create homa_incoming.c
  net: homa: create homa_timer.c
  net: homa: create homa_plumbing.c
  net: homa: create Makefile and Kconfig

 include/uapi/linux/homa.h |  158 ++++++
 net/Kconfig               |    1 +
 net/Makefile              |    1 +
 net/homa/Kconfig          |   21 +
 net/homa/Makefile         |   16 +
 net/homa/homa_impl.h      |  703 +++++++++++++++++++++++
 net/homa/homa_incoming.c  |  886 +++++++++++++++++++++++++++++
 net/homa/homa_interest.c  |  114 ++++
 net/homa/homa_interest.h  |   93 +++
 net/homa/homa_outgoing.c  |  599 ++++++++++++++++++++
 net/homa/homa_pacer.c     |  303 ++++++++++
 net/homa/homa_pacer.h     |  173 ++++++
 net/homa/homa_peer.c      |  595 ++++++++++++++++++++
 net/homa/homa_peer.h      |  373 +++++++++++++
 net/homa/homa_plumbing.c  | 1118 +++++++++++++++++++++++++++++++++++++
 net/homa/homa_pool.c      |  483 ++++++++++++++++
 net/homa/homa_pool.h      |  136 +++++
 net/homa/homa_rpc.c       |  638 +++++++++++++++++++++
 net/homa/homa_rpc.h       |  501 +++++++++++++++++
 net/homa/homa_sock.c      |  432 ++++++++++++++
 net/homa/homa_sock.h      |  408 ++++++++++++++
 net/homa/homa_stub.h      |   91 +++
 net/homa/homa_timer.c     |  136 +++++
 net/homa/homa_utils.c     |  122 ++++
 net/homa/homa_wire.h      |  345 ++++++++++++
 25 files changed, 8446 insertions(+)
 create mode 100644 include/uapi/linux/homa.h
 create mode 100644 net/homa/Kconfig
 create mode 100644 net/homa/Makefile
 create mode 100644 net/homa/homa_impl.h
 create mode 100644 net/homa/homa_incoming.c
 create mode 100644 net/homa/homa_interest.c
 create mode 100644 net/homa/homa_interest.h
 create mode 100644 net/homa/homa_outgoing.c
 create mode 100644 net/homa/homa_pacer.c
 create mode 100644 net/homa/homa_pacer.h
 create mode 100644 net/homa/homa_peer.c
 create mode 100644 net/homa/homa_peer.h
 create mode 100644 net/homa/homa_plumbing.c
 create mode 100644 net/homa/homa_pool.c
 create mode 100644 net/homa/homa_pool.h
 create mode 100644 net/homa/homa_rpc.c
 create mode 100644 net/homa/homa_rpc.h
 create mode 100644 net/homa/homa_sock.c
 create mode 100644 net/homa/homa_sock.h
 create mode 100644 net/homa/homa_stub.h
 create mode 100644 net/homa/homa_timer.c
 create mode 100644 net/homa/homa_utils.c
 create mode 100644 net/homa/homa_wire.h

--
2.43.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 01/15] net: homa: define user-visible API for Homa
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-18 20:55 ` [PATCH net-next v15 02/15] net: homa: create homa_wire.h John Ousterhout
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

Note: for man pages, see the Homa Wiki at:
https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for v14:
* Add "WITH Linux-syscall-note" SPDX license note

Changes for v11:
* Add explicit padding to struct homa_recvmsg_args to fix problems compiling
  on 32-bit machines.

Changes for v9:
* Eliminate use of _Static_assert
* Remove declarations related to now-defunct homa_api.c

Changes for v7:
* Add HOMA_SENDMSG_NONBLOCKING flag for sendmsg
* API changes for new mechanism for waiting for incoming messages
* Add setsockopt SO_HOMA_SERVER (enable incoming requests)
* Use u64 and __u64 properly
---
 include/uapi/linux/homa.h | 158 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 158 insertions(+)
 create mode 100644 include/uapi/linux/homa.h

diff --git a/include/uapi/linux/homa.h b/include/uapi/linux/homa.h
new file mode 100644
index 000000000000..3a010cc13b25
--- /dev/null
+++ b/include/uapi/linux/homa.h
@@ -0,0 +1,158 @@
+/* SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+ WITH Linux-syscall-note */
+
+/* This file defines the kernel call interface for the Homa
+ * transport protocol.
+ */
+
+#ifndef _UAPI_LINUX_HOMA_H
+#define _UAPI_LINUX_HOMA_H
+
+#include <linux/types.h>
+#ifndef __KERNEL__
+#include <netinet/in.h>
+#include <sys/socket.h>
+#endif
+
+/* IANA-assigned Internet Protocol number for Homa. */
+#define IPPROTO_HOMA 146
+
+/**
+ * define HOMA_MAX_MESSAGE_LENGTH - Maximum bytes of payload in a Homa
+ * request or response message.
+ */
+#define HOMA_MAX_MESSAGE_LENGTH 1000000
+
+/**
+ * define HOMA_BPAGE_SIZE - Number of bytes in pages used for receive
+ * buffers. Must be power of two.
+ */
+#define HOMA_BPAGE_SIZE (1 << HOMA_BPAGE_SHIFT)
+#define HOMA_BPAGE_SHIFT 16
+
+/**
+ * define HOMA_MAX_BPAGES - The largest number of bpages that will be required
+ * to store an incoming message.
+ */
+#define HOMA_MAX_BPAGES ((HOMA_MAX_MESSAGE_LENGTH + HOMA_BPAGE_SIZE - 1) >> \
+		HOMA_BPAGE_SHIFT)
+
+/**
+ * define HOMA_MIN_DEFAULT_PORT - The 16 bit port space is divided into
+ * two nonoverlapping regions. Ports 1-32767 are reserved exclusively
+ * for well-defined server ports. The remaining ports are used for client
+ * ports; these are allocated automatically by Homa. Port 0 is reserved.
+ */
+#define HOMA_MIN_DEFAULT_PORT 0x8000
+
+/**
+ * struct homa_sendmsg_args - Provides information needed by Homa's
+ * sendmsg; passed to sendmsg using the msg_control field.
+ */
+struct homa_sendmsg_args {
+	/**
+	 * @id: (in/out) An initial value of 0 means a new request is
+	 * being sent; nonzero means the message is a reply to the given
+	 * id. If the message is a request, then the value is modified to
+	 * hold the id of the new RPC.
+	 */
+	__u64 id;
+
+	/**
+	 * @completion_cookie: (in) Used only for request messages; will be
+	 * returned by recvmsg when the RPC completes. Typically used to
+	 * locate app-specific info about the RPC.
+	 */
+	__u64 completion_cookie;
+
+	/**
+	 * @flags: (in) OR-ed combination of bits that control the operation.
+	 * See below for values.
+	 */
+	__u32 flags;
+
+	/** @reserved: Not currently used, must be 0. */
+	__u32 reserved;
+};
+
+/* Flag bits for homa_sendmsg_args.flags (see man page for documentation):
+ */
+#define HOMA_SENDMSG_PRIVATE       0x01
+#define HOMA_SENDMSG_VALID_FLAGS   0x01
+
+/**
+ * struct homa_recvmsg_args - Provides information needed by Homa's
+ * recvmsg; passed to recvmsg using the msg_control field.
+ */
+struct homa_recvmsg_args {
+	/**
+	 * @id: (in/out) Initial value is 0 to wait for any shared RPC;
+	 * nonzero means wait for that specific (private) RPC. Returns
+	 * the id of the RPC received.
+	 */
+	__u64 id;
+
+	/**
+	 * @completion_cookie: (out) If the incoming message is a response,
+	 * this will return the completion cookie specified when the
+	 * request was sent. For requests this will always be zero.
+	 */
+	__u64 completion_cookie;
+
+	/**
+	 * @num_bpages: (in/out) Number of valid entries in @bpage_offsets.
+	 * Passes in bpages from previous messages that can now be
+	 * recycled; returns bpages from the new message.
+	 */
+	__u32 num_bpages;
+
+	/** @reserved: Not currently used, must be 0. */
+	__u32 reserved;
+
+	/**
+	 * @bpage_offsets: (in/out) Each entry is an offset into the buffer
+	 * region for the socket pool. When returned from recvmsg, the
+	 * offsets indicate where fragments of the new message are stored. All
+	 * entries but the last refer to full buffer pages (HOMA_BPAGE_SIZE
+	 * bytes) and are bpage-aligned. The last entry may refer to a bpage
+	 * fragment and is not necessarily aligned. The application now owns
+	 * these bpages and must eventually return them to Homa, using
+	 * bpage_offsets in a future recvmsg invocation.
+	 */
+	__u32 bpage_offsets[HOMA_MAX_BPAGES];
+};
+
+/** define SO_HOMA_RCVBUF: setsockopt option for specifying buffer region. */
+#define SO_HOMA_RCVBUF 10
+
+/**
+ * define SO_HOMA_SERVER: setsockopt option for specifying whether a
+ * socket will act as server.
+ */
+#define SO_HOMA_SERVER 11
+
+/** struct homa_rcvbuf_args - setsockopt argument for SO_HOMA_RCVBUF. */
+struct homa_rcvbuf_args {
+	/** @start: Address of first byte of buffer region in user space. */
+	__u64 start;
+
+	/** @length: Total number of bytes available at @start. */
+	size_t length;
+};
+
+/* Meanings of the bits in Homa's flag word, which can be set using
+ * "sysctl /net/homa/flags".
+ */
+
+/**
+ * define HOMA_FLAG_DONT_THROTTLE - disable the output throttling mechanism
+ * (always send all packets immediately).
+ */
+#define HOMA_FLAG_DONT_THROTTLE   2
+
+/* I/O control calls on Homa sockets. These are mapped into the
+ * SIOCPROTOPRIVATE range of 0x89e0 through 0x89ef.
+ */
+
+#define HOMAIOCFREEZE _IO(0x89, 0xef)
+
+#endif /* _UAPI_LINUX_HOMA_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 02/15] net: homa: create homa_wire.h
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
  2025-08-18 20:55 ` [PATCH net-next v15 01/15] net: homa: define user-visible API for Homa John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-18 20:55 ` [PATCH net-next v15 03/15] net: homa: create shared Homa header files John Ousterhout
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

This file defines the on-the-wire packet formats for Homa.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for v11:
* Rework the mechanism for waking up RPCs that stalled waiting for
  buffer pool space

Changes for v10:
* Replace __u16 with u16, __u8 with u8, etc.
* Refactor resend mechanism

Changes for v9:
* Eliminate use of _Static_assert
* Various name improvements (e.g. use "alloc" instead of "new" for functions
  that allocate memory, Replace BOGUS in enum homa_packet_type with MAX_OP)
* Remove HOMA_IPV6_HEADER_LENGTH and similar defs, use sizeof(ipv6hdr) instead

Changes for v7:
* Rename UNKNOWN packet type to RPC_UNKNOWN
* Use u64 and __u64 properly
---
 net/homa/homa_wire.h | 345 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 345 insertions(+)
 create mode 100644 net/homa/homa_wire.h

diff --git a/net/homa/homa_wire.h b/net/homa/homa_wire.h
new file mode 100644
index 000000000000..03479357f8e9
--- /dev/null
+++ b/net/homa/homa_wire.h
@@ -0,0 +1,345 @@
+/* SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+ */
+
+/* This file defines the on-the-wire format of Homa packets. */
+
+#ifndef _HOMA_WIRE_H
+#define _HOMA_WIRE_H
+
+#include <linux/skbuff.h>
+#include <net/tcp.h>
+
+/* Defines the possible types of Homa packets.
+ *
+ * See the xxx_header structs below for more information about each type.
+ */
+enum homa_packet_type {
+	DATA               = 0x10,
+	RESEND             = 0x12,
+	RPC_UNKNOWN        = 0x13,
+	BUSY               = 0x14,
+	NEED_ACK           = 0x17,
+	ACK                = 0x18,
+	MAX_OP             = 0x18,
+	/* If you add a new type here, you must also do the following:
+	 * 1. Change MAX_OP so it is the highest valid opcode
+	 * 2. Add support for the new opcode in homa_print_packet,
+	 *    homa_print_packet_short, homa_symbol_for_type, and mock_skb_alloc.
+	 * 3. Add the header length to header_lengths in homa_plumbing.c.
+	 */
+};
+
+/**
+ * define HOMA_SKB_EXTRA - How many bytes of additional space to allow at the
+ * beginning of each sk_buff, before the Homa header. This includes room for
+ * either an IPV4 or IPV6 header, Ethernet header, VLAN header, etc. This is
+ * a bit of an overestimate, since it also includes space for a TCP header.
+ */
+#define HOMA_SKB_EXTRA MAX_TCP_HEADER
+
+/**
+ * define HOMA_ETH_FRAME_OVERHEAD - Additional overhead bytes for each
+ * Ethernet packet that are not included in the packet header (preamble,
+ * start frame delimiter, CRC, and inter-packet gap).
+ */
+#define HOMA_ETH_FRAME_OVERHEAD 24
+
+/**
+ * define HOMA_ETH_OVERHEAD - Number of bytes per Ethernet packet for Ethernet
+ * header, CRC, preamble, and inter-packet gap.
+ */
+#define HOMA_ETH_OVERHEAD (18 + HOMA_ETH_FRAME_OVERHEAD)
+
+/**
+ * define HOMA_MIN_PKT_LENGTH - Every Homa packet must be padded to at least
+ * this length to meet Ethernet frame size limitations. This number includes
+ * Homa headers and data, but not IP or Ethernet headers.
+ */
+#define HOMA_MIN_PKT_LENGTH 26
+
+/**
+ * define HOMA_MAX_HEADER - Number of bytes in the largest Homa header.
+ */
+#define HOMA_MAX_HEADER 90
+
+/**
+ * struct homa_common_hdr - Wire format for the first bytes in every Homa
+ * packet. This must (mostly) match the format of a TCP header to enable
+ * Homa packets to actually be transmitted as TCP packets (and thereby
+ * take advantage of TSO and other features).
+ */
+struct homa_common_hdr {
+	/**
+	 * @sport: Port on source machine from which packet was sent.
+	 * Must be in the same position as in a TCP header.
+	 */
+	__be16 sport;
+
+	/**
+	 * @dport: Port on destination that is to receive packet. Must be
+	 * in the same position as in a TCP header.
+	 */
+	__be16 dport;
+
+	/**
+	 * @sequence: corresponds to the sequence number field in TCP headers;
+	 * used in DATA packets to hold the offset in the message of the first
+	 * byte of data. However, when TSO is used without TCP hijacking, this
+	 * value will only be correct in the first segment of a GSO packet.
+	 */
+	__be32 sequence;
+
+	/**
+	 * @ack: Corresponds to the high-order bits of the acknowledgment
+	 * field in TCP headers; not used by Homa.
+	 */
+	char ack[3];
+
+	/**
+	 * @type: Homa packet type (one of the values of the homa_packet_type
+	 * enum). Corresponds to the low-order byte of the ack in TCP.
+	 */
+	u8 type;
+
+	/**
+	 * @doff: High order 4 bits holds the number of 4-byte chunks in a
+	 * homa_data_hdr (low-order bits unused). Used only for DATA packets;
+	 * must be in the same position as the data offset in a TCP header.
+	 * Used by TSO to determine where the replicated header portion ends.
+	 */
+	u8 doff;
+
+	/** @reserved1: Not used (corresponds to TCP flags). */
+	u8 reserved1;
+
+	/**
+	 * @window: Corresponds to the window field in TCP headers. Not used
+	 * by HOMA.
+	 */
+	__be16 window;
+
+	/**
+	 * @checksum: Not used by Homa, but must occupy the same bytes as
+	 * the checksum in a TCP header (TSO may modify this?).
+	 */
+	__be16 checksum;
+
+	/** @reserved2: Not used (corresponds to TCP urgent field). */
+	__be16 reserved2;
+
+	/**
+	 * @sender_id: the identifier of this RPC as used on the sender (i.e.,
+	 * if the low-order bit is set, then the sender is the server for
+	 * this RPC).
+	 */
+	__be64 sender_id;
+} __packed;
+
+/**
+ * struct homa_ack - Identifies an RPC that can be safely deleted by its
+ * server. After sending the response for an RPC, the server must retain its
+ * state for the RPC until it knows that the client has successfully
+ * received the entire response. An ack indicates this. Clients will
+ * piggyback acks on future data packets, but if a client doesn't send
+ * any data to the server, the server will eventually request an ack
+ * explicitly with a NEED_ACK packet, in which case the client will
+ * return an explicit ACK.
+ */
+struct homa_ack {
+	/**
+	 * @client_id: The client's identifier for the RPC. 0 means this ack
+	 * is invalid.
+	 */
+	__be64 client_id;
+
+	/** @server_port: The server-side port for the RPC. */
+	__be16 server_port;
+} __packed;
+
+/* struct homa_data_hdr - Contains data for part or all of a Homa message.
+ * An incoming packet consists of a homa_data_hdr followed by message data.
+ * An outgoing packet can have this simple format as well, or it can be
+ * structured as a GSO packet with the following format:
+ *
+ *    |-----------------------|
+ *    |                       |
+ *    |     data_header       |
+ *    |                       |
+ *    |---------------------- |
+ *    |                       |
+ *    |                       |
+ *    |     segment data      |
+ *    |                       |
+ *    |                       |
+ *    |-----------------------|
+ *    |      seg_header       |
+ *    |-----------------------|
+ *    |                       |
+ *    |                       |
+ *    |     segment data      |
+ *    |                       |
+ *    |                       |
+ *    |-----------------------|
+ *    |      seg_header       |
+ *    |-----------------------|
+ *    |                       |
+ *    |                       |
+ *    |     segment data      |
+ *    |                       |
+ *    |                       |
+ *    |-----------------------|
+ *
+ * TSO will not adjust @homa_common_hdr.sequence in the segments, so Homa
+ * sprinkles correct offsets (in homa_seg_hdrs) throughout the segment data;
+ * TSO/GSO will include a different homa_seg_hdr in each generated packet.
+ */
+
+struct homa_seg_hdr {
+	/**
+	 * @offset: Offset within message of the first byte of data in
+	 * this segment.
+	 */
+	__be32 offset;
+} __packed;
+
+struct homa_data_hdr {
+	struct homa_common_hdr common;
+
+	/** @message_length: Total #bytes in the message. */
+	__be32 message_length;
+
+	__be32 reserved1;
+
+	/** @ack: If the @client_id field of this is nonzero, provides info
+	 * about an RPC that the recipient can now safely free. Note: in
+	 * TSO packets this will get duplicated in each of the segments;
+	 * in order to avoid repeated attempts to ack the same RPC,
+	 * homa_gro_receive will clear this field in all segments but the
+	 * first.
+	 */
+	struct homa_ack ack;
+
+	__be16 reserved2;
+
+	/**
+	 * @retransmit: 1 means this packet was sent in response to a RESEND
+	 * (it has already been sent previously).
+	 */
+	u8 retransmit;
+
+	char pad[3];
+
+	/** @seg: First of possibly many segments. */
+	struct homa_seg_hdr seg;
+} __packed;
+
+/**
+ * homa_data_len() - Returns the total number of bytes in a DATA packet
+ * after the homa_data_hdr. Note: if the packet is a GSO packet, the result
+ * may include metadata as well as packet data.
+ * @skb:   Incoming data packet
+ * Return: see above
+ */
+static inline int homa_data_len(struct sk_buff *skb)
+{
+	return skb->len - skb_transport_offset(skb) -
+			sizeof(struct homa_data_hdr);
+}
+
+/**
+ * struct homa_resend_hdr - Wire format for RESEND packets.
+ *
+ * A RESEND is sent by the receiver when it believes that message data may
+ * have been lost in transmission (or if it is concerned that the sender may
+ * have crashed). The receiver should resend the specified portion of the
+ * message, even if it already sent it previously.
+ */
+struct homa_resend_hdr {
+	/** @common: Fields common to all packet types. */
+	struct homa_common_hdr common;
+
+	/**
+	 * @offset: Offset within the message of the first byte of data that
+	 * should be retransmitted.
+	 */
+	__be32 offset;
+
+	/**
+	 * @length: Number of bytes of data to retransmit. -1 means no data
+	 * has been received for the message, so everything sent previously
+	 * should be retransmitted.
+	 */
+	__be32 length;
+
+} __packed;
+
+/**
+ * struct homa_rpc_unknown_hdr - Wire format for RPC_UNKNOWN packets.
+ *
+ * An RPC_UNKNOWN packet is sent by either server or client when it receives a
+ * packet for an RPC that is unknown to it. When a client receives an
+ * RPC_UNKNOWN packet it will typically restart the RPC from the beginning;
+ * when a server receives an RPC_UNKNOWN packet it will typically discard its
+ * state for the RPC.
+ */
+struct homa_rpc_unknown_hdr {
+	/** @common: Fields common to all packet types. */
+	struct homa_common_hdr common;
+} __packed;
+
+/**
+ * struct homa_busy_hdr - Wire format for BUSY packets.
+ *
+ * These packets tell the recipient that the sender is still alive (even if
+ * it isn't sending data expected by the recipient).
+ */
+struct homa_busy_hdr {
+	/** @common: Fields common to all packet types. */
+	struct homa_common_hdr common;
+} __packed;
+
+/**
+ * struct homa_need_ack_hdr - Wire format for NEED_ACK packets.
+ *
+ * These packets ask the recipient (a client) to return an ACK message if
+ * the packet's RPC is no longer active.
+ */
+struct homa_need_ack_hdr {
+	/** @common: Fields common to all packet types. */
+	struct homa_common_hdr common;
+} __packed;
+
+/**
+ * struct homa_ack_hdr - Wire format for ACK packets.
+ *
+ * These packets are sent from a client to a server to indicate that
+ * a set of RPCs is no longer active on the client, so the server can
+ * free any state it may have for them.
+ */
+struct homa_ack_hdr {
+	/** @common: Fields common to all packet types. */
+	struct homa_common_hdr common;
+
+	/** @num_acks: Number of (leading) elements in @acks that are valid. */
+	__be16 num_acks;
+
+#define HOMA_MAX_ACKS_PER_PKT 5
+	/** @acks: Info about RPCs that are no longer active. */
+	struct homa_ack acks[HOMA_MAX_ACKS_PER_PKT];
+} __packed;
+
+/**
+ * homa_local_id(): given an RPC identifier from an input packet (which
+ * is network-encoded), return the decoded id we should use for that
+ * RPC on this machine.
+ * @sender_id:  RPC id from an incoming packet, such as h->common.sender_id
+ * Return: see above
+ */
+static inline u64 homa_local_id(__be64 sender_id)
+{
+	/* If the client bit was set on the sender side, it needs to be
+	 * removed here, and conversely.
+	 */
+	return be64_to_cpu(sender_id) ^ 1;
+}
+
+#endif /* _HOMA_WIRE_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 03/15] net: homa: create shared Homa header files
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
  2025-08-18 20:55 ` [PATCH net-next v15 01/15] net: homa: define user-visible API for Homa John Ousterhout
  2025-08-18 20:55 ` [PATCH net-next v15 02/15] net: homa: create homa_wire.h John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-26  9:05   ` Paolo Abeni
  2025-08-18 20:55 ` [PATCH net-next v15 04/15] net: homa: create homa_pool.h and homa_pool.c John Ousterhout
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

homa_impl.h defines "struct homa", which contains overall information
about the Homa transport, plus various odds and ends that are used
throughout the Homa implementation.

homa_stub.h is a temporary header file that provides stubs for
facilities that have omitted for this first patch series. This file
will go away once Home is fully upstreamed.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for v12:
* Use tsc_khz instead of cpu_khz
* Make is_homa_pkt work properly with IPv6 (it only worked for IPv4)

Changes for v11:
* Move link_mbps variable from struct homa_pacer back to struct homa.

Changes for v10:
* Eliminate __context__ definition
* Replace __u16 with u16, __u8 with u8, etc.
* Refactor resend mechanism

Changes for v9:
* Move information from sync.txt into comments in homa_impl.h
* Add limits on number of active peer structs
* Introduce homa_net objects; there is now a single global struct homa
  shared by all network namespaces, with one homa_net per network namespace
  with netns-specific information.
* Introduce homa_clock as an abstraction layer for the fine-grain clock.
* Various name improvements (e.g. use "alloc" instead of "new" for functions
  that allocate memory)
* Eliminate sizeof32 definition

Changes for v8:
* Pull out pacer-related fields into separate struct homa_pacer in homa_pacer.h

Changes for v7:
* Make Homa a per-net subsystem
* Track tx buffer memory usage
* Refactor waiting mechanism for incoming packets: simplify wait
  criteria and use standard Linux mechanisms for waiting
* Remove "lock_slow" functions, which don't add functionality in this
  patch series
* Rename homa_rpc_free to homa_rpc_end
* Add homa_make_header_avl function
* Use u64 and __u64 properly
---
 net/homa/homa_impl.h | 631 +++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_stub.h |  91 +++++++
 2 files changed, 722 insertions(+)
 create mode 100644 net/homa/homa_impl.h
 create mode 100644 net/homa/homa_stub.h

diff --git a/net/homa/homa_impl.h b/net/homa/homa_impl.h
new file mode 100644
index 000000000000..f5191ec0b198
--- /dev/null
+++ b/net/homa/homa_impl.h
@@ -0,0 +1,631 @@
+/* SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+ */
+
+/* This file contains definitions that are shared across the files
+ * that implement Homa for Linux.
+ */
+
+#ifndef _HOMA_IMPL_H
+#define _HOMA_IMPL_H
+
+#include <linux/bug.h>
+
+#include <linux/audit.h>
+#include <linux/icmp.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/kthread.h>
+#include <linux/completion.h>
+#include <linux/proc_fs.h>
+#include <linux/sched/signal.h>
+#include <linux/skbuff.h>
+#include <linux/socket.h>
+#include <linux/vmalloc.h>
+#include <net/icmp.h>
+#include <net/ip.h>
+#include <net/netns/generic.h>
+#include <net/protocol.h>
+#include <net/inet_common.h>
+#include <net/gro.h>
+#include <net/rps.h>
+#ifdef CONFIG_X86_TSC
+#include <asm/tsc.h>
+#endif
+
+#include <linux/homa.h>
+#include "homa_wire.h"
+
+/* Forward declarations. */
+struct homa;
+struct homa_peer;
+struct homa_rpc;
+struct homa_sock;
+
+/**
+ * union sockaddr_in_union - Holds either an IPv4 or IPv6 address (smaller
+ * and easier to use than sockaddr_storage).
+ */
+union sockaddr_in_union {
+	/** @sa: Used to access as a generic sockaddr. */
+	struct sockaddr sa;
+
+	/** @in4: Used to access as IPv4 socket. */
+	struct sockaddr_in in4;
+
+	/** @in6: Used to access as IPv6 socket.  */
+	struct sockaddr_in6 in6;
+};
+
+/**
+ * struct homa - Stores overall information about the Homa transport, which
+ * is shared across all Homa sockets and all network namespaces.
+ */
+struct homa {
+	/**
+	 * @next_outgoing_id: Id to use for next outgoing RPC request.
+	 * This is always even: it's used only to generate client-side ids.
+	 * Accessed without locks. Note: RPC ids are unique within a
+	 * single client machine.
+	 */
+	atomic64_t next_outgoing_id;
+
+	/**
+	 * @pacer:  Information related to the pacer; managed by homa_pacer.c.
+	 */
+	struct homa_pacer *pacer;
+
+	/**
+	 * @peertab: Info about all the other hosts we have communicated with;
+	 * includes peers from all network namespaces.
+	 */
+	struct homa_peertab *peertab;
+
+	/**
+	 * @socktab: Information about all open sockets. Dynamically
+	 * allocated; must be kfreed.
+	 */
+	struct homa_socktab *socktab;
+
+	/** @max_numa: Highest NUMA node id in use by any core. */
+	int max_numa;
+
+	/**
+	 * @link_mbps: The raw bandwidth of the network uplink, in
+	 * units of 1e06 bits per second.  Set externally via sysctl.
+	 */
+	int link_mbps;
+
+	/**
+	 * @resend_ticks: When an RPC's @silent_ticks reaches this value,
+	 * start sending RESEND requests.
+	 */
+	int resend_ticks;
+
+	/**
+	 * @resend_interval: minimum number of homa timer ticks between
+	 * RESENDs for the same RPC.
+	 */
+	int resend_interval;
+
+	/**
+	 * @timeout_ticks: abort an RPC if its silent_ticks reaches this value.
+	 */
+	int timeout_ticks;
+
+	/**
+	 * @timeout_resends: Assume that a server is dead if it has not
+	 * responded after this many RESENDs have been sent to it.
+	 */
+	int timeout_resends;
+
+	/**
+	 * @request_ack_ticks: How many timer ticks we'll wait for the
+	 * client to ack an RPC before explicitly requesting an ack.
+	 * Set externally via sysctl.
+	 */
+	int request_ack_ticks;
+
+	/**
+	 * @reap_limit: Maximum number of packet buffers to free in a
+	 * single call to home_rpc_reap.
+	 */
+	int reap_limit;
+
+	/**
+	 * @dead_buffs_limit: If the number of packet buffers in dead but
+	 * not yet reaped RPCs is less than this number, then Homa reaps
+	 * RPCs in a way that minimizes impact on performance but may permit
+	 * dead RPCs to accumulate. If the number of dead packet buffers
+	 * exceeds this value, then Homa switches to a more aggressive approach
+	 * to reaping RPCs. Set externally via sysctl.
+	 */
+	int dead_buffs_limit;
+
+	/**
+	 * @max_dead_buffs: The largest aggregate number of packet buffers
+	 * in dead (but not yet reaped) RPCs that has existed so far in a
+	 * single socket.  Readable via sysctl, and may be reset via sysctl
+	 * to begin recalculating.
+	 */
+	int max_dead_buffs;
+
+	/**
+	 * @max_gso_size: Maximum number of bytes that will be included
+	 * in a single output packet that Homa passes to Linux. Can be set
+	 * externally via sysctl to lower the limit already enforced by Linux.
+	 */
+	int max_gso_size;
+
+	/**
+	 * @gso_force_software: A non-zero value will cause Homa to perform
+	 * segmentation in software using GSO; zero means ask the NIC to
+	 * perform TSO. Set externally via sysctl.
+	 */
+	int gso_force_software;
+
+	/**
+	 * @wmem_max: Limit on the value of sk_sndbuf for any socket. Set
+	 * externally via sysctl.
+	 */
+	int wmem_max;
+
+	/**
+	 * @timer_ticks: number of times that homa_timer has been invoked
+	 * (may wraparound, which is safe).
+	 */
+	u32 timer_ticks;
+
+	/**
+	 * @flags: a collection of bits that can be set using sysctl
+	 * to trigger various behaviors.
+	 */
+	int flags;
+
+	/**
+	 * @bpage_lease_usecs: how long a core can own a bpage (microseconds)
+	 * before its ownership can be revoked to reclaim the page.
+	 */
+	int bpage_lease_usecs;
+
+	/**
+	 * @bpage_lease_cycles: same as bpage_lease_usecs except in
+	 * homa_clock() units.
+	 */
+	int bpage_lease_cycles;
+
+	/**
+	 * @next_id: Set via sysctl; causes next_outgoing_id to be set to
+	 * this value; always reads as zero. Typically used while debugging to
+	 * ensure that different nodes use different ranges of ids.
+	 */
+	int next_id;
+
+	/**
+	 * @destroyed: True means that this structure is being destroyed
+	 * so everyone should clean up.
+	 */
+	bool destroyed;
+
+};
+
+/**
+ * struct homa_net - Contains Homa information that is specific to a
+ * particular network namespace.
+ */
+struct homa_net {
+	/** @net: Network namespace corresponding to this structure. */
+	struct net *net;
+
+	/** @homa: Global Homa information. */
+	struct homa *homa;
+
+	/**
+	 * @prev_default_port: The most recent port number assigned from
+	 * the range of default ports.
+	 */
+	u16 prev_default_port;
+
+	/**
+	 * @num_peers: The total number of struct homa_peers that exist
+	 * for this namespace. Managed by homa_peer.c under the peertab lock.
+	 */
+	int num_peers;
+
+};
+
+/**
+ * struct homa_skb_info - Additional information needed by Homa for each
+ * outbound DATA packet. Space is allocated for this at the very end of the
+ * linear part of the skb.
+ */
+struct homa_skb_info {
+	/**
+	 * @next_skb: used to link together all of the skb's for an
+	 * outgoing Homa message (in order of offset).
+	 */
+	struct sk_buff *next_skb;
+
+	/**
+	 * @wire_bytes: total number of bytes of network bandwidth that
+	 * will be consumed by this packet. This includes everything,
+	 * including additional headers added by GSO, IP header, Ethernet
+	 * header, CRC, preamble, and inter-packet gap.
+	 */
+	int wire_bytes;
+
+	/**
+	 * @data_bytes: total bytes of message data across all of the
+	 * segments in this packet.
+	 */
+	int data_bytes;
+
+	/** @seg_length: maximum number of data bytes in each GSO segment. */
+	int seg_length;
+
+	/**
+	 * @offset: offset within the message of the first byte of data in
+	 * this packet.
+	 */
+	int offset;
+
+	/**
+	 * @bytes_left: number of bytes in this packet and all later packets
+	 * in the same message. Used to priroitize packets for SRPT.
+	 */
+	int bytes_left;
+
+	/**
+	 * @rpc: RPC that this packet came from. Used only as a unique
+	 * identifier: it is not safe to dereference this pointer (the RPC
+	 * may no longer exist).
+	 */
+	void *rpc;
+
+	/**
+	 * @next_sibling: next packet in @rpc that has been deferred in
+	 * homa_qdisc because the NIC queue was too long, or NULL if none.
+	 */
+	struct sk_buff *next_sibling;
+
+	/**
+	 * @last_sibling: last packet in @next_sibling list. Only valid
+	 * for the "head" packet (which is in qdev->homa_deferred).
+	 */
+	struct sk_buff *last_sibling;
+};
+
+/**
+ * homa_get_skb_info() - Return the address of Homa's private information
+ * for an sk_buff.
+ * @skb:     Socket buffer whose info is needed.
+ * Return: address of Homa's private information for @skb.
+ */
+static inline struct homa_skb_info *homa_get_skb_info(struct sk_buff *skb)
+{
+	return (struct homa_skb_info *)(skb_end_pointer(skb)) - 1;
+}
+
+/**
+ * homa_set_doff() - Fills in the doff TCP header field for a Homa packet.
+ * @h:     Packet header whose doff field is to be set.
+ * @size:  Size of the "header", bytes (must be a multiple of 4). This
+ *         information is used only for TSO; it's the number of bytes
+ *         that should be replicated in each segment. The bytes after
+ *         this will be distributed among segments.
+ */
+static inline void homa_set_doff(struct homa_data_hdr *h, int size)
+{
+	/* Drop the 2 low-order bits from size and set the 4 high-order
+	 * bits of doff from what's left.
+	 */
+	h->common.doff = size << 2;
+}
+
+/** skb_is_ipv6() - Return true if the packet is encapsulated with IPv6,
+ *  false otherwise (presumably it's IPv4).
+ */
+static inline bool skb_is_ipv6(const struct sk_buff *skb)
+{
+	return ipv6_hdr(skb)->version == 6;
+}
+
+/**
+ * ipv6_to_ipv4() - Given an IPv6 address produced by ipv4_to_ipv6, return
+ * the original IPv4 address (in network byte order).
+ * @ip6:  IPv6 address; assumed to be a mapped IPv4 address.
+ * Return: IPv4 address stored in @ip6.
+ */
+static inline __be32 ipv6_to_ipv4(const struct in6_addr ip6)
+{
+	return ip6.in6_u.u6_addr32[3];
+}
+
+/**
+ * canonical_ipv6_addr() - Convert a socket address to the "standard"
+ * form used in Homa, which is always an IPv6 address; if the original address
+ * was IPv4, convert it to an IPv4-mapped IPv6 address.
+ * @addr:   Address to canonicalize (if NULL, "any" is returned).
+ * Return: IPv6 address corresponding to @addr.
+ */
+static inline struct in6_addr canonical_ipv6_addr(const union sockaddr_in_union
+						  *addr)
+{
+	struct in6_addr mapped;
+
+	if (addr) {
+		if (addr->sa.sa_family == AF_INET6)
+			return addr->in6.sin6_addr;
+		ipv6_addr_set_v4mapped(addr->in4.sin_addr.s_addr, &mapped);
+		return mapped;
+	}
+	return in6addr_any;
+}
+
+/**
+ * skb_canonical_ipv6_saddr() - Given a packet buffer, return its source
+ * address in the "standard" form used in Homa, which is always an IPv6
+ * address; if the original address was IPv4, convert it to an IPv4-mapped
+ * IPv6 address.
+ * @skb:   The source address will be extracted from this packet buffer.
+ * Return: IPv6 address for @skb's source machine.
+ */
+static inline struct in6_addr skb_canonical_ipv6_saddr(struct sk_buff *skb)
+{
+	struct in6_addr mapped;
+
+	if (skb_is_ipv6(skb))
+		return ipv6_hdr(skb)->saddr;
+	ipv6_addr_set_v4mapped(ip_hdr(skb)->saddr, &mapped);
+	return mapped;
+}
+
+/**
+ * is_homa_pkt() - Return true if @skb is a Homa packet, false otherwise.
+ * @skb:    Packet buffer to check.
+ * Return:  see above.
+ */
+static inline bool is_homa_pkt(struct sk_buff *skb)
+{
+	int protocol;
+
+	/* If the network header hasn't been created yet, assume it's a
+	 * Homa packet (Homa never generates any non-Homa packets).
+	 */
+	if (skb->network_header == 0)
+		return true;
+	protocol = (skb_is_ipv6(skb)) ? ipv6_hdr(skb)->nexthdr :
+					ip_hdr(skb)->protocol;
+	return protocol == IPPROTO_HOMA;
+}
+
+/**
+ * homa_make_header_avl() - Invokes pskb_may_pull to make sure that all the
+ * Homa header information for a packet is in the linear part of the skb
+ * where it can be addressed using skb_transport_header.
+ * @skb:     Packet for which header is needed.
+ * Return:   The result of pskb_may_pull (true for success)
+ */
+static inline bool homa_make_header_avl(struct sk_buff *skb)
+{
+	int pull_length;
+
+	pull_length = skb_transport_header(skb) - skb->data + HOMA_MAX_HEADER;
+	if (pull_length > skb->len)
+		pull_length = skb->len;
+	return pskb_may_pull(skb, pull_length);
+}
+
+#define UNIT_LOG(...)
+#define UNIT_HOOK(...)
+
+extern unsigned int homa_net_id;
+
+/**
+ * homa_net_from_net() - Return the struct homa_net associated with a particular
+ * struct net.
+ * @net:     Get the Homa data for this net namespace.
+ * Return:   see above.
+ */
+static inline struct homa_net *homa_net_from_net(struct net *net)
+{
+	return (struct homa_net *)net_generic(net, homa_net_id);
+}
+
+/**
+ * homa_from_skb() - Return the struct homa associated with a particular
+ * sk_buff.
+ * @skb:     Get the struct homa for this packet buffer.
+ * Return:   see above.
+ */
+static inline struct homa *homa_from_skb(struct sk_buff *skb)
+{
+	struct homa_net *hnet;
+
+	hnet = net_generic(dev_net(skb->dev), homa_net_id);
+	return hnet->homa;
+}
+
+/**
+ * homa_net_from_skb() - Return the struct homa_net associated with a particular
+ * sk_buff.
+ * @skb:     Get the struct homa for this packet buffer.
+ * Return:   see above.
+ */
+static inline struct homa_net *homa_net_from_skb(struct sk_buff *skb)
+{
+	struct homa_net *hnet;
+
+	hnet = net_generic(dev_net(skb->dev), homa_net_id);
+	return hnet;
+}
+
+/**
+ * homa_clock() - Return a fine-grain clock value that is monotonic and
+ * consistent across cores.
+ * Return: see above.
+ */
+static inline u64 homa_clock(void)
+{
+	/* As of May 2025 there does not appear to be a portable API that
+	 * meets Homa's needs:
+	 * - The Intel X86 TSC works well but is not portable.
+	 * - sched_clock() does not guarantee monotonicity or consistency.
+	 * - ktime_get_mono_fast_ns and ktime_get_raw_fast_ns are very slow
+	 *   (27 ns to read, vs 8 ns for TSC)
+	 * Thus we use a hybrid approach that uses TSC (via get_cycles) where
+	 * available (which should be just about everywhere Homa runs).
+	 */
+#ifdef CONFIG_X86_TSC
+	return get_cycles();
+#else
+	return ktime_get_mono_fast_ns();
+#endif /* CONFIG_X86_TSC */
+}
+
+/**
+ * homa_clock_khz() - Return the frequency of the values returned by
+ * homa_clock, in units of KHz.
+ * Return: see above.
+ */
+static inline u64 homa_clock_khz(void)
+{
+#ifdef CONFIG_X86_TSC
+	return tsc_khz;
+#else
+	return 1000000;
+#endif /* CONFIG_X86_TSC */
+}
+
+/**
+ * homa_ns_to_cycles() - Convert from units of nanoseconds to units of
+ * homa_clock().
+ * @ns:      A time measurement in nanoseconds
+ * Return:   The time in homa_clock() units corresponding to @ns.
+ */
+static inline u64 homa_ns_to_cycles(u64 ns)
+{
+	u64 tmp;
+
+	tmp = ns * homa_clock_khz();
+	do_div(tmp, 1000000);
+	return tmp;
+}
+
+/**
+ * homa_usecs_to_cycles() - Convert from units of microseconds to units of
+ * homa_clock().
+ * @usecs:   A time measurement in microseconds
+ * Return:   The time in homa_clock() units corresponding to @usecs.
+ */
+static inline u64 homa_usecs_to_cycles(u64 usecs)
+{
+	u64 tmp;
+
+	tmp = usecs * homa_clock_khz();
+	do_div(tmp, 1000);
+	return tmp;
+}
+
+/* Homa Locking Strategy:
+ *
+ * (Note: this documentation is referenced in several other places in the
+ * Homa code)
+ *
+ * In the Linux TCP/IP stack the primary locking mechanism is a sleep-lock
+ * per socket. However, per-socket locks aren't adequate for Homa, because
+ * sockets are "larger" in Homa. In TCP, a socket corresponds to a single
+ * connection between two peers; an application can have hundreds or
+ * thousands of sockets open at once, so per-socket locks leave lots of
+ * opportunities for concurrency. With Homa, a single socket can be used for
+ * communicating with any number of peers, so there will typically be just
+ * one socket per thread. As a result, a single Homa socket must support many
+ * concurrent RPCs efficiently, and a per-socket lock would create a bottleneck
+ * (Homa tried this approach initially).
+ *
+ * Thus, the primary locks used in Homa spinlocks at RPC granularity. This
+ * allows operations on different RPCs for the same socket to proceed
+ * concurrently. Homa also has socket locks (which are spinlocks different
+ * from the official socket sleep-locks) but these are used much less
+ * frequently than RPC locks.
+ *
+ * Lock Ordering:
+ *
+ * There are several other locks in Homa besides RPC locks, all of which
+ * are spinlocks. When multiple locks are held, they must be acquired in a
+ * consistent order in order to prevent deadlock. Here are the rules for Homa:
+ * 1. Except for RPC and socket locks, all locks should be considered
+ *    "leaf" locks: don't acquire other locks while holding them.
+ * 2. The lock order is:
+ *    * RPC lock
+ *    * Socket lock
+ *    * Other lock
+ * 3. It is not safe to wait on an RPC lock while holding any other lock.
+ * 4. It is safe to wait on a socket lock while holding an RPC lock, but
+ *    not while holding any other lock.
+ *
+ * It may seem surprising that RPC locks are acquired *before* socket locks,
+ * but this is essential for high performance. Homa has been designed so that
+ * many common operations (such as processing input packets) can be performed
+ * while holding only an RPC lock; this allows operations on different RPCs
+ * to proceed in parallel. Only a few operations, such as handing off an
+ * incoming message to a waiting thread, require the socket lock. If socket
+ * locks had to be acquired first, any operation that might eventually need
+ * the socket lock would have to acquire it before the RPC lock, which would
+ * severely restrict concurrency.
+ *
+ * Socket Shutdown:
+ *
+ * It is possible for socket shutdown to begin while operations are underway
+ * that hold RPC locks but not the socket lock. For example, a new RPC
+ * creation might be underway when a socket is shut down. The RPC creation
+ * will eventually acquire the socket lock and add the new RPC to those
+ * for the socket; it would be very bad if this were to happen after
+ * homa_sock_shutdown things is has deleted all RPCs for the socket.
+ * In general, any operation that acquires a socket lock must check
+ * hsk->shutdown after acquiring the lock and abort if hsk->shutdown is set.
+ *
+ * Spinlock Implications:
+ *
+ * Homa uses spinlocks exclusively; this is needed because locks typically
+ * need to be acquired at atomic level, such as in SoftIRQ code.
+ *
+ * Operations that can block, such as memory allocation and copying data
+ * to/from user space, are not permitted while holding spinlocks (spinlocks
+ * disable interrupts, so the holder must not block. This results in awkward
+ * code in several places to move restricted operations outside locked
+ * regions. Such code typically looks like this:
+ *   - Acquire a reference on an object such as an RPC, in order to prevent
+ *     the object from being deleted.
+ *   - Release the object's lock.
+ *   - Perform the restricted operation.
+ *   - Re-acquire the lock.
+ *   - Release the reference.
+ * It is possible that the object may have been modified by some other party
+ * while it was unlocked, so additional checks may be needed after reacquiring
+ * the lock. As one example, an RPC may have been terminated, in which case
+ * any operation in progress on that RPC should be aborted after reacquiring
+ * the lock.
+ *
+ * Lists of RPCs:
+ *
+ * There are a few places where Homa needs to process all of the RPCs
+ * associated with a socket, such as the timer. Such code must first lock
+ * the socket (to protect access to the link pointers) then lock
+ * individual RPCs on the list. However, this violates the rules for locking
+ * order. It isn't safe to unlock the socket before locking the individual RPCs,
+ * because RPCs could be deleted and their memory recycled between the unlock
+ * of the socket lock and the lock of the RPC; this could result in corruption.
+ * Homa uses two different approaches to handle this situation:
+ * 1. Use ``homa_protect_rpcs`` to prevent RPC reaping for a socket. RPCs can
+ *    still be terminated, but their memory won't go away until
+ *    homa_unprotect_rpcs is invoked. This allows the socket lock to be
+ *    released before acquiring RPC locks; after acquiring each RPC lock,
+ *    the RPC must be checked to see if it has been terminated; if so, skip it.
+ * 2. Use ``spin_trylock_bh`` to acquire the RPC lock while still holding the
+ *    socket lock. If this fails, then release the socket lock and retry
+ *    both the socket lock and the RPC lock. Of course, the state of both
+ *    socket and RPC could change before the locks are finally acquired.
+ */
+
+#endif /* _HOMA_IMPL_H */
diff --git a/net/homa/homa_stub.h b/net/homa/homa_stub.h
new file mode 100644
index 000000000000..875d3bfe440d
--- /dev/null
+++ b/net/homa/homa_stub.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+ */
+
+/* This file contains stripped-down replacements that have been
+ * temporarily removed from Homa during the Linux upstreaming
+ * process. By the time upstreaming is complete this file will
+ * have gone away.
+ */
+
+#ifndef _HOMA_STUB_H
+#define _HOMA_STUB_H
+
+#include "homa_impl.h"
+
+static inline int homa_skb_init(struct homa *homa)
+{
+	return 0;
+}
+
+static inline void homa_skb_cleanup(struct homa *homa)
+{}
+
+static inline void homa_skb_release_pages(struct homa *homa)
+{}
+
+static inline int homa_skb_append_from_iter(struct homa *homa,
+					    struct sk_buff *skb,
+					    struct iov_iter *iter, int length)
+{
+	char *dst = skb_put(skb, length);
+
+	if (copy_from_iter(dst, length, iter) != length)
+		return -EFAULT;
+	return 0;
+}
+
+static inline int homa_skb_append_to_frag(struct homa *homa,
+					  struct sk_buff *skb, void *buf,
+					  int length)
+{
+	char *dst = skb_put(skb, length);
+
+	memcpy(dst, buf, length);
+	return 0;
+}
+
+static inline int  homa_skb_append_from_skb(struct homa *homa,
+					    struct sk_buff *dst_skb,
+					    struct sk_buff *src_skb,
+					    int offset, int length)
+{
+	return homa_skb_append_to_frag(homa, dst_skb,
+			skb_transport_header(src_skb) + offset, length);
+}
+
+static inline void homa_skb_free_tx(struct homa *homa, struct sk_buff *skb)
+{
+	kfree_skb(skb);
+}
+
+static inline void homa_skb_free_many_tx(struct homa *homa,
+					 struct sk_buff **skbs, int count)
+{
+	int i;
+
+	for (i = 0; i < count; i++)
+		kfree_skb(skbs[i]);
+}
+
+static inline void homa_skb_get(struct sk_buff *skb, void *dest, int offset,
+				int length)
+{
+	memcpy(dest, skb_transport_header(skb) + offset, length);
+}
+
+static inline struct sk_buff *homa_skb_alloc_tx(int length)
+{
+	struct sk_buff *skb;
+
+	skb = alloc_skb(HOMA_SKB_EXTRA + sizeof(struct homa_skb_info) + length,
+			GFP_ATOMIC);
+	if (likely(skb)) {
+		skb_reserve(skb, HOMA_SKB_EXTRA);
+		skb_reset_transport_header(skb);
+	}
+	return skb;
+}
+
+static inline void homa_skb_stash_pages(struct homa *homa, int length)
+{}
+
+#endif /* _HOMA_STUB_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 04/15] net: homa: create homa_pool.h and homa_pool.c
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (2 preceding siblings ...)
  2025-08-18 20:55 ` [PATCH net-next v15 03/15] net: homa: create shared Homa header files John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-18 20:55 ` [PATCH net-next v15 05/15] net: homa: create homa_peer.h and homa_peer.c John Ousterhout
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

These files implement Homa's mechanism for managing application-level
buffer space for incoming messages This mechanism is needed to allow
Homa to copy data out to user space in parallel with receiving packets;
it was discussed in a talk at NetDev 0x17.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for v11:
* Rework the mechanism for waking up RPCs that stalled waiting for
  buffer pool space

Changes for v10:
* Fix minor syntactic issues such as reverse xmas tree

Changes for v9:
* Eliminate use of _Static_assert
* Use new homa_clock abstraction layer.
* Allow memory to be allocated without GFP_ATOMIC
* Various name improvements (e.g. use "alloc" instead of "new" for functions
  that allocate memory)
* Remove sync.txt, move its contents into comments (mostly in homa_impl.h)

Changes for v8:
* Refactor homa_pool APIs (move allocation/deallocation into homa_pool.c,
  move locking responsibility out)

Changes for v7:
* Use u64 and __u64 properly
* Eliminate extraneous use of RCU
* Refactor pool->cores to use percpu variable
* Use smp_processor_id instead of raw_smp_processor_id
---
 net/homa/homa_pool.c | 483 +++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_pool.h | 136 ++++++++++++
 2 files changed, 619 insertions(+)
 create mode 100644 net/homa/homa_pool.c
 create mode 100644 net/homa/homa_pool.h

diff --git a/net/homa/homa_pool.c b/net/homa/homa_pool.c
new file mode 100644
index 000000000000..60cd79eef493
--- /dev/null
+++ b/net/homa/homa_pool.c
@@ -0,0 +1,483 @@
+// SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
+
+#include "homa_impl.h"
+#include "homa_pool.h"
+
+/* This file contains functions that manage user-space buffer pools. */
+
+/* Pools must always have at least this many bpages (no particular
+ * reasoning behind this value).
+ */
+#define MIN_POOL_SIZE 2
+
+/* Used when determining how many bpages to consider for allocation. */
+#define MIN_EXTRA 4
+
+/**
+ * set_bpages_needed() - Set the bpages_needed field of @pool based
+ * on the length of the first RPC that's waiting for buffer space.
+ * The caller must own the lock for @pool->hsk.
+ * @pool: Pool to update.
+ */
+static void set_bpages_needed(struct homa_pool *pool)
+{
+	struct homa_rpc *rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
+						struct homa_rpc, buf_links);
+
+	pool->bpages_needed = (rpc->msgin.length + HOMA_BPAGE_SIZE - 1) >>
+			      HOMA_BPAGE_SHIFT;
+}
+
+/**
+ * homa_pool_alloc() - Allocate and initialize a new homa_pool (it will have
+ * no region associated with it until homa_pool_set_region is invoked).
+ * @hsk:          Socket the pool will be associated with.
+ * Return: A pointer to the new pool or a negative errno.
+ */
+struct homa_pool *homa_pool_alloc(struct homa_sock *hsk)
+{
+	struct homa_pool *pool;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return ERR_PTR(-ENOMEM);
+	pool->hsk = hsk;
+	return pool;
+}
+
+/**
+ * homa_pool_set_region() - Associate a region of memory with a pool.
+ * @hsk:          Socket whose pool the region will be associated with.
+ *                Must not be locked, and the pool must not currently
+ *                have a region associated with it.
+ * @region:       First byte of the memory region for the pool, allocated
+ *                by the application; must be page-aligned.
+ * @region_size:  Total number of bytes available at @buf_region.
+ * Return: Either zero (for success) or a negative errno for failure.
+ */
+int homa_pool_set_region(struct homa_sock *hsk, void __user *region,
+			 u64 region_size)
+{
+	struct homa_pool_core __percpu *cores;
+	struct homa_bpage *descriptors;
+	int i, result, num_bpages;
+	struct homa_pool *pool;
+
+	if (((uintptr_t)region) & ~PAGE_MASK)
+		return -EINVAL;
+
+	/* Allocate memory before locking the socket, so we can allocate
+	 * without GFP_ATOMIC.
+	 */
+	num_bpages = region_size >> HOMA_BPAGE_SHIFT;
+	if (num_bpages < MIN_POOL_SIZE)
+		return -EINVAL;
+	descriptors = kmalloc_array(num_bpages, sizeof(struct homa_bpage),
+				    GFP_KERNEL | __GFP_ZERO);
+	if (!descriptors)
+		return -ENOMEM;
+	cores = alloc_percpu_gfp(struct homa_pool_core, __GFP_ZERO);
+	if (!cores) {
+		result = -ENOMEM;
+		goto error;
+	}
+
+	homa_sock_lock(hsk);
+	pool = hsk->buffer_pool;
+	if (pool->region) {
+		result = -EINVAL;
+		homa_sock_unlock(hsk);
+		goto error;
+	}
+
+	pool->region = (char __user *)region;
+	pool->num_bpages = num_bpages;
+	pool->descriptors = descriptors;
+	atomic_set(&pool->free_bpages, pool->num_bpages);
+	pool->bpages_needed = INT_MAX;
+	pool->cores = cores;
+	pool->check_waiting_invoked = 0;
+
+	for (i = 0; i < pool->num_bpages; i++) {
+		struct homa_bpage *bp = &pool->descriptors[i];
+
+		spin_lock_init(&bp->lock);
+		bp->owner = -1;
+	}
+
+	homa_sock_unlock(hsk);
+	return 0;
+
+error:
+	kfree(descriptors);
+	free_percpu(cores);
+	return result;
+}
+
+/**
+ * homa_pool_free() - Destructor for homa_pool. After this method
+ * returns, the object should not be used (it will be freed here).
+ * @pool: Pool to destroy.
+ */
+void homa_pool_free(struct homa_pool *pool)
+{
+	if (pool->region) {
+		kfree(pool->descriptors);
+		free_percpu(pool->cores);
+		pool->region = NULL;
+	}
+	kfree(pool);
+}
+
+/**
+ * homa_pool_get_rcvbuf() - Return information needed to handle getsockopt
+ * for HOMA_SO_RCVBUF.
+ * @pool:         Pool for which information is needed.
+ * @args:         Store info here.
+ */
+void homa_pool_get_rcvbuf(struct homa_pool *pool,
+			  struct homa_rcvbuf_args *args)
+{
+	args->start = (uintptr_t)pool->region;
+	args->length = pool->num_bpages << HOMA_BPAGE_SHIFT;
+}
+
+/**
+ * homa_bpage_available() - Check whether a bpage is available for use.
+ * @bpage:      Bpage to check
+ * @now:        Current time (homa_clock() units)
+ * Return:      True if the bpage is free or if it can be stolen, otherwise
+ *              false.
+ */
+bool homa_bpage_available(struct homa_bpage *bpage, u64 now)
+{
+	int ref_count = atomic_read(&bpage->refs);
+
+	return ref_count == 0 || (ref_count == 1 && bpage->owner >= 0 &&
+			bpage->expiration <= now);
+}
+
+/**
+ * homa_pool_get_pages() - Allocate one or more full pages from the pool.
+ * @pool:         Pool from which to allocate pages
+ * @num_pages:    Number of pages needed
+ * @pages:        The indices of the allocated pages are stored here; caller
+ *                must ensure this array is big enough. Reference counts have
+ *                been set to 1 on all of these pages (or 2 if set_owner
+ *                was specified).
+ * @set_owner:    If nonzero, the current core is marked as owner of all
+ *                of the allocated pages (and the expiration time is also
+ *                set). Otherwise the pages are left unowned.
+ * Return: 0 for success, -1 if there wasn't enough free space in the pool.
+ */
+int homa_pool_get_pages(struct homa_pool *pool, int num_pages, u32 *pages,
+			int set_owner)
+{
+	int core_num = smp_processor_id();
+	struct homa_pool_core *core;
+	u64 now = homa_clock();
+	int alloced = 0;
+	int limit = 0;
+
+	core = this_cpu_ptr(pool->cores);
+	if (atomic_sub_return(num_pages, &pool->free_bpages) < 0) {
+		atomic_add(num_pages, &pool->free_bpages);
+		return -1;
+	}
+
+	/* Once we get to this point we know we will be able to find
+	 * enough free pages; now we just have to find them.
+	 */
+	while (alloced != num_pages) {
+		struct homa_bpage *bpage;
+		int cur;
+
+		/* If we don't need to use all of the bpages in the pool,
+		 * then try to use only the ones with low indexes. This
+		 * will reduce the cache footprint for the pool by reusing
+		 * a few bpages over and over. Specifically this code will
+		 * not consider any candidate page whose index is >= limit.
+		 * Limit is chosen to make sure there are a reasonable
+		 * number of free pages in the range, so we won't have to
+		 * check a huge number of pages.
+		 */
+		if (limit == 0) {
+			int extra;
+
+			limit = pool->num_bpages -
+				atomic_read(&pool->free_bpages);
+			extra = limit >> 2;
+			limit += (extra < MIN_EXTRA) ? MIN_EXTRA : extra;
+			if (limit > pool->num_bpages)
+				limit = pool->num_bpages;
+		}
+
+		cur = core->next_candidate;
+		core->next_candidate++;
+		if (cur >= limit) {
+			core->next_candidate = 0;
+
+			/* Must recompute the limit for each new loop through
+			 * the bpage array: we may need to consider a larger
+			 * range of pages because of concurrent allocations.
+			 */
+			limit = 0;
+			continue;
+		}
+		bpage = &pool->descriptors[cur];
+
+		/* Figure out whether this candidate is free (or can be
+		 * stolen). Do a quick check without locking the page, and
+		 * if the page looks promising, then lock it and check again
+		 * (must check again in case someone else snuck in and
+		 * grabbed the page).
+		 */
+		if (!homa_bpage_available(bpage, now))
+			continue;
+		if (!spin_trylock_bh(&bpage->lock))
+			/* Rather than wait for a locked page to become free,
+			 * just go on to the next page. If the page is locked,
+			 * it probably won't turn out to be available anyway.
+			 */
+			continue;
+		if (!homa_bpage_available(bpage, now)) {
+			spin_unlock_bh(&bpage->lock);
+			continue;
+		}
+		if (bpage->owner >= 0)
+			atomic_inc(&pool->free_bpages);
+		if (set_owner) {
+			atomic_set(&bpage->refs, 2);
+			bpage->owner = core_num;
+			bpage->expiration = now +
+					    pool->hsk->homa->bpage_lease_cycles;
+		} else {
+			atomic_set(&bpage->refs, 1);
+			bpage->owner = -1;
+		}
+		spin_unlock_bh(&bpage->lock);
+		pages[alloced] = cur;
+		alloced++;
+	}
+	return 0;
+}
+
+/**
+ * homa_pool_alloc_msg() - Allocate buffer space for an incoming message.
+ * @rpc:  RPC that needs space allocated for its incoming message (space must
+ *        not already have been allocated). The fields @msgin->num_buffers
+ *        and @msgin->buffers are filled in. Must be locked by caller.
+ * Return: The return value is normally 0, which means either buffer space
+ * was allocated or the @rpc was queued on @hsk->waiting. If a fatal error
+ * occurred, such as no buffer pool present, then a negative errno is
+ * returned.
+ */
+int homa_pool_alloc_msg(struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa_pool *pool = rpc->hsk->buffer_pool;
+	int full_pages, partial, i, core_id;
+	struct homa_pool_core *core;
+	u32 pages[HOMA_MAX_BPAGES];
+	struct homa_bpage *bpage;
+	struct homa_rpc *other;
+
+	if (!pool->region)
+		return -ENOMEM;
+
+	/* First allocate any full bpages that are needed. */
+	full_pages = rpc->msgin.length >> HOMA_BPAGE_SHIFT;
+	if (unlikely(full_pages)) {
+		if (homa_pool_get_pages(pool, full_pages, pages, 0) != 0)
+			goto out_of_space;
+		for (i = 0; i < full_pages; i++)
+			rpc->msgin.bpage_offsets[i] = pages[i] <<
+					HOMA_BPAGE_SHIFT;
+	}
+	rpc->msgin.num_bpages = full_pages;
+
+	/* The last chunk may be less than a full bpage; for this we use
+	 * the bpage that we own (and reuse it for multiple messages).
+	 */
+	partial = rpc->msgin.length & (HOMA_BPAGE_SIZE - 1);
+	if (unlikely(partial == 0))
+		goto success;
+	core_id = smp_processor_id();
+	core = this_cpu_ptr(pool->cores);
+	bpage = &pool->descriptors[core->page_hint];
+	spin_lock_bh(&bpage->lock);
+	if (bpage->owner != core_id) {
+		spin_unlock_bh(&bpage->lock);
+		goto new_page;
+	}
+	if ((core->allocated + partial) > HOMA_BPAGE_SIZE) {
+		if (atomic_read(&bpage->refs) == 1) {
+			/* Bpage is totally free, so we can reuse it. */
+			core->allocated = 0;
+		} else {
+			bpage->owner = -1;
+
+			/* We know the reference count can't reach zero here
+			 * because of check above, so we won't have to decrement
+			 * pool->free_bpages.
+			 */
+			atomic_dec_return(&bpage->refs);
+			spin_unlock_bh(&bpage->lock);
+			goto new_page;
+		}
+	}
+	bpage->expiration = homa_clock() +
+			    pool->hsk->homa->bpage_lease_cycles;
+	atomic_inc(&bpage->refs);
+	spin_unlock_bh(&bpage->lock);
+	goto allocate_partial;
+
+	/* Can't use the current page; get another one. */
+new_page:
+	if (homa_pool_get_pages(pool, 1, pages, 1) != 0) {
+		homa_pool_release_buffers(pool, rpc->msgin.num_bpages,
+					  rpc->msgin.bpage_offsets);
+		rpc->msgin.num_bpages = 0;
+		goto out_of_space;
+	}
+	core->page_hint = pages[0];
+	core->allocated = 0;
+
+allocate_partial:
+	rpc->msgin.bpage_offsets[rpc->msgin.num_bpages] = core->allocated
+			+ (core->page_hint << HOMA_BPAGE_SHIFT);
+	rpc->msgin.num_bpages++;
+	core->allocated += partial;
+
+success:
+	return 0;
+
+	/* We get here if there wasn't enough buffer space for this
+	 * message; add the RPC to hsk->waiting_for_bufs. The list is sorted
+	 * by RPC length in order to implement SRPT.
+	 */
+out_of_space:
+	homa_sock_lock(pool->hsk);
+	list_for_each_entry(other, &pool->hsk->waiting_for_bufs, buf_links) {
+		if (other->msgin.length > rpc->msgin.length) {
+			list_add_tail(&rpc->buf_links, &other->buf_links);
+			goto queued;
+		}
+	}
+	list_add_tail(&rpc->buf_links, &pool->hsk->waiting_for_bufs);
+
+queued:
+	set_bpages_needed(pool);
+	homa_sock_unlock(pool->hsk);
+	return 0;
+}
+
+/**
+ * homa_pool_get_buffer() - Given an RPC, figure out where to store incoming
+ * message data.
+ * @rpc:        RPC for which incoming message data is being processed; its
+ *              msgin must be properly initialized and buffer space must have
+ *              been allocated for the message.
+ * @offset:     Offset within @rpc's incoming message.
+ * @available:  Will be filled in with the number of bytes of space available
+ *              at the returned address (could be zero if offset is
+ *              (erroneously) past the end of the message).
+ * Return:      The application's virtual address for buffer space corresponding
+ *              to @offset in the incoming message for @rpc.
+ */
+void __user *homa_pool_get_buffer(struct homa_rpc *rpc, int offset,
+				  int *available)
+{
+	int bpage_index, bpage_offset;
+
+	bpage_index = offset >> HOMA_BPAGE_SHIFT;
+	if (offset >= rpc->msgin.length) {
+		WARN_ONCE(true, "%s got offset %d >= message length %d\n",
+			  __func__, offset, rpc->msgin.length);
+		*available = 0;
+		return NULL;
+	}
+	bpage_offset = offset & (HOMA_BPAGE_SIZE - 1);
+	*available = (bpage_index < (rpc->msgin.num_bpages - 1))
+			? HOMA_BPAGE_SIZE - bpage_offset
+			: rpc->msgin.length - offset;
+	return rpc->hsk->buffer_pool->region +
+			rpc->msgin.bpage_offsets[bpage_index] + bpage_offset;
+}
+
+/**
+ * homa_pool_release_buffers() - Release buffer space so that it can be
+ * reused.
+ * @pool:         Pool that the buffer space belongs to. Doesn't need to
+ *                be locked.
+ * @num_buffers:  How many buffers to release.
+ * @buffers:      Points to @num_buffers values, each of which is an offset
+ *                from the start of the pool to the buffer to be released.
+ * Return:        0 for success, otherwise a negative errno.
+ */
+int homa_pool_release_buffers(struct homa_pool *pool, int num_buffers,
+			      u32 *buffers)
+{
+	int result = 0;
+	int i;
+
+	if (!pool->region)
+		return result;
+	for (i = 0; i < num_buffers; i++) {
+		u32 bpage_index = buffers[i] >> HOMA_BPAGE_SHIFT;
+		struct homa_bpage *bpage = &pool->descriptors[bpage_index];
+
+		if (bpage_index < pool->num_bpages) {
+			if (atomic_dec_return(&bpage->refs) == 0)
+				atomic_inc(&pool->free_bpages);
+		} else {
+			result = -EINVAL;
+		}
+	}
+	return result;
+}
+
+/**
+ * homa_pool_check_waiting() - Checks to see if there are enough free
+ * bpages to wake up any RPCs that were blocked. Whenever
+ * homa_pool_release_buffers is invoked, this function must be invoked later,
+ * at a point when the caller holds no locks (homa_pool_release_buffers may
+ * be invoked with locks held, so it can't safely invoke this function).
+ * This is regrettably tricky, but I can't think of a better solution.
+ * @pool:         Information about the buffer pool.
+ */
+void homa_pool_check_waiting(struct homa_pool *pool)
+{
+	if (!pool->region)
+		return;
+	while (atomic_read(&pool->free_bpages) >= pool->bpages_needed) {
+		struct homa_rpc *rpc;
+
+		homa_sock_lock(pool->hsk);
+		if (list_empty(&pool->hsk->waiting_for_bufs)) {
+			pool->bpages_needed = INT_MAX;
+			homa_sock_unlock(pool->hsk);
+			break;
+		}
+		rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
+				       struct homa_rpc, buf_links);
+		if (!homa_rpc_try_lock(rpc)) {
+			/* Can't just spin on the RPC lock because we're
+			 * holding the socket lock and the lock order is
+			 * rpc-then-socket (see "Homa Locking Strategy" in
+			 * homa_impl.h). Instead, release the socket lock
+			 * and try the entire operation again.
+			 */
+			homa_sock_unlock(pool->hsk);
+			continue;
+		}
+		list_del_init(&rpc->buf_links);
+		if (list_empty(&pool->hsk->waiting_for_bufs))
+			pool->bpages_needed = INT_MAX;
+		else
+			set_bpages_needed(pool);
+		homa_sock_unlock(pool->hsk);
+		homa_pool_alloc_msg(rpc);
+		homa_rpc_unlock(rpc);
+	}
+}
diff --git a/net/homa/homa_pool.h b/net/homa/homa_pool.h
new file mode 100644
index 000000000000..15ba5c5d58b6
--- /dev/null
+++ b/net/homa/homa_pool.h
@@ -0,0 +1,136 @@
+/* SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+ */
+
+/* This file contains definitions used to manage user-space buffer pools.
+ */
+
+#ifndef _HOMA_POOL_H
+#define _HOMA_POOL_H
+
+#include <linux/percpu.h>
+
+#include "homa_rpc.h"
+
+/**
+ * struct homa_bpage - Contains information about a single page in
+ * a buffer pool.
+ */
+struct homa_bpage {
+	/** @lock: to synchronize shared access. */
+	spinlock_t lock;
+
+	/**
+	 * @refs: Counts number of distinct uses of this
+	 * bpage (1 tick for each message that is using
+	 * this page, plus an additional tick if the @owner
+	 * field is set).
+	 */
+	atomic_t refs;
+
+	/**
+	 * @owner: kernel core that currently owns this page
+	 * (< 0 if none).
+	 */
+	int owner;
+
+	/**
+	 * @expiration: homa_clock() time after which it's OK to steal this
+	 * page from its current owner (if @refs is 1).
+	 */
+	u64 expiration;
+} ____cacheline_aligned_in_smp;
+
+/**
+ * struct homa_pool_core - Holds core-specific data for a homa_pool (a bpage
+ * out of which that core is allocating small chunks).
+ */
+struct homa_pool_core {
+	/**
+	 * @page_hint: Index of bpage in pool->descriptors,
+	 * which may be owned by this core. If so, we'll use it
+	 * for allocating partial pages.
+	 */
+	int page_hint;
+
+	/**
+	 * @allocated: if the page given by @page_hint is
+	 * owned by this core, this variable gives the number of
+	 * (initial) bytes that have already been allocated
+	 * from the page.
+	 */
+	int allocated;
+
+	/**
+	 * @next_candidate: when searching for free bpages,
+	 * check this index next.
+	 */
+	int next_candidate;
+};
+
+/**
+ * struct homa_pool - Describes a pool of buffer space for incoming
+ * messages for a particular socket; managed by homa_pool.c. The pool is
+ * divided up into "bpages", which are a multiple of the hardware page size.
+ * A bpage may be owned by a particular core so that it can more efficiently
+ * allocate space for small messages.
+ */
+struct homa_pool {
+	/**
+	 * @hsk: the socket that this pool belongs to.
+	 */
+	struct homa_sock *hsk;
+
+	/**
+	 * @region: beginning of the pool's region (in the app's virtual
+	 * memory). Divided into bpages. 0 means the pool hasn't yet been
+	 * initialized.
+	 */
+	char __user *region;
+
+	/** @num_bpages: total number of bpages in the pool. */
+	int num_bpages;
+
+	/** @descriptors: kmalloced area containing one entry for each bpage. */
+	struct homa_bpage *descriptors;
+
+	/**
+	 * @free_bpages: the number of pages still available for allocation
+	 * by homa_pool_get pages. This equals the number of pages with zero
+	 * reference counts, minus the number of pages that have been claimed
+	 * by homa_get_pool_pages but not yet allocated.
+	 */
+	atomic_t free_bpages;
+
+	/**
+	 * @bpages_needed: the number of free bpages required to satisfy the
+	 * needs of the first RPC on @hsk->waiting_for_bufs, or INT_MAX if
+	 * that queue is empty.
+	 */
+	int bpages_needed;
+
+	/** @cores: core-specific info; dynamically allocated. */
+	struct homa_pool_core __percpu *cores;
+
+	/**
+	 * @check_waiting_invoked: incremented during unit tests when
+	 * homa_pool_check_waiting is invoked.
+	 */
+	int check_waiting_invoked;
+};
+
+bool     homa_bpage_available(struct homa_bpage *bpage, u64 now);
+struct   homa_pool *homa_pool_alloc(struct homa_sock *hsk);
+int      homa_pool_alloc_msg(struct homa_rpc *rpc);
+void     homa_pool_check_waiting(struct homa_pool *pool);
+void     homa_pool_free(struct homa_pool *pool);
+void __user *homa_pool_get_buffer(struct homa_rpc *rpc, int offset,
+				  int *available);
+int      homa_pool_get_pages(struct homa_pool *pool, int num_pages,
+			     u32 *pages, int leave_locked);
+void     homa_pool_get_rcvbuf(struct homa_pool *pool,
+			      struct homa_rcvbuf_args *args);
+int      homa_pool_release_buffers(struct homa_pool *pool,
+				   int num_buffers, u32 *buffers);
+int      homa_pool_set_region(struct homa_sock *hsk, void __user *region,
+			      u64 region_size);
+
+#endif /* _HOMA_POOL_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 05/15] net: homa: create homa_peer.h and homa_peer.c
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (3 preceding siblings ...)
  2025-08-18 20:55 ` [PATCH net-next v15 04/15] net: homa: create homa_pool.h and homa_pool.c John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-26  9:32   ` Paolo Abeni
  2025-08-18 20:55 ` [PATCH net-next v15 06/15] net: homa: create homa_sock.h and homa_sock.c John Ousterhout
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

Homa needs to keep a small amount of information for each peer that
it has communicated with. These files define that state and provide
functions for storing and accessing it.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for v11:
* Clean up sparse annotations

Changes for v10:
* Use kzalloc instead of __GFP_ZERO
* Remove log messages after alloc errors
* Fix issues found by sparse, xmastree.py, etc.
* Add missing initialization for peertab->lock

Changes for v9:
* Add support for homa_net objects
* Implement limits on the number of active homa_peer objects. This includes
  adding reference counts in homa_peers and adding code to release peers
  where there are too many.
* Switch to using rhashtable to store homa_peers; the table is shared
  across all network namespaces, though individual peers are namespace-
  specific
* Invoke dst->ops->check in addition to checking the obsolete flag
* Various name improvements
* Remove the homa_peertab_gc_dsts mechanism, which is unnecessary

Changes for v7:
* Remove homa_peertab_get_peers
* Remove "lock_slow" functions, which don't add functionality in this
  patch
* Remove unused fields from homa_peer structs
* Use u64 and __u64 properly
* Add lock annotations
* Refactor homa_peertab_get_peers
* Use __GFP_ZERO in kmalloc calls
---
 net/homa/homa_peer.c | 595 +++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_peer.h | 373 +++++++++++++++++++++++++++
 2 files changed, 968 insertions(+)
 create mode 100644 net/homa/homa_peer.c
 create mode 100644 net/homa/homa_peer.h

diff --git a/net/homa/homa_peer.c b/net/homa/homa_peer.c
new file mode 100644
index 000000000000..a7970c566c64
--- /dev/null
+++ b/net/homa/homa_peer.c
@@ -0,0 +1,595 @@
+// SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
+
+/* This file provides functions related to homa_peer and homa_peertab
+ * objects.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+
+static const struct rhashtable_params ht_params = {
+	.key_len     = sizeof(struct homa_peer_key),
+	.key_offset  = offsetof(struct homa_peer, ht_key),
+	.head_offset = offsetof(struct homa_peer, ht_linkage),
+	.nelem_hint = 10000,
+	.hashfn = homa_peer_hash,
+	.obj_cmpfn = homa_peer_compare
+};
+
+/**
+ * homa_peer_alloc_peertab() - Allocate and initialize a homa_peertab.
+ *
+ * Return:    A pointer to the new homa_peertab, or ERR_PTR(-errno) if there
+ *            was a problem.
+ */
+struct homa_peertab *homa_peer_alloc_peertab(void)
+{
+	struct homa_peertab *peertab;
+	int err;
+
+	peertab = kzalloc(sizeof(*peertab), GFP_KERNEL);
+	if (!peertab)
+		return ERR_PTR(-ENOMEM);
+
+	spin_lock_init(&peertab->lock);
+	err = rhashtable_init(&peertab->ht, &ht_params);
+	if (err) {
+		kfree(peertab);
+		return ERR_PTR(err);
+	}
+	peertab->ht_valid = true;
+	rhashtable_walk_enter(&peertab->ht, &peertab->ht_iter);
+	INIT_LIST_HEAD(&peertab->dead_peers);
+	peertab->gc_threshold = 5000;
+	peertab->net_max = 10000;
+	peertab->idle_secs_min = 10;
+	peertab->idle_secs_max = 120;
+
+	homa_peer_update_sysctl_deps(peertab);
+	return peertab;
+}
+
+/**
+ * homa_peer_free_net() - Garbage collect all of the peer information
+ * associated with a particular network namespace.
+ * @hnet:    Network namespace whose peers should be freed. There must not
+ *           be any active sockets or RPCs for this namespace.
+ */
+void homa_peer_free_net(struct homa_net *hnet)
+{
+	struct homa_peertab *peertab = hnet->homa->peertab;
+	struct rhashtable_iter iter;
+	struct homa_peer *peer;
+
+	spin_lock_bh(&peertab->lock);
+	peertab->gc_stop_count++;
+	spin_unlock_bh(&peertab->lock);
+
+	rhashtable_walk_enter(&peertab->ht, &iter);
+	rhashtable_walk_start(&iter);
+	while (1) {
+		peer = rhashtable_walk_next(&iter);
+		if (!peer)
+			break;
+		if (IS_ERR(peer))
+			continue;
+		if (peer->ht_key.hnet != hnet)
+			continue;
+		if (rhashtable_remove_fast(&peertab->ht, &peer->ht_linkage,
+					   ht_params) == 0) {
+			homa_peer_free(peer);
+			hnet->num_peers--;
+			peertab->num_peers--;
+		}
+	}
+	rhashtable_walk_stop(&iter);
+	rhashtable_walk_exit(&iter);
+	WARN(hnet->num_peers != 0, "%s ended up with hnet->num_peers %d",
+	     __func__, hnet->num_peers);
+
+	spin_lock_bh(&peertab->lock);
+	peertab->gc_stop_count--;
+	spin_unlock_bh(&peertab->lock);
+}
+
+/**
+ * homa_peer_free_fn() - This function is invoked for each entry in
+ * the peer hash table by the rhashtable code when the table is being
+ * deleted. It frees its argument.
+ * @object:     struct homa_peer to free.
+ * @dummy:      Not used.
+ */
+void homa_peer_free_fn(void *object, void *dummy)
+{
+	struct homa_peer *peer = object;
+
+	homa_peer_free(peer);
+}
+
+/**
+ * homa_peer_free_peertab() - Destructor for homa_peertabs. After this
+ * function returns, it is unsafe to use any results from previous calls
+ * to homa_peer_get, since all existing homa_peer objects will have been
+ * destroyed.
+ * @peertab:  The table to destroy.
+ */
+void homa_peer_free_peertab(struct homa_peertab *peertab)
+{
+	spin_lock_bh(&peertab->lock);
+	peertab->gc_stop_count++;
+	spin_unlock_bh(&peertab->lock);
+
+	if (peertab->ht_valid) {
+		rhashtable_walk_exit(&peertab->ht_iter);
+		rhashtable_free_and_destroy(&peertab->ht, homa_peer_free_fn,
+					    NULL);
+	}
+	while (!list_empty(&peertab->dead_peers))
+		homa_peer_free_dead(peertab);
+	kfree(peertab);
+}
+
+/**
+ * homa_peer_rcu_callback() - This function is invoked as the callback
+ * for an invocation of call_rcu. It just marks a peertab to indicate that
+ * it was invoked.
+ * @head:    Contains information used to locate the peertab.
+ */
+void homa_peer_rcu_callback(struct rcu_head *head)
+{
+	struct homa_peertab *peertab;
+
+	peertab = container_of(head, struct homa_peertab, rcu_head);
+	atomic_set(&peertab->call_rcu_pending, 0);
+}
+
+/**
+ * homa_peer_free_dead() - Release peers on peertab->dead_peers
+ * if possible.
+ * @peertab:    Check the dead peers here.
+ */
+void homa_peer_free_dead(struct homa_peertab *peertab)
+	__must_hold(peertab->lock)
+{
+	struct homa_peer *peer, *tmp;
+
+	/* A dead peer can be freed only if:
+	 * (a) there are no call_rcu calls pending (if there are, it's
+	 *     possible that a new reference might get created for the
+	 *     peer)
+	 * (b) the peer's reference count is zero.
+	 */
+	if (atomic_read(&peertab->call_rcu_pending))
+		return;
+	list_for_each_entry_safe(peer, tmp, &peertab->dead_peers, dead_links) {
+		if (atomic_read(&peer->refs) == 0) {
+			list_del_init(&peer->dead_links);
+			homa_peer_free(peer);
+		}
+	}
+}
+
+/**
+ * homa_peer_wait_dead() - Don't return until all of the dead peers have
+ * been freed.
+ * @peertab:    Overall information about peers, which includes a dead list.
+ *
+ */
+void homa_peer_wait_dead(struct homa_peertab *peertab)
+{
+	while (1) {
+		spin_lock_bh(&peertab->lock);
+		homa_peer_free_dead(peertab);
+		if (list_empty(&peertab->dead_peers)) {
+			spin_unlock_bh(&peertab->lock);
+			return;
+		}
+		spin_unlock_bh(&peertab->lock);
+	}
+}
+
+/**
+ * homa_peer_prefer_evict() - Given two peers, determine which one is
+ * a better candidate for eviction.
+ * @peertab:    Overall information used to manage peers.
+ * @peer1:      First peer.
+ * @peer2:      Second peer.
+ * Return:      True if @peer1 is a better candidate for eviction than @peer2.
+ */
+int homa_peer_prefer_evict(struct homa_peertab *peertab,
+			   struct homa_peer *peer1,
+			   struct homa_peer *peer2)
+{
+	/* Prefer a peer whose homa-net is over its limit; if both are either
+	 * over or under, then prefer the peer with the shortest idle time.
+	 */
+	if (peer1->ht_key.hnet->num_peers > peertab->net_max) {
+		if (peer2->ht_key.hnet->num_peers <= peertab->net_max)
+			return true;
+		else
+			return peer1->access_jiffies < peer2->access_jiffies;
+	}
+	if (peer2->ht_key.hnet->num_peers > peertab->net_max)
+		return false;
+	else
+		return peer1->access_jiffies < peer2->access_jiffies;
+}
+
+/**
+ * homa_peer_pick_victims() - Select a few peers that can be freed.
+ * @peertab:      Choose peers that are stored here.
+ * @victims:      Return addresses of victims here.
+ * @max_victims:  Limit on how many victims to choose (and size of @victims
+ *                array).
+ * Return:        The number of peers stored in @victims; may be zero.
+ */
+int homa_peer_pick_victims(struct homa_peertab *peertab,
+			   struct homa_peer *victims[], int max_victims)
+{
+	struct homa_peer *peer;
+	int num_victims = 0;
+	int to_scan;
+	int i, idle;
+
+	/* Scan 2 peers for every potential victim and keep the "best"
+	 * peers for removal.
+	 */
+	rhashtable_walk_start(&peertab->ht_iter);
+	for (to_scan = 2 * max_victims; to_scan > 0; to_scan--) {
+		peer = rhashtable_walk_next(&peertab->ht_iter);
+		if (!peer) {
+			/* Reached the end of the table; restart at
+			 * the beginning.
+			 */
+			rhashtable_walk_stop(&peertab->ht_iter);
+			rhashtable_walk_exit(&peertab->ht_iter);
+			rhashtable_walk_enter(&peertab->ht, &peertab->ht_iter);
+			rhashtable_walk_start(&peertab->ht_iter);
+			peer = rhashtable_walk_next(&peertab->ht_iter);
+			if (!peer)
+				break;
+		}
+		if (IS_ERR(peer)) {
+			/* rhashtable decided to restart the search at the
+			 * beginning.
+			 */
+			peer = rhashtable_walk_next(&peertab->ht_iter);
+			if (!peer || IS_ERR(peer))
+				break;
+		}
+
+		/* Has this peer been idle long enough to be candidate for
+		 * eviction?
+		 */
+		idle = jiffies - peer->access_jiffies;
+		if (idle < peertab->idle_jiffies_min)
+			continue;
+		if (idle < peertab->idle_jiffies_max &&
+		    peer->ht_key.hnet->num_peers <= peertab->net_max)
+			continue;
+
+		/* Sort the candidate into the existing list of victims. */
+		for (i = 0; i < num_victims; i++) {
+			if (peer == victims[i]) {
+				/* This can happen if there aren't very many
+				 * peers and we wrapped around in the hash
+				 * table.
+				 */
+				peer = NULL;
+				break;
+			}
+			if (homa_peer_prefer_evict(peertab, peer, victims[i])) {
+				struct homa_peer *tmp;
+
+				tmp = victims[i];
+				victims[i] = peer;
+				peer = tmp;
+			}
+		}
+
+		if (num_victims < max_victims && peer) {
+			victims[num_victims] = peer;
+			num_victims++;
+		}
+	}
+	rhashtable_walk_stop(&peertab->ht_iter);
+	return num_victims;
+}
+
+/**
+ * homa_peer_gc() - This function is invoked by Homa at regular intervals;
+ * its job is to ensure that the number of peers stays within limits.
+ * If the number grows too large, it selectively deletes peers to get
+ * back under the limit.
+ * @peertab:   Structure whose peers should be considered for garbage
+ *             collection.
+ */
+void homa_peer_gc(struct homa_peertab *peertab)
+{
+#define EVICT_BATCH_SIZE 5
+	struct homa_peer *victims[EVICT_BATCH_SIZE];
+	int num_victims;
+	int i;
+
+	spin_lock_bh(&peertab->lock);
+	if (peertab->gc_stop_count != 0)
+		goto done;
+	if (!list_empty(&peertab->dead_peers))
+		homa_peer_free_dead(peertab);
+	if (atomic_read(&peertab->call_rcu_pending) ||
+	    peertab->num_peers < peertab->gc_threshold)
+		goto done;
+	num_victims = homa_peer_pick_victims(peertab, victims,
+					     EVICT_BATCH_SIZE);
+	if (num_victims == 0)
+		goto done;
+
+	for (i = 0; i < num_victims; i++) {
+		struct homa_peer *peer = victims[i];
+
+		if (rhashtable_remove_fast(&peertab->ht, &peer->ht_linkage,
+					   ht_params) == 0) {
+			list_add_tail(&peer->dead_links, &peertab->dead_peers);
+			peertab->num_peers--;
+			peer->ht_key.hnet->num_peers--;
+		}
+	}
+	atomic_set(&peertab->call_rcu_pending, 1);
+	call_rcu(&peertab->rcu_head, homa_peer_rcu_callback);
+done:
+	spin_unlock_bh(&peertab->lock);
+}
+
+/**
+ * homa_peer_alloc() - Allocate and initialize a new homa_peer object.
+ * @hsk:        Socket for which the peer will be used.
+ * @addr:       Address of the desired host: IPv4 addresses are represented
+ *              as IPv4-mapped IPv6 addresses.
+ * Return:      The peer associated with @addr, or a negative errno if an
+ *              error occurred. On a successful return the reference count
+ *              will be incremented for the returned peer.
+ */
+struct homa_peer *homa_peer_alloc(struct homa_sock *hsk,
+				  const struct in6_addr *addr)
+{
+	struct homa_peer *peer;
+	struct dst_entry *dst;
+
+	peer = kzalloc(sizeof(*peer), GFP_ATOMIC);
+	if (!peer)
+		return (struct homa_peer *)ERR_PTR(-ENOMEM);
+	peer->ht_key.addr = *addr;
+	peer->ht_key.hnet = hsk->hnet;
+	INIT_LIST_HEAD(&peer->dead_links);
+	atomic_set(&peer->refs, 1);
+	peer->access_jiffies = jiffies;
+	peer->addr = *addr;
+	dst = homa_peer_get_dst(peer, hsk);
+	if (IS_ERR(dst)) {
+		kfree(peer);
+		return (struct homa_peer *)dst;
+	}
+	peer->dst = dst;
+	peer->current_ticks = -1;
+	spin_lock_init(&peer->ack_lock);
+	return peer;
+}
+
+/**
+ * homa_peer_free() - Release any resources in a peer and free the homa_peer
+ * struct.
+ * @peer:       Structure to free. Must not currently be linked into
+ *              peertab->ht.
+ */
+void homa_peer_free(struct homa_peer *peer)
+{
+	dst_release(peer->dst);
+
+	if (atomic_read(&peer->refs) == 0)
+		kfree(peer);
+	else
+		WARN(1, "%s found peer with reference count %d",
+		     __func__, atomic_read(&peer->refs));
+}
+
+/**
+ * homa_peer_get() - Returns the peer associated with a given host; creates
+ * a new homa_peer if one doesn't already exist.
+ * @hsk:        Socket where the peer will be used.
+ * @addr:       Address of the desired host: IPv4 addresses are represented
+ *              as IPv4-mapped IPv6 addresses.
+ *
+ * Return:      The peer associated with @addr, or a negative errno if an
+ *              error occurred. On a successful return the reference count
+ *              will be incremented for the returned peer. The caller must
+ *              eventually call homa_peer_release to release the reference.
+ */
+struct homa_peer *homa_peer_get(struct homa_sock *hsk,
+				const struct in6_addr *addr)
+{
+	struct homa_peertab *peertab = hsk->homa->peertab;
+	struct homa_peer *peer, *other;
+	struct homa_peer_key key;
+
+	key.addr = *addr;
+	key.hnet = hsk->hnet;
+	rcu_read_lock();
+	peer = rhashtable_lookup(&peertab->ht, &key, ht_params);
+	if (peer) {
+		homa_peer_hold(peer);
+		peer->access_jiffies = jiffies;
+		rcu_read_unlock();
+		return peer;
+	}
+
+	/* No existing entry, so we have to create a new one. */
+	peer = homa_peer_alloc(hsk, addr);
+	if (IS_ERR(peer)) {
+		rcu_read_unlock();
+		return peer;
+	}
+	spin_lock_bh(&peertab->lock);
+	other = rhashtable_lookup_get_insert_fast(&peertab->ht,
+						  &peer->ht_linkage, ht_params);
+	if (IS_ERR(other)) {
+		/* Couldn't insert; return the error info. */
+		homa_peer_release(peer);
+		homa_peer_free(peer);
+		peer = other;
+	} else if (other) {
+		/* Someone else already created the desired peer; use that
+		 * one instead of ours.
+		 */
+		homa_peer_release(peer);
+		homa_peer_free(peer);
+		peer = other;
+		homa_peer_hold(peer);
+		peer->access_jiffies = jiffies;
+	} else {
+		peertab->num_peers++;
+		key.hnet->num_peers++;
+	}
+	spin_unlock_bh(&peertab->lock);
+	rcu_read_unlock();
+	return peer;
+}
+
+/**
+ * homa_dst_refresh() - This method is called when the dst for a peer is
+ * obsolete; it releases that dst and creates a new one.
+ * @peertab:  Table containing the peer.
+ * @peer:     Peer whose dst is obsolete.
+ * @hsk:      Socket that will be used to transmit data to the peer.
+ */
+void homa_dst_refresh(struct homa_peertab *peertab, struct homa_peer *peer,
+		      struct homa_sock *hsk)
+{
+	struct dst_entry *dst;
+
+	dst = homa_peer_get_dst(peer, hsk);
+	if (IS_ERR(dst))
+		return;
+	dst_release(peer->dst);
+	peer->dst = dst;
+}
+
+/**
+ * homa_peer_get_dst() - Find an appropriate dst structure (either IPv4
+ * or IPv6) for a peer.
+ * @peer:   The peer for which a dst is needed. Note: this peer's flow
+ *          struct will be overwritten.
+ * @hsk:    Socket that will be used for sending packets.
+ * Return:  The dst structure (or an ERR_PTR); a reference has been taken.
+ */
+struct dst_entry *homa_peer_get_dst(struct homa_peer *peer,
+				    struct homa_sock *hsk)
+{
+	memset(&peer->flow, 0, sizeof(peer->flow));
+	if (hsk->sock.sk_family == AF_INET) {
+		struct rtable *rt;
+
+		flowi4_init_output(&peer->flow.u.ip4, hsk->sock.sk_bound_dev_if,
+				   hsk->sock.sk_mark, hsk->inet.tos,
+				   RT_SCOPE_UNIVERSE, hsk->sock.sk_protocol, 0,
+				   peer->addr.in6_u.u6_addr32[3],
+				   hsk->inet.inet_saddr, 0, 0,
+				   hsk->sock.sk_uid);
+		security_sk_classify_flow(&hsk->sock,
+					  &peer->flow.u.__fl_common);
+		rt = ip_route_output_flow(sock_net(&hsk->sock),
+					  &peer->flow.u.ip4, &hsk->sock);
+		if (IS_ERR(rt))
+			return (struct dst_entry *)(PTR_ERR(rt));
+		return &rt->dst;
+	}
+	peer->flow.u.ip6.flowi6_oif = hsk->sock.sk_bound_dev_if;
+	peer->flow.u.ip6.flowi6_iif = LOOPBACK_IFINDEX;
+	peer->flow.u.ip6.flowi6_mark = hsk->sock.sk_mark;
+	peer->flow.u.ip6.flowi6_scope = RT_SCOPE_UNIVERSE;
+	peer->flow.u.ip6.flowi6_proto = hsk->sock.sk_protocol;
+	peer->flow.u.ip6.flowi6_flags = 0;
+	peer->flow.u.ip6.flowi6_secid = 0;
+	peer->flow.u.ip6.flowi6_tun_key.tun_id = 0;
+	peer->flow.u.ip6.flowi6_uid = hsk->sock.sk_uid;
+	peer->flow.u.ip6.daddr = peer->addr;
+	peer->flow.u.ip6.saddr = hsk->inet.pinet6->saddr;
+	peer->flow.u.ip6.fl6_dport = 0;
+	peer->flow.u.ip6.fl6_sport = 0;
+	peer->flow.u.ip6.mp_hash = 0;
+	peer->flow.u.ip6.__fl_common.flowic_tos = hsk->inet.tos;
+	peer->flow.u.ip6.flowlabel = ip6_make_flowinfo(hsk->inet.tos, 0);
+	security_sk_classify_flow(&hsk->sock, &peer->flow.u.__fl_common);
+	return ip6_dst_lookup_flow(sock_net(&hsk->sock), &hsk->sock,
+			&peer->flow.u.ip6, NULL);
+}
+
+/**
+ * homa_peer_add_ack() - Add a given RPC to the list of unacked
+ * RPCs for its server. Once this method has been invoked, it's safe
+ * to delete the RPC, since it will eventually be acked to the server.
+ * @rpc:    Client RPC that has now completed.
+ */
+void homa_peer_add_ack(struct homa_rpc *rpc)
+{
+	struct homa_peer *peer = rpc->peer;
+	struct homa_ack_hdr ack;
+
+	homa_peer_lock(peer);
+	if (peer->num_acks < HOMA_MAX_ACKS_PER_PKT) {
+		peer->acks[peer->num_acks].client_id = cpu_to_be64(rpc->id);
+		peer->acks[peer->num_acks].server_port = htons(rpc->dport);
+		peer->num_acks++;
+		homa_peer_unlock(peer);
+		return;
+	}
+
+	/* The peer has filled up; send an ACK message to empty it. The
+	 * RPC in the message header will also be considered ACKed.
+	 */
+	memcpy(ack.acks, peer->acks, sizeof(peer->acks));
+	ack.num_acks = htons(peer->num_acks);
+	peer->num_acks = 0;
+	homa_peer_unlock(peer);
+	homa_xmit_control(ACK, &ack, sizeof(ack), rpc);
+}
+
+/**
+ * homa_peer_get_acks() - Copy acks out of a peer, and remove them from the
+ * peer.
+ * @peer:    Peer to check for possible unacked RPCs.
+ * @count:   Maximum number of acks to return.
+ * @dst:     The acks are copied to this location.
+ *
+ * Return:   The number of acks extracted from the peer (<= count).
+ */
+int homa_peer_get_acks(struct homa_peer *peer, int count, struct homa_ack *dst)
+{
+	/* Don't waste time acquiring the lock if there are no ids available. */
+	if (peer->num_acks == 0)
+		return 0;
+
+	homa_peer_lock(peer);
+
+	if (count > peer->num_acks)
+		count = peer->num_acks;
+	memcpy(dst, &peer->acks[peer->num_acks - count],
+	       count * sizeof(peer->acks[0]));
+	peer->num_acks -= count;
+
+	homa_peer_unlock(peer);
+	return count;
+}
+
+/**
+ * homa_peer_update_sysctl_deps() - Update any peertab fields that depend
+ * on values set by sysctl. This function is invoked anytime a peer sysctl
+ * value is updated.
+ * @peertab:   Struct to update.
+ */
+void homa_peer_update_sysctl_deps(struct homa_peertab *peertab)
+{
+	peertab->idle_jiffies_min = peertab->idle_secs_min * HZ;
+	peertab->idle_jiffies_max = peertab->idle_secs_max * HZ;
+}
+
diff --git a/net/homa/homa_peer.h b/net/homa/homa_peer.h
new file mode 100644
index 000000000000..ae54e4875b1c
--- /dev/null
+++ b/net/homa/homa_peer.h
@@ -0,0 +1,373 @@
+/* SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+ */
+
+/* This file contains definitions related to managing peers (homa_peer
+ * and homa_peertab).
+ */
+
+#ifndef _HOMA_PEER_H
+#define _HOMA_PEER_H
+
+#include "homa_wire.h"
+#include "homa_sock.h"
+
+#include <linux/rhashtable.h>
+
+struct homa_rpc;
+
+/**
+ * struct homa_peertab - Stores homa_peer objects, indexed by IPv6
+ * address.
+ */
+struct homa_peertab {
+	/**
+	 * @lock: Used to synchronize updates to @ht as well as other
+	 * operations on this object.
+	 */
+	spinlock_t lock;
+
+	/** @ht: Hash table that stores all struct peers. */
+	struct rhashtable ht;
+
+	/** @ht_iter: Used to scan ht to find peers to garbage collect. */
+	struct rhashtable_iter ht_iter;
+
+	/** @num_peers: Total number of peers currently in @ht. */
+	int num_peers;
+
+	/**
+	 * @ht_valid: True means ht and ht_iter have been initialized and must
+	 * eventually be destroyed.
+	 */
+	bool ht_valid;
+
+	/**
+	 * @dead_peers: List of peers that have been removed from ht
+	 * but can't yet be freed (because they have nonzero reference
+	 * counts or an rcu sync point hasn't been reached).
+	 */
+	struct list_head dead_peers;
+
+	/** @rcu_head: Holds state of a pending call_rcu invocation. */
+	struct rcu_head rcu_head;
+
+	/**
+	 * @call_rcu_pending: Nonzero means that call_rcu has been
+	 * invoked but it has not invoked the callback function; until the
+	 * callback has been invoked we can't free peers on dead_peers or
+	 * invoke call_rcu again (which means we can't add more peers to
+	 * dead_peers).
+	 */
+	atomic_t call_rcu_pending;
+
+	/**
+	 * @gc_stop_count: Nonzero means that peer garbage collection
+	 * should not be performed (conflicting state changes are underway).
+	 */
+	int gc_stop_count;
+
+	/**
+	 * @gc_threshold: If @num_peers is less than this, don't bother
+	 * doing any peer garbage collection. Set externally via sysctl.
+	 */
+	int gc_threshold;
+
+	/**
+	 * @net_max: If the number of peers for a homa_net exceeds this number,
+	 * work aggressively to reclaim peers for that homa_net. Set
+	 * externally via sysctl.
+	 */
+	int net_max;
+
+	/**
+	 * @idle_secs_min: A peer will not be considered for garbage collection
+	 * under any circumstances if it has been idle less than this many
+	 * seconds. Set externally via sysctl.
+	 */
+	int idle_secs_min;
+
+	/**
+	 * @idle_jiffies_min: Same as idle_secs_min except in units
+	 * of jiffies.
+	 */
+	unsigned long idle_jiffies_min;
+
+	/**
+	 * @idle_secs_max: A peer that has been idle for less than
+	 * this many seconds will not be considered for garbage collection
+	 * unless its homa_net has more than @net_threshold peers. Set
+	 * externally via sysctl.
+	 */
+	int idle_secs_max;
+
+	/**
+	 * @idle_jiffies_max: Same as idle_secs_max except in units
+	 * of jiffies.
+	 */
+	unsigned long idle_jiffies_max;
+
+};
+
+/**
+ * struct homa_peer_key - Used to look up homa_peer structs in an rhashtable.
+ */
+struct homa_peer_key {
+	/**
+	 * @addr: Address of the desired host. IPv4 addresses are represented
+	 * with IPv4-mapped IPv6 addresses.
+	 */
+	struct in6_addr addr;
+
+	/** @hnet: The network namespace in which this peer is valid. */
+	struct homa_net *hnet;
+};
+
+/**
+ * struct homa_peer - One of these objects exists for each machine that we
+ * have communicated with (either as client or server).
+ */
+struct homa_peer {
+	/** @ht_key: The hash table key for this peer in peertab->ht. */
+	struct homa_peer_key ht_key;
+
+	/**
+	 * @ht_linkage: Used by rashtable implement to link this peer into
+	 * peertab->ht.
+	 */
+	struct rhash_head ht_linkage;
+
+	/** @dead_links: Used to link this peer into peertab->dead_peers. */
+	struct list_head dead_links;
+
+	/**
+	 * @refs: Number of unmatched calls to homa_peer_hold; it's not safe
+	 * to free this object until the reference count is zero.
+	 */
+	atomic_t refs ____cacheline_aligned_in_smp;
+
+	/**
+	 * @access_jiffies: Time in jiffies of most recent access to this
+	 * peer.
+	 */
+	unsigned long access_jiffies;
+
+	/**
+	 * @addr: IPv6 address for the machine (IPv4 addresses are stored
+	 * as IPv4-mapped IPv6 addresses).
+	 */
+	struct in6_addr addr ____cacheline_aligned_in_smp;
+
+	/** @flow: Addressing info needed to send packets. */
+	struct flowi flow;
+
+	/**
+	 * @dst: Used to route packets to this peer; we own a reference
+	 * to this, which we must eventually release.
+	 */
+	struct dst_entry *dst;
+
+	/**
+	 * @outstanding_resends: the number of resend requests we have
+	 * sent to this server (spaced @homa.resend_interval apart) since
+	 * we received a packet from this peer.
+	 */
+	int outstanding_resends;
+
+	/**
+	 * @most_recent_resend: @homa->timer_ticks when the most recent
+	 * resend was sent to this peer.
+	 */
+	int most_recent_resend;
+
+	/**
+	 * @least_recent_rpc: of all the RPCs for this peer scanned at
+	 * @current_ticks, this is the RPC whose @resend_timer_ticks
+	 * is farthest in the past.
+	 */
+	struct homa_rpc *least_recent_rpc;
+
+	/**
+	 * @least_recent_ticks: the @resend_timer_ticks value for
+	 * @least_recent_rpc.
+	 */
+	u32 least_recent_ticks;
+
+	/**
+	 * @current_ticks: the value of @homa->timer_ticks the last time
+	 * that @least_recent_rpc and @least_recent_ticks were computed.
+	 * Used to detect the start of a new homa_timer pass.
+	 */
+	u32 current_ticks;
+
+	/**
+	 * @resend_rpc: the value of @least_recent_rpc computed in the
+	 * previous homa_timer pass. This RPC will be issued a RESEND
+	 * in the current pass, if it still needs one.
+	 */
+	struct homa_rpc *resend_rpc;
+
+	/**
+	 * @num_acks: the number of (initial) entries in @acks that
+	 * currently hold valid information.
+	 */
+	int num_acks;
+
+	/**
+	 * @acks: info about client RPCs whose results have been completely
+	 * received.
+	 */
+	struct homa_ack acks[HOMA_MAX_ACKS_PER_PKT];
+
+	/**
+	 * @ack_lock: used to synchronize access to @num_acks and @acks.
+	 */
+	spinlock_t ack_lock;
+};
+
+void     homa_dst_refresh(struct homa_peertab *peertab,
+			  struct homa_peer *peer, struct homa_sock *hsk);
+void     homa_peer_add_ack(struct homa_rpc *rpc);
+struct homa_peer
+	*homa_peer_alloc(struct homa_sock *hsk, const struct in6_addr *addr);
+struct homa_peertab
+	*homa_peer_alloc_peertab(void);
+int      homa_peer_dointvec(const struct ctl_table *table, int write,
+			    void *buffer, size_t *lenp, loff_t *ppos);
+void     homa_peer_free(struct homa_peer *peer);
+void     homa_peer_free_dead(struct homa_peertab *peertab);
+void     homa_peer_free_fn(void *object, void *dummy);
+void     homa_peer_free_net(struct homa_net *hnet);
+void     homa_peer_free_peertab(struct homa_peertab *peertab);
+void     homa_peer_gc(struct homa_peertab *peertab);
+struct homa_peer
+	*homa_peer_get(struct homa_sock *hsk, const struct in6_addr *addr);
+int      homa_peer_get_acks(struct homa_peer *peer, int count,
+			    struct homa_ack *dst);
+struct dst_entry
+	*homa_peer_get_dst(struct homa_peer *peer, struct homa_sock *hsk);
+int      homa_peer_pick_victims(struct homa_peertab *peertab,
+				struct homa_peer *victims[], int max_victims);
+int      homa_peer_prefer_evict(struct homa_peertab *peertab,
+				struct homa_peer *peer1,
+				struct homa_peer *peer2);
+void     homa_peer_rcu_callback(struct rcu_head *head);
+void     homa_peer_wait_dead(struct homa_peertab *peertab);
+void     homa_peer_update_sysctl_deps(struct homa_peertab *peertab);
+
+/**
+ * homa_peer_lock() - Acquire the lock for a peer's @ack_lock.
+ * @peer:    Peer to lock.
+ */
+static inline void homa_peer_lock(struct homa_peer *peer)
+	__acquires(peer->ack_lock)
+{
+	spin_lock_bh(&peer->ack_lock);
+}
+
+/**
+ * homa_peer_unlock() - Release the lock for a peer's @unacked_lock.
+ * @peer:   Peer to lock.
+ */
+static inline void homa_peer_unlock(struct homa_peer *peer)
+	__releases(peer->ack_lock)
+{
+	spin_unlock_bh(&peer->ack_lock);
+}
+
+/**
+ * homa_get_dst() - Returns destination information associated with a peer,
+ * updating it if the cached information is stale.
+ * @peer:   Peer whose destination information is desired.
+ * @hsk:    Homa socket; needed by lower-level code to recreate the dst.
+ * Return:  Up-to-date destination for peer; a reference has been taken
+ *          on this dst_entry, which the caller must eventually release.
+ */
+static inline struct dst_entry *homa_get_dst(struct homa_peer *peer,
+					     struct homa_sock *hsk)
+{
+	if (unlikely(peer->dst->obsolete &&
+		     !peer->dst->ops->check(peer->dst, 0)))
+		homa_dst_refresh(hsk->homa->peertab, peer, hsk);
+	dst_hold(peer->dst);
+	return peer->dst;
+}
+
+/**
+ * homa_peer_hold() - Increment the reference count on an RPC, which will
+ * prevent it from being freed until homa_peer_release() is called.
+ * @peer:      Object on which to take a reference.
+ */
+static inline void homa_peer_hold(struct homa_peer *peer)
+{
+	atomic_inc(&peer->refs);
+}
+
+/**
+ * homa_peer_release() - Release a reference on a peer (cancels the effect of
+ * a previous call to homa_peer_hold). If the reference count becomes zero
+ * then the peer may be deleted at any time.
+ * @peer:      Object to release.
+ */
+static inline void homa_peer_release(struct homa_peer *peer)
+{
+	atomic_dec(&peer->refs);
+}
+
+/**
+ * homa_peer_hash() - Hash function used for @peertab->ht.
+ * @data:    Pointer to key for which a hash is desired. Must actually
+ *           be a struct homa_peer_key.
+ * @dummy:   Not used
+ * @seed:    Seed for the hash.
+ * Return:   A 32-bit hash value for the given key.
+ */
+static inline u32 homa_peer_hash(const void *data, u32 dummy, u32 seed)
+{
+	/* This is MurmurHash3, used instead of the jhash default because it
+	 * is faster (25 ns vs. 40 ns as of May 2025).
+	 */
+	BUILD_BUG_ON(sizeof(struct homa_peer_key) & 0x3);
+	const u32 len = sizeof(struct homa_peer_key) >> 2;
+	const u32 c1 = 0xcc9e2d51;
+	const u32 c2 = 0x1b873593;
+	const u32 *key = data;
+	u32 h = seed;
+
+	for (size_t i = 0; i < len; i++) {
+		u32 k = key[i];
+
+		k *= c1;
+		k = (k << 15) | (k >> (32 - 15));
+		k *= c2;
+
+		h ^= k;
+		h = (h << 13) | (h >> (32 - 13));
+		h = h * 5 + 0xe6546b64;
+	}
+
+	h ^= len * 4;  // Total number of input bytes
+
+	h ^= h >> 16;
+	h *= 0x85ebca6b;
+	h ^= h >> 13;
+	h *= 0xc2b2ae35;
+	h ^= h >> 16;
+	return h;
+}
+
+/**q
+ * homa_peer_compare() - Comparison function for entries in @peertab->ht.
+ * @arg:   Contains one of the keys to compare.
+ * @obj:   homa_peer object containing the other key to compare.
+ * Return: 0 means the keys match, 1 means mismatch.
+ */
+static inline int homa_peer_compare(struct rhashtable_compare_arg *arg,
+				    const void *obj)
+{
+	const struct homa_peer_key *key = arg->key;
+	const struct homa_peer *peer = obj;
+
+	return !(ipv6_addr_equal(&key->addr, &peer->ht_key.addr) &&
+		 peer->ht_key.hnet == key->hnet);
+}
+
+#endif /* _HOMA_PEER_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 06/15] net: homa: create homa_sock.h and homa_sock.c
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (4 preceding siblings ...)
  2025-08-18 20:55 ` [PATCH net-next v15 05/15] net: homa: create homa_peer.h and homa_peer.c John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-26 10:10   ` Paolo Abeni
  2025-08-18 20:55 ` [PATCH net-next v15 07/15] net: homa: create homa_interest.h and homa_interest.c John Ousterhout
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

These files provide functions for managing the state that Homa keeps
for each open Homa socket.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for v11:
* Clean up sparse annotations

Changes for v10:
* Revise sparse annotations to eliminate __context__ definition
* Replace __u16 with u16, __u8 with u8, etc.
* Use the destroy function from struct proto properly (fixes races in
  socket cleanup)

Changes for v9:
* Add support for homa_net objects; there is now a single socket table shared
  across all network namespaces
* Set SOCK_RCU_FREE in homa_sock_init, not homa_sock_shutdown
* Various name improvements (e.g. use "alloc" instead of "new" for functions
  that allocate memory)

Changes for v8:
* Update for new homa_pool APIs

Changes for v7:
* Refactor homa_sock_start_scan etc. (take a reference on the socket, so
  homa_socktab::active_scans and struct homa_socktab_links are no longer
  needed; encapsulate RCU usage entirely in homa_sock.c).
* Add functions for tx memory accounting
* Refactor waiting mechanism for incoming messages
* Add hsk->is_server, setsockopt SO_HOMA_SERVER
* Remove "lock_slow" functions, which don't add functionality in this
  patch series
* Remove locker argument from locking functions
* Use u64 and __u64 properly
* Take a reference to the socket in homa_sock_find
---
 net/homa/homa_sock.c | 432 +++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_sock.h | 408 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 840 insertions(+)
 create mode 100644 net/homa/homa_sock.c
 create mode 100644 net/homa/homa_sock.h

diff --git a/net/homa/homa_sock.c b/net/homa/homa_sock.c
new file mode 100644
index 000000000000..f3e2ec0be64e
--- /dev/null
+++ b/net/homa/homa_sock.c
@@ -0,0 +1,432 @@
+// SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
+
+/* This file manages homa_sock and homa_socktab objects. */
+
+#include "homa_impl.h"
+#include "homa_interest.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+
+/**
+ * homa_socktab_init() - Constructor for homa_socktabs.
+ * @socktab:  The object to initialize; previous contents are discarded.
+ */
+void homa_socktab_init(struct homa_socktab *socktab)
+{
+	int i;
+
+	spin_lock_init(&socktab->write_lock);
+	for (i = 0; i < HOMA_SOCKTAB_BUCKETS; i++)
+		INIT_HLIST_HEAD(&socktab->buckets[i]);
+}
+
+/**
+ * homa_socktab_destroy() - Destructor for homa_socktabs: deletes all
+ * existing sockets.
+ * @socktab:  The object to destroy.
+ * @hnet:     If non-NULL, only sockets for this namespace are deleted.
+ */
+void homa_socktab_destroy(struct homa_socktab *socktab, struct homa_net *hnet)
+{
+	struct homa_socktab_scan scan;
+	struct homa_sock *hsk;
+
+	for (hsk = homa_socktab_start_scan(socktab, &scan); hsk;
+			hsk = homa_socktab_next(&scan)) {
+		if (hnet && hnet != hsk->hnet)
+			continue;
+
+		/* In actual use there should be no sockets left when this
+		 * function is invoked, so the code below will never be
+		 * invoked. However, it is useful during unit tests.
+		 */
+		homa_sock_shutdown(hsk);
+		homa_sock_destroy(&hsk->sock);
+	}
+	homa_socktab_end_scan(&scan);
+}
+
+/**
+ * homa_socktab_start_scan() - Begin an iteration over all of the sockets
+ * in a socktab.
+ * @socktab:   Socktab to scan.
+ * @scan:      Will hold the current state of the scan; any existing
+ *             contents are discarded. The caller must eventually pass this
+ *             to homa_socktab_end_scan.
+ *
+ * Return:     The first socket in the table, or NULL if the table is
+ *             empty. If non-NULL, a reference is held on the socket to
+ *             prevent its deletion.
+ *
+ * Each call to homa_socktab_next will return the next socket in the table.
+ * All sockets that are present in the table at the time this function is
+ * invoked will eventually be returned, as long as they are not removed
+ * from the table. It is safe to remove sockets from the table while the
+ * scan is in progress. If a socket is removed from the table during the scan,
+ * it may or may not be returned by homa_socktab_next. New entries added
+ * during the scan may or may not be returned.
+ */
+struct homa_sock *homa_socktab_start_scan(struct homa_socktab *socktab,
+					  struct homa_socktab_scan *scan)
+{
+	scan->socktab = socktab;
+	scan->hsk = NULL;
+	scan->current_bucket = -1;
+
+	return homa_socktab_next(scan);
+}
+
+/**
+ * homa_socktab_next() - Return the next socket in an iteration over a socktab.
+ * @scan:      State of the scan.
+ *
+ * Return:     The next socket in the table, or NULL if the iteration has
+ *             returned all of the sockets in the table.  If non-NULL, a
+ *             reference is held on the socket to prevent its deletion.
+ *             Sockets are not returned in any particular order. It's
+ *             possible that the returned socket has been destroyed.
+ */
+struct homa_sock *homa_socktab_next(struct homa_socktab_scan *scan)
+{
+	struct hlist_head *bucket;
+	struct hlist_node *next;
+
+	rcu_read_lock();
+	if (scan->hsk) {
+		sock_put(&scan->hsk->sock);
+		next = rcu_dereference(hlist_next_rcu(&scan->hsk->socktab_links));
+		if (next)
+			goto success;
+	}
+	for (scan->current_bucket++;
+	     scan->current_bucket < HOMA_SOCKTAB_BUCKETS;
+	     scan->current_bucket++) {
+		bucket = &scan->socktab->buckets[scan->current_bucket];
+		next = rcu_dereference(hlist_first_rcu(bucket));
+		if (next)
+			goto success;
+	}
+	scan->hsk = NULL;
+	rcu_read_unlock();
+	return NULL;
+
+success:
+	scan->hsk =  hlist_entry(next, struct homa_sock, socktab_links);
+	sock_hold(&scan->hsk->sock);
+	rcu_read_unlock();
+	return scan->hsk;
+}
+
+/**
+ * homa_socktab_end_scan() - Must be invoked on completion of each scan
+ * to clean up state associated with the scan.
+ * @scan:      State of the scan.
+ */
+void homa_socktab_end_scan(struct homa_socktab_scan *scan)
+{
+	if (scan->hsk) {
+		sock_put(&scan->hsk->sock);
+		scan->hsk = NULL;
+	}
+}
+
+/**
+ * homa_sock_init() - Constructor for homa_sock objects. This function
+ * initializes only the parts of the socket that are owned by Homa.
+ * @hsk:    Object to initialize. The Homa-specific parts must have been
+ *          initialized to zeroes by the caller.
+ *
+ * Return: 0 for success, otherwise a negative errno.
+ */
+int homa_sock_init(struct homa_sock *hsk)
+{
+	struct homa_pool *buffer_pool;
+	struct homa_socktab *socktab;
+	struct homa_sock *other;
+	struct homa_net *hnet;
+	struct homa *homa;
+	int starting_port;
+	int result = 0;
+	int i;
+
+	hnet = (struct homa_net *)net_generic(sock_net(&hsk->sock),
+					      homa_net_id);
+	homa = hnet->homa;
+	socktab = homa->socktab;
+
+	/* Initialize fields outside the Homa part. */
+	hsk->sock.sk_sndbuf = homa->wmem_max;
+	sock_set_flag(&hsk->inet.sk, SOCK_RCU_FREE);
+
+	/* Do things requiring memory allocation before locking the socket,
+	 * so that GFP_ATOMIC is not needed.
+	 */
+	buffer_pool = homa_pool_alloc(hsk);
+	if (IS_ERR(buffer_pool))
+		return PTR_ERR(buffer_pool);
+
+	/* Initialize Homa-specific fields. */
+	hsk->homa = homa;
+	hsk->hnet = hnet;
+	hsk->buffer_pool = buffer_pool;
+
+	/* Pick a default port. Must keep the socktab locked from now
+	 * until the new socket is added to the socktab, to ensure that
+	 * no other socket chooses the same port.
+	 */
+	spin_lock_bh(&socktab->write_lock);
+	starting_port = hnet->prev_default_port;
+	while (1) {
+		hnet->prev_default_port++;
+		if (hnet->prev_default_port < HOMA_MIN_DEFAULT_PORT)
+			hnet->prev_default_port = HOMA_MIN_DEFAULT_PORT;
+		other = homa_sock_find(hnet, hnet->prev_default_port);
+		if (!other)
+			break;
+		sock_put(&other->sock);
+		if (hnet->prev_default_port == starting_port) {
+			spin_unlock_bh(&socktab->write_lock);
+			hsk->shutdown = true;
+			hsk->homa = NULL;
+			result = -EADDRNOTAVAIL;
+			goto error;
+		}
+	}
+	hsk->port = hnet->prev_default_port;
+	hsk->inet.inet_num = hsk->port;
+	hsk->inet.inet_sport = htons(hsk->port);
+
+	hsk->is_server = false;
+	hsk->shutdown = false;
+	hsk->ip_header_length = (hsk->inet.sk.sk_family == AF_INET) ?
+				sizeof(struct iphdr) : sizeof(struct ipv6hdr);
+	spin_lock_init(&hsk->lock);
+	atomic_set(&hsk->protect_count, 0);
+	INIT_LIST_HEAD(&hsk->active_rpcs);
+	INIT_LIST_HEAD(&hsk->dead_rpcs);
+	hsk->dead_skbs = 0;
+	INIT_LIST_HEAD(&hsk->waiting_for_bufs);
+	INIT_LIST_HEAD(&hsk->ready_rpcs);
+	INIT_LIST_HEAD(&hsk->interests);
+	for (i = 0; i < HOMA_CLIENT_RPC_BUCKETS; i++) {
+		struct homa_rpc_bucket *bucket = &hsk->client_rpc_buckets[i];
+
+		spin_lock_init(&bucket->lock);
+		bucket->id = i;
+		INIT_HLIST_HEAD(&bucket->rpcs);
+	}
+	for (i = 0; i < HOMA_SERVER_RPC_BUCKETS; i++) {
+		struct homa_rpc_bucket *bucket = &hsk->server_rpc_buckets[i];
+
+		spin_lock_init(&bucket->lock);
+		bucket->id = i + 1000000;
+		INIT_HLIST_HEAD(&bucket->rpcs);
+	}
+	hlist_add_head_rcu(&hsk->socktab_links,
+			   &socktab->buckets[homa_socktab_bucket(hnet,
+								 hsk->port)]);
+	spin_unlock_bh(&socktab->write_lock);
+	return result;
+
+error:
+	homa_pool_free(buffer_pool);
+	return result;
+}
+
+/*
+ * homa_sock_unlink() - Unlinks a socket from its socktab and does
+ * related cleanups. Once this method returns, the socket will not be
+ * discoverable through the socktab.
+ * @hsk:  Socket to unlink.
+ */
+void homa_sock_unlink(struct homa_sock *hsk)
+{
+	struct homa_socktab *socktab = hsk->homa->socktab;
+
+	spin_lock_bh(&socktab->write_lock);
+	hlist_del_rcu(&hsk->socktab_links);
+	spin_unlock_bh(&socktab->write_lock);
+}
+
+/**
+ * homa_sock_shutdown() - Disable a socket so that it can no longer
+ * be used for either sending or receiving messages. Any system calls
+ * currently waiting to send or receive messages will be aborted. This
+ * function will terminate any existing use of the socket, but it does
+ * not free up socket resources: that happens in homa_sock_destroy.
+ * @hsk:       Socket to shut down.
+ */
+void homa_sock_shutdown(struct homa_sock *hsk)
+{
+	struct homa_interest *interest;
+	struct homa_rpc *rpc;
+
+	homa_sock_lock(hsk);
+	if (hsk->shutdown || !hsk->homa) {
+		homa_sock_unlock(hsk);
+		return;
+	}
+
+	/* The order of cleanup is very important, because there could be
+	 * active operations that hold RPC locks but not the socket lock.
+	 * 1. Set @shutdown; this ensures that no new RPCs will be created for
+	 *    this socket (though some creations might already be in progress).
+	 * 2. Remove the socket from its socktab: this ensures that
+	 *    incoming packets for the socket will be dropped.
+	 * 3. Go through all of the RPCs and delete them; this will
+	 *    synchronize with any operations in progress.
+	 * 4. Perform other socket cleanup: at this point we know that
+	 *    there will be no concurrent activities on individual RPCs.
+	 * 5. Don't delete the buffer pool until after all of the RPCs
+	 *    have been reaped.
+	 * See "Homa Locking Strategy" in homa_impl.h for additional information
+	 * about locking.
+	 */
+	hsk->shutdown = true;
+	homa_sock_unlink(hsk);
+	homa_sock_unlock(hsk);
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(rpc, &hsk->active_rpcs, active_links) {
+		homa_rpc_lock(rpc);
+		homa_rpc_end(rpc);
+		homa_rpc_unlock(rpc);
+	}
+	rcu_read_unlock();
+
+	homa_sock_lock(hsk);
+	while (!list_empty(&hsk->interests)) {
+		interest = list_first_entry(&hsk->interests,
+					    struct homa_interest, links);
+		list_del_init(&interest->links);
+		atomic_set_release(&interest->ready, 1);
+		wake_up(&interest->wait_queue);
+	}
+	homa_sock_unlock(hsk);
+}
+
+/**
+ * homa_sock_destroy() - Release all of the internal resources associated
+ * with a socket; is invoked at time when that is safe (i.e., all references
+ * on the socket have been dropped).
+ * @sk:       Socket to destroy.
+ */
+void homa_sock_destroy(struct sock *sk)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+
+	if (!hsk->homa)
+		return;
+
+	while (!list_empty(&hsk->dead_rpcs))
+		homa_rpc_reap(hsk, true);
+
+	WARN_ON_ONCE(refcount_read(&hsk->sock.sk_wmem_alloc) != 1);
+
+	if (hsk->buffer_pool) {
+		homa_pool_free(hsk->buffer_pool);
+		hsk->buffer_pool = NULL;
+	}
+}
+
+/**
+ * homa_sock_bind() - Associates a server port with a socket; if there
+ * was a previous server port assignment for @hsk, it is abandoned.
+ * @hnet:      Network namespace with which port is associated.
+ * @hsk:       Homa socket.
+ * @port:      Desired server port for @hsk. If 0, then this call
+ *             becomes a no-op: the socket will continue to use
+ *             its randomly assigned client port.
+ *
+ * Return:  0 for success, otherwise a negative errno.
+ */
+int homa_sock_bind(struct homa_net *hnet, struct homa_sock *hsk,
+		   u16 port)
+{
+	struct homa_socktab *socktab = hnet->homa->socktab;
+	struct homa_sock *owner;
+	int result = 0;
+
+	if (port == 0)
+		return result;
+	if (port >= HOMA_MIN_DEFAULT_PORT)
+		return -EINVAL;
+	homa_sock_lock(hsk);
+	spin_lock_bh(&socktab->write_lock);
+	if (hsk->shutdown) {
+		result = -ESHUTDOWN;
+		goto done;
+	}
+
+	owner = homa_sock_find(hnet, port);
+	if (owner) {
+		sock_put(&owner->sock);
+		if (owner != hsk)
+			result = -EADDRINUSE;
+		goto done;
+	}
+	hlist_del_rcu(&hsk->socktab_links);
+	hsk->port = port;
+	hsk->inet.inet_num = port;
+	hsk->inet.inet_sport = htons(hsk->port);
+	hlist_add_head_rcu(&hsk->socktab_links,
+			   &socktab->buckets[homa_socktab_bucket(hnet, port)]);
+	hsk->is_server = true;
+done:
+	spin_unlock_bh(&socktab->write_lock);
+	homa_sock_unlock(hsk);
+	return result;
+}
+
+/**
+ * homa_sock_find() - Returns the socket associated with a given port.
+ * @hnet:       Network namespace where the socket will be used.
+ * @port:       The port of interest.
+ * Return:      The socket that owns @port, or NULL if none. If non-NULL
+ *              then this method has taken a reference on the socket and
+ *              the caller must call sock_put to release it.
+ */
+struct homa_sock *homa_sock_find(struct homa_net *hnet, u16 port)
+{
+	int bucket = homa_socktab_bucket(hnet, port);
+	struct homa_sock *result = NULL;
+	struct homa_sock *hsk;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(hsk, &hnet->homa->socktab->buckets[bucket],
+				 socktab_links) {
+		if (hsk->port == port && hsk->hnet == hnet) {
+			result = hsk;
+			sock_hold(&hsk->sock);
+			break;
+		}
+	}
+	rcu_read_unlock();
+	return result;
+}
+
+/**
+ * homa_sock_wait_wmem() - Block the thread until @hsk's usage of tx
+ * packet memory drops below the socket's limit.
+ * @hsk:          Socket of interest.
+ * @nonblocking:  If there's not enough memory, return -EWOLDBLOCK instead
+ *                of blocking.
+ * Return: 0 for success, otherwise a negative errno.
+ */
+int homa_sock_wait_wmem(struct homa_sock *hsk, int nonblocking)
+{
+	long timeo = hsk->sock.sk_sndtimeo;
+	int result;
+
+	if (nonblocking)
+		timeo = 0;
+	set_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags);
+	result = wait_event_interruptible_timeout(*sk_sleep(&hsk->sock),
+				homa_sock_wmem_avl(hsk) || hsk->shutdown,
+				timeo);
+	if (signal_pending(current))
+		return -EINTR;
+	if (result == 0)
+		return -EWOULDBLOCK;
+	return 0;
+}
diff --git a/net/homa/homa_sock.h b/net/homa/homa_sock.h
new file mode 100644
index 000000000000..1f649c1da628
--- /dev/null
+++ b/net/homa/homa_sock.h
@@ -0,0 +1,408 @@
+/* SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+ */
+
+/* This file defines structs and other things related to Homa sockets.  */
+
+#ifndef _HOMA_SOCK_H
+#define _HOMA_SOCK_H
+
+/* Forward declarations. */
+struct homa;
+struct homa_pool;
+
+/* Number of hash buckets in a homa_socktab. Must be a power of 2. */
+#define HOMA_SOCKTAB_BUCKET_BITS 10
+#define HOMA_SOCKTAB_BUCKETS BIT(HOMA_SOCKTAB_BUCKET_BITS)
+
+/**
+ * struct homa_socktab - A hash table that maps from port numbers (either
+ * client or server) to homa_sock objects.
+ *
+ * This table is managed exclusively by homa_socktab.c, using RCU to
+ * minimize synchronization during lookups.
+ */
+struct homa_socktab {
+	/**
+	 * @write_lock: Controls all modifications to this object; not needed
+	 * for socket lookups (RCU is used instead). Also used to
+	 * synchronize port allocation.
+	 */
+	spinlock_t write_lock;
+
+	/**
+	 * @buckets: Heads of chains for hash table buckets. Chains
+	 * consist of homa_sock objects.
+	 */
+	struct hlist_head buckets[HOMA_SOCKTAB_BUCKETS];
+};
+
+/**
+ * struct homa_socktab_scan - Records the state of an iteration over all
+ * the entries in a homa_socktab, in a way that is safe against concurrent
+ * reclamation of sockets.
+ */
+struct homa_socktab_scan {
+	/** @socktab: The table that is being scanned. */
+	struct homa_socktab *socktab;
+
+	/**
+	 * @hsk: Points to the current socket in the iteration, or NULL if
+	 * we're at the beginning or end of the iteration. If non-NULL then
+	 * we are holding a reference to this socket.
+	 */
+	struct homa_sock *hsk;
+
+	/**
+	 * @current_bucket: The index of the bucket in socktab->buckets
+	 * currently being scanned (-1 if @hsk == NULL).
+	 */
+	int current_bucket;
+};
+
+/**
+ * struct homa_rpc_bucket - One bucket in a hash table of RPCs.
+ */
+
+struct homa_rpc_bucket {
+	/**
+	 * @lock: serves as a lock both for this bucket (e.g., when
+	 * adding and removing RPCs) and also for all of the RPCs in
+	 * the bucket. Must be held whenever looking up an RPC in
+	 * this bucket or manipulating an RPC in the bucket. This approach
+	 * has the following properties:
+	 * 1. An RPC can be looked up and locked (a common operation) with
+	 *    a single lock acquisition.
+	 * 2. Looking up and locking are atomic: there is no window of
+	 *    vulnerability where someone else could delete an RPC after
+	 *    it has been looked up and before it has been locked.
+	 * 3. The lookup mechanism does not use RCU.  This is important because
+	 *    RPCs are created rapidly and typically live only a few tens of
+	 *    microseconds.  As of May 2027 RCU introduces a lag of about
+	 *    25 ms before objects can be deleted; for RPCs this would result
+	 *    in hundreds or thousands of RPCs accumulating before RCU allows
+	 *    them to be deleted.
+	 * This approach has the disadvantage that RPCs within a bucket share
+	 * locks and thus may not be able to work concurrently, but there are
+	 * enough buckets in the table to make such colllisions rare.
+	 *
+	 * See "Homa Locking Strategy" in homa_impl.h for more info about
+	 * locking.
+	 */
+	spinlock_t lock;
+
+	/**
+	 * @id: identifier for this bucket, used in error messages etc.
+	 * It's the index of the bucket within its hash table bucket
+	 * array, with an additional offset to separate server and
+	 * client RPCs.
+	 */
+	int id;
+
+	/** @rpcs: list of RPCs that hash to this bucket. */
+	struct hlist_head rpcs;
+};
+
+/**
+ * define HOMA_CLIENT_RPC_BUCKETS - Number of buckets in hash tables for
+ * client RPCs. Must be a power of 2.
+ */
+#define HOMA_CLIENT_RPC_BUCKETS 1024
+
+/**
+ * define HOMA_SERVER_RPC_BUCKETS - Number of buckets in hash tables for
+ * server RPCs. Must be a power of 2.
+ */
+#define HOMA_SERVER_RPC_BUCKETS 1024
+
+/**
+ * struct homa_sock - Information about an open socket.
+ */
+struct homa_sock {
+	/* Info for other network layers. Note: IPv6 info (struct ipv6_pinfo
+	 * comes at the very end of the struct, *after* Homa's data, if this
+	 * socket uses IPv6).
+	 */
+	union {
+		/** @sock: generic socket data; must be the first field. */
+		struct sock sock;
+
+		/**
+		 * @inet: generic Internet socket data; must also be the
+		 first field (contains sock as its first member).
+		 */
+		struct inet_sock inet;
+	};
+
+	/**
+	 * @homa: Overall state about the Homa implementation. NULL
+	 * means this socket was never initialized or has been deleted.
+	 */
+	struct homa *homa;
+
+	/**
+	 * @hnet: Overall state specific to the network namespace for
+	 * this socket.
+	 */
+	struct homa_net *hnet;
+
+	/**
+	 * @buffer_pool: used to allocate buffer space for incoming messages.
+	 * Storage is dynamically allocated.
+	 */
+	struct homa_pool *buffer_pool;
+
+	/**
+	 * @port: Port number: identifies this socket uniquely among all
+	 * those on this node.
+	 */
+	u16 port;
+
+	/**
+	 * @is_server: True means that this socket can act as both client
+	 * and server; false means the socket is client-only.
+	 */
+	bool is_server;
+
+	/**
+	 * @shutdown: True means the socket is no longer usable (either
+	 * shutdown has already been invoked, or the socket was never
+	 * properly initialized).
+	 */
+	bool shutdown;
+
+	/**
+	 * @ip_header_length: Length of IP headers for this socket (depends
+	 * on IPv4 vs. IPv6).
+	 */
+	int ip_header_length;
+
+	/** @socktab_links: Links this socket into a homa_socktab bucket. */
+	struct hlist_node socktab_links;
+
+	/* Information above is (almost) never modified; start a new
+	 * cache line below for info that is modified frequently.
+	 */
+
+	/**
+	 * @lock: Must be held when modifying fields such as interests
+	 * and lists of RPCs. This lock is used in place of sk->sk_lock
+	 * because it's used differently (it's always used as a simple
+	 * spin lock).  See "Homa Locking Strategy" in homa_impl.h
+	 * for more on Homa's synchronization strategy.
+	 */
+	spinlock_t lock ____cacheline_aligned_in_smp;
+
+	/**
+	 * @protect_count: counts the number of calls to homa_protect_rpcs
+	 * for which there have not yet been calls to homa_unprotect_rpcs.
+	 */
+	atomic_t protect_count;
+
+	/**
+	 * @active_rpcs: List of all existing RPCs related to this socket,
+	 * including both client and server RPCs. This list isn't strictly
+	 * needed, since RPCs are already in one of the hash tables below,
+	 * but it's more efficient for homa_timer to have this list
+	 * (so it doesn't have to scan large numbers of hash buckets).
+	 * The list is sorted, with the oldest RPC first. Manipulate with
+	 * RCU so timer can access without locking.
+	 */
+	struct list_head active_rpcs;
+
+	/**
+	 * @dead_rpcs: Contains RPCs for which homa_rpc_end has been
+	 * called, but their packet buffers haven't yet been freed.
+	 */
+	struct list_head dead_rpcs;
+
+	/** @dead_skbs: Total number of socket buffers in RPCs on dead_rpcs. */
+	int dead_skbs;
+
+	/**
+	 * @waiting_for_bufs: Contains RPCs that are blocked because there
+	 * wasn't enough space in the buffer pool region for their incoming
+	 * messages. Sorted in increasing order of message length.
+	 */
+	struct list_head waiting_for_bufs;
+
+	/**
+	 * @ready_rpcs: List of all RPCs that are ready for attention from
+	 * an application thread.
+	 */
+	struct list_head ready_rpcs;
+
+	/**
+	 * @interests: List of threads that are currently waiting for
+	 * incoming messages via homa_wait_shared.
+	 */
+	struct list_head interests;
+
+	/**
+	 * @client_rpc_buckets: Hash table for fast lookup of client RPCs.
+	 * Modifications are synchronized with bucket locks, not
+	 * the socket lock.
+	 */
+	struct homa_rpc_bucket client_rpc_buckets[HOMA_CLIENT_RPC_BUCKETS];
+
+	/**
+	 * @server_rpc_buckets: Hash table for fast lookup of server RPCs.
+	 * Modifications are synchronized with bucket locks, not
+	 * the socket lock.
+	 */
+	struct homa_rpc_bucket server_rpc_buckets[HOMA_SERVER_RPC_BUCKETS];
+};
+
+/**
+ * struct homa_v6_sock - For IPv6, additional IPv6-specific information
+ * is present in the socket struct after Homa-specific information.
+ */
+struct homa_v6_sock {
+	/** @homa: All socket info except for IPv6-specific stuff. */
+	struct homa_sock homa;
+
+	/** @inet6: Socket info specific to IPv6. */
+	struct ipv6_pinfo inet6;
+};
+
+int                homa_sock_bind(struct homa_net *hnet, struct homa_sock *hsk,
+				  u16 port);
+void               homa_sock_destroy(struct sock *sk);
+struct homa_sock  *homa_sock_find(struct homa_net *hnet, u16 port);
+int                homa_sock_init(struct homa_sock *hsk);
+void               homa_sock_shutdown(struct homa_sock *hsk);
+void               homa_sock_unlink(struct homa_sock *hsk);
+int                homa_sock_wait_wmem(struct homa_sock *hsk, int nonblocking);
+void               homa_socktab_destroy(struct homa_socktab *socktab,
+					struct homa_net *hnet);
+void               homa_socktab_end_scan(struct homa_socktab_scan *scan);
+void               homa_socktab_init(struct homa_socktab *socktab);
+struct homa_sock  *homa_socktab_next(struct homa_socktab_scan *scan);
+struct homa_sock  *homa_socktab_start_scan(struct homa_socktab *socktab,
+					   struct homa_socktab_scan *scan);
+
+/**
+ * homa_sock_lock() - Acquire the lock for a socket.
+ * @hsk:     Socket to lock.
+ */
+static inline void homa_sock_lock(struct homa_sock *hsk)
+	__acquires(hsk->lock)
+{
+	spin_lock_bh(&hsk->lock);
+}
+
+/**
+ * homa_sock_unlock() - Release the lock for a socket.
+ * @hsk:   Socket to lock.
+ */
+static inline void homa_sock_unlock(struct homa_sock *hsk)
+	__releases(hsk->lock)
+{
+	spin_unlock_bh(&hsk->lock);
+}
+
+/**
+ * homa_socktab_bucket() - Compute the bucket number in a homa_socktab
+ * that will contain a particular socket.
+ * @hnet:   Network namespace of the desired socket.
+ * @port:   Port number of the socket.
+ *
+ * Return:  The index of the bucket in which a socket matching @hnet and
+ *          @port will be found (if it exists).
+ */
+static inline int homa_socktab_bucket(struct homa_net *hnet, u16 port)
+{
+	return hash_32((uintptr_t)hnet ^ port, HOMA_SOCKTAB_BUCKET_BITS);
+}
+
+/**
+ * homa_client_rpc_bucket() - Find the bucket containing a given
+ * client RPC.
+ * @hsk:      Socket associated with the RPC.
+ * @id:       Id of the desired RPC.
+ *
+ * Return:    The bucket in which this RPC will appear, if the RPC exists.
+ */
+static inline struct homa_rpc_bucket
+		*homa_client_rpc_bucket(struct homa_sock *hsk, u64 id)
+{
+	/* We can use a really simple hash function here because RPC ids
+	 * are allocated sequentially.
+	 */
+	return &hsk->client_rpc_buckets[(id >> 1)
+			& (HOMA_CLIENT_RPC_BUCKETS - 1)];
+}
+
+/**
+ * homa_server_rpc_bucket() - Find the bucket containing a given
+ * server RPC.
+ * @hsk:         Socket associated with the RPC.
+ * @id:          Id of the desired RPC.
+ *
+ * Return:    The bucket in which this RPC will appear, if the RPC exists.
+ */
+static inline struct homa_rpc_bucket
+		*homa_server_rpc_bucket(struct homa_sock *hsk, u64 id)
+{
+	/* Each client allocates RPC ids sequentially, so they will
+	 * naturally distribute themselves across the hash space.
+	 * Thus we can use the id directly as hash.
+	 */
+	return &hsk->server_rpc_buckets[(id >> 1)
+			& (HOMA_SERVER_RPC_BUCKETS - 1)];
+}
+
+/**
+ * homa_bucket_lock() - Acquire the lock for an RPC hash table bucket.
+ * @bucket:    Bucket to lock.
+ * @id:        Id of the RPC on whose behalf the bucket is being locked.
+ *             Used only for metrics.
+ */
+static inline void homa_bucket_lock(struct homa_rpc_bucket *bucket, u64 id)
+	__acquires(bucket->lock)
+{
+	spin_lock_bh(&bucket->lock);
+}
+
+/**
+ * homa_bucket_unlock() - Release the lock for an RPC hash table bucket.
+ * @bucket:   Bucket to unlock.
+ * @id:       ID of the RPC that was using the lock.
+ */
+static inline void homa_bucket_unlock(struct homa_rpc_bucket *bucket, u64 id)
+	__releases(bucket->lock)
+{
+	spin_unlock_bh(&bucket->lock);
+}
+
+static inline struct homa_sock *homa_sk(const struct sock *sk)
+{
+	return (struct homa_sock *)sk;
+}
+
+/**
+ * homa_sock_wmem_avl() - Returns true if the socket is within its limit
+ * for output memory usage. False means that no new messages should be sent
+ * until memory is freed.
+ * @hsk:   Socket of interest.
+ * Return: See above.
+ */
+static inline bool homa_sock_wmem_avl(struct homa_sock *hsk)
+{
+	return refcount_read(&hsk->sock.sk_wmem_alloc) < hsk->sock.sk_sndbuf;
+}
+
+/**
+ * homa_sock_wakeup_wmem() - Invoked when tx packet memory has been freed;
+ * if memory usage is below the limit and there are tasks waiting for memory,
+ * wake them up.
+ * @hsk:   Socket of interest.
+ */
+static inline void homa_sock_wakeup_wmem(struct homa_sock *hsk)
+{
+	if (test_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags) &&
+	    homa_sock_wmem_avl(hsk)) {
+		clear_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags);
+		wake_up_interruptible_poll(sk_sleep(&hsk->sock), EPOLLOUT);
+	}
+}
+
+#endif /* _HOMA_SOCK_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 07/15] net: homa: create homa_interest.h and homa_interest.c
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (5 preceding siblings ...)
  2025-08-18 20:55 ` [PATCH net-next v15 06/15] net: homa: create homa_sock.h and homa_sock.c John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-18 20:55 ` [PATCH net-next v15 08/15] net: homa: create homa_pacer.h and homa_pacer.c John Ousterhout
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

These files implement the homa_interest struct, which is used to
wait for incoming messages.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for v14:
* Fix race in homa_wait_shared (an RPC could get lost if it became
  ready at the same time that homa_interest_wait returned with an error)
* Remove nonblocking parameter from homa_interest_wait (handle this elsewhere)

Changes for v11:
* Clean up sparse annotations

Changes for v10: none

Changes for v9:
* Remove unused field homa_interest->core
---
 net/homa/homa_interest.c | 114 +++++++++++++++++++++++++++++++++++++++
 net/homa/homa_interest.h |  93 ++++++++++++++++++++++++++++++++
 2 files changed, 207 insertions(+)
 create mode 100644 net/homa/homa_interest.c
 create mode 100644 net/homa/homa_interest.h

diff --git a/net/homa/homa_interest.c b/net/homa/homa_interest.c
new file mode 100644
index 000000000000..6daeedd21309
--- /dev/null
+++ b/net/homa/homa_interest.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
+
+/* This file contains functions for managing homa_interest structs. */
+
+#include "homa_impl.h"
+#include "homa_interest.h"
+#include "homa_rpc.h"
+#include "homa_sock.h"
+
+/**
+ * homa_interest_init_shared() - Initialize an interest and queue it up on
+ * a socket.
+ * @interest:  Interest to initialize
+ * @hsk:       Socket on which the interests should be queued. Must be locked
+ *             by caller.
+ */
+void homa_interest_init_shared(struct homa_interest *interest,
+			       struct homa_sock *hsk)
+	__must_hold(hsk->lock)
+{
+	interest->rpc = NULL;
+	atomic_set(&interest->ready, 0);
+	interest->blocked = 0;
+	init_waitqueue_head(&interest->wait_queue);
+	interest->hsk = hsk;
+	list_add(&interest->links, &hsk->interests);
+}
+
+/**
+ * homa_interest_init_private() - Initialize an interest that will wait
+ * on a particular (private) RPC, and link it to that RPC.
+ * @interest:   Interest to initialize.
+ * @rpc:        RPC to associate with the interest. Must be private, and
+ *              caller must have locked it.
+ *
+ * Return:      0 for success, otherwise a negative errno.
+ */
+int homa_interest_init_private(struct homa_interest *interest,
+			       struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+	if (rpc->private_interest)
+		return -EINVAL;
+
+	interest->rpc = rpc;
+	atomic_set(&interest->ready, 0);
+	interest->blocked = 0;
+	init_waitqueue_head(&interest->wait_queue);
+	interest->hsk = rpc->hsk;
+	rpc->private_interest = interest;
+	return 0;
+}
+
+/**
+ * homa_interest_wait() - Wait for an interest to have an actionable RPC,
+ * or for an error to occur.
+ * @interest:     Interest to wait for; must previously have been initialized
+ *                and linked to a socket or RPC. On return, the interest
+ *                will have been unlinked if its ready flag is set; otherwise
+ *                it may still be linked.
+ *
+ * Return: 0 for success (the ready flag is set in the interest), or -EINTR
+ * if the thread received an interrupt.
+ */
+int homa_interest_wait(struct homa_interest *interest)
+{
+	struct homa_sock *hsk = interest->hsk;
+	int result = 0;
+	int iteration;
+	int wait_err;
+
+	interest->blocked = 0;
+
+	/* This loop iterates in order to poll and/or reap dead RPCS. */
+	for (iteration = 0; ; iteration++) {
+		if (iteration != 0)
+			/* Give NAPI/SoftIRQ tasks a chance to run. */
+			schedule();
+
+		if (atomic_read_acquire(&interest->ready) != 0)
+			goto done;
+
+		/* See if we can cleanup dead RPCs while waiting. */
+		if (homa_rpc_reap(hsk, false) != 0)
+			continue;
+
+		break;
+	}
+
+	interest->blocked = 1;
+	wait_err = wait_event_interruptible_exclusive(interest->wait_queue,
+			atomic_read_acquire(&interest->ready) != 0);
+	if (wait_err == -ERESTARTSYS)
+		result = -EINTR;
+
+done:
+	return result;
+}
+
+/**
+ * homa_interest_notify_private() - If a thread is waiting on the private
+ * interest for an RPC, wake it up.
+ * @rpc:      RPC that may (potentially) have a private interest. Must be
+ *            locked by the caller.
+ */
+void homa_interest_notify_private(struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+	if (rpc->private_interest) {
+		atomic_set_release(&rpc->private_interest->ready, 1);
+		wake_up(&rpc->private_interest->wait_queue);
+	}
+}
+
diff --git a/net/homa/homa_interest.h b/net/homa/homa_interest.h
new file mode 100644
index 000000000000..d9f932960fd8
--- /dev/null
+++ b/net/homa/homa_interest.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+ */
+
+/* This file defines struct homa_interest and related functions.  */
+
+#ifndef _HOMA_INTEREST_H
+#define _HOMA_INTEREST_H
+
+#include "homa_rpc.h"
+#include "homa_sock.h"
+
+/**
+ * struct homa_interest - Holds info that allows applications to wait for
+ * incoming RPC messages. An interest can be either private, in which case
+ * the application is waiting for a single specific RPC response and the
+ * interest is referenced by an rpc->private_interest, or shared, in which
+ * case the application is waiting for any incoming message that isn't
+ * private and the interest is present on hsk->interests.
+ */
+struct homa_interest {
+	/**
+	 * @rpc: If ready is set, then this holds an RPC that needs
+	 * attention, or NULL if this is a shared interest and hsk has
+	 * been shutdown. If ready is not set, this will be NULL if the
+	 * interest is shared; if it's private, it holds the RPC the
+	 * interest is associated with. If non-NULL, a reference has been
+	 * taken on the RPC.
+	 */
+	struct homa_rpc *rpc;
+
+	/**
+	 * @ready: Nonzero means the interest is ready for attention: either
+	 * there is an RPC that needs attention or @hsk has been shutdown.
+	 */
+	atomic_t ready;
+
+	/**
+	 * @blocked: Zero means a handoff was received without the thread
+	 * needing to block; nonzero means the thread blocked.
+	 */
+	int blocked;
+
+	/**
+	 * @wait_queue: Used to block the thread while waiting (will never
+	 * have more than one queued thread).
+	 */
+	struct wait_queue_head wait_queue;
+
+	/** @hsk: Socket that the interest is associated with. */
+	struct homa_sock *hsk;
+
+	/**
+	 * @links: If the interest is shared, used to link this object into
+	 * @hsk->interests.
+	 */
+	struct list_head links;
+};
+
+/**
+ * homa_interest_unlink_shared() - Remove an interest from the list for a
+ * socket. Note: this can race with homa_rpc_handoff, so on return it's
+ * possible that the interest is ready.
+ * @interest:    Interest to remove. Must have been initialized with
+ *               homa_interest_init_shared.
+ */
+static inline void homa_interest_unlink_shared(struct homa_interest *interest)
+	__must_hold(hsk->lock)
+{
+	list_del_init(&interest->links);
+}
+
+/**
+ * homa_interest_unlink_private() - Detach a private interest from its
+ * RPC. Note: this can race with homa_rpc_handoff, so on return it's
+ * possible that the interest is ready.
+ * @interest:    Interest to remove. Must have been initialized with
+ *               homa_interest_init_private. Its RPC must be locked by
+ *               the caller.
+ */
+static inline void homa_interest_unlink_private(struct homa_interest *interest)
+	__must_hold(interest->rpc->bucket->lock)
+{
+	if (interest == interest->rpc->private_interest)
+		interest->rpc->private_interest = NULL;
+}
+
+void     homa_interest_init_shared(struct homa_interest *interest,
+				   struct homa_sock *hsk);
+int      homa_interest_init_private(struct homa_interest *interest,
+				    struct homa_rpc *rpc);
+void     homa_interest_notify_private(struct homa_rpc *rpc);
+int      homa_interest_wait(struct homa_interest *interest);
+
+#endif /* _HOMA_INTEREST_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 08/15] net: homa: create homa_pacer.h and homa_pacer.c
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (6 preceding siblings ...)
  2025-08-18 20:55 ` [PATCH net-next v15 07/15] net: homa: create homa_interest.h and homa_interest.c John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-26 10:53   ` Paolo Abeni
  2025-08-18 20:55 ` [PATCH net-next v15 09/15] net: homa: create homa_rpc.h and homa_rpc.c John Ousterhout
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

These files provide facilities to pace packet output in order to prevent
queue buildup in the NIC. This functionality is needed to implement SRPT
on output, so short messages don't get stuck in long NIC queues. Note: the
pacer eventually needs to be replaced with a Homa-specific qdisc, which can
better manage simultaneous transmissions by Homa and TCP. The current
implementation can coexist with TCP and doesn't harm TCP, but
Homa's latency suffers when TCP runs concurrently.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for v12:
* Clean up pacer thread exit: use kthread_should_stop instead of custom
  variables.

Changes for v11:
* Clean up sparse annotations.
* Move link_mbps variable from struct homa_pacer back to struct homa.
* Cleanup and simplify use of RPC reference counts.

Changes for v10:
* Revise sparse annotations to eliminate __context__ definition
* Use kzalloc instead of __GFP_ZERO
* Remove log messages after alloc errors
* Replace __u64 with u64
* Fix xmastree violations

Changes for v9:
* Add support for homa_net objects
* Use new homa_clock abstraction layer
* Various name improvements (e.g. use "alloc" instead of "new" for functions
  that allocate memory)

Changes for v8:
* This file is new in v8 (functionality extracted from other files)
---
 net/homa/homa_impl.h  |   4 +
 net/homa/homa_pacer.c | 303 ++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_pacer.h | 173 ++++++++++++++++++++++++
 3 files changed, 480 insertions(+)
 create mode 100644 net/homa/homa_pacer.c
 create mode 100644 net/homa/homa_pacer.h

diff --git a/net/homa/homa_impl.h b/net/homa/homa_impl.h
index f5191ec0b198..aa9be99891b0 100644
--- a/net/homa/homa_impl.h
+++ b/net/homa/homa_impl.h
@@ -421,6 +421,10 @@ static inline bool homa_make_header_avl(struct sk_buff *skb)
 
 extern unsigned int homa_net_id;
 
+int      homa_xmit_control(enum homa_packet_type type, void *contents,
+			   size_t length, struct homa_rpc *rpc);
+void     homa_xmit_data(struct homa_rpc *rpc, bool force);
+
 /**
  * homa_net_from_net() - Return the struct homa_net associated with a particular
  * struct net.
diff --git a/net/homa/homa_pacer.c b/net/homa/homa_pacer.c
new file mode 100644
index 000000000000..2450445d48f7
--- /dev/null
+++ b/net/homa/homa_pacer.c
@@ -0,0 +1,303 @@
+// SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
+
+/* This file implements the Homa pacer, which implements SRPT for packet
+ * output. In order to do that, it throttles packet transmission to prevent
+ * the buildup of large queues in the NIC.
+ */
+
+#include "homa_impl.h"
+#include "homa_pacer.h"
+#include "homa_rpc.h"
+
+/**
+ * homa_pacer_alloc() - Allocate and initialize a new pacer object, which
+ * will hold pacer-related information for @homa.
+ * @homa:   Homa transport that the pacer will be associated with.
+ * Return:  A pointer to the new struct pacer, or a negative errno.
+ */
+struct homa_pacer *homa_pacer_alloc(struct homa *homa)
+{
+	struct homa_pacer *pacer;
+	int err;
+
+	pacer = kzalloc(sizeof(*pacer), GFP_KERNEL);
+	if (!pacer)
+		return ERR_PTR(-ENOMEM);
+	pacer->homa = homa;
+	spin_lock_init(&pacer->mutex);
+	pacer->fifo_count = 1000;
+	spin_lock_init(&pacer->throttle_lock);
+	INIT_LIST_HEAD_RCU(&pacer->throttled_rpcs);
+	pacer->fifo_fraction = 50;
+	pacer->max_nic_queue_ns = 5000;
+	pacer->throttle_min_bytes = 1000;
+	init_waitqueue_head(&pacer->wait_queue);
+	pacer->kthread = kthread_run(homa_pacer_main, pacer, "homa_pacer");
+	if (IS_ERR(pacer->kthread)) {
+		err = PTR_ERR(pacer->kthread);
+		pr_err("Homa couldn't create pacer thread: error %d\n", err);
+		goto error;
+	}
+	atomic64_set(&pacer->link_idle_time, homa_clock());
+
+	homa_pacer_update_sysctl_deps(pacer);
+	return pacer;
+
+error:
+	homa_pacer_free(pacer);
+	return ERR_PTR(err);
+}
+
+/**
+ * homa_pacer_free() - Cleanup and free the pacer object for a Homa
+ * transport.
+ * @pacer:    Object to destroy; caller must not reference the object
+ *            again once this function returns.
+ */
+void homa_pacer_free(struct homa_pacer *pacer)
+{
+	if (pacer->kthread) {
+		kthread_stop(pacer->kthread);
+		pacer->kthread = NULL;
+	}
+	kfree(pacer);
+}
+
+/**
+ * homa_pacer_check_nic_q() - This function is invoked before passing a
+ * packet to the NIC for transmission. It serves two purposes. First, it
+ * maintains an estimate of the NIC queue length. Second, it indicates to
+ * the caller whether the NIC queue is so full that no new packets should be
+ * queued (Homa's SRPT depends on keeping the NIC queue short).
+ * @pacer:    Pacer information for a Homa transport.
+ * @skb:      Packet that is about to be transmitted.
+ * @force:    True means this packet is going to be transmitted
+ *            regardless of the queue length.
+ * Return:    Nonzero is returned if either the NIC queue length is
+ *            acceptably short or @force was specified. 0 means that the
+ *            NIC queue is at capacity or beyond, so the caller should delay
+ *            the transmission of @skb. If nonzero is returned, then the
+ *            queue estimate is updated to reflect the transmission of @skb.
+ */
+int homa_pacer_check_nic_q(struct homa_pacer *pacer, struct sk_buff *skb,
+			   bool force)
+{
+	u64 idle, new_idle, clock, cycles_for_packet;
+	int bytes;
+
+	bytes = homa_get_skb_info(skb)->wire_bytes;
+	cycles_for_packet = pacer->cycles_per_mbyte;
+	cycles_for_packet *= bytes;
+	do_div(cycles_for_packet, 1000000);
+	while (1) {
+		clock = homa_clock();
+		idle = atomic64_read(&pacer->link_idle_time);
+		if ((clock + pacer->max_nic_queue_cycles) < idle && !force &&
+		    !(pacer->homa->flags & HOMA_FLAG_DONT_THROTTLE))
+			return 0;
+		if (idle < clock)
+			new_idle = clock + cycles_for_packet;
+		else
+			new_idle = idle + cycles_for_packet;
+
+		/* This method must be thread-safe. */
+		if (atomic64_cmpxchg_relaxed(&pacer->link_idle_time, idle,
+					     new_idle) == idle)
+			break;
+	}
+	return 1;
+}
+
+/**
+ * homa_pacer_main() - Top-level function for the pacer thread.
+ * @arg:  Pointer to pacer struct.
+ *
+ * Return:         Always 0.
+ */
+int homa_pacer_main(void *arg)
+{
+	struct homa_pacer *pacer = arg;
+	int status;
+
+	while (1) {
+		if (kthread_should_stop())
+			break;
+		pacer->wake_time = homa_clock();
+		homa_pacer_xmit(pacer);
+		pacer->wake_time = 0;
+		if (!list_empty(&pacer->throttled_rpcs)) {
+			/* NIC queue is full; before calling pacer again,
+			 * give other threads a chance to run (otherwise
+			 * low-level packet processing such as softirq could
+			 * get locked out).
+			 */
+			schedule();
+			continue;
+		}
+
+		status = wait_event_interruptible(pacer->wait_queue,
+				kthread_should_stop() ||
+				!list_empty(&pacer->throttled_rpcs));
+		if (status != 0 && status != -ERESTARTSYS)
+			break;
+	}
+	return 0;
+}
+
+/**
+ * homa_pacer_xmit() - Transmit packets from  the throttled list until
+ * either (a) the throttled list is empty or (b) the NIC queue has
+ * reached maximum allowable length. Note: this function may be invoked
+ * from either process context or softirq (BH) level. This function is
+ * invoked from multiple places, not just in the pacer thread. The reason
+ * for this is that (as of 10/2019) Linux's scheduling of the pacer thread
+ * is unpredictable: the thread may block for long periods of time (e.g.,
+ * because it is assigned to the same CPU as a busy interrupt handler).
+ * This can result in poor utilization of the network link. So, this method
+ * gets invoked from other places as well, to increase the likelihood that we
+ * keep the link busy. Those other invocations are not guaranteed to happen,
+ * so the pacer thread provides a backstop.
+ * @pacer:    Pacer information for a Homa transport.
+ */
+void homa_pacer_xmit(struct homa_pacer *pacer)
+{
+	struct homa_rpc *rpc;
+	s64 queue_cycles;
+
+	/* Make sure only one instance of this function executes at a time. */
+	if (!spin_trylock_bh(&pacer->mutex))
+		return;
+
+	while (1) {
+		queue_cycles = atomic64_read(&pacer->link_idle_time) -
+					     homa_clock();
+		if (queue_cycles >= pacer->max_nic_queue_cycles)
+			break;
+		if (list_empty(&pacer->throttled_rpcs))
+			break;
+
+		/* Select an RPC to transmit (either SRPT or FIFO) and
+		 * take a reference on it. Must do this while holding the
+		 * throttle_lock to prevent the RPC from being reaped. Then
+		 * release the throttle lock and lock the RPC (can't acquire
+		 * the RPC lock while holding the throttle lock; see "Homa
+		 * Locking Strategy" in homa_impl.h).
+		 */
+		homa_pacer_throttle_lock(pacer);
+		pacer->fifo_count -= pacer->fifo_fraction;
+		if (pacer->fifo_count <= 0) {
+			struct homa_rpc *cur;
+			u64 oldest = ~0;
+
+			pacer->fifo_count += 1000;
+			rpc = NULL;
+			list_for_each_entry(cur, &pacer->throttled_rpcs,
+					    throttled_links) {
+				if (cur->msgout.init_time < oldest) {
+					rpc = cur;
+					oldest = cur->msgout.init_time;
+				}
+			}
+		} else {
+			rpc = list_first_entry_or_null(&pacer->throttled_rpcs,
+						       struct homa_rpc,
+						       throttled_links);
+		}
+		if (!rpc) {
+			homa_pacer_throttle_unlock(pacer);
+			break;
+		}
+		homa_rpc_hold(rpc);
+		homa_pacer_throttle_unlock(pacer);
+		homa_rpc_lock(rpc);
+		homa_xmit_data(rpc, true);
+
+		/* Note: rpc->state could be RPC_DEAD here, but the code
+		 * below should work anyway.
+		 */
+		if (!*rpc->msgout.next_xmit)
+			/* No more data can be transmitted from this message
+			 * (right now), so remove it from the throttled list.
+			 */
+			homa_pacer_unmanage_rpc(rpc);
+		homa_rpc_unlock(rpc);
+		homa_rpc_put(rpc);
+	}
+	spin_unlock_bh(&pacer->mutex);
+}
+
+/**
+ * homa_pacer_manage_rpc() - Arrange for the pacer to transmit packets
+ * from this RPC (make sure that an RPC is on the throttled list and wake up
+ * the pacer thread if necessary).
+ * @rpc:     RPC with outbound packets that have been granted but can't be
+ *           sent because of NIC queue restrictions. Must be locked by caller.
+ */
+void homa_pacer_manage_rpc(struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa_pacer *pacer = rpc->hsk->homa->pacer;
+	struct homa_rpc *candidate;
+	int bytes_left;
+
+	if (!list_empty(&rpc->throttled_links))
+		return;
+	bytes_left = rpc->msgout.length - rpc->msgout.next_xmit_offset;
+	homa_pacer_throttle_lock(pacer);
+	list_for_each_entry(candidate, &pacer->throttled_rpcs,
+			    throttled_links) {
+		int bytes_left_cand;
+
+		/* Watch out: the pacer might have just transmitted the last
+		 * packet from candidate.
+		 */
+		bytes_left_cand = candidate->msgout.length -
+				candidate->msgout.next_xmit_offset;
+		if (bytes_left_cand > bytes_left) {
+			list_add_tail(&rpc->throttled_links,
+				      &candidate->throttled_links);
+			goto done;
+		}
+	}
+	list_add_tail(&rpc->throttled_links, &pacer->throttled_rpcs);
+done:
+	homa_pacer_throttle_unlock(pacer);
+	wake_up(&pacer->wait_queue);
+}
+
+/**
+ * homa_pacer_unmanage_rpc() - Make sure that an RPC is no longer managed
+ * by the pacer.
+ * @rpc:     RPC of interest.
+ */
+void homa_pacer_unmanage_rpc(struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa_pacer *pacer = rpc->hsk->homa->pacer;
+
+	if (unlikely(!list_empty(&rpc->throttled_links))) {
+		homa_pacer_throttle_lock(pacer);
+		list_del_init(&rpc->throttled_links);
+		homa_pacer_throttle_unlock(pacer);
+	}
+}
+
+/**
+ * homa_pacer_update_sysctl_deps() - Update any pacer fields that depend
+ * on values set by sysctl. This function is invoked anytime a pacer sysctl
+ * value is updated.
+ * @pacer:   Pacer to update.
+ */
+void homa_pacer_update_sysctl_deps(struct homa_pacer *pacer)
+{
+	u64 tmp;
+
+	pacer->max_nic_queue_cycles =
+			homa_ns_to_cycles(pacer->max_nic_queue_ns);
+
+	/* Underestimate link bandwidth (overestimate time) by 1%. */
+	tmp = 101 * 8000 * (u64)homa_clock_khz();
+	do_div(tmp, pacer->homa->link_mbps * 100);
+	pacer->cycles_per_mbyte = tmp;
+}
+
diff --git a/net/homa/homa_pacer.h b/net/homa/homa_pacer.h
new file mode 100644
index 000000000000..a5279f4f576c
--- /dev/null
+++ b/net/homa/homa_pacer.h
@@ -0,0 +1,173 @@
+/* SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+ */
+
+/* This file defines structs and functions related to the Homa pacer,
+ * which implements SRPT for packet output. In order to do that, it
+ * throttles packet transmission to prevent the buildup of
+ * large queues in the NIC.
+ */
+
+#ifndef _HOMA_PACER_H
+#define _HOMA_PACER_H
+
+#include "homa_impl.h"
+
+/**
+ * struct homa_pacer - Contains information that the pacer users to
+ * manage packet output. There is one instance of this object stored
+ * in each struct homa.
+ */
+struct homa_pacer {
+	/** @homa: Transport that this pacer is associated with. */
+	struct homa *homa;
+
+	/**
+	 * @mutex: Ensures that only one instance of homa_pacer_xmit
+	 * runs at a time. Only used in "try" mode: never block on this.
+	 */
+	spinlock_t mutex;
+
+	/**
+	 * @fifo_count: When this becomes <= zero, it's time for the
+	 * pacer to allow the oldest RPC to transmit.
+	 */
+	int fifo_count;
+
+	/**
+	 * @wake_time: homa_clock() time when the pacer woke up (if the pacer
+	 * is running) or 0 if the pacer is sleeping.
+	 */
+	u64 wake_time;
+
+	/**
+	 * @throttle_lock: Used to synchronize access to @throttled_rpcs. Must
+	 * hold when inserting or removing an RPC from throttled_rpcs.
+	 */
+	spinlock_t throttle_lock;
+
+	/**
+	 * @throttled_rpcs: Contains all homa_rpcs that have bytes ready
+	 * for transmission, but which couldn't be sent without exceeding
+	 * the NIC queue limit.
+	 */
+	struct list_head throttled_rpcs;
+
+	/**
+	 * @fifo_fraction: Out of every 1000 packets transmitted by the
+	 * pacer, this number will be transmitted from the oldest message
+	 * rather than the highest-priority message. Set externally via
+	 * sysctl.
+	 */
+	int fifo_fraction;
+
+	/**
+	 * @max_nic_queue_ns: Limits the NIC queue length: we won't queue
+	 * up a packet for transmission if link_idle_time is this many
+	 * nanoseconds in the future (or more). Set externally via sysctl.
+	 */
+	int max_nic_queue_ns;
+
+	/**
+	 * @max_nic_queue_cycles: Same as max_nic_queue_ns except in
+	 * homa_clock() units.
+	 */
+	int max_nic_queue_cycles;
+
+	/**
+	 * @cycles_per_mbyte: the number of homa_clock() cycles that it takes to
+	 * transmit 10**6 bytes on our uplink. This is actually a slight
+	 * overestimate of the value, to ensure that we don't underestimate
+	 * NIC queue length and queue too many packets.
+	 */
+	u32 cycles_per_mbyte;
+
+	/**
+	 * @throttle_min_bytes: If a packet has fewer bytes than this, then it
+	 * bypasses the throttle mechanism and is transmitted immediately.
+	 * We have this limit because for very small packets CPU overheads
+	 * make it impossible to keep up with the NIC so (a) the NIC queue
+	 * can't grow and (b) using the pacer would serialize all of these
+	 * packets through a single core, which makes things even worse.
+	 * Set externally via sysctl.
+	 */
+	int throttle_min_bytes;
+
+	/**
+	 * @wait_queue: Used to block the pacer thread when there
+	 * are no throttled RPCs.
+	 */
+	struct wait_queue_head wait_queue;
+
+	/**
+	 * @kthread: Kernel thread that transmits packets from
+	 * throttled_rpcs in a way that limits queue buildup in the
+	 * NIC.
+	 */
+	struct task_struct *kthread;
+
+	/**
+	 * @link_idle_time: The homa_clock() time at which we estimate
+	 * that all of the packets we have passed to the NIC for transmission
+	 * will have been transmitted. May be in the past. This estimate
+	 * assumes that only Homa is transmitting data, so it could be a
+	 * severe underestimate if there is competing traffic from, say, TCP.
+	 */
+	atomic64_t link_idle_time ____cacheline_aligned_in_smp;
+};
+
+struct homa_pacer *homa_pacer_alloc(struct homa *homa);
+int      homa_pacer_check_nic_q(struct homa_pacer *pacer,
+				struct sk_buff *skb, bool force);
+int      homa_pacer_dointvec(const struct ctl_table *table, int write,
+			     void *buffer, size_t *lenp, loff_t *ppos);
+void     homa_pacer_free(struct homa_pacer *pacer);
+void     homa_pacer_unmanage_rpc(struct homa_rpc *rpc);
+void     homa_pacer_log_throttled(struct homa_pacer *pacer);
+int      homa_pacer_main(void *transport);
+void     homa_pacer_manage_rpc(struct homa_rpc *rpc);
+void     homa_pacer_throttle_lock_slow(struct homa_pacer *pacer);
+void     homa_pacer_update_sysctl_deps(struct homa_pacer *pacer);
+void     homa_pacer_xmit(struct homa_pacer *pacer);
+
+/**
+ * homa_pacer_check() - This method is invoked at various places in Homa to
+ * see if the pacer needs to transmit more packets and, if so, transmit
+ * them. It's needed because the pacer thread may get descheduled by
+ * Linux, result in output stalls.
+ * @pacer:    Pacer information for a Homa transport.
+ */
+static inline void homa_pacer_check(struct homa_pacer *pacer)
+{
+	if (list_empty(&pacer->throttled_rpcs))
+		return;
+
+	/* The ">> 1" in the line below gives homa_pacer_main the first chance
+	 * to queue new packets; if the NIC queue becomes more than half
+	 * empty, then we will help out here.
+	 */
+	if ((homa_clock() + (pacer->max_nic_queue_cycles >> 1)) <
+			atomic64_read(&pacer->link_idle_time))
+		return;
+	homa_pacer_xmit(pacer);
+}
+
+/**
+ * homa_pacer_throttle_lock() - Acquire the throttle lock.
+ * @pacer:    Pacer information for a Homa transport.
+ */
+static inline void homa_pacer_throttle_lock(struct homa_pacer *pacer)
+	__acquires(pacer->throttle_lock)
+{
+	spin_lock_bh(&pacer->throttle_lock);
+}
+
+/**
+ * homa_pacer_throttle_unlock() - Release the throttle lock.
+ * @pacer:    Pacer information for a Homa transport.
+ */
+static inline void homa_pacer_throttle_unlock(struct homa_pacer *pacer)
+	__releases(pacer->throttle_lock)
+{
+	spin_unlock_bh(&pacer->throttle_lock);
+}
+
+#endif /* _HOMA_PACER_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 09/15] net: homa: create homa_rpc.h and homa_rpc.c
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (7 preceding siblings ...)
  2025-08-18 20:55 ` [PATCH net-next v15 08/15] net: homa: create homa_pacer.h and homa_pacer.c John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-26 11:31   ` Paolo Abeni
  2025-08-18 20:55 ` [PATCH net-next v15 10/15] net: homa: create homa_outgoing.c John Ousterhout
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

These files provide basic functions for managing remote procedure calls,
which are the fundamental entities managed by Homa. Each RPC consists
of a request message from a client to a server, followed by a response
message returned from the server to the client.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for v14:
* Add msgout.first_not_tx field needed by homa_rpc_tx_end function
  (better abstraction)

Changes for v11:
* Cleanup and simplify use of RPC reference counts.
* Rework the mechanism for waking up RPCs that stalled waiting for
  buffer pool space.

Changes for v10:
* Replace __u16 with u16, __u8 with u8, etc.
* Improve documentation
* Revise sparse annotations to eliminate __context__ definition
* Use kzalloc instead of __GFP_ZERO
* Fix issues from xmastree, sparse, etc.

Changes for v9:
* Eliminate reap.txt; move its contents into code as a comment
  in homa_rpc_reap
* Various name improvements (e.g. use "alloc" instead of "new" for functions
  that allocate memory)
* Add support for homa_net objects
* Use new homa_clock abstraction layer

Changes for v8:
* Updates to reflect pacer refactoring

Changes for v7:
* Implement accounting for bytes in tx skbs
* Fix potential races related to homa->active_rpcs
* Refactor waiting mechanism for incoming packets: simplify wait
  criteria and use standard Linux mechanisms for waiting
* Add reference counting for RPCs (homa_rpc_hold, homa_rpc_put)
* Remove locker argument from locking functions
* Rename homa_rpc_free to homa_rpc_end
* Use u64 and __u64 properly
* Use __skb_queue_purge instead of skb_queue_purge
* Use __GFP_ZERO in kmalloc calls
* Eliminate spurious RCU usage
---
 net/homa/homa_impl.h |   3 +
 net/homa/homa_rpc.c  | 638 +++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_rpc.h  | 501 +++++++++++++++++++++++++++++++++
 3 files changed, 1142 insertions(+)
 create mode 100644 net/homa/homa_rpc.c
 create mode 100644 net/homa/homa_rpc.h

diff --git a/net/homa/homa_impl.h b/net/homa/homa_impl.h
index aa9be99891b0..cb31101f3684 100644
--- a/net/homa/homa_impl.h
+++ b/net/homa/homa_impl.h
@@ -421,10 +421,13 @@ static inline bool homa_make_header_avl(struct sk_buff *skb)
 
 extern unsigned int homa_net_id;
 
+void     homa_rpc_handoff(struct homa_rpc *rpc);
 int      homa_xmit_control(enum homa_packet_type type, void *contents,
 			   size_t length, struct homa_rpc *rpc);
 void     homa_xmit_data(struct homa_rpc *rpc, bool force);
 
+int      homa_message_in_init(struct homa_rpc *rpc, int unsched);
+
 /**
  * homa_net_from_net() - Return the struct homa_net associated with a particular
  * struct net.
diff --git a/net/homa/homa_rpc.c b/net/homa/homa_rpc.c
new file mode 100644
index 000000000000..940c85e2db73
--- /dev/null
+++ b/net/homa/homa_rpc.c
@@ -0,0 +1,638 @@
+// SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
+
+/* This file contains functions for managing homa_rpc structs. */
+
+#include "homa_impl.h"
+#include "homa_interest.h"
+#include "homa_pacer.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+#include "homa_stub.h"
+
+/**
+ * homa_rpc_alloc_client() - Allocate and initialize a client RPC (one that
+ * is used to issue an outgoing request). Doesn't send any packets. Invoked
+ * with no locks held.
+ * @hsk:      Socket to which the RPC belongs.
+ * @dest:     Address of host (ip and port) to which the RPC will be sent.
+ *
+ * Return:    A printer to the newly allocated object, or a negative
+ *            errno if an error occurred. The RPC will be locked; the
+ *            caller must eventually unlock it.
+ */
+struct homa_rpc *homa_rpc_alloc_client(struct homa_sock *hsk,
+				       const union sockaddr_in_union *dest)
+	__cond_acquires(crpc->bucket->lock)
+{
+	struct in6_addr dest_addr_as_ipv6 = canonical_ipv6_addr(dest);
+	struct homa_rpc_bucket *bucket;
+	struct homa_rpc *crpc;
+	int err;
+
+	crpc = kzalloc(sizeof(*crpc), GFP_KERNEL);
+	if (unlikely(!crpc))
+		return ERR_PTR(-ENOMEM);
+
+	/* Initialize fields that don't require the socket lock. */
+	crpc->hsk = hsk;
+	crpc->id = atomic64_fetch_add(2, &hsk->homa->next_outgoing_id);
+	bucket = homa_client_rpc_bucket(hsk, crpc->id);
+	crpc->bucket = bucket;
+	crpc->state = RPC_OUTGOING;
+	crpc->peer = homa_peer_get(hsk, &dest_addr_as_ipv6);
+	if (IS_ERR(crpc->peer)) {
+		err = PTR_ERR(crpc->peer);
+		crpc->peer = NULL;
+		goto error;
+	}
+	crpc->dport = ntohs(dest->in6.sin6_port);
+	crpc->msgin.length = -1;
+	crpc->msgout.length = -1;
+	INIT_LIST_HEAD(&crpc->ready_links);
+	INIT_LIST_HEAD(&crpc->buf_links);
+	INIT_LIST_HEAD(&crpc->dead_links);
+	INIT_LIST_HEAD(&crpc->throttled_links);
+	crpc->resend_timer_ticks = hsk->homa->timer_ticks;
+	crpc->magic = HOMA_RPC_MAGIC;
+	crpc->start_time = homa_clock();
+
+	/* Initialize fields that require locking. This allows the most
+	 * expensive work, such as copying in the message from user space,
+	 * to be performed without holding locks. Also, can't hold spin
+	 * locks while doing things that could block, such as memory allocation.
+	 */
+	homa_bucket_lock(bucket, crpc->id);
+	homa_sock_lock(hsk);
+	if (hsk->shutdown) {
+		homa_sock_unlock(hsk);
+		homa_rpc_unlock(crpc);
+		err = -ESHUTDOWN;
+		goto error;
+	}
+	hlist_add_head(&crpc->hash_links, &bucket->rpcs);
+	rcu_read_lock();
+	list_add_tail_rcu(&crpc->active_links, &hsk->active_rpcs);
+	rcu_read_unlock();
+	homa_sock_unlock(hsk);
+
+	return crpc;
+
+error:
+	if (crpc->peer)
+		homa_peer_release(crpc->peer);
+	kfree(crpc);
+	return ERR_PTR(err);
+}
+
+/**
+ * homa_rpc_alloc_server() - Allocate and initialize a server RPC (one that is
+ * used to manage an incoming request). If appropriate, the RPC will also
+ * be handed off (we do it here, while we have the socket locked, to avoid
+ * acquiring the socket lock a second time later for the handoff).
+ * @hsk:      Socket that owns this RPC.
+ * @source:   IP address (network byte order) of the RPC's client.
+ * @h:        Header for the first data packet received for this RPC; used
+ *            to initialize the RPC.
+ * @created:  Will be set to 1 if a new RPC was created and 0 if an
+ *            existing RPC was found.
+ *
+ * Return:  A pointer to a new RPC, which is locked, or a negative errno
+ *          if an error occurred. If there is already an RPC corresponding
+ *          to h, then it is returned instead of creating a new RPC.
+ */
+struct homa_rpc *homa_rpc_alloc_server(struct homa_sock *hsk,
+				       const struct in6_addr *source,
+				       struct homa_data_hdr *h, int *created)
+	__cond_acquires(srpc->bucket->lock)
+{
+	u64 id = homa_local_id(h->common.sender_id);
+	struct homa_rpc_bucket *bucket;
+	struct homa_rpc *srpc = NULL;
+	int err;
+
+	if (!hsk->buffer_pool)
+		return ERR_PTR(-ENOMEM);
+
+	/* Lock the bucket, and make sure no-one else has already created
+	 * the desired RPC.
+	 */
+	bucket = homa_server_rpc_bucket(hsk, id);
+	homa_bucket_lock(bucket, id);
+	hlist_for_each_entry(srpc, &bucket->rpcs, hash_links) {
+		if (srpc->id == id &&
+		    srpc->dport == ntohs(h->common.sport) &&
+		    ipv6_addr_equal(&srpc->peer->addr, source)) {
+			/* RPC already exists; just return it instead
+			 * of creating a new RPC.
+			 */
+			*created = 0;
+			return srpc;
+		}
+	}
+
+	/* Initialize fields that don't require the socket lock. */
+	srpc = kzalloc(sizeof(*srpc), GFP_ATOMIC);
+	if (!srpc) {
+		err = -ENOMEM;
+		goto error;
+	}
+	srpc->hsk = hsk;
+	srpc->bucket = bucket;
+	srpc->state = RPC_INCOMING;
+	srpc->peer = homa_peer_get(hsk, source);
+	if (IS_ERR(srpc->peer)) {
+		err = PTR_ERR(srpc->peer);
+		srpc->peer = NULL;
+		goto error;
+	}
+	srpc->dport = ntohs(h->common.sport);
+	srpc->id = id;
+	srpc->msgin.length = -1;
+	srpc->msgout.length = -1;
+	INIT_LIST_HEAD(&srpc->ready_links);
+	INIT_LIST_HEAD(&srpc->buf_links);
+	INIT_LIST_HEAD(&srpc->dead_links);
+	INIT_LIST_HEAD(&srpc->throttled_links);
+	srpc->resend_timer_ticks = hsk->homa->timer_ticks;
+	srpc->magic = HOMA_RPC_MAGIC;
+	srpc->start_time = homa_clock();
+	err = homa_message_in_init(srpc, ntohl(h->message_length));
+	if (err != 0)
+		goto error;
+
+	/* Initialize fields that require socket to be locked. */
+	homa_sock_lock(hsk);
+	if (hsk->shutdown) {
+		homa_sock_unlock(hsk);
+		err = -ESHUTDOWN;
+		goto error;
+	}
+	hlist_add_head(&srpc->hash_links, &bucket->rpcs);
+	list_add_tail_rcu(&srpc->active_links, &hsk->active_rpcs);
+	homa_sock_unlock(hsk);
+	if (ntohl(h->seg.offset) == 0 && srpc->msgin.num_bpages > 0) {
+		atomic_or(RPC_PKTS_READY, &srpc->flags);
+		homa_rpc_handoff(srpc);
+	}
+	*created = 1;
+	return srpc;
+
+error:
+	homa_bucket_unlock(bucket, id);
+	if (srpc && srpc->peer)
+		homa_peer_release(srpc->peer);
+	kfree(srpc);
+	return ERR_PTR(err);
+}
+
+/**
+ * homa_rpc_acked() - This function is invoked when an ack is received
+ * for an RPC; if the RPC still exists, is freed.
+ * @hsk:     Socket on which the ack was received. May or may not correspond
+ *           to the RPC, but can sometimes be used to avoid a socket lookup.
+ * @saddr:   Source address from which the act was received (the client
+ *           note for the RPC)
+ * @ack:     Information about an RPC from @saddr that may now be deleted
+ *           safely.
+ */
+void homa_rpc_acked(struct homa_sock *hsk, const struct in6_addr *saddr,
+		    struct homa_ack *ack)
+{
+	u16 server_port = ntohs(ack->server_port);
+	u64 id = homa_local_id(ack->client_id);
+	struct homa_sock *hsk2 = hsk;
+	struct homa_rpc *rpc;
+
+	if (hsk->port != server_port) {
+		/* Without RCU, sockets other than hsk can be deleted
+		 * out from under us.
+		 */
+		hsk2 = homa_sock_find(hsk->hnet, server_port);
+		if (!hsk2)
+			return;
+	}
+	rpc = homa_rpc_find_server(hsk2, saddr, id);
+	if (rpc) {
+		homa_rpc_end(rpc);
+		homa_rpc_unlock(rpc); /* Locked by homa_rpc_find_server. */
+	}
+	if (hsk->port != server_port)
+		sock_put(&hsk2->sock);
+}
+
+/**
+ * homa_rpc_end() - Stop all activity on an RPC and begin the process of
+ * releasing its resources; this process will continue in the background
+ * until homa_rpc_reap eventually completes it.
+ * @rpc:  Structure to clean up, or NULL. Must be locked. Its socket must
+ *        not be locked. Once this function returns the caller should not
+ *        use the RPC except to unlock it.
+ */
+void homa_rpc_end(struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+	/* The goal for this function is to make the RPC inaccessible,
+	 * so that no other code will ever access it again. However, don't
+	 * actually release resources or tear down the internal structure
+	 * of the RPC; leave that to homa_rpc_reap, which runs later. There
+	 * are two reasons for this. First, releasing resources may be
+	 * expensive, so we don't want to keep the caller waiting; homa_rpc_reap
+	 * will run in situations where there is time to spare. Second, there
+	 * may be other code that currently has pointers to this RPC but
+	 * temporarily released the lock (e.g. to copy data to/from user space).
+	 * It isn't safe to clean up until that code has finished its work and
+	 * released any pointers to the RPC (homa_rpc_reap will ensure that
+	 * this has happened). So, this function should only make changes
+	 * needed to make the RPC inaccessible.
+	 */
+	if (!rpc || rpc->state == RPC_DEAD)
+		return;
+	rpc->state = RPC_DEAD;
+	rpc->error = -EINVAL;
+
+	/* Unlink from all lists, so no-one will ever find this RPC again. */
+	homa_sock_lock(rpc->hsk);
+	__hlist_del(&rpc->hash_links);
+	list_del_rcu(&rpc->active_links);
+	list_add_tail(&rpc->dead_links, &rpc->hsk->dead_rpcs);
+	__list_del_entry(&rpc->ready_links);
+	__list_del_entry(&rpc->buf_links);
+	homa_interest_notify_private(rpc);
+
+	if (rpc->msgin.length >= 0) {
+		rpc->hsk->dead_skbs += skb_queue_len(&rpc->msgin.packets);
+		while (1) {
+			struct homa_gap *gap;
+
+			gap = list_first_entry_or_null(&rpc->msgin.gaps,
+						       struct homa_gap, links);
+			if (!gap)
+				break;
+			list_del(&gap->links);
+			kfree(gap);
+		}
+	}
+	rpc->hsk->dead_skbs += rpc->msgout.num_skbs;
+	if (rpc->hsk->dead_skbs > rpc->hsk->homa->max_dead_buffs)
+		/* This update isn't thread-safe; it's just a
+		 * statistic so it's OK if updates occasionally get
+		 * missed.
+		 */
+		rpc->hsk->homa->max_dead_buffs = rpc->hsk->dead_skbs;
+
+	homa_sock_unlock(rpc->hsk);
+	homa_pacer_unmanage_rpc(rpc);
+}
+
+/**
+ * homa_rpc_abort() - Terminate an RPC.
+ * @rpc:     RPC to be terminated.  Must be locked by caller.
+ * @error:   A negative errno value indicating the error that caused the abort.
+ *           If this is a client RPC, the error will be returned to the
+ *           application; if it's a server RPC, the error is ignored and
+ *           we just free the RPC.
+ */
+void homa_rpc_abort(struct homa_rpc *rpc, int error)
+	__must_hold(rpc->bucket->lock)
+{
+	if (!homa_is_client(rpc->id)) {
+		homa_rpc_end(rpc);
+		return;
+	}
+	rpc->error = error;
+	homa_rpc_handoff(rpc);
+}
+
+/**
+ * homa_abort_rpcs() - Abort all RPCs to/from a particular peer.
+ * @homa:    Overall data about the Homa protocol implementation.
+ * @addr:    Address (network order) of the destination whose RPCs are
+ *           to be aborted.
+ * @port:    If nonzero, then RPCs will only be aborted if they were
+ *	     targeted at this server port.
+ * @error:   Negative errno value indicating the reason for the abort.
+ */
+void homa_abort_rpcs(struct homa *homa, const struct in6_addr *addr,
+		     int port, int error)
+{
+	struct homa_socktab_scan scan;
+	struct homa_sock *hsk;
+	struct homa_rpc *rpc;
+
+	for (hsk = homa_socktab_start_scan(homa->socktab, &scan); hsk;
+	     hsk = homa_socktab_next(&scan)) {
+		/* Skip the (expensive) lock acquisition if there's no
+		 * work to do.
+		 */
+		if (list_empty(&hsk->active_rpcs))
+			continue;
+		if (!homa_protect_rpcs(hsk))
+			continue;
+		rcu_read_lock();
+		list_for_each_entry_rcu(rpc, &hsk->active_rpcs, active_links) {
+			if (!ipv6_addr_equal(&rpc->peer->addr, addr))
+				continue;
+			if (port && rpc->dport != port)
+				continue;
+			homa_rpc_lock(rpc);
+			homa_rpc_abort(rpc, error);
+			homa_rpc_unlock(rpc);
+		}
+		rcu_read_unlock();
+		homa_unprotect_rpcs(hsk);
+	}
+	homa_socktab_end_scan(&scan);
+}
+
+/**
+ * homa_rpc_reap() - Invoked to release resources associated with dead
+ * RPCs for a given socket.
+ * @hsk:      Homa socket that may contain dead RPCs. Must not be locked by the
+ *            caller; this function will lock and release.
+ * @reap_all: False means do a small chunk of work; there may still be
+ *            unreaped RPCs on return. True means reap all dead RPCs for
+ *            hsk.  Will busy-wait if reaping has been disabled for some RPCs.
+ *
+ * Return: A return value of 0 means that we ran out of work to do; calling
+ *         again will do no work (there could be unreaped RPCs, but if so,
+ *         they cannot currently be reaped).  A value greater than zero means
+ *         there is still more reaping work to be done.
+ */
+int homa_rpc_reap(struct homa_sock *hsk, bool reap_all)
+{
+	/* RPC Reaping Strategy:
+	 *
+	 * (Note: there are references to this comment elsewhere in the
+	 * Homa code)
+	 *
+	 * Most of the cost of reaping comes from freeing sk_buffs; this can be
+	 * quite expensive for RPCs with long messages.
+	 *
+	 * The natural time to reap is when homa_rpc_end is invoked to
+	 * terminate an RPC, but this doesn't work for two reasons. First,
+	 * there may be outstanding references to the RPC; it cannot be reaped
+	 * until all of those references have been released. Second, reaping
+	 * is potentially expensive and RPC termination could occur in
+	 * homa_softirq when there are short messages waiting to be processed.
+	 * Taking time to reap a long RPC could result in significant delays
+	 * for subsequent short RPCs.
+	 *
+	 * Thus Homa doesn't reap immediately in homa_rpc_end. Instead, dead
+	 * RPCs are queued up and reaping occurs in this function, which is
+	 * invoked later when it is less likely to impact latency. The
+	 * challenge is to do this so that (a) we don't allow large numbers of
+	 * dead RPCs to accumulate and (b) we minimize the impact of reaping
+	 * on latency.
+	 *
+	 * The primary place where homa_rpc_reap is invoked is when threads
+	 * are waiting for incoming messages. The thread has nothing else to
+	 * do (it may even be polling for input), so reaping can be performed
+	 * with no latency impact on the application.  However, if a machine
+	 * is overloaded then it may never wait, so this mechanism isn't always
+	 * sufficient.
+	 *
+	 * Homa now reaps in two other places, if reaping while waiting for
+	 * messages isn't adequate:
+	 * 1. If too may dead skbs accumulate, then homa_timer will call
+	 *    homa_rpc_reap.
+	 * 2. If this timer thread cannot keep up with all the reaping to be
+	 *    done then as a last resort homa_dispatch_pkts will reap in small
+	 *    increments (a few sk_buffs or RPCs) for every incoming batch
+	 *    of packets . This is undesirable because it will impact Homa's
+	 *    performance.
+	 *
+	 * During the introduction of homa_pools for managing input
+	 * buffers, freeing of packets for incoming messages was moved to
+	 * homa_copy_to_user under the assumption that this code wouldn't be
+	 * on the critical path. However, there is evidence that with
+	 * fast networks (e.g. 100 Gbps) copying to user space is the
+	 * bottleneck for incoming messages, and packet freeing takes about
+	 * 20-25% of the total time in homa_copy_to_user. So, it may eventually
+	 * be desirable to remove packet freeing out of homa_copy_to_user.
+	 */
+#define BATCH_MAX 20
+	struct homa_rpc *rpcs[BATCH_MAX];
+	struct sk_buff *skbs[BATCH_MAX];
+	int num_skbs, num_rpcs;
+	struct homa_rpc *rpc;
+	struct homa_rpc *tmp;
+	int i, batch_size;
+	int skbs_to_reap;
+	int result = 0;
+	int rx_frees;
+
+	/* Each iteration through the following loop will reap
+	 * BATCH_MAX skbs.
+	 */
+	skbs_to_reap = hsk->homa->reap_limit;
+	while (skbs_to_reap > 0 && !list_empty(&hsk->dead_rpcs)) {
+		batch_size = BATCH_MAX;
+		if (!reap_all) {
+			if (batch_size > skbs_to_reap)
+				batch_size = skbs_to_reap;
+			skbs_to_reap -= batch_size;
+		}
+		num_skbs = 0;
+		num_rpcs = 0;
+		rx_frees = 0;
+
+		homa_sock_lock(hsk);
+		if (atomic_read(&hsk->protect_count)) {
+			homa_sock_unlock(hsk);
+			if (reap_all)
+				continue;
+			return 0;
+		}
+
+		/* Collect buffers and freeable RPCs. */
+		list_for_each_entry_safe(rpc, tmp, &hsk->dead_rpcs,
+					 dead_links) {
+			int refs;
+
+			/* Make sure that all outstanding uses of the RPC have
+			 * completed. We can only be sure if the reference
+			 * count is zero when we're holding the lock. Note:
+			 * it isn't safe to block while locking the RPC here,
+			 * since we hold the socket lock.
+			 */
+			if (homa_rpc_try_lock(rpc)) {
+				refs = atomic_read(&rpc->refs);
+				homa_rpc_unlock(rpc);
+			} else {
+				refs = 1;
+			}
+			if (refs != 0)
+				continue;
+			rpc->magic = 0;
+
+			/* For Tx sk_buffs, collect them here but defer
+			 * freeing until after releasing the socket lock.
+			 */
+			if (rpc->msgout.length >= 0) {
+				while (rpc->msgout.packets) {
+					skbs[num_skbs] = rpc->msgout.packets;
+					rpc->msgout.packets = homa_get_skb_info(
+						rpc->msgout.packets)->next_skb;
+					num_skbs++;
+					rpc->msgout.num_skbs--;
+					if (num_skbs >= batch_size)
+						goto release;
+				}
+			}
+
+			/* In the normal case rx sk_buffs will already have been
+			 * freed before we got here. Thus it's OK to free
+			 * immediately in rare situations where there are
+			 * buffers left.
+			 */
+			if (rpc->msgin.length >= 0 &&
+			    !skb_queue_empty_lockless(&rpc->msgin.packets)) {
+				rx_frees += skb_queue_len(&rpc->msgin.packets);
+				__skb_queue_purge(&rpc->msgin.packets);
+			}
+
+			/* If we get here, it means all packets have been
+			 *  removed from the RPC.
+			 */
+			rpcs[num_rpcs] = rpc;
+			num_rpcs++;
+			list_del(&rpc->dead_links);
+			WARN_ON(refcount_sub_and_test(rpc->msgout.skb_memory,
+						      &hsk->sock.sk_wmem_alloc));
+			if (num_rpcs >= batch_size)
+				goto release;
+		}
+
+		/* Free all of the collected resources; release the socket
+		 * lock while doing this.
+		 */
+release:
+		hsk->dead_skbs -= num_skbs + rx_frees;
+		result = !list_empty(&hsk->dead_rpcs) &&
+				(num_skbs + num_rpcs) != 0;
+		homa_sock_unlock(hsk);
+		homa_skb_free_many_tx(hsk->homa, skbs, num_skbs);
+		for (i = 0; i < num_rpcs; i++) {
+			rpc = rpcs[i];
+
+			if (unlikely(rpc->msgin.num_bpages))
+				homa_pool_release_buffers(rpc->hsk->buffer_pool,
+							  rpc->msgin.num_bpages,
+							  rpc->msgin.bpage_offsets);
+			if (rpc->msgin.length >= 0) {
+				while (1) {
+					struct homa_gap *gap;
+
+					gap = list_first_entry_or_null(
+							&rpc->msgin.gaps,
+							struct homa_gap,
+							links);
+					if (!gap)
+						break;
+					list_del(&gap->links);
+					kfree(gap);
+				}
+			}
+			if (rpc->peer) {
+				homa_peer_release(rpc->peer);
+				rpc->peer = NULL;
+			}
+			rpc->state = 0;
+			kfree(rpc);
+		}
+		homa_sock_wakeup_wmem(hsk);
+		if (!result && !reap_all)
+			break;
+	}
+	homa_pool_check_waiting(hsk->buffer_pool);
+	return result;
+}
+
+/**
+ * homa_abort_sock_rpcs() - Abort all outgoing (client-side) RPCs on a given
+ * socket.
+ * @hsk:         Socket whose RPCs should be aborted.
+ * @error:       Zero means that the aborted RPCs should be freed immediately.
+ *               A nonzero value means that the RPCs should be marked
+ *               complete, so that they can be returned to the application;
+ *               this value (a negative errno) will be returned from
+ *               recvmsg.
+ */
+void homa_abort_sock_rpcs(struct homa_sock *hsk, int error)
+{
+	struct homa_rpc *rpc;
+
+	if (list_empty(&hsk->active_rpcs))
+		return;
+	if (!homa_protect_rpcs(hsk))
+		return;
+	rcu_read_lock();
+	list_for_each_entry_rcu(rpc, &hsk->active_rpcs, active_links) {
+		if (!homa_is_client(rpc->id))
+			continue;
+		homa_rpc_lock(rpc);
+		if (rpc->state == RPC_DEAD) {
+			homa_rpc_unlock(rpc);
+			continue;
+		}
+		if (error)
+			homa_rpc_abort(rpc, error);
+		else
+			homa_rpc_end(rpc);
+		homa_rpc_unlock(rpc);
+	}
+	rcu_read_unlock();
+	homa_unprotect_rpcs(hsk);
+}
+
+/**
+ * homa_rpc_find_client() - Locate client-side information about the RPC that
+ * a packet belongs to, if there is any. Thread-safe without socket lock.
+ * @hsk:      Socket via which packet was received.
+ * @id:       Unique identifier for the RPC.
+ *
+ * Return:    A pointer to the homa_rpc for this id, or NULL if none.
+ *            The RPC will be locked; the caller must eventually unlock it
+ *            by invoking homa_rpc_unlock.
+ */
+struct homa_rpc *homa_rpc_find_client(struct homa_sock *hsk, u64 id)
+	__cond_acquires(crpc->bucket->lock)
+{
+	struct homa_rpc_bucket *bucket = homa_client_rpc_bucket(hsk, id);
+	struct homa_rpc *crpc;
+
+	homa_bucket_lock(bucket, id);
+	hlist_for_each_entry(crpc, &bucket->rpcs, hash_links) {
+		if (crpc->id == id)
+			return crpc;
+	}
+	homa_bucket_unlock(bucket, id);
+	return NULL;
+}
+
+/**
+ * homa_rpc_find_server() - Locate server-side information about the RPC that
+ * a packet belongs to, if there is any. Thread-safe without socket lock.
+ * @hsk:      Socket via which packet was received.
+ * @saddr:    Address from which the packet was sent.
+ * @id:       Unique identifier for the RPC (must have server bit set).
+ *
+ * Return:    A pointer to the homa_rpc matching the arguments, or NULL
+ *            if none. The RPC will be locked; the caller must eventually
+ *            unlock it by invoking homa_rpc_unlock.
+ */
+struct homa_rpc *homa_rpc_find_server(struct homa_sock *hsk,
+				      const struct in6_addr *saddr, u64 id)
+	__cond_acquires(srpc->bucket->lock)
+{
+	struct homa_rpc_bucket *bucket = homa_server_rpc_bucket(hsk, id);
+	struct homa_rpc *srpc;
+
+	homa_bucket_lock(bucket, id);
+	hlist_for_each_entry(srpc, &bucket->rpcs, hash_links) {
+		if (srpc->id == id && ipv6_addr_equal(&srpc->peer->addr, saddr))
+			return srpc;
+	}
+	homa_bucket_unlock(bucket, id);
+	return NULL;
+}
diff --git a/net/homa/homa_rpc.h b/net/homa/homa_rpc.h
new file mode 100644
index 000000000000..b9f77f3d401c
--- /dev/null
+++ b/net/homa/homa_rpc.h
@@ -0,0 +1,501 @@
+/* SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+ */
+
+/* This file defines homa_rpc and related structs.  */
+
+#ifndef _HOMA_RPC_H
+#define _HOMA_RPC_H
+
+#include <linux/percpu-defs.h>
+#include <linux/skbuff.h>
+#include <linux/types.h>
+
+#include "homa_sock.h"
+#include "homa_wire.h"
+
+/* Forward references. */
+struct homa_ack;
+
+/**
+ * struct homa_message_out - Describes a message (either request or response)
+ * for which this machine is the sender.
+ */
+struct homa_message_out {
+	/**
+	 * @length: Total bytes in message (excluding headers).  A value
+	 * less than 0 means this structure is uninitialized and therefore
+	 * not in use (all other fields will be zero in this case).
+	 */
+	int length;
+
+	/** @num_skbs: Total number of buffers currently in @packets. */
+	int num_skbs;
+
+	/**
+	 * @skb_memory: Total number of bytes of memory occupied by
+	 * the sk_buffs for this message.
+	 */
+	int skb_memory;
+
+	/**
+	 * @copied_from_user: Number of bytes of the message that have
+	 * been copied from user space into skbs in @packets.
+	 */
+	int copied_from_user;
+
+	/**
+	 * @packets: Singly-linked list of all packets in message, linked
+	 * using homa_next_skb. The list is in order of offset in the message
+	 * (offset 0 first); each sk_buff can potentially contain multiple
+	 * data_segments, which will be split into separate packets by GSO.
+	 * This list grows gradually as data is copied in from user space,
+	 * so it may not be complete.
+	 */
+	struct sk_buff *packets;
+
+	/**
+	 * @next_xmit: Pointer to pointer to next packet to transmit (will
+	 * either refer to @packets or homa_next_skb(skb) for some skb
+	 * in @packets).
+	 */
+	struct sk_buff **next_xmit;
+
+	/**
+	 * @next_xmit_offset: All bytes in the message, up to but not
+	 * including this one, have been passed to ip_queue_xmit or
+	 * ip6_xmit.
+	 */
+	int next_xmit_offset;
+
+	/**
+	 * @first_not_tx: All packets in @packets preceding this one have
+	 * been confirmed to have been transmitted by the NIC (the driver
+	 * has released its reference). NULL means all packets are known to
+	 * have been transmitted. Used by homa_rpc_tx_complete.
+	 */
+	struct sk_buff *first_not_tx;
+
+	/**
+	 * @init_time: homa_clock() time when this structure was initialized.
+	 * Used to find the oldest outgoing message.
+	 */
+	u64 init_time;
+};
+
+/**
+ * struct homa_gap - Represents a range of bytes within a message that have
+ * not yet been received.
+ */
+struct homa_gap {
+	/** @start: offset of first byte in this gap. */
+	int start;
+
+	/** @end: offset of byte just after last one in this gap. */
+	int end;
+
+	/**
+	 * @time: homa_clock() time when the gap was first detected.
+	 * As of 7/2024 this isn't used for anything.
+	 */
+	u64 time;
+
+	/** @links: for linking into list in homa_message_in. */
+	struct list_head links;
+};
+
+/**
+ * struct homa_message_in - Holds the state of a message received by
+ * this machine; used for both requests and responses.
+ */
+struct homa_message_in {
+	/**
+	 * @length: Payload size in bytes. A value less than 0 means this
+	 * structure is uninitialized and therefore not in use.
+	 */
+	int length;
+
+	/**
+	 * @packets: DATA packets for this message that have been received but
+	 * not yet copied to user space (no particular order).
+	 */
+	struct sk_buff_head packets;
+
+	/**
+	 * @recv_end: Offset of the byte just after the highest one that
+	 * has been received so far.
+	 */
+	int recv_end;
+
+	/**
+	 * @gaps: List of homa_gaps describing all of the bytes with
+	 * offsets less than @recv_end that have not yet been received.
+	 */
+	struct list_head gaps;
+
+	/**
+	 * @bytes_remaining: Amount of data for this message that has
+	 * not yet been received; will determine the message's priority.
+	 */
+	int bytes_remaining;
+
+	/**
+	 * @num_bpages: The number of entries in @bpage_offsets used for this
+	 * message (0 means buffers not allocated yet).
+	 */
+	u32 num_bpages;
+
+	/**
+	 * @bpage_offsets: Describes buffer space allocated for this message.
+	 * Each entry is an offset from the start of the buffer region.
+	 * All but the last pointer refer to areas of size HOMA_BPAGE_SIZE.
+	 */
+	u32 bpage_offsets[HOMA_MAX_BPAGES];
+
+};
+
+/**
+ * struct homa_rpc - One of these structures exists for each active
+ * RPC. The same structure is used to manage both outgoing RPCs on
+ * clients and incoming RPCs on servers.
+ */
+struct homa_rpc {
+	/** @hsk:  Socket that owns the RPC. */
+	struct homa_sock *hsk;
+
+	/**
+	 * @bucket: Pointer to the bucket in hsk->client_rpc_buckets or
+	 * hsk->server_rpc_buckets where this RPC is linked. Used primarily
+	 * for locking the RPC (which is done by locking its bucket).
+	 */
+	struct homa_rpc_bucket *bucket;
+
+	/**
+	 * @state: The current state of this RPC:
+	 *
+	 * @RPC_OUTGOING:     The RPC is waiting for @msgout to be transmitted
+	 *                    to the peer.
+	 * @RPC_INCOMING:     The RPC is waiting for data @msgin to be received
+	 *                    from the peer; at least one packet has already
+	 *                    been received.
+	 * @RPC_IN_SERVICE:   Used only for server RPCs: the request message
+	 *                    has been read from the socket, but the response
+	 *                    message has not yet been presented to the kernel.
+	 * @RPC_DEAD:         RPC has been deleted and is waiting to be
+	 *                    reaped. In some cases, information in the RPC
+	 *                    structure may be accessed in this state.
+	 *
+	 * Client RPCs pass through states in the following order:
+	 * RPC_OUTGOING, RPC_INCOMING, RPC_DEAD.
+	 *
+	 * Server RPCs pass through states in the following order:
+	 * RPC_INCOMING, RPC_IN_SERVICE, RPC_OUTGOING, RPC_DEAD.
+	 */
+	enum {
+		RPC_OUTGOING            = 5,
+		RPC_INCOMING            = 6,
+		RPC_IN_SERVICE          = 8,
+		RPC_DEAD                = 9
+	} state;
+
+	/**
+	 * @flags: Additional state information: an OR'ed combination of
+	 * various single-bit flags. See below for definitions. Must be
+	 * manipulated with atomic operations because some of the manipulations
+	 * occur without holding the RPC lock.
+	 */
+	atomic_t flags;
+
+	/* Valid bits for @flags:
+	 * RPC_PKTS_READY -        The RPC has input packets ready to be
+	 *                         copied to user space.
+	 * APP_NEEDS_LOCK -        Means that code in the application thread
+	 *                         needs the RPC lock (e.g. so it can start
+	 *                         copying data to user space) so others
+	 *                         (e.g. SoftIRQ processing) should relinquish
+	 *                         the lock ASAP. Without this, SoftIRQ can
+	 *                         lock out the application for a long time,
+	 *                         preventing data copies to user space from
+	 *                         starting (and they limit throughput at
+	 *                         high network speeds).
+	 * RPC_PRIVATE -           This RPC will be waited on in "private" mode,
+	 *                         where the app explicitly requests the
+	 *                         response from this particular RPC.
+	 */
+#define RPC_PKTS_READY        1
+#define APP_NEEDS_LOCK        4
+#define RPC_PRIVATE           8
+
+	/**
+	 * @refs: Number of unmatched calls to homa_rpc_hold; it's not safe
+	 * to free the RPC until this is zero.
+	 */
+	atomic_t refs;
+
+	/**
+	 * @peer: Information about the other machine (the server, if
+	 * this is a client RPC, or the client, if this is a server RPC).
+	 * If non-NULL then we own a reference on the object.
+	 */
+	struct homa_peer *peer;
+
+	/** @dport: Port number on @peer that will handle packets. */
+	u16 dport;
+
+	/**
+	 * @id: Unique identifier for the RPC among all those issued
+	 * from its port. The low-order bit indicates whether we are
+	 * server (1) or client (0) for this RPC.
+	 */
+	u64 id;
+
+	/**
+	 * @completion_cookie: Only used on clients. Contains identifying
+	 * information about the RPC provided by the application; returned to
+	 * the application with the RPC's result.
+	 */
+	u64 completion_cookie;
+
+	/**
+	 * @error: Only used on clients. If nonzero, then the RPC has
+	 * failed and the value is a negative errno that describes the
+	 * problem.
+	 */
+	int error;
+
+	/**
+	 * @msgin: Information about the message we receive for this RPC
+	 * (for server RPCs this is the request, for client RPCs this is the
+	 * response).
+	 */
+	struct homa_message_in msgin;
+
+	/**
+	 * @msgout: Information about the message we send for this RPC
+	 * (for client RPCs this is the request, for server RPCs this is the
+	 * response).
+	 */
+	struct homa_message_out msgout;
+
+	/**
+	 * @hash_links: Used to link this object into a hash bucket for
+	 * either @hsk->client_rpc_buckets (for a client RPC), or
+	 * @hsk->server_rpc_buckets (for a server RPC).
+	 */
+	struct hlist_node hash_links;
+
+	/**
+	 * @ready_links: Used to link this object into @hsk->ready_rpcs.
+	 */
+	struct list_head ready_links;
+
+	/**
+	 * @buf_links: Used to link this RPC into @hsk->waiting_for_bufs.
+	 * If the RPC isn't on @hsk->waiting_for_bufs, this is an empty
+	 * list pointing to itself.
+	 */
+	struct list_head buf_links;
+
+	/**
+	 * @active_links: For linking this object into @hsk->active_rpcs.
+	 * The next field will be LIST_POISON1 if this RPC hasn't yet been
+	 * linked into @hsk->active_rpcs. Access with RCU.
+	 */
+	struct list_head active_links;
+
+	/** @dead_links: For linking this object into @hsk->dead_rpcs. */
+	struct list_head dead_links;
+
+	/**
+	 * @private_interest: If there is a thread waiting for this RPC in
+	 * homa_wait_private, then this points to that thread's interest.
+	 */
+	struct homa_interest *private_interest;
+
+	/**
+	 * @throttled_links: Used to link this RPC into
+	 * homa->pacer.throttled_rpcs. If this RPC isn't in
+	 * homa->pacer.throttled_rpcs, this is an empty
+	 * list pointing to itself.
+	 */
+	struct list_head throttled_links;
+
+	/**
+	 * @silent_ticks: Number of times homa_timer has been invoked
+	 * since the last time a packet indicating progress was received
+	 * for this RPC, so we don't need to send a resend for a while.
+	 */
+	int silent_ticks;
+
+	/**
+	 * @resend_timer_ticks: Value of homa->timer_ticks the last time
+	 * we sent a RESEND for this RPC.
+	 */
+	u32 resend_timer_ticks;
+
+	/**
+	 * @done_timer_ticks: The value of homa->timer_ticks the first
+	 * time we noticed that this (server) RPC is done (all response
+	 * packets have been transmitted), so we're ready for an ack.
+	 * Zero means we haven't reached that point yet.
+	 */
+	u32 done_timer_ticks;
+
+	/**
+	 * @magic: when the RPC is alive, this holds a distinct value that
+	 * is unlikely to occur naturally. The value is cleared when the
+	 * RPC is reaped, so we can detect accidental use of an RPC after
+	 * it has been reaped.
+	 */
+#define HOMA_RPC_MAGIC 0xdeadbeef
+	int magic;
+
+	/**
+	 * @start_time: homa_clock() time when this RPC was created. Used
+	 * occasionally for testing.
+	 */
+	u64 start_time;
+};
+
+void     homa_abort_rpcs(struct homa *homa, const struct in6_addr *addr,
+			 int port, int error);
+void     homa_abort_sock_rpcs(struct homa_sock *hsk, int error);
+void     homa_rpc_abort(struct homa_rpc *crpc, int error);
+struct homa_rpc
+	*homa_rpc_alloc_client(struct homa_sock *hsk,
+			       const union sockaddr_in_union *dest);
+struct homa_rpc
+	*homa_rpc_alloc_server(struct homa_sock *hsk,
+			       const struct in6_addr *source,
+			       struct homa_data_hdr *h, int *created);
+void     homa_rpc_end(struct homa_rpc *rpc);
+struct homa_rpc
+	*homa_rpc_find_client(struct homa_sock *hsk, u64 id);
+struct homa_rpc
+	*homa_rpc_find_server(struct homa_sock *hsk,
+			      const struct in6_addr *saddr, u64 id);
+void     homa_rpc_acked(struct homa_sock *hsk, const struct in6_addr *saddr,
+			struct homa_ack *ack);
+void     homa_rpc_end(struct homa_rpc *rpc);
+int      homa_rpc_reap(struct homa_sock *hsk, bool reap_all);
+
+/**
+ * homa_rpc_lock() - Acquire the lock for an RPC.
+ * @rpc:    RPC to lock.
+ */
+static inline void homa_rpc_lock(struct homa_rpc *rpc)
+	__acquires(rpc->bucket->lock)
+{
+	homa_bucket_lock(rpc->bucket, rpc->id);
+}
+
+/**
+ * homa_rpc_try_lock() - Acquire the lock for an RPC if it is available.
+ * @rpc:       RPC to lock.
+ * Return:     Nonzero if lock was successfully acquired, zero if it is
+ *             currently owned by someone else.
+ */
+static inline int homa_rpc_try_lock(struct homa_rpc *rpc)
+	__cond_acquires(rpc->bucket->lock)
+{
+	if (!spin_trylock_bh(&rpc->bucket->lock))
+		return 0;
+	return 1;
+}
+
+/**
+ * homa_rpc_unlock() - Release the lock for an RPC.
+ * @rpc:   RPC to unlock.
+ */
+static inline void homa_rpc_unlock(struct homa_rpc *rpc)
+	__releases(rpc->bucket->lock)
+{
+	homa_bucket_unlock(rpc->bucket, rpc->id);
+}
+
+/**
+ * homa_protect_rpcs() - Ensures that no RPCs will be reaped for a given
+ * socket until homa_sock_unprotect is called. Typically used by functions
+ * that want to scan the active RPCs for a socket without holding the socket
+ * lock.  Multiple calls to this function may be in effect at once. See
+ * "Homa Locking Strategy" in homa_impl.h for more info on why this function
+ * is needed.
+ * @hsk:    Socket whose RPCs should be protected. Must not be locked
+ *          by the caller; will be locked here.
+ *
+ * Return:  1 for success, 0 if the socket has been shutdown, in which
+ *          case its RPCs cannot be protected.
+ */
+static inline int homa_protect_rpcs(struct homa_sock *hsk)
+{
+	int result;
+
+	homa_sock_lock(hsk);
+	result = !hsk->shutdown;
+	if (result)
+		atomic_inc(&hsk->protect_count);
+	homa_sock_unlock(hsk);
+	return result;
+}
+
+/**
+ * homa_unprotect_rpcs() - Cancel the effect of a previous call to
+ * homa_sock_protect(), so that RPCs can once again be reaped.
+ * @hsk:    Socket whose RPCs should be unprotected.
+ */
+static inline void homa_unprotect_rpcs(struct homa_sock *hsk)
+{
+	atomic_dec(&hsk->protect_count);
+}
+
+/**
+ * homa_rpc_hold() - Increment the reference count on an RPC, which will
+ * prevent it from being freed until homa_rpc_put() is called. References
+ * are taken in two situations:
+ * 1. An RPC is going to be manipulated by a collection of functions. In
+ *    this case the top-most function that identifies the RPC takes the
+ *    reference; any function that receives an RPC as an argument can
+ *    assume that a reference has been taken on the RPC by some higher
+ *    function on the call stack.
+ * 2. A pointer to an RPC is stored in an object for use later, such as
+ *    an interest. A reference must be held as long as the pointer remains
+ *    accessible in the object.
+ * @rpc:      RPC on which to take a reference.
+ */
+static inline void homa_rpc_hold(struct homa_rpc *rpc)
+{
+	atomic_inc(&rpc->refs);
+}
+
+/**
+ * homa_rpc_put() - Release a reference on an RPC (cancels the effect of
+ * a previous call to homa_rpc_put).
+ * @rpc:      RPC to release.
+ */
+static inline void homa_rpc_put(struct homa_rpc *rpc)
+{
+	atomic_dec(&rpc->refs);
+}
+
+/**
+ * homa_is_client(): returns true if we are the client for a particular RPC,
+ * false if we are the server.
+ * @id:  Id of the RPC in question.
+ * Return: true if we are the client for RPC id, false otherwise
+ */
+static inline bool homa_is_client(u64 id)
+{
+	return (id & 1) == 0;
+}
+
+/**
+ * homa_rpc_needs_attention() - Returns true if @rpc has failed or if
+ * its incoming message is ready for attention by an application thread
+ * (e.g., packets are ready to copy to user space).
+ * @rpc: RPC to check.
+ * Return: See above
+ */
+static inline bool homa_rpc_needs_attention(struct homa_rpc *rpc)
+{
+	return (rpc->error != 0 || atomic_read(&rpc->flags) & RPC_PKTS_READY);
+}
+
+#endif /* _HOMA_RPC_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 10/15] net: homa: create homa_outgoing.c
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (8 preceding siblings ...)
  2025-08-18 20:55 ` [PATCH net-next v15 09/15] net: homa: create homa_rpc.h and homa_rpc.c John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-26 11:50   ` Paolo Abeni
  2025-08-18 20:55 ` [PATCH net-next v15 11/15] net: homa: create homa_utils.c John Ousterhout
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

This file does most of the work of transmitting outgoing messages.
It is also responsible for copying data from user space into skbs.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for v14:
* Implement homa_rpc_tx_end function

Changes for v13:
* Fix bug in homa_resend_data: wasn't fully initializing new skb.
* Fix bug in homa_tx_data_pkt_alloc: wasn't allocating enough space
  in the new skb.

Changes for v12:
* Move RPC_DEAD check in homa_xmit_data to eliminate window for
  more complete coverage.

Changes for v11:
* Cleanup and simplify use of RPC reference counts.

Changes for v10:
* Revise sparse annotations to eliminate __context__ definition
* Remove log messages after alloc errors

Changes for v9:
* Use new homa_clock abstraction layer
* Various name improvements (e.g. use "alloc" instead of "new" for functions
  that allocate memory)
* Eliminate sizeof32 define: use sizeof instead

Changes for v7:
* Implement accounting for bytes in tx skbs
* Rename UNKNOWN packet type to RPC_UNKNOWN
* Use new RPC reference counts; eliminates need for RCU
* Remove locker argument from locking functions
* Use u64 and __u64 properly
* Fix incorrect skb check in homa_message_out_fill
---
 net/homa/homa_impl.h     |  14 +
 net/homa/homa_outgoing.c | 599 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 613 insertions(+)
 create mode 100644 net/homa/homa_outgoing.c

diff --git a/net/homa/homa_impl.h b/net/homa/homa_impl.h
index cb31101f3684..c6c3cae9ac52 100644
--- a/net/homa/homa_impl.h
+++ b/net/homa/homa_impl.h
@@ -421,12 +421,26 @@ static inline bool homa_make_header_avl(struct sk_buff *skb)
 
 extern unsigned int homa_net_id;
 
+int      homa_fill_data_interleaved(struct homa_rpc *rpc,
+				    struct sk_buff *skb, struct iov_iter *iter);
+int      homa_message_out_fill(struct homa_rpc *rpc,
+			       struct iov_iter *iter, int xmit);
+void     homa_message_out_init(struct homa_rpc *rpc, int length);
 void     homa_rpc_handoff(struct homa_rpc *rpc);
+int      homa_rpc_tx_end(struct homa_rpc *rpc);
+struct sk_buff *homa_tx_data_pkt_alloc(struct homa_rpc *rpc,
+				       struct iov_iter *iter, int offset,
+				       int length, int max_seg_data);
 int      homa_xmit_control(enum homa_packet_type type, void *contents,
 			   size_t length, struct homa_rpc *rpc);
+int      __homa_xmit_control(void *contents, size_t length,
+			     struct homa_peer *peer, struct homa_sock *hsk);
 void     homa_xmit_data(struct homa_rpc *rpc, bool force);
+void     homa_xmit_unknown(struct sk_buff *skb, struct homa_sock *hsk);
 
 int      homa_message_in_init(struct homa_rpc *rpc, int unsched);
+void     homa_resend_data(struct homa_rpc *rpc, int start, int end);
+void     __homa_xmit_data(struct sk_buff *skb, struct homa_rpc *rpc);
 
 /**
  * homa_net_from_net() - Return the struct homa_net associated with a particular
diff --git a/net/homa/homa_outgoing.c b/net/homa/homa_outgoing.c
new file mode 100644
index 000000000000..0e373c498245
--- /dev/null
+++ b/net/homa/homa_outgoing.c
@@ -0,0 +1,599 @@
+// SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
+
+/* This file contains functions related to the sender side of message
+ * transmission. It also contains utility functions for sending packets.
+ */
+
+#include "homa_impl.h"
+#include "homa_pacer.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+#include "homa_wire.h"
+#include "homa_stub.h"
+
+/**
+ * homa_message_out_init() - Initialize rpc->msgout.
+ * @rpc:       RPC whose output message should be initialized. Must be
+ *             locked by caller.
+ * @length:    Number of bytes that will eventually be in rpc->msgout.
+ */
+void homa_message_out_init(struct homa_rpc *rpc, int length)
+	__must_hold(rpc->bucket->lock)
+{
+	memset(&rpc->msgout, 0, sizeof(rpc->msgout));
+	rpc->msgout.length = length;
+	rpc->msgout.next_xmit = &rpc->msgout.packets;
+	rpc->msgout.init_time = homa_clock();
+}
+
+/**
+ * homa_fill_data_interleaved() - This function is invoked to fill in the
+ * part of a data packet after the initial header, when GSO is being used.
+ * homa_seg_hdrs must be interleaved with the data to provide the correct
+ * offset for each segment.
+ * @rpc:            RPC whose output message is being created. Must be
+ *                  locked by caller.
+ * @skb:            The packet being filled. The initial homa_data_hdr was
+ *                  created and initialized by the caller and the
+ *                  homa_skb_info has been filled in with the packet geometry.
+ * @iter:           Describes location(s) of (remaining) message data in user
+ *                  space.
+ * Return:          Either a negative errno or 0 (for success).
+ */
+int homa_fill_data_interleaved(struct homa_rpc *rpc, struct sk_buff *skb,
+			       struct iov_iter *iter)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa_skb_info *homa_info = homa_get_skb_info(skb);
+	int seg_length = homa_info->seg_length;
+	int bytes_left = homa_info->data_bytes;
+	int offset = homa_info->offset;
+	int err;
+
+	/* Each iteration of the following loop adds info for one packet,
+	 * which includes a homa_seg_hdr followed by the data for that
+	 * segment. The first homa_seg_hdr was already added by the caller.
+	 */
+	while (1) {
+		struct homa_seg_hdr seg;
+
+		if (bytes_left < seg_length)
+			seg_length = bytes_left;
+		err = homa_skb_append_from_iter(rpc->hsk->homa, skb, iter,
+						seg_length);
+		if (err != 0)
+			return err;
+		bytes_left -= seg_length;
+		offset += seg_length;
+
+		if (bytes_left == 0)
+			break;
+
+		seg.offset = htonl(offset);
+		err = homa_skb_append_to_frag(rpc->hsk->homa, skb, &seg,
+					      sizeof(seg));
+		if (err != 0)
+			return err;
+	}
+	return 0;
+}
+
+/**
+ * homa_tx_data_pkt_alloc() - Allocate a new sk_buff and fill it with an
+ * outgoing Homa data packet. The resulting packet will be a GSO packet
+ * that will eventually be segmented by the NIC.
+ * @rpc:          RPC that packet will belong to (msgout must have been
+ *                initialized). Must be locked by caller.
+ * @iter:         Describes location(s) of (remaining) message data in user
+ *                space.
+ * @offset:       Offset in the message of the first byte of data in this
+ *                packet.
+ * @length:       How many bytes of data to include in the skb. Caller must
+ *                ensure that this amount of data isn't too much for a
+ *                well-formed GSO packet, and that iter has at least this
+ *                much data.
+ * @max_seg_data: Maximum number of bytes of message data that can go in
+ *                a single segment of the GSO packet.
+ * Return: A pointer to the new packet, or a negative errno.
+ */
+struct sk_buff *homa_tx_data_pkt_alloc(struct homa_rpc *rpc,
+				       struct iov_iter *iter, int offset,
+				       int length, int max_seg_data)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa_skb_info *homa_info;
+	struct homa_data_hdr *h;
+	struct sk_buff *skb;
+	int err, gso_size;
+	u64 segs;
+
+	segs = length + max_seg_data - 1;
+	do_div(segs, max_seg_data);
+
+	/* Initialize the overall skb. */
+	skb = homa_skb_alloc_tx(sizeof(struct homa_data_hdr) + length +
+			      (segs - 1) * sizeof(struct homa_seg_hdr));
+	if (!skb)
+		return ERR_PTR(-ENOMEM);
+
+	/* Fill in the Homa header (which will be replicated in every
+	 * network packet by GSO).
+	 */
+	h = (struct homa_data_hdr *)skb_put(skb, sizeof(struct homa_data_hdr));
+	h->common.sport = htons(rpc->hsk->port);
+	h->common.dport = htons(rpc->dport);
+	h->common.sequence = htonl(offset);
+	h->common.type = DATA;
+	homa_set_doff(h, sizeof(struct homa_data_hdr));
+	h->common.checksum = 0;
+	h->common.sender_id = cpu_to_be64(rpc->id);
+	h->message_length = htonl(rpc->msgout.length);
+	h->ack.client_id = 0;
+	homa_peer_get_acks(rpc->peer, 1, &h->ack);
+	h->retransmit = 0;
+	h->seg.offset = htonl(offset);
+
+	homa_info = homa_get_skb_info(skb);
+	homa_info->next_skb = NULL;
+	homa_info->wire_bytes = length + segs * (sizeof(struct homa_data_hdr)
+			+  rpc->hsk->ip_header_length + HOMA_ETH_OVERHEAD);
+	homa_info->data_bytes = length;
+	homa_info->seg_length = max_seg_data;
+	homa_info->offset = offset;
+	homa_info->bytes_left = rpc->msgout.length - offset;
+	homa_info->rpc = rpc;
+
+	if (segs > 1) {
+		homa_set_doff(h, sizeof(struct homa_data_hdr)  -
+				sizeof(struct homa_seg_hdr));
+		gso_size = max_seg_data + sizeof(struct homa_seg_hdr);
+		err = homa_fill_data_interleaved(rpc, skb, iter);
+	} else {
+		gso_size = max_seg_data;
+		err = homa_skb_append_from_iter(rpc->hsk->homa, skb, iter,
+						length);
+	}
+	if (err)
+		goto error;
+
+	if (segs > 1) {
+		skb_shinfo(skb)->gso_segs = segs;
+		skb_shinfo(skb)->gso_size = gso_size;
+
+		/* It's unclear what gso_type should be used to force software
+		 * GSO; the value below seems to work...
+		 */
+		skb_shinfo(skb)->gso_type =
+		    rpc->hsk->homa->gso_force_software ? 0xd : SKB_GSO_TCPV6;
+	}
+	return skb;
+
+error:
+	homa_skb_free_tx(rpc->hsk->homa, skb);
+	return ERR_PTR(err);
+}
+
+/**
+ * homa_message_out_fill() - Initializes information for sending a message
+ * for an RPC (either request or response); copies the message data from
+ * user space and (possibly) begins transmitting the message.
+ * @rpc:     RPC for which to send message; this function must not
+ *           previously have been called for the RPC. Must be locked. The RPC
+ *           will be unlocked while copying data, but will be locked again
+ *           before returning.
+ * @iter:    Describes location(s) of message data in user space.
+ * @xmit:    Nonzero means this method should start transmitting packets;
+ *           transmission will be overlapped with copying from user space.
+ *           Zero means the caller will initiate transmission after this
+ *           function returns.
+ *
+ * Return:   0 for success, or a negative errno for failure. It is possible
+ *           for the RPC to be freed while this function is active. If that
+ *           happens, copying will cease, -EINVAL will be returned, and
+ *           rpc->state will be RPC_DEAD.
+ */
+int homa_message_out_fill(struct homa_rpc *rpc, struct iov_iter *iter, int xmit)
+	__must_hold(rpc->bucket->lock)
+{
+	/* Geometry information for packets:
+	 * mtu:              largest size for an on-the-wire packet (including
+	 *                   all headers through IP header, but not Ethernet
+	 *                   header).
+	 * max_seg_data:     largest amount of Homa message data that fits
+	 *                   in an on-the-wire packet (after segmentation).
+	 * max_gso_data:     largest amount of Homa message data that fits
+	 *                   in a GSO packet (before segmentation).
+	 */
+	int mtu, max_seg_data, max_gso_data;
+
+	struct sk_buff **last_link;
+	struct dst_entry *dst;
+	u64 segs_per_gso;
+	int overlap_xmit;
+
+	/* Bytes of the message that haven't yet been copied into skbs. */
+	int bytes_left;
+
+	int gso_size;
+	int err;
+
+	if (unlikely(iter->count > HOMA_MAX_MESSAGE_LENGTH ||
+		     iter->count == 0)) {
+		err = -EINVAL;
+		goto error;
+	}
+	homa_message_out_init(rpc, iter->count);
+
+	/* Compute the geometry of packets. */
+	dst = homa_get_dst(rpc->peer, rpc->hsk);
+	mtu = dst_mtu(dst);
+	max_seg_data = mtu - rpc->hsk->ip_header_length
+			- sizeof(struct homa_data_hdr);
+	gso_size = dst->dev->gso_max_size;
+	if (gso_size > rpc->hsk->homa->max_gso_size)
+		gso_size = rpc->hsk->homa->max_gso_size;
+	dst_release(dst);
+
+	/* Round gso_size down to an even # of mtus. */
+	segs_per_gso = gso_size - rpc->hsk->ip_header_length -
+			sizeof(struct homa_data_hdr) +
+			sizeof(struct homa_seg_hdr);
+	do_div(segs_per_gso, max_seg_data +
+			sizeof(struct homa_seg_hdr));
+	if (segs_per_gso == 0)
+		segs_per_gso = 1;
+	max_gso_data = segs_per_gso * max_seg_data;
+
+	overlap_xmit = rpc->msgout.length > 2 * max_gso_data;
+	homa_skb_stash_pages(rpc->hsk->homa, rpc->msgout.length);
+
+	/* Each iteration of the loop below creates one GSO packet. */
+	last_link = &rpc->msgout.packets;
+	for (bytes_left = rpc->msgout.length; bytes_left > 0; ) {
+		int skb_data_bytes, offset;
+		struct sk_buff *skb;
+
+		homa_rpc_unlock(rpc);
+		skb_data_bytes = max_gso_data;
+		offset = rpc->msgout.length - bytes_left;
+		if (skb_data_bytes > bytes_left)
+			skb_data_bytes = bytes_left;
+		skb = homa_tx_data_pkt_alloc(rpc, iter, offset, skb_data_bytes,
+					     max_seg_data);
+		if (IS_ERR(skb)) {
+			err = PTR_ERR(skb);
+			homa_rpc_lock(rpc);
+			goto error;
+		}
+		bytes_left -= skb_data_bytes;
+
+		homa_rpc_lock(rpc);
+		if (rpc->state == RPC_DEAD) {
+			/* RPC was freed while we were copying. */
+			err = -EINVAL;
+			homa_skb_free_tx(rpc->hsk->homa, skb);
+			goto error;
+		}
+		*last_link = skb;
+		last_link = &(homa_get_skb_info(skb)->next_skb);
+		*last_link = NULL;
+		rpc->msgout.num_skbs++;
+		rpc->msgout.skb_memory += skb->truesize;
+		rpc->msgout.copied_from_user = rpc->msgout.length - bytes_left;
+		rpc->msgout.first_not_tx = rpc->msgout.packets;
+		if (overlap_xmit && list_empty(&rpc->throttled_links) &&
+		    xmit)
+			homa_pacer_manage_rpc(rpc);
+	}
+	refcount_add(rpc->msgout.skb_memory, &rpc->hsk->sock.sk_wmem_alloc);
+	if (!overlap_xmit && xmit)
+		homa_xmit_data(rpc, false);
+	return 0;
+
+error:
+	refcount_add(rpc->msgout.skb_memory, &rpc->hsk->sock.sk_wmem_alloc);
+	return err;
+}
+
+/**
+ * homa_xmit_control() - Send a control packet to the other end of an RPC.
+ * @type:      Packet type, such as DATA.
+ * @contents:  Address of buffer containing the contents of the packet.
+ *             Only information after the common header must be valid;
+ *             the common header will be filled in by this function.
+ * @length:    Length of @contents (including the common header).
+ * @rpc:       The packet will go to the socket that handles the other end
+ *             of this RPC. Addressing info for the packet, including all of
+ *             the fields of homa_common_hdr except type, will be set from this.
+ *             Caller must hold either the lock or a reference.
+ *
+ * Return:     Either zero (for success), or a negative errno value if there
+ *             was a problem.
+ */
+int homa_xmit_control(enum homa_packet_type type, void *contents,
+		      size_t length, struct homa_rpc *rpc)
+{
+	struct homa_common_hdr *h = contents;
+
+	h->type = type;
+	h->sport = htons(rpc->hsk->port);
+	h->dport = htons(rpc->dport);
+	h->sender_id = cpu_to_be64(rpc->id);
+	return __homa_xmit_control(contents, length, rpc->peer, rpc->hsk);
+}
+
+/**
+ * __homa_xmit_control() - Lower-level version of homa_xmit_control: sends
+ * a control packet.
+ * @contents:  Address of buffer containing the contents of the packet.
+ *             The caller must have filled in all of the information,
+ *             including the common header.
+ * @length:    Length of @contents.
+ * @peer:      Destination to which the packet will be sent.
+ * @hsk:       Socket via which the packet will be sent.
+ *
+ * Return:     Either zero (for success), or a negative errno value if there
+ *             was a problem.
+ */
+int __homa_xmit_control(void *contents, size_t length, struct homa_peer *peer,
+			struct homa_sock *hsk)
+{
+	struct homa_common_hdr *h;
+	struct sk_buff *skb;
+	int extra_bytes;
+	int result;
+
+	skb = homa_skb_alloc_tx(HOMA_MAX_HEADER);
+	if (unlikely(!skb))
+		return -ENOBUFS;
+	skb_dst_set(skb, homa_get_dst(peer, hsk));
+
+	h = skb_put(skb, length);
+	memcpy(h, contents, length);
+	extra_bytes = HOMA_MIN_PKT_LENGTH - length;
+	if (extra_bytes > 0)
+		memset(skb_put(skb, extra_bytes), 0, extra_bytes);
+	skb->ooo_okay = 1;
+	skb_get(skb);
+	if (hsk->inet.sk.sk_family == AF_INET6)
+		result = ip6_xmit(&hsk->inet.sk, skb, &peer->flow.u.ip6, 0,
+				  NULL, 0, 0);
+	else
+		result = ip_queue_xmit(&hsk->inet.sk, skb, &peer->flow);
+	if (unlikely(result != 0)) {
+		/* It appears that ip*_xmit frees skbuffs after
+		 * errors; the following code is to raise an alert if
+		 * this isn't actually the case. The extra skb_get above
+		 * and kfree_skb call below are needed to do the check
+		 * accurately (otherwise the buffer could be freed and
+		 * its memory used for some other purpose, resulting in
+		 * a bogus "reference count").
+		 */
+		if (refcount_read(&skb->users) > 1) {
+			if (hsk->inet.sk.sk_family == AF_INET6)
+				pr_notice("ip6_xmit didn't free Homa control packet (type %d) after error %d\n",
+					  h->type, result);
+			else
+				pr_notice("ip_queue_xmit didn't free Homa control packet (type %d) after error %d\n",
+					  h->type, result);
+		}
+	}
+	kfree_skb(skb);
+	return result;
+}
+
+/**
+ * homa_xmit_unknown() - Send an RPC_UNKNOWN packet to a peer.
+ * @skb:         Buffer containing an incoming packet; identifies the peer to
+ *               which the RPC_UNKNOWN packet should be sent.
+ * @hsk:         Socket that should be used to send the RPC_UNKNOWN packet.
+ */
+void homa_xmit_unknown(struct sk_buff *skb, struct homa_sock *hsk)
+{
+	struct homa_common_hdr *h = (struct homa_common_hdr *)skb->data;
+	struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+	struct homa_rpc_unknown_hdr unknown;
+	struct homa_peer *peer;
+
+	unknown.common.sport = h->dport;
+	unknown.common.dport = h->sport;
+	unknown.common.type = RPC_UNKNOWN;
+	unknown.common.sender_id = cpu_to_be64(homa_local_id(h->sender_id));
+	peer = homa_peer_get(hsk, &saddr);
+	if (!IS_ERR(peer))
+		__homa_xmit_control(&unknown, sizeof(unknown), peer, hsk);
+	homa_peer_release(peer);
+}
+
+/**
+ * homa_xmit_data() - If an RPC has outbound data packets that are permitted
+ * to be transmitted according to the scheduling mechanism, arrange for
+ * them to be sent (some may be sent immediately; others may be sent
+ * later by the pacer thread).
+ * @rpc:       RPC to check for transmittable packets. Must be locked by
+ *             caller. Note: this function will release the RPC lock while
+ *             passing packets through the RPC stack, then reacquire it
+ *             before returning. It is possible that the RPC gets freed
+ *             when the lock isn't held, in which case the state will
+ *             be RPC_DEAD on return.
+ * @force:     True means send at least one packet, even if the NIC queue
+ *             is too long. False means that zero packets may be sent, if
+ *             the NIC queue is sufficiently long.
+ */
+void homa_xmit_data(struct homa_rpc *rpc, bool force)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa *homa = rpc->hsk->homa;
+	int length;
+
+	while (*rpc->msgout.next_xmit && rpc->state != RPC_DEAD) {
+		struct sk_buff *skb = *rpc->msgout.next_xmit;
+
+		if (rpc->msgout.length - rpc->msgout.next_xmit_offset >
+		    homa->pacer->throttle_min_bytes) {
+			if (!homa_pacer_check_nic_q(homa->pacer, skb, force)) {
+				homa_pacer_manage_rpc(rpc);
+				break;
+			}
+		}
+
+		rpc->msgout.next_xmit = &(homa_get_skb_info(skb)->next_skb);
+		length = homa_get_skb_info(skb)->data_bytes;
+		rpc->msgout.next_xmit_offset += length;
+
+		homa_rpc_unlock(rpc);
+		skb_get(skb);
+		__homa_xmit_data(skb, rpc);
+		force = false;
+		homa_rpc_lock(rpc);
+	}
+}
+
+/**
+ * __homa_xmit_data() - Handles packet transmission stuff that is common
+ * to homa_xmit_data and homa_resend_data.
+ * @skb:      Packet to be sent. The packet will be freed after transmission
+ *            (and also if errors prevented transmission).
+ * @rpc:      Information about the RPC that the packet belongs to.
+ */
+void __homa_xmit_data(struct sk_buff *skb, struct homa_rpc *rpc)
+{
+	skb_dst_set(skb, homa_get_dst(rpc->peer, rpc->hsk));
+
+	skb->ooo_okay = 1;
+	skb->ip_summed = CHECKSUM_PARTIAL;
+	skb->csum_start = skb_transport_header(skb) - skb->head;
+	skb->csum_offset = offsetof(struct homa_common_hdr, checksum);
+	if (rpc->hsk->inet.sk.sk_family == AF_INET6)
+		ip6_xmit(&rpc->hsk->inet.sk, skb, &rpc->peer->flow.u.ip6,
+			 0, NULL, 0, 0);
+	else
+		ip_queue_xmit(&rpc->hsk->inet.sk, skb, &rpc->peer->flow);
+}
+
+/**
+ * homa_resend_data() - This function is invoked as part of handling RESEND
+ * requests. It retransmits the packet(s) containing a given range of bytes
+ * from a message.
+ * @rpc:      RPC for which data should be resent.
+ * @start:    Offset within @rpc->msgout of the first byte to retransmit.
+ * @end:      Offset within @rpc->msgout of the byte just after the last one
+ *            to retransmit.
+ */
+void homa_resend_data(struct homa_rpc *rpc, int start, int end)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa_skb_info *homa_info;
+	struct sk_buff *skb;
+
+	if (end <= start)
+		return;
+
+	/* Each iteration of this loop checks one packet in the message
+	 * to see if it contains segments that need to be retransmitted.
+	 */
+	for (skb = rpc->msgout.packets; skb; skb = homa_info->next_skb) {
+		int seg_offset, offset, seg_length, data_left;
+		struct homa_data_hdr *h;
+
+		homa_info = homa_get_skb_info(skb);
+		offset = homa_info->offset;
+		if (offset >= end)
+			break;
+		if (start >= (offset + homa_info->data_bytes))
+			continue;
+
+		offset = homa_info->offset;
+		seg_offset = sizeof(struct homa_data_hdr);
+		data_left = homa_info->data_bytes;
+		if (skb_shinfo(skb)->gso_segs <= 1) {
+			seg_length = data_left;
+		} else {
+			seg_length = homa_info->seg_length;
+			h = (struct homa_data_hdr *)skb_transport_header(skb);
+		}
+		for ( ; data_left > 0; data_left -= seg_length,
+		     offset += seg_length,
+		     seg_offset += skb_shinfo(skb)->gso_size) {
+			struct homa_skb_info *new_homa_info;
+			struct sk_buff *new_skb;
+			int err;
+
+			if (seg_length > data_left)
+				seg_length = data_left;
+
+			if (end <= offset)
+				goto resend_done;
+			if ((offset + seg_length) <= start)
+				continue;
+
+			/* This segment must be retransmitted. */
+			new_skb = homa_skb_alloc_tx(sizeof(struct homa_data_hdr)
+					+ seg_length);
+			if (unlikely(!new_skb))
+				goto resend_done;
+			h = __skb_put_data(new_skb, skb_transport_header(skb),
+					   sizeof(struct homa_data_hdr));
+			h->common.sequence = htonl(offset);
+			h->seg.offset = htonl(offset);
+			h->retransmit = 1;
+			err = homa_skb_append_from_skb(rpc->hsk->homa, new_skb,
+						       skb, seg_offset,
+						       seg_length);
+			if (err != 0) {
+				pr_err("%s got error %d from homa_skb_append_from_skb\n",
+				       __func__, err);
+				kfree_skb(new_skb);
+				goto resend_done;
+			}
+
+			new_homa_info = homa_get_skb_info(new_skb);
+			new_homa_info->next_skb = NULL;
+			new_homa_info->wire_bytes = rpc->hsk->ip_header_length
+					+ sizeof(struct homa_data_hdr)
+					+ seg_length + HOMA_ETH_OVERHEAD;
+			new_homa_info->data_bytes = seg_length;
+			new_homa_info->seg_length = seg_length;
+			new_homa_info->offset = offset;
+			homa_pacer_check_nic_q(rpc->hsk->homa->pacer, new_skb,
+					       true);
+			__homa_xmit_data(new_skb, rpc);
+		}
+	}
+
+resend_done:
+	return;
+}
+
+/**
+ * homa_rpc_tx_end() - Return the offset of the first byte in an
+ * RPC's outgoing message that has not yet been fully transmitted.
+ * "Fully transmitted" means the message has been transmitted by the
+ * NIC and the skb has been released by the driver. This is different from
+ * rpc->msgout.next_xmit_offset, which computes the first offset that
+ * hasn't yet been passed to the IP stack.
+ * @rpc:    RPC to check
+ * Return:  See above. If the message has been fully transmitted then
+ *          rpc->msgout.length is returned.
+ */
+int homa_rpc_tx_end(struct homa_rpc *rpc)
+{
+	struct sk_buff *skb = rpc->msgout.first_not_tx;
+
+	while (skb) {
+		struct homa_skb_info *homa_info = homa_get_skb_info(skb);
+
+		/* next_xmit_offset tells us whether the packet has been
+		 * passed to the IP stack. Checking the reference count tells
+		 * us whether the packet has been released by the driver
+		 * (which only happens after notification from the NIC that
+		 * transmission is complete).
+		 */
+		if (homa_info->offset >= rpc->msgout.next_xmit_offset ||
+		    refcount_read(&skb->users) > 1)
+			return homa_info->offset;
+		skb = homa_info->next_skb;
+		rpc->msgout.first_not_tx = skb;
+	}
+	return rpc->msgout.length;
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 11/15] net: homa: create homa_utils.c
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (9 preceding siblings ...)
  2025-08-18 20:55 ` [PATCH net-next v15 10/15] net: homa: create homa_outgoing.c John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-26 11:52   ` Paolo Abeni
  2025-08-18 20:55 ` [PATCH net-next v15 12/15] net: homa: create homa_incoming.c John Ousterhout
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

This file contains functions for constructing and destructing
homa structs.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for v11:
* Move link_mbps variable from struct homa_pacer back to struct homa.

Changes for v10:
* Remove log messages after alloc errors

Changes for v9:
* Add support for homa_net objects
* Use new homa_clock abstraction layer
* Various name improvements (e.g. use "alloc" instead of "new" for functions
  that allocate memory)

Changes for v8:
* Accommodate homa_pacer refactoring

Changes for v7:
* Make Homa a pernet subsystem
* Add support for tx memory accounting
* Remove "lock_slow" functions, which don't add functionality in this
  patch series
* Use u64 and __u64 properly
---
 net/homa/homa_impl.h  |   6 +++
 net/homa/homa_utils.c | 122 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 128 insertions(+)
 create mode 100644 net/homa/homa_utils.c

diff --git a/net/homa/homa_impl.h b/net/homa/homa_impl.h
index c6c3cae9ac52..49ca4abfb50b 100644
--- a/net/homa/homa_impl.h
+++ b/net/homa/homa_impl.h
@@ -421,13 +421,19 @@ static inline bool homa_make_header_avl(struct sk_buff *skb)
 
 extern unsigned int homa_net_id;
 
+void     homa_destroy(struct homa *homa);
 int      homa_fill_data_interleaved(struct homa_rpc *rpc,
 				    struct sk_buff *skb, struct iov_iter *iter);
+int      homa_init(struct homa *homa);
 int      homa_message_out_fill(struct homa_rpc *rpc,
 			       struct iov_iter *iter, int xmit);
 void     homa_message_out_init(struct homa_rpc *rpc, int length);
+void     homa_net_destroy(struct homa_net *hnet);
+int      homa_net_init(struct homa_net *hnet, struct net *net,
+		       struct homa *homa);
 void     homa_rpc_handoff(struct homa_rpc *rpc);
 int      homa_rpc_tx_end(struct homa_rpc *rpc);
+void     homa_spin(int ns);
 struct sk_buff *homa_tx_data_pkt_alloc(struct homa_rpc *rpc,
 				       struct iov_iter *iter, int offset,
 				       int length, int max_seg_data);
diff --git a/net/homa/homa_utils.c b/net/homa/homa_utils.c
new file mode 100644
index 000000000000..abab0277ea0e
--- /dev/null
+++ b/net/homa/homa_utils.c
@@ -0,0 +1,122 @@
+// SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
+
+/* This file contains miscellaneous utility functions for Homa, such
+ * as initializing and destroying homa structs.
+ */
+
+#include "homa_impl.h"
+#include "homa_pacer.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+#include "homa_stub.h"
+
+/**
+ * homa_init() - Constructor for homa objects.
+ * @homa:   Object to initialize.
+ *
+ * Return:  0 on success, or a negative errno if there was an error. Even
+ *          if an error occurs, it is safe (and necessary) to call
+ *          homa_destroy at some point.
+ */
+int homa_init(struct homa *homa)
+{
+	int err;
+
+	memset(homa, 0, sizeof(*homa));
+
+	atomic64_set(&homa->next_outgoing_id, 2);
+	homa->link_mbps = 25000;
+	homa->pacer = homa_pacer_alloc(homa);
+	if (IS_ERR(homa->pacer)) {
+		err = PTR_ERR(homa->pacer);
+		homa->pacer = NULL;
+		return err;
+	}
+	homa->peertab = homa_peer_alloc_peertab();
+	if (IS_ERR(homa->peertab)) {
+		err = PTR_ERR(homa->peertab);
+		homa->peertab = NULL;
+		return err;
+	}
+	homa->socktab = kmalloc(sizeof(*homa->socktab), GFP_KERNEL);
+	if (!homa->socktab)
+		return -ENOMEM;
+	homa_socktab_init(homa->socktab);
+
+	/* Wild guesses to initialize configuration values... */
+	homa->resend_ticks = 5;
+	homa->resend_interval = 5;
+	homa->timeout_ticks = 100;
+	homa->timeout_resends = 5;
+	homa->request_ack_ticks = 2;
+	homa->reap_limit = 10;
+	homa->dead_buffs_limit = 5000;
+	homa->max_gso_size = 10000;
+	homa->wmem_max = 100000000;
+	homa->bpage_lease_usecs = 10000;
+	return 0;
+}
+
+/**
+ * homa_destroy() -  Destructor for homa objects.
+ * @homa:      Object to destroy. It is safe if this object has already
+ *             been previously destroyed.
+ */
+void homa_destroy(struct homa *homa)
+{
+	/* The order of the following cleanups matters! */
+	if (homa->socktab) {
+		homa_socktab_destroy(homa->socktab, NULL);
+		kfree(homa->socktab);
+		homa->socktab = NULL;
+	}
+	if (homa->pacer) {
+		homa_pacer_free(homa->pacer);
+		homa->pacer = NULL;
+	}
+	if (homa->peertab) {
+		homa_peer_free_peertab(homa->peertab);
+		homa->peertab = NULL;
+	}
+}
+
+/**
+ * homa_net_init() - Initialize a new struct homa_net as a per-net subsystem.
+ * @hnet:    Struct to initialzie.
+ * @net:     The network namespace the struct will be associated with.
+ * @homa:    The main Homa data structure to use for the net.
+ * Return:  0 on success, otherwise a negative errno.
+ */
+int homa_net_init(struct homa_net *hnet, struct net *net, struct homa *homa)
+{
+	memset(hnet, 0, sizeof(*hnet));
+	hnet->net = net;
+	hnet->homa = homa;
+	hnet->prev_default_port = HOMA_MIN_DEFAULT_PORT - 1;
+	return 0;
+}
+
+/**
+ * homa_net_destroy() - Release any resources associated with a homa_net.
+ * @hnet:    Object to destroy; must not be used again after this function
+ *           returns.
+ */
+void homa_net_destroy(struct homa_net *hnet)
+{
+	homa_socktab_destroy(hnet->homa->socktab, hnet);
+	homa_peer_free_net(hnet);
+}
+
+/**
+ * homa_spin() - Delay (without sleeping) for a given time interval.
+ * @ns:   How long to delay (in nanoseconds)
+ */
+void homa_spin(int ns)
+{
+	u64 end;
+
+	end = homa_clock() + homa_ns_to_cycles(ns);
+	while (homa_clock() < end)
+		/* Empty loop body.*/
+		;
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 12/15] net: homa: create homa_incoming.c
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (10 preceding siblings ...)
  2025-08-18 20:55 ` [PATCH net-next v15 11/15] net: homa: create homa_utils.c John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-26 12:05   ` Paolo Abeni
  2025-09-02  7:19   ` Eric Dumazet
  2025-08-18 20:55 ` [PATCH net-next v15 13/15] net: homa: create homa_timer.c John Ousterhout
                   ` (3 subsequent siblings)
  15 siblings, 2 replies; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

This file contains most of the code for handling incoming packets,
including top-level dispatching code plus specific handlers for each
pack type. It also contains code for dispatching fully-received
messages to waiting application threads.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for v14:
* Use new homa_rpc_tx_end function
* Fix race in homa_wait_shared (an RPC could get lost if it became
  ready at the same time that homa_interest_wait returned with an error)
* Handle nonblocking behavior here, rather than in homa_interest.c
* Change API for homa_wait_private to distinguish errors in an RPC from
  errors that prevented the wait operation from completing.

Changes for v11:
* Cleanup and simplify use of RPC reference counts.
* Cleanup sparse annotations.
* Rework the mechanism for waking up RPCs that stalled waiting for
  buffer pool space.

Changes for v10:
* Revise sparse annotations to eliminate __context__ definition
* Refactor resend mechanism (new function homa_request_retrans replaces
  homa_gap_retry)
* Remove log messages after alloc errors
* Fix socket cleanup race

Changes for v9:
* Add support for homa_net objects
* Use new homa_clock abstraction layer
* Various name improvements (e.g. use "alloc" instead of "new" for functions
  that allocate memory)

Changes for v7:
* API change for homa_rpc_handoff
* Refactor waiting mechanism for incoming packets: simplify wait
  criteria and use standard Linux mechanisms for waiting, use
  new homa_interest struct
* Reject unauthorized incoming request messages
* Improve documentation for code that spins (and reduce spin length)
* Use RPC reference counts, eliminate RPC_HANDING_OFF flag
* Replace erroneous use of "safe" list iteration with "rcu" version
* Remove locker argument from locking functions
* Check incoming messages against HOMA_MAX_MESSAGE_LENGTH
* Use u64 and __u64 properly
---
 net/homa/homa_impl.h     |  15 +
 net/homa/homa_incoming.c | 886 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 901 insertions(+)
 create mode 100644 net/homa/homa_incoming.c

diff --git a/net/homa/homa_impl.h b/net/homa/homa_impl.h
index 49ca4abfb50b..3d91b7f44de9 100644
--- a/net/homa/homa_impl.h
+++ b/net/homa/homa_impl.h
@@ -421,22 +421,37 @@ static inline bool homa_make_header_avl(struct sk_buff *skb)
 
 extern unsigned int homa_net_id;
 
+void     homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+		      struct homa_rpc *rpc);
+void     homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb);
+int      homa_copy_to_user(struct homa_rpc *rpc);
+void     homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
 void     homa_destroy(struct homa *homa);
+void     homa_dispatch_pkts(struct sk_buff *skb);
 int      homa_fill_data_interleaved(struct homa_rpc *rpc,
 				    struct sk_buff *skb, struct iov_iter *iter);
+struct homa_gap *homa_gap_alloc(struct list_head *next, int start, int end);
 int      homa_init(struct homa *homa);
 int      homa_message_out_fill(struct homa_rpc *rpc,
 			       struct iov_iter *iter, int xmit);
 void     homa_message_out_init(struct homa_rpc *rpc, int length);
+void     homa_need_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+			   struct homa_rpc *rpc);
 void     homa_net_destroy(struct homa_net *hnet);
 int      homa_net_init(struct homa_net *hnet, struct net *net,
 		       struct homa *homa);
+void     homa_request_retrans(struct homa_rpc *rpc);
+void     homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
+			 struct homa_sock *hsk);
 void     homa_rpc_handoff(struct homa_rpc *rpc);
 int      homa_rpc_tx_end(struct homa_rpc *rpc);
 void     homa_spin(int ns);
 struct sk_buff *homa_tx_data_pkt_alloc(struct homa_rpc *rpc,
 				       struct iov_iter *iter, int offset,
 				       int length, int max_seg_data);
+void     homa_rpc_unknown_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
+int      homa_wait_private(struct homa_rpc *rpc, int nonblocking);
+struct homa_rpc *homa_wait_shared(struct homa_sock *hsk, int nonblocking);
 int      homa_xmit_control(enum homa_packet_type type, void *contents,
 			   size_t length, struct homa_rpc *rpc);
 int      __homa_xmit_control(void *contents, size_t length,
diff --git a/net/homa/homa_incoming.c b/net/homa/homa_incoming.c
new file mode 100644
index 000000000000..c485dd98cba9
--- /dev/null
+++ b/net/homa/homa_incoming.c
@@ -0,0 +1,886 @@
+// SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
+
+/* This file contains functions that handle incoming Homa messages. */
+
+#include "homa_impl.h"
+#include "homa_interest.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+
+/**
+ * homa_message_in_init() - Constructor for homa_message_in.
+ * @rpc:          RPC whose msgin structure should be initialized. The
+ *                msgin struct is assumed to be zeroes.
+ * @length:       Total number of bytes in message.
+ * Return:        Zero for successful initialization, or a negative errno
+ *                if rpc->msgin could not be initialized.
+ */
+int homa_message_in_init(struct homa_rpc *rpc, int length)
+	__must_hold(rpc->bucket->lock)
+{
+	int err;
+
+	if (length > HOMA_MAX_MESSAGE_LENGTH)
+		return -EINVAL;
+
+	rpc->msgin.length = length;
+	skb_queue_head_init(&rpc->msgin.packets);
+	INIT_LIST_HEAD(&rpc->msgin.gaps);
+	rpc->msgin.bytes_remaining = length;
+	err = homa_pool_alloc_msg(rpc);
+	if (err != 0) {
+		rpc->msgin.length = -1;
+		return err;
+	}
+	return 0;
+}
+
+/**
+ * homa_gap_alloc() - Allocate a new gap and add it to a gap list.
+ * @next:   Add the new gap just before this list element.
+ * @start:  Offset of first byte covered by the gap.
+ * @end:    Offset of byte just after the last one covered by the gap.
+ * Return:  Pointer to the new gap, or NULL if memory couldn't be allocated
+ *          for the gap object.
+ */
+struct homa_gap *homa_gap_alloc(struct list_head *next, int start, int end)
+{
+	struct homa_gap *gap;
+
+	gap = kmalloc(sizeof(*gap), GFP_ATOMIC);
+	if (!gap)
+		return NULL;
+	gap->start = start;
+	gap->end = end;
+	gap->time = homa_clock();
+	list_add_tail(&gap->links, next);
+	return gap;
+}
+
+/**
+ * homa_request_retrans() - The function is invoked when it appears that
+ * data packets for a message have been lost. It issues RESEND requests
+ * as appropriate and may modify the state of the RPC.
+ * @rpc:     RPC for which incoming data is delinquent; must be locked by
+ *           caller.
+ */
+void homa_request_retrans(struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa_resend_hdr resend;
+	struct homa_gap *gap;
+	int offset, length;
+
+	if (rpc->msgin.length >= 0) {
+		/* Issue RESENDS for any gaps in incoming data. */
+		list_for_each_entry(gap, &rpc->msgin.gaps, links) {
+			resend.offset = htonl(gap->start);
+			resend.length = htonl(gap->end - gap->start);
+			homa_xmit_control(RESEND, &resend, sizeof(resend), rpc);
+		}
+
+		/* Issue a RESEND for any granted data after the last gap. */
+		offset = rpc->msgin.recv_end;
+		length = rpc->msgin.length - rpc->msgin.recv_end;
+		if (length <= 0)
+			return;
+	} else {
+		/* No data has been received for the RPC. Ask the sender to
+		 * resend everything it has sent so far.
+		 */
+		offset = 0;
+		length = -1;
+	}
+
+	resend.offset = htonl(offset);
+	resend.length = htonl(length);
+	homa_xmit_control(RESEND, &resend, sizeof(resend), rpc);
+}
+
+/**
+ * homa_add_packet() - Add an incoming packet to the contents of a
+ * partially received message.
+ * @rpc:   Add the packet to the msgin for this RPC.
+ * @skb:   The new packet. This function takes ownership of the packet
+ *         (the packet will either be freed or added to rpc->msgin.packets).
+ */
+void homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
+	struct homa_gap *gap, *dummy, *gap2;
+	int start = ntohl(h->seg.offset);
+	int length = homa_data_len(skb);
+	int end = start + length;
+
+	if ((start + length) > rpc->msgin.length)
+		goto discard;
+
+	if (start == rpc->msgin.recv_end) {
+		/* Common case: packet is sequential. */
+		rpc->msgin.recv_end += length;
+		goto keep;
+	}
+
+	if (start > rpc->msgin.recv_end) {
+		/* Packet creates a new gap. */
+		if (!homa_gap_alloc(&rpc->msgin.gaps,
+				    rpc->msgin.recv_end, start))
+			goto discard;
+		rpc->msgin.recv_end = end;
+		goto keep;
+	}
+
+	/* Must now check to see if the packet fills in part or all of
+	 * an existing gap.
+	 */
+	list_for_each_entry_safe(gap, dummy, &rpc->msgin.gaps, links) {
+		/* Is packet at the start of this gap? */
+		if (start <= gap->start) {
+			if (end <= gap->start)
+				continue;
+			if (start < gap->start)
+				goto discard;
+			if (end > gap->end)
+				goto discard;
+			gap->start = end;
+			if (gap->start >= gap->end) {
+				list_del(&gap->links);
+				kfree(gap);
+			}
+			goto keep;
+		}
+
+		/* Is packet at the end of this gap? BTW, at this point we know
+		 * the packet can't cover the entire gap.
+		 */
+		if (end >= gap->end) {
+			if (start >= gap->end)
+				continue;
+			if (end > gap->end)
+				goto discard;
+			gap->end = start;
+			goto keep;
+		}
+
+		/* Packet is in the middle of the gap; must split the gap. */
+		gap2 = homa_gap_alloc(&gap->links, gap->start, start);
+		if (!gap2)
+			goto discard;
+		gap2->time = gap->time;
+		gap->start = end;
+		goto keep;
+	}
+
+discard:
+	kfree_skb(skb);
+	return;
+
+keep:
+	__skb_queue_tail(&rpc->msgin.packets, skb);
+	rpc->msgin.bytes_remaining -= length;
+}
+
+/**
+ * homa_copy_to_user() - Copy as much data as possible from incoming
+ * packet buffers to buffers in user space.
+ * @rpc:     RPC for which data should be copied. Must be locked by caller.
+ * Return:   Zero for success or a negative errno if there is an error.
+ *           It is possible for the RPC to be freed while this function
+ *           executes (it releases and reacquires the RPC lock). If that
+ *           happens, -EINVAL will be returned and the state of @rpc
+ *           will be RPC_DEAD. Clears the RPC_PKTS_READY bit in @rpc->flags
+ *           if all available packets have been copied out.
+ */
+int homa_copy_to_user(struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+#define MAX_SKBS 20
+	struct sk_buff *skbs[MAX_SKBS];
+	int error = 0;
+	int n = 0;             /* Number of filled entries in skbs. */
+	int i;
+
+	/* Tricky note: we can't hold the RPC lock while we're actually
+	 * copying to user space, because (a) it's illegal to hold a spinlock
+	 * while copying to user space and (b) we'd like for homa_softirq
+	 * to add more packets to the RPC while we're copying these out.
+	 * So, collect a bunch of packets to copy, then release the lock,
+	 * copy them, and reacquire the lock.
+	 */
+	while (true) {
+		struct sk_buff *skb;
+
+		if (rpc->state == RPC_DEAD) {
+			error = -EINVAL;
+			break;
+		}
+
+		skb = __skb_dequeue(&rpc->msgin.packets);
+		if (skb) {
+			skbs[n] = skb;
+			n++;
+			if (n < MAX_SKBS)
+				continue;
+		}
+		if (n == 0) {
+			atomic_andnot(RPC_PKTS_READY, &rpc->flags);
+			break;
+		}
+
+		/* At this point we've collected a batch of packets (or
+		 * run out of packets); copy any available packets out to
+		 * user space.
+		 */
+		homa_rpc_unlock(rpc);
+
+		/* Each iteration of this loop copies out one skb. */
+		for (i = 0; i < n; i++) {
+			struct homa_data_hdr *h = (struct homa_data_hdr *)
+					skbs[i]->data;
+			int pkt_length = homa_data_len(skbs[i]);
+			int offset = ntohl(h->seg.offset);
+			int buf_bytes, chunk_size;
+			struct iov_iter iter;
+			int copied = 0;
+			char __user *dst;
+
+			/* Each iteration of this loop copies to one
+			 * user buffer.
+			 */
+			while (copied < pkt_length) {
+				chunk_size = pkt_length - copied;
+				dst = homa_pool_get_buffer(rpc, offset + copied,
+							   &buf_bytes);
+				if (buf_bytes < chunk_size) {
+					if (buf_bytes == 0) {
+						/* skb has data beyond message
+						 * end?
+						 */
+						break;
+					}
+					chunk_size = buf_bytes;
+				}
+				error = import_ubuf(READ, dst, chunk_size,
+						    &iter);
+				if (error)
+					goto free_skbs;
+				error = skb_copy_datagram_iter(skbs[i],
+							       sizeof(*h) +
+							       copied,  &iter,
+							       chunk_size);
+				if (error)
+					goto free_skbs;
+				copied += chunk_size;
+			}
+		}
+
+free_skbs:
+		for (i = 0; i < n; i++)
+			kfree_skb(skbs[i]);
+		n = 0;
+		atomic_or(APP_NEEDS_LOCK, &rpc->flags);
+		homa_rpc_lock(rpc);
+		atomic_andnot(APP_NEEDS_LOCK, &rpc->flags);
+		if (error)
+			break;
+	}
+	return error;
+}
+
+/**
+ * homa_dispatch_pkts() - Top-level function that processes a batch of packets,
+ * all related to the same RPC.
+ * @skb:       First packet in the batch, linked through skb->next.
+ */
+void homa_dispatch_pkts(struct sk_buff *skb)
+{
+#define MAX_ACKS 10
+	const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+	struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
+	u64 id = homa_local_id(h->common.sender_id);
+	int dport = ntohs(h->common.dport);
+
+	/* Used to collect acks from data packets so we can process them
+	 * all at the end (can't process them inline because that may
+	 * require locking conflicting RPCs). If we run out of space just
+	 * ignore the extra acks; they'll be regenerated later through the
+	 * explicit mechanism.
+	 */
+	struct homa_ack acks[MAX_ACKS];
+	struct homa_rpc *rpc = NULL;
+	struct homa_sock *hsk;
+	struct homa_net *hnet;
+	struct sk_buff *next;
+	int num_acks = 0;
+
+	/* Find the appropriate socket.*/
+	hnet = homa_net_from_skb(skb);
+	hsk = homa_sock_find(hnet, dport);
+	if (!hsk || (!homa_is_client(id) && !hsk->is_server)) {
+		if (skb_is_ipv6(skb))
+			icmp6_send(skb, ICMPV6_DEST_UNREACH,
+				   ICMPV6_PORT_UNREACH, 0, NULL, IP6CB(skb));
+		else
+			icmp_send(skb, ICMP_DEST_UNREACH,
+				  ICMP_PORT_UNREACH, 0);
+		while (skb) {
+			next = skb->next;
+			kfree_skb(skb);
+			skb = next;
+		}
+		if (hsk)
+			sock_put(&hsk->sock);
+		return;
+	}
+
+	/* Each iteration through the following loop processes one packet. */
+	for (; skb; skb = next) {
+		h = (struct homa_data_hdr *)skb->data;
+		next = skb->next;
+
+		/* Relinquish the RPC lock temporarily if it's needed
+		 * elsewhere.
+		 */
+		if (rpc) {
+			int flags = atomic_read(&rpc->flags);
+
+			if (flags & APP_NEEDS_LOCK) {
+				homa_rpc_unlock(rpc);
+
+				/* This short spin is needed to ensure that the
+				 * other thread gets the lock before this thread
+				 * grabs it again below (the need for this
+				 * was confirmed experimentally in 2/2025;
+				 * without it, the handoff fails 20-25% of the
+				 * time). Furthermore, the call to homa_spin
+				 * seems to allow the other thread to acquire
+				 * the lock more quickly.
+				 */
+				homa_spin(100);
+				homa_rpc_lock(rpc);
+			}
+		}
+
+		/* If we don't already have an RPC, find it, lock it,
+		 * and create a reference on it.
+		 */
+		if (!rpc) {
+			if (!homa_is_client(id)) {
+				/* We are the server for this RPC. */
+				if (h->common.type == DATA) {
+					int created;
+
+					/* Create a new RPC if one doesn't
+					 * already exist.
+					 */
+					rpc = homa_rpc_alloc_server(hsk, &saddr,
+								    h,
+								    &created);
+					if (IS_ERR(rpc)) {
+						rpc = NULL;
+						goto discard;
+					}
+				} else {
+					rpc = homa_rpc_find_server(hsk, &saddr,
+								   id);
+				}
+			} else {
+				rpc = homa_rpc_find_client(hsk, id);
+			}
+			if (rpc)
+				homa_rpc_hold(rpc);
+		}
+		if (unlikely(!rpc)) {
+			if (h->common.type != NEED_ACK &&
+			    h->common.type != ACK &&
+			    h->common.type != RESEND)
+				goto discard;
+		} else {
+			if (h->common.type == DATA ||
+			    h->common.type == BUSY)
+				rpc->silent_ticks = 0;
+			rpc->peer->outstanding_resends = 0;
+		}
+
+		switch (h->common.type) {
+		case DATA:
+			if (h->ack.client_id) {
+				/* Save the ack for processing later, when we
+				 * have released the RPC lock.
+				 */
+				if (num_acks < MAX_ACKS) {
+					acks[num_acks] = h->ack;
+					num_acks++;
+				}
+			}
+			homa_data_pkt(skb, rpc);
+			break;
+		case RESEND:
+			homa_resend_pkt(skb, rpc, hsk);
+			break;
+		case RPC_UNKNOWN:
+			homa_rpc_unknown_pkt(skb, rpc);
+			break;
+		case BUSY:
+			/* Nothing to do for these packets except reset
+			 * silent_ticks, which happened above.
+			 */
+			goto discard;
+		case NEED_ACK:
+			homa_need_ack_pkt(skb, hsk, rpc);
+			break;
+		case ACK:
+			homa_ack_pkt(skb, hsk, rpc);
+			break;
+			goto discard;
+		}
+		continue;
+
+discard:
+		kfree_skb(skb);
+	}
+	if (rpc) {
+		homa_rpc_put(rpc);
+		homa_rpc_unlock(rpc);
+	}
+
+	while (num_acks > 0) {
+		num_acks--;
+		homa_rpc_acked(hsk, &saddr, &acks[num_acks]);
+	}
+
+	if (hsk->dead_skbs >= 2 * hsk->homa->dead_buffs_limit)
+		/* We get here if other approaches are not keeping up with
+		 * reaping dead RPCs. See "RPC Reaping Strategy" in
+		 * homa_rpc_reap code for details.
+		 */
+		homa_rpc_reap(hsk, false);
+	sock_put(&hsk->sock);
+}
+
+/**
+ * homa_data_pkt() - Handler for incoming DATA packets
+ * @skb:     Incoming packet; size known to be large enough for the header.
+ *           This function now owns the packet.
+ * @rpc:     Information about the RPC corresponding to this packet.
+ *           Must be locked by the caller.
+ */
+void homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
+
+	if (rpc->state != RPC_INCOMING && homa_is_client(rpc->id)) {
+		if (unlikely(rpc->state != RPC_OUTGOING))
+			goto discard;
+		rpc->state = RPC_INCOMING;
+		if (homa_message_in_init(rpc, ntohl(h->message_length)) != 0)
+			goto discard;
+	} else if (rpc->state != RPC_INCOMING) {
+		/* Must be server; note that homa_rpc_alloc_server already
+		 * initialized msgin and allocated buffers.
+		 */
+		if (unlikely(rpc->msgin.length >= 0))
+			goto discard;
+	}
+
+	if (rpc->msgin.num_bpages == 0)
+		/* Drop packets that arrive when we can't allocate buffer
+		 * space. If we keep them around, packet buffer usage can
+		 * exceed available cache space, resulting in poor
+		 * performance.
+		 */
+		goto discard;
+
+	homa_add_packet(rpc, skb);
+
+	if (skb_queue_len(&rpc->msgin.packets) != 0 &&
+	    !(atomic_read(&rpc->flags) & RPC_PKTS_READY)) {
+		atomic_or(RPC_PKTS_READY, &rpc->flags);
+		homa_rpc_handoff(rpc);
+	}
+
+	return;
+
+discard:
+	kfree_skb(skb);
+}
+
+/**
+ * homa_resend_pkt() - Handler for incoming RESEND packets
+ * @skb:     Incoming packet; size already verified large enough for header.
+ *           This function now owns the packet.
+ * @rpc:     Information about the RPC corresponding to this packet; must
+ *           be locked by caller, but may be NULL if there is no RPC matching
+ *           this packet
+ * @hsk:     Socket on which the packet was received.
+ */
+void homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
+		     struct homa_sock *hsk)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa_resend_hdr *h = (struct homa_resend_hdr *)skb->data;
+	int offset = ntohl(h->offset);
+	int length = ntohl(h->length);
+	int end = offset + length;
+	struct homa_busy_hdr busy;
+	int tx_end;
+
+	if (!rpc) {
+		homa_xmit_unknown(skb, hsk);
+		goto done;
+	}
+
+	tx_end = homa_rpc_tx_end(rpc);
+	if (!homa_is_client(rpc->id) && rpc->state != RPC_OUTGOING) {
+		/* We are the server for this RPC and don't yet have a
+		 * response message, so send BUSY to keep the client
+		 * waiting.
+		 */
+		homa_xmit_control(BUSY, &busy, sizeof(busy), rpc);
+		goto done;
+	}
+
+	if (length == -1)
+		end = tx_end;
+
+	homa_resend_data(rpc, offset, (end > tx_end) ? tx_end : end);
+
+	if (offset >= tx_end)  {
+		/* We have chosen not to transmit any of the requested data;
+		 * send BUSY so the receiver knows we are alive.
+		 */
+		homa_xmit_control(BUSY, &busy, sizeof(busy), rpc);
+		goto done;
+	}
+
+done:
+	kfree_skb(skb);
+}
+
+/**
+ * homa_rpc_unknown_pkt() - Handler for incoming RPC_UNKNOWN packets.
+ * @skb:     Incoming packet; size known to be large enough for the header.
+ *           This function now owns the packet.
+ * @rpc:     Information about the RPC corresponding to this packet. Must
+ *           be locked by caller.
+ */
+void homa_rpc_unknown_pkt(struct sk_buff *skb, struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+	if (homa_is_client(rpc->id)) {
+		if (rpc->state == RPC_OUTGOING) {
+			int tx_end = homa_rpc_tx_end(rpc);
+
+			/* It appears that everything we've already transmitted
+			 * has been lost; retransmit it.
+			 */
+			homa_resend_data(rpc, 0, tx_end);
+			goto done;
+		}
+	} else {
+		homa_rpc_end(rpc);
+	}
+done:
+	kfree_skb(skb);
+}
+
+/**
+ * homa_need_ack_pkt() - Handler for incoming NEED_ACK packets
+ * @skb:     Incoming packet; size already verified large enough for header.
+ *           This function now owns the packet.
+ * @hsk:     Socket on which the packet was received.
+ * @rpc:     The RPC named in the packet header, or NULL if no such
+ *           RPC exists. The RPC has been locked by the caller.
+ */
+void homa_need_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+		       struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa_common_hdr *h = (struct homa_common_hdr *)skb->data;
+	const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+	u64 id = homa_local_id(h->sender_id);
+	struct homa_ack_hdr ack;
+	struct homa_peer *peer;
+
+	/* Don't ack if it's not safe for the peer to purge its state
+	 * for this RPC (the RPC still exists and we haven't received
+	 * the entire response), or if we can't find peer info.
+	 */
+	if (rpc && (rpc->state != RPC_INCOMING ||
+		    rpc->msgin.bytes_remaining)) {
+		homa_request_retrans(rpc);
+		goto done;
+	} else {
+		peer = homa_peer_get(hsk, &saddr);
+		if (IS_ERR(peer))
+			goto done;
+	}
+
+	/* Send an ACK for this RPC. At the same time, include all of the
+	 * other acks available for the peer. Note: can't use rpc below,
+	 * since it may be NULL.
+	 */
+	ack.common.type = ACK;
+	ack.common.sport = h->dport;
+	ack.common.dport = h->sport;
+	ack.common.sender_id = cpu_to_be64(id);
+	ack.num_acks = htons(homa_peer_get_acks(peer,
+						HOMA_MAX_ACKS_PER_PKT,
+						ack.acks));
+	__homa_xmit_control(&ack, sizeof(ack), peer, hsk);
+	homa_peer_release(peer);
+
+done:
+	kfree_skb(skb);
+}
+
+/**
+ * homa_ack_pkt() - Handler for incoming ACK packets
+ * @skb:     Incoming packet; size already verified large enough for header.
+ *           This function now owns the packet.
+ * @hsk:     Socket on which the packet was received.
+ * @rpc:     The RPC named in the packet header, or NULL if no such
+ *           RPC exists. The RPC lock will be dead on return.
+ */
+void homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+		  struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+	const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+	struct homa_ack_hdr *h = (struct homa_ack_hdr *)skb->data;
+	int i, count;
+
+	if (rpc)
+		homa_rpc_end(rpc);
+
+	count = ntohs(h->num_acks);
+	if (count > 0) {
+		if (rpc) {
+			/* Must temporarily release rpc's lock because
+			 * homa_rpc_acked needs to acquire RPC locks.
+			 */
+			homa_rpc_unlock(rpc);
+			for (i = 0; i < count; i++)
+				homa_rpc_acked(hsk, &saddr, &h->acks[i]);
+			homa_rpc_lock(rpc);
+		} else {
+			for (i = 0; i < count; i++)
+				homa_rpc_acked(hsk, &saddr, &h->acks[i]);
+		}
+	}
+	kfree_skb(skb);
+}
+
+/**
+ * homa_wait_private() - Waits until the response has been received for
+ * a specific RPC or the RPC has failed with an error.
+ * @rpc:          RPC to wait for; an error will be returned if the RPC is
+ *                not a client RPC or not private. Must be locked by caller.
+ * @nonblocking:  Nonzero means return immediately if @rpc not ready.
+ * Return:        0 means that @rpc is ready for attention: either its response
+ *                has been received or it has an unrecoverable error such as
+ *                ETIMEDOUT (in rpc->error). Nonzero means some other error
+ *                (such as EINTR or EINVAL) occurred before @rpc became ready
+ *                for attention; in this case the return value is a negative
+ *                errno.
+ */
+int homa_wait_private(struct homa_rpc *rpc, int nonblocking)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa_interest interest;
+	int result;
+
+	if (!(atomic_read(&rpc->flags) & RPC_PRIVATE))
+		return -EINVAL;
+
+	/* Each iteration through this loop waits until rpc needs attention
+	 * in some way (e.g. packets have arrived), then deals with that need
+	 * (e.g. copy to user space). It may take many iterations until the
+	 * RPC is ready for the application.
+	 */
+	while (1) {
+		result = 0;
+		if (!rpc->error)
+			rpc->error = homa_copy_to_user(rpc);
+		if (rpc->error)
+			break;
+		if (rpc->msgin.length >= 0 &&
+		    rpc->msgin.bytes_remaining == 0 &&
+		    skb_queue_len(&rpc->msgin.packets) == 0)
+			break;
+
+		if (nonblocking) {
+			result = -EAGAIN;
+			break;
+		}
+
+		result = homa_interest_init_private(&interest, rpc);
+		if (result != 0)
+			break;
+
+		homa_rpc_unlock(rpc);
+		result = homa_interest_wait(&interest);
+
+		atomic_or(APP_NEEDS_LOCK, &rpc->flags);
+		homa_rpc_lock(rpc);
+		atomic_andnot(APP_NEEDS_LOCK, &rpc->flags);
+		homa_interest_unlink_private(&interest);
+
+		/* Abort on error, but if the interest actually got ready
+		 * in the meantime the ignore the error (loop back around
+		 * to process the RPC).
+		 */
+		if (result != 0 && atomic_read(&interest.ready) == 0)
+			break;
+	}
+
+	return result;
+}
+
+/**
+ * homa_wait_shared() - Wait for the completion of any non-private
+ * incoming message on a socket.
+ * @hsk:          Socket on which to wait. Must not be locked.
+ * @nonblocking:  Nonzero means return immediately if no RPC is ready.
+ *
+ * Return:    Pointer to an RPC with a complete incoming message or nonzero
+ *            error field, or a negative errno (usually -EINTR). If an RPC
+ *            is returned it will be locked and referenced; the caller
+ *            must release the lock and the reference.
+ */
+struct homa_rpc *homa_wait_shared(struct homa_sock *hsk, int nonblocking)
+	__cond_acquires(rpc->bucket->lock)
+{
+	struct homa_interest interest;
+	struct homa_rpc *rpc;
+	int result;
+
+	INIT_LIST_HEAD(&interest.links);
+	init_waitqueue_head(&interest.wait_queue);
+	/* Each iteration through this loop waits until an RPC needs attention
+	 * in some way (e.g. packets have arrived), then deals with that need
+	 * (e.g. copy to user space). It may take many iterations until an
+	 * RPC is ready for the application.
+	 */
+	while (1) {
+		homa_sock_lock(hsk);
+		if (hsk->shutdown) {
+			rpc = ERR_PTR(-ESHUTDOWN);
+			homa_sock_unlock(hsk);
+			goto done;
+		}
+		if (!list_empty(&hsk->ready_rpcs)) {
+			rpc = list_first_entry(&hsk->ready_rpcs,
+					       struct homa_rpc,
+					       ready_links);
+			homa_rpc_hold(rpc);
+			list_del_init(&rpc->ready_links);
+			if (!list_empty(&hsk->ready_rpcs)) {
+				/* There are still more RPCs available, so
+				 * let Linux know.
+				 */
+				hsk->sock.sk_data_ready(&hsk->sock);
+			}
+			homa_sock_unlock(hsk);
+		} else if (nonblocking) {
+			rpc = ERR_PTR(-EAGAIN);
+			homa_sock_unlock(hsk);
+
+			/* This is a good time to cleanup dead RPCS. */
+			homa_rpc_reap(hsk, false);
+			goto done;
+		} else {
+			homa_interest_init_shared(&interest, hsk);
+			homa_sock_unlock(hsk);
+			result = homa_interest_wait(&interest);
+
+			if (result != 0) {
+				int ready;
+
+				/* homa_interest_wait returned an error, so we
+				 * have to do two things. First, unlink the
+				 * interest from the socket. Second, check to
+				 * see if in the meantime the interest received
+				 * a handoff. If so, ignore the error. Very
+				 * important to hold the socket lock while
+				 * checking, in order to eliminate races with
+				 * homa_rpc_handoff.
+				 */
+				homa_sock_lock(hsk);
+				homa_interest_unlink_shared(&interest);
+				ready = atomic_read(&interest.ready);
+				homa_sock_unlock(hsk);
+				if (ready == 0) {
+					rpc = ERR_PTR(result);
+					goto done;
+				}
+			}
+
+			rpc = interest.rpc;
+			if (!rpc) {
+				rpc = ERR_PTR(-ESHUTDOWN);
+				goto done;
+			}
+		}
+
+		atomic_or(APP_NEEDS_LOCK, &rpc->flags);
+		homa_rpc_lock(rpc);
+		atomic_andnot(APP_NEEDS_LOCK, &rpc->flags);
+		if (!rpc->error)
+			rpc->error = homa_copy_to_user(rpc);
+		if (rpc->error) {
+			if (rpc->state != RPC_DEAD)
+				break;
+		} else if (rpc->msgin.bytes_remaining == 0 &&
+		    skb_queue_len(&rpc->msgin.packets) == 0)
+			break;
+		homa_rpc_put(rpc);
+		homa_rpc_unlock(rpc);
+	}
+
+done:
+	return rpc;
+}
+
+/**
+ * homa_rpc_handoff() - This function is called when the input message for
+ * an RPC is ready for attention from a user thread. It notifies a waiting
+ * reader and/or queues the RPC, as appropriate.
+ * @rpc:                RPC to handoff; must be locked.
+ */
+void homa_rpc_handoff(struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa_sock *hsk = rpc->hsk;
+	struct homa_interest *interest;
+
+	if (atomic_read(&rpc->flags) & RPC_PRIVATE) {
+		homa_interest_notify_private(rpc);
+		return;
+	}
+
+	/* Shared RPC; if there is a waiting thread, hand off the RPC;
+	 * otherwise enqueue it.
+	 */
+	homa_sock_lock(hsk);
+	if (hsk->shutdown) {
+		homa_sock_unlock(hsk);
+		return;
+	}
+	if (!list_empty(&hsk->interests)) {
+		interest = list_first_entry(&hsk->interests,
+					    struct homa_interest, links);
+		list_del_init(&interest->links);
+		interest->rpc = rpc;
+		homa_rpc_hold(rpc);
+		atomic_set_release(&interest->ready, 1);
+		wake_up(&interest->wait_queue);
+	} else if (list_empty(&rpc->ready_links)) {
+		list_add_tail(&rpc->ready_links, &hsk->ready_rpcs);
+		hsk->sock.sk_data_ready(&hsk->sock);
+	}
+	homa_sock_unlock(hsk);
+}
+
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 13/15] net: homa: create homa_timer.c
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (11 preceding siblings ...)
  2025-08-18 20:55 ` [PATCH net-next v15 12/15] net: homa: create homa_incoming.c John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-18 20:55 ` [PATCH net-next v15 14/15] net: homa: create homa_plumbing.c John Ousterhout
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

This file contains code that wakes up periodically to check for
missing data, initiate retransmissions, and declare peer nodes
"dead".

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for v14:
* Use new homa_rpc_tx_end function

Changes for v11:
* Cleanup sparse annotations.

Changes for v10:
* Refactor resend mechanism

Changes for v9:
* Reflect changes in socket and peer management
* Minor name changes for clarity

Changes for v7:
* Interface changes to homa_sock_start_scan etc.
* Remove locker argument from locking functions
* Use u64 and __u64 properly
---
 net/homa/homa_impl.h  |   3 +
 net/homa/homa_timer.c | 136 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 139 insertions(+)
 create mode 100644 net/homa/homa_timer.c

diff --git a/net/homa/homa_impl.h b/net/homa/homa_impl.h
index 3d91b7f44de9..f28302cb1061 100644
--- a/net/homa/homa_impl.h
+++ b/net/homa/homa_impl.h
@@ -446,6 +446,9 @@ void     homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
 void     homa_rpc_handoff(struct homa_rpc *rpc);
 int      homa_rpc_tx_end(struct homa_rpc *rpc);
 void     homa_spin(int ns);
+void     homa_timer(struct homa *homa);
+void     homa_timer_check_rpc(struct homa_rpc *rpc);
+int      homa_timer_main(void *transport);
 struct sk_buff *homa_tx_data_pkt_alloc(struct homa_rpc *rpc,
 				       struct iov_iter *iter, int offset,
 				       int length, int max_seg_data);
diff --git a/net/homa/homa_timer.c b/net/homa/homa_timer.c
new file mode 100644
index 000000000000..dcfdcc06c8ab
--- /dev/null
+++ b/net/homa/homa_timer.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
+
+/* This file handles timing-related functions for Homa, such as retries
+ * and timeouts.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+#include "homa_stub.h"
+
+/**
+ * homa_timer_check_rpc() -  Invoked for each RPC during each timer pass; does
+ * most of the work of checking for time-related actions such as sending
+ * resends, aborting RPCs for which there is no response, and sending
+ * requests for acks. It is separate from homa_timer because homa_timer
+ * got too long and deeply indented.
+ * @rpc:     RPC to check; must be locked by the caller.
+ */
+void homa_timer_check_rpc(struct homa_rpc *rpc)
+	__must_hold(rpc->bucket->lock)
+{
+	struct homa *homa = rpc->hsk->homa;
+	int tx_end = homa_rpc_tx_end(rpc);
+
+	/* See if we need to request an ack for this RPC. */
+	if (!homa_is_client(rpc->id) && rpc->state == RPC_OUTGOING &&
+	    tx_end == rpc->msgout.length) {
+		if (rpc->done_timer_ticks == 0) {
+			rpc->done_timer_ticks = homa->timer_ticks;
+		} else {
+			/* >= comparison that handles tick wrap-around. */
+			if ((rpc->done_timer_ticks + homa->request_ack_ticks
+					- 1 - homa->timer_ticks) & 1 << 31) {
+				struct homa_need_ack_hdr h;
+
+				homa_xmit_control(NEED_ACK, &h, sizeof(h), rpc);
+			}
+		}
+	}
+
+	if (rpc->state == RPC_INCOMING) {
+		if (rpc->msgin.num_bpages == 0) {
+			/* Waiting for buffer space, so no problem. */
+			rpc->silent_ticks = 0;
+			return;
+		}
+	} else if (!homa_is_client(rpc->id)) {
+		/* We're the server and we've received the input message;
+		 * no need to worry about retries.
+		 */
+		rpc->silent_ticks = 0;
+		return;
+	}
+
+	if (rpc->state == RPC_OUTGOING) {
+		if (tx_end < rpc->msgout.length) {
+			/* There are granted bytes that we haven't transmitted,
+			 * so no need to be concerned; the ball is in our court.
+			 */
+			rpc->silent_ticks = 0;
+			return;
+		}
+	}
+
+	if (rpc->silent_ticks < homa->resend_ticks)
+		return;
+	if (rpc->silent_ticks >= homa->timeout_ticks) {
+		homa_rpc_abort(rpc, -ETIMEDOUT);
+		return;
+	}
+	if (((rpc->silent_ticks - homa->resend_ticks) % homa->resend_interval)
+			== 0)
+		homa_request_retrans(rpc);
+}
+
+/**
+ * homa_timer() - This function is invoked at regular intervals ("ticks")
+ * to implement retries and aborts for Homa.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+void homa_timer(struct homa *homa)
+{
+	struct homa_socktab_scan scan;
+	struct homa_sock *hsk;
+	struct homa_rpc *rpc;
+	int rpc_count = 0;
+
+	homa->timer_ticks++;
+
+	/* Scan all existing RPCs in all sockets. */
+	for (hsk = homa_socktab_start_scan(homa->socktab, &scan);
+			hsk; hsk = homa_socktab_next(&scan)) {
+		while (hsk->dead_skbs >= homa->dead_buffs_limit)
+			/* If we get here, it means that Homa isn't keeping
+			 * up with RPC reaping, so we'll help out.  See
+			 * "RPC Reaping Strategy" in homa_rpc_reap code for
+			 * details.
+			 */
+			if (homa_rpc_reap(hsk, false) == 0)
+				break;
+
+		if (list_empty(&hsk->active_rpcs) || hsk->shutdown)
+			continue;
+
+		if (!homa_protect_rpcs(hsk))
+			continue;
+		rcu_read_lock();
+		list_for_each_entry_rcu(rpc, &hsk->active_rpcs, active_links) {
+			homa_rpc_lock(rpc);
+			if (rpc->state == RPC_IN_SERVICE) {
+				rpc->silent_ticks = 0;
+				homa_rpc_unlock(rpc);
+				continue;
+			}
+			rpc->silent_ticks++;
+			homa_timer_check_rpc(rpc);
+			homa_rpc_unlock(rpc);
+			rpc_count++;
+			if (rpc_count >= 10) {
+				/* Give other kernel threads a chance to run
+				 * on this core.
+				 */
+				rcu_read_unlock();
+				schedule();
+				rcu_read_lock();
+				rpc_count = 0;
+			}
+		}
+		rcu_read_unlock();
+		homa_unprotect_rpcs(hsk);
+	}
+	homa_socktab_end_scan(&scan);
+	homa_skb_release_pages(homa);
+	homa_peer_gc(homa->peertab);
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 14/15] net: homa: create homa_plumbing.c
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (12 preceding siblings ...)
  2025-08-18 20:55 ` [PATCH net-next v15 13/15] net: homa: create homa_timer.c John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-26 16:17   ` Paolo Abeni
  2025-08-18 20:55 ` [PATCH net-next v15 15/15] net: homa: create Makefile and Kconfig John Ousterhout
  2025-08-22 15:51 ` [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
  15 siblings, 1 reply; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

homa_plumbing.c contains functions that connect Homa to the rest of
the Linux kernel, such as dispatch tables used by Linux and the
top-level functions that Linux invokes from those dispatch tables.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>

---
Changes for V13:
* Fix bug in is_homa_pkt: didn't properly handle packets where the
  network header hadn't yet been set.

Changes for v12:
* Fix deadlock in homa_recvmsg (homa_rpc_reap was invoked while holding
  an RPC lock).

Changes for v11:
* Move link_mbps variable from struct homa_pacer back to struct homa.
* Clean up error handing in homa_load.
* Cleanup and simplify use of RPC reference counts.
* Add explicit padding to struct homa_recvmsg_args to fix problems compiling
  on 32-bit machines.

Changes for v10:
* Use the destroy function from struct proto properly (fixes bugs in
  socket cleanup)
* Fix issues from sparse, xmastree, etc.
* Replace __u16 with u16, __u8 with u8, etc.

Changes for v9:
* Add support for homa_net objects
* Various name improvements (e.g. use "alloc" instead of "new" for functions
  that allocate memory)
* Add BUILD_BUG_ON statements to replace _Static_asserts removed from
  header files
* Remove unnecessary/unused functions such as homa_get_port, homa_disconnect,
  and homa_backlog_rcv.

Changes for v8:
* Accommodate homa_pacer and homa_pool refactorings

Changes for v7:
* Remove extraneous code
* Make Homa a pernet subsystem
* Block Homa senders if insufficient tx buffer memory
* Check for missing buffer pool in homa_recvmsg
* Refactor waiting mechanism for incoming packets: simplify wait
  criteria and use standard Linux mechanisms for waiting
* Implement SO_HOMA_SERVER option for setsockopt
* Rename UNKNOWN packet type to RPC_UNKNOWN
* Remove locker argument from locking functions
* Use u64 and __u64 properly
* Use new homa_make_header_avl function
---
 net/homa/homa_impl.h     |   27 +
 net/homa/homa_plumbing.c | 1118 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 1145 insertions(+)
 create mode 100644 net/homa/homa_plumbing.c

diff --git a/net/homa/homa_impl.h b/net/homa/homa_impl.h
index f28302cb1061..4616502b13c4 100644
--- a/net/homa/homa_impl.h
+++ b/net/homa/homa_impl.h
@@ -424,27 +424,52 @@ extern unsigned int homa_net_id;
 void     homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
 		      struct homa_rpc *rpc);
 void     homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb);
+int      homa_bind(struct socket *sk, struct sockaddr *addr,
+		   int addr_len);
+void     homa_close(struct sock *sock, long timeout);
 int      homa_copy_to_user(struct homa_rpc *rpc);
 void     homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
 void     homa_destroy(struct homa *homa);
 void     homa_dispatch_pkts(struct sk_buff *skb);
+int      homa_err_handler_v4(struct sk_buff *skb, u32 info);
+int      homa_err_handler_v6(struct sk_buff *skb,
+			     struct inet6_skb_parm *opt, u8 type,  u8 code,
+			     int offset, __be32 info);
 int      homa_fill_data_interleaved(struct homa_rpc *rpc,
 				    struct sk_buff *skb, struct iov_iter *iter);
 struct homa_gap *homa_gap_alloc(struct list_head *next, int start, int end);
+int      homa_getsockopt(struct sock *sk, int level, int optname,
+			 char __user *optval, int __user *optlen);
+int      homa_hash(struct sock *sk);
+enum hrtimer_restart homa_hrtimer(struct hrtimer *timer);
 int      homa_init(struct homa *homa);
+int      homa_ioctl(struct sock *sk, int cmd, int *karg);
+int      homa_load(void);
 int      homa_message_out_fill(struct homa_rpc *rpc,
 			       struct iov_iter *iter, int xmit);
 void     homa_message_out_init(struct homa_rpc *rpc, int length);
 void     homa_need_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
 			   struct homa_rpc *rpc);
 void     homa_net_destroy(struct homa_net *hnet);
+void     homa_net_exit(struct net *net);
 int      homa_net_init(struct homa_net *hnet, struct net *net,
 		       struct homa *homa);
+int      homa_net_start(struct net *net);
+__poll_t homa_poll(struct file *file, struct socket *sock,
+		   struct poll_table_struct *wait);
+int      homa_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+		      int flags, int *addr_len);
 void     homa_request_retrans(struct homa_rpc *rpc);
 void     homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
 			 struct homa_sock *hsk);
 void     homa_rpc_handoff(struct homa_rpc *rpc);
 int      homa_rpc_tx_end(struct homa_rpc *rpc);
+int      homa_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
+int      homa_setsockopt(struct sock *sk, int level, int optname,
+			 sockptr_t optval, unsigned int optlen);
+int      homa_shutdown(struct socket *sock, int how);
+int      homa_socket(struct sock *sk);
+int      homa_softirq(struct sk_buff *skb);
 void     homa_spin(int ns);
 void     homa_timer(struct homa *homa);
 void     homa_timer_check_rpc(struct homa_rpc *rpc);
@@ -452,7 +477,9 @@ int      homa_timer_main(void *transport);
 struct sk_buff *homa_tx_data_pkt_alloc(struct homa_rpc *rpc,
 				       struct iov_iter *iter, int offset,
 				       int length, int max_seg_data);
+void     homa_unhash(struct sock *sk);
 void     homa_rpc_unknown_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
+void     homa_unload(void);
 int      homa_wait_private(struct homa_rpc *rpc, int nonblocking);
 struct homa_rpc *homa_wait_shared(struct homa_sock *hsk, int nonblocking);
 int      homa_xmit_control(enum homa_packet_type type, void *contents,
diff --git a/net/homa/homa_plumbing.c b/net/homa/homa_plumbing.c
new file mode 100644
index 000000000000..e9c8576f22dd
--- /dev/null
+++ b/net/homa/homa_plumbing.c
@@ -0,0 +1,1118 @@
+// SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
+
+/* This file consists mostly of "glue" that hooks Homa into the rest of
+ * the Linux kernel. The guts of the protocol are in other files.
+ */
+
+#include "homa_impl.h"
+#include "homa_pacer.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+
+/* Identifier for retrieving Homa-specific data for a struct net. */
+unsigned int homa_net_id;
+
+/* This structure defines functions that allow Homa to be used as a
+ * pernet subsystem.
+ */
+static struct pernet_operations homa_net_ops = {
+	.init = homa_net_start,
+	.exit = homa_net_exit,
+	.id = &homa_net_id,
+	.size = sizeof(struct homa_net)
+};
+
+/* Global data for Homa. Never reference homa_data directly. Always use
+ * the global_homa variable instead (or, even better, a homa pointer
+ * stored in a struct or passed via a parameter); this allows overriding
+ * during unit tests.
+ */
+static struct homa homa_data;
+
+/* This variable contains the address of the statically-allocated struct homa
+ * used throughout Homa. This variable should almost never be used directly:
+ * it should be passed as a parameter to functions that need it. This
+ * variable is used only by a few functions called from Linux where there
+ * is no struct homa* available.
+ */
+static struct homa *global_homa = &homa_data;
+
+/* This structure defines functions that handle various operations on
+ * Homa sockets. These functions are relatively generic: they are called
+ * to implement top-level system calls. Many of these operations can
+ * be implemented by PF_INET6 functions that are independent of the
+ * Homa protocol.
+ */
+static const struct proto_ops homa_proto_ops = {
+	.family		   = PF_INET,
+	.owner		   = THIS_MODULE,
+	.release	   = inet_release,
+	.bind		   = homa_bind,
+	.connect	   = inet_dgram_connect,
+	.socketpair	   = sock_no_socketpair,
+	.accept		   = sock_no_accept,
+	.getname	   = inet_getname,
+	.poll		   = homa_poll,
+	.ioctl		   = inet_ioctl,
+	.listen		   = sock_no_listen,
+	.shutdown	   = homa_shutdown,
+	.setsockopt	   = sock_common_setsockopt,
+	.getsockopt	   = sock_common_getsockopt,
+	.sendmsg	   = inet_sendmsg,
+	.recvmsg	   = inet_recvmsg,
+	.mmap		   = sock_no_mmap,
+	.set_peek_off	   = sk_set_peek_off,
+};
+
+static const struct proto_ops homav6_proto_ops = {
+	.family		   = PF_INET6,
+	.owner		   = THIS_MODULE,
+	.release	   = inet6_release,
+	.bind		   = homa_bind,
+	.connect	   = inet_dgram_connect,
+	.socketpair	   = sock_no_socketpair,
+	.accept		   = sock_no_accept,
+	.getname	   = inet6_getname,
+	.poll		   = homa_poll,
+	.ioctl		   = inet6_ioctl,
+	.listen		   = sock_no_listen,
+	.shutdown	   = homa_shutdown,
+	.setsockopt	   = sock_common_setsockopt,
+	.getsockopt	   = sock_common_getsockopt,
+	.sendmsg	   = inet_sendmsg,
+	.recvmsg	   = inet_recvmsg,
+	.mmap		   = sock_no_mmap,
+	.set_peek_off	   = sk_set_peek_off,
+};
+
+/* This structure also defines functions that handle various operations
+ * on Homa sockets. However, these functions are lower-level than those
+ * in homa_proto_ops: they are specific to the PF_INET or PF_INET6
+ * protocol family, and in many cases they are invoked by functions in
+ * homa_proto_ops. Most of these functions have Homa-specific implementations.
+ */
+static struct proto homa_prot = {
+	.name		   = "HOMA",
+	.owner		   = THIS_MODULE,
+	.close		   = homa_close,
+	.connect	   = ip4_datagram_connect,
+	.ioctl		   = homa_ioctl,
+	.init		   = homa_socket,
+	.destroy	   = homa_sock_destroy,
+	.setsockopt	   = homa_setsockopt,
+	.getsockopt	   = homa_getsockopt,
+	.sendmsg	   = homa_sendmsg,
+	.recvmsg	   = homa_recvmsg,
+	.hash		   = homa_hash,
+	.unhash		   = homa_unhash,
+	.obj_size	   = sizeof(struct homa_sock),
+	.no_autobind       = 1,
+};
+
+static struct proto homav6_prot = {
+	.name		   = "HOMAv6",
+	.owner		   = THIS_MODULE,
+	.close		   = homa_close,
+	.connect	   = ip6_datagram_connect,
+	.ioctl		   = homa_ioctl,
+	.init		   = homa_socket,
+	.destroy	   = homa_sock_destroy,
+	.setsockopt	   = homa_setsockopt,
+	.getsockopt	   = homa_getsockopt,
+	.sendmsg	   = homa_sendmsg,
+	.recvmsg	   = homa_recvmsg,
+	.hash		   = homa_hash,
+	.unhash		   = homa_unhash,
+	.obj_size	   = sizeof(struct homa_v6_sock),
+	.ipv6_pinfo_offset = offsetof(struct homa_v6_sock, inet6),
+
+	.no_autobind       = 1,
+};
+
+/* Top-level structure describing the Homa protocol. */
+static struct inet_protosw homa_protosw = {
+	.type              = SOCK_DGRAM,
+	.protocol          = IPPROTO_HOMA,
+	.prot              = &homa_prot,
+	.ops               = &homa_proto_ops,
+	.flags             = INET_PROTOSW_REUSE,
+};
+
+static struct inet_protosw homav6_protosw = {
+	.type              = SOCK_DGRAM,
+	.protocol          = IPPROTO_HOMA,
+	.prot              = &homav6_prot,
+	.ops               = &homav6_proto_ops,
+	.flags             = INET_PROTOSW_REUSE,
+};
+
+/* This structure is used by IP to deliver incoming Homa packets to us. */
+static struct net_protocol homa_protocol = {
+	.handler =	homa_softirq,
+	.err_handler =	homa_err_handler_v4,
+	.no_policy =     1,
+};
+
+static struct inet6_protocol homav6_protocol = {
+	.handler =	homa_softirq,
+	.err_handler =	homa_err_handler_v6,
+	.flags =        INET6_PROTO_NOPOLICY | INET6_PROTO_FINAL,
+};
+
+/* Sizes of the headers for each Homa packet type, in bytes. */
+static u16 header_lengths[] = {
+	sizeof(struct homa_data_hdr),
+	0,
+	sizeof(struct homa_resend_hdr),
+	sizeof(struct homa_rpc_unknown_hdr),
+	sizeof(struct homa_busy_hdr),
+	0,
+	0,
+	sizeof(struct homa_need_ack_hdr),
+	sizeof(struct homa_ack_hdr)
+};
+
+/* Thread that runs timer code to detect lost packets and crashed peers. */
+static struct task_struct *timer_kthread;
+static DECLARE_COMPLETION(timer_thread_done);
+
+/* Used to wakeup timer_kthread at regular intervals. */
+static struct hrtimer hrtimer;
+
+/* Nonzero is an indication to the timer thread that it should exit. */
+static int timer_thread_exit;
+
+/**
+ * homa_load() - invoked when this module is loaded into the Linux kernel
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int __init homa_load(void)
+{
+	struct homa *homa = global_homa;
+	bool init_protocol6 = false;
+	bool init_protosw6 = false;
+	bool init_protocol = false;
+	bool init_protosw = false;
+	bool init_net_ops = false;
+	bool init_proto6 = false;
+	bool init_proto = false;
+	bool init_homa = false;
+	int status;
+
+	/* Compile-time validations that no packet header is longer
+	 * than HOMA_MAX_HEADER.
+	 */
+	BUILD_BUG_ON(sizeof(struct homa_data_hdr) > HOMA_MAX_HEADER);
+	BUILD_BUG_ON(sizeof(struct homa_resend_hdr) > HOMA_MAX_HEADER);
+	BUILD_BUG_ON(sizeof(struct homa_rpc_unknown_hdr) > HOMA_MAX_HEADER);
+	BUILD_BUG_ON(sizeof(struct homa_busy_hdr) > HOMA_MAX_HEADER);
+	BUILD_BUG_ON(sizeof(struct homa_need_ack_hdr) > HOMA_MAX_HEADER);
+	BUILD_BUG_ON(sizeof(struct homa_ack_hdr) > HOMA_MAX_HEADER);
+
+	/* Extra constraints on data packets:
+	 * - Ensure minimum header length so Homa doesn't have to worry about
+	 *   padding data packets.
+	 * - Make sure data packet headers are a multiple of 4 bytes (needed
+	 *   for TCP/TSO compatibility).
+	 */
+	BUILD_BUG_ON(sizeof(struct homa_data_hdr) < HOMA_MIN_PKT_LENGTH);
+	BUILD_BUG_ON((sizeof(struct homa_data_hdr) -
+		      sizeof(struct homa_seg_hdr)) & 0x3);
+
+	/* Detect size changes in uAPI structs. */
+	BUILD_BUG_ON(sizeof(struct homa_sendmsg_args) != 24);
+	BUILD_BUG_ON(sizeof(struct homa_recvmsg_args) != 88);
+
+	pr_err("Homa module loading\n");
+	status = proto_register(&homa_prot, 1);
+	if (status != 0) {
+		pr_err("proto_register failed for homa_prot: %d\n", status);
+		goto error;
+	}
+	init_proto = true;
+
+	status = proto_register(&homav6_prot, 1);
+	if (status != 0) {
+		pr_err("proto_register failed for homav6_prot: %d\n", status);
+		goto error;
+	}
+	init_proto6 = true;
+
+	inet_register_protosw(&homa_protosw);
+	init_protosw = true;
+
+	status = inet6_register_protosw(&homav6_protosw);
+	if (status != 0) {
+		pr_err("inet6_register_protosw failed in %s: %d\n", __func__,
+		       status);
+		goto error;
+	}
+	init_protosw6 = true;
+
+	status = inet_add_protocol(&homa_protocol, IPPROTO_HOMA);
+	if (status != 0) {
+		pr_err("inet_add_protocol failed in %s: %d\n", __func__,
+		       status);
+		goto error;
+	}
+	init_protocol = true;
+
+	status = inet6_add_protocol(&homav6_protocol, IPPROTO_HOMA);
+	if (status != 0) {
+		pr_err("inet6_add_protocol failed in %s: %d\n",  __func__,
+		       status);
+		goto error;
+	}
+	init_protocol6 = true;
+
+	status = homa_init(homa);
+	if (status)
+		goto error;
+	init_homa = true;
+
+	status = register_pernet_subsys(&homa_net_ops);
+	if (status != 0) {
+		pr_err("Homa got error from register_pernet_subsys: %d\n",
+		       status);
+		goto error;
+	}
+	init_net_ops = true;
+
+	timer_kthread = kthread_run(homa_timer_main, homa, "homa_timer");
+	if (IS_ERR(timer_kthread)) {
+		status = PTR_ERR(timer_kthread);
+		pr_err("couldn't create Homa timer thread: error %d\n",
+		       status);
+		timer_kthread = NULL;
+		goto error;
+	}
+
+	return 0;
+
+error:
+	if (timer_kthread) {
+		timer_thread_exit = 1;
+		wake_up_process(timer_kthread);
+		wait_for_completion(&timer_thread_done);
+	}
+	if (init_net_ops)
+		unregister_pernet_subsys(&homa_net_ops);
+	if (init_homa)
+		homa_destroy(homa);
+	if (init_protocol)
+		inet_del_protocol(&homa_protocol, IPPROTO_HOMA);
+	if (init_protocol6)
+		inet6_del_protocol(&homav6_protocol, IPPROTO_HOMA);
+	if (init_protosw)
+		inet_unregister_protosw(&homa_protosw);
+	if (init_protosw6)
+		inet6_unregister_protosw(&homav6_protosw);
+	if (init_proto)
+		proto_unregister(&homa_prot);
+	if (init_proto6)
+		proto_unregister(&homav6_prot);
+	return status;
+}
+
+/**
+ * homa_unload() - invoked when this module is unloaded from the Linux kernel.
+ */
+void __exit homa_unload(void)
+{
+	struct homa *homa = global_homa;
+
+	pr_notice("Homa module unloading\n");
+
+	unregister_pernet_subsys(&homa_net_ops);
+	homa_destroy(homa);
+	inet_del_protocol(&homa_protocol, IPPROTO_HOMA);
+	inet_unregister_protosw(&homa_protosw);
+	inet6_del_protocol(&homav6_protocol, IPPROTO_HOMA);
+	inet6_unregister_protosw(&homav6_protosw);
+	proto_unregister(&homa_prot);
+	proto_unregister(&homav6_prot);
+}
+
+module_init(homa_load);
+module_exit(homa_unload);
+
+/**
+ * homa_net_start() - Initialize Homa for a new network namespace.
+ * @net:    The net that Homa will be associated with.
+ * Return:  0 on success, otherwise a negative errno.
+ */
+int homa_net_start(struct net *net)
+{
+	pr_notice("Homa attaching to net namespace\n");
+	return homa_net_init(homa_net_from_net(net), net, global_homa);
+}
+
+/**
+ * homa_net_exit() - Perform Homa cleanup needed when a network namespace
+ * is destroyed.
+ * @net:    The net from which Homa should be removed.
+ */
+void homa_net_exit(struct net *net)
+{
+	pr_notice("Homa detaching from net namespace\n");
+	homa_net_destroy(homa_net_from_net(net));
+}
+
+/**
+ * homa_bind() - Implements the bind system call for Homa sockets: associates
+ * a well-known service port with a socket. Unlike other AF_INET6 protocols,
+ * there is no need to invoke this system call for sockets that are only
+ * used as clients.
+ * @sock:     Socket on which the system call was invoked.
+ * @addr:    Contains the desired port number.
+ * @addr_len: Number of bytes in uaddr.
+ * Return:    0 on success, otherwise a negative errno.
+ */
+int homa_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
+{
+	union sockaddr_in_union *addr_in = (union sockaddr_in_union *)addr;
+	struct homa_sock *hsk = homa_sk(sock->sk);
+	int port = 0;
+
+	if (unlikely(addr->sa_family != sock->sk->sk_family))
+		return -EAFNOSUPPORT;
+	if (addr_in->in6.sin6_family == AF_INET6) {
+		if (addr_len < sizeof(struct sockaddr_in6))
+			return -EINVAL;
+		port = ntohs(addr_in->in4.sin_port);
+	} else if (addr_in->in4.sin_family == AF_INET) {
+		if (addr_len < sizeof(struct sockaddr_in))
+			return -EINVAL;
+		port = ntohs(addr_in->in6.sin6_port);
+	}
+	return homa_sock_bind(hsk->hnet, hsk, port);
+}
+
+/**
+ * homa_close() - Invoked when close system call is invoked on a Homa socket.
+ * @sk:      Socket being closed
+ * @timeout: ??
+ */
+void homa_close(struct sock *sk, long timeout)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+
+	homa_sock_shutdown(hsk);
+	sk_common_release(sk);
+}
+
+/**
+ * homa_shutdown() - Implements the shutdown system call for Homa sockets.
+ * @sock:    Socket to shut down.
+ * @how:     Ignored: for other sockets, can independently shut down
+ *           sending and receiving, but for Homa any shutdown will
+ *           shut down everything.
+ *
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_shutdown(struct socket *sock, int how)
+{
+	homa_sock_shutdown(homa_sk(sock->sk));
+	return 0;
+}
+
+/**
+ * homa_ioctl() - Implements the ioctl system call for Homa sockets.
+ * @sk:    Socket on which the system call was invoked.
+ * @cmd:   Identifier for a particular ioctl operation.
+ * @karg:  Operation-specific argument; typically the address of a block
+ *         of data in user address space.
+ *
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_ioctl(struct sock *sk, int cmd, int *karg)
+{
+	return -EINVAL;
+}
+
+/**
+ * homa_socket() - Implements the socket(2) system call for sockets.
+ * @sk:    Socket on which the system call was invoked. The non-Homa
+ *         parts have already been initialized.
+ *
+ * Return: always 0 (success).
+ */
+int homa_socket(struct sock *sk)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+	int result;
+
+	result = homa_sock_init(hsk);
+	if (result != 0) {
+		homa_sock_shutdown(hsk);
+		homa_sock_destroy(&hsk->sock);
+	}
+	return result;
+}
+
+/**
+ * homa_setsockopt() - Implements the getsockopt system call for Homa sockets.
+ * @sk:      Socket on which the system call was invoked.
+ * @level:   Level at which the operation should be handled; will always
+ *           be IPPROTO_HOMA.
+ * @optname: Identifies a particular setsockopt operation.
+ * @optval:  Address in user space of information about the option.
+ * @optlen:  Number of bytes of data at @optval.
+ * Return:   0 on success, otherwise a negative errno.
+ */
+int homa_setsockopt(struct sock *sk, int level, int optname,
+		    sockptr_t optval, unsigned int optlen)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+	int ret;
+
+	if (level != IPPROTO_HOMA)
+		return -ENOPROTOOPT;
+
+	if (optname == SO_HOMA_RCVBUF) {
+		struct homa_rcvbuf_args args;
+
+		if (optlen != sizeof(struct homa_rcvbuf_args))
+			return -EINVAL;
+
+		if (copy_from_sockptr(&args, optval, optlen))
+			return -EFAULT;
+
+		/* Do a trivial test to make sure we can at least write the
+		 * first page of the region.
+		 */
+		if (copy_to_user(u64_to_user_ptr(args.start), &args,
+				 sizeof(args)))
+			return -EFAULT;
+
+		ret = homa_pool_set_region(hsk, u64_to_user_ptr(args.start),
+					   args.length);
+	} else if (optname == SO_HOMA_SERVER) {
+		int arg;
+
+		if (optlen != sizeof(arg))
+			return -EINVAL;
+
+		if (copy_from_sockptr(&arg, optval, optlen))
+			return -EFAULT;
+
+		if (arg)
+			hsk->is_server = true;
+		else
+			hsk->is_server = false;
+		ret = 0;
+	} else {
+		ret = -ENOPROTOOPT;
+	}
+	return ret;
+}
+
+/**
+ * homa_getsockopt() - Implements the getsockopt system call for Homa sockets.
+ * @sk:      Socket on which the system call was invoked.
+ * @level:   Selects level in the network stack to handle the request;
+ *           must be IPPROTO_HOMA.
+ * @optname: Identifies a particular setsockopt operation.
+ * @optval:  Address in user space where the option's value should be stored.
+ * @optlen:  Number of bytes available at optval; will be overwritten with
+ *           actual number of bytes stored.
+ * Return:   0 on success, otherwise a negative errno.
+ */
+int homa_getsockopt(struct sock *sk, int level, int optname,
+		    char __user *optval, int __user *optlen)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+	struct homa_rcvbuf_args rcvbuf_args;
+	int is_server;
+	void *result;
+	int len;
+
+	if (copy_from_sockptr(&len, USER_SOCKPTR(optlen), sizeof(int)))
+		return -EFAULT;
+
+	if (level != IPPROTO_HOMA)
+		return -ENOPROTOOPT;
+	if (optname == SO_HOMA_RCVBUF) {
+		if (len < sizeof(rcvbuf_args))
+			return -EINVAL;
+
+		homa_sock_lock(hsk);
+		homa_pool_get_rcvbuf(hsk->buffer_pool, &rcvbuf_args);
+		homa_sock_unlock(hsk);
+		len = sizeof(rcvbuf_args);
+		result = &rcvbuf_args;
+	} else if (optname == SO_HOMA_SERVER) {
+		if (len < sizeof(is_server))
+			return -EINVAL;
+
+		is_server = hsk->is_server;
+		len = sizeof(is_server);
+		result = &is_server;
+	} else {
+		return -ENOPROTOOPT;
+	}
+
+	if (copy_to_sockptr(USER_SOCKPTR(optlen), &len, sizeof(int)))
+		return -EFAULT;
+
+	if (copy_to_sockptr(USER_SOCKPTR(optval), result, len))
+		return -EFAULT;
+
+	return 0;
+}
+
+/**
+ * homa_sendmsg() - Send a request or response message on a Homa socket.
+ * @sk:     Socket on which the system call was invoked.
+ * @msg:    Structure describing the message to send; the msg_control
+ *          field points to additional information.
+ * @length: Number of bytes of the message.
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_sendmsg(struct sock *sk, struct msghdr *msg, size_t length)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+	struct homa_sendmsg_args args;
+	union sockaddr_in_union *addr;
+	struct homa_rpc *rpc = NULL;
+	int result = 0;
+
+	addr = (union sockaddr_in_union *)msg->msg_name;
+	if (!addr) {
+		result = -EINVAL;
+		goto error;
+	}
+
+	if (unlikely(!msg->msg_control_is_user)) {
+		result = -EINVAL;
+		goto error;
+	}
+	if (unlikely(copy_from_user(&args, (void __user *)msg->msg_control,
+				    sizeof(args)))) {
+		result = -EFAULT;
+		goto error;
+	}
+	if (args.flags & ~HOMA_SENDMSG_VALID_FLAGS ||
+	    args.reserved != 0) {
+		result = -EINVAL;
+		goto error;
+	}
+
+	if (!homa_sock_wmem_avl(hsk)) {
+		result = homa_sock_wait_wmem(hsk,
+					     msg->msg_flags & MSG_DONTWAIT);
+		if (result != 0)
+			goto error;
+	}
+
+	if (addr->sa.sa_family != sk->sk_family) {
+		result = -EAFNOSUPPORT;
+		goto error;
+	}
+	if (msg->msg_namelen < sizeof(struct sockaddr_in) ||
+	    (msg->msg_namelen < sizeof(struct sockaddr_in6) &&
+	     addr->in6.sin6_family == AF_INET6)) {
+		result = -EINVAL;
+		goto error;
+	}
+
+	if (!args.id) {
+		/* This is a request message. */
+		rpc = homa_rpc_alloc_client(hsk, addr);
+		if (IS_ERR(rpc)) {
+			result = PTR_ERR(rpc);
+			rpc = NULL;
+			goto error;
+		}
+		homa_rpc_hold(rpc);
+		if (args.flags & HOMA_SENDMSG_PRIVATE)
+			atomic_or(RPC_PRIVATE, &rpc->flags);
+		rpc->completion_cookie = args.completion_cookie;
+		result = homa_message_out_fill(rpc, &msg->msg_iter, 1);
+		if (result)
+			goto error;
+		args.id = rpc->id;
+		homa_rpc_unlock(rpc); /* Locked by homa_rpc_alloc_client. */
+
+		if (unlikely(copy_to_user((void __user *)msg->msg_control,
+					  &args, sizeof(args)))) {
+			homa_rpc_lock(rpc);
+			result = -EFAULT;
+			goto error;
+		}
+		homa_rpc_put(rpc);
+	} else {
+		/* This is a response message. */
+		struct in6_addr canonical_dest;
+
+		if (args.completion_cookie != 0) {
+			result = -EINVAL;
+			goto error;
+		}
+		canonical_dest = canonical_ipv6_addr(addr);
+
+		rpc = homa_rpc_find_server(hsk, &canonical_dest, args.id);
+		if (!rpc)
+			/* Return without an error if the RPC doesn't exist;
+			 * this could be totally valid (e.g. client is
+			 * no longer interested in it).
+			 */
+			return 0;
+		homa_rpc_hold(rpc);
+		if (rpc->error) {
+			result = rpc->error;
+			goto error;
+		}
+		if (rpc->state != RPC_IN_SERVICE) {
+			result = -EINVAL;
+			goto error_dont_end_rpc;
+		}
+		rpc->state = RPC_OUTGOING;
+
+		result = homa_message_out_fill(rpc, &msg->msg_iter, 1);
+		if (result && rpc->state != RPC_DEAD)
+			goto error;
+		homa_rpc_put(rpc);
+		homa_rpc_unlock(rpc); /* Locked by homa_rpc_find_server. */
+	}
+	return 0;
+
+error:
+	if (rpc)
+		homa_rpc_end(rpc);
+
+error_dont_end_rpc:
+	if (rpc) {
+		homa_rpc_put(rpc);
+
+		/* Locked by homa_rpc_find_server or homa_rpc_alloc_client. */
+		homa_rpc_unlock(rpc);
+	}
+	return result;
+}
+
+/**
+ * homa_recvmsg() - Receive a message from a Homa socket.
+ * @sk:          Socket on which the system call was invoked.
+ * @msg:         Controlling information for the receive.
+ * @len:         Total bytes of space available in msg->msg_iov; not used.
+ * @flags:       Flags from system call; only MSG_DONTWAIT is used.
+ * @addr_len:    Store the length of the sender address here
+ * Return:       The length of the message on success, otherwise a negative
+ *               errno.
+ */
+int homa_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
+		 int *addr_len)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+	struct homa_recvmsg_args control;
+	struct homa_rpc *rpc = NULL;
+	int nonblocking;
+	int result;
+
+	if (unlikely(!msg->msg_control)) {
+		/* This test isn't strictly necessary, but it provides a
+		 * hook for testing kernel call times.
+		 */
+		return -EINVAL;
+	}
+	if (msg->msg_controllen != sizeof(control))
+		return -EINVAL;
+	if (unlikely(copy_from_user(&control, (void __user *)msg->msg_control,
+				    sizeof(control))))
+		return -EFAULT;
+	control.completion_cookie = 0;
+
+	if (control.num_bpages > HOMA_MAX_BPAGES || control.reserved != 0) {
+		result = -EINVAL;
+		goto done;
+	}
+	if (!hsk->buffer_pool) {
+		result = -EINVAL;
+		goto done;
+	}
+	result = homa_pool_release_buffers(hsk->buffer_pool, control.num_bpages,
+					   control.bpage_offsets);
+	control.num_bpages = 0;
+	if (result != 0)
+		goto done;
+
+	nonblocking = flags & MSG_DONTWAIT;
+	if (control.id != 0) {
+		rpc = homa_rpc_find_client(hsk, control.id); /* Locks RPC. */
+		if (!rpc) {
+			result = -EINVAL;
+			goto done;
+		}
+		homa_rpc_hold(rpc);
+		result = homa_wait_private(rpc, nonblocking);
+		if (result != 0) {
+			control.id = 0;
+			goto done;
+		}
+	} else {
+		rpc = homa_wait_shared(hsk, nonblocking);
+		if (IS_ERR(rpc)) {
+			/* If we get here, it means there was an error that
+			 * prevented us from finding an RPC to return. Errors
+			 * in the RPC itself are handled below.
+			 */
+			result = PTR_ERR(rpc);
+			rpc = NULL;
+			goto done;
+		}
+	}
+	result = rpc->error ? rpc->error : rpc->msgin.length;
+
+	/* Collect result information. */
+	control.id = rpc->id;
+	control.completion_cookie = rpc->completion_cookie;
+	if (likely(rpc->msgin.length >= 0)) {
+		control.num_bpages = rpc->msgin.num_bpages;
+		memcpy(control.bpage_offsets, rpc->msgin.bpage_offsets,
+		       sizeof(rpc->msgin.bpage_offsets));
+	}
+	if (sk->sk_family == AF_INET6) {
+		struct sockaddr_in6 *in6 = msg->msg_name;
+
+		in6->sin6_family = AF_INET6;
+		in6->sin6_port = htons(rpc->dport);
+		in6->sin6_addr = rpc->peer->addr;
+		*addr_len = sizeof(*in6);
+	} else {
+		struct sockaddr_in *in4 = msg->msg_name;
+
+		in4->sin_family = AF_INET;
+		in4->sin_port = htons(rpc->dport);
+		in4->sin_addr.s_addr = ipv6_to_ipv4(rpc->peer->addr);
+		*addr_len = sizeof(*in4);
+	}
+
+	/* This indicates that the application now owns the buffers, so
+	 * we won't free them in homa_rpc_end.
+	 */
+	rpc->msgin.num_bpages = 0;
+
+	if (homa_is_client(rpc->id)) {
+		homa_peer_add_ack(rpc);
+		homa_rpc_end(rpc);
+	} else {
+		if (result < 0)
+			homa_rpc_end(rpc);
+		else
+			rpc->state = RPC_IN_SERVICE;
+	}
+
+done:
+	/* Note: must release the RPC lock before calling homa_rpc_reap
+	 * or copying results to user space.
+	 */
+	if (rpc) {
+		homa_rpc_put(rpc);
+
+		/* Locked by homa_rpc_find_client or homa_wait_shared. */
+		homa_rpc_unlock(rpc);
+	}
+
+	if (test_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags)) {
+		/* There are tasks waiting for tx memory, so reap
+		 * immediately.
+		 */
+		homa_rpc_reap(hsk, false);
+	}
+
+	if (unlikely(copy_to_user((__force void __user *)msg->msg_control,
+				  &control, sizeof(control))))
+		result = -EFAULT;
+
+	return result;
+}
+
+/**
+ * homa_hash() - Not needed for Homa.
+ * @sk:    Socket for the operation
+ * Return: ??
+ */
+int homa_hash(struct sock *sk)
+{
+	return 0;
+}
+
+/**
+ * homa_unhash() - Not needed for Homa.
+ * @sk:    Socket for the operation
+ */
+void homa_unhash(struct sock *sk)
+{
+}
+
+/**
+ * homa_softirq() - This function is invoked at SoftIRQ level to handle
+ * incoming packets.
+ * @skb:   The incoming packet.
+ * Return: Always 0
+ */
+int homa_softirq(struct sk_buff *skb)
+{
+	struct sk_buff *packets, *other_pkts, *next;
+	struct sk_buff **prev_link, **other_link;
+	struct homa_common_hdr *h;
+	int header_offset;
+
+	/* skb may actually contain many distinct packets, linked through
+	 * skb_shinfo(skb)->frag_list by the Homa GRO mechanism. Make a
+	 * pass through the list to process all of the short packets,
+	 * leaving the longer packets in the list. Also, perform various
+	 * prep/cleanup/error checking functions.
+	 */
+	skb->next = skb_shinfo(skb)->frag_list;
+	skb_shinfo(skb)->frag_list = NULL;
+	packets = skb;
+	prev_link = &packets;
+	for (skb = packets; skb; skb = next) {
+		next = skb->next;
+
+		/* Make the header available at skb->data, even if the packet
+		 * is fragmented. One complication: it's possible that the IP
+		 * header hasn't yet been removed (this happens for GRO packets
+		 * on the frag_list, since they aren't handled explicitly by IP.
+		 */
+		if (!homa_make_header_avl(skb))
+			goto discard;
+		header_offset = skb_transport_header(skb) - skb->data;
+		if (header_offset)
+			__skb_pull(skb, header_offset);
+
+		/* Reject packets that are too short or have bogus types. */
+		h = (struct homa_common_hdr *)skb->data;
+		if (unlikely(skb->len < sizeof(struct homa_common_hdr) ||
+			     h->type < DATA || h->type > MAX_OP ||
+			     skb->len < header_lengths[h->type - DATA]))
+			goto discard;
+
+		/* Process the packet now if it is a control packet or
+		 * if it contains an entire short message.
+		 */
+		if (h->type != DATA || ntohl(((struct homa_data_hdr *)h)
+				->message_length) < 1400) {
+			*prev_link = skb->next;
+			skb->next = NULL;
+			homa_dispatch_pkts(skb);
+		} else {
+			prev_link = &skb->next;
+		}
+		continue;
+
+discard:
+		*prev_link = skb->next;
+		kfree_skb(skb);
+	}
+
+	/* Now process the longer packets. Each iteration of this loop
+	 * collects all of the packets for a particular RPC and dispatches
+	 * them (batching the packets for an RPC allows more efficient
+	 * generation of grants).
+	 */
+	while (packets) {
+		struct in6_addr saddr, saddr2;
+		struct homa_common_hdr *h2;
+		struct sk_buff *skb2;
+
+		skb = packets;
+		prev_link = &skb->next;
+		saddr = skb_canonical_ipv6_saddr(skb);
+		other_pkts = NULL;
+		other_link = &other_pkts;
+		h = (struct homa_common_hdr *)skb->data;
+		for (skb2 = skb->next; skb2; skb2 = next) {
+			next = skb2->next;
+			h2 = (struct homa_common_hdr *)skb2->data;
+			if (h2->sender_id == h->sender_id) {
+				saddr2 = skb_canonical_ipv6_saddr(skb2);
+				if (ipv6_addr_equal(&saddr, &saddr2)) {
+					*prev_link = skb2;
+					prev_link = &skb2->next;
+					continue;
+				}
+			}
+			*other_link = skb2;
+			other_link = &skb2->next;
+		}
+		*prev_link = NULL;
+		*other_link = NULL;
+		homa_dispatch_pkts(packets);
+		packets = other_pkts;
+	}
+
+	return 0;
+}
+
+/**
+ * homa_err_handler_v4() - Invoked by IP to handle an incoming error
+ * packet, such as ICMP UNREACHABLE.
+ * @skb:    The incoming packet; skb->data points to the byte just after
+ *          the ICMP header (the first byte of the embedded packet IP header).
+ * @skb:   The incoming packet.
+ * @info:  Information about the error that occurred?
+ *
+ * Return: zero, or a negative errno if the error couldn't be handled here.
+ */
+int homa_err_handler_v4(struct sk_buff *skb, u32 info)
+{
+	const struct icmphdr *icmp = icmp_hdr(skb);
+	struct homa *homa = homa_from_skb(skb);
+	struct in6_addr daddr;
+	int type = icmp->type;
+	int code = icmp->code;
+	struct iphdr *iph;
+	int error = 0;
+	int port = 0;
+
+	iph = (struct iphdr *)(skb->data);
+	ipv6_addr_set_v4mapped(iph->daddr, &daddr);
+	if (type == ICMP_DEST_UNREACH && code == ICMP_PORT_UNREACH) {
+		struct homa_common_hdr *h = (struct homa_common_hdr *)(skb->data
+				+ iph->ihl * 4);
+
+		port = ntohs(h->dport);
+		error = -ENOTCONN;
+	} else if (type == ICMP_DEST_UNREACH) {
+		if (code == ICMP_PROT_UNREACH)
+			error = -EPROTONOSUPPORT;
+		else
+			error = -EHOSTUNREACH;
+	} else {
+		pr_notice("%s invoked with info %x, ICMP type %d, ICMP code %d\n",
+			  __func__, info, type, code);
+	}
+	if (error != 0)
+		homa_abort_rpcs(homa, &daddr, port, error);
+	return 0;
+}
+
+/**
+ * homa_err_handler_v6() - Invoked by IP to handle an incoming error
+ * packet, such as ICMP UNREACHABLE.
+ * @skb:    The incoming packet; skb->data points to the byte just after
+ *          the ICMP header (the first byte of the embedded packet IP header).
+ * @opt:    Not used.
+ * @type:   Type of ICMP packet.
+ * @code:   Additional information about the error.
+ * @offset: Not used.
+ * @info:   Information about the error that occurred?
+ *
+ * Return: zero, or a negative errno if the error couldn't be handled here.
+ */
+int homa_err_handler_v6(struct sk_buff *skb, struct inet6_skb_parm *opt,
+			u8 type,  u8 code,  int offset,  __be32 info)
+{
+	const struct ipv6hdr *iph = (const struct ipv6hdr *)skb->data;
+	struct homa *homa = homa_from_skb(skb);
+	int error = 0;
+	int port = 0;
+
+	if (type == ICMPV6_DEST_UNREACH && code == ICMPV6_PORT_UNREACH) {
+		const struct homa_common_hdr *h;
+
+		h = (struct homa_common_hdr *)(skb->data + sizeof(*iph));
+		port = ntohs(h->dport);
+		error = -ENOTCONN;
+	} else if (type == ICMPV6_DEST_UNREACH && code == ICMPV6_ADDR_UNREACH) {
+		error = -EHOSTUNREACH;
+	} else if (type == ICMPV6_PARAMPROB && code == ICMPV6_UNK_NEXTHDR) {
+		error = -EPROTONOSUPPORT;
+	}
+	if (error != 0)
+		homa_abort_rpcs(homa, &iph->daddr, port, error);
+	return 0;
+}
+
+/**
+ * homa_poll() - Invoked by Linux as part of implementing select, poll,
+ * epoll, etc.
+ * @file:  Open file that is participating in a poll, select, etc.
+ * @sock:  A Homa socket, associated with @file.
+ * @wait:  This table will be registered with the socket, so that it
+ *         is notified when the socket's ready state changes.
+ *
+ * Return: A mask of bits such as EPOLLIN, which indicate the current
+ *         state of the socket.
+ */
+__poll_t homa_poll(struct file *file, struct socket *sock,
+		   struct poll_table_struct *wait)
+{
+	struct homa_sock *hsk = homa_sk(sock->sk);
+	__poll_t mask;
+
+	mask = 0;
+	sock_poll_wait(file, sock, wait);
+	if (homa_sock_wmem_avl(hsk))
+		mask |= EPOLLOUT | EPOLLWRNORM;
+	else
+		set_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags);
+
+	if (hsk->shutdown)
+		mask |= EPOLLIN;
+
+	if (!list_empty(&hsk->ready_rpcs))
+		mask |= EPOLLIN | EPOLLRDNORM;
+	return mask;
+}
+
+/**
+ * homa_hrtimer() - This function is invoked by the hrtimer mechanism to
+ * wake up the timer thread. Runs at IRQ level.
+ * @timer:   The timer that triggered; not used.
+ *
+ * Return:   Always HRTIMER_RESTART.
+ */
+enum hrtimer_restart homa_hrtimer(struct hrtimer *timer)
+{
+	wake_up_process(timer_kthread);
+	return HRTIMER_NORESTART;
+}
+
+/**
+ * homa_timer_main() - Top-level function for the timer thread.
+ * @transport:  Pointer to struct homa.
+ *
+ * Return:         Always 0.
+ */
+int homa_timer_main(void *transport)
+{
+	struct homa *homa = (struct homa *)transport;
+	ktime_t tick_interval;
+	u64 nsec;
+
+	hrtimer_setup(&hrtimer, homa_hrtimer, CLOCK_MONOTONIC,
+		      HRTIMER_MODE_REL);
+	nsec = 1000000;                   /* 1 ms */
+	tick_interval = ns_to_ktime(nsec);
+	while (1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!timer_thread_exit) {
+			hrtimer_start(&hrtimer, tick_interval,
+				      HRTIMER_MODE_REL);
+			schedule();
+		}
+		__set_current_state(TASK_RUNNING);
+		if (timer_thread_exit)
+			break;
+		homa_timer(homa);
+	}
+	hrtimer_cancel(&hrtimer);
+	kthread_complete_and_exit(&timer_thread_done, 0);
+	return 0;
+}
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_AUTHOR("John Ousterhout <ouster@cs.stanford.edu>");
+MODULE_DESCRIPTION("Homa transport protocol");
+MODULE_VERSION("1.0");
+
+/* Arrange for this module to be loaded automatically when a Homa socket is
+ * opened. Apparently symbols don't work in the macros below, so must use
+ * numeric values for IPPROTO_HOMA (146) and SOCK_DGRAM(2).
+ */
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_INET, 146, 2);
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_INET6, 146, 2);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH net-next v15 15/15] net: homa: create Makefile and Kconfig
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (13 preceding siblings ...)
  2025-08-18 20:55 ` [PATCH net-next v15 14/15] net: homa: create homa_plumbing.c John Ousterhout
@ 2025-08-18 20:55 ` John Ousterhout
  2025-08-23  5:36   ` kernel test robot
  2025-08-22 15:51 ` [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
  15 siblings, 1 reply; 47+ messages in thread
From: John Ousterhout @ 2025-08-18 20:55 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

Before this commit the Homa code is "inert": it won't be compiled
in kernel builds. This commit adds Homa's Makefile and Kconfig, and
also links Homa into net/Makefile and net/Kconfig, so that Homa
will be built during kernel builds if enabled (it is disabled by
default).

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/Kconfig       |  1 +
 net/Makefile      |  1 +
 net/homa/Kconfig  | 21 +++++++++++++++++++++
 net/homa/Makefile | 16 ++++++++++++++++
 4 files changed, 39 insertions(+)
 create mode 100644 net/homa/Kconfig
 create mode 100644 net/homa/Makefile

diff --git a/net/Kconfig b/net/Kconfig
index d5865cf19799..92972ff2a78d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -250,6 +250,7 @@ source "net/bridge/netfilter/Kconfig"
 endif # if NETFILTER
 
 source "net/sctp/Kconfig"
+source "net/homa/Kconfig"
 source "net/rds/Kconfig"
 source "net/tipc/Kconfig"
 source "net/atm/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index aac960c41db6..71f740e0dc34 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -43,6 +43,7 @@ ifneq ($(CONFIG_VLAN_8021Q),)
 obj-y				+= 8021q/
 endif
 obj-$(CONFIG_IP_SCTP)		+= sctp/
+obj-$(CONFIG_HOMA)		+= homa/
 obj-$(CONFIG_RDS)		+= rds/
 obj-$(CONFIG_WIRELESS)		+= wireless/
 obj-$(CONFIG_MAC80211)		+= mac80211/
diff --git a/net/homa/Kconfig b/net/homa/Kconfig
new file mode 100644
index 000000000000..16fec3fd52ba
--- /dev/null
+++ b/net/homa/Kconfig
@@ -0,0 +1,21 @@
+# SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
+#
+# Homa transport protocol
+#
+
+menuconfig HOMA
+	tristate "The Homa transport protocol"
+	depends on INET
+	depends on IPV6
+
+	help
+	  Homa is a network transport protocol for communication within
+	  a datacenter. It provides significantly lower latency than TCP,
+	  particularly for workloads containing a mixture of large and small
+	  messages operating at high network utilization. At present, Homa
+	  has been only partially upstreamed; this version provides bare-bones
+	  functionality but is not performant. For more information see the
+	  homa(7) man page or checkout the Homa Wiki at
+	  https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview.
+
+	  If unsure, say N.
diff --git a/net/homa/Makefile b/net/homa/Makefile
new file mode 100644
index 000000000000..a7ebccd4b56c
--- /dev/null
+++ b/net/homa/Makefile
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
+#
+# Makefile for the Linux implementation of the Homa transport protocol.
+
+obj-$(CONFIG_HOMA) := homa.o
+homa-y:=        homa_incoming.o \
+		homa_interest.o \
+		homa_outgoing.o \
+		homa_pacer.o \
+		homa_peer.o \
+		homa_plumbing.o \
+		homa_pool.o \
+		homa_rpc.o \
+		homa_sock.o \
+		homa_timer.o \
+		homa_utils.o
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol
  2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (14 preceding siblings ...)
  2025-08-18 20:55 ` [PATCH net-next v15 15/15] net: homa: create Makefile and Kconfig John Ousterhout
@ 2025-08-22 15:51 ` John Ousterhout
  15 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-08-22 15:51 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba

This patch series appears to be stuck in limbo: I have not received
any comments since the v9 patch in early June. Is there anything I can
do to move this series towards closure?

-John-


On Mon, Aug 18, 2025 at 1:56 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
>
> This patch series begins the process of upstreaming the Homa transport
> protocol. Homa is an alternative to TCP for use in datacenter
> environments. It provides 10-100x reductions in tail latency for short
> messages relative to TCP. Its benefits are greatest for mixed workloads
> containing both short and long messages running under high network loads.
> Homa is not API-compatible with TCP: it is connectionless and message-
> oriented (but still reliable and flow-controlled). Homa's new API not
> only contributes to its performance gains, but it also eliminates the
> massive amount of connection state required by TCP for highly connected
> datacenter workloads (Homa uses ~ 1 socket per application, whereas
> TCP requires a separate socket for each peer).
>
> For more details on Homa, please consult the Homa Wiki:
> https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview
> The Wiki has pointers to two papers on Homa (one of which describes
> this implementation) as well as man pages describing the application
> API and other information.
>
> There is also a GitHub repo for Homa:
> https://github.com/PlatformLab/HomaModule
> The GitHub repo contains a superset of this patch set, including:
> * Additional source code that will eventually be upstreamed
> * Extensive unit tests (which will also be upstreamed eventually)
> * Application-level library functions (which need to go in glibc?)
> * Man pages (which need to be upstreamed as well)
> * Benchmarking and instrumentation code
>
> For this patch series, Homa has been stripped down to the bare minimum
> functionality capable of actually executing remote procedure calls. (about
> 8000 lines of source code, compared to 15000 in the complete Homa). The
> remaining code will be upstreamed in smaller batches once this patch
> series has been accepted. Note: the code in this patch series is
> functional but its performance is not very interesting (about the same
> as TCP).
>
> The patch series is arranged to introduce the major functional components
> of Homa. Until the last patch has been applied, the code is inert (it
> will not be compiled).
>
> Note: this implementation of Homa supports both IPv4 and IPv6.
>
> Changes for v15:
> * This series is a resubmit of the v14 series to repair broken Author
>   email addresses in the commits. There are no other changes.
>
> Changes for v14:
> * There were no comments on the v13 patch series.
> * Fix a couple of bugs and clean up a few APIs (see individual patches for
>   details).
>
> Changes for v13:
> * Modify all files to include GPL-2.0+ as an option in the SPDX license line
> * Fix a couple of bugs in homa_outgoing.c and one bug in homa_plumbing.c
>
> Major changes for v12:
> * There were no comments on the v11 patch series, so there are no major
>   changes in this version. See individual patch files for a few small
>   local changes.
>
> Major changes for v11 (see individual patches for additional details):
> * There were no comments on the v10 patch series, so there are not many
>   changes in this version
> * Rework the mechanism for waking up RPCs that stalled waiting for
>   buffer pool space (the old approach deprioritized waking RPCs, which
>   led to starvation and server overload).
> * Cleanup and simplify use of RPC reference counts. Before, references were
>   only acquired to bridge gaps in lock ownership; this was complicated and
>   error-prone. Now, reference counts are acquired at the "top level" when
>   an RPC is selected for working on. Any function that receives a homa_rpc as
>   argument can assume it is protected with a reference.
> * Clean up sparse annotations (use name of lock variable, not address)
>
> Major changes for v10 (see individual patches for additional details):
> - Refactor resend mechanism: consolidate code for sending RESEND packets
>   in new function homa_request_retrans (simplifies homa_timer.c); a few
>   bug fixes (updating "granted" field in homa_resend_pkt, etc.)
> - Revise sparse annotations to eliminate __context__ definition
> - Use the destroy function from struct proto properly (fixes races in
>   socket cleanup)
>
> Major changes for v9 (see individual patches for additional details):
> - Introduce homa_net objects; there is now a single global struct homa
>   shared by all network namespaces, with one homa_net per network namespace
>   with netns-specific information. Most info, including socket table and
>   peer table, is stored in the struct homa.
> - Introduce homa_clock as an abstraction layer for the fine-grain clock.
> - Implement limits on the number of active homa_peer objects. This includes
>   adding reference counts in homa_peers and adding code to release peers
>   where there are too many.
> - Switch to using rhashtable to store homa_peers; the table is shared
>   across all network namespaces, though individual peers are namespace-
>   specific.
>
> v8 changes:
> - There were no reviews of the v7 patch series, so there are not many changes
>   in this version
> - Pull out pacer code into separate files pacer.h and pacer.c
> - Refactor homa_pool APIs (move allocation/deallocation into homa_pool.c,
>   move locking responsibility out)
> - Fix various problems from sparse, checkpatch, and kernel-doc
>
> v7 changes:
> - Add documentation files reap.txt and sync.txt.
> - Replace __u64 with _u64 (and __s64 with s64) in non-uapi settings.
> - Replace '__aligned(L1_CACHE_BYTES)' with '____cacheline_aligned_in_smp'.
> - Use alloc_percpu_gfp for homa_pool::cores.
> - Extract bool homa_bpage_available from homa_pool_get_pages.
> - Rename homa_rpc_free to homa_rpc_end.
> - Use skb_queue_purge in homa_rpc_reap instead of hand-coding.
> - Clean up RCU usage in several places:
>   - Eliminate unnecessary use of RCU for homa_sock::dead_rpcs.
>   - Eliminate use of RCU for homa::throttled_rpcs (unnecessary, unclear
>     that it would have worked). Added return value from homa_pacer_xmit.
>   - Call rcu_read_lock/unlock in homa_peer_find (just to be safe; probably
>     isn't necessary)
>   - Eliminate extraneous use of RCU in homa_pool_allocate.
>   - Cleaned up RCU usage around homa_sock::active_rpcs.
>   - Change homa_sock_find to take a reference on the returned socket;
>     caller no longer has to worry about RCU issues.
> - Remove "locker" arguments from homa_lock_rpc, homa_lock_sock,
>   homa_rpc_try_lock, and homa_bucket_lock (shouldn't be needed, given
>   CONFIG_PROVE_LOCKING).
> - Use __GFP_ZERO in *alloc calls instead of initializing individual
>   struct fields to zero.
> - Don't use raw_smp_processor_id; use smp_processor_id instead.
> - Remove homa_peertab_get_peers from this patch series (and also fix
>   problems in it related to RCU usage).
> - Add annotation to homa_peertab_gc_dsts requiring write_lock.
> - Remove "lock_slow" functions, which don't add functionality in this patch
>   series.
> - Remove unused fields from homa_peer structs.
> - Reorder fields in homa_rpc_bucket to squeeze out padding.
> - Refactor homa_sock_start_scan etc.
>   - Take a reference on the current socket to keep it from being freed.
>   - No need now for homa_socktab::active_scans or struct homa_socktab_links.
>   - rcu_read_lock/unlock is now entirely in the homa_sock scan methods;
>     no need for callers to worry about this.
> - Add homa_rpc_hold and homa_rpc_put. Replaces several ad-hoc mechanisms,
>   such as RPC_COPYING_FROM_USER and RPC_COPYING_TO_USER, with a single
>   general-purpose mechanism.
> - Use __skb_queue_purge instead of skb_queue_purge (locking isn't needed
>   because Homa has its own locks).
> - Rename UNKNOWN packet type to RPC_UNKNOWN.
> - Add hsk->is_server plus SO_HOMA_SERVER setsockopt: by default, sockets
>   will not accept incoming RPCs unless they have been bound.
> - Refactor waiting mechanism for incoming packets: simplify wait
>   criteria and use standard mechanisms (wait_event_*) for blocking
>   threads. Create homa_interest.c and homa_interest.h.
> * Add memory accounting for outbound messages (e.g. new sysctl value
>   wmem_max); senders now block when memory limit is exceeded.
> * Made Homa a pernet subsystem (a separate Homa transport for each
>   network namespace).
>
> v6 changes:
> - Make hrtimer variable in homa_timer_main static instead of stack-allocated
>   (avoids complaints when in debug mode).
> - Remove unnecessary cast in homa_dst_refresh.
> - Replace erroneous uses of GFP_KERNEL with GFP_ATOMIC.
> - Check for "all ports in use" in homa_sock_init.
> - Refactor API for homa_rpc_reap to incorporate "reap all" feature,
>   eliminate need for callers to specify exact amount of work to do
>   when in "reap a few" mode.
> - Fix bug in homa_rpc_reap (wasn't resetting rx_frees for each iteration
>   of outer loop).
>
> v5 changes:
> - Change type of start in struct homa_rcvbuf_args from void* to __u64;
>   also add more __user annotations.
> - Refactor homa_interest: replace awkward ready_rpc field with two
>   fields: rpc and rpc_ready. Added new functions homa_interest_get_rpc
>   and homa_interest_set_rpc to encapsulate/clarify access to
>   interest->rpc_ready.
> - Eliminate use of LIST_POISON1 etc. in homa_interests (use list_del_init
>   instead of list_del).
> - Remove homa_next_skb function, which is obsolete, unused, and incorrect
> - Eliminate ipv4_to_ipv6 function (use ipv6_addr_set_v4mapped instead)
> - Eliminate is_mapped_ipv4 function (use ipv6_addr_v4mapped instead)
> - Use __u64 instead of uint64_t in homa.h
> - Remove 'extern "C"' from homa.h
> - Various fixes from patchwork checks (checkpatch.pl, etc.)
> - A few improvements to comments
>
> v4 changes:
> - Remove sport argument for homa_find_server_rpc (unneeded). Also
>   remove client_port field from struct homa_ack
> - Refactor ICMP packet handling (v6 was incorrect)
> - Check for socket shutdown in homa_poll
> - Fix potential for memory garbling in homa_symbol_for_type
> - Remove unused ETHERNET_MAX_PAYLOAD declaration
> - Rename classes in homa_wire.h so they all have "homa_" prefixes
> - Various fixes from patchwork checks (checkpatch.pl, etc.)
> - A few improvements to comments
>
> v3 changes:
> - Fix formatting in Kconfig
> - Set ipv6_pinfo_offset in struct proto
> - Check return value of inet6_register_protosw
> - In homa_load cleanup, don't cleanup things that haven't been
>   initialized
> - Add MODULE_ALIAS_NET_PF_PROTO_TYPE to auto-load module
> - Check return value from kzalloc call in homa_sock_init
> - Change SO_HOMA_SET_BUF to SO_HOMA_RCVBUF
> - Change struct homa_set_buf_args to struct homa_rcvbuf_args
> - Implement getsockopt for SO_HOMA_RCVBUF
> - Return ENOPROTOOPT instead of EINVAL where appropriate in
>   setsockopt and getsockopt
> - Fix crash in homa_pool_check_waiting if pool has no region yet
> - Check for NULL msg->msg_name in homa_sendmsg
> - Change addr->in6.sin6_family to addr->sa.sa_family in homa_sendmsg
>   for clarity
> - For some errors in homa_recvmsg, return directly rather than "goto done"
> - Return error from recvmsg if offsets of returned read buffers are bogus
> - Added comments to clarify lock-unlock pairs for RPCs
> - Renamed homa_try_bucket_lock to homa_try_rpc_lock
> - Fix issues found by test robot and checkpatch.pl
> - Ensure first argument to do_div is 64 bits
> - Remove C++ style comments
> - Removed some code that will only be relevant in future patches that
>   fill in missing Homa functionality
>
> v2 changes:
> - Remove sockaddr_in_union declaration from public API in homa.h
> - Remove kernel wrapper functions (homa_send, etc.) from homa.h
> - Fix many sparse warnings (still more work to do here) and other issues
>   uncovered by test robot
> - Fix checkpatch.pl issues
> - Remove residual code related to unit tests
> - Remove references to tt_record from comments
> - Make it safe to delete sockets during homa_socktab scans
> - Use uintptr_t for portability fo 32-bit platforms
> - Use do_div instead of "/" for portability
> - Remove homa->busy_usecs and homa->gro_busy_usecs (not needed in
>   this stripped down version of Homa)
> - Eliminate usage of cpu_khz, use sched_clock instead of get_cycles
> - Add missing checks of kmalloc return values
> - Remove "inline" qualifier from functions in .c files
> - Document that pad fields must be zero
> - Use more precise type "uint32_t" rather than "int"
> - Remove unneeded #include of linux/version.h
>
> John Ousterhout (15):
>   net: homa: define user-visible API for Homa
>   net: homa: create homa_wire.h
>   net: homa: create shared Homa header files
>   net: homa: create homa_pool.h and homa_pool.c
>   net: homa: create homa_peer.h and homa_peer.c
>   net: homa: create homa_sock.h and homa_sock.c
>   net: homa: create homa_interest.h and homa_interest.c
>   net: homa: create homa_pacer.h and homa_pacer.c
>   net: homa: create homa_rpc.h and homa_rpc.c
>   net: homa: create homa_outgoing.c
>   net: homa: create homa_utils.c
>   net: homa: create homa_incoming.c
>   net: homa: create homa_timer.c
>   net: homa: create homa_plumbing.c
>   net: homa: create Makefile and Kconfig
>
>  include/uapi/linux/homa.h |  158 ++++++
>  net/Kconfig               |    1 +
>  net/Makefile              |    1 +
>  net/homa/Kconfig          |   21 +
>  net/homa/Makefile         |   16 +
>  net/homa/homa_impl.h      |  703 +++++++++++++++++++++++
>  net/homa/homa_incoming.c  |  886 +++++++++++++++++++++++++++++
>  net/homa/homa_interest.c  |  114 ++++
>  net/homa/homa_interest.h  |   93 +++
>  net/homa/homa_outgoing.c  |  599 ++++++++++++++++++++
>  net/homa/homa_pacer.c     |  303 ++++++++++
>  net/homa/homa_pacer.h     |  173 ++++++
>  net/homa/homa_peer.c      |  595 ++++++++++++++++++++
>  net/homa/homa_peer.h      |  373 +++++++++++++
>  net/homa/homa_plumbing.c  | 1118 +++++++++++++++++++++++++++++++++++++
>  net/homa/homa_pool.c      |  483 ++++++++++++++++
>  net/homa/homa_pool.h      |  136 +++++
>  net/homa/homa_rpc.c       |  638 +++++++++++++++++++++
>  net/homa/homa_rpc.h       |  501 +++++++++++++++++
>  net/homa/homa_sock.c      |  432 ++++++++++++++
>  net/homa/homa_sock.h      |  408 ++++++++++++++
>  net/homa/homa_stub.h      |   91 +++
>  net/homa/homa_timer.c     |  136 +++++
>  net/homa/homa_utils.c     |  122 ++++
>  net/homa/homa_wire.h      |  345 ++++++++++++
>  25 files changed, 8446 insertions(+)
>  create mode 100644 include/uapi/linux/homa.h
>  create mode 100644 net/homa/Kconfig
>  create mode 100644 net/homa/Makefile
>  create mode 100644 net/homa/homa_impl.h
>  create mode 100644 net/homa/homa_incoming.c
>  create mode 100644 net/homa/homa_interest.c
>  create mode 100644 net/homa/homa_interest.h
>  create mode 100644 net/homa/homa_outgoing.c
>  create mode 100644 net/homa/homa_pacer.c
>  create mode 100644 net/homa/homa_pacer.h
>  create mode 100644 net/homa/homa_peer.c
>  create mode 100644 net/homa/homa_peer.h
>  create mode 100644 net/homa/homa_plumbing.c
>  create mode 100644 net/homa/homa_pool.c
>  create mode 100644 net/homa/homa_pool.h
>  create mode 100644 net/homa/homa_rpc.c
>  create mode 100644 net/homa/homa_rpc.h
>  create mode 100644 net/homa/homa_sock.c
>  create mode 100644 net/homa/homa_sock.h
>  create mode 100644 net/homa/homa_stub.h
>  create mode 100644 net/homa/homa_timer.c
>  create mode 100644 net/homa/homa_utils.c
>  create mode 100644 net/homa/homa_wire.h
>
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 15/15] net: homa: create Makefile and Kconfig
  2025-08-18 20:55 ` [PATCH net-next v15 15/15] net: homa: create Makefile and Kconfig John Ousterhout
@ 2025-08-23  5:36   ` kernel test robot
  0 siblings, 0 replies; 47+ messages in thread
From: kernel test robot @ 2025-08-23  5:36 UTC (permalink / raw)
  To: John Ousterhout, netdev
  Cc: llvm, oe-kbuild-all, pabeni, edumazet, horms, kuba,
	John Ousterhout

Hi John,

kernel test robot noticed the following build errors:

[auto build test ERROR on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/John-Ousterhout/net-homa-define-user-visible-API-for-Homa/20250819-050052
base:   net-next/main
patch link:    https://lore.kernel.org/r/20250818205551.2082-16-ouster%40cs.stanford.edu
patch subject: [PATCH net-next v15 15/15] net: homa: create Makefile and Kconfig
config: um-allmodconfig (https://download.01.org/0day-ci/archive/20250823/202508231353.SNg0KuDi-lkp@intel.com/config)
compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250823/202508231353.SNg0KuDi-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508231353.SNg0KuDi-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from net/homa/homa_incoming.c:5:
   In file included from net/homa/homa_impl.h:13:
   In file included from include/linux/icmp.h:16:
   In file included from include/linux/skbuff.h:17:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:12:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:1175:55: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
    1175 |         return (port > MMIO_UPPER_LIMIT) ? NULL : PCI_IOBASE + port;
         |                                                   ~~~~~~~~~~ ^
   In file included from net/homa/homa_incoming.c:5:
   In file included from net/homa/homa_impl.h:33:
>> arch/x86/include/asm/tsc.h:70:28: error: typedef redefinition with different types ('unsigned long long' vs 'unsigned long')
      70 | typedef unsigned long long cycles_t;
         |                            ^
   include/asm-generic/timex.h:8:23: note: previous definition is here
       8 | typedef unsigned long cycles_t;
         |                       ^
   In file included from net/homa/homa_incoming.c:5:
   In file included from net/homa/homa_impl.h:33:
>> arch/x86/include/asm/tsc.h:77:24: error: redefinition of 'get_cycles'
      77 | static inline cycles_t get_cycles(void)
         |                        ^
   include/asm-generic/timex.h:10:24: note: previous definition is here
      10 | static inline cycles_t get_cycles(void)
         |                        ^
   In file included from net/homa/homa_incoming.c:5:
   In file included from net/homa/homa_impl.h:33:
>> arch/x86/include/asm/tsc.h:80:7: error: call to undeclared function 'DISABLED_MASK_BIT_SET'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
      80 |             !cpu_feature_enabled(X86_FEATURE_TSC))
         |              ^
   arch/um/include/asm/cpufeature.h:52:32: note: expanded from macro 'cpu_feature_enabled'
      52 |         (__builtin_constant_p(bit) && DISABLED_MASK_BIT_SET(bit) ? 0 : static_cpu_has(bit))
         |                                       ^
   1 warning and 3 errors generated.


vim +70 arch/x86/include/asm/tsc.h

288a4ff0ad29d1 arch/x86/include/asm/tsc.h Xin Li (Intel             2025-05-02  66) 
2272b0e03ea573 include/asm-i386/tsc.h     Andres Salomon            2007-03-06  67  /*
2272b0e03ea573 include/asm-i386/tsc.h     Andres Salomon            2007-03-06  68   * Standard way to access the cycle counter.
2272b0e03ea573 include/asm-i386/tsc.h     Andres Salomon            2007-03-06  69   */
2272b0e03ea573 include/asm-i386/tsc.h     Andres Salomon            2007-03-06 @70  typedef unsigned long long cycles_t;
2272b0e03ea573 include/asm-i386/tsc.h     Andres Salomon            2007-03-06  71  
2272b0e03ea573 include/asm-i386/tsc.h     Andres Salomon            2007-03-06  72  extern unsigned int cpu_khz;
2272b0e03ea573 include/asm-i386/tsc.h     Andres Salomon            2007-03-06  73  extern unsigned int tsc_khz;
73018a66e70fa6 include/asm-x86/tsc.h      Glauber de Oliveira Costa 2008-01-30  74  
73018a66e70fa6 include/asm-x86/tsc.h      Glauber de Oliveira Costa 2008-01-30  75  extern void disable_TSC(void);
2272b0e03ea573 include/asm-i386/tsc.h     Andres Salomon            2007-03-06  76  
2272b0e03ea573 include/asm-i386/tsc.h     Andres Salomon            2007-03-06 @77  static inline cycles_t get_cycles(void)
2272b0e03ea573 include/asm-i386/tsc.h     Andres Salomon            2007-03-06  78  {
3bd4abc07a267e arch/x86/include/asm/tsc.h Jason A. Donenfeld        2022-04-08  79  	if (!IS_ENABLED(CONFIG_X86_TSC) &&
3bd4abc07a267e arch/x86/include/asm/tsc.h Jason A. Donenfeld        2022-04-08 @80  	    !cpu_feature_enabled(X86_FEATURE_TSC))
2272b0e03ea573 include/asm-i386/tsc.h     Andres Salomon            2007-03-06  81  		return 0;
4ea1636b04dbd6 arch/x86/include/asm/tsc.h Andy Lutomirski           2015-06-25  82  	return rdtsc();
6d63de8dbcda98 include/asm-x86/tsc.h      Andi Kleen                2008-01-30  83  }
3bd4abc07a267e arch/x86/include/asm/tsc.h Jason A. Donenfeld        2022-04-08  84  #define get_cycles get_cycles
2272b0e03ea573 include/asm-i386/tsc.h     Andres Salomon            2007-03-06  85  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 03/15] net: homa: create shared Homa header files
  2025-08-18 20:55 ` [PATCH net-next v15 03/15] net: homa: create shared Homa header files John Ousterhout
@ 2025-08-26  9:05   ` Paolo Abeni
  2025-08-26 23:10     ` John Ousterhout
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2025-08-26  9:05 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 8/18/25 10:55 PM, John Ousterhout wrote:
> +/**
> + * struct homa_net - Contains Homa information that is specific to a
> + * particular network namespace.
> + */
> +struct homa_net {
> +	/** @net: Network namespace corresponding to this structure. */
> +	struct net *net;
> +
> +	/** @homa: Global Homa information. */
> +	struct homa *homa;

It's not clear why the above 2 fields are needed. You could access
directly the global struct homa instance, and 'struct net' is usually
available when struct home_net is avail.


> +/**
> + * is_homa_pkt() - Return true if @skb is a Homa packet, false otherwise.
> + * @skb:    Packet buffer to check.
> + * Return:  see above.
> + */
> +static inline bool is_homa_pkt(struct sk_buff *skb)
> +{
> +	int protocol;
> +
> +	/* If the network header hasn't been created yet, assume it's a
> +	 * Homa packet (Homa never generates any non-Homa packets).
> +	 */
> +	if (skb->network_header == 0)
> +		return true;
> +	protocol = (skb_is_ipv6(skb)) ? ipv6_hdr(skb)->nexthdr :
> +					ip_hdr(skb)->protocol;
> +	return protocol == IPPROTO_HOMA;
> +}

This helper is apparently unused in this series, just drop it / add it
later

> +#define UNIT_LOG(...)
> +#define UNIT_HOOK(...)

Also apparently unused.

> +extern unsigned int homa_net_id;
> +
> +/**
> + * homa_net_from_net() - Return the struct homa_net associated with a particular
> + * struct net.
> + * @net:     Get the Homa data for this net namespace.
> + * Return:   see above.
> + */
> +static inline struct homa_net *homa_net_from_net(struct net *net)

The customary name for this kind of helper is homa_net()

> +{
> +	return (struct homa_net *)net_generic(net, homa_net_id);
> +}
> +
> +/**
> + * homa_from_skb() - Return the struct homa associated with a particular
> + * sk_buff.
> + * @skb:     Get the struct homa for this packet buffer.
> + * Return:   see above.
> + */
> +static inline struct homa *homa_from_skb(struct sk_buff *skb)
> +{
> +	struct homa_net *hnet;
> +
> +	hnet = net_generic(dev_net(skb->dev), homa_net_id);
> +	return hnet->homa;

You can implement this using homa_net_from_skb(), avoiding some code
duplication

> +}
> +
> +/**
> + * homa_net_from_skb() - Return the struct homa_net associated with a particular
> + * sk_buff.
> + * @skb:     Get the struct homa for this packet buffer.
> + * Return:   see above.
> + */
> +static inline struct homa_net *homa_net_from_skb(struct sk_buff *skb)
> +{
> +	struct homa_net *hnet;
> +
> +	hnet = net_generic(dev_net(skb->dev), homa_net_id);
> +	return hnet;

You can implement this using homa_net(), avoid some code duplication.

> +}
> +
> +/**
> + * homa_clock() - Return a fine-grain clock value that is monotonic and
> + * consistent across cores.
> + * Return: see above.
> + */
> +static inline u64 homa_clock(void)
> +{
> +	/* As of May 2025 there does not appear to be a portable API that
> +	 * meets Homa's needs:
> +	 * - The Intel X86 TSC works well but is not portable.
> +	 * - sched_clock() does not guarantee monotonicity or consistency.
> +	 * - ktime_get_mono_fast_ns and ktime_get_raw_fast_ns are very slow
> +	 *   (27 ns to read, vs 8 ns for TSC)
> +	 * Thus we use a hybrid approach that uses TSC (via get_cycles) where
> +	 * available (which should be just about everywhere Homa runs).
> +	 */
> +#ifdef CONFIG_X86_TSC
> +	return get_cycles();
> +#else
> +	return ktime_get_mono_fast_ns();
> +#endif /* CONFIG_X86_TSC */
> +}

ktime_get*() variant are fast enough to allow e.g. pktgen deals with
millions of packets x seconds. Both tsc() and ktime_get_mono_fast_ns()
suffer of various inconsistencies which will cause the most unexpected
issues in the most dangerous situation. I strongly advice against this
early optimization.

> +/**
> + * homa_usecs_to_cycles() - Convert from units of microseconds to units of
> + * homa_clock().
> + * @usecs:   A time measurement in microseconds
> + * Return:   The time in homa_clock() units corresponding to @usecs.
> + */
> +static inline u64 homa_usecs_to_cycles(u64 usecs)
> +{
> +	u64 tmp;
> +
> +	tmp = usecs * homa_clock_khz();
> +	do_div(tmp, 1000);
> +	return tmp;
> +}

Apparently not used in this series.
FWIW do_div() would likely be much more costly than timestamp fetching.

> +
> +/* Homa Locking Strategy:
> + *
> + * (Note: this documentation is referenced in several other places in the
> + * Homa code)
> + *
> + * In the Linux TCP/IP stack the primary locking mechanism is a sleep-lock
> + * per socket. However, per-socket locks aren't adequate for Homa, because
> + * sockets are "larger" in Homa. In TCP, a socket corresponds to a single
> + * connection between two peers; an application can have hundreds or
> + * thousands of sockets open at once, so per-socket locks leave lots of
> + * opportunities for concurrency. With Homa, a single socket can be used for
> + * communicating with any number of peers, so there will typically be just
> + * one socket per thread. As a result, a single Homa socket must support many
> + * concurrent RPCs efficiently, and a per-socket lock would create a bottleneck
> + * (Homa tried this approach initially).
> + *
> + * Thus, the primary locks used in Homa spinlocks at RPC granularity. This
> + * allows operations on different RPCs for the same socket to proceed
> + * concurrently. Homa also has socket locks (which are spinlocks different
> + * from the official socket sleep-locks) but these are used much less
> + * frequently than RPC locks.
> + *
> + * Lock Ordering:
> + *
> + * There are several other locks in Homa besides RPC locks, all of which
> + * are spinlocks. When multiple locks are held, they must be acquired in a
> + * consistent order in order to prevent deadlock. Here are the rules for Homa:
> + * 1. Except for RPC and socket locks, all locks should be considered
> + *    "leaf" locks: don't acquire other locks while holding them.
> + * 2. The lock order is:
> + *    * RPC lock
> + *    * Socket lock
> + *    * Other lock
> + * 3. It is not safe to wait on an RPC lock while holding any other lock.
> + * 4. It is safe to wait on a socket lock while holding an RPC lock, but
> + *    not while holding any other lock.

The last 2 points are not needed: are obviously implied by the previous
ones.

> + *
> + * It may seem surprising that RPC locks are acquired *before* socket locks,
> + * but this is essential for high performance. Homa has been designed so that
> + * many common operations (such as processing input packets) can be performed
> + * while holding only an RPC lock; this allows operations on different RPCs
> + * to proceed in parallel. Only a few operations, such as handing off an
> + * incoming message to a waiting thread, require the socket lock. If socket
> + * locks had to be acquired first, any operation that might eventually need
> + * the socket lock would have to acquire it before the RPC lock, which would
> + * severely restrict concurrency.

FWIW, I think the above scheme can offer good performances if and only
if any operation requiring the socket lock is slow-path: multiple
RPCs/cores can content the same socket lock experiencing false sharing
and cache contention/misses and will hit performances badly.

If the operations requiring the socket lock are slow-path, the
RPC/socket lock order should be irrelevant for performances.

[...]
> +static inline void homa_skb_get(struct sk_buff *skb, void *dest, int offset,
> +				int length)
> +{
> +	memcpy(dest, skb_transport_header(skb) + offset, length);
> +}

Apparently unused.

/P


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 05/15] net: homa: create homa_peer.h and homa_peer.c
  2025-08-18 20:55 ` [PATCH net-next v15 05/15] net: homa: create homa_peer.h and homa_peer.c John Ousterhout
@ 2025-08-26  9:32   ` Paolo Abeni
  2025-08-27 23:27     ` John Ousterhout
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2025-08-26  9:32 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 8/18/25 10:55 PM, John Ousterhout wrote:
> +/**
> + * homa_peer_rcu_callback() - This function is invoked as the callback
> + * for an invocation of call_rcu. It just marks a peertab to indicate that
> + * it was invoked.
> + * @head:    Contains information used to locate the peertab.
> + */
> +void homa_peer_rcu_callback(struct rcu_head *head)
> +{
> +	struct homa_peertab *peertab;
> +
> +	peertab = container_of(head, struct homa_peertab, rcu_head);
> +	atomic_set(&peertab->call_rcu_pending, 0);
> +}

The free schema is quite convoluted and different from the usual RCU
handling. Why don't you simply call_rcu() on the given peer once that
the refcount reaches zero?

> +
> +/**
> + * homa_peer_free_dead() - Release peers on peertab->dead_peers
> + * if possible.
> + * @peertab:    Check the dead peers here.
> + */
> +void homa_peer_free_dead(struct homa_peertab *peertab)
> +	__must_hold(peertab->lock)
> +{
> +	struct homa_peer *peer, *tmp;
> +
> +	/* A dead peer can be freed only if:
> +	 * (a) there are no call_rcu calls pending (if there are, it's
> +	 *     possible that a new reference might get created for the
> +	 *     peer)
> +	 * (b) the peer's reference count is zero.
> +	 */
> +	if (atomic_read(&peertab->call_rcu_pending))
> +		return;
> +	list_for_each_entry_safe(peer, tmp, &peertab->dead_peers, dead_links) {
> +		if (atomic_read(&peer->refs) == 0) {
> +			list_del_init(&peer->dead_links);
> +			homa_peer_free(peer);
> +		}
> +	}
> +}
> +
> +/**
> + * homa_peer_wait_dead() - Don't return until all of the dead peers have
> + * been freed.
> + * @peertab:    Overall information about peers, which includes a dead list.
> + *
> + */
> +void homa_peer_wait_dead(struct homa_peertab *peertab)
> +{
> +	while (1) {
> +		spin_lock_bh(&peertab->lock);
> +		homa_peer_free_dead(peertab);
> +		if (list_empty(&peertab->dead_peers)) {
> +			spin_unlock_bh(&peertab->lock);
> +			return;
> +		}
> +		spin_unlock_bh(&peertab->lock);
> +	}
> +}

Apparently unused.

> +/**
> + * homa_dst_refresh() - This method is called when the dst for a peer is
> + * obsolete; it releases that dst and creates a new one.
> + * @peertab:  Table containing the peer.
> + * @peer:     Peer whose dst is obsolete.
> + * @hsk:      Socket that will be used to transmit data to the peer.
> + */
> +void homa_dst_refresh(struct homa_peertab *peertab, struct homa_peer *peer,
> +		      struct homa_sock *hsk)
> +{
> +	struct dst_entry *dst;
> +
> +	dst = homa_peer_get_dst(peer, hsk);
> +	if (IS_ERR(dst))
> +		return;
> +	dst_release(peer->dst);
> +	peer->dst = dst;

Why the above does not need any lock? Can multiple RPC race on the same
peer concurrently?

> +/**
> + * struct homa_peer - One of these objects exists for each machine that we
> + * have communicated with (either as client or server).
> + */
> +struct homa_peer {
> +	/** @ht_key: The hash table key for this peer in peertab->ht. */
> +	struct homa_peer_key ht_key;
> +
> +	/**
> +	 * @ht_linkage: Used by rashtable implement to link this peer into
> +	 * peertab->ht.
> +	 */
> +	struct rhash_head ht_linkage;
> +
> +	/** @dead_links: Used to link this peer into peertab->dead_peers. */
> +	struct list_head dead_links;
> +
> +	/**
> +	 * @refs: Number of unmatched calls to homa_peer_hold; it's not safe
> +	 * to free this object until the reference count is zero.
> +	 */
> +	atomic_t refs ____cacheline_aligned_in_smp;

Please use refcount_t instead.

> +/**
> + * homa_peer_hash() - Hash function used for @peertab->ht.
> + * @data:    Pointer to key for which a hash is desired. Must actually
> + *           be a struct homa_peer_key.
> + * @dummy:   Not used
> + * @seed:    Seed for the hash.
> + * Return:   A 32-bit hash value for the given key.
> + */
> +static inline u32 homa_peer_hash(const void *data, u32 dummy, u32 seed)
> +{
> +	/* This is MurmurHash3, used instead of the jhash default because it
> +	 * is faster (25 ns vs. 40 ns as of May 2025).
> +	 */
> +	BUILD_BUG_ON(sizeof(struct homa_peer_key) & 0x3);

It's likely worthy to place the hash function implementation in a
standalone header.

> +	const u32 len = sizeof(struct homa_peer_key) >> 2;
> +	const u32 c1 = 0xcc9e2d51;
> +	const u32 c2 = 0x1b873593;
> +	const u32 *key = data;
> +	u32 h = seed;
> +
> +	for (size_t i = 0; i < len; i++) {
> +		u32 k = key[i];
> +
> +		k *= c1;
> +		k = (k << 15) | (k >> (32 - 15));
> +		k *= c2;
> +
> +		h ^= k;
> +		h = (h << 13) | (h >> (32 - 13));
> +		h = h * 5 + 0xe6546b64;
> +	}
> +
> +	h ^= len * 4;  // Total number of input bytes

Please avoid C99 comments

/P


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 06/15] net: homa: create homa_sock.h and homa_sock.c
  2025-08-18 20:55 ` [PATCH net-next v15 06/15] net: homa: create homa_sock.h and homa_sock.c John Ousterhout
@ 2025-08-26 10:10   ` Paolo Abeni
  2025-08-31 23:29     ` John Ousterhout
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2025-08-26 10:10 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 8/18/25 10:55 PM, John Ousterhout wrote:
> +/**
> + * homa_socktab_next() - Return the next socket in an iteration over a socktab.
> + * @scan:      State of the scan.
> + *
> + * Return:     The next socket in the table, or NULL if the iteration has
> + *             returned all of the sockets in the table.  If non-NULL, a
> + *             reference is held on the socket to prevent its deletion.
> + *             Sockets are not returned in any particular order. It's
> + *             possible that the returned socket has been destroyed.
> + */
> +struct homa_sock *homa_socktab_next(struct homa_socktab_scan *scan)
> +{
> +	struct hlist_head *bucket;
> +	struct hlist_node *next;
> +
> +	rcu_read_lock();
> +	if (scan->hsk) {
> +		sock_put(&scan->hsk->sock);
> +		next = rcu_dereference(hlist_next_rcu(&scan->hsk->socktab_links));
> +		if (next)
> +			goto success;
> +	}
> +	for (scan->current_bucket++;
> +	     scan->current_bucket < HOMA_SOCKTAB_BUCKETS;
> +	     scan->current_bucket++) {
> +		bucket = &scan->socktab->buckets[scan->current_bucket];
> +		next = rcu_dereference(hlist_first_rcu(bucket));
> +		if (next)
> +			goto success;
> +	}
> +	scan->hsk = NULL;
> +	rcu_read_unlock();
> +	return NULL;
> +
> +success:
> +	scan->hsk =  hlist_entry(next, struct homa_sock, socktab_links);

Minor nit: double space above.

> +	sock_hold(&scan->hsk->sock);
> +	rcu_read_unlock();
> +	return scan->hsk;
> +}
> +
> +/**
> + * homa_socktab_end_scan() - Must be invoked on completion of each scan
> + * to clean up state associated with the scan.
> + * @scan:      State of the scan.
> + */
> +void homa_socktab_end_scan(struct homa_socktab_scan *scan)
> +{
> +	if (scan->hsk) {
> +		sock_put(&scan->hsk->sock);
> +		scan->hsk = NULL;
> +	}
> +}
> +
> +/**
> + * homa_sock_init() - Constructor for homa_sock objects. This function
> + * initializes only the parts of the socket that are owned by Homa.
> + * @hsk:    Object to initialize. The Homa-specific parts must have been
> + *          initialized to zeroes by the caller.
> + *
> + * Return: 0 for success, otherwise a negative errno.
> + */
> +int homa_sock_init(struct homa_sock *hsk)
> +{
> +	struct homa_pool *buffer_pool;
> +	struct homa_socktab *socktab;
> +	struct homa_sock *other;
> +	struct homa_net *hnet;
> +	struct homa *homa;
> +	int starting_port;
> +	int result = 0;
> +	int i;
> +
> +	hnet = (struct homa_net *)net_generic(sock_net(&hsk->sock),
> +					      homa_net_id);
> +	homa = hnet->homa;
> +	socktab = homa->socktab;
> +
> +	/* Initialize fields outside the Homa part. */
> +	hsk->sock.sk_sndbuf = homa->wmem_max;
> +	sock_set_flag(&hsk->inet.sk, SOCK_RCU_FREE);
> +
> +	/* Do things requiring memory allocation before locking the socket,
> +	 * so that GFP_ATOMIC is not needed.
> +	 */
> +	buffer_pool = homa_pool_alloc(hsk);
> +	if (IS_ERR(buffer_pool))
> +		return PTR_ERR(buffer_pool);
> +
> +	/* Initialize Homa-specific fields. */
> +	hsk->homa = homa;
> +	hsk->hnet = hnet;
> +	hsk->buffer_pool = buffer_pool;
> +
> +	/* Pick a default port. Must keep the socktab locked from now
> +	 * until the new socket is added to the socktab, to ensure that
> +	 * no other socket chooses the same port.
> +	 */
> +	spin_lock_bh(&socktab->write_lock);
> +	starting_port = hnet->prev_default_port;
> +	while (1) {
> +		hnet->prev_default_port++;
> +		if (hnet->prev_default_port < HOMA_MIN_DEFAULT_PORT)
> +			hnet->prev_default_port = HOMA_MIN_DEFAULT_PORT;
> +		other = homa_sock_find(hnet, hnet->prev_default_port);
> +		if (!other)
> +			break;
> +		sock_put(&other->sock);
> +		if (hnet->prev_default_port == starting_port) {
> +			spin_unlock_bh(&socktab->write_lock);
> +			hsk->shutdown = true;
> +			hsk->homa = NULL;
> +			result = -EADDRNOTAVAIL;
> +			goto error;
> +		}

You likely need to add a cond_resched here (releasing and re-acquiring
the lock as needed)

> +	}
> +	hsk->port = hnet->prev_default_port;
> +	hsk->inet.inet_num = hsk->port;
> +	hsk->inet.inet_sport = htons(hsk->port);
> +
> +	hsk->is_server = false;
> +	hsk->shutdown = false;
> +	hsk->ip_header_length = (hsk->inet.sk.sk_family == AF_INET) ?
> +				sizeof(struct iphdr) : sizeof(struct ipv6hdr);
> +	spin_lock_init(&hsk->lock);
> +	atomic_set(&hsk->protect_count, 0);
> +	INIT_LIST_HEAD(&hsk->active_rpcs);
> +	INIT_LIST_HEAD(&hsk->dead_rpcs);
> +	hsk->dead_skbs = 0;
> +	INIT_LIST_HEAD(&hsk->waiting_for_bufs);
> +	INIT_LIST_HEAD(&hsk->ready_rpcs);
> +	INIT_LIST_HEAD(&hsk->interests);
> +	for (i = 0; i < HOMA_CLIENT_RPC_BUCKETS; i++) {
> +		struct homa_rpc_bucket *bucket = &hsk->client_rpc_buckets[i];
> +
> +		spin_lock_init(&bucket->lock);
> +		bucket->id = i;
> +		INIT_HLIST_HEAD(&bucket->rpcs);
> +	}
> +	for (i = 0; i < HOMA_SERVER_RPC_BUCKETS; i++) {
> +		struct homa_rpc_bucket *bucket = &hsk->server_rpc_buckets[i];
> +
> +		spin_lock_init(&bucket->lock);
> +		bucket->id = i + 1000000;
> +		INIT_HLIST_HEAD(&bucket->rpcs);
> +	}

Do all the above initialization steps need to be done under the socktab
lock?

> +/**
> + * homa_sock_bind() - Associates a server port with a socket; if there
> + * was a previous server port assignment for @hsk, it is abandoned.
> + * @hnet:      Network namespace with which port is associated.
> + * @hsk:       Homa socket.
> + * @port:      Desired server port for @hsk. If 0, then this call
> + *             becomes a no-op: the socket will continue to use
> + *             its randomly assigned client port.
> + *
> + * Return:  0 for success, otherwise a negative errno.
> + */
> +int homa_sock_bind(struct homa_net *hnet, struct homa_sock *hsk,
> +		   u16 port)
> +{
> +	struct homa_socktab *socktab = hnet->homa->socktab;
> +	struct homa_sock *owner;
> +	int result = 0;
> +
> +	if (port == 0)
> +		return result;
> +	if (port >= HOMA_MIN_DEFAULT_PORT)
> +		return -EINVAL;
> +	homa_sock_lock(hsk);
> +	spin_lock_bh(&socktab->write_lock);
> +	if (hsk->shutdown) {
> +		result = -ESHUTDOWN;
> +		goto done;
> +	}
> +
> +	owner = homa_sock_find(hnet, port);
> +	if (owner) {
> +		sock_put(&owner->sock);

homa_sock_find() is used is multiple places to check for port usage. I
think it would be useful to add a variant of such helper not
incremention the socket refcount.

> +		if (owner != hsk)
> +			result = -EADDRINUSE;
> +		goto done;
> +	}
> +	hlist_del_rcu(&hsk->socktab_links);
> +	hsk->port = port;
> +	hsk->inet.inet_num = port;
> +	hsk->inet.inet_sport = htons(hsk->port);
> +	hlist_add_head_rcu(&hsk->socktab_links,
> +			   &socktab->buckets[homa_socktab_bucket(hnet, port)]);
> +	hsk->is_server = true;
> +done:
> +	spin_unlock_bh(&socktab->write_lock);
> +	homa_sock_unlock(hsk);
> +	return result;
> +}


> +/**
> + * homa_sock_wait_wmem() - Block the thread until @hsk's usage of tx
> + * packet memory drops below the socket's limit.
> + * @hsk:          Socket of interest.
> + * @nonblocking:  If there's not enough memory, return -EWOLDBLOCK instead
> + *                of blocking.
> + * Return: 0 for success, otherwise a negative errno.
> + */
> +int homa_sock_wait_wmem(struct homa_sock *hsk, int nonblocking)
> +{
> +	long timeo = hsk->sock.sk_sndtimeo;
> +	int result;
> +
> +	if (nonblocking)
> +		timeo = 0;
> +	set_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags);
> +	result = wait_event_interruptible_timeout(*sk_sleep(&hsk->sock),
> +				homa_sock_wmem_avl(hsk) || hsk->shutdown,
> +				timeo);
> +	if (signal_pending(current))
> +		return -EINTR;
> +	if (result == 0)
> +		return -EWOULDBLOCK;
> +	return 0;
> +}

Perhaps you could use sock_wait_for_wmem() ?

> diff --git a/net/homa/homa_sock.h b/net/homa/homa_sock.h
> new file mode 100644
> index 000000000000..1f649c1da628
> --- /dev/null
> +++ b/net/homa/homa_sock.h
> @@ -0,0 +1,408 @@
> +/* SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+ */
> +
> +/* This file defines structs and other things related to Homa sockets.  */
> +
> +#ifndef _HOMA_SOCK_H
> +#define _HOMA_SOCK_H
> +
> +/* Forward declarations. */
> +struct homa;
> +struct homa_pool;
> +
> +/* Number of hash buckets in a homa_socktab. Must be a power of 2. */
> +#define HOMA_SOCKTAB_BUCKET_BITS 10
> +#define HOMA_SOCKTAB_BUCKETS BIT(HOMA_SOCKTAB_BUCKET_BITS)
> +
> +/**
> + * struct homa_socktab - A hash table that maps from port numbers (either
> + * client or server) to homa_sock objects.
> + *
> + * This table is managed exclusively by homa_socktab.c, using RCU to
> + * minimize synchronization during lookups.
> + */
> +struct homa_socktab {
> +	/**
> +	 * @write_lock: Controls all modifications to this object; not needed
> +	 * for socket lookups (RCU is used instead). Also used to
> +	 * synchronize port allocation.
> +	 */
> +	spinlock_t write_lock;
> +
> +	/**
> +	 * @buckets: Heads of chains for hash table buckets. Chains
> +	 * consist of homa_sock objects.
> +	 */
> +	struct hlist_head buckets[HOMA_SOCKTAB_BUCKETS];
> +};
> +
> +/**
> + * struct homa_socktab_scan - Records the state of an iteration over all
> + * the entries in a homa_socktab, in a way that is safe against concurrent
> + * reclamation of sockets.
> + */
> +struct homa_socktab_scan {
> +	/** @socktab: The table that is being scanned. */
> +	struct homa_socktab *socktab;
> +
> +	/**
> +	 * @hsk: Points to the current socket in the iteration, or NULL if
> +	 * we're at the beginning or end of the iteration. If non-NULL then
> +	 * we are holding a reference to this socket.
> +	 */
> +	struct homa_sock *hsk;
> +
> +	/**
> +	 * @current_bucket: The index of the bucket in socktab->buckets
> +	 * currently being scanned (-1 if @hsk == NULL).
> +	 */
> +	int current_bucket;
> +};
> +
> +/**
> + * struct homa_rpc_bucket - One bucket in a hash table of RPCs.
> + */
> +
> +struct homa_rpc_bucket {
> +	/**
> +	 * @lock: serves as a lock both for this bucket (e.g., when
> +	 * adding and removing RPCs) and also for all of the RPCs in
> +	 * the bucket. Must be held whenever looking up an RPC in
> +	 * this bucket or manipulating an RPC in the bucket. This approach
> +	 * has the following properties:
> +	 * 1. An RPC can be looked up and locked (a common operation) with
> +	 *    a single lock acquisition.
> +	 * 2. Looking up and locking are atomic: there is no window of
> +	 *    vulnerability where someone else could delete an RPC after
> +	 *    it has been looked up and before it has been locked.
> +	 * 3. The lookup mechanism does not use RCU.  This is important because
> +	 *    RPCs are created rapidly and typically live only a few tens of
> +	 *    microseconds.  As of May 2027 RCU introduces a lag of about

I'm unable to make prediction about the next week, I have no idea what
will happen in 2y...

> +	 *    25 ms before objects can be deleted; for RPCs this would result
> +	 *    in hundreds or thousands of RPCs accumulating before RCU allows
> +	 *    them to be deleted.
> +	 * This approach has the disadvantage that RPCs within a bucket share
> +	 * locks and thus may not be able to work concurrently, but there are
> +	 * enough buckets in the table to make such colllisions rare.
> +	 *
> +	 * See "Homa Locking Strategy" in homa_impl.h for more info about
> +	 * locking.
> +	 */
> +	spinlock_t lock;
> +
> +	/**
> +	 * @id: identifier for this bucket, used in error messages etc.
> +	 * It's the index of the bucket within its hash table bucket
> +	 * array, with an additional offset to separate server and
> +	 * client RPCs.
> +	 */
> +	int id;
> +
> +	/** @rpcs: list of RPCs that hash to this bucket. */
> +	struct hlist_head rpcs;
> +};
> +
> +/**
> + * define HOMA_CLIENT_RPC_BUCKETS - Number of buckets in hash tables for
> + * client RPCs. Must be a power of 2.
> + */
> +#define HOMA_CLIENT_RPC_BUCKETS 1024
> +
> +/**
> + * define HOMA_SERVER_RPC_BUCKETS - Number of buckets in hash tables for
> + * server RPCs. Must be a power of 2.
> + */
> +#define HOMA_SERVER_RPC_BUCKETS 1024
> +
> +/**
> + * struct homa_sock - Information about an open socket.
> + */
> +struct homa_sock {
> +	/* Info for other network layers. Note: IPv6 info (struct ipv6_pinfo
> +	 * comes at the very end of the struct, *after* Homa's data, if this
> +	 * socket uses IPv6).
> +	 */
> +	union {
> +		/** @sock: generic socket data; must be the first field. */
> +		struct sock sock;
> +
> +		/**
> +		 * @inet: generic Internet socket data; must also be the
> +		 first field (contains sock as its first member).
> +		 */
> +		struct inet_sock inet;
> +	};
> +
> +	/**
> +	 * @homa: Overall state about the Homa implementation. NULL
> +	 * means this socket was never initialized or has been deleted.
> +	 */
> +	struct homa *homa;
> +
> +	/**
> +	 * @hnet: Overall state specific to the network namespace for
> +	 * this socket.
> +	 */
> +	struct homa_net *hnet;

Both the above should likely be removed

> +
> +	/**
> +	 * @buffer_pool: used to allocate buffer space for incoming messages.
> +	 * Storage is dynamically allocated.
> +	 */
> +	struct homa_pool *buffer_pool;
> +
> +	/**
> +	 * @port: Port number: identifies this socket uniquely among all
> +	 * those on this node.
> +	 */
> +	u16 port;
> +
> +	/**
> +	 * @is_server: True means that this socket can act as both client
> +	 * and server; false means the socket is client-only.
> +	 */
> +	bool is_server;
> +
> +	/**
> +	 * @shutdown: True means the socket is no longer usable (either
> +	 * shutdown has already been invoked, or the socket was never
> +	 * properly initialized).
> +	 */
> +	bool shutdown;
> +
> +	/**
> +	 * @ip_header_length: Length of IP headers for this socket (depends
> +	 * on IPv4 vs. IPv6).
> +	 */
> +	int ip_header_length;
> +
> +	/** @socktab_links: Links this socket into a homa_socktab bucket. */
> +	struct hlist_node socktab_links;
> +
> +	/* Information above is (almost) never modified; start a new
> +	 * cache line below for info that is modified frequently.
> +	 */
> +
> +	/**
> +	 * @lock: Must be held when modifying fields such as interests
> +	 * and lists of RPCs. This lock is used in place of sk->sk_lock
> +	 * because it's used differently (it's always used as a simple
> +	 * spin lock).  See "Homa Locking Strategy" in homa_impl.h
> +	 * for more on Homa's synchronization strategy.
> +	 */
> +	spinlock_t lock ____cacheline_aligned_in_smp;
> +
> +	/**
> +	 * @protect_count: counts the number of calls to homa_protect_rpcs
> +	 * for which there have not yet been calls to homa_unprotect_rpcs.
> +	 */
> +	atomic_t protect_count;
> +
> +	/**
> +	 * @active_rpcs: List of all existing RPCs related to this socket,
> +	 * including both client and server RPCs. This list isn't strictly
> +	 * needed, since RPCs are already in one of the hash tables below,
> +	 * but it's more efficient for homa_timer to have this list
> +	 * (so it doesn't have to scan large numbers of hash buckets).
> +	 * The list is sorted, with the oldest RPC first. Manipulate with
> +	 * RCU so timer can access without locking.
> +	 */
> +	struct list_head active_rpcs;
> +
> +	/**
> +	 * @dead_rpcs: Contains RPCs for which homa_rpc_end has been
> +	 * called, but their packet buffers haven't yet been freed.
> +	 */
> +	struct list_head dead_rpcs;
> +
> +	/** @dead_skbs: Total number of socket buffers in RPCs on dead_rpcs. */
> +	int dead_skbs;
> +
> +	/**
> +	 * @waiting_for_bufs: Contains RPCs that are blocked because there
> +	 * wasn't enough space in the buffer pool region for their incoming
> +	 * messages. Sorted in increasing order of message length.
> +	 */
> +	struct list_head waiting_for_bufs;
> +
> +	/**
> +	 * @ready_rpcs: List of all RPCs that are ready for attention from
> +	 * an application thread.
> +	 */
> +	struct list_head ready_rpcs;
> +
> +	/**
> +	 * @interests: List of threads that are currently waiting for
> +	 * incoming messages via homa_wait_shared.
> +	 */
> +	struct list_head interests;
> +
> +	/**
> +	 * @client_rpc_buckets: Hash table for fast lookup of client RPCs.
> +	 * Modifications are synchronized with bucket locks, not
> +	 * the socket lock.
> +	 */
> +	struct homa_rpc_bucket client_rpc_buckets[HOMA_CLIENT_RPC_BUCKETS];
> +
> +	/**
> +	 * @server_rpc_buckets: Hash table for fast lookup of server RPCs.
> +	 * Modifications are synchronized with bucket locks, not
> +	 * the socket lock.
> +	 */
> +	struct homa_rpc_bucket server_rpc_buckets[HOMA_SERVER_RPC_BUCKETS];

The above 2 array are quite large, and should be probably allocated
separately.

> +/**
> + * homa_client_rpc_bucket() - Find the bucket containing a given
> + * client RPC.
> + * @hsk:      Socket associated with the RPC.
> + * @id:       Id of the desired RPC.
> + *
> + * Return:    The bucket in which this RPC will appear, if the RPC exists.
> + */
> +static inline struct homa_rpc_bucket
> +		*homa_client_rpc_bucket(struct homa_sock *hsk, u64 id)
> +{
> +	/* We can use a really simple hash function here because RPC ids
> +	 * are allocated sequentially.
> +	 */
> +	return &hsk->client_rpc_buckets[(id >> 1)
> +			& (HOMA_CLIENT_RPC_BUCKETS - 1)];

Minor nit: '&' should be on the provious line, and please fix the alignment.

> +/**
> + * homa_sock_wakeup_wmem() - Invoked when tx packet memory has been freed;
> + * if memory usage is below the limit and there are tasks waiting for memory,
> + * wake them up.
> + * @hsk:   Socket of interest.
> + */
> +static inline void homa_sock_wakeup_wmem(struct homa_sock *hsk)
> +{
> +	if (test_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags) &&
> +	    homa_sock_wmem_avl(hsk)) {
> +		clear_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags);
> +		wake_up_interruptible_poll(sk_sleep(&hsk->sock), EPOLLOUT);

Can hsk be orphaned at this point? I think so.

/P


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 08/15] net: homa: create homa_pacer.h and homa_pacer.c
  2025-08-18 20:55 ` [PATCH net-next v15 08/15] net: homa: create homa_pacer.h and homa_pacer.c John Ousterhout
@ 2025-08-26 10:53   ` Paolo Abeni
  2025-09-01 16:35     ` John Ousterhout
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2025-08-26 10:53 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 8/18/25 10:55 PM, John Ousterhout wrote:
> +/**
> + * homa_pacer_alloc() - Allocate and initialize a new pacer object, which
> + * will hold pacer-related information for @homa.
> + * @homa:   Homa transport that the pacer will be associated with.
> + * Return:  A pointer to the new struct pacer, or a negative errno.
> + */
> +struct homa_pacer *homa_pacer_alloc(struct homa *homa)
> +{
> +	struct homa_pacer *pacer;
> +	int err;
> +
> +	pacer = kzalloc(sizeof(*pacer), GFP_KERNEL);
> +	if (!pacer)
> +		return ERR_PTR(-ENOMEM);
> +	pacer->homa = homa;
> +	spin_lock_init(&pacer->mutex);
> +	pacer->fifo_count = 1000;
> +	spin_lock_init(&pacer->throttle_lock);
> +	INIT_LIST_HEAD_RCU(&pacer->throttled_rpcs);
> +	pacer->fifo_fraction = 50;
> +	pacer->max_nic_queue_ns = 5000;
> +	pacer->throttle_min_bytes = 1000;
> +	init_waitqueue_head(&pacer->wait_queue);
> +	pacer->kthread = kthread_run(homa_pacer_main, pacer, "homa_pacer");
> +	if (IS_ERR(pacer->kthread)) {
> +		err = PTR_ERR(pacer->kthread);
> +		pr_err("Homa couldn't create pacer thread: error %d\n", err);
> +		goto error;
> +	}
> +	atomic64_set(&pacer->link_idle_time, homa_clock());
> +
> +	homa_pacer_update_sysctl_deps(pacer);

IMHO this does not fit mergeable status:
- the static init (@25Gbs)
- never updated on link changes
- assumes a single link in the whole system

I think it's better to split the pacer part out of this series, or the
above points should be addressed and it would be difficult fitting a
reasonable series size.

Also a single thread for all the reap RPC looks like a potentially high
contended spot.

> +/**
> + * homa_pacer_xmit() - Transmit packets from  the throttled list until
> + * either (a) the throttled list is empty or (b) the NIC queue has
> + * reached maximum allowable length. Note: this function may be invoked
> + * from either process context or softirq (BH) level. This function is
> + * invoked from multiple places, not just in the pacer thread. The reason
> + * for this is that (as of 10/2019) Linux's scheduling of the pacer thread
> + * is unpredictable: the thread may block for long periods of time (e.g.,
> + * because it is assigned to the same CPU as a busy interrupt handler).
> + * This can result in poor utilization of the network link. So, this method
> + * gets invoked from other places as well, to increase the likelihood that we
> + * keep the link busy. Those other invocations are not guaranteed to happen,
> + * so the pacer thread provides a backstop.
> + * @pacer:    Pacer information for a Homa transport.
> + */
> +void homa_pacer_xmit(struct homa_pacer *pacer)
> +{
> +	struct homa_rpc *rpc;
> +	s64 queue_cycles;
> +
> +	/* Make sure only one instance of this function executes at a time. */
> +	if (!spin_trylock_bh(&pacer->mutex))
> +		return;
> +
> +	while (1) {
> +		queue_cycles = atomic64_read(&pacer->link_idle_time) -
> +					     homa_clock();
> +		if (queue_cycles >= pacer->max_nic_queue_cycles)
> +			break;
> +		if (list_empty(&pacer->throttled_rpcs))
> +			break;
> +
> +		/* Select an RPC to transmit (either SRPT or FIFO) and
> +		 * take a reference on it. Must do this while holding the
> +		 * throttle_lock to prevent the RPC from being reaped. Then
> +		 * release the throttle lock and lock the RPC (can't acquire
> +		 * the RPC lock while holding the throttle lock; see "Homa
> +		 * Locking Strategy" in homa_impl.h).
> +		 */
> +		homa_pacer_throttle_lock(pacer);
> +		pacer->fifo_count -= pacer->fifo_fraction;
> +		if (pacer->fifo_count <= 0) {
> +			struct homa_rpc *cur;
> +			u64 oldest = ~0;
> +
> +			pacer->fifo_count += 1000;
> +			rpc = NULL;
> +			list_for_each_entry(cur, &pacer->throttled_rpcs,
> +					    throttled_links) {
> +				if (cur->msgout.init_time < oldest) {
> +					rpc = cur;
> +					oldest = cur->msgout.init_time;
> +				}
> +			}
> +		} else {
> +			rpc = list_first_entry_or_null(&pacer->throttled_rpcs,
> +						       struct homa_rpc,
> +						       throttled_links);
> +		}
> +		if (!rpc) {
> +			homa_pacer_throttle_unlock(pacer);
> +			break;
> +		}
> +		homa_rpc_hold(rpc);

It's unclear what ensures that 'rpc' is valid at this point.

> +		homa_pacer_throttle_unlock(pacer);
> +		homa_rpc_lock(rpc);
> +		homa_xmit_data(rpc, true);
> +
> +		/* Note: rpc->state could be RPC_DEAD here, but the code
> +		 * below should work anyway.
> +		 */
> +		if (!*rpc->msgout.next_xmit)
> +			/* No more data can be transmitted from this message
> +			 * (right now), so remove it from the throttled list.
> +			 */
> +			homa_pacer_unmanage_rpc(rpc);
> +		homa_rpc_unlock(rpc);
> +		homa_rpc_put(rpc);

All the loop is atomic context, you should likely place a cond_resched()
here - releasing and reaquiring the mutex as needed.

> +/**
> + * struct homa_pacer - Contains information that the pacer users to
> + * manage packet output. There is one instance of this object stored
> + * in each struct homa.
> + */
> +struct homa_pacer {
> +	/** @homa: Transport that this pacer is associated with. */
> +	struct homa *homa;

Should be removed

> +/**
> + * homa_pacer_check() - This method is invoked at various places in Homa to
> + * see if the pacer needs to transmit more packets and, if so, transmit
> + * them. It's needed because the pacer thread may get descheduled by
> + * Linux, result in output stalls.
> + * @pacer:    Pacer information for a Homa transport.
> + */
> +static inline void homa_pacer_check(struct homa_pacer *pacer)
> +{
> +	if (list_empty(&pacer->throttled_rpcs))
> +		return;
> +
> +	/* The ">> 1" in the line below gives homa_pacer_main the first chance
> +	 * to queue new packets; if the NIC queue becomes more than half
> +	 * empty, then we will help out here.
> +	 */
> +	if ((homa_clock() + (pacer->max_nic_queue_cycles >> 1)) <
> +			atomic64_read(&pacer->link_idle_time))
> +		return;
> +	homa_pacer_xmit(pacer);
> +}

apparently not used in this series.

/P


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 09/15] net: homa: create homa_rpc.h and homa_rpc.c
  2025-08-18 20:55 ` [PATCH net-next v15 09/15] net: homa: create homa_rpc.h and homa_rpc.c John Ousterhout
@ 2025-08-26 11:31   ` Paolo Abeni
  2025-09-01 20:10     ` John Ousterhout
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2025-08-26 11:31 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 8/18/25 10:55 PM, John Ousterhout wrote:
> +/**
> + * homa_rpc_reap() - Invoked to release resources associated with dead
> + * RPCs for a given socket.
> + * @hsk:      Homa socket that may contain dead RPCs. Must not be locked by the
> + *            caller; this function will lock and release.
> + * @reap_all: False means do a small chunk of work; there may still be
> + *            unreaped RPCs on return. True means reap all dead RPCs for
> + *            hsk.  Will busy-wait if reaping has been disabled for some RPCs.
> + *
> + * Return: A return value of 0 means that we ran out of work to do; calling
> + *         again will do no work (there could be unreaped RPCs, but if so,
> + *         they cannot currently be reaped).  A value greater than zero means
> + *         there is still more reaping work to be done.
> + */
> +int homa_rpc_reap(struct homa_sock *hsk, bool reap_all)
> +{
> +	/* RPC Reaping Strategy:
> +	 *
> +	 * (Note: there are references to this comment elsewhere in the
> +	 * Homa code)
> +	 *
> +	 * Most of the cost of reaping comes from freeing sk_buffs; this can be
> +	 * quite expensive for RPCs with long messages.
> +	 *
> +	 * The natural time to reap is when homa_rpc_end is invoked to
> +	 * terminate an RPC, but this doesn't work for two reasons. First,
> +	 * there may be outstanding references to the RPC; it cannot be reaped
> +	 * until all of those references have been released. Second, reaping
> +	 * is potentially expensive and RPC termination could occur in
> +	 * homa_softirq when there are short messages waiting to be processed.
> +	 * Taking time to reap a long RPC could result in significant delays
> +	 * for subsequent short RPCs.
> +	 *
> +	 * Thus Homa doesn't reap immediately in homa_rpc_end. Instead, dead
> +	 * RPCs are queued up and reaping occurs in this function, which is
> +	 * invoked later when it is less likely to impact latency. The
> +	 * challenge is to do this so that (a) we don't allow large numbers of
> +	 * dead RPCs to accumulate and (b) we minimize the impact of reaping
> +	 * on latency.
> +	 *
> +	 * The primary place where homa_rpc_reap is invoked is when threads
> +	 * are waiting for incoming messages. The thread has nothing else to
> +	 * do (it may even be polling for input), so reaping can be performed
> +	 * with no latency impact on the application.  However, if a machine
> +	 * is overloaded then it may never wait, so this mechanism isn't always
> +	 * sufficient.
> +	 *
> +	 * Homa now reaps in two other places, if reaping while waiting for
> +	 * messages isn't adequate:
> +	 * 1. If too may dead skbs accumulate, then homa_timer will call
> +	 *    homa_rpc_reap.
> +	 * 2. If this timer thread cannot keep up with all the reaping to be
> +	 *    done then as a last resort homa_dispatch_pkts will reap in small
> +	 *    increments (a few sk_buffs or RPCs) for every incoming batch
> +	 *    of packets . This is undesirable because it will impact Homa's
> +	 *    performance.
> +	 *
> +	 * During the introduction of homa_pools for managing input
> +	 * buffers, freeing of packets for incoming messages was moved to
> +	 * homa_copy_to_user under the assumption that this code wouldn't be
> +	 * on the critical path. However, there is evidence that with
> +	 * fast networks (e.g. 100 Gbps) copying to user space is the
> +	 * bottleneck for incoming messages, and packet freeing takes about
> +	 * 20-25% of the total time in homa_copy_to_user. So, it may eventually
> +	 * be desirable to remove packet freeing out of homa_copy_to_user.

See skb_attempt_defer_free()

> +	 */
> +#define BATCH_MAX 20
> +	struct homa_rpc *rpcs[BATCH_MAX];
> +	struct sk_buff *skbs[BATCH_MAX];

A lot of bytes on the stack, and a quite large batch. You should probaly
decrease it.

Also it still feel suspect the need for just another tx free strategy on
top of the several existing caches.

> +	int num_skbs, num_rpcs;
> +	struct homa_rpc *rpc;
> +	struct homa_rpc *tmp;
> +	int i, batch_size;
> +	int skbs_to_reap;
> +	int result = 0;
> +	int rx_frees;
> +
> +	/* Each iteration through the following loop will reap
> +	 * BATCH_MAX skbs.
> +	 */
> +	skbs_to_reap = hsk->homa->reap_limit;
> +	while (skbs_to_reap > 0 && !list_empty(&hsk->dead_rpcs)) {
> +		batch_size = BATCH_MAX;
> +		if (!reap_all) {
> +			if (batch_size > skbs_to_reap)
> +				batch_size = skbs_to_reap;
> +			skbs_to_reap -= batch_size;
> +		}
> +		num_skbs = 0;
> +		num_rpcs = 0;
> +		rx_frees = 0;
> +
> +		homa_sock_lock(hsk);
> +		if (atomic_read(&hsk->protect_count)) {
> +			homa_sock_unlock(hsk);
> +			if (reap_all)
> +				continue;
> +			return 0;
> +		}
> +
> +		/* Collect buffers and freeable RPCs. */
> +		list_for_each_entry_safe(rpc, tmp, &hsk->dead_rpcs,
> +					 dead_links) {
> +			int refs;
> +
> +			/* Make sure that all outstanding uses of the RPC have
> +			 * completed. We can only be sure if the reference
> +			 * count is zero when we're holding the lock. Note:
> +			 * it isn't safe to block while locking the RPC here,
> +			 * since we hold the socket lock.
> +			 */
> +			if (homa_rpc_try_lock(rpc)) {
> +				refs = atomic_read(&rpc->refs);
> +				homa_rpc_unlock(rpc);
> +			} else {
> +				refs = 1;
> +			}
> +			if (refs != 0)
> +				continue;
> +			rpc->magic = 0;
> +
> +			/* For Tx sk_buffs, collect them here but defer
> +			 * freeing until after releasing the socket lock.
> +			 */
> +			if (rpc->msgout.length >= 0) {
> +				while (rpc->msgout.packets) {
> +					skbs[num_skbs] = rpc->msgout.packets;
> +					rpc->msgout.packets = homa_get_skb_info(
> +						rpc->msgout.packets)->next_skb;
> +					num_skbs++;
> +					rpc->msgout.num_skbs--;
> +					if (num_skbs >= batch_size)
> +						goto release;
> +				}
> +			}
> +
> +			/* In the normal case rx sk_buffs will already have been
> +			 * freed before we got here. Thus it's OK to free
> +			 * immediately in rare situations where there are
> +			 * buffers left.
> +			 */
> +			if (rpc->msgin.length >= 0 &&
> +			    !skb_queue_empty_lockless(&rpc->msgin.packets)) {
> +				rx_frees += skb_queue_len(&rpc->msgin.packets);
> +				__skb_queue_purge(&rpc->msgin.packets);
> +			}
> +
> +			/* If we get here, it means all packets have been
> +			 *  removed from the RPC.
> +			 */
> +			rpcs[num_rpcs] = rpc;
> +			num_rpcs++;
> +			list_del(&rpc->dead_links);
> +			WARN_ON(refcount_sub_and_test(rpc->msgout.skb_memory,
> +						      &hsk->sock.sk_wmem_alloc));
> +			if (num_rpcs >= batch_size)
> +				goto release;
> +		}
> +
> +		/* Free all of the collected resources; release the socket
> +		 * lock while doing this.
> +		 */
> +release:
> +		hsk->dead_skbs -= num_skbs + rx_frees;
> +		result = !list_empty(&hsk->dead_rpcs) &&
> +				(num_skbs + num_rpcs) != 0;
> +		homa_sock_unlock(hsk);
> +		homa_skb_free_many_tx(hsk->homa, skbs, num_skbs);
> +		for (i = 0; i < num_rpcs; i++) {
> +			rpc = rpcs[i];
> +
> +			if (unlikely(rpc->msgin.num_bpages))
> +				homa_pool_release_buffers(rpc->hsk->buffer_pool,
> +							  rpc->msgin.num_bpages,
> +							  rpc->msgin.bpage_offsets);
> +			if (rpc->msgin.length >= 0) {
> +				while (1) {
> +					struct homa_gap *gap;
> +
> +					gap = list_first_entry_or_null(
> +							&rpc->msgin.gaps,
> +							struct homa_gap,
> +							links);
> +					if (!gap)
> +						break;
> +					list_del(&gap->links);
> +					kfree(gap);
> +				}
> +			}
> +			if (rpc->peer) {
> +				homa_peer_release(rpc->peer);
> +				rpc->peer = NULL;
> +			}
> +			rpc->state = 0;
> +			kfree(rpc);
> +		}
> +		homa_sock_wakeup_wmem(hsk);

Here num_rpcs can be zero, and you can have spurius wake-ups

> +/**
> + * homa_rpc_hold() - Increment the reference count on an RPC, which will
> + * prevent it from being freed until homa_rpc_put() is called. References
> + * are taken in two situations:
> + * 1. An RPC is going to be manipulated by a collection of functions. In
> + *    this case the top-most function that identifies the RPC takes the
> + *    reference; any function that receives an RPC as an argument can
> + *    assume that a reference has been taken on the RPC by some higher
> + *    function on the call stack.
> + * 2. A pointer to an RPC is stored in an object for use later, such as
> + *    an interest. A reference must be held as long as the pointer remains
> + *    accessible in the object.
> + * @rpc:      RPC on which to take a reference.
> + */
> +static inline void homa_rpc_hold(struct homa_rpc *rpc)
> +{
> +	atomic_inc(&rpc->refs);

`refs` should be a reference_t, since is uses as such.

/P


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 10/15] net: homa: create homa_outgoing.c
  2025-08-18 20:55 ` [PATCH net-next v15 10/15] net: homa: create homa_outgoing.c John Ousterhout
@ 2025-08-26 11:50   ` Paolo Abeni
  2025-09-01 20:21     ` John Ousterhout
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2025-08-26 11:50 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 8/18/25 10:55 PM, John Ousterhout wrote:
> +/**
> + * homa_message_out_fill() - Initializes information for sending a message
> + * for an RPC (either request or response); copies the message data from
> + * user space and (possibly) begins transmitting the message.
> + * @rpc:     RPC for which to send message; this function must not
> + *           previously have been called for the RPC. Must be locked. The RPC
> + *           will be unlocked while copying data, but will be locked again
> + *           before returning.
> + * @iter:    Describes location(s) of message data in user space.
> + * @xmit:    Nonzero means this method should start transmitting packets;
> + *           transmission will be overlapped with copying from user space.
> + *           Zero means the caller will initiate transmission after this
> + *           function returns.
> + *
> + * Return:   0 for success, or a negative errno for failure. It is possible
> + *           for the RPC to be freed while this function is active. If that
> + *           happens, copying will cease, -EINVAL will be returned, and
> + *           rpc->state will be RPC_DEAD.
> + */
> +int homa_message_out_fill(struct homa_rpc *rpc, struct iov_iter *iter, int xmit)
> +	__must_hold(rpc->bucket->lock)
> +{
> +	/* Geometry information for packets:
> +	 * mtu:              largest size for an on-the-wire packet (including
> +	 *                   all headers through IP header, but not Ethernet
> +	 *                   header).
> +	 * max_seg_data:     largest amount of Homa message data that fits
> +	 *                   in an on-the-wire packet (after segmentation).
> +	 * max_gso_data:     largest amount of Homa message data that fits
> +	 *                   in a GSO packet (before segmentation).
> +	 */
> +	int mtu, max_seg_data, max_gso_data;
> +
> +	struct sk_buff **last_link;
> +	struct dst_entry *dst;
> +	u64 segs_per_gso;
> +	int overlap_xmit;
> +
> +	/* Bytes of the message that haven't yet been copied into skbs. */
> +	int bytes_left;
> +
> +	int gso_size;
> +	int err;

Please, no empty lines in the variable declaration section.


> +/**
> + * __homa_xmit_control() - Lower-level version of homa_xmit_control: sends
> + * a control packet.
> + * @contents:  Address of buffer containing the contents of the packet.
> + *             The caller must have filled in all of the information,
> + *             including the common header.
> + * @length:    Length of @contents.
> + * @peer:      Destination to which the packet will be sent.
> + * @hsk:       Socket via which the packet will be sent.
> + *
> + * Return:     Either zero (for success), or a negative errno value if there
> + *             was a problem.
> + */
> +int __homa_xmit_control(void *contents, size_t length, struct homa_peer *peer,
> +			struct homa_sock *hsk)
> +{
> +	struct homa_common_hdr *h;
> +	struct sk_buff *skb;
> +	int extra_bytes;
> +	int result;
> +
> +	skb = homa_skb_alloc_tx(HOMA_MAX_HEADER);
> +	if (unlikely(!skb))
> +		return -ENOBUFS;
> +	skb_dst_set(skb, homa_get_dst(peer, hsk));
> +
> +	h = skb_put(skb, length);
> +	memcpy(h, contents, length);
> +	extra_bytes = HOMA_MIN_PKT_LENGTH - length;
> +	if (extra_bytes > 0)
> +		memset(skb_put(skb, extra_bytes), 0, extra_bytes);
> +	skb->ooo_okay = 1;
> +	skb_get(skb);
> +	if (hsk->inet.sk.sk_family == AF_INET6)
> +		result = ip6_xmit(&hsk->inet.sk, skb, &peer->flow.u.ip6, 0,
> +				  NULL, 0, 0);
> +	else
> +		result = ip_queue_xmit(&hsk->inet.sk, skb, &peer->flow);
> +	if (unlikely(result != 0)) {
> +		/* It appears that ip*_xmit frees skbuffs after
> +		 * errors; the following code is to raise an alert if
> +		 * this isn't actually the case. The extra skb_get above
> +		 * and kfree_skb call below are needed to do the check
> +		 * accurately (otherwise the buffer could be freed and
> +		 * its memory used for some other purpose, resulting in
> +		 * a bogus "reference count").
> +		 */
> +		if (refcount_read(&skb->users) > 1) {
> +			if (hsk->inet.sk.sk_family == AF_INET6)
> +				pr_notice("ip6_xmit didn't free Homa control packet (type %d) after error %d\n",
> +					  h->type, result);
> +			else
> +				pr_notice("ip_queue_xmit didn't free Homa control packet (type %d) after error %d\n",
> +					  h->type, result);
> +		}

Please remove the above check and related refcounting.

> +	}
> +	kfree_skb(skb);
> +	return result;
> +}
> +
> +/**
> + * homa_xmit_unknown() - Send an RPC_UNKNOWN packet to a peer.
> + * @skb:         Buffer containing an incoming packet; identifies the peer to
> + *               which the RPC_UNKNOWN packet should be sent.
> + * @hsk:         Socket that should be used to send the RPC_UNKNOWN packet.
> + */
> +void homa_xmit_unknown(struct sk_buff *skb, struct homa_sock *hsk)
> +{
> +	struct homa_common_hdr *h = (struct homa_common_hdr *)skb->data;
> +	struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
> +	struct homa_rpc_unknown_hdr unknown;
> +	struct homa_peer *peer;
> +
> +	unknown.common.sport = h->dport;
> +	unknown.common.dport = h->sport;
> +	unknown.common.type = RPC_UNKNOWN;
> +	unknown.common.sender_id = cpu_to_be64(homa_local_id(h->sender_id));
> +	peer = homa_peer_get(hsk, &saddr);
> +	if (!IS_ERR(peer))
> +		__homa_xmit_control(&unknown, sizeof(unknown), peer, hsk);
> +	homa_peer_release(peer);
> +}
> +
> +/**
> + * homa_xmit_data() - If an RPC has outbound data packets that are permitted
> + * to be transmitted according to the scheduling mechanism, arrange for
> + * them to be sent (some may be sent immediately; others may be sent
> + * later by the pacer thread).
> + * @rpc:       RPC to check for transmittable packets. Must be locked by
> + *             caller. Note: this function will release the RPC lock while
> + *             passing packets through the RPC stack, then reacquire it
> + *             before returning. It is possible that the RPC gets freed
> + *             when the lock isn't held, in which case the state will
> + *             be RPC_DEAD on return.
> + * @force:     True means send at least one packet, even if the NIC queue
> + *             is too long. False means that zero packets may be sent, if
> + *             the NIC queue is sufficiently long.
> + */
> +void homa_xmit_data(struct homa_rpc *rpc, bool force)
> +	__must_hold(rpc->bucket->lock)
> +{
> +	struct homa *homa = rpc->hsk->homa;
> +	int length;
> +
> +	while (*rpc->msgout.next_xmit && rpc->state != RPC_DEAD) {
> +		struct sk_buff *skb = *rpc->msgout.next_xmit;
> +
> +		if (rpc->msgout.length - rpc->msgout.next_xmit_offset >
> +		    homa->pacer->throttle_min_bytes) {
> +			if (!homa_pacer_check_nic_q(homa->pacer, skb, force)) {
> +				homa_pacer_manage_rpc(rpc);
> +				break;
> +			}
> +		}
> +
> +		rpc->msgout.next_xmit = &(homa_get_skb_info(skb)->next_skb);
> +		length = homa_get_skb_info(skb)->data_bytes;
> +		rpc->msgout.next_xmit_offset += length;
> +
> +		homa_rpc_unlock(rpc);
> +		skb_get(skb);
> +		__homa_xmit_data(skb, rpc);
> +		force = false;
> +		homa_rpc_lock(rpc);
> +	}
> +}
> +
> +/**
> + * __homa_xmit_data() - Handles packet transmission stuff that is common
> + * to homa_xmit_data and homa_resend_data.
> + * @skb:      Packet to be sent. The packet will be freed after transmission
> + *            (and also if errors prevented transmission).
> + * @rpc:      Information about the RPC that the packet belongs to.
> + */
> +void __homa_xmit_data(struct sk_buff *skb, struct homa_rpc *rpc)
> +{
> +	skb_dst_set(skb, homa_get_dst(rpc->peer, rpc->hsk));
> +
> +	skb->ooo_okay = 1;
> +	skb->ip_summed = CHECKSUM_PARTIAL;
> +	skb->csum_start = skb_transport_header(skb) - skb->head;
> +	skb->csum_offset = offsetof(struct homa_common_hdr, checksum);
> +	if (rpc->hsk->inet.sk.sk_family == AF_INET6)
> +		ip6_xmit(&rpc->hsk->inet.sk, skb, &rpc->peer->flow.u.ip6,
> +			 0, NULL, 0, 0);
> +	else
> +		ip_queue_xmit(&rpc->hsk->inet.sk, skb, &rpc->peer->flow);
> +}
> +
> +/**
> + * homa_resend_data() - This function is invoked as part of handling RESEND
> + * requests. It retransmits the packet(s) containing a given range of bytes
> + * from a message.
> + * @rpc:      RPC for which data should be resent.
> + * @start:    Offset within @rpc->msgout of the first byte to retransmit.
> + * @end:      Offset within @rpc->msgout of the byte just after the last one
> + *            to retransmit.
> + */
> +void homa_resend_data(struct homa_rpc *rpc, int start, int end)
> +	__must_hold(rpc->bucket->lock)
> +{
> +	struct homa_skb_info *homa_info;
> +	struct sk_buff *skb;
> +
> +	if (end <= start)
> +		return;
> +
> +	/* Each iteration of this loop checks one packet in the message
> +	 * to see if it contains segments that need to be retransmitted.
> +	 */
> +	for (skb = rpc->msgout.packets; skb; skb = homa_info->next_skb) {
> +		int seg_offset, offset, seg_length, data_left;
> +		struct homa_data_hdr *h;
> +
> +		homa_info = homa_get_skb_info(skb);
> +		offset = homa_info->offset;
> +		if (offset >= end)
> +			break;
> +		if (start >= (offset + homa_info->data_bytes))
> +			continue;
> +
> +		offset = homa_info->offset;
> +		seg_offset = sizeof(struct homa_data_hdr);
> +		data_left = homa_info->data_bytes;
> +		if (skb_shinfo(skb)->gso_segs <= 1) {
> +			seg_length = data_left;
> +		} else {
> +			seg_length = homa_info->seg_length;
> +			h = (struct homa_data_hdr *)skb_transport_header(skb);
> +		}
> +		for ( ; data_left > 0; data_left -= seg_length,
> +		     offset += seg_length,
> +		     seg_offset += skb_shinfo(skb)->gso_size) {
> +			struct homa_skb_info *new_homa_info;
> +			struct sk_buff *new_skb;
> +			int err;
> +
> +			if (seg_length > data_left)
> +				seg_length = data_left;
> +
> +			if (end <= offset)
> +				goto resend_done;
> +			if ((offset + seg_length) <= start)
> +				continue;
> +
> +			/* This segment must be retransmitted. */
> +			new_skb = homa_skb_alloc_tx(sizeof(struct homa_data_hdr)
> +					+ seg_length);

Please fix the alignment above

[...]
> +/**
> + * homa_rpc_tx_end() - Return the offset of the first byte in an
> + * RPC's outgoing message that has not yet been fully transmitted.
> + * "Fully transmitted" means the message has been transmitted by the
> + * NIC and the skb has been released by the driver. This is different from
> + * rpc->msgout.next_xmit_offset, which computes the first offset that
> + * hasn't yet been passed to the IP stack.
> + * @rpc:    RPC to check
> + * Return:  See above. If the message has been fully transmitted then
> + *          rpc->msgout.length is returned.
> + */
> +int homa_rpc_tx_end(struct homa_rpc *rpc)
> +{
> +	struct sk_buff *skb = rpc->msgout.first_not_tx;
> +
> +	while (skb) {
> +		struct homa_skb_info *homa_info = homa_get_skb_info(skb);
> +
> +		/* next_xmit_offset tells us whether the packet has been
> +		 * passed to the IP stack. Checking the reference count tells
> +		 * us whether the packet has been released by the driver
> +		 * (which only happens after notification from the NIC that
> +		 * transmission is complete).
> +		 */
> +		if (homa_info->offset >= rpc->msgout.next_xmit_offset ||
> +		    refcount_read(&skb->users) > 1)
> +			return homa_info->offset;

Pushing skbs with refcount > 1 into the tx stack calls for trouble. You
should instead likely clone the tx skb.

/P


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 11/15] net: homa: create homa_utils.c
  2025-08-18 20:55 ` [PATCH net-next v15 11/15] net: homa: create homa_utils.c John Ousterhout
@ 2025-08-26 11:52   ` Paolo Abeni
  2025-09-01 20:30     ` John Ousterhout
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2025-08-26 11:52 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 8/18/25 10:55 PM, John Ousterhout wrote:
+/**
> + * homa_spin() - Delay (without sleeping) for a given time interval.
> + * @ns:   How long to delay (in nanoseconds)
> + */
> +void homa_spin(int ns)
> +{
> +	u64 end;
> +
> +	end = homa_clock() + homa_ns_to_cycles(ns);
> +	while (homa_clock() < end)
> +		/* Empty loop body.*/

		cpu_relax();

/P


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 12/15] net: homa: create homa_incoming.c
  2025-08-18 20:55 ` [PATCH net-next v15 12/15] net: homa: create homa_incoming.c John Ousterhout
@ 2025-08-26 12:05   ` Paolo Abeni
  2025-09-01 22:12     ` John Ousterhout
  2025-09-02  7:19   ` Eric Dumazet
  1 sibling, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2025-08-26 12:05 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 8/18/25 10:55 PM, John Ousterhout wrote:
> +/**
> + * homa_dispatch_pkts() - Top-level function that processes a batch of packets,
> + * all related to the same RPC.
> + * @skb:       First packet in the batch, linked through skb->next.
> + */
> +void homa_dispatch_pkts(struct sk_buff *skb)
> +{
> +#define MAX_ACKS 10
> +	const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
> +	struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
> +	u64 id = homa_local_id(h->common.sender_id);
> +	int dport = ntohs(h->common.dport);
> +
> +	/* Used to collect acks from data packets so we can process them
> +	 * all at the end (can't process them inline because that may
> +	 * require locking conflicting RPCs). If we run out of space just
> +	 * ignore the extra acks; they'll be regenerated later through the
> +	 * explicit mechanism.
> +	 */
> +	struct homa_ack acks[MAX_ACKS];
> +	struct homa_rpc *rpc = NULL;
> +	struct homa_sock *hsk;
> +	struct homa_net *hnet;
> +	struct sk_buff *next;
> +	int num_acks = 0;

No black lines in the variable declaration section, and the stack usage
feel a bit too high.

> +
> +	/* Find the appropriate socket.*/
> +	hnet = homa_net_from_skb(skb);
> +	hsk = homa_sock_find(hnet, dport);
> +	if (!hsk || (!homa_is_client(id) && !hsk->is_server)) {
> +		if (skb_is_ipv6(skb))
> +			icmp6_send(skb, ICMPV6_DEST_UNREACH,
> +				   ICMPV6_PORT_UNREACH, 0, NULL, IP6CB(skb));
> +		else
> +			icmp_send(skb, ICMP_DEST_UNREACH,
> +				  ICMP_PORT_UNREACH, 0);
> +		while (skb) {
> +			next = skb->next;
> +			kfree_skb(skb);
> +			skb = next;
> +		}
> +		if (hsk)
> +			sock_put(&hsk->sock);
> +		return;
> +	}
> +
> +	/* Each iteration through the following loop processes one packet. */
> +	for (; skb; skb = next) {
> +		h = (struct homa_data_hdr *)skb->data;
> +		next = skb->next;
> +
> +		/* Relinquish the RPC lock temporarily if it's needed
> +		 * elsewhere.
> +		 */
> +		if (rpc) {
> +			int flags = atomic_read(&rpc->flags);
> +
> +			if (flags & APP_NEEDS_LOCK) {
> +				homa_rpc_unlock(rpc);
> +
> +				/* This short spin is needed to ensure that the
> +				 * other thread gets the lock before this thread
> +				 * grabs it again below (the need for this
> +				 * was confirmed experimentally in 2/2025;
> +				 * without it, the handoff fails 20-25% of the
> +				 * time). Furthermore, the call to homa_spin
> +				 * seems to allow the other thread to acquire
> +				 * the lock more quickly.
> +				 */
> +				homa_spin(100);
> +				homa_rpc_lock(rpc);

This can still fail due to a number of reasons, e.g. if multiple threads
are spinning on the rpc lock, or in fully preemptable kernels.

You need to either ensure that:
- the loop works just fine even if the handover fails with high
frequency - even without the homa_spin() call,
or
- there is explicit handover notification.

/P


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 14/15] net: homa: create homa_plumbing.c
  2025-08-18 20:55 ` [PATCH net-next v15 14/15] net: homa: create homa_plumbing.c John Ousterhout
@ 2025-08-26 16:17   ` Paolo Abeni
  2025-09-01 22:53     ` John Ousterhout
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2025-08-26 16:17 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 8/18/25 10:55 PM, John Ousterhout wrote:
> +/* This variable contains the address of the statically-allocated struct homa
> + * used throughout Homa. This variable should almost never be used directly:
> + * it should be passed as a parameter to functions that need it. This
> + * variable is used only by a few functions called from Linux where there
> + * is no struct homa* available.
> + */
> +static struct homa *global_homa = &homa_data;

No need for this, use hame_data directly everywhere.

> +static struct proto homav6_prot = {
> +	.name		   = "HOMAv6",
> +	.owner		   = THIS_MODULE,
> +	.close		   = homa_close,
> +	.connect	   = ip6_datagram_connect,
> +	.ioctl		   = homa_ioctl,
> +	.init		   = homa_socket,
> +	.destroy	   = homa_sock_destroy,
> +	.setsockopt	   = homa_setsockopt,
> +	.getsockopt	   = homa_getsockopt,
> +	.sendmsg	   = homa_sendmsg,
> +	.recvmsg	   = homa_recvmsg,
> +	.hash		   = homa_hash,
> +	.unhash		   = homa_unhash,
> +	.obj_size	   = sizeof(struct homa_v6_sock),
> +	.ipv6_pinfo_offset = offsetof(struct homa_v6_sock, inet6),
> +
> +	.no_autobind       = 1,

Minor nit: no empty line above

> +};
> +
> +/* Top-level structure describing the Homa protocol. */
> +static struct inet_protosw homa_protosw = {
> +	.type              = SOCK_DGRAM,
> +	.protocol          = IPPROTO_HOMA,
> +	.prot              = &homa_prot,
> +	.ops               = &homa_proto_ops,
> +	.flags             = INET_PROTOSW_REUSE,
> +};
> +
> +static struct inet_protosw homav6_protosw = {
> +	.type              = SOCK_DGRAM,
> +	.protocol          = IPPROTO_HOMA,
> +	.prot              = &homav6_prot,
> +	.ops               = &homav6_proto_ops,
> +	.flags             = INET_PROTOSW_REUSE,
> +};
> +
> +/* This structure is used by IP to deliver incoming Homa packets to us. */
> +static struct net_protocol homa_protocol = {
> +	.handler =	homa_softirq,
> +	.err_handler =	homa_err_handler_v4,
> +	.no_policy =     1,
> +};
> +
> +static struct inet6_protocol homav6_protocol = {
> +	.handler =	homa_softirq,
> +	.err_handler =	homa_err_handler_v6,
> +	.flags =        INET6_PROTO_NOPOLICY | INET6_PROTO_FINAL,
> +};
> +
> +/* Sizes of the headers for each Homa packet type, in bytes. */
> +static u16 header_lengths[] = {
> +	sizeof(struct homa_data_hdr),
> +	0,
> +	sizeof(struct homa_resend_hdr),
> +	sizeof(struct homa_rpc_unknown_hdr),
> +	sizeof(struct homa_busy_hdr),
> +	0,
> +	0,
> +	sizeof(struct homa_need_ack_hdr),
> +	sizeof(struct homa_ack_hdr)
> +};
> +
> +/* Thread that runs timer code to detect lost packets and crashed peers. */
> +static struct task_struct *timer_kthread;
> +static DECLARE_COMPLETION(timer_thread_done);
> +
> +/* Used to wakeup timer_kthread at regular intervals. */
> +static struct hrtimer hrtimer;
> +
> +/* Nonzero is an indication to the timer thread that it should exit. */
> +static int timer_thread_exit;
> +
> +/**
> + * homa_load() - invoked when this module is loaded into the Linux kernel
> + * Return: 0 on success, otherwise a negative errno.
> + */
> +int __init homa_load(void)
> +{
> +	struct homa *homa = global_homa;
> +	bool init_protocol6 = false;
> +	bool init_protosw6 = false;
> +	bool init_protocol = false;
> +	bool init_protosw = false;
> +	bool init_net_ops = false;
> +	bool init_proto6 = false;
> +	bool init_proto = false;
> +	bool init_homa = false;
> +	int status;
> +
> +	/* Compile-time validations that no packet header is longer
> +	 * than HOMA_MAX_HEADER.
> +	 */
> +	BUILD_BUG_ON(sizeof(struct homa_data_hdr) > HOMA_MAX_HEADER);
> +	BUILD_BUG_ON(sizeof(struct homa_resend_hdr) > HOMA_MAX_HEADER);
> +	BUILD_BUG_ON(sizeof(struct homa_rpc_unknown_hdr) > HOMA_MAX_HEADER);
> +	BUILD_BUG_ON(sizeof(struct homa_busy_hdr) > HOMA_MAX_HEADER);
> +	BUILD_BUG_ON(sizeof(struct homa_need_ack_hdr) > HOMA_MAX_HEADER);
> +	BUILD_BUG_ON(sizeof(struct homa_ack_hdr) > HOMA_MAX_HEADER);
> +
> +	/* Extra constraints on data packets:
> +	 * - Ensure minimum header length so Homa doesn't have to worry about
> +	 *   padding data packets.
> +	 * - Make sure data packet headers are a multiple of 4 bytes (needed
> +	 *   for TCP/TSO compatibility).
> +	 */
> +	BUILD_BUG_ON(sizeof(struct homa_data_hdr) < HOMA_MIN_PKT_LENGTH);
> +	BUILD_BUG_ON((sizeof(struct homa_data_hdr) -
> +		      sizeof(struct homa_seg_hdr)) & 0x3);
> +
> +	/* Detect size changes in uAPI structs. */
> +	BUILD_BUG_ON(sizeof(struct homa_sendmsg_args) != 24);
> +	BUILD_BUG_ON(sizeof(struct homa_recvmsg_args) != 88);
> +
> +	pr_err("Homa module loading\n");

Please use pr_notice() instead.

> +	status = proto_register(&homa_prot, 1);
> +	if (status != 0) {
> +		pr_err("proto_register failed for homa_prot: %d\n", status);
> +		goto error;
> +	}
> +	init_proto = true;

The standard way of handling the error paths it to avoid local flags and
use different goto labels.

> +
> +	status = proto_register(&homav6_prot, 1);
> +	if (status != 0) {
> +		pr_err("proto_register failed for homav6_prot: %d\n", status);
> +		goto error;
> +	}
> +	init_proto6 = true;
> +
> +	inet_register_protosw(&homa_protosw);
> +	init_protosw = true;
> +
> +	status = inet6_register_protosw(&homav6_protosw);
> +	if (status != 0) {
> +		pr_err("inet6_register_protosw failed in %s: %d\n", __func__,
> +		       status);
> +		goto error;
> +	}
> +	init_protosw6 = true;
> +
> +	status = inet_add_protocol(&homa_protocol, IPPROTO_HOMA);
> +	if (status != 0) {
> +		pr_err("inet_add_protocol failed in %s: %d\n", __func__,
> +		       status);
> +		goto error;
> +	}
> +	init_protocol = true;
> +
> +	status = inet6_add_protocol(&homav6_protocol, IPPROTO_HOMA);
> +	if (status != 0) {
> +		pr_err("inet6_add_protocol failed in %s: %d\n",  __func__,
> +		       status);
> +		goto error;
> +	}
> +	init_protocol6 = true;
> +
> +	status = homa_init(homa);
> +	if (status)
> +		goto error;
> +	init_homa = true;

home_init() should be likely the first call in this function

> +
> +	status = register_pernet_subsys(&homa_net_ops);
> +	if (status != 0) {
> +		pr_err("Homa got error from register_pernet_subsys: %d\n",
> +		       status);
> +		goto error;
> +	}
> +	init_net_ops = true;
> +
> +	timer_kthread = kthread_run(homa_timer_main, homa, "homa_timer");
> +	if (IS_ERR(timer_kthread)) {
> +		status = PTR_ERR(timer_kthread);
> +		pr_err("couldn't create Homa timer thread: error %d\n",
> +		       status);
> +		timer_kthread = NULL;
> +		goto error;
> +	}
> +
> +	return 0;
> +
> +error:
> +	if (timer_kthread) {
> +		timer_thread_exit = 1;
> +		wake_up_process(timer_kthread);
> +		wait_for_completion(&timer_thread_done);
> +	}
> +	if (init_net_ops)
> +		unregister_pernet_subsys(&homa_net_ops);
> +	if (init_homa)
> +		homa_destroy(homa);
> +	if (init_protocol)
> +		inet_del_protocol(&homa_protocol, IPPROTO_HOMA);
> +	if (init_protocol6)
> +		inet6_del_protocol(&homav6_protocol, IPPROTO_HOMA);
> +	if (init_protosw)
> +		inet_unregister_protosw(&homa_protosw);
> +	if (init_protosw6)
> +		inet6_unregister_protosw(&homav6_protosw);
> +	if (init_proto)
> +		proto_unregister(&homa_prot);
> +	if (init_proto6)
> +		proto_unregister(&homav6_prot);
> +	return status;
> +}
> +
> +/**
> + * homa_unload() - invoked when this module is unloaded from the Linux kernel.
> + */
> +void __exit homa_unload(void)
> +{
> +	struct homa *homa = global_homa;
> +
> +	pr_notice("Homa module unloading\n");
> +
> +	unregister_pernet_subsys(&homa_net_ops);
> +	homa_destroy(homa);

home_destroy() should likely be the last call of this function.

> +/**
> + * homa_softirq() - This function is invoked at SoftIRQ level to handle
> + * incoming packets.
> + * @skb:   The incoming packet.
> + * Return: Always 0
> + */
> +int homa_softirq(struct sk_buff *skb)
> +{
> +	struct sk_buff *packets, *other_pkts, *next;
> +	struct sk_buff **prev_link, **other_link;
> +	struct homa_common_hdr *h;
> +	int header_offset;
> +
> +	/* skb may actually contain many distinct packets, linked through
> +	 * skb_shinfo(skb)->frag_list by the Homa GRO mechanism. Make a
> +	 * pass through the list to process all of the short packets,
> +	 * leaving the longer packets in the list. Also, perform various
> +	 * prep/cleanup/error checking functions.

It's hard to tell without the GRO/GSO code handy, but I guess the
implementation here could be simplified invoking __skb_gso_segment()...

> +	 */
> +	skb->next = skb_shinfo(skb)->frag_list;
> +	skb_shinfo(skb)->frag_list = NULL;
> +	packets = skb;
> +	prev_link = &packets;
> +	for (skb = packets; skb; skb = next) {
> +		next = skb->next;
> +
> +		/* Make the header available at skb->data, even if the packet
> +		 * is fragmented. One complication: it's possible that the IP
> +		 * header hasn't yet been removed (this happens for GRO packets
> +		 * on the frag_list, since they aren't handled explicitly by IP.

... at very least it will avoif this complication and will simplify the
list handling.

> +		 */
> +		if (!homa_make_header_avl(skb))
> +			goto discard;

It looks like the above is too aggressive, i.e. pskb_may_pull() may fail
for a correctly formatted homa_ack_hdr - or any other packet with hdr
size < HOMA_MAX_HEADER

> +		header_offset = skb_transport_header(skb) - skb->data;
> +		if (header_offset)
> +			__skb_pull(skb, header_offset);
> +
> +		/* Reject packets that are too short or have bogus types. */
> +		h = (struct homa_common_hdr *)skb->data;
> +		if (unlikely(skb->len < sizeof(struct homa_common_hdr) ||
> +			     h->type < DATA || h->type > MAX_OP ||
> +			     skb->len < header_lengths[h->type - DATA]))
> +			goto discard;
> +
> +		/* Process the packet now if it is a control packet or
> +		 * if it contains an entire short message.
> +		 */
> +		if (h->type != DATA || ntohl(((struct homa_data_hdr *)h)
> +				->message_length) < 1400) {

I could not fined where `message_length` is validated. AFAICS
data_hdr->message_length could be > skb->len.

Also I don't see how the condition checked above ensures that the pkt
contains the whole message.

/P


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 03/15] net: homa: create shared Homa header files
  2025-08-26  9:05   ` Paolo Abeni
@ 2025-08-26 23:10     ` John Ousterhout
  2025-08-27  7:21       ` Paolo Abeni
  2025-08-27 12:16       ` Eric Dumazet
  0 siblings, 2 replies; 47+ messages in thread
From: John Ousterhout @ 2025-08-26 23:10 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Tue, Aug 26, 2025 at 2:06 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/18/25 10:55 PM, John Ousterhout wrote:
> > +/**
> > + * struct homa_net - Contains Homa information that is specific to a
> > + * particular network namespace.
> > + */
> > +struct homa_net {
> > +     /** @net: Network namespace corresponding to this structure. */
> > +     struct net *net;
> > +
> > +     /** @homa: Global Homa information. */
> > +     struct homa *homa;
>
> It's not clear why the above 2 fields are needed. You could access
> directly the global struct homa instance, and 'struct net' is usually
> available when struct home_net is avail.

I have eliminated net but would like to retain homa. I have tried very
hard to avoid global variables in Homa, both for general pedagogical
reasons and because it simplifies unit testing. Right now there is no
need for a global homa except a couple of places in homa_plumbing.c,
and I'd like to maintain that encapsulation.

> > +/**
> > + * homa_clock() - Return a fine-grain clock value that is monotonic and
> > + * consistent across cores.
> > + * Return: see above.
> > + */
> > +static inline u64 homa_clock(void)
> > +{
> > +     /* As of May 2025 there does not appear to be a portable API that
> > +      * meets Homa's needs:
> > +      * - The Intel X86 TSC works well but is not portable.
> > +      * - sched_clock() does not guarantee monotonicity or consistency.
> > +      * - ktime_get_mono_fast_ns and ktime_get_raw_fast_ns are very slow
> > +      *   (27 ns to read, vs 8 ns for TSC)
> > +      * Thus we use a hybrid approach that uses TSC (via get_cycles) where
> > +      * available (which should be just about everywhere Homa runs).
> > +      */
> > +#ifdef CONFIG_X86_TSC
> > +     return get_cycles();
> > +#else
> > +     return ktime_get_mono_fast_ns();
> > +#endif /* CONFIG_X86_TSC */
> > +}
>
> ktime_get*() variant are fast enough to allow e.g. pktgen deals with
> millions of packets x seconds. Both tsc() and ktime_get_mono_fast_ns()
> suffer of various inconsistencies which will cause the most unexpected
> issues in the most dangerous situation. I strongly advice against this
> early optimization.

Which ktime_get variant do you recommend instead of ktime_get_mono_fast_ns?

I feel pretty strongly about retaining the use of TSC on Intel
platforms. As I have said before, Homa is attempting to operate in a
much more aggressive latency domain than Linux is used to, and
nanoseconds matter. I have been using TSC on Intel and AMD platforms
for more than 15 years and I have never had any problems. Is there a
specific inconsistency you know of that will cause "unexpected issues
in the most dangerous situations"? If not, I would prefer to retain
the use of TSC until someone can identify a real problem. Note that
the choice of clock is now well encapsulated, so if a change should
become necessary it will be very easy to make.

For all of your comments that I have not responded to explicitly
above, I have implemented the changes you recommended.

-John-

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 03/15] net: homa: create shared Homa header files
  2025-08-26 23:10     ` John Ousterhout
@ 2025-08-27  7:21       ` Paolo Abeni
  2025-08-29  3:03         ` John Ousterhout
  2025-08-27 12:16       ` Eric Dumazet
  1 sibling, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2025-08-27  7:21 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev, edumazet, horms, kuba

On 8/27/25 1:10 AM, John Ousterhout wrote:
> On Tue, Aug 26, 2025 at 2:06 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> On 8/18/25 10:55 PM, John Ousterhout wrote:
>>> +/**
>>> + * struct homa_net - Contains Homa information that is specific to a
>>> + * particular network namespace.
>>> + */
>>> +struct homa_net {
>>> +     /** @net: Network namespace corresponding to this structure. */
>>> +     struct net *net;
>>> +
>>> +     /** @homa: Global Homa information. */
>>> +     struct homa *homa;
>>
>> It's not clear why the above 2 fields are needed. You could access
>> directly the global struct homa instance, and 'struct net' is usually
>> available when struct home_net is avail.
> 
> I have eliminated net but would like to retain homa. I have tried very
> hard to avoid global variables in Homa, both for general pedagogical
> reasons and because it simplifies unit testing. Right now there is no
> need for a global homa except a couple of places in homa_plumbing.c,
> and I'd like to maintain that encapsulation.

Note that there is no kernel convention against global per protocol
variables, when that does not prevent scaling.

> 
>>> +/**
>>> + * homa_clock() - Return a fine-grain clock value that is monotonic and
>>> + * consistent across cores.
>>> + * Return: see above.
>>> + */
>>> +static inline u64 homa_clock(void)
>>> +{
>>> +     /* As of May 2025 there does not appear to be a portable API that
>>> +      * meets Homa's needs:
>>> +      * - The Intel X86 TSC works well but is not portable.
>>> +      * - sched_clock() does not guarantee monotonicity or consistency.
>>> +      * - ktime_get_mono_fast_ns and ktime_get_raw_fast_ns are very slow
>>> +      *   (27 ns to read, vs 8 ns for TSC)
>>> +      * Thus we use a hybrid approach that uses TSC (via get_cycles) where
>>> +      * available (which should be just about everywhere Homa runs).
>>> +      */
>>> +#ifdef CONFIG_X86_TSC
>>> +     return get_cycles();
>>> +#else
>>> +     return ktime_get_mono_fast_ns();
>>> +#endif /* CONFIG_X86_TSC */
>>> +}
>>
>> ktime_get*() variant are fast enough to allow e.g. pktgen deals with
>> millions of packets x seconds. Both tsc() and ktime_get_mono_fast_ns()
>> suffer of various inconsistencies which will cause the most unexpected
>> issues in the most dangerous situation. I strongly advice against this
>> early optimization.
> 
> Which ktime_get variant do you recommend instead of ktime_get_mono_fast_ns?
> 
> I feel pretty strongly about retaining the use of TSC on Intel
> platforms. As I have said before, Homa is attempting to operate in a
> much more aggressive latency domain than Linux is used to, and
> nanoseconds matter. I have been using TSC on Intel and AMD platforms
> for more than 15 years and I have never had any problems. Is there a
> specific inconsistency you know of that will cause "unexpected issues
> in the most dangerous situations"? 

The TSC raw value depends on the current CPU. According to the relevant
documentation ktime_get_mono_fast_ns() is allowed to jump under certain
conditions: with either of them you can get sudden/unexpected tick
increases.

> If not, I would prefer to retain
> the use of TSC until someone can identify a real problem. Note that
> the choice of clock is now well encapsulated, so if a change should
> become necessary it will be very easy to make.

AFAICS, in the current revision there are several points that could
cause much greater latency - i.e. the long loops under BH lock with no
reschedule. I'm surprised they don't show as ms-latency bottle-necks
under stress test.

I suggest removing such issues before doing micro optimization that at
very least use APIs that are explicitly discouraged.

/P


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 03/15] net: homa: create shared Homa header files
  2025-08-26 23:10     ` John Ousterhout
  2025-08-27  7:21       ` Paolo Abeni
@ 2025-08-27 12:16       ` Eric Dumazet
  1 sibling, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2025-08-27 12:16 UTC (permalink / raw)
  To: John Ousterhout; +Cc: Paolo Abeni, netdev, horms, kuba

On Tue, Aug 26, 2025 at 4:11 PM John Ousterhout <ouster@cs.stanford.edu> wrote:

> I feel pretty strongly about retaining the use of TSC on Intel
> platforms. As I have said before, Homa is attempting to operate in a
> much more aggressive latency domain than Linux is used to, and
> nanoseconds matter. I have been using TSC on Intel and AMD platforms
> for more than 15 years and I have never had any problems. Is there a
> specific inconsistency you know of that will cause "unexpected issues
> in the most dangerous situations"? If not, I would prefer to retain
> the use of TSC until someone can identify a real problem. Note that
> the choice of clock is now well encapsulated, so if a change should
> become necessary it will be very easy to make.

Real cost in these helpers on modern cpus is in rdtscp instruction.

And using rdtsc (rdtsc()) instead of rdtscp (rdtsc_ordered()) is not
measuring anything useful
because of speculation.

Using get_cycles() in networking is simply a big no from us.

We do not want to deal with all these #ifdef CONFIG_X86_TSC games.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 05/15] net: homa: create homa_peer.h and homa_peer.c
  2025-08-26  9:32   ` Paolo Abeni
@ 2025-08-27 23:27     ` John Ousterhout
  0 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-08-27 23:27 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Tue, Aug 26, 2025 at 2:33 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/18/25 10:55 PM, John Ousterhout wrote:
> > +/**
> > + * homa_peer_rcu_callback() - This function is invoked as the callback
> > + * for an invocation of call_rcu. It just marks a peertab to indicate that
> > + * it was invoked.
> > + * @head:    Contains information used to locate the peertab.
> > + */
> > +void homa_peer_rcu_callback(struct rcu_head *head)
> > +{
> > +     struct homa_peertab *peertab;
> > +
> > +     peertab = container_of(head, struct homa_peertab, rcu_head);
> > +     atomic_set(&peertab->call_rcu_pending, 0);
> > +}
>
> The free schema is quite convoluted and different from the usual RCU
> handling. Why don't you simply call_rcu() on the given peer once that
> the refcount reaches zero?

I have no idea why I implemented such a complicated mechanism. I've
switched to your (obvious, in retrospect) approach.

> > +/**
> > + * homa_dst_refresh() - This method is called when the dst for a peer is
> > + * obsolete; it releases that dst and creates a new one.
> > + * @peertab:  Table containing the peer.
> > + * @peer:     Peer whose dst is obsolete.
> > + * @hsk:      Socket that will be used to transmit data to the peer.
> > + */
> > +void homa_dst_refresh(struct homa_peertab *peertab, struct homa_peer *peer,
> > +                   struct homa_sock *hsk)
> > +{
> > +     struct dst_entry *dst;
> > +
> > +     dst = homa_peer_get_dst(peer, hsk);
> > +     if (IS_ERR(dst))
> > +             return;
> > +     dst_release(peer->dst);
> > +     peer->dst = dst;
>
> Why the above does not need any lock? Can multiple RPC race on the same
> peer concurrently?

Yep, that's a bug. I have refactored to use RCU appropriately.

For all of your comments not discussed explicitly above I have
implemented the changes you requested.

And sorry for my first attempt sending this message, which
accidentally used HTML mode.

-John-

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 03/15] net: homa: create shared Homa header files
  2025-08-27  7:21       ` Paolo Abeni
@ 2025-08-29  3:03         ` John Ousterhout
  2025-08-29  7:53           ` Paolo Abeni
  0 siblings, 1 reply; 47+ messages in thread
From: John Ousterhout @ 2025-08-29  3:03 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Wed, Aug 27, 2025 at 12:21 AM Paolo Abeni <pabeni@redhat.com> wrote:

> The TSC raw value depends on the current CPU.

This is incorrect. There were problems in the first multi-core Intel
chips in the early 2000s, but they were fixed before I began using TSC
in 2010. The TSC counter is synchronized across cores and increments
at a constant rate independent of core frequency and power state.

You didn't answer my question about which time source I should use,
but after poking around a bit it looks like ktime_get_ns is the best
option? Please let me know if there are any problems with using this.
Interestingly, ktime_get_ns actually uses TSC (RDTSCP) on Intel
platforms. ktime_get_ns takes about 14 ns per call, vs. about 8 ns for
get_cycles. I have measured Homa performance using ktime_get_ns, and
this adds about .04 core to Homa's total core utilization when driving
a 25 Gbps link at 80% utilization bidirectional. I expect the overhead
to scale with network bandwidth, so I would expect the overhead to be
0.16 core at 100 Gbps. I consider this overhead to be significant, but
I have modified homa_clock to use ktime_get_ns in the upstreamed
version.

> > If not, I would prefer to retain
> > the use of TSC until someone can identify a real problem. Note that
> > the choice of clock is now well encapsulated, so if a change should
> > become necessary it will be very easy to make.
>
> AFAICS, in the current revision there are several points that could
> cause much greater latency - i.e. the long loops under BH lock with no
> reschedule. I'm surprised they don't show as ms-latency bottle-necks
> under stress test.

If you see "long loops under BH lock with no reschedule" please let me
know and I will try to fix them. My goal is to avoid such things, but
I may have missed something.

-John-

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 03/15] net: homa: create shared Homa header files
  2025-08-29  3:03         ` John Ousterhout
@ 2025-08-29  7:53           ` Paolo Abeni
  2025-08-29 17:08             ` John Ousterhout
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2025-08-29  7:53 UTC (permalink / raw)
  To: John Ousterhout
  Cc: netdev@vger.kernel.org, Eric Dumazet, Simon Horman,
	Jakub Kicinski

On 8/29/25 5:03 AM, John Ousterhout wrote:
> On Wed, Aug 27, 2025 at 12:21 AM Paolo Abeni <pabeni@redhat.com> wrote:
> 
>> The TSC raw value depends on the current CPU.
> 
> This is incorrect. There were problems in the first multi-core Intel
> chips in the early 2000s, but they were fixed before I began using TSC
> in 2010. The TSC counter is synchronized across cores and increments
> at a constant rate independent of core frequency and power state.

Please read:

https://elixir.bootlin.com/linux/v6.17-rc3/source/arch/x86/include/asm/tsc.h#L14

> You didn't answer my question about which time source I should use,
> but after poking around a bit it looks like ktime_get_ns is the best
> option?

yes, ktime_get_ns()

> I have measured Homa performance using ktime_get_ns, and
> this adds about .04 core to Homa's total core utilization when driving
> a 25 Gbps link at 80% utilization bidirectional. 

What is that 0.04? A percent? of total CPU time? of CPU time used by
Homa? absolute time?

If that is percent of total CPU time for a single core, such value is
inconsistent with my benchmarking where a couple of timestamp() reads
per aggregate packet are well below noise level.

> I expect the overhead
> to scale with network bandwidth, 

Actually it could not if the protocol does proper aggregation.

> so I would expect the overhead to be
> 0.16 core at 100 Gbps. I consider this overhead to be significant, but
> I have modified homa_clock to use ktime_get_ns in the upstreamed
> version.

My not so wild guess is that other bottlenecks will hit much more, much
earlier.

/P


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 03/15] net: homa: create shared Homa header files
  2025-08-29  7:53           ` Paolo Abeni
@ 2025-08-29 17:08             ` John Ousterhout
  2025-09-01  7:59               ` Paolo Abeni
  0 siblings, 1 reply; 47+ messages in thread
From: John Ousterhout @ 2025-08-29 17:08 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev@vger.kernel.org, Eric Dumazet, Simon Horman,
	Jakub Kicinski

On Fri, Aug 29, 2025 at 12:53 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/29/25 5:03 AM, John Ousterhout wrote:
> > On Wed, Aug 27, 2025 at 12:21 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> >> The TSC raw value depends on the current CPU.
> >
> > This is incorrect. There were problems in the first multi-core Intel
> > chips in the early 2000s, but they were fixed before I began using TSC
> > in 2010. The TSC counter is synchronized across cores and increments
> > at a constant rate independent of core frequency and power state.
>
> Please read:
>
> https://elixir.bootlin.com/linux/v6.17-rc3/source/arch/x86/include/asm/tsc.h#L14

This does not contradict my assertion, but maybe we are talking about
different things.

First, the statement "the results can be non-monotonic if compared on
different CPUs" in the link you sent doesn't really make sense as
written. There is no way to execute RDTSC instructions at exactly the
same moment on two CPUs and compare the results. Maybe the comment is
referring to a situation like this:
* Execute RDTSC on core A.
* Increment a shared variable on core A.
* Read the variable's value on core B.
* Execute RDTSC on core B.

In this situation, it is possible that the time returned by RDTSC on
core B could precede that observed on core A, while the value of the
variable read by core B reflects the increment. Perhaps this is what
you meant by your statement "The TSC raw value depends on the current
CPU"? I interpreted your words to mean that each CPU has its own
independent TSC counter, which was the case in the early 2000's but is
not the case today.

There are two different issues here:
* Is the TSC clock itself consistent across CPUs? Yes it is. It does
not depend on which CPU reads it.
* When are TSC values read relative to the execution of nearby
instructions? This is also well-defined: with RDTSC, the time is read
as soon as the instruction is decoded. Of course, this may be before
some previous instructions have been retired, so the time could appear
to have been read out-of-order. This means you shouldn't use RDTSC
values to deduce the order of operations on different cores.

> > I have measured Homa performance using ktime_get_ns, and
> > this adds about .04 core to Homa's total core utilization when driving
> > a 25 Gbps link at 80% utilization bidirectional.
>
> What is that 0.04? A percent? of total CPU time? of CPU time used by
> Homa? absolute time?

It's .04 core. In other words Homa uses 40ms more execution time every
second with ktime_get_ns than it did with get_cycles, when running
this particular workload.

> If that is percent of total CPU time for a single core, such value is
> inconsistent with my benchmarking where a couple of timestamp() reads
> per aggregate packet are well below noise level.

Homa is doing a lot more than a couple of timestamp() reads per
aggregate packet. The version of Homa that I measured (my default
version, even for benchmarking) is heavily instrumented; you will see
the instrumentation in a later patch series. So far, I've been able to
afford the instrumentation without significant performance penalty,
and I'd like to keep it that way if possible.

-John-

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 06/15] net: homa: create homa_sock.h and homa_sock.c
  2025-08-26 10:10   ` Paolo Abeni
@ 2025-08-31 23:29     ` John Ousterhout
  0 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-08-31 23:29 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Tue, Aug 26, 2025 at 3:11 AM Paolo Abeni <pabeni@redhat.com> wrote:

> > +/**
> > + * homa_sock_init() - Constructor for homa_sock objects. This function
> > + * initializes only the parts of the socket that are owned by Homa.
> > + * @hsk:    Object to initialize. The Homa-specific parts must have been
> > + *          initialized to zeroes by the caller.
> > + *
> > + * Return: 0 for success, otherwise a negative errno.
> > + */
> > +int homa_sock_init(struct homa_sock *hsk)
> > +{
> > +    ...
> > +     /* Pick a default port. Must keep the socktab locked from now
> > +      * until the new socket is added to the socktab, to ensure that
> > +      * no other socket chooses the same port.
> > +      */
> > +     spin_lock_bh(&socktab->write_lock);
> > +     starting_port = hnet->prev_default_port;
> > +     while (1) {
> > +             hnet->prev_default_port++;
> > +             if (hnet->prev_default_port < HOMA_MIN_DEFAULT_PORT)
> > +                     hnet->prev_default_port = HOMA_MIN_DEFAULT_PORT;
> > +             other = homa_sock_find(hnet, hnet->prev_default_port);
> > +             if (!other)
> > +                     break;
> > +             sock_put(&other->sock);
> > +             if (hnet->prev_default_port == starting_port) {
> > +                     spin_unlock_bh(&socktab->write_lock);
> > +                     hsk->shutdown = true;
> > +                     hsk->homa = NULL;
> > +                     result = -EADDRNOTAVAIL;
> > +                     goto error;
> > +             }
>
> You likely need to add a cond_resched here (releasing and re-acquiring
> the lock as needed)

Done.

> Do all the above initialization steps need to be done under the socktab
> lock?

No; I have now reorganized to minimize the amount of time the socktab
lock is held. Socket creation is a pretty rare event in Homa
(typically once per process) so this optimization probably doesn't
matter much...

> > +int homa_sock_bind(struct homa_net *hnet, struct homa_sock *hsk,
> > +                u16 port)
> > +{
> > +     ...
> > +     owner = homa_sock_find(hnet, port);
> > +     if (owner) {
> > +             sock_put(&owner->sock);
>
> homa_sock_find() is used is multiple places to check for port usage. I
> think it would be useful to add a variant of such helper not
> incremention the socket refcount.

It's only used this way in 2 places, both in this file. The
alternatives (either add another parameter to homa_sock_find or create
a separate method homa_port_in_use) both seem like they would add more
complexity than the current approach. My preference is to leave it as
is. If you feel strongly about this, let me know which option you
prefer and I'll implement it (note also that adding another parameter
to homa_sock_find would be awkward because it could result in a socket
being returned without its reference count incremented, meaning that
it isn't really safe for the caller to use it; I worry that this will
lead people to write buggy code).

> > +/**
> > + * homa_sock_wait_wmem() - Block the thread until @hsk's usage of tx
> > + * packet memory drops below the socket's limit.
> > + * @hsk:          Socket of interest.
> > + * @nonblocking:  If there's not enough memory, return -EWOLDBLOCK instead
> > + *                of blocking.
> > + * Return: 0 for success, otherwise a negative errno.
> > + */
> > +int homa_sock_wait_wmem(struct homa_sock *hsk, int nonblocking)
> > +{
> > +     long timeo = hsk->sock.sk_sndtimeo;
> > +     int result;
> > +
> > +     if (nonblocking)
> > +             timeo = 0;
> > +     set_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags);
> > +     result = wait_event_interruptible_timeout(*sk_sleep(&hsk->sock),
> > +                             homa_sock_wmem_avl(hsk) || hsk->shutdown,
> > +                             timeo);
> > +     if (signal_pending(current))
> > +             return -EINTR;
> > +     if (result == 0)
> > +             return -EWOULDBLOCK;
> > +     return 0;
> > +}
>
> Perhaps you could use sock_wait_for_wmem() ?

sock_wait_for_wmem is not accessible to modules (it's declared "static").

> > diff --git a/net/homa/homa_sock.h b/net/homa/homa_sock.h
>
> > +/**
> > + * struct homa_sock - Information about an open socket.
> > + */
> > +struct homa_sock {
> > +     ...
> > +
> > +     /**
> > +      * @homa: Overall state about the Homa implementation. NULL
> > +      * means this socket was never initialized or has been deleted.
> > +      */
> > +     struct homa *homa;
> > +
> > +     /**
> > +      * @hnet: Overall state specific to the network namespace for
> > +      * this socket.
> > +      */
> > +     struct homa_net *hnet;
>
> Both the above should likely be removed

What is the motivation for removing them? The homa field can be
accessed through hnet, so I suppose it could be removed, but that will
result in extra instructions (both time and icache space) every time
it is accessed (and there are a bunch of accesses). hnet is more
expensive to remove: it can be accessed through the socket, but the
code path is longer, which, again, wastes time and icache space.

> > +     /**
> > +      * @client_rpc_buckets: Hash table for fast lookup of client RPCs.
> > +      * Modifications are synchronized with bucket locks, not
> > +      * the socket lock.
> > +      */
> > +     struct homa_rpc_bucket client_rpc_buckets[HOMA_CLIENT_RPC_BUCKETS];
> > +
> > +     /**
> > +      * @server_rpc_buckets: Hash table for fast lookup of server RPCs.
> > +      * Modifications are synchronized with bucket locks, not
> > +      * the socket lock.
> > +      */
> > +     struct homa_rpc_bucket server_rpc_buckets[HOMA_SERVER_RPC_BUCKETS];
>
> The above 2 array are quite large, and should be probably allocated
> separately.

What's the benefit of doing multiple allocations? The individual
arrays will still be 16KB, which isn't exactly small. Is there general
advice on how to decide whether large objects should be split up into
smaller ones?

> > +/**
> > + * homa_sock_wakeup_wmem() - Invoked when tx packet memory has been freed;
> > + * if memory usage is below the limit and there are tasks waiting for memory,
> > + * wake them up.
> > + * @hsk:   Socket of interest.
> > + */
> > +static inline void homa_sock_wakeup_wmem(struct homa_sock *hsk)
> > +{
> > +     if (test_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags) &&
> > +         homa_sock_wmem_avl(hsk)) {
> > +             clear_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags);
> > +             wake_up_interruptible_poll(sk_sleep(&hsk->sock), EPOLLOUT);
>
> Can hsk be orphaned at this point? I think so.

I don't think so. This function is invoked only from homa_rpc_reap,
and I don't believe homa_rpc_reap can be invoked once a socket is
orphaned (that would be problematic on its own). Also, as part of
socket shutdown all RPCs are deleted and homa_rpc_reap is invoked to
free their resources, so it will wake up anyone waiting for wmem. By
the time socket cleanup has completed (a) there are no RPCs, and (b)
there should be no-one waiting for wmem.

Do you have a particular pathway in mind by which
homa_sock_wakeup_wmem could be invoked after a socket has been
orphaned?

-John-

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 03/15] net: homa: create shared Homa header files
  2025-08-29 17:08             ` John Ousterhout
@ 2025-09-01  7:59               ` Paolo Abeni
  0 siblings, 0 replies; 47+ messages in thread
From: Paolo Abeni @ 2025-09-01  7:59 UTC (permalink / raw)
  To: John Ousterhout
  Cc: netdev@vger.kernel.org, Eric Dumazet, Simon Horman,
	Jakub Kicinski

On 8/29/25 7:08 PM, John Ousterhout wrote:
> On Fri, Aug 29, 2025 at 12:53 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> If that is percent of total CPU time for a single core, such value is
>> inconsistent with my benchmarking where a couple of timestamp() reads
>> per aggregate packet are well below noise level.
> 
> Homa is doing a lot more than a couple of timestamp() reads per
> aggregate packet. 

Than it looks like this is the problem. Data processing should require ~
a single ts per packet. If you need more for instrumentation, you should
likely put such code behind a compiler's conditional and enable them
just in devel/debug build.

Or even better you could use ftrace/bpf trace for that.

/P


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 08/15] net: homa: create homa_pacer.h and homa_pacer.c
  2025-08-26 10:53   ` Paolo Abeni
@ 2025-09-01 16:35     ` John Ousterhout
  0 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-09-01 16:35 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Tue, Aug 26, 2025 at 3:54 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/18/25 10:55 PM, John Ousterhout wrote:
> > +/**
> > + * homa_pacer_alloc() - Allocate and initialize a new pacer object, which
> > + * will hold pacer-related information for @homa.
> > + * @homa:   Homa transport that the pacer will be associated with.
> > + * Return:  A pointer to the new struct pacer, or a negative errno.
> > + */
> > +struct homa_pacer *homa_pacer_alloc(struct homa *homa)
> > +{
> > +     struct homa_pacer *pacer;
> > +     int err;
> > +
> > +     pacer = kzalloc(sizeof(*pacer), GFP_KERNEL);
> > +     if (!pacer)
> > +             return ERR_PTR(-ENOMEM);
> > +     pacer->homa = homa;
> > +     spin_lock_init(&pacer->mutex);
> > +     pacer->fifo_count = 1000;
> > +     spin_lock_init(&pacer->throttle_lock);
> > +     INIT_LIST_HEAD_RCU(&pacer->throttled_rpcs);
> > +     pacer->fifo_fraction = 50;
> > +     pacer->max_nic_queue_ns = 5000;
> > +     pacer->throttle_min_bytes = 1000;
> > +     init_waitqueue_head(&pacer->wait_queue);
> > +     pacer->kthread = kthread_run(homa_pacer_main, pacer, "homa_pacer");
> > +     if (IS_ERR(pacer->kthread)) {
> > +             err = PTR_ERR(pacer->kthread);
> > +             pr_err("Homa couldn't create pacer thread: error %d\n", err);
> > +             goto error;
> > +     }
> > +     atomic64_set(&pacer->link_idle_time, homa_clock());
> > +
> > +     homa_pacer_update_sysctl_deps(pacer);
>
> IMHO this does not fit mergeable status:
> - the static init (@25Gbs)
> - never updated on link changes
> - assumes a single link in the whole system
>
> I think it's better to split the pacer part out of this series, or the
> above points should be addressed and it would be difficult fitting a
> reasonable series size.

I have removed the pacer from the patch series.

-John-

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 09/15] net: homa: create homa_rpc.h and homa_rpc.c
  2025-08-26 11:31   ` Paolo Abeni
@ 2025-09-01 20:10     ` John Ousterhout
  0 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-09-01 20:10 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Tue, Aug 26, 2025 at 4:31 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/18/25 10:55 PM, John Ousterhout wrote:
> > +/**
> > + * homa_rpc_reap() - Invoked to release resources associated with dead
> > + * RPCs for a given socket.
> > + * @hsk:      Homa socket that may contain dead RPCs. Must not be locked by the
> > + *            caller; this function will lock and release.
> > + * @reap_all: False means do a small chunk of work; there may still be
> > + *            unreaped RPCs on return. True means reap all dead RPCs for
> > + *            hsk.  Will busy-wait if reaping has been disabled for some RPCs.
> > + *
> > + * Return: A return value of 0 means that we ran out of work to do; calling
> > + *         again will do no work (there could be unreaped RPCs, but if so,
> > + *         they cannot currently be reaped).  A value greater than zero means
> > + *         there is still more reaping work to be done.
> > + */
> > +int homa_rpc_reap(struct homa_sock *hsk, bool reap_all)
> > +{
> > +     /* RPC Reaping Strategy:
> > +      *
> > +      * (Note: there are references to this comment elsewhere in the
> > +      * Homa code)
> > +      *
> > +      * Most of the cost of reaping comes from freeing sk_buffs; this can be
> > +      * quite expensive for RPCs with long messages.
> > +      *
> > +      * The natural time to reap is when homa_rpc_end is invoked to
> > +      * terminate an RPC, but this doesn't work for two reasons. First,
> > +      * there may be outstanding references to the RPC; it cannot be reaped
> > +      * until all of those references have been released. Second, reaping
> > +      * is potentially expensive and RPC termination could occur in
> > +      * homa_softirq when there are short messages waiting to be processed.
> > +      * Taking time to reap a long RPC could result in significant delays
> > +      * for subsequent short RPCs.
> > +      *
> > +      * Thus Homa doesn't reap immediately in homa_rpc_end. Instead, dead
> > +      * RPCs are queued up and reaping occurs in this function, which is
> > +      * invoked later when it is less likely to impact latency. The
> > +      * challenge is to do this so that (a) we don't allow large numbers of
> > +      * dead RPCs to accumulate and (b) we minimize the impact of reaping
> > +      * on latency.
> > +      *
> > +      * The primary place where homa_rpc_reap is invoked is when threads
> > +      * are waiting for incoming messages. The thread has nothing else to
> > +      * do (it may even be polling for input), so reaping can be performed
> > +      * with no latency impact on the application.  However, if a machine
> > +      * is overloaded then it may never wait, so this mechanism isn't always
> > +      * sufficient.
> > +      *
> > +      * Homa now reaps in two other places, if reaping while waiting for
> > +      * messages isn't adequate:
> > +      * 1. If too may dead skbs accumulate, then homa_timer will call
> > +      *    homa_rpc_reap.
> > +      * 2. If this timer thread cannot keep up with all the reaping to be
> > +      *    done then as a last resort homa_dispatch_pkts will reap in small
> > +      *    increments (a few sk_buffs or RPCs) for every incoming batch
> > +      *    of packets . This is undesirable because it will impact Homa's
> > +      *    performance.
> > +      *
> > +      * During the introduction of homa_pools for managing input
> > +      * buffers, freeing of packets for incoming messages was moved to
> > +      * homa_copy_to_user under the assumption that this code wouldn't be
> > +      * on the critical path. However, there is evidence that with
> > +      * fast networks (e.g. 100 Gbps) copying to user space is the
> > +      * bottleneck for incoming messages, and packet freeing takes about
> > +      * 20-25% of the total time in homa_copy_to_user. So, it may eventually
> > +      * be desirable to remove packet freeing out of homa_copy_to_user.
>
> See skb_attempt_defer_free()

I wasn't previously aware of this. It looks useful, but unfortunately
its symbol isn't currently EXPORTed so Homa can't use it. I submitted
a patch to export that symbol, but that patch was rejected because the
patch didn't also include a use of the symbol.

I'm going to wait until this series is accepted, then submit a smaller
patch that adds the EXPORT and uses it in Homa (or maybe I'll wait
until I upstream Homa's GRO support, as Eric suggested).

> > +      */
> > +#define BATCH_MAX 20
> > +     struct homa_rpc *rpcs[BATCH_MAX];
> > +     struct sk_buff *skbs[BATCH_MAX];
>
> A lot of bytes on the stack, and a quite large batch. You should probaly
> decrease it.

I have reduced the batch size to 10. Note also that this is a
"near-leaf" function, so it should be safe for it to have a larger
footprint than Homa functions that invoke the IP/driver stack, which
presumably takes a lot of stack space.

> Also it still feel suspect the need for just another tx free strategy on
> top of the several existing caches.

I wasn't able to identify an existing cache mechanism that could meet
Homa's needs (and given the association Homa introduces between skb's
and RPCs, which are Homa-specific, it seems unlikely that any existing
mechanism would work for Homa). But, if you have something in mind
that you think might work for Homa, let me know and I'll take a look.

> > +             homa_sock_wakeup_wmem(hsk);
>
> Here num_rpcs can be zero, and you can have spurius wake-ups

I agree that num_rpcs can be zero, but homa_sock_wakeup_wmem won't
actually perform a wakeup unless (a) there are tasks waiting and (b)
there is available memory. So I don't see how there can be a spurious
wakeup. Is there something I'm missing?

> > +static inline void homa_rpc_hold(struct homa_rpc *rpc)
> > +{
> > +     atomic_inc(&rpc->refs);
>
> `refs` should be a reference_t, since is uses as such.

Done.

-John-

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 10/15] net: homa: create homa_outgoing.c
  2025-08-26 11:50   ` Paolo Abeni
@ 2025-09-01 20:21     ` John Ousterhout
  0 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-09-01 20:21 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Tue, Aug 26, 2025 at 4:50 AM Paolo Abeni <pabeni@redhat.com> wrote:

> > +/**
> > + * homa_rpc_tx_end() - Return the offset of the first byte in an
> > + * RPC's outgoing message that has not yet been fully transmitted.
> > + * "Fully transmitted" means the message has been transmitted by the
> > + * NIC and the skb has been released by the driver. This is different from
> > + * rpc->msgout.next_xmit_offset, which computes the first offset that
> > + * hasn't yet been passed to the IP stack.
> > + * @rpc:    RPC to check
> > + * Return:  See above. If the message has been fully transmitted then
> > + *          rpc->msgout.length is returned.
> > + */
> > +int homa_rpc_tx_end(struct homa_rpc *rpc)
> > +{
> > +     struct sk_buff *skb = rpc->msgout.first_not_tx;
> > +
> > +     while (skb) {
> > +             struct homa_skb_info *homa_info = homa_get_skb_info(skb);
> > +
> > +             /* next_xmit_offset tells us whether the packet has been
> > +              * passed to the IP stack. Checking the reference count tells
> > +              * us whether the packet has been released by the driver
> > +              * (which only happens after notification from the NIC that
> > +              * transmission is complete).
> > +              */
> > +             if (homa_info->offset >= rpc->msgout.next_xmit_offset ||
> > +                 refcount_read(&skb->users) > 1)
> > +                     return homa_info->offset;
>
> Pushing skbs with refcount > 1 into the tx stack calls for trouble. You
> should instead likely clone the tx skb.

Can you say more about what the problems are? So far I have not
encountered any issues and this approach is pretty useful (it will
become even more useful with additional Homa mechanisms that aren't in
this patch).

I have fixed all of the other comments on this patch in the way you suggested.

-John-

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 11/15] net: homa: create homa_utils.c
  2025-08-26 11:52   ` Paolo Abeni
@ 2025-09-01 20:30     ` John Ousterhout
  0 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-09-01 20:30 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Tue, Aug 26, 2025 at 4:53 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/18/25 10:55 PM, John Ousterhout wrote:
> +/**
> > + * homa_spin() - Delay (without sleeping) for a given time interval.
> > + * @ns:   How long to delay (in nanoseconds)
> > + */
> > +void homa_spin(int ns)
> > +{
> > +     u64 end;
> > +
> > +     end = homa_clock() + homa_ns_to_cycles(ns);
> > +     while (homa_clock() < end)
> > +             /* Empty loop body.*/
>
>                 cpu_relax();

Done; I have found at least one other place to use this as well.

-John-

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 12/15] net: homa: create homa_incoming.c
  2025-08-26 12:05   ` Paolo Abeni
@ 2025-09-01 22:12     ` John Ousterhout
  0 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-09-01 22:12 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Tue, Aug 26, 2025 at 5:05 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/18/25 10:55 PM, John Ousterhout wrote:
> > +/**
> > + * homa_dispatch_pkts() - Top-level function that processes a batch of packets,
> > + * all related to the same RPC.
> > + * @skb:       First packet in the batch, linked through skb->next.
> > + */
> > +void homa_dispatch_pkts(struct sk_buff *skb)
> > +{
> > +#define MAX_ACKS 10
> > +     const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
> > +     struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
> > +     u64 id = homa_local_id(h->common.sender_id);
> > +     int dport = ntohs(h->common.dport);
> > +
> > +     /* Used to collect acks from data packets so we can process them
> > +      * all at the end (can't process them inline because that may
> > +      * require locking conflicting RPCs). If we run out of space just
> > +      * ignore the extra acks; they'll be regenerated later through the
> > +      * explicit mechanism.
> > +      */
> > +     struct homa_ack acks[MAX_ACKS];
> > +     struct homa_rpc *rpc = NULL;
> > +     struct homa_sock *hsk;
> > +     struct homa_net *hnet;
> > +     struct sk_buff *next;
> > +     int num_acks = 0;
>
> No black lines in the variable declaration section, and the stack usage
> feel a bit too high.

I have eliminated "acks" and "num_acks" (there's a cleaner way to
handle acks now that RPCs have real reference counts).

> > +     /* Each iteration through the following loop processes one packet. */
> > +     for (; skb; skb = next) {
> > +             h = (struct homa_data_hdr *)skb->data;
> > +             next = skb->next;
> > +
> > +             /* Relinquish the RPC lock temporarily if it's needed
> > +              * elsewhere.
> > +              */
> > +             if (rpc) {
> > +                     int flags = atomic_read(&rpc->flags);
> > +
> > +                     if (flags & APP_NEEDS_LOCK) {
> > +                             homa_rpc_unlock(rpc);
> > +
> > +                             /* This short spin is needed to ensure that the
> > +                              * other thread gets the lock before this thread
> > +                              * grabs it again below (the need for this
> > +                              * was confirmed experimentally in 2/2025;
> > +                              * without it, the handoff fails 20-25% of the
> > +                              * time). Furthermore, the call to homa_spin
> > +                              * seems to allow the other thread to acquire
> > +                              * the lock more quickly.
> > +                              */
> > +                             homa_spin(100);
> > +                             homa_rpc_lock(rpc);
>
> This can still fail due to a number of reasons, e.g. if multiple threads
> are spinning on the rpc lock, or in fully preemptable kernels.

Yes, but that's not a problem; working most of the time gets most of
the benefit.

> You need to either ensure that:
> - the loop works just fine even if the handover fails with high

I've already done this: earlier versions of Homa had no handover at
all and the system worked fine except that tail latency was higher.

-John-

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 14/15] net: homa: create homa_plumbing.c
  2025-08-26 16:17   ` Paolo Abeni
@ 2025-09-01 22:53     ` John Ousterhout
  2025-09-01 23:03       ` Andrew Lunn
  2025-09-02  8:12       ` Paolo Abeni
  0 siblings, 2 replies; 47+ messages in thread
From: John Ousterhout @ 2025-09-01 22:53 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Tue, Aug 26, 2025 at 9:17 AM Paolo Abeni <pabeni@redhat.com> wrote:

> > +     status = proto_register(&homa_prot, 1);
> > +     if (status != 0) {
> > +             pr_err("proto_register failed for homa_prot: %d\n", status);
> > +             goto error;
> > +     }
> > +     init_proto = true;
>
> The standard way of handling the error paths it to avoid local flags and
> use different goto labels.

I initially implemented this with different goto labels, but there
were so many different labels that the code became unmanageable (very
difficult to figure out what to change when adding or removing
initializers). The current approach is *way* cleaner and more obvious,
so I hope I can keep it. The label approach works best when there is
only one label that collects all errors.

> > +/**
> > + * homa_softirq() - This function is invoked at SoftIRQ level to handle
> > + * incoming packets.
> > + * @skb:   The incoming packet.
> > + * Return: Always 0
> > + */
> > +int homa_softirq(struct sk_buff *skb)
> > +{
> > +     struct sk_buff *packets, *other_pkts, *next;
> > +     struct sk_buff **prev_link, **other_link;
> > +     struct homa_common_hdr *h;
> > +     int header_offset;
> > +
> > +     /* skb may actually contain many distinct packets, linked through
> > +      * skb_shinfo(skb)->frag_list by the Homa GRO mechanism. Make a
> > +      * pass through the list to process all of the short packets,
> > +      * leaving the longer packets in the list. Also, perform various
> > +      * prep/cleanup/error checking functions.
>
> It's hard to tell without the GRO/GSO code handy, but I guess the
> implementation here could be simplified invoking __skb_gso_segment()...

This mechanism relates to GRO, not GSO. I suggest we hold off on this
discussion until I submit the GRO patch; I'm pretty sure there will be
a *lot* of discussion about that :-)

> > +      */
> > +     skb->next = skb_shinfo(skb)->frag_list;
> > +     skb_shinfo(skb)->frag_list = NULL;
> > +     packets = skb;
> > +     prev_link = &packets;
> > +     for (skb = packets; skb; skb = next) {
> > +             next = skb->next;
> > +
> > +             /* Make the header available at skb->data, even if the packet
> > +              * is fragmented. One complication: it's possible that the IP
> > +              * header hasn't yet been removed (this happens for GRO packets
> > +              * on the frag_list, since they aren't handled explicitly by IP.
>
> ... at very least it will avoif this complication and will simplify the
> list handling.

As with the comment above, let's defer until you see the GRO mechanism
(a preview: Homa aggregates out-of-order packets in GRO, or even
packets from different RPCs, so it has to retain header information in
the aggregated data).

> > +              */
> > +             if (!homa_make_header_avl(skb))
> > +                     goto discard;
>
> It looks like the above is too aggressive, i.e. pskb_may_pull() may fail
> for a correctly formatted homa_ack_hdr - or any other packet with hdr
> size < HOMA_MAX_HEADER

I think it's OK: homa_make_header_avl pulls the min of HOMA_MAX_HEADER
and the packet length. This may pull some bytes byte that aren't in
the header, but is that a problem? (this approach seemed
simpler/faster than trying to compute the header length on a
packet-by-packet basis; for example, )

> > +             header_offset = skb_transport_header(skb) - skb->data;
> > +             if (header_offset)
> > +                     __skb_pull(skb, header_offset);
> > +
> > +             /* Reject packets that are too short or have bogus types. */
> > +             h = (struct homa_common_hdr *)skb->data;
> > +             if (unlikely(skb->len < sizeof(struct homa_common_hdr) ||
> > +                          h->type < DATA || h->type > MAX_OP ||
> > +                          skb->len < header_lengths[h->type - DATA]))
> > +                     goto discard;
> > +
> > +             /* Process the packet now if it is a control packet or
> > +              * if it contains an entire short message.
> > +              */
> > +             if (h->type != DATA || ntohl(((struct homa_data_hdr *)h)
> > +                             ->message_length) < 1400) {
>
> I could not fined where `message_length` is validated. AFAICS
> data_hdr->message_length could be > skb->len.
>
> Also I don't see how the condition checked above ensures that the pkt
> contains the whole message.

Long messages consist of multiple packets, so it is fine if
data_hdr->message_length > skb->len. That said, Homa does not fragment
a message into multiple packets unless necessary, so if the condition
above is met, then the message is contained in a single packet (if for
some reason a sender fragments a short message, that won't cause
problems).

The message length is validated in homa_message_in_init, invoked via
homa_softirq -> homa_dispatch_pkts -> homa_data_pkt ->
homa_message_in_init.

For comments that I haven't responded to explicitly here, I have
implemented your suggested fix.

-John-

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 14/15] net: homa: create homa_plumbing.c
  2025-09-01 22:53     ` John Ousterhout
@ 2025-09-01 23:03       ` Andrew Lunn
  2025-09-02  4:54         ` John Ousterhout
  2025-09-02  8:12       ` Paolo Abeni
  1 sibling, 1 reply; 47+ messages in thread
From: Andrew Lunn @ 2025-09-01 23:03 UTC (permalink / raw)
  To: John Ousterhout; +Cc: Paolo Abeni, netdev, edumazet, horms, kuba

On Mon, Sep 01, 2025 at 03:53:35PM -0700, John Ousterhout wrote:
> On Tue, Aug 26, 2025 at 9:17 AM Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > > +     status = proto_register(&homa_prot, 1);
> > > +     if (status != 0) {
> > > +             pr_err("proto_register failed for homa_prot: %d\n", status);
> > > +             goto error;
> > > +     }
> > > +     init_proto = true;
> >
> > The standard way of handling the error paths it to avoid local flags and
> > use different goto labels.
> 
> I initially implemented this with different goto labels, but there
> were so many different labels that the code became unmanageable (very
> difficult to figure out what to change when adding or removing
> initializers). The current approach is *way* cleaner and more obvious,
> so I hope I can keep it. The label approach works best when there is
> only one label that collects all errors.

This _might_ mean you need to split it unto a number of helper
function, with each helper using a goto, and the main function calling
the helper also using goto when a helper returns an error code.

https://www.kernel.org/doc/html/v4.10/process/coding-style.html
says

6) Functions

Functions should be short and sweet, and do just one thing. They
should fit on one or two screenfuls of text (the ISO/ANSI screen size
is 80x24, as we all know), and do one thing and do that well.

The maximum length of a function is inversely proportional to the
complexity and indentation level of that function. So, if you have a
conceptually simple function that is just one long (but simple)
case-statement, where you have to do lots of small things for a lot of
different cases, it’s OK to have a longer function.

However, if you have a complex function, and you suspect that a
less-than-gifted first-year high-school student might not even
understand what the function is all about, you should adhere to the
maximum limits all the more closely. Use helper functions with
descriptive names (you can ask the compiler to in-line them if you
think it’s performance-critical, and it will probably do a better job
of it than you would have done).

	Andrew

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 14/15] net: homa: create homa_plumbing.c
  2025-09-01 23:03       ` Andrew Lunn
@ 2025-09-02  4:54         ` John Ousterhout
  0 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-09-02  4:54 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Paolo Abeni, netdev, edumazet, horms, kuba

On Mon, Sep 1, 2025 at 4:03 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Mon, Sep 01, 2025 at 03:53:35PM -0700, John Ousterhout wrote:
> > On Tue, Aug 26, 2025 at 9:17 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > > > +     status = proto_register(&homa_prot, 1);
> > > > +     if (status != 0) {
> > > > +             pr_err("proto_register failed for homa_prot: %d\n", status);
> > > > +             goto error;
> > > > +     }
> > > > +     init_proto = true;
> > >
> > > The standard way of handling the error paths it to avoid local flags and
> > > use different goto labels.
> >
> > I initially implemented this with different goto labels, but there
> > were so many different labels that the code became unmanageable (very
> > difficult to figure out what to change when adding or removing
> > initializers). The current approach is *way* cleaner and more obvious,
> > so I hope I can keep it. The label approach works best when there is
> > only one label that collects all errors.
>
> This _might_ mean you need to split it unto a number of helper
> function, with each helper using a goto, and the main function calling
> the helper also using goto when a helper returns an error code.

Unfortunately helpers don't help. There are already separate functions
for the individual initializations. The problem is with handling
errors in the parent. If a child returns an error, the parent must
reverse all of the initializations that completed before that child
was invoked, so error handling is slightly different for every child
invocation.

-John-

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 12/15] net: homa: create homa_incoming.c
  2025-08-18 20:55 ` [PATCH net-next v15 12/15] net: homa: create homa_incoming.c John Ousterhout
  2025-08-26 12:05   ` Paolo Abeni
@ 2025-09-02  7:19   ` Eric Dumazet
  1 sibling, 0 replies; 47+ messages in thread
From: Eric Dumazet @ 2025-09-02  7:19 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev, pabeni, horms, kuba

On Mon, Aug 18, 2025 at 1:56 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
>
> This file contains most of the code for handling incoming packets,
> including top-level dispatching code plus specific handlers for each
> pack type. It also contains code for dispatching fully-received
> messages to waiting application threads.
>
> Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
>
> ---
> Changes for v14:
> * Use new homa_rpc_tx_end function
> * Fix race in homa_wait_shared (an RPC could get lost if it became
>   ready at the same time that homa_interest_wait returned with an error)
> * Handle nonblocking behavior here, rather than in homa_interest.c
> * Change API for homa_wait_private to distinguish errors in an RPC from
>   errors that prevented the wait operation from completing.
>
> Changes for v11:
> * Cleanup and simplify use of RPC reference counts.
> * Cleanup sparse annotations.
> * Rework the mechanism for waking up RPCs that stalled waiting for
>   buffer pool space.
>
> Changes for v10:
> * Revise sparse annotations to eliminate __context__ definition
> * Refactor resend mechanism (new function homa_request_retrans replaces
>   homa_gap_retry)
> * Remove log messages after alloc errors
> * Fix socket cleanup race
>
> Changes for v9:
> * Add support for homa_net objects
> * Use new homa_clock abstraction layer
> * Various name improvements (e.g. use "alloc" instead of "new" for functions
>   that allocate memory)
>
> Changes for v7:
> * API change for homa_rpc_handoff
> * Refactor waiting mechanism for incoming packets: simplify wait
>   criteria and use standard Linux mechanisms for waiting, use
>   new homa_interest struct
> * Reject unauthorized incoming request messages
> * Improve documentation for code that spins (and reduce spin length)
> * Use RPC reference counts, eliminate RPC_HANDING_OFF flag
> * Replace erroneous use of "safe" list iteration with "rcu" version
> * Remove locker argument from locking functions
> * Check incoming messages against HOMA_MAX_MESSAGE_LENGTH
> * Use u64 and __u64 properly
> ---
>  net/homa/homa_impl.h     |  15 +
>  net/homa/homa_incoming.c | 886 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 901 insertions(+)
>  create mode 100644 net/homa/homa_incoming.c
>
> diff --git a/net/homa/homa_impl.h b/net/homa/homa_impl.h
> index 49ca4abfb50b..3d91b7f44de9 100644
> --- a/net/homa/homa_impl.h
> +++ b/net/homa/homa_impl.h
> @@ -421,22 +421,37 @@ static inline bool homa_make_header_avl(struct sk_buff *skb)
>
>  extern unsigned int homa_net_id;
>
> +void     homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
> +                     struct homa_rpc *rpc);
> +void     homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb);
> +int      homa_copy_to_user(struct homa_rpc *rpc);
> +void     homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
>  void     homa_destroy(struct homa *homa);
> +void     homa_dispatch_pkts(struct sk_buff *skb);
>  int      homa_fill_data_interleaved(struct homa_rpc *rpc,
>                                     struct sk_buff *skb, struct iov_iter *iter);
> +struct homa_gap *homa_gap_alloc(struct list_head *next, int start, int end);
>  int      homa_init(struct homa *homa);
>  int      homa_message_out_fill(struct homa_rpc *rpc,
>                                struct iov_iter *iter, int xmit);
>  void     homa_message_out_init(struct homa_rpc *rpc, int length);
> +void     homa_need_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
> +                          struct homa_rpc *rpc);
>  void     homa_net_destroy(struct homa_net *hnet);
>  int      homa_net_init(struct homa_net *hnet, struct net *net,
>                        struct homa *homa);
> +void     homa_request_retrans(struct homa_rpc *rpc);
> +void     homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
> +                        struct homa_sock *hsk);
>  void     homa_rpc_handoff(struct homa_rpc *rpc);
>  int      homa_rpc_tx_end(struct homa_rpc *rpc);
>  void     homa_spin(int ns);
>  struct sk_buff *homa_tx_data_pkt_alloc(struct homa_rpc *rpc,
>                                        struct iov_iter *iter, int offset,
>                                        int length, int max_seg_data);
> +void     homa_rpc_unknown_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
> +int      homa_wait_private(struct homa_rpc *rpc, int nonblocking);
> +struct homa_rpc *homa_wait_shared(struct homa_sock *hsk, int nonblocking);
>  int      homa_xmit_control(enum homa_packet_type type, void *contents,
>                            size_t length, struct homa_rpc *rpc);
>  int      __homa_xmit_control(void *contents, size_t length,
> diff --git a/net/homa/homa_incoming.c b/net/homa/homa_incoming.c
> new file mode 100644
> index 000000000000..c485dd98cba9
> --- /dev/null
> +++ b/net/homa/homa_incoming.c
> @@ -0,0 +1,886 @@
> +// SPDX-License-Identifier: BSD-2-Clause or GPL-2.0+
> +
> +/* This file contains functions that handle incoming Homa messages. */
> +
> +#include "homa_impl.h"
> +#include "homa_interest.h"
> +#include "homa_peer.h"
> +#include "homa_pool.h"
> +
> +/**
> + * homa_message_in_init() - Constructor for homa_message_in.
> + * @rpc:          RPC whose msgin structure should be initialized. The
> + *                msgin struct is assumed to be zeroes.
> + * @length:       Total number of bytes in message.
> + * Return:        Zero for successful initialization, or a negative errno
> + *                if rpc->msgin could not be initialized.
> + */
> +int homa_message_in_init(struct homa_rpc *rpc, int length)
> +       __must_hold(rpc->bucket->lock)
> +{
> +       int err;
> +
> +       if (length > HOMA_MAX_MESSAGE_LENGTH)
> +               return -EINVAL;
> +
> +       rpc->msgin.length = length;
> +       skb_queue_head_init(&rpc->msgin.packets);

Do you need the lock, or can you use __skb_queue_head_init() here for clarity ?

> +       INIT_LIST_HEAD(&rpc->msgin.gaps);
> +       rpc->msgin.bytes_remaining = length;
> +       err = homa_pool_alloc_msg(rpc);
> +       if (err != 0) {
> +               rpc->msgin.length = -1;
> +               return err;
> +       }
> +       return 0;
> +}
> +
> +/**
> + * homa_gap_alloc() - Allocate a new gap and add it to a gap list.
> + * @next:   Add the new gap just before this list element.
> + * @start:  Offset of first byte covered by the gap.
> + * @end:    Offset of byte just after the last one covered by the gap.
> + * Return:  Pointer to the new gap, or NULL if memory couldn't be allocated
> + *          for the gap object.
> + */
> +struct homa_gap *homa_gap_alloc(struct list_head *next, int start, int end)
> +{
> +       struct homa_gap *gap;
> +
> +       gap = kmalloc(sizeof(*gap), GFP_ATOMIC);
> +       if (!gap)
> +               return NULL;
> +       gap->start = start;
> +       gap->end = end;
> +       gap->time = homa_clock();
> +       list_add_tail(&gap->links, next);
> +       return gap;
> +}
> +
> +/**
> + * homa_request_retrans() - The function is invoked when it appears that
> + * data packets for a message have been lost. It issues RESEND requests
> + * as appropriate and may modify the state of the RPC.
> + * @rpc:     RPC for which incoming data is delinquent; must be locked by
> + *           caller.
> + */
> +void homa_request_retrans(struct homa_rpc *rpc)
> +       __must_hold(rpc->bucket->lock)
> +{
> +       struct homa_resend_hdr resend;
> +       struct homa_gap *gap;
> +       int offset, length;
> +
> +       if (rpc->msgin.length >= 0) {
> +               /* Issue RESENDS for any gaps in incoming data. */
> +               list_for_each_entry(gap, &rpc->msgin.gaps, links) {
> +                       resend.offset = htonl(gap->start);
> +                       resend.length = htonl(gap->end - gap->start);
> +                       homa_xmit_control(RESEND, &resend, sizeof(resend), rpc);
> +               }
> +
> +               /* Issue a RESEND for any granted data after the last gap. */
> +               offset = rpc->msgin.recv_end;
> +               length = rpc->msgin.length - rpc->msgin.recv_end;
> +               if (length <= 0)
> +                       return;
> +       } else {
> +               /* No data has been received for the RPC. Ask the sender to
> +                * resend everything it has sent so far.
> +                */
> +               offset = 0;
> +               length = -1;
> +       }
> +
> +       resend.offset = htonl(offset);
> +       resend.length = htonl(length);
> +       homa_xmit_control(RESEND, &resend, sizeof(resend), rpc);
> +}
> +
> +/**
> + * homa_add_packet() - Add an incoming packet to the contents of a
> + * partially received message.
> + * @rpc:   Add the packet to the msgin for this RPC.
> + * @skb:   The new packet. This function takes ownership of the packet
> + *         (the packet will either be freed or added to rpc->msgin.packets).
> + */
> +void homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb)
> +       __must_hold(rpc->bucket->lock)
> +{
> +       struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
> +       struct homa_gap *gap, *dummy, *gap2;
> +       int start = ntohl(h->seg.offset);
> +       int length = homa_data_len(skb);
> +       int end = start + length;
> +
> +       if ((start + length) > rpc->msgin.length)
> +               goto discard;
> +
> +       if (start == rpc->msgin.recv_end) {
> +               /* Common case: packet is sequential. */
> +               rpc->msgin.recv_end += length;
> +               goto keep;
> +       }
> +
> +       if (start > rpc->msgin.recv_end) {
> +               /* Packet creates a new gap. */
> +               if (!homa_gap_alloc(&rpc->msgin.gaps,
> +                                   rpc->msgin.recv_end, start))
> +                       goto discard;
> +               rpc->msgin.recv_end = end;
> +               goto keep;
> +       }
> +
> +       /* Must now check to see if the packet fills in part or all of
> +        * an existing gap.
> +        */
> +       list_for_each_entry_safe(gap, dummy, &rpc->msgin.gaps, links) {
> +               /* Is packet at the start of this gap? */
> +               if (start <= gap->start) {
> +                       if (end <= gap->start)
> +                               continue;
> +                       if (start < gap->start)
> +                               goto discard;
> +                       if (end > gap->end)
> +                               goto discard;
> +                       gap->start = end;
> +                       if (gap->start >= gap->end) {
> +                               list_del(&gap->links);
> +                               kfree(gap);
> +                       }
> +                       goto keep;
> +               }
> +
> +               /* Is packet at the end of this gap? BTW, at this point we know
> +                * the packet can't cover the entire gap.
> +                */
> +               if (end >= gap->end) {
> +                       if (start >= gap->end)
> +                               continue;
> +                       if (end > gap->end)
> +                               goto discard;
> +                       gap->end = start;
> +                       goto keep;
> +               }
> +
> +               /* Packet is in the middle of the gap; must split the gap. */
> +               gap2 = homa_gap_alloc(&gap->links, gap->start, start);
> +               if (!gap2)
> +                       goto discard;
> +               gap2->time = gap->time;
> +               gap->start = end;
> +               goto keep;
> +       }
> +
> +discard:
> +       kfree_skb(skb);
> +       return;
> +
> +keep:
> +       __skb_queue_tail(&rpc->msgin.packets, skb);
> +       rpc->msgin.bytes_remaining -= length;
> +}
> +
> +/**
> + * homa_copy_to_user() - Copy as much data as possible from incoming
> + * packet buffers to buffers in user space.
> + * @rpc:     RPC for which data should be copied. Must be locked by caller.
> + * Return:   Zero for success or a negative errno if there is an error.
> + *           It is possible for the RPC to be freed while this function
> + *           executes (it releases and reacquires the RPC lock). If that
> + *           happens, -EINVAL will be returned and the state of @rpc
> + *           will be RPC_DEAD. Clears the RPC_PKTS_READY bit in @rpc->flags
> + *           if all available packets have been copied out.
> + */
> +int homa_copy_to_user(struct homa_rpc *rpc)
> +       __must_hold(rpc->bucket->lock)
> +{
> +#define MAX_SKBS 20
> +       struct sk_buff *skbs[MAX_SKBS];
> +       int error = 0;
> +       int n = 0;             /* Number of filled entries in skbs. */
> +       int i;
> +
> +       /* Tricky note: we can't hold the RPC lock while we're actually
> +        * copying to user space, because (a) it's illegal to hold a spinlock
> +        * while copying to user space and (b) we'd like for homa_softirq
> +        * to add more packets to the RPC while we're copying these out.
> +        * So, collect a bunch of packets to copy, then release the lock,
> +        * copy them, and reacquire the lock.
> +        */
> +       while (true) {
> +               struct sk_buff *skb;
> +
> +               if (rpc->state == RPC_DEAD) {
> +                       error = -EINVAL;
> +                       break;
> +               }
> +
> +               skb = __skb_dequeue(&rpc->msgin.packets);
> +               if (skb) {
> +                       skbs[n] = skb;
> +                       n++;
> +                       if (n < MAX_SKBS)
> +                               continue;
> +               }
> +               if (n == 0) {
> +                       atomic_andnot(RPC_PKTS_READY, &rpc->flags);

All networking uses clear_bit() instead...

> +                       break;
> +               }
> +
> +               /* At this point we've collected a batch of packets (or
> +                * run out of packets); copy any available packets out to
> +                * user space.
> +                */
> +               homa_rpc_unlock(rpc);
> +
> +               /* Each iteration of this loop copies out one skb. */
> +               for (i = 0; i < n; i++) {
> +                       struct homa_data_hdr *h = (struct homa_data_hdr *)
> +                                       skbs[i]->data;
> +                       int pkt_length = homa_data_len(skbs[i]);
> +                       int offset = ntohl(h->seg.offset);
> +                       int buf_bytes, chunk_size;
> +                       struct iov_iter iter;
> +                       int copied = 0;
> +                       char __user *dst;
> +
> +                       /* Each iteration of this loop copies to one
> +                        * user buffer.
> +                        */
> +                       while (copied < pkt_length) {
> +                               chunk_size = pkt_length - copied;
> +                               dst = homa_pool_get_buffer(rpc, offset + copied,
> +                                                          &buf_bytes);
> +                               if (buf_bytes < chunk_size) {
> +                                       if (buf_bytes == 0) {
> +                                               /* skb has data beyond message
> +                                                * end?
> +                                                */
> +                                               break;
> +                                       }
> +                                       chunk_size = buf_bytes;
> +                               }
> +                               error = import_ubuf(READ, dst, chunk_size,
> +                                                   &iter);
> +                               if (error)
> +                                       goto free_skbs;
> +                               error = skb_copy_datagram_iter(skbs[i],
> +                                                              sizeof(*h) +
> +                                                              copied,  &iter,
> +                                                              chunk_size);
> +                               if (error)
> +                                       goto free_skbs;
> +                               copied += chunk_size;
> +                       }
> +               }
> +
> +free_skbs:
> +               for (i = 0; i < n; i++)
> +                       kfree_skb(skbs[i]);

There is a big difference between kfree_skb() and consume_skb()

> +               n = 0;

> +               atomic_or(APP_NEEDS_LOCK, &rpc->flags);
> +               homa_rpc_lock(rpc);
> +               atomic_andnot(APP_NEEDS_LOCK, &rpc->flags);

This construct would probably need a helper.

> +               if (error)
> +                       break;
> +       }
> +       return error;
> +}
> +
> +/**
> + * homa_dispatch_pkts() - Top-level function that processes a batch of packets,
> + * all related to the same RPC.
> + * @skb:       First packet in the batch, linked through skb->next.
> + */
> +void homa_dispatch_pkts(struct sk_buff *skb)
> +{
> +#define MAX_ACKS 10
> +       const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
> +       struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
> +       u64 id = homa_local_id(h->common.sender_id);
> +       int dport = ntohs(h->common.dport);
> +
> +       /* Used to collect acks from data packets so we can process them
> +        * all at the end (can't process them inline because that may
> +        * require locking conflicting RPCs). If we run out of space just
> +        * ignore the extra acks; they'll be regenerated later through the
> +        * explicit mechanism.
> +        */
> +       struct homa_ack acks[MAX_ACKS];
> +       struct homa_rpc *rpc = NULL;
> +       struct homa_sock *hsk;
> +       struct homa_net *hnet;
> +       struct sk_buff *next;
> +       int num_acks = 0;
> +
> +       /* Find the appropriate socket.*/
> +       hnet = homa_net_from_skb(skb);
> +       hsk = homa_sock_find(hnet, dport);
> +       if (!hsk || (!homa_is_client(id) && !hsk->is_server)) {
> +               if (skb_is_ipv6(skb))
> +                       icmp6_send(skb, ICMPV6_DEST_UNREACH,
> +                                  ICMPV6_PORT_UNREACH, 0, NULL, IP6CB(skb));
> +               else
> +                       icmp_send(skb, ICMP_DEST_UNREACH,
> +                                 ICMP_PORT_UNREACH, 0);
> +               while (skb) {
> +                       next = skb->next;
> +                       kfree_skb(skb);
> +                       skb = next;
> +               }
> +               if (hsk)
> +                       sock_put(&hsk->sock);
> +               return;
> +       }
> +
> +       /* Each iteration through the following loop processes one packet. */
> +       for (; skb; skb = next) {
> +               h = (struct homa_data_hdr *)skb->data;
> +               next = skb->next;
> +
> +               /* Relinquish the RPC lock temporarily if it's needed
> +                * elsewhere.
> +                */
> +               if (rpc) {
> +                       int flags = atomic_read(&rpc->flags);
> +
> +                       if (flags & APP_NEEDS_LOCK) {
> +                               homa_rpc_unlock(rpc);
> +
> +                               /* This short spin is needed to ensure that the
> +                                * other thread gets the lock before this thread
> +                                * grabs it again below (the need for this
> +                                * was confirmed experimentally in 2/2025;
> +                                * without it, the handoff fails 20-25% of the
> +                                * time). Furthermore, the call to homa_spin
> +                                * seems to allow the other thread to acquire
> +                                * the lock more quickly.
> +                                */
> +                               homa_spin(100);
> +                               homa_rpc_lock(rpc);
> +                       }
> +               }
> +
> +               /* If we don't already have an RPC, find it, lock it,
> +                * and create a reference on it.
> +                */
> +               if (!rpc) {
> +                       if (!homa_is_client(id)) {
> +                               /* We are the server for this RPC. */
> +                               if (h->common.type == DATA) {
> +                                       int created;
> +
> +                                       /* Create a new RPC if one doesn't
> +                                        * already exist.
> +                                        */
> +                                       rpc = homa_rpc_alloc_server(hsk, &saddr,
> +                                                                   h,
> +                                                                   &created);
> +                                       if (IS_ERR(rpc)) {
> +                                               rpc = NULL;
> +                                               goto discard;
> +                                       }
> +                               } else {
> +                                       rpc = homa_rpc_find_server(hsk, &saddr,
> +                                                                  id);
> +                               }
> +                       } else {
> +                               rpc = homa_rpc_find_client(hsk, id);
> +                       }
> +                       if (rpc)
> +                               homa_rpc_hold(rpc);
> +               }
> +               if (unlikely(!rpc)) {
> +                       if (h->common.type != NEED_ACK &&
> +                           h->common.type != ACK &&
> +                           h->common.type != RESEND)
> +                               goto discard;
> +               } else {
> +                       if (h->common.type == DATA ||
> +                           h->common.type == BUSY)
> +                               rpc->silent_ticks = 0;
> +                       rpc->peer->outstanding_resends = 0;
> +               }
> +
> +               switch (h->common.type) {
> +               case DATA:
> +                       if (h->ack.client_id) {
> +                               /* Save the ack for processing later, when we
> +                                * have released the RPC lock.
> +                                */
> +                               if (num_acks < MAX_ACKS) {
> +                                       acks[num_acks] = h->ack;
> +                                       num_acks++;
> +                               }
> +                       }
> +                       homa_data_pkt(skb, rpc);
> +                       break;
> +               case RESEND:
> +                       homa_resend_pkt(skb, rpc, hsk);
> +                       break;
> +               case RPC_UNKNOWN:
> +                       homa_rpc_unknown_pkt(skb, rpc);
> +                       break;
> +               case BUSY:
> +                       /* Nothing to do for these packets except reset
> +                        * silent_ticks, which happened above.
> +                        */
> +                       goto discard;
> +               case NEED_ACK:
> +                       homa_need_ack_pkt(skb, hsk, rpc);
> +                       break;
> +               case ACK:
> +                       homa_ack_pkt(skb, hsk, rpc);
> +                       break;
> +                       goto discard;
> +               }
> +               continue;
> +
> +discard:
> +               kfree_skb(skb);
> +       }
> +       if (rpc) {
> +               homa_rpc_put(rpc);
> +               homa_rpc_unlock(rpc);
> +       }
> +
> +       while (num_acks > 0) {
> +               num_acks--;
> +               homa_rpc_acked(hsk, &saddr, &acks[num_acks]);
> +       }
> +
> +       if (hsk->dead_skbs >= 2 * hsk->homa->dead_buffs_limit)
> +               /* We get here if other approaches are not keeping up with
> +                * reaping dead RPCs. See "RPC Reaping Strategy" in
> +                * homa_rpc_reap code for details.
> +                */
> +               homa_rpc_reap(hsk, false);
> +       sock_put(&hsk->sock);
> +}
> +
> +/**
> + * homa_data_pkt() - Handler for incoming DATA packets
> + * @skb:     Incoming packet; size known to be large enough for the header.
> + *           This function now owns the packet.
> + * @rpc:     Information about the RPC corresponding to this packet.
> + *           Must be locked by the caller.
> + */
> +void homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc)
> +       __must_hold(rpc->bucket->lock)
> +{
> +       struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
> +
> +       if (rpc->state != RPC_INCOMING && homa_is_client(rpc->id)) {
> +               if (unlikely(rpc->state != RPC_OUTGOING))
> +                       goto discard;
> +               rpc->state = RPC_INCOMING;
> +               if (homa_message_in_init(rpc, ntohl(h->message_length)) != 0)
> +                       goto discard;
> +       } else if (rpc->state != RPC_INCOMING) {
> +               /* Must be server; note that homa_rpc_alloc_server already
> +                * initialized msgin and allocated buffers.
> +                */
> +               if (unlikely(rpc->msgin.length >= 0))
> +                       goto discard;
> +       }
> +
> +       if (rpc->msgin.num_bpages == 0)
> +               /* Drop packets that arrive when we can't allocate buffer
> +                * space. If we keep them around, packet buffer usage can
> +                * exceed available cache space, resulting in poor
> +                * performance.
> +                */
> +               goto discard;
> +
> +       homa_add_packet(rpc, skb);
> +
> +       if (skb_queue_len(&rpc->msgin.packets) != 0 &&
> +           !(atomic_read(&rpc->flags) & RPC_PKTS_READY)) {
> +               atomic_or(RPC_PKTS_READY, &rpc->flags);
> +               homa_rpc_handoff(rpc);
> +       }
> +
> +       return;
> +
> +discard:
> +       kfree_skb(skb);
> +}
> +
> +/**
> + * homa_resend_pkt() - Handler for incoming RESEND packets
> + * @skb:     Incoming packet; size already verified large enough for header.
> + *           This function now owns the packet.
> + * @rpc:     Information about the RPC corresponding to this packet; must
> + *           be locked by caller, but may be NULL if there is no RPC matching
> + *           this packet
> + * @hsk:     Socket on which the packet was received.
> + */
> +void homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
> +                    struct homa_sock *hsk)
> +       __must_hold(rpc->bucket->lock)
> +{
> +       struct homa_resend_hdr *h = (struct homa_resend_hdr *)skb->data;
> +       int offset = ntohl(h->offset);
> +       int length = ntohl(h->length);
> +       int end = offset + length;
> +       struct homa_busy_hdr busy;
> +       int tx_end;
> +
> +       if (!rpc) {
> +               homa_xmit_unknown(skb, hsk);
> +               goto done;
> +       }
> +
> +       tx_end = homa_rpc_tx_end(rpc);
> +       if (!homa_is_client(rpc->id) && rpc->state != RPC_OUTGOING) {
> +               /* We are the server for this RPC and don't yet have a
> +                * response message, so send BUSY to keep the client
> +                * waiting.
> +                */
> +               homa_xmit_control(BUSY, &busy, sizeof(busy), rpc);
> +               goto done;
> +       }
> +
> +       if (length == -1)
> +               end = tx_end;
> +
> +       homa_resend_data(rpc, offset, (end > tx_end) ? tx_end : end);
> +
> +       if (offset >= tx_end)  {
> +               /* We have chosen not to transmit any of the requested data;
> +                * send BUSY so the receiver knows we are alive.
> +                */
> +               homa_xmit_control(BUSY, &busy, sizeof(busy), rpc);
> +               goto done;
> +       }
> +
> +done:
> +       kfree_skb(skb);
> +}
> +
> +/**
> + * homa_rpc_unknown_pkt() - Handler for incoming RPC_UNKNOWN packets.
> + * @skb:     Incoming packet; size known to be large enough for the header.
> + *           This function now owns the packet.
> + * @rpc:     Information about the RPC corresponding to this packet. Must
> + *           be locked by caller.
> + */
> +void homa_rpc_unknown_pkt(struct sk_buff *skb, struct homa_rpc *rpc)
> +       __must_hold(rpc->bucket->lock)
> +{
> +       if (homa_is_client(rpc->id)) {
> +               if (rpc->state == RPC_OUTGOING) {
> +                       int tx_end = homa_rpc_tx_end(rpc);
> +
> +                       /* It appears that everything we've already transmitted
> +                        * has been lost; retransmit it.
> +                        */
> +                       homa_resend_data(rpc, 0, tx_end);
> +                       goto done;
> +               }
> +       } else {
> +               homa_rpc_end(rpc);
> +       }
> +done:
> +       kfree_skb(skb);
> +}
> +
> +/**
> + * homa_need_ack_pkt() - Handler for incoming NEED_ACK packets
> + * @skb:     Incoming packet; size already verified large enough for header.
> + *           This function now owns the packet.
> + * @hsk:     Socket on which the packet was received.
> + * @rpc:     The RPC named in the packet header, or NULL if no such
> + *           RPC exists. The RPC has been locked by the caller.
> + */
> +void homa_need_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
> +                      struct homa_rpc *rpc)
> +       __must_hold(rpc->bucket->lock)
> +{
> +       struct homa_common_hdr *h = (struct homa_common_hdr *)skb->data;
> +       const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
> +       u64 id = homa_local_id(h->sender_id);
> +       struct homa_ack_hdr ack;
> +       struct homa_peer *peer;
> +
> +       /* Don't ack if it's not safe for the peer to purge its state
> +        * for this RPC (the RPC still exists and we haven't received
> +        * the entire response), or if we can't find peer info.
> +        */
> +       if (rpc && (rpc->state != RPC_INCOMING ||
> +                   rpc->msgin.bytes_remaining)) {
> +               homa_request_retrans(rpc);
> +               goto done;
> +       } else {
> +               peer = homa_peer_get(hsk, &saddr);
> +               if (IS_ERR(peer))
> +                       goto done;
> +       }
> +
> +       /* Send an ACK for this RPC. At the same time, include all of the
> +        * other acks available for the peer. Note: can't use rpc below,
> +        * since it may be NULL.
> +        */
> +       ack.common.type = ACK;
> +       ack.common.sport = h->dport;
> +       ack.common.dport = h->sport;
> +       ack.common.sender_id = cpu_to_be64(id);
> +       ack.num_acks = htons(homa_peer_get_acks(peer,
> +                                               HOMA_MAX_ACKS_PER_PKT,
> +                                               ack.acks));
> +       __homa_xmit_control(&ack, sizeof(ack), peer, hsk);
> +       homa_peer_release(peer);
> +
> +done:
> +       kfree_skb(skb);


Please double check all your kfree_skb() vs consume_skb()

perf record -a -e skb:kfree_skb  sleep 60
vs
perf record -a -e skb:consume_skb  sleep 60

As a bonus, you can use kfree_skb_reason(skb, some_reason) for future
bug hunting

> +}
> +
> +/**
> + * homa_ack_pkt() - Handler for incoming ACK packets
> + * @skb:     Incoming packet; size already verified large enough for header.
> + *           This function now owns the packet.
> + * @hsk:     Socket on which the packet was received.
> + * @rpc:     The RPC named in the packet header, or NULL if no such
> + *           RPC exists. The RPC lock will be dead on return.
> + */
> +void homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
> +                 struct homa_rpc *rpc)
> +       __must_hold(rpc->bucket->lock)
> +{
> +       const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
> +       struct homa_ack_hdr *h = (struct homa_ack_hdr *)skb->data;
> +       int i, count;
> +
> +       if (rpc)
> +               homa_rpc_end(rpc);
> +
> +       count = ntohs(h->num_acks);
> +       if (count > 0) {
> +               if (rpc) {
> +                       /* Must temporarily release rpc's lock because
> +                        * homa_rpc_acked needs to acquire RPC locks.
> +                        */
> +                       homa_rpc_unlock(rpc);
> +                       for (i = 0; i < count; i++)
> +                               homa_rpc_acked(hsk, &saddr, &h->acks[i]);
> +                       homa_rpc_lock(rpc);
> +               } else {
> +                       for (i = 0; i < count; i++)
> +                               homa_rpc_acked(hsk, &saddr, &h->acks[i]);
> +               }
> +       }
> +       kfree_skb(skb);
> +}
> +
> +/**
> + * homa_wait_private() - Waits until the response has been received for
> + * a specific RPC or the RPC has failed with an error.
> + * @rpc:          RPC to wait for; an error will be returned if the RPC is
> + *                not a client RPC or not private. Must be locked by caller.
> + * @nonblocking:  Nonzero means return immediately if @rpc not ready.
> + * Return:        0 means that @rpc is ready for attention: either its response
> + *                has been received or it has an unrecoverable error such as
> + *                ETIMEDOUT (in rpc->error). Nonzero means some other error
> + *                (such as EINTR or EINVAL) occurred before @rpc became ready
> + *                for attention; in this case the return value is a negative
> + *                errno.
> + */
> +int homa_wait_private(struct homa_rpc *rpc, int nonblocking)
> +       __must_hold(rpc->bucket->lock)
> +{
> +       struct homa_interest interest;
> +       int result;
> +
> +       if (!(atomic_read(&rpc->flags) & RPC_PRIVATE))
> +               return -EINVAL;
> +
> +       /* Each iteration through this loop waits until rpc needs attention
> +        * in some way (e.g. packets have arrived), then deals with that need
> +        * (e.g. copy to user space). It may take many iterations until the
> +        * RPC is ready for the application.
> +        */
> +       while (1) {
> +               result = 0;
> +               if (!rpc->error)
> +                       rpc->error = homa_copy_to_user(rpc);
> +               if (rpc->error)
> +                       break;
> +               if (rpc->msgin.length >= 0 &&
> +                   rpc->msgin.bytes_remaining == 0 &&
> +                   skb_queue_len(&rpc->msgin.packets) == 0)
> +                       break;
> +
> +               if (nonblocking) {
> +                       result = -EAGAIN;
> +                       break;
> +               }
> +
> +               result = homa_interest_init_private(&interest, rpc);
> +               if (result != 0)
> +                       break;
> +
> +               homa_rpc_unlock(rpc);
> +               result = homa_interest_wait(&interest);
> +
> +               atomic_or(APP_NEEDS_LOCK, &rpc->flags);
> +               homa_rpc_lock(rpc);
> +               atomic_andnot(APP_NEEDS_LOCK, &rpc->flags);

reuse the helper.

> +               homa_interest_unlink_private(&interest);
> +
> +               /* Abort on error, but if the interest actually got ready
> +                * in the meantime the ignore the error (loop back around
> +                * to process the RPC).
> +                */
> +               if (result != 0 && atomic_read(&interest.ready) == 0)
> +                       break;
> +       }
> +
> +       return result;
> +}
> +
> +/**
> + * homa_wait_shared() - Wait for the completion of any non-private
> + * incoming message on a socket.
> + * @hsk:          Socket on which to wait. Must not be locked.
> + * @nonblocking:  Nonzero means return immediately if no RPC is ready.
> + *
> + * Return:    Pointer to an RPC with a complete incoming message or nonzero
> + *            error field, or a negative errno (usually -EINTR). If an RPC
> + *            is returned it will be locked and referenced; the caller
> + *            must release the lock and the reference.
> + */
> +struct homa_rpc *homa_wait_shared(struct homa_sock *hsk, int nonblocking)
> +       __cond_acquires(rpc->bucket->lock)
> +{
> +       struct homa_interest interest;
> +       struct homa_rpc *rpc;
> +       int result;
> +
> +       INIT_LIST_HEAD(&interest.links);
> +       init_waitqueue_head(&interest.wait_queue);
> +       /* Each iteration through this loop waits until an RPC needs attention
> +        * in some way (e.g. packets have arrived), then deals with that need
> +        * (e.g. copy to user space). It may take many iterations until an
> +        * RPC is ready for the application.
> +        */
> +       while (1) {
> +               homa_sock_lock(hsk);
> +               if (hsk->shutdown) {
> +                       rpc = ERR_PTR(-ESHUTDOWN);
> +                       homa_sock_unlock(hsk);
> +                       goto done;
> +               }
> +               if (!list_empty(&hsk->ready_rpcs)) {
> +                       rpc = list_first_entry(&hsk->ready_rpcs,
> +                                              struct homa_rpc,
> +                                              ready_links);
> +                       homa_rpc_hold(rpc);
> +                       list_del_init(&rpc->ready_links);
> +                       if (!list_empty(&hsk->ready_rpcs)) {
> +                               /* There are still more RPCs available, so
> +                                * let Linux know.
> +                                */
> +                               hsk->sock.sk_data_ready(&hsk->sock);
> +                       }
> +                       homa_sock_unlock(hsk);
> +               } else if (nonblocking) {
> +                       rpc = ERR_PTR(-EAGAIN);
> +                       homa_sock_unlock(hsk);
> +
> +                       /* This is a good time to cleanup dead RPCS. */
> +                       homa_rpc_reap(hsk, false);
> +                       goto done;
> +               } else {
> +                       homa_interest_init_shared(&interest, hsk);
> +                       homa_sock_unlock(hsk);
> +                       result = homa_interest_wait(&interest);
> +
> +                       if (result != 0) {
> +                               int ready;
> +
> +                               /* homa_interest_wait returned an error, so we
> +                                * have to do two things. First, unlink the
> +                                * interest from the socket. Second, check to
> +                                * see if in the meantime the interest received
> +                                * a handoff. If so, ignore the error. Very
> +                                * important to hold the socket lock while
> +                                * checking, in order to eliminate races with
> +                                * homa_rpc_handoff.
> +                                */
> +                               homa_sock_lock(hsk);
> +                               homa_interest_unlink_shared(&interest);
> +                               ready = atomic_read(&interest.ready);
> +                               homa_sock_unlock(hsk);
> +                               if (ready == 0) {
> +                                       rpc = ERR_PTR(result);
> +                                       goto done;
> +                               }
> +                       }
> +
> +                       rpc = interest.rpc;
> +                       if (!rpc) {
> +                               rpc = ERR_PTR(-ESHUTDOWN);
> +                               goto done;
> +                       }
> +               }
> +
> +               atomic_or(APP_NEEDS_LOCK, &rpc->flags);
> +               homa_rpc_lock(rpc);
> +               atomic_andnot(APP_NEEDS_LOCK, &rpc->flags);

Reuse the helper here.

> +               if (!rpc->error)
> +                       rpc->error = homa_copy_to_user(rpc);
> +               if (rpc->error) {
> +                       if (rpc->state != RPC_DEAD)
> +                               break;
> +               } else if (rpc->msgin.bytes_remaining == 0 &&
> +                   skb_queue_len(&rpc->msgin.packets) == 0)
> +                       break;
> +               homa_rpc_put(rpc);
> +               homa_rpc_unlock(rpc);
> +       }
> +
> +done:
> +       return rpc;
> +}
> +
> +/**
> + * homa_rpc_handoff() - This function is called when the input message for
> + * an RPC is ready for attention from a user thread. It notifies a waiting
> + * reader and/or queues the RPC, as appropriate.
> + * @rpc:                RPC to handoff; must be locked.
> + */
> +void homa_rpc_handoff(struct homa_rpc *rpc)
> +       __must_hold(rpc->bucket->lock)
> +{
> +       struct homa_sock *hsk = rpc->hsk;
> +       struct homa_interest *interest;
> +
> +       if (atomic_read(&rpc->flags) & RPC_PRIVATE) {
> +               homa_interest_notify_private(rpc);
> +               return;
> +       }
> +
> +       /* Shared RPC; if there is a waiting thread, hand off the RPC;
> +        * otherwise enqueue it.
> +        */
> +       homa_sock_lock(hsk);
> +       if (hsk->shutdown) {
> +               homa_sock_unlock(hsk);
> +               return;
> +       }
> +       if (!list_empty(&hsk->interests)) {
> +               interest = list_first_entry(&hsk->interests,
> +                                           struct homa_interest, links);
> +               list_del_init(&interest->links);
> +               interest->rpc = rpc;
> +               homa_rpc_hold(rpc);
> +               atomic_set_release(&interest->ready, 1);
> +               wake_up(&interest->wait_queue);
> +       } else if (list_empty(&rpc->ready_links)) {
> +               list_add_tail(&rpc->ready_links, &hsk->ready_rpcs);
> +               hsk->sock.sk_data_ready(&hsk->sock);
> +       }
> +       homa_sock_unlock(hsk);
> +}
> +
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 14/15] net: homa: create homa_plumbing.c
  2025-09-01 22:53     ` John Ousterhout
  2025-09-01 23:03       ` Andrew Lunn
@ 2025-09-02  8:12       ` Paolo Abeni
  2025-09-02 23:15         ` John Ousterhout
  1 sibling, 1 reply; 47+ messages in thread
From: Paolo Abeni @ 2025-09-02  8:12 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev, edumazet, horms, kuba

On 9/2/25 12:53 AM, John Ousterhout wrote:
> On Tue, Aug 26, 2025 at 9:17 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>> +             header_offset = skb_transport_header(skb) - skb->data;
>>> +             if (header_offset)
>>> +                     __skb_pull(skb, header_offset);
>>> +
>>> +             /* Reject packets that are too short or have bogus types. */
>>> +             h = (struct homa_common_hdr *)skb->data;
>>> +             if (unlikely(skb->len < sizeof(struct homa_common_hdr) ||
>>> +                          h->type < DATA || h->type > MAX_OP ||
>>> +                          skb->len < header_lengths[h->type - DATA]))
>>> +                     goto discard;
>>> +
>>> +             /* Process the packet now if it is a control packet or
>>> +              * if it contains an entire short message.
>>> +              */
>>> +             if (h->type != DATA || ntohl(((struct homa_data_hdr *)h)
>>> +                             ->message_length) < 1400) {
>>
>> I could not fined where `message_length` is validated. AFAICS
>> data_hdr->message_length could be > skb->len.
>>
>> Also I don't see how the condition checked above ensures that the pkt
>> contains the whole message.
> 
> Long messages consist of multiple packets, so it is fine if
> data_hdr->message_length > skb->len. That said, Homa does not fragment
> a message into multiple packets unless necessary, so if the condition
> above is met, then the message is contained in a single packet (if for
> some reason a sender fragments a short message, that won't cause
> problems).

Let me rephrase: why 1400? is that MRU dependent, or just an arbitrary
threshold? What if the NIC can receive 8K frames (or max 1024 bytes long
one)? What if the stack adds a long encapsulation?

What if an evil/bugged peer set message_length to a random value (larger
than the amount of bytes actually sent or smaller than that)?

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next v15 14/15] net: homa: create homa_plumbing.c
  2025-09-02  8:12       ` Paolo Abeni
@ 2025-09-02 23:15         ` John Ousterhout
  0 siblings, 0 replies; 47+ messages in thread
From: John Ousterhout @ 2025-09-02 23:15 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Tue, Sep 2, 2025 at 1:12 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 9/2/25 12:53 AM, John Ousterhout wrote:
> > On Tue, Aug 26, 2025 at 9:17 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>> +             header_offset = skb_transport_header(skb) - skb->data;
> >>> +             if (header_offset)
> >>> +                     __skb_pull(skb, header_offset);
> >>> +
> >>> +             /* Reject packets that are too short or have bogus types. */
> >>> +             h = (struct homa_common_hdr *)skb->data;
> >>> +             if (unlikely(skb->len < sizeof(struct homa_common_hdr) ||
> >>> +                          h->type < DATA || h->type > MAX_OP ||
> >>> +                          skb->len < header_lengths[h->type - DATA]))
> >>> +                     goto discard;
> >>> +
> >>> +             /* Process the packet now if it is a control packet or
> >>> +              * if it contains an entire short message.
> >>> +              */
> >>> +             if (h->type != DATA || ntohl(((struct homa_data_hdr *)h)
> >>> +                             ->message_length) < 1400) {
> >>
> >> I could not fined where `message_length` is validated. AFAICS
> >> data_hdr->message_length could be > skb->len.
> >>
> >> Also I don't see how the condition checked above ensures that the pkt
> >> contains the whole message.
> >
> > Long messages consist of multiple packets, so it is fine if
> > data_hdr->message_length > skb->len. That said, Homa does not fragment
> > a message into multiple packets unless necessary, so if the condition
> > above is met, then the message is contained in a single packet (if for
> > some reason a sender fragments a short message, that won't cause
> > problems).
>
> Let me rephrase: why 1400? is that MRU dependent, or just an arbitrary
> threshold? What if the NIC can receive 8K frames (or max 1024 bytes long
> one)? What if the stack adds a long encapsulation?
>
> What if an evil/bugged peer set message_length to a random value (larger
> than the amount of bytes actually sent or smaller than that)?

1400 is an arbitrary threshold. This has no impact on functionality or
correctness; it is simply used to reorder the packets in a batch so
that shorter messages get processed first. If the NIC can receive 8K
frames it won't change this threshold; only frames shorter than 1400
will get the scheduling boost. If a message shorter than 1400 bytes
arrives in multiple packets, all of the packets will get the boost.

A sender could cheat the mechanism by declaring the message length to
less than 1400 bytes when the message is really longer than that. This
would cause the message's packets to get priority for SoftIRQ
processing, but all of the extra data in the message beyond the stated
length would be discarded, so I'm not sure how the sender would
benefit from this.

-John-

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2025-09-02 23:15 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-18 20:55 [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout
2025-08-18 20:55 ` [PATCH net-next v15 01/15] net: homa: define user-visible API for Homa John Ousterhout
2025-08-18 20:55 ` [PATCH net-next v15 02/15] net: homa: create homa_wire.h John Ousterhout
2025-08-18 20:55 ` [PATCH net-next v15 03/15] net: homa: create shared Homa header files John Ousterhout
2025-08-26  9:05   ` Paolo Abeni
2025-08-26 23:10     ` John Ousterhout
2025-08-27  7:21       ` Paolo Abeni
2025-08-29  3:03         ` John Ousterhout
2025-08-29  7:53           ` Paolo Abeni
2025-08-29 17:08             ` John Ousterhout
2025-09-01  7:59               ` Paolo Abeni
2025-08-27 12:16       ` Eric Dumazet
2025-08-18 20:55 ` [PATCH net-next v15 04/15] net: homa: create homa_pool.h and homa_pool.c John Ousterhout
2025-08-18 20:55 ` [PATCH net-next v15 05/15] net: homa: create homa_peer.h and homa_peer.c John Ousterhout
2025-08-26  9:32   ` Paolo Abeni
2025-08-27 23:27     ` John Ousterhout
2025-08-18 20:55 ` [PATCH net-next v15 06/15] net: homa: create homa_sock.h and homa_sock.c John Ousterhout
2025-08-26 10:10   ` Paolo Abeni
2025-08-31 23:29     ` John Ousterhout
2025-08-18 20:55 ` [PATCH net-next v15 07/15] net: homa: create homa_interest.h and homa_interest.c John Ousterhout
2025-08-18 20:55 ` [PATCH net-next v15 08/15] net: homa: create homa_pacer.h and homa_pacer.c John Ousterhout
2025-08-26 10:53   ` Paolo Abeni
2025-09-01 16:35     ` John Ousterhout
2025-08-18 20:55 ` [PATCH net-next v15 09/15] net: homa: create homa_rpc.h and homa_rpc.c John Ousterhout
2025-08-26 11:31   ` Paolo Abeni
2025-09-01 20:10     ` John Ousterhout
2025-08-18 20:55 ` [PATCH net-next v15 10/15] net: homa: create homa_outgoing.c John Ousterhout
2025-08-26 11:50   ` Paolo Abeni
2025-09-01 20:21     ` John Ousterhout
2025-08-18 20:55 ` [PATCH net-next v15 11/15] net: homa: create homa_utils.c John Ousterhout
2025-08-26 11:52   ` Paolo Abeni
2025-09-01 20:30     ` John Ousterhout
2025-08-18 20:55 ` [PATCH net-next v15 12/15] net: homa: create homa_incoming.c John Ousterhout
2025-08-26 12:05   ` Paolo Abeni
2025-09-01 22:12     ` John Ousterhout
2025-09-02  7:19   ` Eric Dumazet
2025-08-18 20:55 ` [PATCH net-next v15 13/15] net: homa: create homa_timer.c John Ousterhout
2025-08-18 20:55 ` [PATCH net-next v15 14/15] net: homa: create homa_plumbing.c John Ousterhout
2025-08-26 16:17   ` Paolo Abeni
2025-09-01 22:53     ` John Ousterhout
2025-09-01 23:03       ` Andrew Lunn
2025-09-02  4:54         ` John Ousterhout
2025-09-02  8:12       ` Paolo Abeni
2025-09-02 23:15         ` John Ousterhout
2025-08-18 20:55 ` [PATCH net-next v15 15/15] net: homa: create Makefile and Kconfig John Ousterhout
2025-08-23  5:36   ` kernel test robot
2025-08-22 15:51 ` [PATCH net-next v15 00/15] Begin upstreaming Homa transport protocol John Ousterhout

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).