* [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol
@ 2025-05-02 23:37 John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 01/15] net: homa: define user-visible API for Homa John Ousterhout
` (15 more replies)
0 siblings, 16 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
This patch series begins the process of upstreaming the Homa transport
protocol. Homa is an alternative to TCP for use in datacenter
environments. It provides 10-100x reductions in tail latency for short
messages relative to TCP. Its benefits are greatest for mixed workloads
containing both short and long messages running under high network loads.
Homa is not API-compatible with TCP: it is connectionless and message-
oriented (but still reliable and flow-controlled). Homa's new API not
only contributes to its performance gains, but it also eliminates the
massive amount of connection state required by TCP for highly connected
datacenter workloads (Homa uses ~ 1 socket per application, whereas
TCP requires a separate socket for each peer).
For more details on Homa, please consult the Homa Wiki:
https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview
The Wiki has pointers to two papers on Homa (one of which describes
this implementation) as well as man pages describing the application
API and other information.
There is also a GitHub repo for Homa:
https://github.com/PlatformLab/HomaModule
The GitHub repo contains a superset of this patch set, including:
* Additional source code that will eventually be upstreamed
* Extensive unit tests (which will also be upstreamed eventually)
* Application-level library functions (which need to go in glibc?)
* Man pages (which need to be upstreamed as well)
* Benchmarking and instrumentation code
For this patch series, Homa has been stripped down to the bare minimum
functionality capable of actually executing remote procedure calls. (about
8000 lines of source code, compared to 15000 in the complete Homa). The
remaining code will be upstreamed in smaller batches once this patch
series has been accepted. Note: the code in this patch series is
functional but its performance is not very interesting (about the same
as TCP).
The patch series is arranged to introduce the major functional components
of Homa. Until the last patch has been applied, the code is inert (it
will not be compiled).
Note: this implementation of Homa supports both IPv4 and IPv6.
v8 changes:
- There were no reviews of the v7 patch series, so there are not many changes
in this version
- Pull out pacer code into separate files pacer.h and pacer.c
- Refactor homa_pool APIs (move allocation/deallocation into homa_pool.c,
move locking responsibility out)
- Fix various problems from sparse, checkpatch, and kernel-doc
v7 changes:
- Add documentation files reap.txt and sync.txt.
- Replace __u64 with _u64 (and __s64 with s64) in non-uapi settings.
- Replace '__aligned(L1_CACHE_BYTES)' with '____cacheline_aligned_in_smp'.
- Use alloc_percpu_gfp for homa_pool::cores.
- Extract bool homa_bpage_available from homa_pool_get_pages.
- Rename homa_rpc_free to homa_rpc_end.
- Use skb_queue_purge in homa_rpc_reap instead of hand-coding.
- Clean up RCU usage in several places:
- Eliminate unnecessary use of RCU for homa_sock::dead_rpcs.
- Eliminate use of RCU for homa::throttled_rpcs (unnecessary, unclear
that it would have worked). Added return value from homa_pacer_xmit.
- Call rcu_read_lock/unlock in homa_peer_find (just to be safe; probably
isn't necessary)
- Eliminate extraneous use of RCU in homa_pool_allocate.
- Cleaned up RCU usage around homa_sock::active_rpcs.
- Change homa_sock_find to take a reference on the returned socket;
caller no longer has to worry about RCU issues.
- Remove "locker" arguments from homa_lock_rpc, homa_lock_sock,
homa_rpc_try_lock, and homa_bucket_lock (shouldn't be needed, given
CONFIG_PROVE_LOCKING).
- Use __GFP_ZERO in *alloc calls instead of initializing individual
struct fields to zero.
- Don't use raw_smp_processor_id; use smp_processor_id instead.
- Remove homa_peertab_get_peers from this patch series (and also fix
problems in it related to RCU usage).
- Add annotation to homa_peertab_gc_dsts requiring write_lock.
- Remove "lock_slow" functions, which don't add functionality in this patch
series.
- Remove unused fields from homa_peer structs.
- Reorder fields in homa_rpc_bucket to squeeze out padding.
- Refactor homa_sock_start_scan etc.
- Take a reference on the current socket to keep it from being freed.
- No need now for homa_socktab::active_scans or struct homa_socktab_links.
- rcu_read_lock/unlock is now entirely in the homa_sock scan methods;
no need for callers to worry about this.
- Add homa_rpc_hold and homa_rpc_put. Replaces several ad-hoc mechanisms,
such as RPC_COPYING_FROM_USER and RPC_COPYING_TO_USER, with a single
general-purpose mechanism.
- Use __skb_queue_purge instead of skb_queue_purge (locking isn't needed
because Homa has its own locks).
- Rename UNKNOWN packet type to RPC_UNKNOWN.
- Add hsk->is_server plus SO_HOMA_SERVER setsockopt: by default, sockets
will not accept incoming RPCs unless they have been bound.
- Refactor waiting mechanism for incoming packets: simplify wait
criteria and use standard mechanisms (wait_event_*) for blocking
threads. Create homa_interest.c and homa_interest.h.
* Add memory accounting for outbound messages (e.g. new sysctl value
wmem_max); senders now block when memory limit is exceeded.
* Made Homa a pernet subsystem (a separate Homa transport for each
network namespace).
v6 changes:
- Make hrtimer variable in homa_timer_main static instead of stack-allocated
(avoids complaints when in debug mode).
- Remove unnecessary cast in homa_dst_refresh.
- Replace erroneous uses of GFP_KERNEL with GFP_ATOMIC.
- Check for "all ports in use" in homa_sock_init.
- Refactor API for homa_rpc_reap to incorporate "reap all" feature,
eliminate need for callers to specify exact amount of work to do
when in "reap a few" mode.
- Fix bug in homa_rpc_reap (wasn't resetting rx_frees for each iteration
of outer loop).
v5 changes:
- Change type of start in struct homa_rcvbuf_args from void* to __u64;
also add more __user annotations.
- Refactor homa_interest: replace awkward ready_rpc field with two
fields: rpc and rpc_ready. Added new functions homa_interest_get_rpc
and homa_interest_set_rpc to encapsulate/clarify access to
interest->rpc_ready.
- Eliminate use of LIST_POISON1 etc. in homa_interests (use list_del_init
instead of list_del).
- Remove homa_next_skb function, which is obsolete, unused, and incorrect
- Eliminate ipv4_to_ipv6 function (use ipv6_addr_set_v4mapped instead)
- Eliminate is_mapped_ipv4 function (use ipv6_addr_v4mapped instead)
- Use __u64 instead of uint64_t in homa.h
- Remove 'extern "C"' from homa.h
- Various fixes from patchwork checks (checkpatch.pl, etc.)
- A few improvements to comments
v4 changes:
- Remove sport argument for homa_find_server_rpc (unneeded). Also
remove client_port field from struct homa_ack
- Refactor ICMP packet handling (v6 was incorrect)
- Check for socket shutdown in homa_poll
- Fix potential for memory garbling in homa_symbol_for_type
- Remove unused ETHERNET_MAX_PAYLOAD declaration
- Rename classes in homa_wire.h so they all have "homa_" prefixes
- Various fixes from patchwork checks (checkpatch.pl, etc.)
- A few improvements to comments
v3 changes:
- Fix formatting in Kconfig
- Set ipv6_pinfo_offset in struct proto
- Check return value of inet6_register_protosw
- In homa_load cleanup, don't cleanup things that haven't been
initialized
- Add MODULE_ALIAS_NET_PF_PROTO_TYPE to auto-load module
- Check return value from kzalloc call in homa_sock_init
- Change SO_HOMA_SET_BUF to SO_HOMA_RCVBUF
- Change struct homa_set_buf_args to struct homa_rcvbuf_args
- Implement getsockopt for SO_HOMA_RCVBUF
- Return ENOPROTOOPT instead of EINVAL where appropriate in
setsockopt and getsockopt
- Fix crash in homa_pool_check_waiting if pool has no region yet
- Check for NULL msg->msg_name in homa_sendmsg
- Change addr->in6.sin6_family to addr->sa.sa_family in homa_sendmsg
for clarity
- For some errors in homa_recvmsg, return directly rather than "goto done"
- Return error from recvmsg if offsets of returned read buffers are bogus
- Added comments to clarify lock-unlock pairs for RPCs
- Renamed homa_try_bucket_lock to homa_try_rpc_lock
- Fix issues found by test robot and checkpatch.pl
- Ensure first argument to do_div is 64 bits
- Remove C++ style comments
- Removed some code that will only be relevant in future patches that
fill in missing Homa functionality
v2 changes:
- Remove sockaddr_in_union declaration from public API in homa.h
- Remove kernel wrapper functions (homa_send, etc.) from homa.h
- Fix many sparse warnings (still more work to do here) and other issues
uncovered by test robot
- Fix checkpatch.pl issues
- Remove residual code related to unit tests
- Remove references to tt_record from comments
- Make it safe to delete sockets during homa_socktab scans
- Use uintptr_t for portability fo 32-bit platforms
- Use do_div instead of "/" for portability
- Remove homa->busy_usecs and homa->gro_busy_usecs (not needed in
this stripped down version of Homa)
- Eliminate usage of cpu_khz, use sched_clock instead of get_cycles
- Add missing checks of kmalloc return values
- Remove "inline" qualifier from functions in .c files
- Document that pad fields must be zero
- Use more precise type "uint32_t" rather than "int"
- Remove unneeded #include of linux/version.h
John Ousterhout (15):
net: homa: define user-visible API for Homa
net: homa: create homa_wire.h
net: homa: create shared Homa header files
net: homa: create homa_pool.h and homa_pool.c
net: homa: create homa_peer.h and homa_peer.c
net: homa: create homa_sock.h and homa_sock.c
net: homa: create homa_interest.h and homa_interest.
net: homa: create homa_pacer.h and homa_pacer.c
net: homa: create homa_rpc.h and homa_rpc.c
net: homa: create homa_outgoing.c
net: homa: create homa_utils.c
net: homa: create homa_incoming.c
net: homa: create homa_timer.c
net: homa: create homa_plumbing.c
net: homa: create Makefile and Kconfig
include/uapi/linux/homa.h | 196 +++++++
net/Kconfig | 1 +
net/Makefile | 1 +
net/homa/Kconfig | 21 +
net/homa/Makefile | 16 +
net/homa/homa_impl.h | 481 ++++++++++++++++
net/homa/homa_incoming.c | 922 +++++++++++++++++++++++++++++++
net/homa/homa_interest.c | 122 +++++
net/homa/homa_interest.h | 100 ++++
net/homa/homa_outgoing.c | 575 +++++++++++++++++++
net/homa/homa_pacer.c | 310 +++++++++++
net/homa/homa_pacer.h | 185 +++++++
net/homa/homa_peer.c | 308 +++++++++++
net/homa/homa_peer.h | 211 +++++++
net/homa/homa_plumbing.c | 1098 +++++++++++++++++++++++++++++++++++++
net/homa/homa_pool.c | 470 ++++++++++++++++
net/homa/homa_pool.h | 149 +++++
net/homa/homa_rpc.c | 484 ++++++++++++++++
net/homa/homa_rpc.h | 484 ++++++++++++++++
net/homa/homa_sock.c | 393 +++++++++++++
net/homa/homa_sock.h | 386 +++++++++++++
net/homa/homa_stub.h | 91 +++
net/homa/homa_timer.c | 156 ++++++
net/homa/homa_utils.c | 103 ++++
net/homa/homa_wire.h | 362 ++++++++++++
net/homa/reap.txt | 50 ++
net/homa/sync.txt | 77 +++
27 files changed, 7752 insertions(+)
create mode 100644 include/uapi/linux/homa.h
create mode 100644 net/homa/Kconfig
create mode 100644 net/homa/Makefile
create mode 100644 net/homa/homa_impl.h
create mode 100644 net/homa/homa_incoming.c
create mode 100644 net/homa/homa_interest.c
create mode 100644 net/homa/homa_interest.h
create mode 100644 net/homa/homa_outgoing.c
create mode 100644 net/homa/homa_pacer.c
create mode 100644 net/homa/homa_pacer.h
create mode 100644 net/homa/homa_peer.c
create mode 100644 net/homa/homa_peer.h
create mode 100644 net/homa/homa_plumbing.c
create mode 100644 net/homa/homa_pool.c
create mode 100644 net/homa/homa_pool.h
create mode 100644 net/homa/homa_rpc.c
create mode 100644 net/homa/homa_rpc.h
create mode 100644 net/homa/homa_sock.c
create mode 100644 net/homa/homa_sock.h
create mode 100644 net/homa/homa_stub.h
create mode 100644 net/homa/homa_timer.c
create mode 100644 net/homa/homa_utils.c
create mode 100644 net/homa/homa_wire.h
create mode 100644 net/homa/reap.txt
create mode 100644 net/homa/sync.txt
--
2.43.0
^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH net-next v8 01/15] net: homa: define user-visible API for Homa
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-05 7:54 ` Paolo Abeni
2025-05-05 8:03 ` Paolo Abeni
2025-05-02 23:37 ` [PATCH net-next v8 02/15] net: homa: create homa_wire.h John Ousterhout
` (14 subsequent siblings)
15 siblings, 2 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
Note: for man pages, see the Homa Wiki at:
https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
Changes for v7:
* Add HOMA_SENDMSG_NONBLOCKING flag for sendmsg
* API changes for new mechanism for waiting for incoming messages
* Add setsockopt SO_HOMA_SERVER (enable incoming requests)
* Use u64 and __u64 properly
---
include/uapi/linux/homa.h | 196 ++++++++++++++++++++++++++++++++++++++
1 file changed, 196 insertions(+)
create mode 100644 include/uapi/linux/homa.h
diff --git a/include/uapi/linux/homa.h b/include/uapi/linux/homa.h
new file mode 100644
index 000000000000..0c0a29ea8076
--- /dev/null
+++ b/include/uapi/linux/homa.h
@@ -0,0 +1,196 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines the kernel call interface for the Homa
+ * transport protocol.
+ */
+
+#ifndef _UAPI_LINUX_HOMA_H
+#define _UAPI_LINUX_HOMA_H
+
+#include <linux/types.h>
+#ifndef __KERNEL__
+#include <netinet/in.h>
+#include <sys/socket.h>
+#endif
+
+/* IANA-assigned Internet Protocol number for Homa. */
+#define IPPROTO_HOMA 146
+
+/**
+ * define HOMA_MAX_MESSAGE_LENGTH - Maximum bytes of payload in a Homa
+ * request or response message.
+ */
+#define HOMA_MAX_MESSAGE_LENGTH 1000000
+
+/**
+ * define HOMA_BPAGE_SIZE - Number of bytes in pages used for receive
+ * buffers. Must be power of two.
+ */
+#define HOMA_BPAGE_SIZE (1 << HOMA_BPAGE_SHIFT)
+#define HOMA_BPAGE_SHIFT 16
+
+/**
+ * define HOMA_MAX_BPAGES - The largest number of bpages that will be required
+ * to store an incoming message.
+ */
+#define HOMA_MAX_BPAGES ((HOMA_MAX_MESSAGE_LENGTH + HOMA_BPAGE_SIZE - 1) \
+ >> HOMA_BPAGE_SHIFT)
+
+/**
+ * define HOMA_MIN_DEFAULT_PORT - The 16 bit port space is divided into
+ * two nonoverlapping regions. Ports 1-32767 are reserved exclusively
+ * for well-defined server ports. The remaining ports are used for client
+ * ports; these are allocated automatically by Homa. Port 0 is reserved.
+ */
+#define HOMA_MIN_DEFAULT_PORT 0x8000
+
+/**
+ * struct homa_sendmsg_args - Provides information needed by Homa's
+ * sendmsg; passed to sendmsg using the msg_control field.
+ */
+struct homa_sendmsg_args {
+ /**
+ * @id: (in/out) An initial value of 0 means a new request is
+ * being sent; nonzero means the message is a reply to the given
+ * id. If the message is a request, then the value is modified to
+ * hold the id of the new RPC.
+ */
+ __u64 id;
+
+ /**
+ * @completion_cookie: (in) Used only for request messages; will be
+ * returned by recvmsg when the RPC completes. Typically used to
+ * locate app-specific info about the RPC.
+ */
+ __u64 completion_cookie;
+
+ /**
+ * @flags: (in) OR-ed combination of bits that control the operation.
+ * See below for values.
+ */
+ __u32 flags;
+
+ /** @reserved: Not currently used. */
+ __u32 reserved;
+};
+
+#if !defined(__cplusplus)
+_Static_assert(sizeof(struct homa_sendmsg_args) >= 24,
+ "homa_sendmsg_args shrunk");
+_Static_assert(sizeof(struct homa_sendmsg_args) <= 24,
+ "homa_sendmsg_args grew");
+#endif
+
+/* Flag bits for homa_sendmsg_args.flags (see man page for documentation):
+ */
+#define HOMA_SENDMSG_PRIVATE 0x01
+#define HOMA_SENDMSG_NONBLOCKING 0x02
+#define HOMA_SENDMSG_VALID_FLAGS 0x03
+
+/**
+ * struct homa_recvmsg_args - Provides information needed by Homa's
+ * recvmsg; passed to recvmsg using the msg_control field.
+ */
+struct homa_recvmsg_args {
+ /**
+ * @id: (in/out) Initial value is 0 to wait for any shared RPC;
+ * nonzero means wait for that specific (private) RPC. Returns
+ * the id of the RPC received.
+ */
+ __u64 id;
+
+ /**
+ * @completion_cookie: (out) If the incoming message is a response,
+ * this will return the completion cookie specified when the
+ * request was sent. For requests this will always be zero.
+ */
+ __u64 completion_cookie;
+
+ /**
+ * @flags: (in) OR-ed combination of bits that control the operation.
+ * See below for values.
+ */
+ __u32 flags;
+
+ /**
+ * @num_bpages: (in/out) Number of valid entries in @bpage_offsets.
+ * Passes in bpages from previous messages that can now be
+ * recycled; returns bpages from the new message.
+ */
+ __u32 num_bpages;
+
+ /**
+ * @bpage_offsets: (in/out) Each entry is an offset into the buffer
+ * region for the socket pool. When returned from recvmsg, the
+ * offsets indicate where fragments of the new message are stored. All
+ * entries but the last refer to full buffer pages (HOMA_BPAGE_SIZE
+ * bytes) and are bpage-aligned. The last entry may refer to a bpage
+ * fragment and is not necessarily aligned. The application now owns
+ * these bpages and must eventually return them to Homa, using
+ * bpage_offsets in a future recvmsg invocation.
+ */
+ __u32 bpage_offsets[HOMA_MAX_BPAGES];
+};
+
+#if !defined(__cplusplus)
+_Static_assert(sizeof(struct homa_recvmsg_args) >= 88,
+ "homa_recvmsg_args shrunk");
+_Static_assert(sizeof(struct homa_recvmsg_args) <= 88,
+ "homa_recvmsg_args grew");
+#endif
+
+/* Flag bits for homa_recvmsg_args.flags (see man page for documentation):
+ */
+#define HOMA_RECVMSG_NONBLOCKING 0x01
+#define HOMA_RECVMSG_VALID_FLAGS 0x01
+
+/** define SO_HOMA_RCVBUF: setsockopt option for specifying buffer region. */
+#define SO_HOMA_RCVBUF 10
+
+/**
+ * define SO_HOMA_SERVER: setsockopt option for specifying whether a
+ * socket will act as server.
+ */
+#define SO_HOMA_SERVER 11
+
+/** struct homa_rcvbuf_args - setsockopt argument for SO_HOMA_RCVBUF. */
+struct homa_rcvbuf_args {
+ /** @start: Address of first byte of buffer region in user space. */
+ __u64 start;
+
+ /** @length: Total number of bytes available at @start. */
+ size_t length;
+};
+
+/* Meanings of the bits in Homa's flag word, which can be set using
+ * "sysctl /net/homa/flags".
+ */
+
+/**
+ * define HOMA_FLAG_DONT_THROTTLE - disable the output throttling mechanism
+ * (always send all packets immediately).
+ */
+#define HOMA_FLAG_DONT_THROTTLE 2
+
+/* I/O control calls on Homa sockets. These are mapped into the
+ * SIOCPROTOPRIVATE range of 0x89e0 through 0x89ef.
+ */
+
+#define HOMAIOCFREEZE _IO(0x89, 0xef)
+
+int homa_send(int sockfd, const void *message_buf,
+ size_t length, const struct sockaddr *dest_addr,
+ __u32 addrlen, __u64 *id, __u64 completion_cookie,
+ int flags);
+int homa_sendv(int sockfd, const struct iovec *iov,
+ int iovcnt, const struct sockaddr *dest_addr,
+ __u32 addrlen, __u64 *id, __u64 completion_cookie,
+ int flags);
+ssize_t homa_reply(int sockfd, const void *message_buf,
+ size_t length, const struct sockaddr *dest_addr,
+ __u32 addrlen, __u64 id);
+ssize_t homa_replyv(int sockfd, const struct iovec *iov,
+ int iovcnt, const struct sockaddr *dest_addr,
+ __u32 addrlen, __u64 id);
+
+#endif /* _UAPI_LINUX_HOMA_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 02/15] net: homa: create homa_wire.h
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 01/15] net: homa: define user-visible API for Homa John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-05 8:28 ` Paolo Abeni
2025-05-02 23:37 ` [PATCH net-next v8 03/15] net: homa: create shared Homa header files John Ousterhout
` (13 subsequent siblings)
15 siblings, 1 reply; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
This file defines the on-the-wire packet formats for Homa.
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
Changes for v7:
* Rename UNKNOWN packet type to RPC_UNKNOWN
* Use u64 and __u64 properly
---
net/homa/homa_wire.h | 362 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 362 insertions(+)
create mode 100644 net/homa/homa_wire.h
diff --git a/net/homa/homa_wire.h b/net/homa/homa_wire.h
new file mode 100644
index 000000000000..47693244c3ec
--- /dev/null
+++ b/net/homa/homa_wire.h
@@ -0,0 +1,362 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines the on-the-wire format of Homa packets. */
+
+#ifndef _HOMA_WIRE_H
+#define _HOMA_WIRE_H
+
+#include <linux/skbuff.h>
+
+/* Defines the possible types of Homa packets.
+ *
+ * See the xxx_header structs below for more information about each type.
+ */
+enum homa_packet_type {
+ DATA = 0x10,
+ RESEND = 0x12,
+ RPC_UNKNOWN = 0x13,
+ BUSY = 0x14,
+ NEED_ACK = 0x17,
+ ACK = 0x18,
+ BOGUS = 0x19, /* Used only in unit tests. */
+ /* If you add a new type here, you must also do the following:
+ * 1. Change BOGUS so it is the highest opcode
+ * 2. Add support for the new opcode in homa_print_packet,
+ * homa_print_packet_short, homa_symbol_for_type, and mock_skb_new.
+ * 3. Add the header length to header_lengths in homa_plumbing.c.
+ */
+};
+
+/** define HOMA_IPV6_HEADER_LENGTH - Size of IP header (V6). */
+#define HOMA_IPV6_HEADER_LENGTH 40
+
+/** define HOMA_IPV4_HEADER_LENGTH - Size of IP header (V4). */
+#define HOMA_IPV4_HEADER_LENGTH 20
+
+/**
+ * define HOMA_SKB_EXTRA - How many bytes of additional space to allow at the
+ * beginning of each sk_buff, before the IP header. This includes room for a
+ * VLAN header and also includes some extra space, "just to be safe" (not
+ * really sure if this is needed).
+ */
+#define HOMA_SKB_EXTRA 40
+
+/**
+ * define HOMA_ETH_OVERHEAD - Number of bytes per Ethernet packet for Ethernet
+ * header, CRC, preamble, and inter-packet gap.
+ */
+#define HOMA_ETH_OVERHEAD 42
+
+/**
+ * define HOMA_MIN_PKT_LENGTH - Every Homa packet must be padded to at least
+ * this length to meet Ethernet frame size limitations. This number includes
+ * Homa headers and data, but not IP or Ethernet headers.
+ */
+#define HOMA_MIN_PKT_LENGTH 26
+
+/**
+ * define HOMA_MAX_HEADER - Number of bytes in the largest Homa header.
+ */
+#define HOMA_MAX_HEADER 90
+
+/**
+ * struct homa_common_hdr - Wire format for the first bytes in every Homa
+ * packet. This must (mostly) match the format of a TCP header to enable
+ * Homa packets to actually be transmitted as TCP packets (and thereby
+ * take advantage of TSO and other features).
+ */
+struct homa_common_hdr {
+ /**
+ * @sport: Port on source machine from which packet was sent.
+ * Must be in the same position as in a TCP header.
+ */
+ __be16 sport;
+
+ /**
+ * @dport: Port on destination that is to receive packet. Must be
+ * in the same position as in a TCP header.
+ */
+ __be16 dport;
+
+ /**
+ * @sequence: corresponds to the sequence number field in TCP headers;
+ * used in DATA packets to hold the offset in the message of the first
+ * byte of data. However, when TSO is used without TCP hijacking, this
+ * value will only be correct in the first segment of a GSO packet.
+ */
+ __be32 sequence;
+
+ /**
+ * @ack: Corresponds to the high-order bits of the acknowledgment
+ * field in TCP headers; not used by Homa.
+ */
+ char ack[3];
+
+ /**
+ * @type: Homa packet type (one of the values of the homa_packet_type
+ * enum). Corresponds to the low-order byte of the ack in TCP.
+ */
+ __u8 type;
+
+ /**
+ * @doff: High order 4 bits holds the number of 4-byte chunks in a
+ * homa_data_hdr (low-order bits unused). Used only for DATA packets;
+ * must be in the same position as the data offset in a TCP header.
+ * Used by TSO to determine where the replicated header portion ends.
+ */
+ __u8 doff;
+
+ /** @reserved1: Not used (corresponds to TCP flags). */
+ __u8 reserved1;
+
+ /**
+ * @window: Corresponds to the window field in TCP headers. Not used
+ * by HOMA.
+ */
+ __be16 window;
+
+ /**
+ * @checksum: Not used by Homa, but must occupy the same bytes as
+ * the checksum in a TCP header (TSO may modify this?).
+ */
+ __be16 checksum;
+
+ /** @reserved2: Not used (corresponds to TCP urgent field). */
+ __be16 reserved2;
+
+ /**
+ * @sender_id: the identifier of this RPC as used on the sender (i.e.,
+ * if the low-order bit is set, then the sender is the server for
+ * this RPC).
+ */
+ __be64 sender_id;
+} __packed;
+
+/**
+ * struct homa_ack - Identifies an RPC that can be safely deleted by its
+ * server. After sending the response for an RPC, the server must retain its
+ * state for the RPC until it knows that the client has successfully
+ * received the entire response. An ack indicates this. Clients will
+ * piggyback acks on future data packets, but if a client doesn't send
+ * any data to the server, the server will eventually request an ack
+ * explicitly with a NEED_ACK packet, in which case the client will
+ * return an explicit ACK.
+ */
+struct homa_ack {
+ /**
+ * @client_id: The client's identifier for the RPC. 0 means this ack
+ * is invalid.
+ */
+ __be64 client_id;
+
+ /** @server_port: The server-side port for the RPC. */
+ __be16 server_port;
+} __packed;
+
+/* struct homa_data_hdr - Contains data for part or all of a Homa message.
+ * An incoming packet consists of a homa_data_hdr followed by message data.
+ * An outgoing packet can have this simple format as well, or it can be
+ * structured as a GSO packet with the following format:
+ *
+ * |-----------------------|
+ * | |
+ * | data_header |
+ * | |
+ * |---------------------- |
+ * | |
+ * | |
+ * | segment data |
+ * | |
+ * | |
+ * |-----------------------|
+ * | seg_header |
+ * |-----------------------|
+ * | |
+ * | |
+ * | segment data |
+ * | |
+ * | |
+ * |-----------------------|
+ * | seg_header |
+ * |-----------------------|
+ * | |
+ * | |
+ * | segment data |
+ * | |
+ * | |
+ * |-----------------------|
+ *
+ * TSO will not adjust @homa_common_hdr.sequence in the segments, so Homa
+ * sprinkles correct offsets (in homa_seg_hdrs) throughout the segment data;
+ * TSO/GSO will include a different homa_seg_hdr in each generated packet.
+ */
+
+struct homa_seg_hdr {
+ /**
+ * @offset: Offset within message of the first byte of data in
+ * this segment.
+ */
+ __be32 offset;
+} __packed;
+
+struct homa_data_hdr {
+ struct homa_common_hdr common;
+
+ /** @message_length: Total #bytes in the message. */
+ __be32 message_length;
+
+ __be32 reserved1;
+
+ /** @ack: If the @client_id field of this is nonzero, provides info
+ * about an RPC that the recipient can now safely free. Note: in
+ * TSO packets this will get duplicated in each of the segments;
+ * in order to avoid repeated attempts to ack the same RPC,
+ * homa_gro_receive will clear this field in all segments but the
+ * first.
+ */
+ struct homa_ack ack;
+
+ __be16 reserved2;
+
+ /**
+ * @retransmit: 1 means this packet was sent in response to a RESEND
+ * (it has already been sent previously).
+ */
+ __u8 retransmit;
+
+ char pad[3];
+
+ /** @seg: First of possibly many segments. */
+ struct homa_seg_hdr seg;
+} __packed;
+_Static_assert(sizeof(struct homa_data_hdr) <= HOMA_MAX_HEADER,
+ "homa_data_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+_Static_assert(sizeof(struct homa_data_hdr) >= HOMA_MIN_PKT_LENGTH,
+ "homa_data_hdr too small: Homa doesn't currently have code to pad data packets");
+_Static_assert(((sizeof(struct homa_data_hdr) - sizeof(struct homa_seg_hdr)) &
+ 0x3) == 0,
+ " homa_data_hdr length not a multiple of 4 bytes (required for TCP/TSO compatibility");
+
+/**
+ * homa_data_len() - Returns the total number of bytes in a DATA packet
+ * after the homa_data_hdr. Note: if the packet is a GSO packet, the result
+ * may include metadata as well as packet data.
+ * @skb: Incoming data packet
+ * Return: see above
+ */
+static inline int homa_data_len(struct sk_buff *skb)
+{
+ return skb->len - skb_transport_offset(skb) -
+ sizeof(struct homa_data_hdr);
+}
+
+/**
+ * struct homa_resend_hdr - Wire format for RESEND packets.
+ *
+ * A RESEND is sent by the receiver when it believes that message data may
+ * have been lost in transmission (or if it is concerned that the sender may
+ * have crashed). The receiver should resend the specified portion of the
+ * message, even if it already sent it previously.
+ */
+struct homa_resend_hdr {
+ /** @common: Fields common to all packet types. */
+ struct homa_common_hdr common;
+
+ /**
+ * @offset: Offset within the message of the first byte of data that
+ * should be retransmitted.
+ */
+ __be32 offset;
+
+ /**
+ * @length: Number of bytes of data to retransmit; this could specify
+ * a range longer than the total message size. Zero is a special case
+ * used by servers; in this case, there is no need to actually resend
+ * anything; the purpose of this packet is to trigger an RPC_UNKNOWN
+ * response if the client no longer cares about this RPC.
+ */
+ __be32 length;
+
+} __packed;
+_Static_assert(sizeof(struct homa_resend_hdr) <= HOMA_MAX_HEADER,
+ "homa_resend_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * struct homa_rpc_unknown_hdr - Wire format for RPC_UNKNOWN packets.
+ *
+ * An RPC_UNKNOWN packet is sent by either server or client when it receives a
+ * packet for an RPC that is unknown to it. When a client receives an
+ * RPC_UNKNOWN packet it will typically restart the RPC from the beginning;
+ * when a server receives an RPC_UNKNOWN packet it will typically discard its
+ * state for the RPC.
+ */
+struct homa_rpc_unknown_hdr {
+ /** @common: Fields common to all packet types. */
+ struct homa_common_hdr common;
+} __packed;
+_Static_assert(sizeof(struct homa_rpc_unknown_hdr) <= HOMA_MAX_HEADER,
+ "homa_rpc_unknown_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * struct homa_busy_hdr - Wire format for BUSY packets.
+ *
+ * These packets tell the recipient that the sender is still alive (even if
+ * it isn't sending data expected by the recipient).
+ */
+struct homa_busy_hdr {
+ /** @common: Fields common to all packet types. */
+ struct homa_common_hdr common;
+} __packed;
+_Static_assert(sizeof(struct homa_busy_hdr) <= HOMA_MAX_HEADER,
+ "homa_busy_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * struct homa_need_ack_hdr - Wire format for NEED_ACK packets.
+ *
+ * These packets ask the recipient (a client) to return an ACK message if
+ * the packet's RPC is no longer active.
+ */
+struct homa_need_ack_hdr {
+ /** @common: Fields common to all packet types. */
+ struct homa_common_hdr common;
+} __packed;
+_Static_assert(sizeof(struct homa_need_ack_hdr) <= HOMA_MAX_HEADER,
+ "homa_need_ack_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * struct homa_ack_hdr - Wire format for ACK packets.
+ *
+ * These packets are sent from a client to a server to indicate that
+ * a set of RPCs is no longer active on the client, so the server can
+ * free any state it may have for them.
+ */
+struct homa_ack_hdr {
+ /** @common: Fields common to all packet types. */
+ struct homa_common_hdr common;
+
+ /** @num_acks: Number of (leading) elements in @acks that are valid. */
+ __be16 num_acks;
+
+#define HOMA_MAX_ACKS_PER_PKT 5
+ /** @acks: Info about RPCs that are no longer active. */
+ struct homa_ack acks[HOMA_MAX_ACKS_PER_PKT];
+} __packed;
+_Static_assert(sizeof(struct homa_ack_hdr) <= HOMA_MAX_HEADER,
+ "homa_ack_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * homa_local_id(): given an RPC identifier from an input packet (which
+ * is network-encoded), return the decoded id we should use for that
+ * RPC on this machine.
+ * @sender_id: RPC id from an incoming packet, such as h->common.sender_id
+ * Return: see above
+ */
+static inline u64 homa_local_id(__be64 sender_id)
+{
+ /* If the client bit was set on the sender side, it needs to be
+ * removed here, and conversely.
+ */
+ return be64_to_cpu(sender_id) ^ 1;
+}
+
+#endif /* _HOMA_WIRE_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 03/15] net: homa: create shared Homa header files
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 01/15] net: homa: define user-visible API for Homa John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 02/15] net: homa: create homa_wire.h John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-05 10:20 ` Paolo Abeni
2025-05-02 23:37 ` [PATCH net-next v8 04/15] net: homa: create homa_pool.h and homa_pool.c John Ousterhout
` (12 subsequent siblings)
15 siblings, 1 reply; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
homa_impl.h defines "struct homa", which contains overall information
about the Homa transport, plus various odds and ends that are used
throughout the Homa implementation.
homa_stub.h is a temporary header file that provides stubs for
facilities that have omitted for this first patch series. This file
will go away once Home is fully upstreamed.
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
Changes for v8:
* Pull out pacer-related fields into separate struct homa_pacer in homa_pacer.h
Changes for v7:
* Make Homa a per-net subsystem
* Track tx buffer memory usage
* Refactor waiting mechanism for incoming packets: simplify wait
criteria and use standard Linux mechanisms for waiting
* Remove "lock_slow" functions, which don't add functionality in this
patch series
* Rename homa_rpc_free to homa_rpc_end
* Add homa_make_header_avl function
* Use u64 and __u64 properly
---
net/homa/homa_impl.h | 481 +++++++++++++++++++++++++++++++++++++++++++
net/homa/homa_stub.h | 91 ++++++++
2 files changed, 572 insertions(+)
create mode 100644 net/homa/homa_impl.h
create mode 100644 net/homa/homa_stub.h
diff --git a/net/homa/homa_impl.h b/net/homa/homa_impl.h
new file mode 100644
index 000000000000..5a022a96473e
--- /dev/null
+++ b/net/homa/homa_impl.h
@@ -0,0 +1,481 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file contains definitions that are shared across the files
+ * that implement Homa for Linux.
+ */
+
+#ifndef _HOMA_IMPL_H
+#define _HOMA_IMPL_H
+
+#include <linux/bug.h>
+
+#include <linux/audit.h>
+#include <linux/icmp.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/kthread.h>
+#include <linux/completion.h>
+#include <linux/proc_fs.h>
+#include <linux/sched/clock.h>
+#include <linux/sched/signal.h>
+#include <linux/skbuff.h>
+#include <linux/socket.h>
+#include <linux/vmalloc.h>
+#include <net/icmp.h>
+#include <net/ip.h>
+#include <net/netns/generic.h>
+#include <net/protocol.h>
+#include <net/inet_common.h>
+#include <net/gro.h>
+#include <net/rps.h>
+
+#include <linux/homa.h>
+#include "homa_wire.h"
+
+/* Forward declarations. */
+struct homa;
+struct homa_peer;
+struct homa_rpc;
+struct homa_sock;
+
+#define sizeof32(type) ((int)(sizeof(type)))
+
+#ifdef __CHECKER__
+#define __context__(x, y, z) __attribute__((context(x, y, z)))
+#else
+#define __context__(...)
+#endif /* __CHECKER__ */
+
+/**
+ * union sockaddr_in_union - Holds either an IPv4 or IPv6 address (smaller
+ * and easier to use than sockaddr_storage).
+ */
+union sockaddr_in_union {
+ /** @sa: Used to access as a generic sockaddr. */
+ struct sockaddr sa;
+
+ /** @in4: Used to access as IPv4 socket. */
+ struct sockaddr_in in4;
+
+ /** @in6: Used to access as IPv6 socket. */
+ struct sockaddr_in6 in6;
+};
+
+/**
+ * struct homa - Stores overall information about an implementation of
+ * the Homa transport. One of these objects exists for each network namespace.
+ */
+struct homa {
+ /**
+ * @next_outgoing_id: Id to use for next outgoing RPC request.
+ * This is always even: it's used only to generate client-side ids.
+ * Accessed without locks. Note: RPC ids are unique within a
+ * single client machine.
+ */
+ atomic64_t next_outgoing_id;
+
+ /**
+ * @pacer: Information related to the pacer; managed by homa_pacer.c.
+ */
+ struct homa_pacer *pacer;
+
+ /**
+ * @prev_default_port: The most recent port number assigned from
+ * the range of default ports.
+ */
+ __u16 prev_default_port ____cacheline_aligned_in_smp;
+
+ /**
+ * @port_map: Information about all open sockets. Dynamically
+ * allocated; must be kfreed.
+ */
+ struct homa_socktab *port_map ____cacheline_aligned_in_smp;
+
+ /**
+ * @peers: Info about all the other hosts we have communicated with.
+ * Dynamically allocated; must be kfreed.
+ */
+ struct homa_peertab *peers;
+
+ /** @max_numa: Highest NUMA node id in use by any core. */
+ int max_numa;
+
+ /**
+ * @resend_ticks: When an RPC's @silent_ticks reaches this value,
+ * start sending RESEND requests.
+ */
+ int resend_ticks;
+
+ /**
+ * @resend_interval: minimum number of homa timer ticks between
+ * RESENDs for the same RPC.
+ */
+ int resend_interval;
+
+ /**
+ * @timeout_ticks: abort an RPC if its silent_ticks reaches this value.
+ */
+ int timeout_ticks;
+
+ /**
+ * @timeout_resends: Assume that a server is dead if it has not
+ * responded after this many RESENDs have been sent to it.
+ */
+ int timeout_resends;
+
+ /**
+ * @request_ack_ticks: How many timer ticks we'll wait for the
+ * client to ack an RPC before explicitly requesting an ack.
+ * Set externally via sysctl.
+ */
+ int request_ack_ticks;
+
+ /**
+ * @reap_limit: Maximum number of packet buffers to free in a
+ * single call to home_rpc_reap.
+ */
+ int reap_limit;
+
+ /**
+ * @dead_buffs_limit: If the number of packet buffers in dead but
+ * not yet reaped RPCs is less than this number, then Homa reaps
+ * RPCs in a way that minimizes impact on performance but may permit
+ * dead RPCs to accumulate. If the number of dead packet buffers
+ * exceeds this value, then Homa switches to a more aggressive approach
+ * to reaping RPCs. Set externally via sysctl.
+ */
+ int dead_buffs_limit;
+
+ /**
+ * @max_dead_buffs: The largest aggregate number of packet buffers
+ * in dead (but not yet reaped) RPCs that has existed so far in a
+ * single socket. Readable via sysctl, and may be reset via sysctl
+ * to begin recalculating.
+ */
+ int max_dead_buffs;
+
+ /**
+ * @max_gso_size: Maximum number of bytes that will be included
+ * in a single output packet that Homa passes to Linux. Can be set
+ * externally via sysctl to lower the limit already enforced by Linux.
+ */
+ int max_gso_size;
+
+ /**
+ * @gso_force_software: A non-zero value will cause Homa to perform
+ * segmentation in software using GSO; zero means ask the NIC to
+ * perform TSO. Set externally via sysctl.
+ */
+ int gso_force_software;
+
+ /**
+ * @wmem_max: Limit on the value of sk_sndbuf for any socket. Set
+ * externally via sysctl.
+ */
+ int wmem_max;
+
+ /**
+ * @timer_ticks: number of times that homa_timer has been invoked
+ * (may wraparound, which is safe).
+ */
+ u32 timer_ticks;
+
+ /**
+ * @flags: a collection of bits that can be set using sysctl
+ * to trigger various behaviors.
+ */
+ int flags;
+
+ /**
+ * @bpage_lease_usecs: how long a core can own a bpage (microseconds)
+ * before its ownership can be revoked to reclaim the page.
+ */
+ int bpage_lease_usecs;
+
+ /**
+ * @next_id: Set via sysctl; causes next_outgoing_id to be set to
+ * this value; always reads as zero. Typically used while debugging to
+ * ensure that different nodes use different ranges of ids.
+ */
+ int next_id;
+
+ /**
+ * @timer_kthread: Thread that runs timer code to detect lost
+ * packets and crashed peers.
+ */
+ struct task_struct *timer_kthread;
+
+ /** @hrtimer: Used to wakeup @timer_kthread at regular intervals. */
+ struct hrtimer hrtimer;
+
+ /**
+ * @destroyed: True means that this structure is being destroyed
+ * so everyone should clean up.
+ */
+ bool destroyed;
+
+};
+
+/**
+ * struct homa_skb_info - Additional information needed by Homa for each
+ * outbound DATA packet. Space is allocated for this at the very end of the
+ * linear part of the skb.
+ */
+struct homa_skb_info {
+ /**
+ * @next_skb: used to link together all of the skb's for a Homa
+ * message (in order of offset).
+ */
+ struct sk_buff *next_skb;
+
+ /**
+ * @wire_bytes: total number of bytes of network bandwidth that
+ * will be consumed by this packet. This includes everything,
+ * including additional headers added by GSO, IP header, Ethernet
+ * header, CRC, preamble, and inter-packet gap.
+ */
+ int wire_bytes;
+
+ /**
+ * @data_bytes: total bytes of message data across all of the
+ * segments in this packet.
+ */
+ int data_bytes;
+
+ /** @seg_length: maximum number of data bytes in each GSO segment. */
+ int seg_length;
+
+ /**
+ * @offset: offset within the message of the first byte of data in
+ * this packet.
+ */
+ int offset;
+};
+
+/**
+ * homa_get_skb_info() - Return the address of Homa's private information
+ * for an sk_buff.
+ * @skb: Socket buffer whose info is needed.
+ * Return: address of Homa's private information for @skb.
+ */
+static inline struct homa_skb_info *homa_get_skb_info(struct sk_buff *skb)
+{
+ return (struct homa_skb_info *)(skb_end_pointer(skb)) - 1;
+}
+
+/**
+ * homa_set_doff() - Fills in the doff TCP header field for a Homa packet.
+ * @h: Packet header whose doff field is to be set.
+ * @size: Size of the "header", bytes (must be a multiple of 4). This
+ * information is used only for TSO; it's the number of bytes
+ * that should be replicated in each segment. The bytes after
+ * this will be distributed among segments.
+ */
+static inline void homa_set_doff(struct homa_data_hdr *h, int size)
+{
+ /* Drop the 2 low-order bits from size and set the 4 high-order
+ * bits of doff from what's left.
+ */
+ h->common.doff = size << 2;
+}
+
+/** skb_is_ipv6() - Return true if the packet is encapsulated with IPv6,
+ * false otherwise (presumably it's IPv4).
+ */
+static inline bool skb_is_ipv6(const struct sk_buff *skb)
+{
+ return ipv6_hdr(skb)->version == 6;
+}
+
+/**
+ * ipv6_to_ipv4() - Given an IPv6 address produced by ipv4_to_ipv6, return
+ * the original IPv4 address (in network byte order).
+ * @ip6: IPv6 address; assumed to be a mapped IPv4 address.
+ * Return: IPv4 address stored in @ip6.
+ */
+static inline __be32 ipv6_to_ipv4(const struct in6_addr ip6)
+{
+ return ip6.in6_u.u6_addr32[3];
+}
+
+/**
+ * canonical_ipv6_addr() - Convert a socket address to the "standard"
+ * form used in Homa, which is always an IPv6 address; if the original address
+ * was IPv4, convert it to an IPv4-mapped IPv6 address.
+ * @addr: Address to canonicalize (if NULL, "any" is returned).
+ * Return: IPv6 address corresponding to @addr.
+ */
+static inline struct in6_addr canonical_ipv6_addr(const union sockaddr_in_union
+ *addr)
+{
+ struct in6_addr mapped;
+
+ if (addr) {
+ if (addr->sa.sa_family == AF_INET6)
+ return addr->in6.sin6_addr;
+ ipv6_addr_set_v4mapped(addr->in4.sin_addr.s_addr, &mapped);
+ return mapped;
+ }
+ return in6addr_any;
+}
+
+/**
+ * skb_canonical_ipv6_saddr() - Given a packet buffer, return its source
+ * address in the "standard" form used in Homa, which is always an IPv6
+ * address; if the original address was IPv4, convert it to an IPv4-mapped
+ * IPv6 address.
+ * @skb: The source address will be extracted from this packet buffer.
+ * Return: IPv6 address for @skb's source machine.
+ */
+static inline struct in6_addr skb_canonical_ipv6_saddr(struct sk_buff *skb)
+{
+ struct in6_addr mapped;
+
+ if (skb_is_ipv6(skb))
+ return ipv6_hdr(skb)->saddr;
+ ipv6_addr_set_v4mapped(ip_hdr(skb)->saddr, &mapped);
+ return mapped;
+}
+
+static inline bool is_homa_pkt(struct sk_buff *skb)
+{
+ struct iphdr *iph = ip_hdr(skb);
+
+ return (iph->protocol == IPPROTO_HOMA);
+}
+
+/**
+ * homa_make_header_avl() - Invokes pskb_may_pull to make sure that all the
+ * Homa header information for a packet is in the linear part of the skb
+ * where it can be addressed using skb_transport_header.
+ * @skb: Packet for which header is needed.
+ * Return: The result of pskb_may_pull (true for success)
+ */
+static inline bool homa_make_header_avl(struct sk_buff *skb)
+{
+ int pull_length;
+
+ pull_length = skb_transport_header(skb) - skb->data + HOMA_MAX_HEADER;
+ if (pull_length > skb->len)
+ pull_length = skb->len;
+ return pskb_may_pull(skb, pull_length);
+}
+
+#define UNIT_LOG(...)
+#define UNIT_HOOK(...)
+
+extern unsigned int homa_net_id;
+
+void homa_abort_rpcs(struct homa *homa, const struct in6_addr *addr,
+ int port, int error);
+void homa_abort_sock_rpcs(struct homa_sock *hsk, int error);
+void homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+ struct homa_rpc *rpc);
+void homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb);
+int homa_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+int homa_bind(struct socket *sk, struct sockaddr *addr,
+ int addr_len);
+void homa_close(struct sock *sock, long timeout);
+int homa_copy_to_user(struct homa_rpc *rpc);
+void homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
+void homa_destroy(struct homa *homa);
+int homa_disconnect(struct sock *sk, int flags);
+void homa_dispatch_pkts(struct sk_buff *skb, struct homa *homa);
+int homa_err_handler_v4(struct sk_buff *skb, u32 info);
+int homa_err_handler_v6(struct sk_buff *skb,
+ struct inet6_skb_parm *opt, u8 type, u8 code,
+ int offset, __be32 info);
+int homa_fill_data_interleaved(struct homa_rpc *rpc,
+ struct sk_buff *skb, struct iov_iter *iter);
+struct homa_gap *homa_gap_new(struct list_head *next, int start, int end);
+void homa_gap_retry(struct homa_rpc *rpc);
+int homa_get_port(struct sock *sk, unsigned short snum);
+int homa_getsockopt(struct sock *sk, int level, int optname,
+ char __user *optval, int __user *optlen);
+int homa_hash(struct sock *sk);
+enum hrtimer_restart homa_hrtimer(struct hrtimer *timer);
+int homa_init(struct homa *homa, struct net *net);
+int homa_ioctl(struct sock *sk, int cmd, int *karg);
+int homa_load(void);
+int homa_message_out_fill(struct homa_rpc *rpc,
+ struct iov_iter *iter, int xmit);
+void homa_message_out_init(struct homa_rpc *rpc, int length);
+void homa_need_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+ struct homa_rpc *rpc);
+struct sk_buff *homa_new_data_packet(struct homa_rpc *rpc,
+ struct iov_iter *iter, int offset,
+ int length, int max_seg_data);
+int homa_net_init(struct net *net);
+void homa_net_exit(struct net *net);
+__poll_t homa_poll(struct file *file, struct socket *sock,
+ struct poll_table_struct *wait);
+int homa_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+ int flags, int *addr_len);
+void homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
+ struct homa_sock *hsk);
+void homa_rpc_abort(struct homa_rpc *crpc, int error);
+void homa_rpc_acked(struct homa_sock *hsk,
+ const struct in6_addr *saddr, struct homa_ack *ack);
+void homa_rpc_end(struct homa_rpc *rpc);
+void homa_rpc_handoff(struct homa_rpc *rpc);
+int homa_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
+int homa_setsockopt(struct sock *sk, int level, int optname,
+ sockptr_t optval, unsigned int optlen);
+int homa_shutdown(struct socket *sock, int how);
+int homa_softirq(struct sk_buff *skb);
+void homa_spin(int ns);
+void homa_timer(struct homa *homa);
+int homa_timer_main(void *transport);
+void homa_unhash(struct sock *sk);
+void homa_rpc_unknown_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
+void homa_unload(void);
+int homa_wait_private(struct homa_rpc *rpc, int nonblocking);
+struct homa_rpc
+ *homa_wait_shared(struct homa_sock *hsk, int nonblocking);
+int homa_xmit_control(enum homa_packet_type type, void *contents,
+ size_t length, struct homa_rpc *rpc);
+int __homa_xmit_control(void *contents, size_t length,
+ struct homa_peer *peer, struct homa_sock *hsk);
+void homa_xmit_data(struct homa_rpc *rpc, bool force);
+void homa_xmit_unknown(struct sk_buff *skb, struct homa_sock *hsk);
+
+int homa_message_in_init(struct homa_rpc *rpc, int unsched);
+void homa_resend_data(struct homa_rpc *rpc, int start, int end);
+void __homa_xmit_data(struct sk_buff *skb, struct homa_rpc *rpc);
+
+/**
+ * homa_from_net() - Return the struct homa associated with a particular
+ * struct net.
+ * @net: Get the struct homa for this net namespace.
+ * Return: see above
+ */
+static inline struct homa *homa_from_net(struct net *net)
+{
+ return (struct homa *)net_generic(net, homa_net_id);
+}
+
+/**
+ * homa_from_sock() - Return the struct homa associated with a particular
+ * struct sock.
+ * @sock: Get the struct homa for this socket.
+ * Return: see above
+ */
+static inline struct homa *homa_from_sock(struct sock *sock)
+{
+ return (struct homa *)net_generic(sock_net(sock), homa_net_id);
+}
+
+/**
+ * homa_from_skb() - Return the struct homa associated with a particular
+ * sk_buff.
+ * @skb: Get the struct homa for this packet buffer.
+ * Return: see above
+ */
+static inline struct homa *homa_from_skb(struct sk_buff *skb)
+{
+ return (struct homa *)net_generic(dev_net(skb->dev), homa_net_id);
+}
+
+#endif /* _HOMA_IMPL_H */
diff --git a/net/homa/homa_stub.h b/net/homa/homa_stub.h
new file mode 100644
index 000000000000..3bfe7b8b5b42
--- /dev/null
+++ b/net/homa/homa_stub.h
@@ -0,0 +1,91 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file contains stripped-down replacements that have been
+ * temporarily removed from Homa during the Linux upstreaming
+ * process. By the time upstreaming is complete this file will
+ * have gone away.
+ */
+
+#ifndef _HOMA_STUB_H
+#define _HOMA_STUB_H
+
+#include "homa_impl.h"
+
+static inline int homa_skb_init(struct homa *homa)
+{
+ return 0;
+}
+
+static inline void homa_skb_cleanup(struct homa *homa)
+{}
+
+static inline void homa_skb_release_pages(struct homa *homa)
+{}
+
+static inline int homa_skb_append_from_iter(struct homa *homa,
+ struct sk_buff *skb,
+ struct iov_iter *iter, int length)
+{
+ char *dst = skb_put(skb, length);
+
+ if (copy_from_iter(dst, length, iter) != length)
+ return -EFAULT;
+ return 0;
+}
+
+static inline int homa_skb_append_to_frag(struct homa *homa,
+ struct sk_buff *skb, void *buf,
+ int length)
+{
+ char *dst = skb_put(skb, length);
+
+ memcpy(dst, buf, length);
+ return 0;
+}
+
+static inline int homa_skb_append_from_skb(struct homa *homa,
+ struct sk_buff *dst_skb,
+ struct sk_buff *src_skb,
+ int offset, int length)
+{
+ return homa_skb_append_to_frag(homa, dst_skb,
+ skb_transport_header(src_skb) + offset, length);
+}
+
+static inline void homa_skb_free_tx(struct homa *homa, struct sk_buff *skb)
+{
+ kfree_skb(skb);
+}
+
+static inline void homa_skb_free_many_tx(struct homa *homa,
+ struct sk_buff **skbs, int count)
+{
+ int i;
+
+ for (i = 0; i < count; i++)
+ kfree_skb(skbs[i]);
+}
+
+static inline void homa_skb_get(struct sk_buff *skb, void *dest, int offset,
+ int length)
+{
+ memcpy(dest, skb_transport_header(skb) + offset, length);
+}
+
+static inline struct sk_buff *homa_skb_new_tx(int length)
+{
+ struct sk_buff *skb;
+
+ skb = alloc_skb(HOMA_SKB_EXTRA + HOMA_IPV6_HEADER_LENGTH +
+ sizeof(struct homa_skb_info) + length, GFP_ATOMIC);
+ if (likely(skb)) {
+ skb_reserve(skb, HOMA_SKB_EXTRA + HOMA_IPV6_HEADER_LENGTH);
+ skb_reset_transport_header(skb);
+ }
+ return skb;
+}
+
+static inline void homa_skb_stash_pages(struct homa *homa, int length)
+{}
+
+#endif /* _HOMA_STUB_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 04/15] net: homa: create homa_pool.h and homa_pool.c
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
` (2 preceding siblings ...)
2025-05-02 23:37 ` [PATCH net-next v8 03/15] net: homa: create shared Homa header files John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-05 9:51 ` Paolo Abeni
2025-05-02 23:37 ` [PATCH net-next v8 05/15] net: homa: create homa_peer.h and homa_peer.c John Ousterhout
` (11 subsequent siblings)
15 siblings, 1 reply; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
These files implement Homa's mechanism for managing application-level
buffer space for incoming messages This mechanism is needed to allow
Homa to copy data out to user space in parallel with receiving packets;
it was discussed in a talk at NetDev 0x17.
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
Changes for v8:
* Refactor homa_pool APIs (move allocation/deallocation into homa_pool.c,
move locking responsibility out)
Changes for v7:
* Use u64 and __u64 properly
* Eliminate extraneous use of RCU
* Refactor pool->cores to use percpu variable
* Use smp_processor_id instead of raw_smp_processor_id
---
net/homa/homa_pool.c | 470 +++++++++++++++++++++++++++++++++++++++++++
net/homa/homa_pool.h | 149 ++++++++++++++
net/homa/sync.txt | 77 +++++++
3 files changed, 696 insertions(+)
create mode 100644 net/homa/homa_pool.c
create mode 100644 net/homa/homa_pool.h
create mode 100644 net/homa/sync.txt
diff --git a/net/homa/homa_pool.c b/net/homa/homa_pool.c
new file mode 100644
index 000000000000..d2dfcc6f38bb
--- /dev/null
+++ b/net/homa/homa_pool.c
@@ -0,0 +1,470 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+#include "homa_impl.h"
+#include "homa_pool.h"
+
+/* This file contains functions that manage user-space buffer pools. */
+
+/* Pools must always have at least this many bpages (no particular
+ * reasoning behind this value).
+ */
+#define MIN_POOL_SIZE 2
+
+/* Used when determining how many bpages to consider for allocation. */
+#define MIN_EXTRA 4
+
+/**
+ * set_bpages_needed() - Set the bpages_needed field of @pool based
+ * on the length of the first RPC that's waiting for buffer space.
+ * The caller must own the lock for @pool->hsk.
+ * @pool: Pool to update.
+ */
+static void set_bpages_needed(struct homa_pool *pool)
+{
+ struct homa_rpc *rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
+ struct homa_rpc, buf_links);
+ pool->bpages_needed = (rpc->msgin.length + HOMA_BPAGE_SIZE - 1)
+ >> HOMA_BPAGE_SHIFT;
+}
+
+/**
+ * homa_pool_new() - Allocate and initialize a new homa_pool (it will have
+ * no region associated with it until homa_pool_set_region is invoked).
+ * @hsk: Socket the pool will be associated with.
+ * Return: A pointer to the new pool or a negative errno.
+ */
+struct homa_pool *homa_pool_new(struct homa_sock *hsk)
+{
+ struct homa_pool *pool;
+
+ pool = kzalloc(sizeof(*pool), GFP_ATOMIC);
+ if (!pool)
+ return ERR_PTR(-ENOMEM);
+ pool->hsk = hsk;
+ return pool;
+}
+
+/**
+ * homa_pool_set_region() - Associate a region of memory with a pool.
+ * @pool: Pool the region will be associated with. Must not currently
+ * have a region associated with it.
+ * @region: First byte of the memory region for the pool, allocated
+ * by the application; must be page-aligned.
+ * @region_size: Total number of bytes available at @buf_region.
+ * Return: Either zero (for success) or a negative errno for failure.
+ */
+int homa_pool_set_region(struct homa_pool *pool, void __user *region,
+ u64 region_size)
+{
+ int i, result;
+
+ if (pool->region)
+ return -EINVAL;
+
+ if (((uintptr_t)region) & ~PAGE_MASK)
+ return -EINVAL;
+ pool->region = (char __user *)region;
+ pool->num_bpages = region_size >> HOMA_BPAGE_SHIFT;
+ pool->descriptors = NULL;
+ pool->cores = NULL;
+ if (pool->num_bpages < MIN_POOL_SIZE) {
+ result = -EINVAL;
+ goto error;
+ }
+ pool->descriptors = kmalloc_array(pool->num_bpages,
+ sizeof(struct homa_bpage),
+ GFP_ATOMIC | __GFP_ZERO);
+ if (!pool->descriptors) {
+ result = -ENOMEM;
+ goto error;
+ }
+ for (i = 0; i < pool->num_bpages; i++) {
+ struct homa_bpage *bp = &pool->descriptors[i];
+
+ spin_lock_init(&bp->lock);
+ bp->owner = -1;
+ }
+ atomic_set(&pool->free_bpages, pool->num_bpages);
+ pool->bpages_needed = INT_MAX;
+
+ /* Allocate and initialize core-specific data. */
+ pool->cores = alloc_percpu_gfp(struct homa_pool_core,
+ GFP_ATOMIC | __GFP_ZERO);
+ if (!pool->cores) {
+ result = -ENOMEM;
+ goto error;
+ }
+ pool->num_cores = nr_cpu_ids;
+ pool->check_waiting_invoked = 0;
+
+ return 0;
+
+error:
+ kfree(pool->descriptors);
+ free_percpu(pool->cores);
+ pool->region = NULL;
+ return result;
+}
+
+/**
+ * homa_pool_destroy() - Destructor for homa_pool. After this method
+ * returns, the object should not be used (it will be freed here).
+ * @pool: Pool to destroy.
+ */
+void homa_pool_destroy(struct homa_pool *pool)
+{
+ if (pool->region) {
+ kfree(pool->descriptors);
+ free_percpu(pool->cores);
+ pool->region = NULL;
+ }
+ kfree(pool);
+}
+
+/**
+ * homa_pool_get_rcvbuf() - Return information needed to handle getsockopt
+ * for HOMA_SO_RCVBUF.
+ * @pool: Pool for which information is needed.
+ * @args: Store info here.
+ */
+void homa_pool_get_rcvbuf(struct homa_pool *pool,
+ struct homa_rcvbuf_args *args)
+{
+ args->start = (uintptr_t)pool->region;
+ args->length = pool->num_bpages << HOMA_BPAGE_SHIFT;
+}
+
+/**
+ * homa_bpage_available() - Check whether a bpage is available for use.
+ * @bpage: Bpage to check
+ * @now: Current time (sched_clock() units)
+ * Return: True if the bpage is free or if it can be stolen, otherwise
+ * false.
+ */
+bool homa_bpage_available(struct homa_bpage *bpage, u64 now)
+{
+ int ref_count = atomic_read(&bpage->refs);
+
+ return ref_count == 0 || (ref_count == 1 && bpage->owner >= 0 &&
+ bpage->expiration <= now);
+}
+
+/**
+ * homa_pool_get_pages() - Allocate one or more full pages from the pool.
+ * @pool: Pool from which to allocate pages
+ * @num_pages: Number of pages needed
+ * @pages: The indices of the allocated pages are stored here; caller
+ * must ensure this array is big enough. Reference counts have
+ * been set to 1 on all of these pages (or 2 if set_owner
+ * was specified).
+ * @set_owner: If nonzero, the current core is marked as owner of all
+ * of the allocated pages (and the expiration time is also
+ * set). Otherwise the pages are left unowned.
+ * Return: 0 for success, -1 if there wasn't enough free space in the pool.
+ */
+int homa_pool_get_pages(struct homa_pool *pool, int num_pages, u32 *pages,
+ int set_owner)
+{
+ int core_num = smp_processor_id();
+ struct homa_pool_core *core;
+ u64 now = sched_clock();
+ int alloced = 0;
+ int limit = 0;
+
+ core = this_cpu_ptr(pool->cores);
+ if (atomic_sub_return(num_pages, &pool->free_bpages) < 0) {
+ atomic_add(num_pages, &pool->free_bpages);
+ return -1;
+ }
+
+ /* Once we get to this point we know we will be able to find
+ * enough free pages; now we just have to find them.
+ */
+ while (alloced != num_pages) {
+ struct homa_bpage *bpage;
+ int cur;
+
+ /* If we don't need to use all of the bpages in the pool,
+ * then try to use only the ones with low indexes. This
+ * will reduce the cache footprint for the pool by reusing
+ * a few bpages over and over. Specifically this code will
+ * not consider any candidate page whose index is >= limit.
+ * Limit is chosen to make sure there are a reasonable
+ * number of free pages in the range, so we won't have to
+ * check a huge number of pages.
+ */
+ if (limit == 0) {
+ int extra;
+
+ limit = pool->num_bpages
+ - atomic_read(&pool->free_bpages);
+ extra = limit >> 2;
+ limit += (extra < MIN_EXTRA) ? MIN_EXTRA : extra;
+ if (limit > pool->num_bpages)
+ limit = pool->num_bpages;
+ }
+
+ cur = core->next_candidate;
+ core->next_candidate++;
+ if (cur >= limit) {
+ core->next_candidate = 0;
+
+ /* Must recompute the limit for each new loop through
+ * the bpage array: we may need to consider a larger
+ * range of pages because of concurrent allocations.
+ */
+ limit = 0;
+ continue;
+ }
+ bpage = &pool->descriptors[cur];
+
+ /* Figure out whether this candidate is free (or can be
+ * stolen). Do a quick check without locking the page, and
+ * if the page looks promising, then lock it and check again
+ * (must check again in case someone else snuck in and
+ * grabbed the page).
+ */
+ if (!homa_bpage_available(bpage, now))
+ continue;
+ if (!spin_trylock_bh(&bpage->lock))
+ continue;
+ if (!homa_bpage_available(bpage, now)) {
+ spin_unlock_bh(&bpage->lock);
+ continue;
+ }
+ if (bpage->owner >= 0)
+ atomic_inc(&pool->free_bpages);
+ if (set_owner) {
+ atomic_set(&bpage->refs, 2);
+ bpage->owner = core_num;
+ bpage->expiration = now + 1000 *
+ pool->hsk->homa->bpage_lease_usecs;
+ } else {
+ atomic_set(&bpage->refs, 1);
+ bpage->owner = -1;
+ }
+ spin_unlock_bh(&bpage->lock);
+ pages[alloced] = cur;
+ alloced++;
+ }
+ return 0;
+}
+
+/**
+ * homa_pool_allocate() - Allocate buffer space for an RPC.
+ * @rpc: RPC that needs space allocated for its incoming message (space must
+ * not already have been allocated). The fields @msgin->num_buffers
+ * and @msgin->buffers are filled in. Must be locked by caller.
+ * Return: The return value is normally 0, which means either buffer space
+ * was allocated or the @rpc was queued on @hsk->waiting. If a fatal error
+ * occurred, such as no buffer pool present, then a negative errno is
+ * returned.
+ */
+int homa_pool_allocate(struct homa_rpc *rpc)
+ __must_hold(&rpc->bucket->lock)
+{
+ struct homa_pool *pool = rpc->hsk->buffer_pool;
+ int full_pages, partial, i, core_id;
+ u32 pages[HOMA_MAX_BPAGES];
+ struct homa_pool_core *core;
+ struct homa_bpage *bpage;
+ struct homa_rpc *other;
+
+ if (!pool->region)
+ return -ENOMEM;
+
+ /* First allocate any full bpages that are needed. */
+ full_pages = rpc->msgin.length >> HOMA_BPAGE_SHIFT;
+ if (unlikely(full_pages)) {
+ if (homa_pool_get_pages(pool, full_pages, pages, 0) != 0)
+ goto out_of_space;
+ for (i = 0; i < full_pages; i++)
+ rpc->msgin.bpage_offsets[i] = pages[i] <<
+ HOMA_BPAGE_SHIFT;
+ }
+ rpc->msgin.num_bpages = full_pages;
+
+ /* The last chunk may be less than a full bpage; for this we use
+ * the bpage that we own (and reuse it for multiple messages).
+ */
+ partial = rpc->msgin.length & (HOMA_BPAGE_SIZE - 1);
+ if (unlikely(partial == 0))
+ goto success;
+ core_id = smp_processor_id();
+ core = this_cpu_ptr(pool->cores);
+ bpage = &pool->descriptors[core->page_hint];
+ if (!spin_trylock_bh(&bpage->lock))
+ spin_lock_bh(&bpage->lock);
+ if (bpage->owner != core_id) {
+ spin_unlock_bh(&bpage->lock);
+ goto new_page;
+ }
+ if ((core->allocated + partial) > HOMA_BPAGE_SIZE) {
+ if (atomic_read(&bpage->refs) == 1) {
+ /* Bpage is totally free, so we can reuse it. */
+ core->allocated = 0;
+ } else {
+ bpage->owner = -1;
+
+ /* We know the reference count can't reach zero here
+ * because of check above, so we won't have to decrement
+ * pool->free_bpages.
+ */
+ atomic_dec_return(&bpage->refs);
+ spin_unlock_bh(&bpage->lock);
+ goto new_page;
+ }
+ }
+ bpage->expiration = sched_clock() +
+ 1000 * pool->hsk->homa->bpage_lease_usecs;
+ atomic_inc(&bpage->refs);
+ spin_unlock_bh(&bpage->lock);
+ goto allocate_partial;
+
+ /* Can't use the current page; get another one. */
+new_page:
+ if (homa_pool_get_pages(pool, 1, pages, 1) != 0) {
+ homa_pool_release_buffers(pool, rpc->msgin.num_bpages,
+ rpc->msgin.bpage_offsets);
+ rpc->msgin.num_bpages = 0;
+ goto out_of_space;
+ }
+ core->page_hint = pages[0];
+ core->allocated = 0;
+
+allocate_partial:
+ rpc->msgin.bpage_offsets[rpc->msgin.num_bpages] = core->allocated
+ + (core->page_hint << HOMA_BPAGE_SHIFT);
+ rpc->msgin.num_bpages++;
+ core->allocated += partial;
+
+success:
+ return 0;
+
+ /* We get here if there wasn't enough buffer space for this
+ * message; add the RPC to hsk->waiting_for_bufs.
+ */
+out_of_space:
+ homa_sock_lock(pool->hsk);
+ list_for_each_entry(other, &pool->hsk->waiting_for_bufs, buf_links) {
+ if (other->msgin.length > rpc->msgin.length) {
+ list_add_tail(&rpc->buf_links, &other->buf_links);
+ goto queued;
+ }
+ }
+ list_add_tail(&rpc->buf_links, &pool->hsk->waiting_for_bufs);
+
+queued:
+ set_bpages_needed(pool);
+ homa_sock_unlock(pool->hsk);
+ return 0;
+}
+
+/**
+ * homa_pool_get_buffer() - Given an RPC, figure out where to store incoming
+ * message data.
+ * @rpc: RPC for which incoming message data is being processed; its
+ * msgin must be properly initialized and buffer space must have
+ * been allocated for the message.
+ * @offset: Offset within @rpc's incoming message.
+ * @available: Will be filled in with the number of bytes of space available
+ * at the returned address (could be zero if offset is
+ * (erroneously) past the end of the message).
+ * Return: The application's virtual address for buffer space corresponding
+ * to @offset in the incoming message for @rpc.
+ */
+void __user *homa_pool_get_buffer(struct homa_rpc *rpc, int offset,
+ int *available)
+{
+ int bpage_index, bpage_offset;
+
+ bpage_index = offset >> HOMA_BPAGE_SHIFT;
+ if (offset >= rpc->msgin.length) {
+ WARN_ONCE(true, "%s got offset %d >= message length %d\n",
+ __func__, offset, rpc->msgin.length);
+ *available = 0;
+ return NULL;
+ }
+ bpage_offset = offset & (HOMA_BPAGE_SIZE - 1);
+ *available = (bpage_index < (rpc->msgin.num_bpages - 1))
+ ? HOMA_BPAGE_SIZE - bpage_offset
+ : rpc->msgin.length - offset;
+ return rpc->hsk->buffer_pool->region +
+ rpc->msgin.bpage_offsets[bpage_index] + bpage_offset;
+}
+
+/**
+ * homa_pool_release_buffers() - Release buffer space so that it can be
+ * reused.
+ * @pool: Pool that the buffer space belongs to. Doesn't need to
+ * be locked.
+ * @num_buffers: How many buffers to release.
+ * @buffers: Points to @num_buffers values, each of which is an offset
+ * from the start of the pool to the buffer to be released.
+ * Return: 0 for success, otherwise a negative errno.
+ */
+int homa_pool_release_buffers(struct homa_pool *pool, int num_buffers,
+ u32 *buffers)
+{
+ int result = 0;
+ int i;
+
+ if (!pool->region)
+ return result;
+ for (i = 0; i < num_buffers; i++) {
+ u32 bpage_index = buffers[i] >> HOMA_BPAGE_SHIFT;
+ struct homa_bpage *bpage = &pool->descriptors[bpage_index];
+
+ if (bpage_index < pool->num_bpages) {
+ if (atomic_dec_return(&bpage->refs) == 0)
+ atomic_inc(&pool->free_bpages);
+ } else {
+ result = -EINVAL;
+ }
+ }
+ return result;
+}
+
+/**
+ * homa_pool_check_waiting() - Checks to see if there are enough free
+ * bpages to wake up any RPCs that were blocked. Whenever
+ * homa_pool_release_buffers is invoked, this function must be invoked later,
+ * at a point when the caller holds no locks (homa_pool_release_buffers may
+ * be invoked with locks held, so it can't safely invoke this function).
+ * This is regrettably tricky, but I can't think of a better solution.
+ * @pool: Information about the buffer pool.
+ */
+void homa_pool_check_waiting(struct homa_pool *pool)
+{
+ if (!pool->region)
+ return;
+ while (atomic_read(&pool->free_bpages) >= pool->bpages_needed) {
+ struct homa_rpc *rpc;
+
+ homa_sock_lock(pool->hsk);
+ if (list_empty(&pool->hsk->waiting_for_bufs)) {
+ pool->bpages_needed = INT_MAX;
+ homa_sock_unlock(pool->hsk);
+ break;
+ }
+ rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
+ struct homa_rpc, buf_links);
+ if (!homa_rpc_try_lock(rpc)) {
+ /* Can't just spin on the RPC lock because we're
+ * holding the socket lock (see sync.txt). Instead,
+ * release the socket lock and try the entire
+ * operation again.
+ */
+ homa_sock_unlock(pool->hsk);
+ continue;
+ }
+ list_del_init(&rpc->buf_links);
+ if (list_empty(&pool->hsk->waiting_for_bufs))
+ pool->bpages_needed = INT_MAX;
+ else
+ set_bpages_needed(pool);
+ homa_sock_unlock(pool->hsk);
+ homa_pool_allocate(rpc);
+ homa_rpc_unlock(rpc);
+ }
+}
diff --git a/net/homa/homa_pool.h b/net/homa/homa_pool.h
new file mode 100644
index 000000000000..d52d61afa557
--- /dev/null
+++ b/net/homa/homa_pool.h
@@ -0,0 +1,149 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file contains definitions used to manage user-space buffer pools.
+ */
+
+#ifndef _HOMA_POOL_H
+#define _HOMA_POOL_H
+
+#include <linux/percpu.h>
+
+#include "homa_rpc.h"
+
+/**
+ * struct homa_bpage - Contains information about a single page in
+ * a buffer pool.
+ */
+struct homa_bpage {
+ union {
+ /**
+ * @cache_line: Ensures that each homa_bpage object
+ * is exactly one cache line long.
+ */
+ char cache_line[L1_CACHE_BYTES];
+ struct {
+ /** @lock: to synchronize shared access. */
+ spinlock_t lock;
+
+ /**
+ * @refs: Counts number of distinct uses of this
+ * bpage (1 tick for each message that is using
+ * this page, plus an additional tick if the @owner
+ * field is set).
+ */
+ atomic_t refs;
+
+ /**
+ * @owner: kernel core that currently owns this page
+ * (< 0 if none).
+ */
+ int owner;
+
+ /**
+ * @expiration: time (in sched_clock() units) after
+ * which it's OK to steal this page from its current
+ * owner (if @refs is 1).
+ */
+ u64 expiration;
+ };
+ };
+};
+
+/**
+ * struct homa_pool_core - Holds core-specific data for a homa_pool (a bpage
+ * out of which that core is allocating small chunks).
+ */
+struct homa_pool_core {
+ /**
+ * @page_hint: Index of bpage in pool->descriptors,
+ * which may be owned by this core. If so, we'll use it
+ * for allocating partial pages.
+ */
+ int page_hint;
+
+ /**
+ * @allocated: if the page given by @page_hint is
+ * owned by this core, this variable gives the number of
+ * (initial) bytes that have already been allocated
+ * from the page.
+ */
+ int allocated;
+
+ /**
+ * @next_candidate: when searching for free bpages,
+ * check this index next.
+ */
+ int next_candidate;
+};
+
+/**
+ * struct homa_pool - Describes a pool of buffer space for incoming
+ * messages for a particular socket; managed by homa_pool.c. The pool is
+ * divided up into "bpages", which are a multiple of the hardware page size.
+ * A bpage may be owned by a particular core so that it can more efficiently
+ * allocate space for small messages.
+ */
+struct homa_pool {
+ /**
+ * @hsk: the socket that this pool belongs to.
+ */
+ struct homa_sock *hsk;
+
+ /**
+ * @region: beginning of the pool's region (in the app's virtual
+ * memory). Divided into bpages. 0 means the pool hasn't yet been
+ * initialized.
+ */
+ char __user *region;
+
+ /** @num_bpages: total number of bpages in the pool. */
+ int num_bpages;
+
+ /** @descriptors: kmalloced area containing one entry for each bpage. */
+ struct homa_bpage *descriptors;
+
+ /**
+ * @free_bpages: the number of pages still available for allocation
+ * by homa_pool_get pages. This equals the number of pages with zero
+ * reference counts, minus the number of pages that have been claimed
+ * by homa_get_pool_pages but not yet allocated.
+ */
+ atomic_t free_bpages;
+
+ /**
+ * @bpages_needed: the number of free bpages required to satisfy the
+ * needs of the first RPC on @hsk->waiting_for_bufs, or INT_MAX if
+ * that queue is empty.
+ */
+ int bpages_needed;
+
+ /** @cores: core-specific info; dynamically allocated. */
+ struct homa_pool_core __percpu *cores;
+
+ /** @num_cores: number of elements in @cores. */
+ int num_cores;
+
+ /**
+ * @check_waiting_invoked: incremented during unit tests when
+ * homa_pool_check_waiting is invoked.
+ */
+ int check_waiting_invoked;
+};
+
+bool homa_bpage_available(struct homa_bpage *bpage, u64 now);
+int homa_pool_allocate(struct homa_rpc *rpc);
+void homa_pool_check_waiting(struct homa_pool *pool);
+void homa_pool_destroy(struct homa_pool *pool);
+void __user *homa_pool_get_buffer(struct homa_rpc *rpc, int offset,
+ int *available);
+int homa_pool_get_pages(struct homa_pool *pool, int num_pages,
+ u32 *pages, int leave_locked);
+void homa_pool_get_rcvbuf(struct homa_pool *pool,
+ struct homa_rcvbuf_args *args);
+struct homa_pool *homa_pool_new(struct homa_sock *hsk);
+int homa_pool_release_buffers(struct homa_pool *pool,
+ int num_buffers, u32 *buffers);
+int homa_pool_set_region(struct homa_pool *pool, void __user *region,
+ u64 region_size);
+
+#endif /* _HOMA_POOL_H */
diff --git a/net/homa/sync.txt b/net/homa/sync.txt
new file mode 100644
index 000000000000..eb3c6ffb19ee
--- /dev/null
+++ b/net/homa/sync.txt
@@ -0,0 +1,77 @@
+This file describes the synchronization strategy used for Homa.
+
+* In the Linux TCP/IP stack, the primary locking mechanism is a lock
+ per socket. However, per-socket locks aren't adequate for Homa, because
+ sockets are "larger" in Homa. In TCP, a socket corresponds to a single
+ connection between the source and destination; an application can have
+ hundreds or thousands of sockets open at once, so per-socket locks leave
+ lots of opportunities for concurrency. With Homa, a single socket can be
+ used for communicating with any number of peers, so there will typically
+ be no more than one socket per thread. As a result, a single Homa socket
+ must support many concurrent RPCs efficiently, and a per-socket lock would
+ create a bottleneck (Homa tried this approach initially).
+
+* Thus, the primary lock used in Homa is a per-RPC spinlock. This allows operations
+ on different RPCs to proceed concurrently. RPC locks are actually stored in
+ the hash table buckets used to look them up. This is important because it
+ makes looking up RPCs and locking them atomic. Without this approach it
+ is possible that an RPC could get deleted after it was looked up but before
+ it was locked.
+
+* Certain operations are not permitted while holding spinlocks, such as memory
+ allocation and copying data to/from user space (spinlocks disable
+ interrupts, so the holder must not block). RPC locks are spinlocks,
+ and that results in awkward code in several places to move prohibited
+ operations outside the locked regions. In particular, there is extra
+ complexity to make sure that RPCs are not garbage-collected while these
+ operations are occurring without a lock.
+
+* There are several other locks in Homa besides RPC locks. When multiple
+ locks are held, they must always be acquired in a consistent order, in
+ order to prevent deadlock. For each lock, here are the other locks that
+ may be acquired while holding the given lock.
+ * RPC: socket, grant, throttle, peer->ack_lock
+ * Socket: port_map.write_lock
+ Any lock not listed above must be a "leaf" lock: no other lock will be
+ acquired while holding the lock.
+
+* Homa's approach means that socket shutdown and deletion can potentially
+ occur while operations are underway that hold RPC locks but not the socket
+ lock. This creates several potential problems:
+ * A socket might be deleted and its memory reclaimed while an RPC still
+ has access to it. Homa assumes that Linux will prevent socket deletion
+ while the kernel call is executing. In situations outside kernel call
+ handling, Homa uses rcu_read_lock and/or socket references to prevent
+ socket deletion.
+ * A socket might be shut down while there are active operations on
+ RPCs. For example, a new RPC creation might be underway when a socket
+ is shut down, which could add the new RPC after all of its RPCs
+ have supposedly been deleted. Handling this requires careful ordering
+ of operations during shutdown, plus the rest of Homa must be careful
+ never to add new RPCs to a socket that has been shut down.
+
+* There are a few places where Homa needs to process RPCs on lists
+ associated with a socket, such as the timer. Such code must first lock
+ the socket (to synchronize access to the link pointers) then lock
+ individual RPCs on the list. However, this violates the rules for locking
+ order. It isn't safe to unlock the socket before locking the RPC, because
+ the RPC could be deleted and its memory recycled between the unlock of the
+ socket lock and the lock of the RPC; this could result in corruption. Homa
+ uses a few different ways to handle this situation:
+ * Use homa_protect_rpcs to prevent RPC reaping for a socket. RPCs can still
+ be deleted, but their memory won't go away until homa_unprotect_rpcs is
+ invoked. This allows the socket lock to be released before acquiring
+ the RPC lock; after acquiring the RPC lock check to see if it has been
+ deleted; if so, skip it. Note: the Linux RCU mechanism could have been
+ used to achieve the same effect, but it results in *very* long delays
+ before final reclamation (tens of ms), even without contention, which
+ means that a large number of dead RPCs could accumulate.
+ * Use spin_trylock_bh to acquire the RPC lock, while still holding the
+ socket lock. If this fails, then release the socket lock, then retry
+ both the socket lock and the RPC lock.
+
+* There are also a few places where Homa is doing something related to an
+ RPC (such as copying message data to user space) and needs the RPC to stay
+ around, but it isn't holding the RPC lock. In this situations, Homa sets
+ a bit in rpc->flags and homa_rpc_reap will not reap RPCs with any of these
+ flags set.
\ No newline at end of file
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 05/15] net: homa: create homa_peer.h and homa_peer.c
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
` (3 preceding siblings ...)
2025-05-02 23:37 ` [PATCH net-next v8 04/15] net: homa: create homa_pool.h and homa_pool.c John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-05 11:06 ` Paolo Abeni
2025-05-02 23:37 ` [PATCH net-next v8 06/15] net: homa: create homa_sock.h and homa_sock.c John Ousterhout
` (10 subsequent siblings)
15 siblings, 1 reply; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
Homa needs to keep a small amount of information for each peer that
it has communicated with. These files define that state and provide
functions for storing and accessing it.
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
Changes for v7:
* Remove homa_peertab_get_peers
* Remove "lock_slow" functions, which don't add functionality in this
patch
* Remove unused fields from homa_peer structs
* Use u64 and __u64 properly
* Add lock annotations
* Refactor homa_peertab_get_peers
* Use __GFP_ZERO in kmalloc calls
---
net/homa/homa_peer.c | 308 +++++++++++++++++++++++++++++++++++++++++++
net/homa/homa_peer.h | 211 +++++++++++++++++++++++++++++
2 files changed, 519 insertions(+)
create mode 100644 net/homa/homa_peer.c
create mode 100644 net/homa/homa_peer.h
diff --git a/net/homa/homa_peer.c b/net/homa/homa_peer.c
new file mode 100644
index 000000000000..72742cecb3dd
--- /dev/null
+++ b/net/homa/homa_peer.c
@@ -0,0 +1,308 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file provides functions related to homa_peer and homa_peertab
+ * objects.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+
+/**
+ * homa_peertab_init() - Constructor for homa_peertabs.
+ * @peertab: The object to initialize; previous contents are discarded.
+ *
+ * Return: 0 in the normal case, or a negative errno if there was a problem.
+ */
+int homa_peertab_init(struct homa_peertab *peertab)
+{
+ /* Note: when we return, the object must be initialized so it's
+ * safe to call homa_peertab_destroy, even if this function returns
+ * an error.
+ */
+ int i;
+
+ spin_lock_init(&peertab->write_lock);
+ INIT_LIST_HEAD(&peertab->dead_dsts);
+ peertab->buckets = vmalloc(HOMA_PEERTAB_BUCKETS *
+ sizeof(*peertab->buckets));
+ if (!peertab->buckets)
+ return -ENOMEM;
+ for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++)
+ INIT_HLIST_HEAD(&peertab->buckets[i]);
+ return 0;
+}
+
+/**
+ * homa_peertab_destroy() - Destructor for homa_peertabs. After this
+ * function returns, it is unsafe to use any results from previous calls
+ * to homa_peer_find, since all existing homa_peer objects will have been
+ * destroyed.
+ * @peertab: The table to destroy.
+ */
+void homa_peertab_destroy(struct homa_peertab *peertab)
+{
+ struct hlist_node *next;
+ struct homa_peer *peer;
+ int i;
+
+ if (!peertab->buckets)
+ return;
+
+ spin_lock_bh(&peertab->write_lock);
+ for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++) {
+ hlist_for_each_entry_safe(peer, next, &peertab->buckets[i],
+ peertab_links) {
+ dst_release(peer->dst);
+ kfree(peer);
+ }
+ }
+ vfree(peertab->buckets);
+ homa_peertab_gc_dsts(peertab, ~0);
+ spin_unlock_bh(&peertab->write_lock);
+}
+
+/**
+ * homa_peertab_gc_dsts() - Invoked to free unused dst_entries, if it is
+ * safe to do so.
+ * @peertab: The table in which to free entries.
+ * @now: Current time, in sched_clock() units; entries with expiration
+ * dates no later than this will be freed. Specify ~0 to
+ * free all entries.
+ */
+void homa_peertab_gc_dsts(struct homa_peertab *peertab, u64 now)
+ __must_hold(&peer_tab->write_lock)
+{
+ while (!list_empty(&peertab->dead_dsts)) {
+ struct homa_dead_dst *dead =
+ list_first_entry(&peertab->dead_dsts,
+ struct homa_dead_dst, dst_links);
+ if (dead->gc_time > now)
+ break;
+ dst_release(dead->dst);
+ list_del(&dead->dst_links);
+ kfree(dead);
+ }
+}
+
+/**
+ * homa_peer_find() - Returns the peer associated with a given host; creates
+ * a new homa_peer if one doesn't already exist.
+ * @peertab: Peer table in which to perform lookup.
+ * @addr: Address of the desired host: IPv4 addresses are represented
+ * as IPv4-mapped IPv6 addresses.
+ * @inet: Socket that will be used for sending packets.
+ *
+ * Return: The peer associated with @addr, or a negative errno if an
+ * error occurred. The caller can retain this pointer
+ * indefinitely: peer entries are never deleted except in
+ * homa_peertab_destroy.
+ */
+struct homa_peer *homa_peer_find(struct homa_peertab *peertab,
+ const struct in6_addr *addr,
+ struct inet_sock *inet)
+{
+ struct homa_peer *peer;
+ struct dst_entry *dst;
+
+ u32 bucket = hash_32((__force u32)addr->in6_u.u6_addr32[0],
+ HOMA_PEERTAB_BUCKET_BITS);
+
+ bucket ^= hash_32((__force u32)addr->in6_u.u6_addr32[1],
+ HOMA_PEERTAB_BUCKET_BITS);
+ bucket ^= hash_32((__force u32)addr->in6_u.u6_addr32[2],
+ HOMA_PEERTAB_BUCKET_BITS);
+ bucket ^= hash_32((__force u32)addr->in6_u.u6_addr32[3],
+ HOMA_PEERTAB_BUCKET_BITS);
+
+ /* Use RCU operators to ensure safety even if a concurrent call is
+ * adding a new entry. The calls to rcu_read_lock and rcu_read_unlock
+ * shouldn't actually be needed, since we don't need to protect
+ * against concurrent deletion.
+ */
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(peer, &peertab->buckets[bucket],
+ peertab_links) {
+ if (ipv6_addr_equal(&peer->addr, addr)) {
+ rcu_read_unlock();
+ return peer;
+ }
+ }
+ rcu_read_unlock();
+
+ /* No existing entry; create a new one.
+ *
+ * Note: after we acquire the lock, we have to check again to
+ * make sure the entry still doesn't exist (it might have been
+ * created by a concurrent invocation of this function).
+ */
+ spin_lock_bh(&peertab->write_lock);
+ hlist_for_each_entry(peer, &peertab->buckets[bucket],
+ peertab_links) {
+ if (ipv6_addr_equal(&peer->addr, addr))
+ goto done;
+ }
+ peer = kmalloc(sizeof(*peer), GFP_ATOMIC | __GFP_ZERO);
+ if (!peer) {
+ peer = (struct homa_peer *)ERR_PTR(-ENOMEM);
+ goto done;
+ }
+ peer->addr = *addr;
+ dst = homa_peer_get_dst(peer, inet);
+ if (IS_ERR(dst)) {
+ kfree(peer);
+ peer = (struct homa_peer *)PTR_ERR(dst);
+ goto done;
+ }
+ peer->dst = dst;
+ hlist_add_head_rcu(&peer->peertab_links, &peertab->buckets[bucket]);
+ peer->current_ticks = -1;
+ spin_lock_init(&peer->ack_lock);
+
+done:
+ spin_unlock_bh(&peertab->write_lock);
+ return peer;
+}
+
+/**
+ * homa_dst_refresh() - This method is called when the dst for a peer is
+ * obsolete; it releases that dst and creates a new one.
+ * @peertab: Table containing the peer.
+ * @peer: Peer whose dst is obsolete.
+ * @hsk: Socket that will be used to transmit data to the peer.
+ */
+void homa_dst_refresh(struct homa_peertab *peertab, struct homa_peer *peer,
+ struct homa_sock *hsk)
+{
+ struct homa_dead_dst *save_dead;
+ struct dst_entry *dst;
+ u64 now;
+
+ /* Need to keep around the current entry for a while in case
+ * someone is using it. If we can't do that, then don't update
+ * the entry.
+ */
+ save_dead = kmalloc(sizeof(*save_dead), GFP_ATOMIC);
+ if (unlikely(!save_dead))
+ return;
+
+ dst = homa_peer_get_dst(peer, &hsk->inet);
+ if (IS_ERR(dst)) {
+ kfree(save_dead);
+ return;
+ }
+
+ spin_lock_bh(&peertab->write_lock);
+ now = sched_clock();
+ save_dead->dst = peer->dst;
+ save_dead->gc_time = now + 100000000; /* 100 ms */
+ list_add_tail(&save_dead->dst_links, &peertab->dead_dsts);
+ homa_peertab_gc_dsts(peertab, now);
+ peer->dst = dst;
+ spin_unlock_bh(&peertab->write_lock);
+}
+
+/**
+ * homa_peer_get_dst() - Find an appropriate dst structure (either IPv4
+ * or IPv6) for a peer.
+ * @peer: The peer for which a dst is needed. Note: this peer's flow
+ * struct will be overwritten.
+ * @inet: Socket that will be used for sending packets.
+ * Return: The dst structure (or an ERR_PTR).
+ */
+struct dst_entry *homa_peer_get_dst(struct homa_peer *peer,
+ struct inet_sock *inet)
+{
+ memset(&peer->flow, 0, sizeof(peer->flow));
+ if (inet->sk.sk_family == AF_INET) {
+ struct rtable *rt;
+
+ flowi4_init_output(&peer->flow.u.ip4, inet->sk.sk_bound_dev_if,
+ inet->sk.sk_mark, inet->tos,
+ RT_SCOPE_UNIVERSE, inet->sk.sk_protocol, 0,
+ peer->addr.in6_u.u6_addr32[3],
+ inet->inet_saddr, 0, 0, inet->sk.sk_uid);
+ security_sk_classify_flow(&inet->sk, &peer->flow.u.__fl_common);
+ rt = ip_route_output_flow(sock_net(&inet->sk),
+ &peer->flow.u.ip4, &inet->sk);
+ if (IS_ERR(rt))
+ return (struct dst_entry *)(PTR_ERR(rt));
+ return &rt->dst;
+ }
+ peer->flow.u.ip6.flowi6_oif = inet->sk.sk_bound_dev_if;
+ peer->flow.u.ip6.flowi6_iif = LOOPBACK_IFINDEX;
+ peer->flow.u.ip6.flowi6_mark = inet->sk.sk_mark;
+ peer->flow.u.ip6.flowi6_scope = RT_SCOPE_UNIVERSE;
+ peer->flow.u.ip6.flowi6_proto = inet->sk.sk_protocol;
+ peer->flow.u.ip6.flowi6_flags = 0;
+ peer->flow.u.ip6.flowi6_secid = 0;
+ peer->flow.u.ip6.flowi6_tun_key.tun_id = 0;
+ peer->flow.u.ip6.flowi6_uid = inet->sk.sk_uid;
+ peer->flow.u.ip6.daddr = peer->addr;
+ peer->flow.u.ip6.saddr = inet->pinet6->saddr;
+ peer->flow.u.ip6.fl6_dport = 0;
+ peer->flow.u.ip6.fl6_sport = 0;
+ peer->flow.u.ip6.mp_hash = 0;
+ peer->flow.u.ip6.__fl_common.flowic_tos = inet->tos;
+ peer->flow.u.ip6.flowlabel = ip6_make_flowinfo(inet->tos, 0);
+ security_sk_classify_flow(&inet->sk, &peer->flow.u.__fl_common);
+ return ip6_dst_lookup_flow(sock_net(&inet->sk), &inet->sk,
+ &peer->flow.u.ip6, NULL);
+}
+
+/**
+ * homa_peer_add_ack() - Add a given RPC to the list of unacked
+ * RPCs for its server. Once this method has been invoked, it's safe
+ * to delete the RPC, since it will eventually be acked to the server.
+ * @rpc: Client RPC that has now completed.
+ */
+void homa_peer_add_ack(struct homa_rpc *rpc)
+{
+ struct homa_peer *peer = rpc->peer;
+ struct homa_ack_hdr ack;
+
+ homa_peer_lock(peer);
+ if (peer->num_acks < HOMA_MAX_ACKS_PER_PKT) {
+ peer->acks[peer->num_acks].client_id = cpu_to_be64(rpc->id);
+ peer->acks[peer->num_acks].server_port = htons(rpc->dport);
+ peer->num_acks++;
+ homa_peer_unlock(peer);
+ return;
+ }
+
+ /* The peer has filled up; send an ACK message to empty it. The
+ * RPC in the message header will also be considered ACKed.
+ */
+ memcpy(ack.acks, peer->acks, sizeof(peer->acks));
+ ack.num_acks = htons(peer->num_acks);
+ peer->num_acks = 0;
+ homa_peer_unlock(peer);
+ homa_xmit_control(ACK, &ack, sizeof(ack), rpc);
+}
+
+/**
+ * homa_peer_get_acks() - Copy acks out of a peer, and remove them from the
+ * peer.
+ * @peer: Peer to check for possible unacked RPCs.
+ * @count: Maximum number of acks to return.
+ * @dst: The acks are copied to this location.
+ *
+ * Return: The number of acks extracted from the peer (<= count).
+ */
+int homa_peer_get_acks(struct homa_peer *peer, int count, struct homa_ack *dst)
+{
+ /* Don't waste time acquiring the lock if there are no ids available. */
+ if (peer->num_acks == 0)
+ return 0;
+
+ homa_peer_lock(peer);
+
+ if (count > peer->num_acks)
+ count = peer->num_acks;
+ memcpy(dst, &peer->acks[peer->num_acks - count],
+ count * sizeof(peer->acks[0]));
+ peer->num_acks -= count;
+
+ homa_peer_unlock(peer);
+ return count;
+}
diff --git a/net/homa/homa_peer.h b/net/homa/homa_peer.h
new file mode 100644
index 000000000000..7a34c5c3e31a
--- /dev/null
+++ b/net/homa/homa_peer.h
@@ -0,0 +1,211 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file contains definitions related to managing peers (homa_peer
+ * and homa_peertab).
+ */
+
+#ifndef _HOMA_PEER_H
+#define _HOMA_PEER_H
+
+#include "homa_wire.h"
+#include "homa_sock.h"
+
+struct homa_rpc;
+
+/**
+ * struct homa_dead_dst - Used to retain dst_entries that are no longer
+ * needed, until it is safe to delete them (I'm not confident that the RCU
+ * mechanism will be safe for these: the reference count could get incremented
+ * after it's on the RCU list?).
+ */
+struct homa_dead_dst {
+ /** @dst: Entry that is no longer used by a struct homa_peer. */
+ struct dst_entry *dst;
+
+ /**
+ * @gc_time: Time (in units of sched_clock()) when it is safe
+ * to free @dst.
+ */
+ u64 gc_time;
+
+ /** @dst_links: Used to link together entries in peertab->dead_dsts. */
+ struct list_head dst_links;
+};
+
+/**
+ * define HOMA_PEERTAB_BUCKET_BITS - Number of bits in the bucket index for a
+ * homa_peertab. Should be large enough to hold an entry for every server
+ * in a datacenter without long hash chains.
+ */
+#define HOMA_PEERTAB_BUCKET_BITS 16
+
+/** define HOME_PEERTAB_BUCKETS - Number of buckets in a homa_peertab. */
+#define HOMA_PEERTAB_BUCKETS BIT(HOMA_PEERTAB_BUCKET_BITS)
+
+/**
+ * struct homa_peertab - A hash table that maps from IPv6 addresses
+ * to homa_peer objects. IPv4 entries are encapsulated as IPv6 addresses.
+ * Entries are gradually added to this table, but they are never removed
+ * except when the entire table is deleted. We can't safely delete because
+ * results returned by homa_peer_find may be retained indefinitely.
+ *
+ * This table is managed exclusively by homa_peertab.c, using RCU to
+ * permit efficient lookups.
+ */
+struct homa_peertab {
+ /**
+ * @write_lock: Synchronizes addition of new entries; not needed
+ * for lookups (RCU is used instead).
+ */
+ spinlock_t write_lock;
+
+ /**
+ * @dead_dsts: List of dst_entries that are waiting to be deleted.
+ * Hold @write_lock when manipulating.
+ */
+ struct list_head dead_dsts;
+
+ /**
+ * @buckets: Pointer to heads of chains of homa_peers for each bucket.
+ * Malloc-ed, and must eventually be freed. NULL means this structure
+ * has not been initialized.
+ */
+ struct hlist_head *buckets;
+};
+
+/**
+ * struct homa_peer - One of these objects exists for each machine that we
+ * have communicated with (either as client or server).
+ */
+struct homa_peer {
+ /**
+ * @addr: IPv6 address for the machine (IPv4 addresses are stored
+ * as IPv4-mapped IPv6 addresses).
+ */
+ struct in6_addr addr;
+
+ /** @flow: Addressing info needed to send packets. */
+ struct flowi flow;
+
+ /**
+ * @dst: Used to route packets to this peer; we own a reference
+ * to this, which we must eventually release.
+ */
+ struct dst_entry *dst;
+
+ /**
+ * @peertab_links: Links this object into a bucket of its
+ * homa_peertab.
+ */
+ struct hlist_node peertab_links;
+
+ /**
+ * @outstanding_resends: the number of resend requests we have
+ * sent to this server (spaced @homa.resend_interval apart) since
+ * we received a packet from this peer.
+ */
+ int outstanding_resends;
+
+ /**
+ * @most_recent_resend: @homa->timer_ticks when the most recent
+ * resend was sent to this peer.
+ */
+ int most_recent_resend;
+
+ /**
+ * @least_recent_rpc: of all the RPCs for this peer scanned at
+ * @current_ticks, this is the RPC whose @resend_timer_ticks
+ * is farthest in the past.
+ */
+ struct homa_rpc *least_recent_rpc;
+
+ /**
+ * @least_recent_ticks: the @resend_timer_ticks value for
+ * @least_recent_rpc.
+ */
+ u32 least_recent_ticks;
+
+ /**
+ * @current_ticks: the value of @homa->timer_ticks the last time
+ * that @least_recent_rpc and @least_recent_ticks were computed.
+ * Used to detect the start of a new homa_timer pass.
+ */
+ u32 current_ticks;
+
+ /**
+ * @resend_rpc: the value of @least_recent_rpc computed in the
+ * previous homa_timer pass. This RPC will be issued a RESEND
+ * in the current pass, if it still needs one.
+ */
+ struct homa_rpc *resend_rpc;
+
+ /**
+ * @num_acks: the number of (initial) entries in @acks that
+ * currently hold valid information.
+ */
+ int num_acks;
+
+ /**
+ * @acks: info about client RPCs whose results have been completely
+ * received.
+ */
+ struct homa_ack acks[HOMA_MAX_ACKS_PER_PKT];
+
+ /**
+ * @ack_lock: used to synchronize access to @num_acks and @acks.
+ */
+ spinlock_t ack_lock;
+};
+
+void homa_dst_refresh(struct homa_peertab *peertab,
+ struct homa_peer *peer, struct homa_sock *hsk);
+void homa_peertab_destroy(struct homa_peertab *peertab);
+int homa_peertab_init(struct homa_peertab *peertab);
+void homa_peer_add_ack(struct homa_rpc *rpc);
+struct homa_peer
+ *homa_peer_find(struct homa_peertab *peertab,
+ const struct in6_addr *addr,
+ struct inet_sock *inet);
+int homa_peer_get_acks(struct homa_peer *peer, int count,
+ struct homa_ack *dst);
+struct dst_entry
+ *homa_peer_get_dst(struct homa_peer *peer,
+ struct inet_sock *inet);
+void homa_peertab_gc_dsts(struct homa_peertab *peertab, u64 now);
+
+/**
+ * homa_peer_lock() - Acquire the lock for a peer's @ack_lock.
+ * @peer: Peer to lock.
+ */
+static inline void homa_peer_lock(struct homa_peer *peer)
+ __acquires(&peer->ack_lock)
+{
+ spin_lock_bh(&peer->ack_lock);
+}
+
+/**
+ * homa_peer_unlock() - Release the lock for a peer's @unacked_lock.
+ * @peer: Peer to lock.
+ */
+static inline void homa_peer_unlock(struct homa_peer *peer)
+ __releases(&peer->ack_lock)
+{
+ spin_unlock_bh(&peer->ack_lock);
+}
+
+/**
+ * homa_get_dst() - Returns destination information associated with a peer,
+ * updating it if the cached information is stale.
+ * @peer: Peer whose destination information is desired.
+ * @hsk: Homa socket; needed by lower-level code to recreate the dst.
+ * Return: Up-to-date destination for peer.
+ */
+static inline struct dst_entry *homa_get_dst(struct homa_peer *peer,
+ struct homa_sock *hsk)
+{
+ if (unlikely(peer->dst->obsolete > 0))
+ homa_dst_refresh(hsk->homa->peers, peer, hsk);
+ return peer->dst;
+}
+
+#endif /* _HOMA_PEER_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 06/15] net: homa: create homa_sock.h and homa_sock.c
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
` (4 preceding siblings ...)
2025-05-02 23:37 ` [PATCH net-next v8 05/15] net: homa: create homa_peer.h and homa_peer.c John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-05 16:46 ` Paolo Abeni
2025-05-02 23:37 ` [PATCH net-next v8 07/15] net: homa: create homa_interest.h and homa_interest John Ousterhout
` (9 subsequent siblings)
15 siblings, 1 reply; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
These files provide functions for managing the state that Homa keeps
for each open Homa socket.
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
Changes for v8:
* Update for new homa_pool APIs
Changes for v7:
* Refactor homa_sock_start_scan etc. (take a reference on the socket, so
homa_socktab::active_scans and struct homa_socktab_links are no longer
needed; encapsulate RCU usage entirely in homa_sock.c).
* Add functions for tx memory accounting
* Refactor waiting mechanism for incoming messages
* Add hsk->is_server, setsockopt SO_HOMA_SERVER
* Remove "lock_slow" functions, which don't add functionality in this
patch series
* Remove locker argument from locking functions
* Use u64 and __u64 properly
* Take a reference to the socket in homa_sock_find
---
net/homa/homa_sock.c | 393 +++++++++++++++++++++++++++++++++++++++++++
net/homa/homa_sock.h | 386 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 779 insertions(+)
create mode 100644 net/homa/homa_sock.c
create mode 100644 net/homa/homa_sock.h
diff --git a/net/homa/homa_sock.c b/net/homa/homa_sock.c
new file mode 100644
index 000000000000..3b3f99dfa7e9
--- /dev/null
+++ b/net/homa/homa_sock.c
@@ -0,0 +1,393 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file manages homa_sock and homa_socktab objects. */
+
+#include "homa_impl.h"
+#include "homa_interest.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+
+/**
+ * homa_socktab_init() - Constructor for homa_socktabs.
+ * @socktab: The object to initialize; previous contents are discarded.
+ */
+void homa_socktab_init(struct homa_socktab *socktab)
+{
+ int i;
+
+ spin_lock_init(&socktab->write_lock);
+ for (i = 0; i < HOMA_SOCKTAB_BUCKETS; i++)
+ INIT_HLIST_HEAD(&socktab->buckets[i]);
+}
+
+/**
+ * homa_socktab_destroy() - Destructor for homa_socktabs.
+ * @socktab: The object to destroy.
+ */
+void homa_socktab_destroy(struct homa_socktab *socktab)
+{
+ struct homa_socktab_scan scan;
+ struct homa_sock *hsk;
+
+ for (hsk = homa_socktab_start_scan(socktab, &scan); hsk;
+ hsk = homa_socktab_next(&scan)) {
+ homa_sock_destroy(hsk);
+ }
+ homa_socktab_end_scan(&scan);
+}
+
+/**
+ * homa_socktab_start_scan() - Begin an iteration over all of the sockets
+ * in a socktab.
+ * @socktab: Socktab to scan.
+ * @scan: Will hold the current state of the scan; any existing
+ * contents are discarded. The caller must eventually pass this
+ * to homa_socktab_end_scan.
+ *
+ * Return: The first socket in the table, or NULL if the table is
+ * empty. If non-NULL, a reference is held on the socket to
+ * prevent its deletion.
+ *
+ * Each call to homa_socktab_next will return the next socket in the table.
+ * All sockets that are present in the table at the time this function is
+ * invoked will eventually be returned, as long as they are not removed
+ * from the table. It is safe to remove sockets from the table while the
+ * scan is in progress. If a socket is removed from the table during the scan,
+ * it may or may not be returned by homa_socktab_next. New entries added
+ * during the scan may or may not be returned.
+ */
+struct homa_sock *homa_socktab_start_scan(struct homa_socktab *socktab,
+ struct homa_socktab_scan *scan)
+{
+ scan->socktab = socktab;
+ scan->hsk = NULL;
+ scan->current_bucket = -1;
+
+ return homa_socktab_next(scan);
+}
+
+/**
+ * homa_socktab_next() - Return the next socket in an iteration over a socktab.
+ * @scan: State of the scan.
+ *
+ * Return: The next socket in the table, or NULL if the iteration has
+ * returned all of the sockets in the table. If non-NULL, a
+ * reference is held on the socket to prevent its deletion.
+ * Sockets are not returned in any particular order. It's
+ * possible that the returned socket has been destroyed.
+ */
+struct homa_sock *homa_socktab_next(struct homa_socktab_scan *scan)
+{
+ struct hlist_head *bucket;
+ struct hlist_node *next;
+
+ rcu_read_lock();
+ if (scan->hsk) {
+ sock_put(&scan->hsk->sock);
+ next = rcu_dereference(hlist_next_rcu(&scan->hsk->socktab_links));
+ if (next)
+ goto success;
+ }
+ while (scan->current_bucket < HOMA_SOCKTAB_BUCKETS - 1) {
+ scan->current_bucket++;
+ bucket = &scan->socktab->buckets[scan->current_bucket];
+ next = rcu_dereference(hlist_first_rcu(bucket));
+ if (next)
+ goto success;
+ }
+ scan->hsk = NULL;
+ rcu_read_unlock();
+ return NULL;
+
+success:
+ scan->hsk = hlist_entry(next, struct homa_sock, socktab_links);
+ sock_hold(&scan->hsk->sock);
+ rcu_read_unlock();
+ return scan->hsk;
+}
+
+/**
+ * homa_socktab_end_scan() - Must be invoked on completion of each scan
+ * to clean up state associated with the scan.
+ * @scan: State of the scan.
+ */
+void homa_socktab_end_scan(struct homa_socktab_scan *scan)
+{
+ if (scan->hsk) {
+ sock_put(&scan->hsk->sock);
+ scan->hsk = NULL;
+ }
+}
+
+/**
+ * homa_sock_init() - Constructor for homa_sock objects. This function
+ * initializes only the parts of the socket that are owned by Homa.
+ * @hsk: Object to initialize.
+ * @homa: Homa implementation that will manage the socket.
+ *
+ * Return: 0 for success, otherwise a negative errno.
+ */
+int homa_sock_init(struct homa_sock *hsk, struct homa *homa)
+{
+ struct homa_socktab *socktab = homa->port_map;
+ struct homa_sock *other;
+ int starting_port;
+ int result = 0;
+ int i;
+
+ /* Initialize fields outside the Homa part. */
+ hsk->sock.sk_sndbuf = homa->wmem_max;
+
+ /* Initialize Homa-specific fields. */
+ spin_lock_bh(&socktab->write_lock);
+ atomic_set(&hsk->protect_count, 0);
+ spin_lock_init(&hsk->lock);
+ atomic_set(&hsk->protect_count, 0);
+ hsk->homa = homa;
+ hsk->ip_header_length = (hsk->inet.sk.sk_family == AF_INET)
+ ? HOMA_IPV4_HEADER_LENGTH : HOMA_IPV6_HEADER_LENGTH;
+ hsk->is_server = false;
+ hsk->shutdown = false;
+ starting_port = homa->prev_default_port;
+ while (1) {
+ homa->prev_default_port++;
+ if (homa->prev_default_port < HOMA_MIN_DEFAULT_PORT)
+ homa->prev_default_port = HOMA_MIN_DEFAULT_PORT;
+ other = homa_sock_find(socktab, homa->prev_default_port);
+ if (!other)
+ break;
+ sock_put(&other->sock);
+ if (homa->prev_default_port == starting_port) {
+ spin_unlock_bh(&socktab->write_lock);
+ hsk->shutdown = true;
+ return -EADDRNOTAVAIL;
+ }
+ }
+ hsk->port = homa->prev_default_port;
+ hsk->inet.inet_num = hsk->port;
+ hsk->inet.inet_sport = htons(hsk->port);
+ hlist_add_head_rcu(&hsk->socktab_links,
+ &socktab->buckets[homa_port_hash(hsk->port)]);
+ INIT_LIST_HEAD(&hsk->active_rpcs);
+ INIT_LIST_HEAD(&hsk->dead_rpcs);
+ hsk->dead_skbs = 0;
+ INIT_LIST_HEAD(&hsk->waiting_for_bufs);
+ INIT_LIST_HEAD(&hsk->ready_rpcs);
+ INIT_LIST_HEAD(&hsk->interests);
+ for (i = 0; i < HOMA_CLIENT_RPC_BUCKETS; i++) {
+ struct homa_rpc_bucket *bucket = &hsk->client_rpc_buckets[i];
+
+ spin_lock_init(&bucket->lock);
+ bucket->id = i;
+ INIT_HLIST_HEAD(&bucket->rpcs);
+ }
+ for (i = 0; i < HOMA_SERVER_RPC_BUCKETS; i++) {
+ struct homa_rpc_bucket *bucket = &hsk->server_rpc_buckets[i];
+
+ spin_lock_init(&bucket->lock);
+ bucket->id = i + 1000000;
+ INIT_HLIST_HEAD(&bucket->rpcs);
+ }
+ hsk->buffer_pool = homa_pool_new(hsk);
+ if (IS_ERR(hsk->buffer_pool)) {
+ result = PTR_ERR(hsk->buffer_pool);
+ hsk->buffer_pool = NULL;
+ }
+ spin_unlock_bh(&socktab->write_lock);
+ return result;
+}
+
+/*
+ * homa_sock_unlink() - Unlinks a socket from its socktab and does
+ * related cleanups. Once this method returns, the socket will not be
+ * discoverable through the socktab.
+ * @hsk: Socket to unlink.
+ */
+void homa_sock_unlink(struct homa_sock *hsk)
+{
+ struct homa_socktab *socktab = hsk->homa->port_map;
+
+ spin_lock_bh(&socktab->write_lock);
+ hlist_del_rcu(&hsk->socktab_links);
+ spin_unlock_bh(&socktab->write_lock);
+}
+
+/**
+ * homa_sock_shutdown() - Disable a socket so that it can no longer
+ * be used for either sending or receiving messages. Any system calls
+ * currently waiting to send or receive messages will be aborted.
+ * @hsk: Socket to shut down.
+ */
+void homa_sock_shutdown(struct homa_sock *hsk)
+{
+ struct homa_interest *interest;
+ struct homa_rpc *rpc;
+ u64 tx_memory;
+
+ homa_sock_lock(hsk);
+ if (hsk->shutdown) {
+ homa_sock_unlock(hsk);
+ return;
+ }
+
+ /* The order of cleanup is very important, because there could be
+ * active operations that hold RPC locks but not the socket lock.
+ * 1. Set @shutdown; this ensures that no new RPCs will be created for
+ * this socket (though some creations might already be in progress).
+ * 2. Remove the socket from its socktab: this ensures that
+ * incoming packets for the socket will be dropped.
+ * 3. Go through all of the RPCs and delete them; this will
+ * synchronize with any operations in progress.
+ * 4. Perform other socket cleanup: at this point we know that
+ * there will be no concurrent activities on individual RPCs.
+ * 5. Don't delete the buffer pool until after all of the RPCs
+ * have been reaped.
+ * See sync.txt for additional information about locking.
+ */
+ hsk->shutdown = true;
+ homa_sock_unlink(hsk);
+ homa_sock_unlock(hsk);
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(rpc, &hsk->active_rpcs, active_links) {
+ homa_rpc_lock(rpc);
+ homa_rpc_end(rpc);
+ homa_rpc_unlock(rpc);
+ }
+ rcu_read_unlock();
+
+ homa_sock_lock(hsk);
+ while (!list_empty(&hsk->interests)) {
+ interest = list_first_entry(&hsk->interests,
+ struct homa_interest, links);
+ __list_del_entry(&interest->links);
+ atomic_set_release(&interest->ready, 1);
+ wake_up(&interest->wait_queue);
+ }
+ homa_sock_unlock(hsk);
+
+ while (!list_empty(&hsk->dead_rpcs))
+ homa_rpc_reap(hsk, 1000);
+
+ tx_memory = refcount_read(&hsk->sock.sk_wmem_alloc);
+ if (tx_memory != 1) {
+ pr_err("%s found sk_wmem_alloc %llu bytes, port %d\n",
+ __func__, tx_memory, hsk->port);
+ }
+
+ if (hsk->buffer_pool) {
+ homa_pool_destroy(hsk->buffer_pool);
+ hsk->buffer_pool = NULL;
+ }
+}
+
+/**
+ * homa_sock_destroy() - Destructor for homa_sock objects. This function
+ * only cleans up the parts of the object that are owned by Homa.
+ * @hsk: Socket to destroy.
+ */
+void homa_sock_destroy(struct homa_sock *hsk)
+{
+ homa_sock_shutdown(hsk);
+ sock_set_flag(&hsk->inet.sk, SOCK_RCU_FREE);
+}
+
+/**
+ * homa_sock_bind() - Associates a server port with a socket; if there
+ * was a previous server port assignment for @hsk, it is abandoned.
+ * @socktab: Hash table in which the binding will be recorded.
+ * @hsk: Homa socket.
+ * @port: Desired server port for @hsk. If 0, then this call
+ * becomes a no-op: the socket will continue to use
+ * its randomly assigned client port.
+ *
+ * Return: 0 for success, otherwise a negative errno.
+ */
+int homa_sock_bind(struct homa_socktab *socktab, struct homa_sock *hsk,
+ __u16 port)
+{
+ struct homa_sock *owner;
+ int result = 0;
+
+ if (port == 0)
+ return result;
+ if (port >= HOMA_MIN_DEFAULT_PORT)
+ return -EINVAL;
+ homa_sock_lock(hsk);
+ spin_lock_bh(&socktab->write_lock);
+ if (hsk->shutdown) {
+ result = -ESHUTDOWN;
+ goto done;
+ }
+
+ owner = homa_sock_find(socktab, port);
+ if (owner) {
+ sock_put(&owner->sock);
+ if (owner != hsk)
+ result = -EADDRINUSE;
+ goto done;
+ }
+ hlist_del_rcu(&hsk->socktab_links);
+ hsk->port = port;
+ hsk->inet.inet_num = port;
+ hsk->inet.inet_sport = htons(hsk->port);
+ hlist_add_head_rcu(&hsk->socktab_links,
+ &socktab->buckets[homa_port_hash(port)]);
+ hsk->is_server = true;
+done:
+ spin_unlock_bh(&socktab->write_lock);
+ homa_sock_unlock(hsk);
+ return result;
+}
+
+/**
+ * homa_sock_find() - Returns the socket associated with a given port.
+ * @socktab: Hash table in which to perform lookup.
+ * @port: The port of interest.
+ * Return: The socket that owns @port, or NULL if none. If non-NULL
+ * then this method has taken a reference on the socket and
+ * the caller must call sock_put to release it.
+ */
+struct homa_sock *homa_sock_find(struct homa_socktab *socktab, __u16 port)
+{
+ struct homa_sock *hsk;
+ struct homa_sock *result = NULL;
+
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(hsk, &socktab->buckets[homa_port_hash(port)],
+ socktab_links) {
+ if (hsk->port == port) {
+ result = hsk;
+ sock_hold(&hsk->sock);
+ break;
+ }
+ }
+ rcu_read_unlock();
+ return result;
+}
+
+/**
+ * homa_sock_wait_wmem() - Block the thread until @hsk's usage of tx
+ * packet memory drops below the socket's limit.
+ * @hsk: Socket of interest.
+ * @nonblocking: If there's not enough memory, return -EWOLDBLOCK instead
+ * of blocking.
+ * Return: 0 for success, otherwise a negative errno.
+ */
+int homa_sock_wait_wmem(struct homa_sock *hsk, int nonblocking)
+{
+ long timeo = hsk->sock.sk_sndtimeo;
+ int result;
+
+ if (nonblocking)
+ timeo = 0;
+ set_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags);
+ result = wait_event_interruptible_timeout(*sk_sleep(&hsk->sock),
+ homa_sock_wmem_avl(hsk) || hsk->shutdown,
+ timeo);
+ if (signal_pending(current))
+ return -EINTR;
+ if (result == 0)
+ return -EWOULDBLOCK;
+ return 0;
+}
diff --git a/net/homa/homa_sock.h b/net/homa/homa_sock.h
new file mode 100644
index 000000000000..d98f89709e6f
--- /dev/null
+++ b/net/homa/homa_sock.h
@@ -0,0 +1,386 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines structs and other things related to Homa sockets. */
+
+#ifndef _HOMA_SOCK_H
+#define _HOMA_SOCK_H
+
+/* Forward declarations. */
+struct homa;
+struct homa_pool;
+
+/**
+ * define HOMA_SOCKTAB_BUCKETS - Number of hash buckets in a homa_socktab.
+ * Must be a power of 2.
+ */
+#define HOMA_SOCKTAB_BUCKETS 1024
+
+/**
+ * struct homa_socktab - A hash table that maps from port numbers (either
+ * client or server) to homa_sock objects.
+ *
+ * This table is managed exclusively by homa_socktab.c, using RCU to
+ * minimize synchronization during lookups.
+ */
+struct homa_socktab {
+ /**
+ * @write_lock: Controls all modifications to this object; not needed
+ * for socket lookups (RCU is used instead). Also used to
+ * synchronize port allocation.
+ */
+ spinlock_t write_lock;
+
+ /**
+ * @buckets: Heads of chains for hash table buckets. Chains
+ * consist of homa_sock objects.
+ */
+ struct hlist_head buckets[HOMA_SOCKTAB_BUCKETS];
+};
+
+/**
+ * struct homa_socktab_scan - Records the state of an iteration over all
+ * the entries in a homa_socktab, in a way that is safe against concurrent
+ * reclamation of sockets.
+ */
+struct homa_socktab_scan {
+ /** @socktab: The table that is being scanned. */
+ struct homa_socktab *socktab;
+
+ /**
+ * @hsk: Points to the current socket in the iteration, or NULL if
+ * we're at the beginning or end of the iteration. If non-NULL then
+ * we are holding a reference to this socket.
+ */
+ struct homa_sock *hsk;
+
+ /**
+ * @current_bucket: The index of the bucket in socktab->buckets
+ * currently being scanned.
+ */
+ int current_bucket;
+};
+
+/**
+ * struct homa_rpc_bucket - One bucket in a hash table of RPCs.
+ */
+
+struct homa_rpc_bucket {
+ /**
+ * @lock: serves as a lock both for this bucket (e.g., when
+ * adding and removing RPCs) and also for all of the RPCs in
+ * the bucket. Must be held whenever manipulating an RPC in
+ * this bucket. This dual purpose permits clean and safe
+ * deletion and garbage collection of RPCs.
+ */
+ spinlock_t lock __context__(rpc_bucket_lock, 1, 1);
+
+ /**
+ * @id: identifier for this bucket, used in error messages etc.
+ * It's the index of the bucket within its hash table bucket
+ * array, with an additional offset to separate server and
+ * client RPCs.
+ */
+ int id;
+
+ /** @rpcs: list of RPCs that hash to this bucket. */
+ struct hlist_head rpcs;
+};
+
+/**
+ * define HOMA_CLIENT_RPC_BUCKETS - Number of buckets in hash tables for
+ * client RPCs. Must be a power of 2.
+ */
+#define HOMA_CLIENT_RPC_BUCKETS 1024
+
+/**
+ * define HOMA_SERVER_RPC_BUCKETS - Number of buckets in hash tables for
+ * server RPCs. Must be a power of 2.
+ */
+#define HOMA_SERVER_RPC_BUCKETS 1024
+
+/**
+ * struct homa_sock - Information about an open socket.
+ */
+struct homa_sock {
+ /* Info for other network layers. Note: IPv6 info (struct ipv6_pinfo
+ * comes at the very end of the struct, *after* Homa's data, if this
+ * socket uses IPv6).
+ */
+ union {
+ /** @sock: generic socket data; must be the first field. */
+ struct sock sock;
+
+ /**
+ * @inet: generic Internet socket data; must also be the
+ first field (contains sock as its first member).
+ */
+ struct inet_sock inet;
+ };
+
+ /**
+ * @lock: Must be held when modifying fields such as interests
+ * and lists of RPCs. This lock is used in place of sk->sk_lock
+ * because it's used differently (it's always used as a simple
+ * spin lock). See sync.txt for more on Homa's synchronization
+ * strategy.
+ */
+ spinlock_t lock;
+
+ /**
+ * @protect_count: counts the number of calls to homa_protect_rpcs
+ * for which there have not yet been calls to homa_unprotect_rpcs.
+ * See sync.txt for more info.
+ */
+ atomic_t protect_count;
+
+ /**
+ * @homa: Overall state about the Homa implementation. NULL
+ * means this socket has been deleted.
+ */
+ struct homa *homa;
+
+ /**
+ * @is_server: True means that this socket can act as both client
+ * and server; false means the socket is client-only.
+ */
+ bool is_server;
+
+ /**
+ * @shutdown: True means the socket is no longer usable (either
+ * shutdown has already been invoked, or the socket was never
+ * properly initialized).
+ */
+ bool shutdown;
+
+ /**
+ * @port: Port number: identifies this socket uniquely among all
+ * those on this node.
+ */
+ __u16 port;
+
+ /**
+ * @ip_header_length: Length of IP headers for this socket (depends
+ * on IPv4 vs. IPv6).
+ */
+ int ip_header_length;
+
+ /** @socktab_links: Links this socket into a homa_socktab bucket. */
+ struct hlist_node socktab_links;
+
+ /**
+ * @active_rpcs: List of all existing RPCs related to this socket,
+ * including both client and server RPCs. This list isn't strictly
+ * needed, since RPCs are already in one of the hash tables below,
+ * but it's more efficient for homa_timer to have this list
+ * (so it doesn't have to scan large numbers of hash buckets).
+ * The list is sorted, with the oldest RPC first. Manipulate with
+ * RCU so timer can access without locking.
+ */
+ struct list_head active_rpcs;
+
+ /**
+ * @dead_rpcs: Contains RPCs for which homa_rpc_end has been
+ * called, but their packet buffers haven't yet been freed.
+ */
+ struct list_head dead_rpcs;
+
+ /** @dead_skbs: Total number of socket buffers in RPCs on dead_rpcs. */
+ int dead_skbs;
+
+ /**
+ * @waiting_for_bufs: Contains RPCs that are blocked because there
+ * wasn't enough space in the buffer pool region for their incoming
+ * messages. Sorted in increasing order of message length.
+ */
+ struct list_head waiting_for_bufs;
+
+ /**
+ * @ready_rpcs: List of all RPCs that are ready for attention from
+ * an application thread.
+ */
+ struct list_head ready_rpcs;
+
+ /**
+ * @interests: List of threads that are currently waiting for
+ * incoming messages via homa_wait_shared.
+ */
+ struct list_head interests;
+
+ /**
+ * @client_rpc_buckets: Hash table for fast lookup of client RPCs.
+ * Modifications are synchronized with bucket locks, not
+ * the socket lock.
+ */
+ struct homa_rpc_bucket client_rpc_buckets[HOMA_CLIENT_RPC_BUCKETS];
+
+ /**
+ * @server_rpc_buckets: Hash table for fast lookup of server RPCs.
+ * Modifications are synchronized with bucket locks, not
+ * the socket lock.
+ */
+ struct homa_rpc_bucket server_rpc_buckets[HOMA_SERVER_RPC_BUCKETS];
+
+ /**
+ * @buffer_pool: used to allocate buffer space for incoming messages.
+ * Storage is dynamically allocated.
+ */
+ struct homa_pool *buffer_pool;
+};
+
+/**
+ * struct homa_v6_sock - For IPv6, additional IPv6-specific information
+ * is present in the socket struct after Homa-specific information.
+ */
+struct homa_v6_sock {
+ /** @homa: All socket info except for IPv6-specific stuff. */
+ struct homa_sock homa;
+
+ /** @inet6: Socket info specific to IPv6. */
+ struct ipv6_pinfo inet6;
+};
+
+int homa_sock_bind(struct homa_socktab *socktab,
+ struct homa_sock *hsk, __u16 port);
+void homa_sock_destroy(struct homa_sock *hsk);
+struct homa_sock *homa_sock_find(struct homa_socktab *socktab, __u16 port);
+int homa_sock_init(struct homa_sock *hsk, struct homa *homa);
+void homa_sock_shutdown(struct homa_sock *hsk);
+void homa_sock_unlink(struct homa_sock *hsk);
+int homa_sock_wait_wmem(struct homa_sock *hsk, int nonblocking);
+int homa_socket(struct sock *sk);
+void homa_socktab_destroy(struct homa_socktab *socktab);
+void homa_socktab_end_scan(struct homa_socktab_scan *scan);
+void homa_socktab_init(struct homa_socktab *socktab);
+struct homa_sock *homa_socktab_next(struct homa_socktab_scan *scan);
+struct homa_sock *homa_socktab_start_scan(struct homa_socktab *socktab,
+ struct homa_socktab_scan *scan);
+
+/**
+ * homa_sock_lock() - Acquire the lock for a socket.
+ * @hsk: Socket to lock.
+ */
+static inline void homa_sock_lock(struct homa_sock *hsk)
+ __acquires(&hsk->lock)
+{
+ spin_lock_bh(&hsk->lock);
+}
+
+/**
+ * homa_sock_unlock() - Release the lock for a socket.
+ * @hsk: Socket to lock.
+ */
+static inline void homa_sock_unlock(struct homa_sock *hsk)
+ __releases(&hsk->lock)
+{
+ spin_unlock_bh(&hsk->lock);
+}
+
+/**
+ * homa_port_hash() - Hash function for port numbers.
+ * @port: Port number being looked up.
+ *
+ * Return: The index of the bucket in which this port will be found (if
+ * it exists.
+ */
+static inline int homa_port_hash(__u16 port)
+{
+ /* We can use a really simple hash function here because client
+ * port numbers are allocated sequentially and server port numbers
+ * are unpredictable.
+ */
+ return port & (HOMA_SOCKTAB_BUCKETS - 1);
+}
+
+/**
+ * homa_client_rpc_bucket() - Find the bucket containing a given
+ * client RPC.
+ * @hsk: Socket associated with the RPC.
+ * @id: Id of the desired RPC.
+ *
+ * Return: The bucket in which this RPC will appear, if the RPC exists.
+ */
+static inline struct homa_rpc_bucket
+ *homa_client_rpc_bucket(struct homa_sock *hsk, u64 id)
+{
+ /* We can use a really simple hash function here because RPC ids
+ * are allocated sequentially.
+ */
+ return &hsk->client_rpc_buckets[(id >> 1)
+ & (HOMA_CLIENT_RPC_BUCKETS - 1)];
+}
+
+/**
+ * homa_server_rpc_bucket() - Find the bucket containing a given
+ * server RPC.
+ * @hsk: Socket associated with the RPC.
+ * @id: Id of the desired RPC.
+ *
+ * Return: The bucket in which this RPC will appear, if the RPC exists.
+ */
+static inline struct homa_rpc_bucket
+ *homa_server_rpc_bucket(struct homa_sock *hsk, u64 id)
+{
+ /* Each client allocates RPC ids sequentially, so they will
+ * naturally distribute themselves across the hash space.
+ * Thus we can use the id directly as hash.
+ */
+ return &hsk->server_rpc_buckets[(id >> 1)
+ & (HOMA_SERVER_RPC_BUCKETS - 1)];
+}
+
+/**
+ * homa_bucket_lock() - Acquire the lock for an RPC hash table bucket.
+ * @bucket: Bucket to lock.
+ * @id: Id of the RPC on whose behalf the bucket is being locked.
+ * Used only for metrics.
+ */
+static inline void homa_bucket_lock(struct homa_rpc_bucket *bucket, u64 id)
+ __acquires(rpc_bucket_lock)
+{
+ spin_lock_bh(&bucket->lock);
+}
+
+/**
+ * homa_bucket_unlock() - Release the lock for an RPC hash table bucket.
+ * @bucket: Bucket to unlock.
+ * @id: ID of the RPC that was using the lock.
+ */
+static inline void homa_bucket_unlock(struct homa_rpc_bucket *bucket, u64 id)
+ __releases(rpc_bucket_lock)
+{
+ spin_unlock_bh(&bucket->lock);
+}
+
+static inline struct homa_sock *homa_sk(const struct sock *sk)
+{
+ return (struct homa_sock *)sk;
+}
+
+/**
+ * homa_sock_wmem_avl() - Returns true if the socket is within its limit
+ * for output memory usage. False means that no new messages should be sent
+ * until memory is freed.
+ * @hsk: Socket of interest.
+ * Return: See above.
+ */
+static inline bool homa_sock_wmem_avl(struct homa_sock *hsk)
+{
+ return refcount_read(&hsk->sock.sk_wmem_alloc) < hsk->sock.sk_sndbuf;
+}
+
+/**
+ * homa_sock_wakeup_wmem() - Invoked when tx packet memory has been freed;
+ * if memory usage is below the limit and there are tasks waiting for memory,
+ * wake them up.
+ * @hsk: Socket of interest.
+ */
+static inline void homa_sock_wakeup_wmem(struct homa_sock *hsk)
+{
+ if (test_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags) &&
+ homa_sock_wmem_avl(hsk)) {
+ clear_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags);
+ wake_up_interruptible_poll(sk_sleep(&hsk->sock), EPOLLOUT);
+ }
+}
+
+#endif /* _HOMA_SOCK_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 07/15] net: homa: create homa_interest.h and homa_interest.
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
` (5 preceding siblings ...)
2025-05-02 23:37 ` [PATCH net-next v8 06/15] net: homa: create homa_sock.h and homa_sock.c John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-06 13:53 ` Paolo Abeni
2025-05-02 23:37 ` [PATCH net-next v8 08/15] net: homa: create homa_pacer.h and homa_pacer.c John Ousterhout
` (8 subsequent siblings)
15 siblings, 1 reply; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
These files implement the homa_interest struct, which is used to
wait for incoming messages.
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
net/homa/homa_interest.c | 122 +++++++++++++++++++++++++++++++++++++++
net/homa/homa_interest.h | 100 ++++++++++++++++++++++++++++++++
2 files changed, 222 insertions(+)
create mode 100644 net/homa/homa_interest.c
create mode 100644 net/homa/homa_interest.h
diff --git a/net/homa/homa_interest.c b/net/homa/homa_interest.c
new file mode 100644
index 000000000000..71bf70d7073b
--- /dev/null
+++ b/net/homa/homa_interest.c
@@ -0,0 +1,122 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file contains functions for managing homa_interest structs. */
+
+#include "homa_impl.h"
+#include "homa_interest.h"
+#include "homa_rpc.h"
+#include "homa_sock.h"
+
+/**
+ * homa_interest_init_shared() - Initialize an interest and queue it up on a socket.
+ * @interest: Interest to initialize
+ * @hsk: Socket on which the interests should be queued. Must be locked
+ * by caller.
+ */
+void homa_interest_init_shared(struct homa_interest *interest,
+ struct homa_sock *hsk)
+ __must_hold(&hsk->lock)
+{
+ interest->rpc = NULL;
+ atomic_set(&interest->ready, 0);
+ interest->core = raw_smp_processor_id();
+ interest->blocked = 0;
+ init_waitqueue_head(&interest->wait_queue);
+ interest->hsk = hsk;
+ list_add(&interest->links, &hsk->interests);
+}
+
+/**
+ * homa_interest_init_private() - Initialize an interest that will wait
+ * on a particular (private) RPC, and link it to that RPC.
+ * @interest: Interest to initialize.
+ * @rpc: RPC to associate with the interest. Must be private, and
+ * caller must have locked it.
+ *
+ * Return: 0 for success, otherwise a negative errno.
+ */
+int homa_interest_init_private(struct homa_interest *interest,
+ struct homa_rpc *rpc)
+ __must_hold(&rpc->bucket->lock)
+{
+ if (rpc->private_interest)
+ return -EINVAL;
+
+ interest->rpc = rpc;
+ atomic_set(&interest->ready, 0);
+ interest->core = raw_smp_processor_id();
+ interest->blocked = 0;
+ init_waitqueue_head(&interest->wait_queue);
+ interest->hsk = rpc->hsk;
+ rpc->private_interest = interest;
+ return 0;
+}
+
+/**
+ * homa_interest_wait() - Wait for an interest to have an actionable RPC,
+ * or for an error to occur.
+ * @interest: Interest to wait for; must previously have been initialized
+ * and linked to a socket or RPC. On return, the interest
+ * will have been unlinked if its ready flag is set; otherwise
+ * it may still be linked.
+ * @nonblocking: Nonzero means return without blocking if the interest
+ * doesn't become ready immediately.
+ *
+ * Return: 0 for success (there is an actionable RPC in the interest), or
+ * a negative errno.
+ */
+int homa_interest_wait(struct homa_interest *interest, int nonblocking)
+{
+ struct homa_sock *hsk = interest->hsk;
+ int result = 0;
+ int iteration;
+ int wait_err;
+
+ interest->blocked = 0;
+
+ /* This loop iterates in order to poll and/or reap dead RPCS. */
+ for (iteration = 0; ; iteration++) {
+ if (iteration != 0)
+ /* Give NAPI/SoftIRQ tasks a chance to run. */
+ schedule();
+
+ if (atomic_read_acquire(&interest->ready) != 0)
+ goto done;
+
+ /* See if we can cleanup dead RPCs while waiting. */
+ if (homa_rpc_reap(hsk, false) != 0)
+ continue;
+
+ if (nonblocking) {
+ result = -EAGAIN;
+ goto done;
+ }
+
+ break;
+ }
+
+ interest->blocked = 1;
+ wait_err = wait_event_interruptible_exclusive(interest->wait_queue,
+ atomic_read_acquire(&interest->ready) != 0);
+ if (wait_err == -ERESTARTSYS)
+ result = -EINTR;
+
+done:
+ return result;
+}
+
+/**
+ * homa_interest_notify_private() - If a thread is waiting on the private
+ * interest for an RPC, wake it up.
+ * @rpc: RPC that may (potentially) have a private interest. Must be
+ * locked by the caller.
+ */
+void homa_interest_notify_private(struct homa_rpc *rpc)
+ __must_hold(&rpc->bucket->lock)
+{
+ if (rpc->private_interest) {
+ atomic_set_release(&rpc->private_interest->ready, 1);
+ wake_up(&rpc->private_interest->wait_queue);
+ }
+}
+
diff --git a/net/homa/homa_interest.h b/net/homa/homa_interest.h
new file mode 100644
index 000000000000..78111bf68b6c
--- /dev/null
+++ b/net/homa/homa_interest.h
@@ -0,0 +1,100 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines struct homa_interest and related functions. */
+
+#ifndef _HOMA_INTEREST_H
+#define _HOMA_INTEREST_H
+
+#include "homa_rpc.h"
+#include "homa_sock.h"
+
+/**
+ * struct homa_interest - Used by homa_wait_private and homa_wait_shared to
+ * wait for incoming message data to arrive for an RPC. An interest can
+ * be either private (if referenced by an rpc->private_interest) or shared
+ * (if present on hsk->interests).
+ */
+struct homa_interest {
+ /**
+ * @rpc: If ready is set, then this holds an RPC that needs
+ * attention, or NULL if this is a shared interest and hsk has
+ * been shutdown. If ready is not set, this will be NULL if the
+ * interest is shared; if it's private, it holds the RPC the
+ * interest is associated with. If non-NULL, a reference has been
+ * taken on the RPC.
+ */
+ struct homa_rpc *rpc;
+
+ /**
+ * @ready: Nonzero means the interest is ready for attention: either
+ * there is an RPC that needs attention or @hsk has been shutdown.
+ */
+ atomic_t ready;
+
+ /**
+ * @core: Core on which homa_wait_*was invoked. This is a hint
+ * used for load balancing (see balance.txt).
+ */
+ int core;
+
+ /**
+ * @blocked: Zero means a handoff was received without the thread
+ * needing to block; nonzero means the thread blocked.
+ */
+ int blocked;
+
+ /**
+ * @wait_queue: Used to block the thread while waiting (will never
+ * have more than one queued thread).
+ */
+ struct wait_queue_head wait_queue;
+
+ /** @hsk: Socket that the interest is associated with. */
+ struct homa_sock *hsk;
+
+ /**
+ * @links: If the interest is shared, used to link this object into
+ * @hsk->interests.
+ */
+ struct list_head links;
+};
+
+/**
+ * homa_interest_unlink_shared() - Remove an interest from the list for a
+ * socket. Note: this can race with homa_rpc_handoff, so on return it's
+ * possible that the interest is ready.
+ * @interest: Interest to remove. Must have been initialized with
+ * homa_interest_init_shared.
+ */
+static inline void homa_interest_unlink_shared(struct homa_interest *interest)
+{
+ if (!list_empty(&interest->links)) {
+ homa_sock_lock(interest->hsk);
+ list_del_init(&interest->links);
+ homa_sock_unlock(interest->hsk);
+ }
+}
+
+/**
+ * homa_interest_unlink_private() - Detach a private interest from its
+ * RPC. Note: this can race with homa_rpc_handoff, so on return it's
+ * possible that the interest is ready.
+ * @interest: Interest to remove. Must have been initialized with
+ * homa_interest_init_private. Its RPC must be locked by
+ * the caller.
+ */
+static inline void homa_interest_unlink_private(struct homa_interest *interest)
+ __must_hold(&interest->rpc->bucket->lock)
+{
+ if (interest == interest->rpc->private_interest)
+ interest->rpc->private_interest = NULL;
+}
+
+void homa_interest_init_shared(struct homa_interest *interest,
+ struct homa_sock *hsk);
+int homa_interest_init_private(struct homa_interest *interest,
+ struct homa_rpc *rpc);
+void homa_interest_notify_private(struct homa_rpc *rpc);
+int homa_interest_wait(struct homa_interest *interest, int nonblocking);
+
+#endif /* _HOMA_INTEREST_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 08/15] net: homa: create homa_pacer.h and homa_pacer.c
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
` (6 preceding siblings ...)
2025-05-02 23:37 ` [PATCH net-next v8 07/15] net: homa: create homa_interest.h and homa_interest John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-06 14:05 ` Paolo Abeni
2025-05-02 23:37 ` [PATCH net-next v8 09/15] net: homa: create homa_rpc.h and homa_rpc.c John Ousterhout
` (7 subsequent siblings)
15 siblings, 1 reply; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
These files provide facilities to pace packet output in order to prevent
queue buildup in the NIC. This functionality is needed to implement SRPT
on output, so short messages don't get stuck in long NIC queues.
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
Changes for v8:
* This file is new in v8 (functionality extracted from other files)
---
net/homa/homa_pacer.c | 310 ++++++++++++++++++++++++++++++++++++++++++
net/homa/homa_pacer.h | 185 +++++++++++++++++++++++++
2 files changed, 495 insertions(+)
create mode 100644 net/homa/homa_pacer.c
create mode 100644 net/homa/homa_pacer.h
diff --git a/net/homa/homa_pacer.c b/net/homa/homa_pacer.c
new file mode 100644
index 000000000000..12715a6aed10
--- /dev/null
+++ b/net/homa/homa_pacer.c
@@ -0,0 +1,310 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file implements the Homa pacer, which implements SRPT for packet
+ * output. In order to do that, it throttles packet transmission to prevent
+ * the buildup of large queues in the NIC.
+ */
+
+#include "homa_pacer.h"
+#include "homa_rpc.h"
+
+/**
+ * homa_pacer_new() - Allocate and initialize a new pacer object, which
+ * will hold pacer-related information for @homa.
+ * @homa: Homa transport that the pacer will be associated with.
+ * @net: Network namespace that @homa is associated with.
+ * Return: A pointer to the new struct pacer, or a negative errno.
+ */
+struct homa_pacer *homa_pacer_new(struct homa *homa, struct net *net)
+{
+ struct homa_pacer *pacer;
+ int err;
+
+ pacer = kmalloc(sizeof(*pacer), GFP_KERNEL | __GFP_ZERO);
+ if (!pacer)
+ return ERR_PTR(-ENOMEM);
+ pacer->homa = homa;
+ spin_lock_init(&pacer->mutex);
+ pacer->fifo_count = 1000;
+ spin_lock_init(&pacer->throttle_lock);
+ INIT_LIST_HEAD_RCU(&pacer->throttled_rpcs);
+ pacer->fifo_fraction = 50;
+ pacer->max_nic_queue_ns = 5000;
+ pacer->link_mbps = 25000;
+ pacer->throttle_min_bytes = 1000;
+ pacer->exit = false;
+ init_waitqueue_head(&pacer->wait_queue);
+ pacer->kthread = kthread_run(homa_pacer_main, pacer, "homa_pacer");
+ if (IS_ERR(pacer->kthread)) {
+ err = PTR_ERR(pacer->kthread);
+ pr_err("Homa couldn't create pacer thread: error %d\n", err);
+ goto error;
+ }
+ init_completion(&pacer->kthread_done);
+ atomic64_set(&pacer->link_idle_time, sched_clock());
+
+ homa_pacer_update_sysctl_deps(pacer);
+ return pacer;
+
+error:
+ homa_pacer_destroy(pacer);
+ return ERR_PTR(err);
+}
+
+/**
+ * homa_pacer_destroy() - Cleanup and destroy the pacer object for a Homa
+ * transport.
+ * @pacer: Object to destroy; caller must not reference the object
+ * again once this function returns.
+ */
+void homa_pacer_destroy(struct homa_pacer *pacer)
+{
+ pacer->exit = true;
+ if (pacer->kthread) {
+ wake_up(&pacer->wait_queue);
+ kthread_stop(pacer->kthread);
+ wait_for_completion(&pacer->kthread_done);
+ pacer->kthread = NULL;
+ }
+ kfree(pacer);
+}
+
+/**
+ * homa_pacer_check_nic_q() - This function is invoked before passing a
+ * packet to the NIC for transmission. It serves two purposes. First, it
+ * maintains an estimate of the NIC queue length. Second, it indicates to
+ * the caller whether the NIC queue is so full that no new packets should be
+ * queued (Homa's SRPT depends on keeping the NIC queue short).
+ * @pacer: Pacer information for a Homa transport.
+ * @skb: Packet that is about to be transmitted.
+ * @force: True means this packet is going to be transmitted
+ * regardless of the queue length.
+ * Return: Nonzero is returned if either the NIC queue length is
+ * acceptably short or @force was specified. 0 means that the
+ * NIC queue is at capacity or beyond, so the caller should delay
+ * the transmission of @skb. If nonzero is returned, then the
+ * queue estimate is updated to reflect the transmission of @skb.
+ */
+int homa_pacer_check_nic_q(struct homa_pacer *pacer, struct sk_buff *skb,
+ bool force)
+{
+ u64 idle, new_idle, clock, ns_for_packet;
+ int bytes;
+
+ bytes = homa_get_skb_info(skb)->wire_bytes;
+ ns_for_packet = pacer->ns_per_mbyte;
+ ns_for_packet *= bytes;
+ do_div(ns_for_packet, 1000000);
+ while (1) {
+ clock = sched_clock();
+ idle = atomic64_read(&pacer->link_idle_time);
+ if ((clock + pacer->max_nic_queue_ns) < idle && !force &&
+ !(pacer->homa->flags & HOMA_FLAG_DONT_THROTTLE))
+ return 0;
+ if (idle < clock)
+ new_idle = clock + ns_for_packet;
+ else
+ new_idle = idle + ns_for_packet;
+
+ /* This method must be thread-safe. */
+ if (atomic64_cmpxchg_relaxed(&pacer->link_idle_time, idle,
+ new_idle) == idle)
+ break;
+ }
+ return 1;
+}
+
+/**
+ * homa_pacer_main() - Top-level function for the pacer thread.
+ * @arg: Pointer to pacer struct.
+ *
+ * Return: Always 0.
+ */
+int homa_pacer_main(void *arg)
+{
+ struct homa_pacer *pacer = arg;
+
+ while (1) {
+ if (pacer->exit)
+ break;
+ pacer->wake_time = sched_clock();
+ homa_pacer_xmit(pacer);
+ pacer->wake_time = 0;
+ if (!list_empty(&pacer->throttled_rpcs)) {
+ /* NIC queue is full; before calling pacer again,
+ * give other threads a chance to run (otherwise
+ * low-level packet processing such as softirq could
+ * get locked out).
+ */
+ schedule();
+ continue;
+ }
+
+ wait_event(pacer->wait_queue, pacer->exit ||
+ !list_empty(&pacer->throttled_rpcs));
+ }
+ kthread_complete_and_exit(&pacer->kthread_done, 0);
+ return 0;
+}
+
+/**
+ * homa_pacer_xmit() - Transmit packets from the throttled list until
+ * either (a) the throttled list is empty or (b) the NIC queue has
+ * reached maximum allowable length. Note: this function may be invoked
+ * from either process context or softirq (BH) level. This function is
+ * invoked from multiple places, not just in the pacer thread. The reason
+ * for this is that (as of 10/2019) Linux's scheduling of the pacer thread
+ * is unpredictable: the thread may block for long periods of time (e.g.,
+ * because it is assigned to the same CPU as a busy interrupt handler).
+ * This can result in poor utilization of the network link. So, this method
+ * gets invoked from other places as well, to increase the likelihood that we
+ * keep the link busy. Those other invocations are not guaranteed to happen,
+ * so the pacer thread provides a backstop.
+ * @pacer: Pacer information for a Homa transport.
+ */
+void homa_pacer_xmit(struct homa_pacer *pacer)
+{
+ struct homa_rpc *rpc;
+ s64 queue_ns;
+
+ /* Make sure only one instance of this function executes at a time. */
+ if (!spin_trylock_bh(&pacer->mutex))
+ return;
+
+ while (1) {
+ queue_ns = atomic64_read(&pacer->link_idle_time) - sched_clock();
+ if (queue_ns >= pacer->max_nic_queue_ns)
+ break;
+ if (list_empty(&pacer->throttled_rpcs))
+ break;
+
+ /* Lock the first throttled RPC. This may not be possible
+ * because we have to hold throttle_lock while locking
+ * the RPC; that means we can't wait for the RPC lock because
+ * of lock ordering constraints (see sync.txt). Thus, if
+ * the RPC lock isn't available, do nothing. Holding the
+ * throttle lock while locking the RPC is important because
+ * it keeps the RPC from being deleted before it can be locked.
+ */
+ homa_pacer_throttle_lock(pacer);
+ pacer->fifo_count -= pacer->fifo_fraction;
+ if (pacer->fifo_count <= 0) {
+ struct homa_rpc *cur;
+ u64 oldest = ~0;
+
+ pacer->fifo_count += 1000;
+ rpc = NULL;
+ list_for_each_entry(cur, &pacer->throttled_rpcs,
+ throttled_links) {
+ if (cur->msgout.init_ns < oldest) {
+ rpc = cur;
+ oldest = cur->msgout.init_ns;
+ }
+ }
+ } else {
+ rpc = list_first_entry_or_null(&pacer->throttled_rpcs,
+ struct homa_rpc,
+ throttled_links);
+ }
+ if (!rpc) {
+ homa_pacer_throttle_unlock(pacer);
+ break;
+ }
+ if (!homa_rpc_try_lock(rpc)) {
+ homa_pacer_throttle_unlock(pacer);
+ break;
+ }
+ homa_pacer_throttle_unlock(pacer);
+
+ homa_xmit_data(rpc, true);
+
+ /* Note: rpc->state could be RPC_DEAD here, but the code
+ * below should work anyway.
+ */
+ if (!*rpc->msgout.next_xmit)
+ /* No more data can be transmitted from this message
+ * (right now), so remove it from the throttled list.
+ */
+ homa_pacer_unmanage_rpc(rpc);
+ homa_rpc_unlock(rpc);
+ }
+ spin_unlock_bh(&pacer->mutex);
+}
+
+/**
+ * homa_pacer_manage_rpc() - Arrange for the pacer to transmit packets
+ * from this RPC (make sure that an RPC is on the throttled list and wake up
+ * the pacer thread if necessary).
+ * @rpc: RPC with outbound packets that have been granted but can't be
+ * sent because of NIC queue restrictions. Must be locked by caller.
+ */
+void homa_pacer_manage_rpc(struct homa_rpc *rpc)
+ __must_hold(rpc_bucket_lock)
+{
+ struct homa_pacer *pacer = rpc->hsk->homa->pacer;
+ struct homa_rpc *candidate;
+ int bytes_left;
+ int checks = 0;
+
+ if (!list_empty(&rpc->throttled_links))
+ return;
+ bytes_left = rpc->msgout.length - rpc->msgout.next_xmit_offset;
+ homa_pacer_throttle_lock(pacer);
+ list_for_each_entry(candidate, &pacer->throttled_rpcs,
+ throttled_links) {
+ int bytes_left_cand;
+
+ checks++;
+
+ /* Watch out: the pacer might have just transmitted the last
+ * packet from candidate.
+ */
+ bytes_left_cand = candidate->msgout.length -
+ candidate->msgout.next_xmit_offset;
+ if (bytes_left_cand > bytes_left) {
+ list_add_tail(&rpc->throttled_links,
+ &candidate->throttled_links);
+ goto done;
+ }
+ }
+ list_add_tail(&rpc->throttled_links, &pacer->throttled_rpcs);
+done:
+ homa_pacer_throttle_unlock(pacer);
+ wake_up(&pacer->wait_queue);
+}
+
+/**
+ * homa_pacer_unmanage_rpc() - Make sure that an RPC is no longer managed
+ * by the pacer.
+ * @rpc: RPC of interest.
+ */
+void homa_pacer_unmanage_rpc(struct homa_rpc *rpc)
+ __must_hold(rpc_bucket_lock)
+{
+ struct homa_pacer *pacer = rpc->hsk->homa->pacer;
+
+ if (unlikely(!list_empty(&rpc->throttled_links))) {
+ homa_pacer_throttle_lock(pacer);
+ list_del_init(&rpc->throttled_links);
+ homa_pacer_throttle_unlock(pacer);
+ }
+}
+
+/**
+ * homa_pacer_update_sysctl_deps() - Update any pacer fields that depend
+ * on values set by sysctl. This function is invoked anytime a pacer sysctl
+ * value is updated.
+ * @pacer: Pacer to update.
+ */
+void homa_pacer_update_sysctl_deps(struct homa_pacer *pacer)
+{
+ u64 tmp;
+
+ tmp = 8 * 1000ULL * 1000ULL * 1000ULL;
+
+ /* Underestimate link bandwidth (overestimate time) by 1%. */
+ tmp = tmp * 101 / 100;
+ do_div(tmp, pacer->link_mbps);
+ pacer->ns_per_mbyte = tmp;
+}
+
diff --git a/net/homa/homa_pacer.h b/net/homa/homa_pacer.h
new file mode 100644
index 000000000000..c6318dfe8878
--- /dev/null
+++ b/net/homa/homa_pacer.h
@@ -0,0 +1,185 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines structs and functions related to the Homa pacer,
+ * which implements SRPT for packet output. In order to do that, it
+ * throttles packet transmission to prevent the buildup of
+ * large queues in the NIC.
+ */
+
+#ifndef _HOMA_PACER_H
+#define _HOMA_PACER_H
+
+#include "homa_impl.h"
+
+/**
+ * struct homa_pacer - Contains information that the pacer users to
+ * manage packet output. There is one instance of this object stored
+ * in each struct homa.
+ */
+struct homa_pacer {
+ /** @homa: Transport that this pacer is associated with. */
+ struct homa *homa;
+
+ /**
+ * @mutex: Ensures that only one instance of homa_pacer_xmit
+ * runs at a time. Only used in "try" mode: never block on this.
+ */
+ spinlock_t mutex;
+
+ /**
+ * @fifo_count: When this becomes <= zero, it's time for the
+ * pacer to allow the oldest RPC to transmit.
+ */
+ int fifo_count;
+
+ /**
+ * @wake_time: time (in sched_clock units) when the pacer last
+ * woke up (if the pacer is running) or 0 if the pacer is sleeping.
+ */
+ u64 wake_time;
+
+ /**
+ * @throttle_lock: Used to synchronize access to @throttled_rpcs. Must
+ * hold when inserting or removing an RPC from throttled_rpcs.
+ */
+ spinlock_t throttle_lock;
+
+ /**
+ * @throttled_rpcs: Contains all homa_rpcs that have bytes ready
+ * for transmission, but which couldn't be sent without exceeding
+ * the NIC queue limit.
+ */
+ struct list_head throttled_rpcs;
+
+ /**
+ * @fifo_fraction: Out of every 1000 packets transmitted by the
+ * pacer, this number will be transmitted from the oldest message
+ * rather than the highest-priority message. Set externally via
+ * sysctl.
+ */
+ int fifo_fraction;
+
+ /**
+ * @max_nic_queue_ns: Limits the NIC queue length: we won't queue
+ * up a packet for transmission if link_idle_time is this many
+ * nanoseconds in the future (or more). Set externally via sysctl.
+ */
+ int max_nic_queue_ns;
+
+ /**
+ * @link_mbps: The raw bandwidth of the network uplink, in
+ * units of 1e06 bits per second. Set externally via sysctl.
+ */
+ int link_mbps;
+
+ /**
+ * @ns_per_mbyte: the number of ns that it takes to transmit
+ * 10**6 bytes on our uplink. This is actually a slight overestimate
+ * of the value, to ensure that we don't underestimate NIC queue
+ * length and queue too many packets.
+ */
+ u32 ns_per_mbyte;
+
+ /**
+ * @throttle_min_bytes: If a packet has fewer bytes than this, then it
+ * bypasses the throttle mechanism and is transmitted immediately.
+ * We have this limit because for very small packets CPU overheads
+ * make it impossible to keep up with the NIC so (a) the NIC queue
+ * can't grow and (b) using the pacer would serialize all of these
+ * packets through a single core, which makes things even worse.
+ * Set externally via sysctl.
+ */
+ int throttle_min_bytes;
+
+ /**
+ * @exit: true means that the pacer thread should exit as
+ * soon as possible.
+ */
+ bool exit;
+
+ /**
+ * @wait_queue: Used to block the pacer thread when there
+ * are no throttled RPCs.
+ */
+ struct wait_queue_head wait_queue;
+
+ /**
+ * @kthread: Kernel thread that transmits packets from
+ * throttled_rpcs in a way that limits queue buildup in the
+ * NIC.
+ */
+ struct task_struct *kthread;
+
+ /**
+ * @kthread_done: Used to wait for @kthread to exit.
+ */
+ struct completion kthread_done;
+
+ /**
+ * @link_idle_time: The time, measured by sched_clock, at which we
+ * estimate that all of the packets we have passed to the NIC for
+ * transmission will have been transmitted. May be in the past.
+ * This estimate assumes that only Homa is transmitting data, so
+ * it could be a severe underestimate if there is competing traffic
+ * from, say, TCP. Access only with atomic ops.
+ */
+ atomic64_t link_idle_time ____cacheline_aligned_in_smp;
+};
+
+int homa_pacer_check_nic_q(struct homa_pacer *pacer,
+ struct sk_buff *skb, bool force);
+int homa_pacer_dointvec(const struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos);
+void homa_pacer_destroy(struct homa_pacer *pacer);
+void homa_pacer_unmanage_rpc(struct homa_rpc *rpc);
+void homa_pacer_log_throttled(struct homa_pacer *pacer);
+int homa_pacer_main(void *transport);
+void homa_pacer_manage_rpc(struct homa_rpc *rpc);
+struct homa_pacer *homa_pacer_new(struct homa *homa, struct net *net);
+void homa_pacer_throttle_lock_slow(struct homa_pacer *pacer);
+void homa_pacer_update_sysctl_deps(struct homa_pacer *pacer);
+void homa_pacer_xmit(struct homa_pacer *pacer);
+
+/**
+ * homa_pacer_check() - This method is invoked at various places in Homa to
+ * see if the pacer needs to transmit more packets and, if so, transmit
+ * them. It's needed because the pacer thread may get descheduled by
+ * Linux, result in output stalls.
+ * @pacer: Pacer information for a Homa transport.
+ */
+static inline void homa_pacer_check(struct homa_pacer *pacer)
+{
+ if (list_empty(&pacer->throttled_rpcs))
+ return;
+
+ /* The ">> 1" in the line below gives homa_pacer_main the first chance
+ * to queue new packets; if the NIC queue becomes more than half
+ * empty, then we will help out here.
+ */
+ if ((sched_clock() + (pacer->max_nic_queue_ns >> 1)) <
+ atomic64_read(&pacer->link_idle_time))
+ return;
+ homa_pacer_xmit(pacer);
+}
+
+/**
+ * homa_pacer_throttle_lock() - Acquire the throttle lock.
+ * @pacer: Pacer information for a Homa transport.
+ */
+static inline void homa_pacer_throttle_lock(struct homa_pacer *pacer)
+ __acquires(&pacer->throttle_lock)
+{
+ spin_lock_bh(&pacer->throttle_lock);
+}
+
+/**
+ * homa_pacer_throttle_unlock() - Release the throttle lock.
+ * @pacer: Pacer information for a Homa transport.
+ */
+static inline void homa_pacer_throttle_unlock(struct homa_pacer *pacer)
+ __releases(&pacer->throttle_lock)
+{
+ spin_unlock_bh(&pacer->throttle_lock);
+}
+
+#endif /* _HOMA_PACER_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 09/15] net: homa: create homa_rpc.h and homa_rpc.c
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
` (7 preceding siblings ...)
2025-05-02 23:37 ` [PATCH net-next v8 08/15] net: homa: create homa_pacer.h and homa_pacer.c John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 10/15] net: homa: create homa_outgoing.c John Ousterhout
` (6 subsequent siblings)
15 siblings, 0 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
These files provide basic functions for managing remote procedure calls,
which are the fundamental entities managed by Homa. Each RPC consists
of a request message from a client to a server, followed by a response
message returned from the server to the client.
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
Changes for v8:
* Updates to reflect pacer refactoring
Changes for v7:
* Implement accounting for bytes in tx skbs
* Fix potential races related to homa->active_rpcs
* Refactor waiting mechanism for incoming packets: simplify wait
criteria and use standard Linux mechanisms for waiting
* Add reference counting for RPCs (homa_rpc_hold, homa_rpc_put)
* Remove locker argument from locking functions
* Rename homa_rpc_free to homa_rpc_end
* Use u64 and __u64 properly
* Use __skb_queue_purge instead of skb_queue_purge
* Use __GFP_ZERO in kmalloc calls
* Eliminate spurious RCU usage
---
net/homa/homa_rpc.c | 484 ++++++++++++++++++++++++++++++++++++++++++++
net/homa/homa_rpc.h | 484 ++++++++++++++++++++++++++++++++++++++++++++
net/homa/reap.txt | 50 +++++
3 files changed, 1018 insertions(+)
create mode 100644 net/homa/homa_rpc.c
create mode 100644 net/homa/homa_rpc.h
create mode 100644 net/homa/reap.txt
diff --git a/net/homa/homa_rpc.c b/net/homa/homa_rpc.c
new file mode 100644
index 000000000000..f21cd4e66e87
--- /dev/null
+++ b/net/homa/homa_rpc.c
@@ -0,0 +1,484 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file contains functions for managing homa_rpc structs. */
+
+#include "homa_impl.h"
+#include "homa_interest.h"
+#include "homa_pacer.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+#include "homa_stub.h"
+
+/**
+ * homa_rpc_new_client() - Allocate and construct a client RPC (one that is used
+ * to issue an outgoing request). Doesn't send any packets. Invoked with no
+ * locks held.
+ * @hsk: Socket to which the RPC belongs.
+ * @dest: Address of host (ip and port) to which the RPC will be sent.
+ *
+ * Return: A printer to the newly allocated object, or a negative
+ * errno if an error occurred. The RPC will be locked; the
+ * caller must eventually unlock it.
+ */
+struct homa_rpc *homa_rpc_new_client(struct homa_sock *hsk,
+ const union sockaddr_in_union *dest)
+ __acquires(rpc_bucket_lock)
+{
+ struct in6_addr dest_addr_as_ipv6 = canonical_ipv6_addr(dest);
+ struct homa_rpc_bucket *bucket;
+ struct homa_rpc *crpc;
+ int err;
+
+ crpc = kmalloc(sizeof(*crpc), GFP_KERNEL | __GFP_ZERO);
+ if (unlikely(!crpc))
+ return ERR_PTR(-ENOMEM);
+
+ /* Initialize fields that don't require the socket lock. */
+ crpc->hsk = hsk;
+ crpc->id = atomic64_fetch_add(2, &hsk->homa->next_outgoing_id);
+ bucket = homa_client_rpc_bucket(hsk, crpc->id);
+ crpc->bucket = bucket;
+ crpc->state = RPC_OUTGOING;
+ crpc->peer = homa_peer_find(hsk->homa->peers, &dest_addr_as_ipv6,
+ &hsk->inet);
+ if (IS_ERR(crpc->peer)) {
+ err = PTR_ERR(crpc->peer);
+ goto error;
+ }
+ crpc->dport = ntohs(dest->in6.sin6_port);
+ crpc->msgin.length = -1;
+ crpc->msgout.length = -1;
+ INIT_LIST_HEAD(&crpc->ready_links);
+ INIT_LIST_HEAD(&crpc->buf_links);
+ INIT_LIST_HEAD(&crpc->dead_links);
+ INIT_LIST_HEAD(&crpc->throttled_links);
+ crpc->resend_timer_ticks = hsk->homa->timer_ticks;
+ crpc->magic = HOMA_RPC_MAGIC;
+ crpc->start_ns = sched_clock();
+
+ /* Initialize fields that require locking. This allows the most
+ * expensive work, such as copying in the message from user space,
+ * to be performed without holding locks. Also, can't hold spin
+ * locks while doing things that could block, such as memory allocation.
+ */
+ homa_bucket_lock(bucket, crpc->id);
+ homa_sock_lock(hsk);
+ if (hsk->shutdown) {
+ homa_sock_unlock(hsk);
+ homa_rpc_unlock(crpc);
+ err = -ESHUTDOWN;
+ goto error;
+ }
+ hlist_add_head(&crpc->hash_links, &bucket->rpcs);
+ rcu_read_lock();
+ list_add_tail_rcu(&crpc->active_links, &hsk->active_rpcs);
+ rcu_read_unlock();
+ homa_sock_unlock(hsk);
+
+ return crpc;
+
+error:
+ kfree(crpc);
+ return ERR_PTR(err);
+}
+
+/**
+ * homa_rpc_new_server() - Allocate and construct a server RPC (one that is
+ * used to manage an incoming request). If appropriate, the RPC will also
+ * be handed off (we do it here, while we have the socket locked, to avoid
+ * acquiring the socket lock a second time later for the handoff).
+ * @hsk: Socket that owns this RPC.
+ * @source: IP address (network byte order) of the RPC's client.
+ * @h: Header for the first data packet received for this RPC; used
+ * to initialize the RPC.
+ * @created: Will be set to 1 if a new RPC was created and 0 if an
+ * existing RPC was found.
+ *
+ * Return: A pointer to a new RPC, which is locked, or a negative errno
+ * if an error occurred. If there is already an RPC corresponding
+ * to h, then it is returned instead of creating a new RPC.
+ */
+struct homa_rpc *homa_rpc_new_server(struct homa_sock *hsk,
+ const struct in6_addr *source,
+ struct homa_data_hdr *h, int *created)
+ __acquires(rpc_bucket_lock)
+{
+ u64 id = homa_local_id(h->common.sender_id);
+ struct homa_rpc_bucket *bucket;
+ struct homa_rpc *srpc = NULL;
+ int err;
+
+ /* Lock the bucket, and make sure no-one else has already created
+ * the desired RPC.
+ */
+ bucket = homa_server_rpc_bucket(hsk, id);
+ homa_bucket_lock(bucket, id);
+ hlist_for_each_entry(srpc, &bucket->rpcs, hash_links) {
+ if (srpc->id == id &&
+ srpc->dport == ntohs(h->common.sport) &&
+ ipv6_addr_equal(&srpc->peer->addr, source)) {
+ /* RPC already exists; just return it instead
+ * of creating a new RPC.
+ */
+ *created = 0;
+ return srpc;
+ }
+ }
+
+ /* Initialize fields that don't require the socket lock. */
+ srpc = kmalloc(sizeof(*srpc), GFP_ATOMIC | __GFP_ZERO);
+ if (!srpc) {
+ err = -ENOMEM;
+ goto error;
+ }
+ srpc->hsk = hsk;
+ srpc->bucket = bucket;
+ srpc->state = RPC_INCOMING;
+ srpc->peer = homa_peer_find(hsk->homa->peers, source, &hsk->inet);
+ if (IS_ERR(srpc->peer)) {
+ err = PTR_ERR(srpc->peer);
+ goto error;
+ }
+ srpc->dport = ntohs(h->common.sport);
+ srpc->id = id;
+ srpc->msgin.length = -1;
+ srpc->msgout.length = -1;
+ INIT_LIST_HEAD(&srpc->ready_links);
+ INIT_LIST_HEAD(&srpc->buf_links);
+ INIT_LIST_HEAD(&srpc->dead_links);
+ INIT_LIST_HEAD(&srpc->throttled_links);
+ srpc->resend_timer_ticks = hsk->homa->timer_ticks;
+ srpc->magic = HOMA_RPC_MAGIC;
+ srpc->start_ns = sched_clock();
+ err = homa_message_in_init(srpc, ntohl(h->message_length));
+ if (err != 0)
+ goto error;
+
+ /* Initialize fields that require socket to be locked. */
+ homa_sock_lock(hsk);
+ if (hsk->shutdown) {
+ homa_sock_unlock(hsk);
+ err = -ESHUTDOWN;
+ goto error;
+ }
+ hlist_add_head(&srpc->hash_links, &bucket->rpcs);
+ list_add_tail_rcu(&srpc->active_links, &hsk->active_rpcs);
+ homa_sock_unlock(hsk);
+ if (ntohl(h->seg.offset) == 0 && srpc->msgin.num_bpages > 0) {
+ atomic_or(RPC_PKTS_READY, &srpc->flags);
+ homa_rpc_handoff(srpc);
+ }
+ *created = 1;
+ return srpc;
+
+error:
+ homa_bucket_unlock(bucket, id);
+ kfree(srpc);
+ return ERR_PTR(err);
+}
+
+/**
+ * homa_rpc_acked() - This function is invoked when an ack is received
+ * for an RPC; if the RPC still exists, is freed.
+ * @hsk: Socket on which the ack was received. May or may not correspond
+ * to the RPC, but can sometimes be used to avoid a socket lookup.
+ * @saddr: Source address from which the act was received (the client
+ * note for the RPC)
+ * @ack: Information about an RPC from @saddr that may now be deleted
+ * safely.
+ */
+void homa_rpc_acked(struct homa_sock *hsk, const struct in6_addr *saddr,
+ struct homa_ack *ack)
+{
+ __u16 server_port = ntohs(ack->server_port);
+ u64 id = homa_local_id(ack->client_id);
+ struct homa_sock *hsk2 = hsk;
+ struct homa_rpc *rpc;
+
+ if (hsk->port != server_port) {
+ /* Without RCU, sockets other than hsk can be deleted
+ * out from under us.
+ */
+ hsk2 = homa_sock_find(hsk->homa->port_map, server_port);
+ if (!hsk2)
+ return;
+ }
+ rpc = homa_find_server_rpc(hsk2, saddr, id);
+ if (rpc) {
+ homa_rpc_end(rpc);
+ homa_rpc_unlock(rpc); /* Locked by homa_find_server_rpc. */
+ }
+ if (hsk->port != server_port)
+ sock_put(&hsk2->sock);
+}
+
+/**
+ * homa_rpc_end() - Stop all activity on an RPC and begin the process of
+ * releasing its resources; this process will continue in the background
+ * until homa_rpc_reap eventually completes it.
+ * @rpc: Structure to clean up, or NULL. Must be locked. Its socket must
+ * not be locked. Once this function returns the caller should not
+ * use the RPC except to unlock it.
+ */
+void homa_rpc_end(struct homa_rpc *rpc)
+ __must_hold(rpc_bucket_lock)
+{
+ /* The goal for this function is to make the RPC inaccessible,
+ * so that no other code will ever access it again. However, don't
+ * actually release resources; leave that to homa_rpc_reap, which
+ * runs later. There are two reasons for this. First, releasing
+ * resources may be expensive, so we don't want to keep the caller
+ * waiting; homa_rpc_reap will run in situations where there is time
+ * to spare. Second, there may be other code that currently has
+ * pointers to this RPC but temporarily released the lock (e.g. to
+ * copy data to/from user space). It isn't safe to clean up until
+ * that code has finished its work and released any pointers to the
+ * RPC (homa_rpc_reap will ensure that this has happened). So, this
+ * function should only make changes needed to make the RPC
+ * inaccessible.
+ */
+ if (!rpc || rpc->state == RPC_DEAD)
+ return;
+ rpc->state = RPC_DEAD;
+ rpc->error = -EINVAL;
+
+ /* Unlink from all lists, so no-one will ever find this RPC again. */
+ homa_sock_lock(rpc->hsk);
+ __hlist_del(&rpc->hash_links);
+ list_del_rcu(&rpc->active_links);
+ list_add_tail(&rpc->dead_links, &rpc->hsk->dead_rpcs);
+ __list_del_entry(&rpc->ready_links);
+ __list_del_entry(&rpc->buf_links);
+ homa_interest_notify_private(rpc);
+
+ if (rpc->msgin.length >= 0) {
+ rpc->hsk->dead_skbs += skb_queue_len(&rpc->msgin.packets);
+ while (1) {
+ struct homa_gap *gap;
+
+ gap = list_first_entry_or_null(&rpc->msgin.gaps,
+ struct homa_gap, links);
+ if (!gap)
+ break;
+ list_del(&gap->links);
+ kfree(gap);
+ }
+ }
+ rpc->hsk->dead_skbs += rpc->msgout.num_skbs;
+ if (rpc->hsk->dead_skbs > rpc->hsk->homa->max_dead_buffs)
+ /* This update isn't thread-safe; it's just a
+ * statistic so it's OK if updates occasionally get
+ * missed.
+ */
+ rpc->hsk->homa->max_dead_buffs = rpc->hsk->dead_skbs;
+
+ homa_sock_unlock(rpc->hsk);
+ homa_pacer_unmanage_rpc(rpc);
+}
+
+/**
+ * homa_rpc_reap() - Invoked to release resources associated with dead
+ * RPCs for a given socket. For a large RPC, it can take a long time to
+ * free all of its packet buffers, so we try to perform this work
+ * off the critical path where it won't delay applications. Each call to
+ * this function normally does a small chunk of work (unless reap_all is
+ * true). See the file reap.txt for more information.
+ * @hsk: Homa socket that may contain dead RPCs. Must not be locked by the
+ * caller; this function will lock and release.
+ * @reap_all: False means do a small chunk of work; there may still be
+ * unreaped RPCs on return. True means reap all dead rpcs for
+ * hsk. Will busy-wait if reaping has been disabled for some RPCs.
+ *
+ * Return: A return value of 0 means that we ran out of work to do; calling
+ * again will do no work (there could be unreaped RPCs, but if so,
+ * reaping has been disabled for them). A value greater than
+ * zero means there is still more reaping work to be done.
+ */
+int homa_rpc_reap(struct homa_sock *hsk, bool reap_all)
+{
+#define BATCH_MAX 20
+ struct homa_rpc *rpcs[BATCH_MAX];
+ struct sk_buff *skbs[BATCH_MAX];
+ int num_skbs, num_rpcs;
+ struct homa_rpc *rpc;
+ struct homa_rpc *tmp;
+ int i, batch_size;
+ int skbs_to_reap;
+ int rx_frees;
+ int result = 0;
+
+ /* Each iteration through the following loop will reap
+ * BATCH_MAX skbs.
+ */
+ skbs_to_reap = hsk->homa->reap_limit;
+ while (skbs_to_reap > 0 && !list_empty(&hsk->dead_rpcs)) {
+ batch_size = BATCH_MAX;
+ if (!reap_all) {
+ if (batch_size > skbs_to_reap)
+ batch_size = skbs_to_reap;
+ skbs_to_reap -= batch_size;
+ }
+ num_skbs = 0;
+ num_rpcs = 0;
+ rx_frees = 0;
+
+ homa_sock_lock(hsk);
+ if (atomic_read(&hsk->protect_count)) {
+ homa_sock_unlock(hsk);
+ if (reap_all)
+ continue;
+ return 0;
+ }
+
+ /* Collect buffers and freeable RPCs. */
+ list_for_each_entry_safe(rpc, tmp, &hsk->dead_rpcs,
+ dead_links) {
+ int refs;
+
+ /* Make sure that all outstanding uses of the RPC have
+ * completed. We can only be sure if the reference
+ * count is zero when we're holding the lock. Note:
+ * it isn't safe to block while locking the RPC here,
+ * since we hold the socket lock.
+ */
+ if (homa_rpc_try_lock(rpc)) {
+ refs = atomic_read(&rpc->refs);
+ homa_rpc_unlock(rpc);
+ } else {
+ refs = 1;
+ }
+ if (refs != 0)
+ continue;
+ rpc->magic = 0;
+
+ /* For Tx sk_buffs, collect them here but defer
+ * freeing until after releasing the socket lock.
+ */
+ if (rpc->msgout.length >= 0) {
+ while (rpc->msgout.packets) {
+ skbs[num_skbs] = rpc->msgout.packets;
+ rpc->msgout.packets = homa_get_skb_info(
+ rpc->msgout.packets)->next_skb;
+ num_skbs++;
+ rpc->msgout.num_skbs--;
+ if (num_skbs >= batch_size)
+ goto release;
+ }
+ }
+
+ /* In the normal case rx sk_buffs will already have been
+ * freed before we got here. Thus it's OK to free
+ * immediately in rare situations where there are
+ * buffers left.
+ */
+ if (rpc->msgin.length >= 0 &&
+ !skb_queue_empty_lockless(&rpc->msgin.packets)) {
+ rx_frees += skb_queue_len(&rpc->msgin.packets);
+ __skb_queue_purge(&rpc->msgin.packets);
+ }
+
+ /* If we get here, it means all packets have been
+ * removed from the RPC.
+ */
+ rpcs[num_rpcs] = rpc;
+ num_rpcs++;
+ list_del(&rpc->dead_links);
+ WARN_ON(refcount_sub_and_test(rpc->msgout.skb_memory,
+ &hsk->sock.sk_wmem_alloc));
+ if (num_rpcs >= batch_size)
+ goto release;
+ }
+
+ /* Free all of the collected resources; release the socket
+ * lock while doing this.
+ */
+release:
+ hsk->dead_skbs -= num_skbs + rx_frees;
+ result = !list_empty(&hsk->dead_rpcs) &&
+ (num_skbs + num_rpcs) != 0;
+ homa_sock_unlock(hsk);
+ homa_skb_free_many_tx(hsk->homa, skbs, num_skbs);
+ for (i = 0; i < num_rpcs; i++) {
+ rpc = rpcs[i];
+
+ if (unlikely(rpc->msgin.num_bpages))
+ homa_pool_release_buffers(rpc->hsk->buffer_pool,
+ rpc->msgin.num_bpages,
+ rpc->msgin.bpage_offsets);
+ if (rpc->msgin.length >= 0) {
+ while (1) {
+ struct homa_gap *gap;
+
+ gap = list_first_entry_or_null(
+ &rpc->msgin.gaps,
+ struct homa_gap,
+ links);
+ if (!gap)
+ break;
+ list_del(&gap->links);
+ kfree(gap);
+ }
+ }
+ rpc->state = 0;
+ kfree(rpc);
+ }
+ homa_sock_wakeup_wmem(hsk);
+ if (!result && !reap_all)
+ break;
+ }
+ homa_pool_check_waiting(hsk->buffer_pool);
+ return result;
+}
+
+/**
+ * homa_find_client_rpc() - Locate client-side information about the RPC that
+ * a packet belongs to, if there is any. Thread-safe without socket lock.
+ * @hsk: Socket via which packet was received.
+ * @id: Unique identifier for the RPC.
+ *
+ * Return: A pointer to the homa_rpc for this id, or NULL if none.
+ * The RPC will be locked; the caller must eventually unlock it
+ * by invoking homa_rpc_unlock.
+ */
+struct homa_rpc *homa_find_client_rpc(struct homa_sock *hsk, u64 id)
+ __cond_acquires(rpc_bucket_lock)
+{
+ struct homa_rpc_bucket *bucket = homa_client_rpc_bucket(hsk, id);
+ struct homa_rpc *crpc;
+
+ homa_bucket_lock(bucket, id);
+ hlist_for_each_entry(crpc, &bucket->rpcs, hash_links) {
+ if (crpc->id == id)
+ return crpc;
+ }
+ homa_bucket_unlock(bucket, id);
+ return NULL;
+}
+
+/**
+ * homa_find_server_rpc() - Locate server-side information about the RPC that
+ * a packet belongs to, if there is any. Thread-safe without socket lock.
+ * @hsk: Socket via which packet was received.
+ * @saddr: Address from which the packet was sent.
+ * @id: Unique identifier for the RPC (must have server bit set).
+ *
+ * Return: A pointer to the homa_rpc matching the arguments, or NULL
+ * if none. The RPC will be locked; the caller must eventually
+ * unlock it by invoking homa_rpc_unlock.
+ */
+struct homa_rpc *homa_find_server_rpc(struct homa_sock *hsk,
+ const struct in6_addr *saddr, u64 id)
+ __cond_acquires(rpc_bucket_lock)
+{
+ struct homa_rpc_bucket *bucket = homa_server_rpc_bucket(hsk, id);
+ struct homa_rpc *srpc;
+
+ homa_bucket_lock(bucket, id);
+ hlist_for_each_entry(srpc, &bucket->rpcs, hash_links) {
+ if (srpc->id == id && ipv6_addr_equal(&srpc->peer->addr, saddr))
+ return srpc;
+ }
+ homa_bucket_unlock(bucket, id);
+ return NULL;
+}
+
diff --git a/net/homa/homa_rpc.h b/net/homa/homa_rpc.h
new file mode 100644
index 000000000000..fb71c5c87266
--- /dev/null
+++ b/net/homa/homa_rpc.h
@@ -0,0 +1,484 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines homa_rpc and related structs. */
+
+#ifndef _HOMA_RPC_H
+#define _HOMA_RPC_H
+
+#include <linux/percpu-defs.h>
+#include <linux/skbuff.h>
+#include <linux/types.h>
+
+#include "homa_sock.h"
+#include "homa_wire.h"
+
+/* Forward references. */
+struct homa_ack;
+
+/**
+ * struct homa_message_out - Describes a message (either request or response)
+ * for which this machine is the sender.
+ */
+struct homa_message_out {
+ /**
+ * @length: Total bytes in message (excluding headers). A value
+ * less than 0 means this structure is uninitialized and therefore
+ * not in use (all other fields will be zero in this case).
+ */
+ int length;
+
+ /** @num_skbs: Total number of buffers currently in @packets. */
+ int num_skbs;
+
+ /**
+ * @skb_memory: Total number of bytes of memory occupied by
+ * the sk_buffs for this message.
+ */
+ int skb_memory;
+
+ /**
+ * @copied_from_user: Number of bytes of the message that have
+ * been copied from user space into skbs in @packets.
+ */
+ int copied_from_user;
+
+ /**
+ * @packets: Singly-linked list of all packets in message, linked
+ * using homa_next_skb. The list is in order of offset in the message
+ * (offset 0 first); each sk_buff can potentially contain multiple
+ * data_segments, which will be split into separate packets by GSO.
+ * This list grows gradually as data is copied in from user space,
+ * so it may not be complete.
+ */
+ struct sk_buff *packets;
+
+ /**
+ * @next_xmit: Pointer to pointer to next packet to transmit (will
+ * either refer to @packets or homa_next_skb(skb) for some skb
+ * in @packets).
+ */
+ struct sk_buff **next_xmit;
+
+ /**
+ * @next_xmit_offset: All bytes in the message, up to but not
+ * including this one, have been transmitted.
+ */
+ int next_xmit_offset;
+
+ /**
+ * @init_ns: Time in sched_clock units when this structure was
+ * initialized. Used to find the oldest outgoing message.
+ */
+ u64 init_ns;
+};
+
+/**
+ * struct homa_gap - Represents a range of bytes within a message that have
+ * not yet been received.
+ */
+struct homa_gap {
+ /** @start: offset of first byte in this gap. */
+ int start;
+
+ /** @end: offset of byte just after last one in this gap. */
+ int end;
+
+ /**
+ * @time: time (in sched_clock units) when the gap was first detected.
+ * As of 7/2024 this isn't used for anything.
+ */
+ u64 time;
+
+ /** @links: for linking into list in homa_message_in. */
+ struct list_head links;
+};
+
+/**
+ * struct homa_message_in - Holds the state of a message received by
+ * this machine; used for both requests and responses.
+ */
+struct homa_message_in {
+ /**
+ * @length: Payload size in bytes. A value less than 0 means this
+ * structure is uninitialized and therefore not in use.
+ */
+ int length;
+
+ /**
+ * @packets: DATA packets for this message that have been received but
+ * not yet copied to user space (no particular order).
+ */
+ struct sk_buff_head packets;
+
+ /**
+ * @recv_end: Offset of the byte just after the highest one that
+ * has been received so far.
+ */
+ int recv_end;
+
+ /**
+ * @gaps: List of homa_gaps describing all of the bytes with
+ * offsets less than @recv_end that have not yet been received.
+ */
+ struct list_head gaps;
+
+ /**
+ * @bytes_remaining: Amount of data for this message that has
+ * not yet been received; will determine the message's priority.
+ */
+ int bytes_remaining;
+
+ /**
+ * @num_bpages: The number of entries in @bpage_offsets used for this
+ * message (0 means buffers not allocated yet).
+ */
+ u32 num_bpages;
+
+ /**
+ * @bpage_offsets: Describes buffer space allocated for this message.
+ * Each entry is an offset from the start of the buffer region.
+ * All but the last pointer refer to areas of size HOMA_BPAGE_SIZE.
+ */
+ u32 bpage_offsets[HOMA_MAX_BPAGES];
+
+};
+
+/**
+ * struct homa_rpc - One of these structures exists for each active
+ * RPC. The same structure is used to manage both outgoing RPCs on
+ * clients and incoming RPCs on servers.
+ */
+struct homa_rpc {
+ /** @hsk: Socket that owns the RPC. */
+ struct homa_sock *hsk;
+
+ /**
+ * @bucket: Pointer to the bucket in hsk->client_rpc_buckets or
+ * hsk->server_rpc_buckets where this RPC is linked. Used primarily
+ * for locking the RPC (which is done by locking its bucket).
+ */
+ struct homa_rpc_bucket *bucket;
+
+ /**
+ * @state: The current state of this RPC:
+ *
+ * @RPC_OUTGOING: The RPC is waiting for @msgout to be transmitted
+ * to the peer.
+ * @RPC_INCOMING: The RPC is waiting for data @msgin to be received
+ * from the peer; at least one packet has already
+ * been received.
+ * @RPC_IN_SERVICE: Used only for server RPCs: the request message
+ * has been read from the socket, but the response
+ * message has not yet been presented to the kernel.
+ * @RPC_DEAD: RPC has been deleted and is waiting to be
+ * reaped. In some cases, information in the RPC
+ * structure may be accessed in this state.
+ *
+ * Client RPCs pass through states in the following order:
+ * RPC_OUTGOING, RPC_INCOMING, RPC_DEAD.
+ *
+ * Server RPCs pass through states in the following order:
+ * RPC_INCOMING, RPC_IN_SERVICE, RPC_OUTGOING, RPC_DEAD.
+ */
+ enum {
+ RPC_OUTGOING = 5,
+ RPC_INCOMING = 6,
+ RPC_IN_SERVICE = 8,
+ RPC_DEAD = 9
+ } state;
+
+ /**
+ * @flags: Additional state information: an OR'ed combination of
+ * various single-bit flags. See below for definitions. Must be
+ * manipulated with atomic operations because some of the manipulations
+ * occur without holding the RPC lock.
+ */
+ atomic_t flags;
+
+ /* Valid bits for @flags:
+ * RPC_PKTS_READY - The RPC has input packets ready to be
+ * copied to user space.
+ * APP_NEEDS_LOCK - Means that code in the application thread
+ * needs the RPC lock (e.g. so it can start
+ * copying data to user space) so others
+ * (e.g. SoftIRQ processing) should relinquish
+ * the lock ASAP. Without this, SoftIRQ can
+ * lock out the application for a long time,
+ * preventing data copies to user space from
+ * starting (and they limit throughput at
+ * high network speeds).
+ * RPC_PRIVATE - This RPC will be waited on in "private" mode,
+ * where the app explicitly requests the
+ * response from this particular RPC.
+ */
+#define RPC_PKTS_READY 1
+#define APP_NEEDS_LOCK 4
+#define RPC_PRIVATE 8
+
+ /**
+ * @refs: Number of unmatched calls to homa_rpc_hold; it's not safe
+ * to free the RPC until this is zero.
+ */
+ atomic_t refs;
+
+ /**
+ * @peer: Information about the other machine (the server, if
+ * this is a client RPC, or the client, if this is a server RPC).
+ */
+ struct homa_peer *peer;
+
+ /** @dport: Port number on @peer that will handle packets. */
+ __u16 dport;
+
+ /**
+ * @id: Unique identifier for the RPC among all those issued
+ * from its port. The low-order bit indicates whether we are
+ * server (1) or client (0) for this RPC.
+ */
+ u64 id;
+
+ /**
+ * @completion_cookie: Only used on clients. Contains identifying
+ * information about the RPC provided by the application; returned to
+ * the application with the RPC's result.
+ */
+ u64 completion_cookie;
+
+ /**
+ * @error: Only used on clients. If nonzero, then the RPC has
+ * failed and the value is a negative errno that describes the
+ * problem.
+ */
+ int error;
+
+ /**
+ * @msgin: Information about the message we receive for this RPC
+ * (for server RPCs this is the request, for client RPCs this is the
+ * response).
+ */
+ struct homa_message_in msgin;
+
+ /**
+ * @msgout: Information about the message we send for this RPC
+ * (for client RPCs this is the request, for server RPCs this is the
+ * response).
+ */
+ struct homa_message_out msgout;
+
+ /**
+ * @hash_links: Used to link this object into a hash bucket for
+ * either @hsk->client_rpc_buckets (for a client RPC), or
+ * @hsk->server_rpc_buckets (for a server RPC).
+ */
+ struct hlist_node hash_links;
+
+ /**
+ * @ready_links: Used to link this object into @hsk->ready_rpcs.
+ */
+ struct list_head ready_links;
+
+ /**
+ * @buf_links: Used to link this RPC into @hsk->waiting_for_bufs.
+ * If the RPC isn't on @hsk->waiting_for_bufs, this is an empty
+ * list pointing to itself.
+ */
+ struct list_head buf_links;
+
+ /**
+ * @active_links: For linking this object into @hsk->active_rpcs.
+ * The next field will be LIST_POISON1 if this RPC hasn't yet been
+ * linked into @hsk->active_rpcs. Access with RCU.
+ */
+ struct list_head active_links;
+
+ /** @dead_links: For linking this object into @hsk->dead_rpcs. */
+ struct list_head dead_links;
+
+ /**
+ * @private_interest: If there is a thread waiting for this RPC in
+ * homa_wait_private, then this points to that thread's interest.
+ */
+ struct homa_interest *private_interest;
+
+ /**
+ * @throttled_links: Used to link this RPC into
+ * homa->pacer.throttled_rpcs. If this RPC isn't in
+ * homa->pacer.throttled_rpcs, this is an empty
+ * list pointing to itself.
+ */
+ struct list_head throttled_links;
+
+ /**
+ * @silent_ticks: Number of times homa_timer has been invoked
+ * since the last time a packet indicating progress was received
+ * for this RPC, so we don't need to send a resend for a while.
+ */
+ int silent_ticks;
+
+ /**
+ * @resend_timer_ticks: Value of homa->timer_ticks the last time
+ * we sent a RESEND for this RPC.
+ */
+ u32 resend_timer_ticks;
+
+ /**
+ * @done_timer_ticks: The value of homa->timer_ticks the first
+ * time we noticed that this (server) RPC is done (all response
+ * packets have been transmitted), so we're ready for an ack.
+ * Zero means we haven't reached that point yet.
+ */
+ u32 done_timer_ticks;
+
+ /**
+ * @magic: when the RPC is alive, this holds a distinct value that
+ * is unlikely to occur naturally. The value is cleared when the
+ * RPC is reaped, so we can detect accidental use of an RPC after
+ * it has been reaped.
+ */
+#define HOMA_RPC_MAGIC 0xdeadbeef
+ int magic;
+
+ /**
+ * @start_ns: time (from sched_clock()) when this RPC was created.
+ * Used (sometimes) for testing.
+ */
+ u64 start_ns;
+};
+
+void homa_check_rpc(struct homa_rpc *rpc);
+struct homa_rpc
+ *homa_find_client_rpc(struct homa_sock *hsk, u64 id);
+struct homa_rpc
+ *homa_find_server_rpc(struct homa_sock *hsk,
+ const struct in6_addr *saddr, u64 id);
+void homa_rpc_acked(struct homa_sock *hsk, const struct in6_addr *saddr,
+ struct homa_ack *ack);
+void homa_rpc_end(struct homa_rpc *rpc);
+struct homa_rpc
+ *homa_rpc_new_client(struct homa_sock *hsk,
+ const union sockaddr_in_union *dest);
+struct homa_rpc
+ *homa_rpc_new_server(struct homa_sock *hsk,
+ const struct in6_addr *source,
+ struct homa_data_hdr *h, int *created);
+int homa_rpc_reap(struct homa_sock *hsk, bool reap_all);
+
+/**
+ * homa_rpc_lock() - Acquire the lock for an RPC.
+ * @rpc: RPC to lock. Note: this function is only safe under
+ * limited conditions (in most cases homa_bucket_lock should be
+ * used). The caller must ensure that the RPC cannot be reaped
+ * before the lock is acquired, such as by taking a reference on
+ * the rpc with homa_rpc_hold or calling homa_protect_rpcs.
+ * Don't use this function unless you are very sure what you are
+ * doing! See sync.txt for more info on locking.
+ */
+static inline void homa_rpc_lock(struct homa_rpc *rpc)
+ __acquires(rpc_bucket_lock)
+{
+ homa_bucket_lock(rpc->bucket, rpc->id);
+}
+
+/**
+ * homa_rpc_try_lock() - Acquire the lock for an RPC if it is available.
+ * @rpc: RPC to lock.
+ * Return: Nonzero if lock was successfully acquired, zero if it is
+ * currently owned by someone else.
+ */
+static inline int homa_rpc_try_lock(struct homa_rpc *rpc)
+ __cond_acquires(rpc_bucket_lock)
+{
+ if (!spin_trylock_bh(&rpc->bucket->lock))
+ return 0;
+ return 1;
+}
+
+/**
+ * homa_rpc_unlock() - Release the lock for an RPC.
+ * @rpc: RPC to unlock.
+ */
+static inline void homa_rpc_unlock(struct homa_rpc *rpc)
+ __releases(rpc_bucket_lock)
+{
+ homa_bucket_unlock(rpc->bucket, rpc->id);
+}
+
+/**
+ * homa_protect_rpcs() - Ensures that no RPCs will be reaped for a given
+ * socket until homa_sock_unprotect is called. Typically used by functions
+ * that want to scan the active RPCs for a socket without holding the socket
+ * lock. Multiple calls to this function may be in effect at once.
+ * @hsk: Socket whose RPCs should be protected. Must not be locked
+ * by the caller; will be locked here.
+ *
+ * Return: 1 for success, 0 if the socket has been shutdown, in which
+ * case its RPCs cannot be protected.
+ */
+static inline int homa_protect_rpcs(struct homa_sock *hsk)
+{
+ int result;
+
+ homa_sock_lock(hsk);
+ result = !hsk->shutdown;
+ if (result)
+ atomic_inc(&hsk->protect_count);
+ homa_sock_unlock(hsk);
+ return result;
+}
+
+/**
+ * homa_unprotect_rpcs() - Cancel the effect of a previous call to
+ * homa_sock_protect(), so that RPCs can once again be reaped.
+ * @hsk: Socket whose RPCs should be unprotected.
+ */
+static inline void homa_unprotect_rpcs(struct homa_sock *hsk)
+{
+ atomic_dec(&hsk->protect_count);
+}
+
+/**
+ * homa_rpc_hold() - Increment the reference count on an RPC, which will
+ * prevent it from being freed until homa_rpc_put() is called. Used in
+ * situations where a pointer to the RPC needs to be retained during a
+ * period where it is unprotected by locks.
+ * @rpc: RPC on which to take a reference.
+ */
+static inline void homa_rpc_hold(struct homa_rpc *rpc)
+{
+ atomic_inc(&rpc->refs);
+}
+
+/**
+ * homa_rpc_put() - Release a reference on an RPC (cancels the effect of
+ * a previous call to homa_rpc_put).
+ * @rpc: RPC to release.
+ */
+static inline void homa_rpc_put(struct homa_rpc *rpc)
+{
+ atomic_dec(&rpc->refs);
+}
+
+/**
+ * homa_is_client(): returns true if we are the client for a particular RPC,
+ * false if we are the server.
+ * @id: Id of the RPC in question.
+ * Return: true if we are the client for RPC id, false otherwise
+ */
+static inline bool homa_is_client(u64 id)
+{
+ return (id & 1) == 0;
+}
+
+/**
+ * homa_rpc_needs_attention() - Returns true if @rpc has failed or if
+ * its incoming message is ready for attention by an application thread
+ * (e.g., packets are ready to copy to user space).
+ * @rpc: RPC to check.
+ * Return: See above
+ */
+static inline bool homa_rpc_needs_attention(struct homa_rpc *rpc)
+{
+ return (rpc->error != 0 || atomic_read(&rpc->flags) & RPC_PKTS_READY);
+}
+
+#endif /* _HOMA_RPC_H */
diff --git a/net/homa/reap.txt b/net/homa/reap.txt
new file mode 100644
index 000000000000..a5956039a22e
--- /dev/null
+++ b/net/homa/reap.txt
@@ -0,0 +1,50 @@
+This file discusses issues related to freeing resources for completed RPCs
+("reaping").
+
+* Most of the cost of reaping comes from freeing skbuffs; this can be
+ quite expensive for RPCs with long messages.
+
+* The natural time to reap is when homa_rpc_end is invoked to mark an
+ RPC completed, but this can result in severe performance hiccups. However,
+ this can happen in homa_softirq at a time when there are short messages
+ waiting to be processed. Freeing a long RPC could result in significant
+ delay for a subsequent short RPC.
+
+* Thus Homa doesn't reap immediately in homa_rpc_end. Instead, dead RPCs
+ are queued up and reaping occurs later, at a more convenient time where
+ it is less likely to impact latency. The challenge is to figure out how to
+ do this so that (a) we keep up with dead RPCs and (b) we minimize
+ the impact of reaping on latency.
+
+* The ideal time to reap is when threads are waiting for incoming messages
+ in homa_wait_for_message. The thread has nothing else to do, so reaping
+ can be performed with no latency impact on the application. However,
+ if a machine is overloaded then it may never wait, so this mechanism
+ isn't always sufficient.
+
+* Homa now reaps in two other places, if homa_wait_for_message can't
+ keep up:
+ * If dead_buffs_limit dead skbs accumulate, then homa_timer will
+ reap to get down to that limit. However, it seems possible that
+ there may be cases where a single thread cannot keep up with all
+ the reaping to be done.
+ * If homa_timer can't keep up, then as a last resort, homa_dispatch_pkts
+ will reap a few buffers for every incoming data packet. This is undesirable
+ because it will impact Homa's performance.
+
+* During the conversion to the new input buffering scheme, freeing of packets
+ for incoming messages was moved to homa_copy_to_user, under the assumption
+ that this code wouldn't be on the critical path. However, right now the
+ packet freeing is taking 20-25% of the total time in that function, and
+ with faster networks it's quite possible that this code will indeed be on
+ the critical path. So, it may eventually be necessary to remove
+ packet freeing from homa_copy_to_user.
+
+* Here are some approaches that have been tried and eventually abandoned:
+ * Occasionally when data packets arrive, reap if too much dead info has
+ accumulated. This will cause a latency impact. The amount to reap is
+ chosen dynamically (by homa_timer) to be as small as possible while
+ gradually working through the backlog. Unfortunately, the formula for
+ computing how much to reap was fragile and resulted in situations where
+ the backlog of dead RPCs grew without bound. This approach was abandoned
+ in October 2021.
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 10/15] net: homa: create homa_outgoing.c
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
` (8 preceding siblings ...)
2025-05-02 23:37 ` [PATCH net-next v8 09/15] net: homa: create homa_rpc.h and homa_rpc.c John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 11/15] net: homa: create homa_utils.c John Ousterhout
` (5 subsequent siblings)
15 siblings, 0 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
This file does most of the work of transmitting outgoing messages.
It is responsible for copying data from user space into skbs and
it also implements the "pacer", which throttles output if necessary
to prevent queue buildup in the NIC. Note: the pacer eventually
needs to be replaced with a Homa-specific qdisc, which can better
manage simultaneous transmissions by Homa and TCP. The current
implementation can coexist with TCP and doesn't harm TCP, but
Homa's latency suffers when TCP runs concurrently.
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
Changes for v7:
* Implement accounting for bytes in tx skbs
* Rename UNKNOWN packet type to RPC_UNKNOWN
* Use new RPC reference counts; eliminates need for RCU
* Remove locker argument from locking functions
* Use u64 and __u64 properly
* Fix incorrect skb check in homa_message_out_fill
---
net/homa/homa_outgoing.c | 575 +++++++++++++++++++++++++++++++++++++++
1 file changed, 575 insertions(+)
create mode 100644 net/homa/homa_outgoing.c
diff --git a/net/homa/homa_outgoing.c b/net/homa/homa_outgoing.c
new file mode 100644
index 000000000000..c2d32af56d8a
--- /dev/null
+++ b/net/homa/homa_outgoing.c
@@ -0,0 +1,575 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file contains functions related to the sender side of message
+ * transmission. It also contains utility functions for sending packets.
+ */
+
+#include "homa_impl.h"
+#include "homa_pacer.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+#include "homa_wire.h"
+#include "homa_stub.h"
+
+/**
+ * homa_message_out_init() - Initialize rpc->msgout.
+ * @rpc: RPC whose output message should be initialized. Must be
+ * locked by caller.
+ * @length: Number of bytes that will eventually be in rpc->msgout.
+ */
+void homa_message_out_init(struct homa_rpc *rpc, int length)
+ __must_hold(rpc_bucket_lock)
+{
+ memset(&rpc->msgout, 0, sizeof(rpc->msgout));
+ rpc->msgout.length = length;
+ rpc->msgout.next_xmit = &rpc->msgout.packets;
+ rpc->msgout.init_ns = sched_clock();
+}
+
+/**
+ * homa_fill_data_interleaved() - This function is invoked to fill in the
+ * part of a data packet after the initial header, when GSO is being used.
+ * homa_seg_hdrs must be interleaved with the data to provide the correct
+ * offset for each segment.
+ * @rpc: RPC whose output message is being created. Must be
+ * locked by caller.
+ * @skb: The packet being filled. The initial homa_data_hdr was
+ * created and initialized by the caller and the
+ * homa_skb_info has been filled in with the packet geometry.
+ * @iter: Describes location(s) of (remaining) message data in user
+ * space.
+ * Return: Either a negative errno or 0 (for success).
+ */
+int homa_fill_data_interleaved(struct homa_rpc *rpc, struct sk_buff *skb,
+ struct iov_iter *iter)
+ __must_hold(rpc_bucket_lock)
+{
+ struct homa_skb_info *homa_info = homa_get_skb_info(skb);
+ int seg_length = homa_info->seg_length;
+ int bytes_left = homa_info->data_bytes;
+ int offset = homa_info->offset;
+ int err;
+
+ /* Each iteration of the following loop adds info for one packet,
+ * which includes a homa_seg_hdr followed by the data for that
+ * segment. The first homa_seg_hdr was already added by the caller.
+ */
+ while (1) {
+ struct homa_seg_hdr seg;
+
+ if (bytes_left < seg_length)
+ seg_length = bytes_left;
+ err = homa_skb_append_from_iter(rpc->hsk->homa, skb, iter,
+ seg_length);
+ if (err != 0)
+ return err;
+ bytes_left -= seg_length;
+ offset += seg_length;
+
+ if (bytes_left == 0)
+ break;
+
+ seg.offset = htonl(offset);
+ err = homa_skb_append_to_frag(rpc->hsk->homa, skb, &seg,
+ sizeof(seg));
+ if (err != 0)
+ return err;
+ }
+ return 0;
+}
+
+/**
+ * homa_new_data_packet() - Allocate a new sk_buff and fill it with a Homa
+ * data packet. The resulting packet will be a GSO packet that will eventually
+ * be segmented by the NIC.
+ * @rpc: RPC that packet will belong to (msgout must have been
+ * initialized). Must be locked by caller.
+ * @iter: Describes location(s) of (remaining) message data in user
+ * space.
+ * @offset: Offset in the message of the first byte of data in this
+ * packet.
+ * @length: How many bytes of data to include in the skb. Caller must
+ * ensure that this amount of data isn't too much for a
+ * well-formed GSO packet, and that iter has at least this
+ * much data.
+ * @max_seg_data: Maximum number of bytes of message data that can go in
+ * a single segment of the GSO packet.
+ * Return: A pointer to the new packet, or a negative errno.
+ */
+struct sk_buff *homa_new_data_packet(struct homa_rpc *rpc,
+ struct iov_iter *iter, int offset,
+ int length, int max_seg_data)
+ __must_hold(rpc_bucket_lock)
+{
+ struct homa_skb_info *homa_info;
+ struct homa_data_hdr *h;
+ struct sk_buff *skb;
+ int err, gso_size;
+ u64 segs;
+
+ segs = length + max_seg_data - 1;
+ do_div(segs, max_seg_data);
+
+ /* Initialize the overall skb. */
+ skb = homa_skb_new_tx(sizeof32(struct homa_data_hdr) + length +
+ (segs - 1) * sizeof32(struct homa_seg_hdr));
+ if (!skb)
+ return ERR_PTR(-ENOMEM);
+
+ /* Fill in the Homa header (which will be replicated in every
+ * network packet by GSO).
+ */
+ h = (struct homa_data_hdr *)skb_put(skb, sizeof(struct homa_data_hdr));
+ h->common.sport = htons(rpc->hsk->port);
+ h->common.dport = htons(rpc->dport);
+ h->common.sequence = htonl(offset);
+ h->common.type = DATA;
+ homa_set_doff(h, sizeof(struct homa_data_hdr));
+ h->common.checksum = 0;
+ h->common.sender_id = cpu_to_be64(rpc->id);
+ h->message_length = htonl(rpc->msgout.length);
+ h->ack.client_id = 0;
+ homa_peer_get_acks(rpc->peer, 1, &h->ack);
+ h->retransmit = 0;
+ h->seg.offset = htonl(offset);
+
+ homa_info = homa_get_skb_info(skb);
+ homa_info->next_skb = NULL;
+ homa_info->wire_bytes = length + segs * (sizeof(struct homa_data_hdr)
+ + rpc->hsk->ip_header_length + HOMA_ETH_OVERHEAD);
+ homa_info->data_bytes = length;
+ homa_info->seg_length = max_seg_data;
+ homa_info->offset = offset;
+
+ if (segs > 1) {
+ homa_set_doff(h, sizeof(struct homa_data_hdr) -
+ sizeof32(struct homa_seg_hdr));
+ gso_size = max_seg_data + sizeof(struct homa_seg_hdr);
+ err = homa_fill_data_interleaved(rpc, skb, iter);
+ } else {
+ gso_size = max_seg_data;
+ err = homa_skb_append_from_iter(rpc->hsk->homa, skb, iter,
+ length);
+ }
+ if (err)
+ goto error;
+
+ if (segs > 1) {
+ skb_shinfo(skb)->gso_segs = segs;
+ skb_shinfo(skb)->gso_size = gso_size;
+
+ /* It's unclear what gso_type should be used to force software
+ * GSO; the value below seems to work...
+ */
+ skb_shinfo(skb)->gso_type =
+ rpc->hsk->homa->gso_force_software ? 0xd : SKB_GSO_TCPV6;
+ }
+ return skb;
+
+error:
+ homa_skb_free_tx(rpc->hsk->homa, skb);
+ return ERR_PTR(err);
+}
+
+/**
+ * homa_message_out_fill() - Initializes information for sending a message
+ * for an RPC (either request or response); copies the message data from
+ * user space and (possibly) begins transmitting the message.
+ * @rpc: RPC for which to send message; this function must not
+ * previously have been called for the RPC. Must be locked. The RPC
+ * will be unlocked while copying data, but will be locked again
+ * before returning.
+ * @iter: Describes location(s) of message data in user space.
+ * @xmit: Nonzero means this method should start transmitting packets;
+ * transmission will be overlapped with copying from user space.
+ * Zero means the caller will initiate transmission after this
+ * function returns.
+ *
+ * Return: 0 for success, or a negative errno for failure. It is possible
+ * for the RPC to be freed while this function is active. If that
+ * happens, copying will cease, -EINVAL will be returned, and
+ * rpc->state will be RPC_DEAD.
+ */
+int homa_message_out_fill(struct homa_rpc *rpc, struct iov_iter *iter, int xmit)
+ __must_hold(rpc_bucket_lock)
+{
+ /* Geometry information for packets:
+ * mtu: largest size for an on-the-wire packet (including
+ * all headers through IP header, but not Ethernet
+ * header).
+ * max_seg_data: largest amount of Homa message data that fits
+ * in an on-the-wire packet (after segmentation).
+ * max_gso_data: largest amount of Homa message data that fits
+ * in a GSO packet (before segmentation).
+ */
+ int mtu, max_seg_data, max_gso_data;
+
+ struct sk_buff **last_link;
+ struct dst_entry *dst;
+ u64 segs_per_gso;
+ int overlap_xmit;
+
+ /* Bytes of the message that haven't yet been copied into skbs. */
+ int bytes_left;
+
+ int gso_size;
+ int err;
+
+ homa_rpc_hold(rpc);
+ if (unlikely(iter->count > HOMA_MAX_MESSAGE_LENGTH ||
+ iter->count == 0)) {
+ err = -EINVAL;
+ goto error;
+ }
+ homa_message_out_init(rpc, iter->count);
+
+ /* Compute the geometry of packets. */
+ dst = homa_get_dst(rpc->peer, rpc->hsk);
+ mtu = dst_mtu(dst);
+ max_seg_data = mtu - rpc->hsk->ip_header_length
+ - sizeof(struct homa_data_hdr);
+ gso_size = dst->dev->gso_max_size;
+ if (gso_size > rpc->hsk->homa->max_gso_size)
+ gso_size = rpc->hsk->homa->max_gso_size;
+
+ /* Round gso_size down to an even # of mtus. */
+ segs_per_gso = gso_size - rpc->hsk->ip_header_length -
+ sizeof(struct homa_data_hdr) +
+ sizeof(struct homa_seg_hdr);
+ do_div(segs_per_gso, max_seg_data +
+ sizeof(struct homa_seg_hdr));
+ if (segs_per_gso == 0)
+ segs_per_gso = 1;
+ max_gso_data = segs_per_gso * max_seg_data;
+
+ overlap_xmit = rpc->msgout.length > 2 * max_gso_data;
+ homa_skb_stash_pages(rpc->hsk->homa, rpc->msgout.length);
+
+ /* Each iteration of the loop below creates one GSO packet. */
+ last_link = &rpc->msgout.packets;
+ for (bytes_left = rpc->msgout.length; bytes_left > 0; ) {
+ int skb_data_bytes, offset;
+ struct sk_buff *skb;
+
+ homa_rpc_unlock(rpc);
+ skb_data_bytes = max_gso_data;
+ offset = rpc->msgout.length - bytes_left;
+ if (skb_data_bytes > bytes_left)
+ skb_data_bytes = bytes_left;
+ skb = homa_new_data_packet(rpc, iter, offset, skb_data_bytes,
+ max_seg_data);
+ if (IS_ERR(skb)) {
+ err = PTR_ERR(skb);
+ homa_rpc_lock(rpc);
+ goto error;
+ }
+ bytes_left -= skb_data_bytes;
+
+ homa_rpc_lock(rpc);
+ if (rpc->state == RPC_DEAD) {
+ /* RPC was freed while we were copying. */
+ err = -EINVAL;
+ homa_skb_free_tx(rpc->hsk->homa, skb);
+ goto error;
+ }
+ *last_link = skb;
+ last_link = &(homa_get_skb_info(skb)->next_skb);
+ *last_link = NULL;
+ rpc->msgout.num_skbs++;
+ rpc->msgout.skb_memory += skb->truesize;
+ rpc->msgout.copied_from_user = rpc->msgout.length - bytes_left;
+ if (overlap_xmit && list_empty(&rpc->throttled_links) &&
+ xmit)
+ homa_pacer_manage_rpc(rpc);
+ }
+ refcount_add(rpc->msgout.skb_memory, &rpc->hsk->sock.sk_wmem_alloc);
+ homa_rpc_put(rpc);
+ if (!overlap_xmit && xmit)
+ homa_xmit_data(rpc, false);
+ return 0;
+
+error:
+ refcount_add(rpc->msgout.skb_memory, &rpc->hsk->sock.sk_wmem_alloc);
+ homa_rpc_put(rpc);
+ return err;
+}
+
+/**
+ * homa_xmit_control() - Send a control packet to the other end of an RPC.
+ * @type: Packet type, such as DATA.
+ * @contents: Address of buffer containing the contents of the packet.
+ * Only information after the common header must be valid;
+ * the common header will be filled in by this function.
+ * @length: Length of @contents (including the common header).
+ * @rpc: The packet will go to the socket that handles the other end
+ * of this RPC. Addressing info for the packet, including all of
+ * the fields of homa_common_hdr except type, will be set from this.
+ * Caller must hold either the lock or a reference.
+ *
+ * Return: Either zero (for success), or a negative errno value if there
+ * was a problem.
+ */
+int homa_xmit_control(enum homa_packet_type type, void *contents,
+ size_t length, struct homa_rpc *rpc)
+{
+ struct homa_common_hdr *h = contents;
+
+ h->type = type;
+ h->sport = htons(rpc->hsk->port);
+ h->dport = htons(rpc->dport);
+ h->sender_id = cpu_to_be64(rpc->id);
+ return __homa_xmit_control(contents, length, rpc->peer, rpc->hsk);
+}
+
+/**
+ * __homa_xmit_control() - Lower-level version of homa_xmit_control: sends
+ * a control packet.
+ * @contents: Address of buffer containing the contents of the packet.
+ * The caller must have filled in all of the information,
+ * including the common header.
+ * @length: Length of @contents.
+ * @peer: Destination to which the packet will be sent.
+ * @hsk: Socket via which the packet will be sent.
+ *
+ * Return: Either zero (for success), or a negative errno value if there
+ * was a problem.
+ */
+int __homa_xmit_control(void *contents, size_t length, struct homa_peer *peer,
+ struct homa_sock *hsk)
+{
+ struct homa_common_hdr *h;
+ struct dst_entry *dst;
+ struct sk_buff *skb;
+ int extra_bytes;
+ int result;
+
+ dst = homa_get_dst(peer, hsk);
+ skb = homa_skb_new_tx(HOMA_MAX_HEADER);
+ if (unlikely(!skb))
+ return -ENOBUFS;
+ dst_hold(dst);
+ skb_dst_set(skb, dst);
+
+ h = skb_put(skb, length);
+ memcpy(h, contents, length);
+ extra_bytes = HOMA_MIN_PKT_LENGTH - length;
+ if (extra_bytes > 0)
+ memset(skb_put(skb, extra_bytes), 0, extra_bytes);
+ skb->ooo_okay = 1;
+ skb_get(skb);
+ if (hsk->inet.sk.sk_family == AF_INET6)
+ result = ip6_xmit(&hsk->inet.sk, skb, &peer->flow.u.ip6, 0,
+ NULL, 0, 0);
+ else
+ result = ip_queue_xmit(&hsk->inet.sk, skb, &peer->flow);
+ if (unlikely(result != 0)) {
+ /* It appears that ip*_xmit frees skbuffs after
+ * errors; the following code is to raise an alert if
+ * this isn't actually the case. The extra skb_get above
+ * and kfree_skb call below are needed to do the check
+ * accurately (otherwise the buffer could be freed and
+ * its memory used for some other purpose, resulting in
+ * a bogus "reference count").
+ */
+ if (refcount_read(&skb->users) > 1) {
+ if (hsk->inet.sk.sk_family == AF_INET6)
+ pr_notice("ip6_xmit didn't free Homa control packet (type %d) after error %d\n",
+ h->type, result);
+ else
+ pr_notice("ip_queue_xmit didn't free Homa control packet (type %d) after error %d\n",
+ h->type, result);
+ }
+ }
+ kfree_skb(skb);
+ return result;
+}
+
+/**
+ * homa_xmit_unknown() - Send an RPC_UNKNOWN packet to a peer.
+ * @skb: Buffer containing an incoming packet; identifies the peer to
+ * which the RPC_UNKNOWN packet should be sent.
+ * @hsk: Socket that should be used to send the RPC_UNKNOWN packet.
+ */
+void homa_xmit_unknown(struct sk_buff *skb, struct homa_sock *hsk)
+{
+ struct homa_common_hdr *h = (struct homa_common_hdr *)skb->data;
+ struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+ struct homa_rpc_unknown_hdr unknown;
+ struct homa_peer *peer;
+
+ unknown.common.sport = h->dport;
+ unknown.common.dport = h->sport;
+ unknown.common.type = RPC_UNKNOWN;
+ unknown.common.sender_id = cpu_to_be64(homa_local_id(h->sender_id));
+ peer = homa_peer_find(hsk->homa->peers, &saddr, &hsk->inet);
+ if (!IS_ERR(peer))
+ __homa_xmit_control(&unknown, sizeof(unknown), peer, hsk);
+}
+
+/**
+ * homa_xmit_data() - If an RPC has outbound data packets that are permitted
+ * to be transmitted according to the scheduling mechanism, arrange for
+ * them to be sent (some may be sent immediately; others may be sent
+ * later by the pacer thread).
+ * @rpc: RPC to check for transmittable packets. Must be locked by
+ * caller. Note: this function will release the RPC lock while
+ * passing packets through the RPC stack, then reacquire it
+ * before returning. It is possible that the RPC gets freed
+ * when the lock isn't held, in which case the state will
+ * be RPC_DEAD on return.
+ * @force: True means send at least one packet, even if the NIC queue
+ * is too long. False means that zero packets may be sent, if
+ * the NIC queue is sufficiently long.
+ */
+void homa_xmit_data(struct homa_rpc *rpc, bool force)
+ __must_hold(rpc_bucket_lock)
+{
+ struct homa *homa = rpc->hsk->homa;
+
+ homa_rpc_hold(rpc);
+ while (*rpc->msgout.next_xmit) {
+ struct sk_buff *skb = *rpc->msgout.next_xmit;
+
+ if ((rpc->msgout.length - rpc->msgout.next_xmit_offset)
+ >= homa->pacer->throttle_min_bytes) {
+ if (!homa_pacer_check_nic_q(homa->pacer, skb, force)) {
+ homa_pacer_manage_rpc(rpc);
+ break;
+ }
+ }
+
+ rpc->msgout.next_xmit = &(homa_get_skb_info(skb)->next_skb);
+ rpc->msgout.next_xmit_offset +=
+ homa_get_skb_info(skb)->data_bytes;
+
+ homa_rpc_hold(rpc);
+ homa_rpc_unlock(rpc);
+ skb_get(skb);
+ __homa_xmit_data(skb, rpc);
+ force = false;
+ homa_rpc_lock(rpc);
+ homa_rpc_put(rpc);
+ if (rpc->state == RPC_DEAD)
+ break;
+ }
+ homa_rpc_put(rpc);
+}
+
+/**
+ * __homa_xmit_data() - Handles packet transmission stuff that is common
+ * to homa_xmit_data and homa_resend_data.
+ * @skb: Packet to be sent. The packet will be freed after transmission
+ * (and also if errors prevented transmission).
+ * @rpc: Information about the RPC that the packet belongs to.
+ */
+void __homa_xmit_data(struct sk_buff *skb, struct homa_rpc *rpc)
+{
+ struct dst_entry *dst;
+
+ dst = homa_get_dst(rpc->peer, rpc->hsk);
+ dst_hold(dst);
+ skb_dst_set(skb, dst);
+
+ skb->ooo_okay = 1;
+ skb->ip_summed = CHECKSUM_PARTIAL;
+ skb->csum_start = skb_transport_header(skb) - skb->head;
+ skb->csum_offset = offsetof(struct homa_common_hdr, checksum);
+ if (rpc->hsk->inet.sk.sk_family == AF_INET6)
+ ip6_xmit(&rpc->hsk->inet.sk, skb, &rpc->peer->flow.u.ip6,
+ 0, NULL, 0, 0);
+ else
+ ip_queue_xmit(&rpc->hsk->inet.sk, skb, &rpc->peer->flow);
+}
+
+/**
+ * homa_resend_data() - This function is invoked as part of handling RESEND
+ * requests. It retransmits the packet(s) containing a given range of bytes
+ * from a message.
+ * @rpc: RPC for which data should be resent.
+ * @start: Offset within @rpc->msgout of the first byte to retransmit.
+ * @end: Offset within @rpc->msgout of the byte just after the last one
+ * to retransmit.
+ */
+void homa_resend_data(struct homa_rpc *rpc, int start, int end)
+ __must_hold(rpc_bucket_lock)
+{
+ struct homa_skb_info *homa_info;
+ struct sk_buff *skb;
+
+ if (end <= start)
+ return;
+
+ /* Each iteration of this loop checks one packet in the message
+ * to see if it contains segments that need to be retransmitted.
+ */
+ for (skb = rpc->msgout.packets; skb; skb = homa_info->next_skb) {
+ int seg_offset, offset, seg_length, data_left;
+ struct homa_data_hdr *h;
+
+ homa_info = homa_get_skb_info(skb);
+ offset = homa_info->offset;
+ if (offset >= end)
+ break;
+ if (start >= (offset + homa_info->data_bytes))
+ continue;
+
+ offset = homa_info->offset;
+ seg_offset = sizeof32(struct homa_data_hdr);
+ data_left = homa_info->data_bytes;
+ if (skb_shinfo(skb)->gso_segs <= 1) {
+ seg_length = data_left;
+ } else {
+ seg_length = homa_info->seg_length;
+ h = (struct homa_data_hdr *)skb_transport_header(skb);
+ }
+ for ( ; data_left > 0; data_left -= seg_length,
+ offset += seg_length,
+ seg_offset += skb_shinfo(skb)->gso_size) {
+ struct homa_skb_info *new_homa_info;
+ struct sk_buff *new_skb;
+ int err;
+
+ if (seg_length > data_left)
+ seg_length = data_left;
+
+ if (end <= offset)
+ goto resend_done;
+ if ((offset + seg_length) <= start)
+ continue;
+
+ /* This segment must be retransmitted. */
+ new_skb = homa_skb_new_tx(sizeof(struct homa_data_hdr)
+ + seg_length);
+ if (unlikely(!new_skb))
+ goto resend_done;
+ h = __skb_put_data(new_skb, skb_transport_header(skb),
+ sizeof32(struct homa_data_hdr));
+ h->common.sequence = htonl(offset);
+ h->seg.offset = htonl(offset);
+ h->retransmit = 1;
+ err = homa_skb_append_from_skb(rpc->hsk->homa, new_skb,
+ skb, seg_offset,
+ seg_length);
+ if (err != 0) {
+ pr_err("%s got error %d from homa_skb_append_from_skb\n",
+ __func__, err);
+ kfree_skb(new_skb);
+ goto resend_done;
+ }
+
+ new_homa_info = homa_get_skb_info(new_skb);
+ new_homa_info->wire_bytes = rpc->hsk->ip_header_length
+ + sizeof(struct homa_data_hdr)
+ + seg_length + HOMA_ETH_OVERHEAD;
+ new_homa_info->data_bytes = seg_length;
+ new_homa_info->seg_length = seg_length;
+ new_homa_info->offset = offset;
+ homa_pacer_check_nic_q(rpc->hsk->homa->pacer,
+ new_skb, true);
+ __homa_xmit_data(new_skb, rpc);
+ }
+ }
+
+resend_done:
+ return;
+}
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 11/15] net: homa: create homa_utils.c
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
` (9 preceding siblings ...)
2025-05-02 23:37 ` [PATCH net-next v8 10/15] net: homa: create homa_outgoing.c John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 12/15] net: homa: create homa_incoming.c John Ousterhout
` (4 subsequent siblings)
15 siblings, 0 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
This file contains functions for constructing and destructing
homa structs.
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
Changes for v8:
* Accommodate homa_pacer refactoring
Changes for v7:
* Make Homa a pernet subsystem
* Add support for tx memory accounting
* Remove "lock_slow" functions, which don't add functionality in this
patch series
* Use u64 and __u64 properly
---
net/homa/homa_utils.c | 103 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 103 insertions(+)
create mode 100644 net/homa/homa_utils.c
diff --git a/net/homa/homa_utils.c b/net/homa/homa_utils.c
new file mode 100644
index 000000000000..01716014afaa
--- /dev/null
+++ b/net/homa/homa_utils.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file contains miscellaneous utility functions for Homa, such
+ * as initializing and destroying homa structs.
+ */
+
+#include "homa_impl.h"
+#include "homa_pacer.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+#include "homa_stub.h"
+
+/**
+ * homa_init() - Constructor for homa objects.
+ * @homa: Object to initialize.
+ * @net: Network namespace that @homa is associated with.
+ *
+ * Return: 0 on success, or a negative errno if there was an error. Even
+ * if an error occurs, it is safe (and necessary) to call
+ * homa_destroy at some point.
+ */
+int homa_init(struct homa *homa, struct net *net)
+{
+ int err;
+
+ memset(homa, 0, sizeof(*homa));
+ atomic64_set(&homa->next_outgoing_id, 2);
+ homa->pacer = homa_pacer_new(homa, net);
+ if (IS_ERR(homa->pacer)) {
+ err = PTR_ERR(homa->pacer);
+ homa->pacer = NULL;
+ return err;
+ }
+ homa->prev_default_port = HOMA_MIN_DEFAULT_PORT - 1;
+ homa->port_map = kmalloc(sizeof(*homa->port_map), GFP_KERNEL);
+ if (!homa->port_map) {
+ pr_err("%s couldn't create port_map: kmalloc failure",
+ __func__);
+ return -ENOMEM;
+ }
+ homa_socktab_init(homa->port_map);
+ homa->peers = kmalloc(sizeof(*homa->peers), GFP_KERNEL);
+ if (!homa->peers) {
+ pr_err("%s couldn't create peers: kmalloc failure", __func__);
+ return -ENOMEM;
+ }
+ err = homa_peertab_init(homa->peers);
+ if (err) {
+ pr_err("%s couldn't initialize peer table (errno %d)\n",
+ __func__, -err);
+ return err;
+ }
+
+ /* Wild guesses to initialize configuration values... */
+ homa->resend_ticks = 5;
+ homa->resend_interval = 5;
+ homa->timeout_ticks = 100;
+ homa->timeout_resends = 5;
+ homa->request_ack_ticks = 2;
+ homa->reap_limit = 10;
+ homa->dead_buffs_limit = 5000;
+ homa->max_gso_size = 10000;
+ homa->wmem_max = 100000000;
+ homa->bpage_lease_usecs = 10000;
+ return 0;
+}
+
+/**
+ * homa_destroy() - Destructor for homa objects.
+ * @homa: Object to destroy.
+ */
+void homa_destroy(struct homa *homa)
+{
+ /* The order of the following statements matters! */
+ if (homa->port_map) {
+ homa_socktab_destroy(homa->port_map);
+ kfree(homa->port_map);
+ homa->port_map = NULL;
+ }
+ if (homa->pacer) {
+ homa_pacer_destroy(homa->pacer);
+ homa->pacer = NULL;
+ }
+ if (homa->peers) {
+ homa_peertab_destroy(homa->peers);
+ kfree(homa->peers);
+ homa->peers = NULL;
+ }
+}
+
+/**
+ * homa_spin() - Delay (without sleeping) for a given time interval.
+ * @ns: How long to delay (in nanoseconds)
+ */
+void homa_spin(int ns)
+{
+ u64 end;
+
+ end = sched_clock() + ns;
+ while (sched_clock() < end)
+ /* Empty loop body.*/
+ ;
+}
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 12/15] net: homa: create homa_incoming.c
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
` (10 preceding siblings ...)
2025-05-02 23:37 ` [PATCH net-next v8 11/15] net: homa: create homa_utils.c John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 13/15] net: homa: create homa_timer.c John Ousterhout
` (3 subsequent siblings)
15 siblings, 0 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
This file contains most of the code for handling incoming packets,
including top-level dispatching code plus specific handlers for each
pack type. It also contains code for dispatching fully-received
messages to waiting application threads.
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
Changes for v7:
* API change for homa_rpc_handoff
* Refactor waiting mechanism for incoming packets: simplify wait
criteria and use standard Linux mechanisms for waiting, use
new homa_interest struct
* Reject unauthorized incoming request messages
* Improve documentation for code that spins (and reduce spin length)
* Use RPC reference counts, eliminate RPC_HANDING_OFF flag
* Replace erroneous use of "safe" list iteration with "rcu" version
* Remove locker argument from locking functions
* Check incoming messages against HOMA_MAX_MESSAGE_LENGTH
* Use u64 and __u64 properly
---
net/homa/homa_incoming.c | 922 +++++++++++++++++++++++++++++++++++++++
1 file changed, 922 insertions(+)
create mode 100644 net/homa/homa_incoming.c
diff --git a/net/homa/homa_incoming.c b/net/homa/homa_incoming.c
new file mode 100644
index 000000000000..608f8067cf1a
--- /dev/null
+++ b/net/homa/homa_incoming.c
@@ -0,0 +1,922 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file contains functions that handle incoming Homa messages. */
+
+#include "homa_impl.h"
+#include "homa_interest.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+
+/**
+ * homa_message_in_init() - Constructor for homa_message_in.
+ * @rpc: RPC whose msgin structure should be initialized. The
+ * msgin struct is assumed to be zeroes.
+ * @length: Total number of bytes in message.
+ * Return: Zero for successful initialization, or a negative errno
+ * if rpc->msgin could not be initialized.
+ */
+int homa_message_in_init(struct homa_rpc *rpc, int length)
+ __must_hold(rpc_bucket_lock)
+{
+ int err;
+
+ if (length > HOMA_MAX_MESSAGE_LENGTH)
+ return -EINVAL;
+
+ rpc->msgin.length = length;
+ skb_queue_head_init(&rpc->msgin.packets);
+ INIT_LIST_HEAD(&rpc->msgin.gaps);
+ rpc->msgin.bytes_remaining = length;
+ err = homa_pool_allocate(rpc);
+ if (err != 0) {
+ rpc->msgin.length = -1;
+ return err;
+ }
+ return 0;
+}
+
+/**
+ * homa_gap_new() - Create a new gap and add it to a list.
+ * @next: Add the new gap just before this list element.
+ * @start: Offset of first byte covered by the gap.
+ * @end: Offset of byte just after the last one covered by the gap.
+ * Return: Pointer to the new gap, or NULL if memory couldn't be allocated
+ * for the gap object.
+ */
+struct homa_gap *homa_gap_new(struct list_head *next, int start, int end)
+{
+ struct homa_gap *gap;
+
+ gap = kmalloc(sizeof(*gap), GFP_ATOMIC);
+ if (!gap)
+ return NULL;
+ gap->start = start;
+ gap->end = end;
+ gap->time = sched_clock();
+ list_add_tail(&gap->links, next);
+ return gap;
+}
+
+/**
+ * homa_gap_retry() - Send RESEND requests for all of the unreceived
+ * gaps in a message.
+ * @rpc: RPC to check; must be locked by caller.
+ */
+void homa_gap_retry(struct homa_rpc *rpc)
+ __must_hold(rpc_bucket_lock)
+{
+ struct homa_resend_hdr resend;
+ struct homa_gap *gap;
+
+ list_for_each_entry(gap, &rpc->msgin.gaps, links) {
+ resend.offset = htonl(gap->start);
+ resend.length = htonl(gap->end - gap->start);
+ homa_xmit_control(RESEND, &resend, sizeof(resend), rpc);
+ }
+}
+
+/**
+ * homa_add_packet() - Add an incoming packet to the contents of a
+ * partially received message.
+ * @rpc: Add the packet to the msgin for this RPC.
+ * @skb: The new packet. This function takes ownership of the packet
+ * (the packet will either be freed or added to rpc->msgin.packets).
+ */
+void homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb)
+ __must_hold(rpc_bucket_lock)
+{
+ struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
+ struct homa_gap *gap, *dummy, *gap2;
+ int start = ntohl(h->seg.offset);
+ int length = homa_data_len(skb);
+ int end = start + length;
+
+ if ((start + length) > rpc->msgin.length)
+ goto discard;
+
+ if (start == rpc->msgin.recv_end) {
+ /* Common case: packet is sequential. */
+ rpc->msgin.recv_end += length;
+ goto keep;
+ }
+
+ if (start > rpc->msgin.recv_end) {
+ /* Packet creates a new gap. */
+ if (!homa_gap_new(&rpc->msgin.gaps,
+ rpc->msgin.recv_end, start)) {
+ pr_err("Homa couldn't allocate gap: insufficient memory\n");
+ goto discard;
+ }
+ rpc->msgin.recv_end = end;
+ goto keep;
+ }
+
+ /* Must now check to see if the packet fills in part or all of
+ * an existing gap.
+ */
+ list_for_each_entry_safe(gap, dummy, &rpc->msgin.gaps, links) {
+ /* Is packet at the start of this gap? */
+ if (start <= gap->start) {
+ if (end <= gap->start)
+ continue;
+ if (start < gap->start)
+ goto discard;
+ if (end > gap->end)
+ goto discard;
+ gap->start = end;
+ if (gap->start >= gap->end) {
+ list_del(&gap->links);
+ kfree(gap);
+ }
+ goto keep;
+ }
+
+ /* Is packet at the end of this gap? BTW, at this point we know
+ * the packet can't cover the entire gap.
+ */
+ if (end >= gap->end) {
+ if (start >= gap->end)
+ continue;
+ if (end > gap->end)
+ goto discard;
+ gap->end = start;
+ goto keep;
+ }
+
+ /* Packet is in the middle of the gap; must split the gap. */
+ gap2 = homa_gap_new(&gap->links, gap->start, start);
+ if (!gap2) {
+ pr_err("Homa couldn't allocate gap for split: insufficient memory\n");
+ goto discard;
+ }
+ gap2->time = gap->time;
+ gap->start = end;
+ goto keep;
+ }
+
+discard:
+ kfree_skb(skb);
+ return;
+
+keep:
+ __skb_queue_tail(&rpc->msgin.packets, skb);
+ rpc->msgin.bytes_remaining -= length;
+}
+
+/**
+ * homa_copy_to_user() - Copy as much data as possible from incoming
+ * packet buffers to buffers in user space.
+ * @rpc: RPC for which data should be copied. Must be locked by caller.
+ * Return: Zero for success or a negative errno if there is an error.
+ * It is possible for the RPC to be freed while this function
+ * executes (it releases and reacquires the RPC lock). If that
+ * happens, -EINVAL will be returned and the state of @rpc
+ * will be RPC_DEAD. Clears the RPC_PKTS_READY bit in @rpc->flags
+ * if all available packets have been copied out.
+ */
+int homa_copy_to_user(struct homa_rpc *rpc)
+ __must_hold(rpc_bucket_lock)
+{
+#define MAX_SKBS 20
+ struct sk_buff *skbs[MAX_SKBS];
+ int error = 0;
+ int n = 0; /* Number of filled entries in skbs. */
+ int i;
+
+ /* Tricky note: we can't hold the RPC lock while we're actually
+ * copying to user space, because (a) it's illegal to hold a spinlock
+ * while copying to user space and (b) we'd like for homa_softirq
+ * to add more packets to the RPC while we're copying these out.
+ * So, collect a bunch of packets to copy, then release the lock,
+ * copy them, and reacquire the lock.
+ */
+ while (true) {
+ struct sk_buff *skb;
+
+ if (rpc->state == RPC_DEAD) {
+ error = -EINVAL;
+ break;
+ }
+
+ skb = __skb_dequeue(&rpc->msgin.packets);
+ if (skb) {
+ skbs[n] = skb;
+ n++;
+ if (n < MAX_SKBS)
+ continue;
+ }
+ if (n == 0) {
+ atomic_andnot(RPC_PKTS_READY, &rpc->flags);
+ break;
+ }
+
+ /* At this point we've collected a batch of packets (or
+ * run out of packets); copy any available packets out to
+ * user space.
+ */
+ homa_rpc_hold(rpc);
+ homa_rpc_unlock(rpc);
+
+ /* Each iteration of this loop copies out one skb. */
+ for (i = 0; i < n; i++) {
+ struct homa_data_hdr *h = (struct homa_data_hdr *)
+ skbs[i]->data;
+ int pkt_length = homa_data_len(skbs[i]);
+ int offset = ntohl(h->seg.offset);
+ int buf_bytes, chunk_size;
+ struct iov_iter iter;
+ int copied = 0;
+ char __user *dst;
+
+ /* Each iteration of this loop copies to one
+ * user buffer.
+ */
+ while (copied < pkt_length) {
+ chunk_size = pkt_length - copied;
+ dst = homa_pool_get_buffer(rpc, offset + copied,
+ &buf_bytes);
+ if (buf_bytes < chunk_size) {
+ if (buf_bytes == 0) {
+ /* skb has data beyond message
+ * end?
+ */
+ break;
+ }
+ chunk_size = buf_bytes;
+ }
+ error = import_ubuf(READ, dst, chunk_size,
+ &iter);
+ if (error)
+ goto free_skbs;
+ error = skb_copy_datagram_iter(skbs[i],
+ sizeof(*h) +
+ copied, &iter,
+ chunk_size);
+ if (error)
+ goto free_skbs;
+ copied += chunk_size;
+ }
+ }
+
+free_skbs:
+ for (i = 0; i < n; i++)
+ kfree_skb(skbs[i]);
+ n = 0;
+ atomic_or(APP_NEEDS_LOCK, &rpc->flags);
+ homa_rpc_lock(rpc);
+ atomic_andnot(APP_NEEDS_LOCK, &rpc->flags);
+ homa_rpc_put(rpc);
+ if (error)
+ break;
+ }
+ return error;
+}
+
+/**
+ * homa_dispatch_pkts() - Top-level function that processes a batch of packets,
+ * all related to the same RPC.
+ * @skb: First packet in the batch, linked through skb->next.
+ * @homa: Overall information about the Homa transport.
+ */
+void homa_dispatch_pkts(struct sk_buff *skb, struct homa *homa)
+{
+#define MAX_ACKS 10
+ const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+ struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
+ u64 id = homa_local_id(h->common.sender_id);
+ int dport = ntohs(h->common.dport);
+
+ /* Used to collect acks from data packets so we can process them
+ * all at the end (can't process them inline because that may
+ * require locking conflicting RPCs). If we run out of space just
+ * ignore the extra acks; they'll be regenerated later through the
+ * explicit mechanism.
+ */
+ struct homa_ack acks[MAX_ACKS];
+ struct homa_rpc *rpc = NULL;
+ struct homa_sock *hsk;
+ struct sk_buff *next;
+ int num_acks = 0;
+
+ /* Find the appropriate socket.*/
+ hsk = homa_sock_find(homa->port_map, dport);
+ if (!hsk || (!homa_is_client(id) && !hsk->is_server)) {
+ if (skb_is_ipv6(skb))
+ icmp6_send(skb, ICMPV6_DEST_UNREACH,
+ ICMPV6_PORT_UNREACH, 0, NULL, IP6CB(skb));
+ else
+ icmp_send(skb, ICMP_DEST_UNREACH,
+ ICMP_PORT_UNREACH, 0);
+ while (skb) {
+ next = skb->next;
+ kfree_skb(skb);
+ skb = next;
+ }
+ if (hsk)
+ sock_put(&hsk->sock);
+ return;
+ }
+
+ /* Each iteration through the following loop processes one packet. */
+ for (; skb; skb = next) {
+ h = (struct homa_data_hdr *)skb->data;
+ next = skb->next;
+
+ /* Relinquish the RPC lock temporarily if it's needed
+ * elsewhere.
+ */
+ if (rpc) {
+ int flags = atomic_read(&rpc->flags);
+
+ if (flags & APP_NEEDS_LOCK) {
+ homa_rpc_unlock(rpc);
+
+ /* This short spin is needed to ensure that the
+ * other thread gets the lock before this thread
+ * grabs it again below (the need for this
+ * was confirmed experimentally in 2/2025;
+ * without it, the handoff fails 20-25% of the
+ * time). Furthermore, the call to homa_spin
+ * seems to allow the other thread to acquire
+ * the lock more quickly.
+ */
+ homa_spin(100);
+ rpc = NULL;
+ }
+ }
+
+ /* Find and lock the RPC if we haven't already done so. */
+ if (!rpc) {
+ if (!homa_is_client(id)) {
+ /* We are the server for this RPC. */
+ if (h->common.type == DATA) {
+ int created;
+
+ /* Create a new RPC if one doesn't
+ * already exist.
+ */
+ rpc = homa_rpc_new_server(hsk, &saddr,
+ h, &created);
+ if (IS_ERR(rpc)) {
+ pr_warn("homa_pkt_dispatch couldn't create server rpc: error %lu",
+ -PTR_ERR(rpc));
+ rpc = NULL;
+ goto discard;
+ }
+ } else {
+ rpc = homa_find_server_rpc(hsk, &saddr,
+ id);
+ }
+ } else {
+ rpc = homa_find_client_rpc(hsk, id);
+ }
+ }
+ if (unlikely(!rpc)) {
+ if (h->common.type != NEED_ACK &&
+ h->common.type != ACK &&
+ h->common.type != RESEND)
+ goto discard;
+ } else {
+ if (h->common.type == DATA ||
+ h->common.type == BUSY ||
+ h->common.type == NEED_ACK)
+ rpc->silent_ticks = 0;
+ rpc->peer->outstanding_resends = 0;
+ }
+
+ switch (h->common.type) {
+ case DATA:
+ if (h->ack.client_id) {
+ /* Save the ack for processing later, when we
+ * have released the RPC lock.
+ */
+ if (num_acks < MAX_ACKS) {
+ acks[num_acks] = h->ack;
+ num_acks++;
+ }
+ }
+ homa_data_pkt(skb, rpc);
+ break;
+ case RESEND:
+ homa_resend_pkt(skb, rpc, hsk);
+ break;
+ case RPC_UNKNOWN:
+ homa_rpc_unknown_pkt(skb, rpc);
+ break;
+ case BUSY:
+ /* Nothing to do for these packets except reset
+ * silent_ticks, which happened above.
+ */
+ goto discard;
+ case NEED_ACK:
+ homa_need_ack_pkt(skb, hsk, rpc);
+ break;
+ case ACK:
+ homa_ack_pkt(skb, hsk, rpc);
+ break;
+ goto discard;
+ }
+ continue;
+
+discard:
+ kfree_skb(skb);
+ }
+ if (rpc)
+ homa_rpc_unlock(rpc);
+
+ while (num_acks > 0) {
+ num_acks--;
+ homa_rpc_acked(hsk, &saddr, &acks[num_acks]);
+ }
+
+ if (hsk->dead_skbs >= 2 * hsk->homa->dead_buffs_limit)
+ /* We get here if neither homa_wait_for_message
+ * nor homa_timer can keep up with reaping dead
+ * RPCs. See reap.txt for details.
+ */
+ homa_rpc_reap(hsk, false);
+ sock_put(&hsk->sock);
+}
+
+/**
+ * homa_data_pkt() - Handler for incoming DATA packets
+ * @skb: Incoming packet; size known to be large enough for the header.
+ * This function now owns the packet.
+ * @rpc: Information about the RPC corresponding to this packet.
+ * Must be locked by the caller.
+ */
+void homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc)
+ __must_hold(rpc_bucket_lock)
+{
+ struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
+
+ if (rpc->state != RPC_INCOMING && homa_is_client(rpc->id)) {
+ if (unlikely(rpc->state != RPC_OUTGOING))
+ goto discard;
+ rpc->state = RPC_INCOMING;
+ if (homa_message_in_init(rpc, ntohl(h->message_length)) != 0)
+ goto discard;
+ } else if (rpc->state != RPC_INCOMING) {
+ /* Must be server; note that homa_rpc_new_server already
+ * initialized msgin and allocated buffers.
+ */
+ if (unlikely(rpc->msgin.length >= 0))
+ goto discard;
+ }
+
+ if (rpc->msgin.num_bpages == 0)
+ /* Drop packets that arrive when we can't allocate buffer
+ * space. If we keep them around, packet buffer usage can
+ * exceed available cache space, resulting in poor
+ * performance.
+ */
+ goto discard;
+
+ homa_add_packet(rpc, skb);
+
+ if (skb_queue_len(&rpc->msgin.packets) != 0 &&
+ !(atomic_read(&rpc->flags) & RPC_PKTS_READY)) {
+ atomic_or(RPC_PKTS_READY, &rpc->flags);
+ homa_rpc_handoff(rpc);
+ }
+
+ return;
+
+discard:
+ kfree_skb(skb);
+}
+
+/**
+ * homa_resend_pkt() - Handler for incoming RESEND packets
+ * @skb: Incoming packet; size already verified large enough for header.
+ * This function now owns the packet.
+ * @rpc: Information about the RPC corresponding to this packet; must
+ * be locked by caller, but may be NULL if there is no RPC matching
+ * this packet
+ * @hsk: Socket on which the packet was received.
+ */
+void homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
+ struct homa_sock *hsk)
+ __must_hold(rpc_bucket_lock)
+{
+ struct homa_resend_hdr *h = (struct homa_resend_hdr *)skb->data;
+ struct homa_busy_hdr busy;
+
+ if (!rpc) {
+ homa_xmit_unknown(skb, hsk);
+ goto done;
+ }
+
+ if (!homa_is_client(rpc->id) && rpc->state != RPC_OUTGOING) {
+ /* We are the server for this RPC and don't yet have a
+ * response packet, so just send BUSY.
+ */
+ homa_xmit_control(BUSY, &busy, sizeof(busy), rpc);
+ goto done;
+ }
+ if (ntohl(h->length) == 0)
+ /* This RESEND is from a server just trying to determine
+ * whether the client still cares about the RPC; return
+ * BUSY so the server doesn't time us out.
+ */
+ homa_xmit_control(BUSY, &busy, sizeof(busy), rpc);
+ homa_resend_data(rpc, ntohl(h->offset),
+ ntohl(h->offset) + ntohl(h->length));
+
+done:
+ kfree_skb(skb);
+}
+
+/**
+ * homa_rpc_unknown_pkt() - Handler for incoming RPC_UNKNOWN packets.
+ * @skb: Incoming packet; size known to be large enough for the header.
+ * This function now owns the packet.
+ * @rpc: Information about the RPC corresponding to this packet. Must
+ * be locked by caller.
+ */
+void homa_rpc_unknown_pkt(struct sk_buff *skb, struct homa_rpc *rpc)
+ __must_hold(rpc_bucket_lock)
+{
+ if (homa_is_client(rpc->id)) {
+ if (rpc->state == RPC_OUTGOING) {
+ /* It appears that everything we've already transmitted
+ * has been lost; retransmit it.
+ */
+ homa_resend_data(rpc, 0, rpc->msgout.next_xmit_offset);
+ goto done;
+ }
+ } else {
+ homa_rpc_end(rpc);
+ }
+done:
+ kfree_skb(skb);
+}
+
+/**
+ * homa_need_ack_pkt() - Handler for incoming NEED_ACK packets
+ * @skb: Incoming packet; size already verified large enough for header.
+ * This function now owns the packet.
+ * @hsk: Socket on which the packet was received.
+ * @rpc: The RPC named in the packet header, or NULL if no such
+ * RPC exists. The RPC has been locked by the caller.
+ */
+void homa_need_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+ struct homa_rpc *rpc)
+ __must_hold(rpc_bucket_lock)
+{
+ struct homa_common_hdr *h = (struct homa_common_hdr *)skb->data;
+ const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+ u64 id = homa_local_id(h->sender_id);
+ struct homa_peer *peer;
+ struct homa_ack_hdr ack;
+
+ /* Return if it's not safe for the peer to purge its state
+ * for this RPC (the RPC still exists and we haven't received
+ * the entire response), or if we can't find peer info.
+ */
+ if (rpc && (rpc->state != RPC_INCOMING ||
+ rpc->msgin.bytes_remaining)) {
+ goto done;
+ } else {
+ peer = homa_peer_find(hsk->homa->peers, &saddr, &hsk->inet);
+ if (IS_ERR(peer))
+ goto done;
+ }
+
+ /* Send an ACK for this RPC. At the same time, include all of the
+ * other acks available for the peer. Note: can't use rpc below,
+ * since it may be NULL.
+ */
+ ack.common.type = ACK;
+ ack.common.sport = h->dport;
+ ack.common.dport = h->sport;
+ ack.common.sender_id = cpu_to_be64(id);
+ ack.num_acks = htons(homa_peer_get_acks(peer,
+ HOMA_MAX_ACKS_PER_PKT,
+ ack.acks));
+ __homa_xmit_control(&ack, sizeof(ack), peer, hsk);
+
+done:
+ kfree_skb(skb);
+}
+
+/**
+ * homa_ack_pkt() - Handler for incoming ACK packets
+ * @skb: Incoming packet; size already verified large enough for header.
+ * This function now owns the packet.
+ * @hsk: Socket on which the packet was received.
+ * @rpc: The RPC named in the packet header, or NULL if no such
+ * RPC exists. The RPC lock will be dead on return.
+ */
+void homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+ struct homa_rpc *rpc)
+ __must_hold(rpc_bucket_lock)
+{
+ const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+ struct homa_ack_hdr *h = (struct homa_ack_hdr *)skb->data;
+ int i, count;
+
+ if (rpc)
+ homa_rpc_end(rpc);
+
+ count = ntohs(h->num_acks);
+ if (count > 0) {
+ if (rpc) {
+ /* Must temporarily release rpc's lock because
+ * homa_rpc_acked needs to acquire RPC locks.
+ */
+ homa_rpc_hold(rpc);
+ homa_rpc_unlock(rpc);
+ for (i = 0; i < count; i++)
+ homa_rpc_acked(hsk, &saddr, &h->acks[i]);
+ homa_rpc_lock(rpc);
+ homa_rpc_put(rpc);
+ } else {
+ for (i = 0; i < count; i++)
+ homa_rpc_acked(hsk, &saddr, &h->acks[i]);
+ }
+ }
+ kfree_skb(skb);
+}
+
+/**
+ * homa_rpc_abort() - Terminate an RPC.
+ * @rpc: RPC to be terminated. Must be locked by caller.
+ * @error: A negative errno value indicating the error that caused the abort.
+ * If this is a client RPC, the error will be returned to the
+ * application; if it's a server RPC, the error is ignored and
+ * we just free the RPC.
+ */
+void homa_rpc_abort(struct homa_rpc *rpc, int error)
+ __must_hold(rpc_bucket_lock)
+{
+ if (!homa_is_client(rpc->id)) {
+ homa_rpc_end(rpc);
+ return;
+ }
+ rpc->error = error;
+ homa_rpc_handoff(rpc);
+}
+
+/**
+ * homa_abort_rpcs() - Abort all RPCs to/from a particular peer.
+ * @homa: Overall data about the Homa protocol implementation.
+ * @addr: Address (network order) of the destination whose RPCs are
+ * to be aborted.
+ * @port: If nonzero, then RPCs will only be aborted if they were
+ * targeted at this server port.
+ * @error: Negative errno value indicating the reason for the abort.
+ */
+void homa_abort_rpcs(struct homa *homa, const struct in6_addr *addr,
+ int port, int error)
+{
+ struct homa_socktab_scan scan;
+ struct homa_rpc *rpc;
+ struct homa_sock *hsk;
+
+ for (hsk = homa_socktab_start_scan(homa->port_map, &scan); hsk;
+ hsk = homa_socktab_next(&scan)) {
+ /* Skip the (expensive) lock acquisition if there's no
+ * work to do.
+ */
+ if (list_empty(&hsk->active_rpcs))
+ continue;
+ if (!homa_protect_rpcs(hsk))
+ continue;
+ rcu_read_lock();
+ list_for_each_entry_rcu(rpc, &hsk->active_rpcs, active_links) {
+ if (!ipv6_addr_equal(&rpc->peer->addr, addr))
+ continue;
+ if (port && rpc->dport != port)
+ continue;
+ homa_rpc_lock(rpc);
+ homa_rpc_abort(rpc, error);
+ homa_rpc_unlock(rpc);
+ }
+ rcu_read_unlock();
+ homa_unprotect_rpcs(hsk);
+ }
+ homa_socktab_end_scan(&scan);
+}
+
+/**
+ * homa_abort_sock_rpcs() - Abort all outgoing (client-side) RPCs on a given
+ * socket.
+ * @hsk: Socket whose RPCs should be aborted.
+ * @error: Zero means that the aborted RPCs should be freed immediately.
+ * A nonzero value means that the RPCs should be marked
+ * complete, so that they can be returned to the application;
+ * this value (a negative errno) will be returned from
+ * recvmsg.
+ */
+void homa_abort_sock_rpcs(struct homa_sock *hsk, int error)
+{
+ struct homa_rpc *rpc;
+
+ if (list_empty(&hsk->active_rpcs))
+ return;
+ if (!homa_protect_rpcs(hsk))
+ return;
+ rcu_read_lock();
+ list_for_each_entry_rcu(rpc, &hsk->active_rpcs, active_links) {
+ if (!homa_is_client(rpc->id))
+ continue;
+ homa_rpc_lock(rpc);
+ if (rpc->state == RPC_DEAD) {
+ homa_rpc_unlock(rpc);
+ continue;
+ }
+ if (error)
+ homa_rpc_abort(rpc, error);
+ else
+ homa_rpc_end(rpc);
+ homa_rpc_unlock(rpc);
+ }
+ rcu_read_unlock();
+ homa_unprotect_rpcs(hsk);
+}
+
+/**
+ * homa_wait_private() - Waits until the response has been received for
+ * a specific RPC or the RPC has failed with an error.
+ * @rpc: RPC to wait for; an error will be returned if the RPC is
+ * not a client RPC or not private. Must be locked by caller.
+ * @nonblocking: Nonzero means return immediately if @rpc not ready.
+ * Return: 0 if the response has been successfully received, otherwise
+ * a negative errno.
+ */
+int homa_wait_private(struct homa_rpc *rpc, int nonblocking)
+ __must_hold(&rpc->bucket->lock)
+{
+ struct homa_interest interest;
+ int result = 0;
+
+ if (!(atomic_read(&rpc->flags) & RPC_PRIVATE))
+ return -EINVAL;
+
+ homa_rpc_hold(rpc);
+
+ /* Each iteration through this loop waits until rpc needs attention
+ * in some way (e.g. packets have arrived), then deals with that need
+ * (e.g. copy to user space). It may take many iterations until the
+ * RPC is ready for the application.
+ */
+ while (1) {
+ if (!rpc->error)
+ rpc->error = homa_copy_to_user(rpc);
+ if (rpc->error) {
+ result = rpc->error;
+ break;
+ }
+ if (rpc->msgin.length >= 0 &&
+ rpc->msgin.bytes_remaining == 0 &&
+ skb_queue_len(&rpc->msgin.packets) == 0)
+ break;
+
+ result = homa_interest_init_private(&interest, rpc);
+ if (result != 0)
+ break;
+
+ homa_rpc_unlock(rpc);
+ result = homa_interest_wait(&interest, nonblocking);
+
+ atomic_or(APP_NEEDS_LOCK, &rpc->flags);
+ homa_rpc_lock(rpc);
+ atomic_andnot(APP_NEEDS_LOCK, &rpc->flags);
+ homa_interest_unlink_private(&interest);
+
+ /* If homa_interest_wait returned an error but the interest
+ * actually got ready, then ignore the error.
+ */
+ if (result != 0 && atomic_read(&interest.ready) == 0)
+ break;
+ }
+
+ homa_rpc_put(rpc);
+ return result;
+}
+
+/**
+ * homa_wait_shared() - Wait for the completion of any non-private
+ * incoming message on a socket.
+ * @hsk: Socket on which to wait. Must not be locked.
+ * @nonblocking: Nonzero means return immediately if no RPC is ready.
+ *
+ * Return: Pointer to an RPC with a complete incoming message or nonzero
+ * error field, or a negative errno (usually -EINTR). If an RPC
+ * is returned it will be locked and the caller must unlock.
+ */
+struct homa_rpc *homa_wait_shared(struct homa_sock *hsk, int nonblocking)
+{
+ struct homa_interest interest;
+ struct homa_rpc *rpc;
+ int result;
+
+ INIT_LIST_HEAD(&interest.links);
+ init_waitqueue_head(&interest.wait_queue);
+ /* Each iteration through this loop waits until an RPC needs attention
+ * in some way (e.g. packets have arrived), then deals with that need
+ * (e.g. copy to user space). It may take many iterations until an
+ * RPC is ready for the application.
+ */
+ while (1) {
+ homa_sock_lock(hsk);
+ if (hsk->shutdown) {
+ rpc = ERR_PTR(-ESHUTDOWN);
+ homa_sock_unlock(hsk);
+ goto done;
+ }
+ if (!list_empty(&hsk->ready_rpcs)) {
+ rpc = list_first_entry(&hsk->ready_rpcs,
+ struct homa_rpc,
+ ready_links);
+ homa_rpc_hold(rpc);
+ list_del_init(&rpc->ready_links);
+ if (!list_empty(&hsk->ready_rpcs)) {
+ /* There are still more RPCs available, so
+ * let Linux know.
+ */
+ hsk->sock.sk_data_ready(&hsk->sock);
+ }
+ homa_sock_unlock(hsk);
+ } else {
+ homa_interest_init_shared(&interest, hsk);
+ homa_sock_unlock(hsk);
+ result = homa_interest_wait(&interest, nonblocking);
+ homa_interest_unlink_shared(&interest);
+
+ if (result != 0) {
+ /* If homa_interest_wait returned an error
+ * (e.g. -EAGAIN) but in the meantime the
+ * interest received a handoff, ignore the
+ * error.
+ */
+ if (atomic_read(&interest.ready) == 0) {
+ rpc = ERR_PTR(result);
+ goto done;
+ }
+ }
+
+ rpc = interest.rpc;
+ if (!rpc) {
+ rpc = ERR_PTR(-ESHUTDOWN);
+ goto done;
+ }
+ }
+
+ atomic_or(APP_NEEDS_LOCK, &rpc->flags);
+ homa_rpc_lock(rpc);
+ atomic_andnot(APP_NEEDS_LOCK, &rpc->flags);
+ homa_rpc_put(rpc);
+ if (!rpc->error)
+ rpc->error = homa_copy_to_user(rpc);
+ if (rpc->error) {
+ if (rpc->state != RPC_DEAD)
+ break;
+ } else if (rpc->msgin.bytes_remaining == 0 &&
+ skb_queue_len(&rpc->msgin.packets) == 0)
+ break;
+ homa_rpc_unlock(rpc);
+ }
+
+done:
+ return rpc;
+}
+
+/**
+ * homa_rpc_handoff() - This function is called when the input message for
+ * an RPC is ready for attention from a user thread. It notifies a waiting
+ * reader and/or queues the RPC, as appropriate.
+ * @rpc: RPC to handoff; must be locked.
+ */
+void homa_rpc_handoff(struct homa_rpc *rpc)
+ __must_hold(rpc_bucket_lock)
+{
+ struct homa_sock *hsk = rpc->hsk;
+ struct homa_interest *interest;
+
+ if (atomic_read(&rpc->flags) & RPC_PRIVATE) {
+ homa_interest_notify_private(rpc);
+ return;
+ }
+
+ /* Shared RPC; if there is a waiting thread, hand off the RPC;
+ * otherwise enqueue it.
+ */
+ homa_sock_lock(hsk);
+ if (!list_empty(&hsk->interests)) {
+ interest = list_first_entry(&hsk->interests,
+ struct homa_interest, links);
+ list_del_init(&interest->links);
+ interest->rpc = rpc;
+ homa_rpc_hold(rpc);
+ atomic_set_release(&interest->ready, 1);
+ wake_up(&interest->wait_queue);
+
+ } else if (list_empty(&rpc->ready_links)) {
+ list_add_tail(&rpc->ready_links, &hsk->ready_rpcs);
+ hsk->sock.sk_data_ready(&hsk->sock);
+ }
+ homa_sock_unlock(hsk);
+}
+
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 13/15] net: homa: create homa_timer.c
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
` (11 preceding siblings ...)
2025-05-02 23:37 ` [PATCH net-next v8 12/15] net: homa: create homa_incoming.c John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 14/15] net: homa: create homa_plumbing.c John Ousterhout
` (2 subsequent siblings)
15 siblings, 0 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
This file contains code that wakes up periodically to check for
missing data, initiate retransmissions, and declare peer nodes
"dead".
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
Changes for v7:
* Interface changes to homa_sock_start_scan etc.
* Remove locker argument from locking functions
* Use u64 and __u64 properly
---
net/homa/homa_timer.c | 156 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 156 insertions(+)
create mode 100644 net/homa/homa_timer.c
diff --git a/net/homa/homa_timer.c b/net/homa/homa_timer.c
new file mode 100644
index 000000000000..5bb29310f7c7
--- /dev/null
+++ b/net/homa/homa_timer.c
@@ -0,0 +1,156 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file handles timing-related functions for Homa, such as retries
+ * and timeouts.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+#include "homa_stub.h"
+
+/**
+ * homa_check_rpc() - Invoked for each RPC during each timer pass; does
+ * most of the work of checking for time-related actions such as sending
+ * resends, aborting RPCs for which there is no response, and sending
+ * requests for acks. It is separate from homa_timer because homa_timer
+ * got too long and deeply indented.
+ * @rpc: RPC to check; must be locked by the caller.
+ */
+void homa_check_rpc(struct homa_rpc *rpc)
+ __must_hold(&rpc->bucket->lock)
+{
+ struct homa *homa = rpc->hsk->homa;
+ struct homa_resend_hdr resend;
+
+ /* See if we need to request an ack for this RPC. */
+ if (!homa_is_client(rpc->id) && rpc->state == RPC_OUTGOING &&
+ rpc->msgout.next_xmit_offset >= rpc->msgout.length) {
+ if (rpc->done_timer_ticks == 0) {
+ rpc->done_timer_ticks = homa->timer_ticks;
+ } else {
+ /* >= comparison that handles tick wrap-around. */
+ if ((rpc->done_timer_ticks + homa->request_ack_ticks
+ - 1 - homa->timer_ticks) & 1 << 31) {
+ struct homa_need_ack_hdr h;
+
+ homa_xmit_control(NEED_ACK, &h, sizeof(h), rpc);
+ }
+ }
+ }
+
+ if (rpc->state == RPC_INCOMING) {
+ if (rpc->msgin.num_bpages == 0) {
+ /* Waiting for buffer space, so no problem. */
+ rpc->silent_ticks = 0;
+ return;
+ }
+ } else if (!homa_is_client(rpc->id)) {
+ /* We're the server and we've received the input message;
+ * no need to worry about retries.
+ */
+ rpc->silent_ticks = 0;
+ return;
+ }
+
+ if (rpc->state == RPC_OUTGOING) {
+ if (rpc->msgout.next_xmit_offset < rpc->msgout.length) {
+ /* There are granted bytes that we haven't transmitted,
+ * so no need to be concerned; the ball is in our court.
+ */
+ rpc->silent_ticks = 0;
+ return;
+ }
+ }
+
+ if (rpc->silent_ticks < homa->resend_ticks)
+ return;
+ if (rpc->silent_ticks >= homa->timeout_ticks) {
+ homa_rpc_abort(rpc, -ETIMEDOUT);
+ return;
+ }
+ if (((rpc->silent_ticks - homa->resend_ticks) % homa->resend_interval)
+ != 0)
+ return;
+
+ /* Issue a resend for the bytes just after the last ones received
+ * (gaps in the middle were already handled by homa_gap_retry above).
+ */
+ if (rpc->msgin.length < 0) {
+ /* Haven't received any data for this message; request
+ * retransmission of just the first packet (the sender
+ * will send at least one full packet, regardless of
+ * the length below).
+ */
+ resend.offset = htonl(0);
+ resend.length = htonl(100);
+ } else {
+ homa_gap_retry(rpc);
+ resend.offset = htonl(rpc->msgin.recv_end);
+ resend.length = htonl(rpc->msgin.length - rpc->msgin.recv_end);
+ if (resend.length == 0)
+ return;
+ }
+ homa_xmit_control(RESEND, &resend, sizeof(resend), rpc);
+}
+
+/**
+ * homa_timer() - This function is invoked at regular intervals ("ticks")
+ * to implement retries and aborts for Homa.
+ * @homa: Overall data about the Homa protocol implementation.
+ */
+void homa_timer(struct homa *homa)
+{
+ struct homa_socktab_scan scan;
+ struct homa_sock *hsk;
+ struct homa_rpc *rpc;
+ int total_rpcs = 0;
+ int rpc_count = 0;
+
+ homa->timer_ticks++;
+
+ /* Scan all existing RPCs in all sockets. */
+ for (hsk = homa_socktab_start_scan(homa->port_map, &scan);
+ hsk; hsk = homa_socktab_next(&scan)) {
+ while (hsk->dead_skbs >= homa->dead_buffs_limit)
+ /* If we get here, it means that homa_wait_for_message
+ * isn't keeping up with RPC reaping, so we'll help
+ * out. See reap.txt for more info.
+ */
+ if (homa_rpc_reap(hsk, false) == 0)
+ break;
+
+ if (list_empty(&hsk->active_rpcs) || hsk->shutdown)
+ continue;
+
+ if (!homa_protect_rpcs(hsk))
+ continue;
+ rcu_read_lock();
+ list_for_each_entry_rcu(rpc, &hsk->active_rpcs, active_links) {
+ total_rpcs++;
+ homa_rpc_lock(rpc);
+ if (rpc->state == RPC_IN_SERVICE) {
+ rpc->silent_ticks = 0;
+ homa_rpc_unlock(rpc);
+ continue;
+ }
+ rpc->silent_ticks++;
+ homa_check_rpc(rpc);
+ homa_rpc_unlock(rpc);
+ rpc_count++;
+ if (rpc_count >= 10) {
+ /* Give other kernel threads a chance to run
+ * on this core.
+ */
+ rcu_read_unlock();
+ schedule();
+ rcu_read_lock();
+ rpc_count = 0;
+ }
+ }
+ rcu_read_unlock();
+ homa_unprotect_rpcs(hsk);
+ }
+ homa_socktab_end_scan(&scan);
+ homa_skb_release_pages(homa);
+}
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 14/15] net: homa: create homa_plumbing.c
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
` (12 preceding siblings ...)
2025-05-02 23:37 ` [PATCH net-next v8 13/15] net: homa: create homa_timer.c John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 15/15] net: homa: create Makefile and Kconfig John Ousterhout
2025-05-06 5:04 ` [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
15 siblings, 0 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
homa_plumbing.c contains functions that connect Homa to the rest of
the Linux kernel, such as dispatch tables used by Linux and the
top-level functions that Linux invokes from those dispatch tables.
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
Changes for v8:
* Accommodate homa_pacer and homa_pool refactorings
Changes for v7:
* Remove extraneous code
* Make Homa a pernet subsystem
* Block Homa senders if insufficient tx buffer memory
* Check for missing buffer pool in homa_recvmsg
* Refactor waiting mechanism for incoming packets: simplify wait
criteria and use standard Linux mechanisms for waiting
* Implement SO_HOMA_SERVER option for setsockopt
* Rename UNKNOWN packet type to RPC_UNKNOWN
* Remove locker argument from locking functions
* Use u64 and __u64 properly
* Use new homa_make_header_avl function
---
net/homa/homa_plumbing.c | 1098 ++++++++++++++++++++++++++++++++++++++
1 file changed, 1098 insertions(+)
create mode 100644 net/homa/homa_plumbing.c
diff --git a/net/homa/homa_plumbing.c b/net/homa/homa_plumbing.c
new file mode 100644
index 000000000000..c8de38ece52a
--- /dev/null
+++ b/net/homa/homa_plumbing.c
@@ -0,0 +1,1098 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file consists mostly of "glue" that hooks Homa into the rest of
+ * the Linux kernel. The guts of the protocol are in other files.
+ */
+
+#include "homa_impl.h"
+#include "homa_pacer.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+
+/* Identifier for retrieving Homa-specific data for a struct net. */
+unsigned int homa_net_id;
+
+/* This structure defines functions that allow Homa to be used as a
+ * pernet subsystem.
+ */
+static struct pernet_operations homa_net_ops = {
+ .init = homa_net_init,
+ .exit = homa_net_exit,
+ .id = &homa_net_id,
+ .size = sizeof(struct homa)
+};
+
+/* This structure defines functions that handle various operations on
+ * Homa sockets. These functions are relatively generic: they are called
+ * to implement top-level system calls. Many of these operations can
+ * be implemented by PF_INET6 functions that are independent of the
+ * Homa protocol.
+ */
+static const struct proto_ops homa_proto_ops = {
+ .family = PF_INET,
+ .owner = THIS_MODULE,
+ .release = inet_release,
+ .bind = homa_bind,
+ .connect = inet_dgram_connect,
+ .socketpair = sock_no_socketpair,
+ .accept = sock_no_accept,
+ .getname = inet_getname,
+ .poll = homa_poll,
+ .ioctl = inet_ioctl,
+ .listen = sock_no_listen,
+ .shutdown = homa_shutdown,
+ .setsockopt = sock_common_setsockopt,
+ .getsockopt = sock_common_getsockopt,
+ .sendmsg = inet_sendmsg,
+ .recvmsg = inet_recvmsg,
+ .mmap = sock_no_mmap,
+ .set_peek_off = sk_set_peek_off,
+};
+
+static const struct proto_ops homav6_proto_ops = {
+ .family = PF_INET6,
+ .owner = THIS_MODULE,
+ .release = inet6_release,
+ .bind = homa_bind,
+ .connect = inet_dgram_connect,
+ .socketpair = sock_no_socketpair,
+ .accept = sock_no_accept,
+ .getname = inet6_getname,
+ .poll = homa_poll,
+ .ioctl = inet6_ioctl,
+ .listen = sock_no_listen,
+ .shutdown = homa_shutdown,
+ .setsockopt = sock_common_setsockopt,
+ .getsockopt = sock_common_getsockopt,
+ .sendmsg = inet_sendmsg,
+ .recvmsg = inet_recvmsg,
+ .mmap = sock_no_mmap,
+ .set_peek_off = sk_set_peek_off,
+};
+
+/* This structure also defines functions that handle various operations
+ * on Homa sockets. However, these functions are lower-level than those
+ * in homa_proto_ops: they are specific to the PF_INET or PF_INET6
+ * protocol family, and in many cases they are invoked by functions in
+ * homa_proto_ops. Most of these functions have Homa-specific implementations.
+ */
+static struct proto homa_prot = {
+ .name = "HOMA",
+ .owner = THIS_MODULE,
+ .close = homa_close,
+ .connect = ip4_datagram_connect,
+ .disconnect = homa_disconnect,
+ .ioctl = homa_ioctl,
+ .init = homa_socket,
+ .destroy = NULL,
+ .setsockopt = homa_setsockopt,
+ .getsockopt = homa_getsockopt,
+ .sendmsg = homa_sendmsg,
+ .recvmsg = homa_recvmsg,
+ .backlog_rcv = homa_backlog_rcv,
+ .hash = homa_hash,
+ .unhash = homa_unhash,
+ .get_port = homa_get_port,
+ .obj_size = sizeof(struct homa_sock),
+ .no_autobind = 1,
+};
+
+static struct proto homav6_prot = {
+ .name = "HOMAv6",
+ .owner = THIS_MODULE,
+ .close = homa_close,
+ .connect = ip6_datagram_connect,
+ .disconnect = homa_disconnect,
+ .ioctl = homa_ioctl,
+ .init = homa_socket,
+ .destroy = NULL,
+ .setsockopt = homa_setsockopt,
+ .getsockopt = homa_getsockopt,
+ .sendmsg = homa_sendmsg,
+ .recvmsg = homa_recvmsg,
+ .backlog_rcv = homa_backlog_rcv,
+ .hash = homa_hash,
+ .unhash = homa_unhash,
+ .get_port = homa_get_port,
+ .obj_size = sizeof(struct homa_v6_sock),
+ .ipv6_pinfo_offset = offsetof(struct homa_v6_sock, inet6),
+
+ .no_autobind = 1,
+};
+
+/* Top-level structure describing the Homa protocol. */
+static struct inet_protosw homa_protosw = {
+ .type = SOCK_DGRAM,
+ .protocol = IPPROTO_HOMA,
+ .prot = &homa_prot,
+ .ops = &homa_proto_ops,
+ .flags = INET_PROTOSW_REUSE,
+};
+
+static struct inet_protosw homav6_protosw = {
+ .type = SOCK_DGRAM,
+ .protocol = IPPROTO_HOMA,
+ .prot = &homav6_prot,
+ .ops = &homav6_proto_ops,
+ .flags = INET_PROTOSW_REUSE,
+};
+
+/* This structure is used by IP to deliver incoming Homa packets to us. */
+static struct net_protocol homa_protocol = {
+ .handler = homa_softirq,
+ .err_handler = homa_err_handler_v4,
+ .no_policy = 1,
+};
+
+static struct inet6_protocol homav6_protocol = {
+ .handler = homa_softirq,
+ .err_handler = homa_err_handler_v6,
+ .flags = INET6_PROTO_NOPOLICY | INET6_PROTO_FINAL,
+};
+
+/* Sizes of the headers for each Homa packet type, in bytes. */
+static __u16 header_lengths[] = {
+ sizeof32(struct homa_data_hdr),
+ 0,
+ sizeof32(struct homa_resend_hdr),
+ sizeof32(struct homa_rpc_unknown_hdr),
+ sizeof32(struct homa_busy_hdr),
+ 0,
+ 0,
+ sizeof32(struct homa_need_ack_hdr),
+ sizeof32(struct homa_ack_hdr)
+};
+
+static DECLARE_COMPLETION(timer_thread_done);
+
+/**
+ * homa_load() - invoked when this module is loaded into the Linux kernel
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int __init homa_load(void)
+{
+ int status;
+
+ pr_err("Homa module loading\n");
+ status = proto_register(&homa_prot, 1);
+ if (status != 0) {
+ pr_err("proto_register failed for homa_prot: %d\n", status);
+ goto proto_register_err;
+ }
+ status = proto_register(&homav6_prot, 1);
+ if (status != 0) {
+ pr_err("proto_register failed for homav6_prot: %d\n", status);
+ goto proto_register_v6_err;
+ }
+ inet_register_protosw(&homa_protosw);
+ status = inet6_register_protosw(&homav6_protosw);
+ if (status != 0) {
+ pr_err("inet6_register_protosw failed in %s: %d\n", __func__,
+ status);
+ goto register_protosw_v6_err;
+ }
+ status = inet_add_protocol(&homa_protocol, IPPROTO_HOMA);
+ if (status != 0) {
+ pr_err("inet_add_protocol failed in %s: %d\n", __func__,
+ status);
+ goto add_protocol_err;
+ }
+ status = inet6_add_protocol(&homav6_protocol, IPPROTO_HOMA);
+ if (status != 0) {
+ pr_err("inet6_add_protocol failed in %s: %d\n", __func__,
+ status);
+ goto add_protocol_v6_err;
+ }
+
+ status = register_pernet_subsys(&homa_net_ops);
+ if (status != 0) {
+ pr_err("Homa got error from register_pernet_subsys: %d\n",
+ status);
+ goto net_err;
+ }
+
+ return 0;
+
+net_err:
+ inet6_del_protocol(&homav6_protocol, IPPROTO_HOMA);
+add_protocol_v6_err:
+ inet_del_protocol(&homa_protocol, IPPROTO_HOMA);
+add_protocol_err:
+ inet6_unregister_protosw(&homav6_protosw);
+register_protosw_v6_err:
+ inet_unregister_protosw(&homa_protosw);
+ proto_unregister(&homav6_prot);
+proto_register_v6_err:
+ proto_unregister(&homa_prot);
+proto_register_err:
+ return status;
+}
+
+/**
+ * homa_unload() - invoked when this module is unloaded from the Linux kernel.
+ */
+void __exit homa_unload(void)
+{
+ pr_notice("Homa module unloading\n");
+
+ unregister_pernet_subsys(&homa_net_ops);
+
+ inet_del_protocol(&homa_protocol, IPPROTO_HOMA);
+ inet_unregister_protosw(&homa_protosw);
+ inet6_del_protocol(&homav6_protocol, IPPROTO_HOMA);
+ inet6_unregister_protosw(&homav6_protosw);
+ proto_unregister(&homa_prot);
+ proto_unregister(&homav6_prot);
+}
+
+module_init(homa_load);
+module_exit(homa_unload);
+
+/**
+ * homa_net_init() - Initialize a new struct homa as a per-net subsystem.
+ * @net: The net that Homa will be associated with.
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_net_init(struct net *net)
+{
+ struct homa *homa = homa_from_net(net);
+ int status;
+
+ pr_notice("Homa attaching to net namespace\n");
+
+ status = homa_init(homa, net);
+ if (status)
+ goto homa_init_err;
+
+ homa->timer_kthread = kthread_run(homa_timer_main, homa, "homa_timer");
+ if (IS_ERR(homa->timer_kthread)) {
+ status = PTR_ERR(homa->timer_kthread);
+ pr_err("couldn't create homa timer thread: error %d\n",
+ status);
+ homa->timer_kthread = NULL;
+ goto timer_err;
+ }
+
+ return 0;
+
+timer_err:
+ homa_destroy(homa);
+homa_init_err:
+ return status;
+}
+
+/**
+ * homa_net_exit() - Remove Homa from a net.
+ * @net: The net from which Homa should be removed.
+ */
+void homa_net_exit(struct net *net)
+{
+ struct homa *homa = homa_from_net(net);
+
+ pr_notice("Homa detaching from net namespace\n");
+
+ homa->destroyed = true;
+ if (homa->timer_kthread)
+ wake_up_process(homa->timer_kthread);
+ wait_for_completion(&timer_thread_done);
+
+ homa_destroy(homa);
+}
+
+/**
+ * homa_bind() - Implements the bind system call for Homa sockets: associates
+ * a well-known service port with a socket. Unlike other AF_INET6 protocols,
+ * there is no need to invoke this system call for sockets that are only
+ * used as clients.
+ * @sock: Socket on which the system call was invoked.
+ * @addr: Contains the desired port number.
+ * @addr_len: Number of bytes in uaddr.
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
+{
+ union sockaddr_in_union *addr_in = (union sockaddr_in_union *)addr;
+ struct homa_sock *hsk = homa_sk(sock->sk);
+ int port = 0;
+
+ if (unlikely(addr->sa_family != sock->sk->sk_family))
+ return -EAFNOSUPPORT;
+ if (addr_in->in6.sin6_family == AF_INET6) {
+ if (addr_len < sizeof(struct sockaddr_in6))
+ return -EINVAL;
+ port = ntohs(addr_in->in4.sin_port);
+ } else if (addr_in->in4.sin_family == AF_INET) {
+ if (addr_len < sizeof(struct sockaddr_in))
+ return -EINVAL;
+ port = ntohs(addr_in->in6.sin6_port);
+ }
+ return homa_sock_bind(hsk->homa->port_map, hsk, port);
+}
+
+/**
+ * homa_close() - Invoked when close system call is invoked on a Homa socket.
+ * @sk: Socket being closed
+ * @timeout: ??
+ */
+void homa_close(struct sock *sk, long timeout)
+{
+ struct homa_sock *hsk = homa_sk(sk);
+
+ homa_sock_destroy(hsk);
+ sk_common_release(sk);
+}
+
+/**
+ * homa_shutdown() - Implements the shutdown system call for Homa sockets.
+ * @sock: Socket to shut down.
+ * @how: Ignored: for other sockets, can independently shut down
+ * sending and receiving, but for Homa any shutdown will
+ * shut down everything.
+ *
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_shutdown(struct socket *sock, int how)
+{
+ homa_sock_shutdown(homa_sk(sock->sk));
+ return 0;
+}
+
+/**
+ * homa_disconnect() - Invoked when disconnect system call is invoked on a
+ * Homa socket.
+ * @sk: Socket to disconnect
+ * @flags: ??
+ *
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_disconnect(struct sock *sk, int flags)
+{
+ pr_warn("unimplemented disconnect invoked on Homa socket\n");
+ return -EINVAL;
+}
+
+/**
+ * homa_ioctl() - Implements the ioctl system call for Homa sockets.
+ * @sk: Socket on which the system call was invoked.
+ * @cmd: Identifier for a particular ioctl operation.
+ * @karg: Operation-specific argument; typically the address of a block
+ * of data in user address space.
+ *
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_ioctl(struct sock *sk, int cmd, int *karg)
+{
+ return -EINVAL;
+}
+
+/**
+ * homa_socket() - Implements the socket(2) system call for sockets.
+ * @sk: Socket on which the system call was invoked. The non-Homa
+ * parts have already been initialized.
+ *
+ * Return: always 0 (success).
+ */
+int homa_socket(struct sock *sk)
+{
+ struct homa_sock *hsk = homa_sk(sk);
+ struct homa *homa = homa_from_sock(sk);
+ int result;
+
+ result = homa_sock_init(hsk, homa);
+ if (result != 0)
+ homa_sock_destroy(hsk);
+ return result;
+}
+
+/**
+ * homa_setsockopt() - Implements the getsockopt system call for Homa sockets.
+ * @sk: Socket on which the system call was invoked.
+ * @level: Level at which the operation should be handled; will always
+ * be IPPROTO_HOMA.
+ * @optname: Identifies a particular setsockopt operation.
+ * @optval: Address in user space of information about the option.
+ * @optlen: Number of bytes of data at @optval.
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_setsockopt(struct sock *sk, int level, int optname,
+ sockptr_t optval, unsigned int optlen)
+{
+ struct homa_sock *hsk = homa_sk(sk);
+ int ret;
+
+ if (level != IPPROTO_HOMA)
+ return -ENOPROTOOPT;
+
+ if (optname == SO_HOMA_RCVBUF) {
+ struct homa_rcvbuf_args args;
+
+ if (optlen != sizeof(struct homa_rcvbuf_args))
+ return -EINVAL;
+
+ if (copy_from_sockptr(&args, optval, optlen))
+ return -EFAULT;
+
+ /* Do a trivial test to make sure we can at least write the
+ * first page of the region.
+ */
+ if (copy_to_user(u64_to_user_ptr(args.start), &args,
+ sizeof(args)))
+ return -EFAULT;
+
+ homa_sock_lock(hsk);
+ ret = homa_pool_set_region(hsk->buffer_pool,
+ u64_to_user_ptr(args.start),
+ args.length);
+ homa_sock_unlock(hsk);
+ } else if (optname == SO_HOMA_SERVER) {
+ int arg;
+
+ if (optlen != sizeof(arg))
+ return -EINVAL;
+
+ if (copy_from_sockptr(&arg, optval, optlen))
+ return -EFAULT;
+
+ if (arg)
+ hsk->is_server = true;
+ else
+ hsk->is_server = false;
+ ret = 0;
+ } else {
+ ret = -ENOPROTOOPT;
+ }
+ return ret;
+}
+
+/**
+ * homa_getsockopt() - Implements the getsockopt system call for Homa sockets.
+ * @sk: Socket on which the system call was invoked.
+ * @level: Selects level in the network stack to handle the request;
+ * must be IPPROTO_HOMA.
+ * @optname: Identifies a particular setsockopt operation.
+ * @optval: Address in user space where the option's value should be stored.
+ * @optlen: Number of bytes available at optval; will be overwritten with
+ * actual number of bytes stored.
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_getsockopt(struct sock *sk, int level, int optname,
+ char __user *optval, int __user *optlen)
+{
+ struct homa_sock *hsk = homa_sk(sk);
+ struct homa_rcvbuf_args rcvbuf_args;
+ void *result;
+ int is_server;
+ int len;
+
+ if (copy_from_sockptr(&len, USER_SOCKPTR(optlen), sizeof(int)))
+ return -EFAULT;
+
+ if (level != IPPROTO_HOMA)
+ return -ENOPROTOOPT;
+ if (optname == SO_HOMA_RCVBUF) {
+ if (len < sizeof(rcvbuf_args))
+ return -EINVAL;
+
+ homa_sock_lock(hsk);
+ homa_pool_get_rcvbuf(hsk->buffer_pool, &rcvbuf_args);
+ homa_sock_unlock(hsk);
+ len = sizeof(rcvbuf_args);
+ result = &rcvbuf_args;
+ } else if (optname == SO_HOMA_SERVER) {
+ if (len < sizeof(is_server))
+ return -EINVAL;
+
+ is_server = hsk->is_server;
+ len = sizeof(is_server);
+ result = &is_server;
+ } else {
+ return -ENOPROTOOPT;
+ }
+
+ if (copy_to_sockptr(USER_SOCKPTR(optlen), &len, sizeof(int)))
+ return -EFAULT;
+
+ if (copy_to_sockptr(USER_SOCKPTR(optval), result, len))
+ return -EFAULT;
+
+ return 0;
+}
+
+/**
+ * homa_sendmsg() - Send a request or response message on a Homa socket.
+ * @sk: Socket on which the system call was invoked.
+ * @msg: Structure describing the message to send; the msg_control
+ * field points to additional information.
+ * @length: Number of bytes of the message.
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_sendmsg(struct sock *sk, struct msghdr *msg, size_t length)
+{
+ struct homa_sock *hsk = homa_sk(sk);
+ struct homa_sendmsg_args args;
+ union sockaddr_in_union *addr;
+ struct homa_rpc *rpc = NULL;
+ int result = 0;
+
+ addr = (union sockaddr_in_union *)msg->msg_name;
+ if (!addr) {
+ result = -EINVAL;
+ goto error;
+ }
+
+ if (unlikely(!msg->msg_control_is_user)) {
+ result = -EINVAL;
+ goto error;
+ }
+ if (unlikely(copy_from_user(&args, (void __user *)msg->msg_control,
+ sizeof(args)))) {
+ result = -EFAULT;
+ goto error;
+ }
+ if (args.flags & ~HOMA_SENDMSG_VALID_FLAGS ||
+ (args.reserved != 0)) {
+ result = -EINVAL;
+ goto error;
+ }
+
+ if (!homa_sock_wmem_avl(hsk)) {
+ result = homa_sock_wait_wmem(hsk,
+ msg->msg_flags & MSG_DONTWAIT ||
+ args.flags &
+ HOMA_SENDMSG_NONBLOCKING);
+ if (result != 0)
+ goto error;
+ }
+
+ if (addr->sa.sa_family != sk->sk_family) {
+ result = -EAFNOSUPPORT;
+ goto error;
+ }
+ if (msg->msg_namelen < sizeof(struct sockaddr_in) ||
+ (msg->msg_namelen < sizeof(struct sockaddr_in6) &&
+ addr->in6.sin6_family == AF_INET6)) {
+ result = -EINVAL;
+ goto error;
+ }
+
+ if (!args.id) {
+ /* This is a request message. */
+ rpc = homa_rpc_new_client(hsk, addr);
+ if (IS_ERR(rpc)) {
+ result = PTR_ERR(rpc);
+ rpc = NULL;
+ goto error;
+ }
+ if (args.flags & HOMA_SENDMSG_PRIVATE)
+ atomic_or(RPC_PRIVATE, &rpc->flags);
+ rpc->completion_cookie = args.completion_cookie;
+ result = homa_message_out_fill(rpc, &msg->msg_iter, 1);
+ if (result)
+ goto error;
+ args.id = rpc->id;
+ homa_rpc_unlock(rpc); /* Locked by homa_rpc_new_client. */
+ rpc = NULL;
+
+ if (unlikely(copy_to_user((void __user *)msg->msg_control,
+ &args, sizeof(args)))) {
+ rpc = homa_find_client_rpc(hsk, args.id);
+ result = -EFAULT;
+ goto error;
+ }
+ } else {
+ /* This is a response message. */
+ struct in6_addr canonical_dest;
+
+ if (args.completion_cookie != 0) {
+ result = -EINVAL;
+ goto error;
+ }
+ canonical_dest = canonical_ipv6_addr(addr);
+
+ rpc = homa_find_server_rpc(hsk, &canonical_dest, args.id);
+ if (!rpc)
+ /* Return without an error if the RPC doesn't exist;
+ * this could be totally valid (e.g. client is
+ * no longer interested in it).
+ */
+ return 0;
+ if (rpc->error) {
+ result = rpc->error;
+ goto error;
+ }
+ if (rpc->state != RPC_IN_SERVICE) {
+ /* Locked by homa_find_server_rpc. */
+ homa_rpc_unlock(rpc);
+ rpc = NULL;
+ result = -EINVAL;
+ goto error;
+ }
+ rpc->state = RPC_OUTGOING;
+
+ result = homa_message_out_fill(rpc, &msg->msg_iter, 1);
+ if (result && rpc->state != RPC_DEAD)
+ goto error;
+ homa_rpc_unlock(rpc); /* Locked by homa_find_server_rpc. */
+ }
+ return 0;
+
+error:
+ if (rpc) {
+ homa_rpc_end(rpc);
+ homa_rpc_unlock(rpc); /* Locked by homa_find_server_rpc. */
+ }
+ return result;
+}
+
+/**
+ * homa_recvmsg() - Receive a message from a Homa socket.
+ * @sk: Socket on which the system call was invoked.
+ * @msg: Controlling information for the receive.
+ * @len: Total bytes of space available in msg->msg_iov; not used.
+ * @flags: Flags from system call; only MSG_DONTWAIT is used.
+ * @addr_len: Store the length of the sender address here
+ * Return: The length of the message on success, otherwise a negative
+ * errno.
+ */
+int homa_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
+ int *addr_len)
+{
+ struct homa_sock *hsk = homa_sk(sk);
+ struct homa_recvmsg_args control;
+ struct homa_rpc *rpc;
+ int nonblocking;
+ int result;
+
+ if (unlikely(!msg->msg_control)) {
+ /* This test isn't strictly necessary, but it provides a
+ * hook for testing kernel call times.
+ */
+ return -EINVAL;
+ }
+ if (msg->msg_controllen != sizeof(control))
+ return -EINVAL;
+ if (unlikely(copy_from_user(&control, (void __user *)msg->msg_control,
+ sizeof(control))))
+ return -EFAULT;
+ control.completion_cookie = 0;
+
+ if (control.num_bpages > HOMA_MAX_BPAGES ||
+ (control.flags & ~HOMA_RECVMSG_VALID_FLAGS)) {
+ result = -EINVAL;
+ goto done;
+ }
+ if (!hsk->buffer_pool) {
+ result = -EINVAL;
+ goto done;
+ }
+ result = homa_pool_release_buffers(hsk->buffer_pool, control.num_bpages,
+ control.bpage_offsets);
+ control.num_bpages = 0;
+ if (result != 0)
+ goto done;
+
+ nonblocking = ((flags & MSG_DONTWAIT) ||
+ (control.flags & HOMA_RECVMSG_NONBLOCKING));
+ if (control.id != 0) {
+ rpc = homa_find_client_rpc(hsk, control.id); /* Locks RPC. */
+ if (!rpc) {
+ result = -EINVAL;
+ goto done;
+ }
+ result = homa_wait_private(rpc, nonblocking);
+ if (result != 0) {
+ homa_rpc_unlock(rpc);
+ control.id = 0;
+ goto done;
+ }
+ } else {
+ rpc = homa_wait_shared(hsk, nonblocking);
+ if (IS_ERR(rpc)) {
+ /* If we get here, it means there was an error that
+ * prevented us from finding an RPC to return. Errors
+ * in the RPC itself are handled below.
+ */
+ result = PTR_ERR(rpc);
+ goto done;
+ }
+ }
+ result = rpc->error ? rpc->error : rpc->msgin.length;
+
+ /* Collect result information. */
+ control.id = rpc->id;
+ control.completion_cookie = rpc->completion_cookie;
+ if (likely(rpc->msgin.length >= 0)) {
+ control.num_bpages = rpc->msgin.num_bpages;
+ memcpy(control.bpage_offsets, rpc->msgin.bpage_offsets,
+ sizeof(rpc->msgin.bpage_offsets));
+ }
+ if (sk->sk_family == AF_INET6) {
+ struct sockaddr_in6 *in6 = msg->msg_name;
+
+ in6->sin6_family = AF_INET6;
+ in6->sin6_port = htons(rpc->dport);
+ in6->sin6_addr = rpc->peer->addr;
+ *addr_len = sizeof(*in6);
+ } else {
+ struct sockaddr_in *in4 = msg->msg_name;
+
+ in4->sin_family = AF_INET;
+ in4->sin_port = htons(rpc->dport);
+ in4->sin_addr.s_addr = ipv6_to_ipv4(rpc->peer->addr);
+ *addr_len = sizeof(*in4);
+ }
+
+ /* This indicates that the application now owns the buffers, so
+ * we won't free them in homa_rpc_end.
+ */
+ rpc->msgin.num_bpages = 0;
+
+ /* Must release the RPC lock (and potentially free the RPC) before
+ * copying the results back to user space.
+ */
+ if (homa_is_client(rpc->id)) {
+ homa_peer_add_ack(rpc);
+ homa_rpc_end(rpc);
+ } else {
+ if (result < 0)
+ homa_rpc_end(rpc);
+ else
+ rpc->state = RPC_IN_SERVICE;
+ }
+ homa_rpc_unlock(rpc); /* Locked by homa_wait_for_message. */
+
+ if (test_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags)) {
+ /* There are tasks waiting for tx memory, so reap
+ * immediately.
+ */
+ homa_rpc_reap(hsk, true);
+ }
+
+done:
+ if (unlikely(copy_to_user((__force void __user *)msg->msg_control,
+ &control, sizeof(control))))
+ result = -EFAULT;
+
+ return result;
+}
+
+/**
+ * homa_hash() - Not needed for Homa.
+ * @sk: Socket for the operation
+ * Return: ??
+ */
+int homa_hash(struct sock *sk)
+{
+ return 0;
+}
+
+/**
+ * homa_unhash() - Not needed for Homa.
+ * @sk: Socket for the operation
+ */
+void homa_unhash(struct sock *sk)
+{
+}
+
+/**
+ * homa_get_port() - It appears that this function is called to assign a
+ * default port for a socket.
+ * @sk: Socket for the operation
+ * @snum: Unclear what this is.
+ * Return: Zero for success, or a negative errno for an error.
+ */
+int homa_get_port(struct sock *sk, unsigned short snum)
+{
+ /* Homa always assigns ports immediately when a socket is created,
+ * so there is nothing to do here.
+ */
+ return 0;
+}
+
+/**
+ * homa_softirq() - This function is invoked at SoftIRQ level to handle
+ * incoming packets.
+ * @skb: The incoming packet.
+ * Return: Always 0
+ */
+int homa_softirq(struct sk_buff *skb)
+{
+ struct sk_buff *packets, *other_pkts, *next;
+ struct sk_buff **prev_link, **other_link;
+ struct homa *homa = homa_from_skb(skb);
+ struct homa_common_hdr *h;
+ int header_offset;
+
+ /* skb may actually contain many distinct packets, linked through
+ * skb_shinfo(skb)->frag_list by the Homa GRO mechanism. Make a
+ * pass through the list to process all of the short packets,
+ * leaving the longer packets in the list. Also, perform various
+ * prep/cleanup/error checking functions.
+ */
+ skb->next = skb_shinfo(skb)->frag_list;
+ skb_shinfo(skb)->frag_list = NULL;
+ packets = skb;
+ prev_link = &packets;
+ for (skb = packets; skb; skb = next) {
+ next = skb->next;
+
+ /* Make the header available at skb->data, even if the packet
+ * is fragmented. One complication: it's possible that the IP
+ * header hasn't yet been removed (this happens for GRO packets
+ * on the frag_list, since they aren't handled explicitly by IP.
+ */
+ if (!homa_make_header_avl(skb))
+ goto discard;
+ header_offset = skb_transport_header(skb) - skb->data;
+ if (header_offset)
+ __skb_pull(skb, header_offset);
+
+ /* Reject packets that are too short or have bogus types. */
+ h = (struct homa_common_hdr *)skb->data;
+ if (unlikely(skb->len < sizeof(struct homa_common_hdr) ||
+ h->type < DATA || h->type >= BOGUS ||
+ skb->len < header_lengths[h->type - DATA]))
+ goto discard;
+
+ /* Process the packet now if it is a control packet or
+ * if it contains an entire short message.
+ */
+ if (h->type != DATA || ntohl(((struct homa_data_hdr *)h)
+ ->message_length) < 1400) {
+ *prev_link = skb->next;
+ skb->next = NULL;
+ homa_dispatch_pkts(skb, homa);
+ } else {
+ prev_link = &skb->next;
+ }
+ continue;
+
+discard:
+ *prev_link = skb->next;
+ kfree_skb(skb);
+ }
+
+ /* Now process the longer packets. Each iteration of this loop
+ * collects all of the packets for a particular RPC and dispatches
+ * them (batching the packets for an RPC allows more efficient
+ * generation of grants).
+ */
+ while (packets) {
+ struct in6_addr saddr, saddr2;
+ struct homa_common_hdr *h2;
+ struct sk_buff *skb2;
+
+ skb = packets;
+ prev_link = &skb->next;
+ saddr = skb_canonical_ipv6_saddr(skb);
+ other_pkts = NULL;
+ other_link = &other_pkts;
+ h = (struct homa_common_hdr *)skb->data;
+ for (skb2 = skb->next; skb2; skb2 = next) {
+ next = skb2->next;
+ h2 = (struct homa_common_hdr *)skb2->data;
+ if (h2->sender_id == h->sender_id) {
+ saddr2 = skb_canonical_ipv6_saddr(skb2);
+ if (ipv6_addr_equal(&saddr, &saddr2)) {
+ *prev_link = skb2;
+ prev_link = &skb2->next;
+ continue;
+ }
+ }
+ *other_link = skb2;
+ other_link = &skb2->next;
+ }
+ *prev_link = NULL;
+ *other_link = NULL;
+ homa_dispatch_pkts(packets, homa);
+ packets = other_pkts;
+ }
+
+ return 0;
+}
+
+/**
+ * homa_backlog_rcv() - Invoked to handle packets saved on a socket's
+ * backlog because it was locked when the packets first arrived.
+ * @sk: Homa socket that owns the packet's destination port.
+ * @skb: The incoming packet. This function takes ownership of the packet
+ * (we'll delete it).
+ *
+ * Return: Always returns 0.
+ */
+int homa_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+ pr_warn_once("unimplemented backlog_rcv invoked on Homa socket\n");
+ kfree_skb(skb);
+ return 0;
+}
+
+/**
+ * homa_err_handler_v4() - Invoked by IP to handle an incoming error
+ * packet, such as ICMP UNREACHABLE.
+ * @skb: The incoming packet.
+ * @info: Information about the error that occurred?
+ *
+ * Return: zero, or a negative errno if the error couldn't be handled here.
+ */
+int homa_err_handler_v4(struct sk_buff *skb, u32 info)
+{
+ const struct icmphdr *icmp = icmp_hdr(skb);
+ struct homa *homa = homa_from_skb(skb);
+ struct in6_addr daddr;
+ int type = icmp->type;
+ int code = icmp->code;
+ struct iphdr *iph;
+ int error = 0;
+ int port = 0;
+
+ iph = (struct iphdr *)(skb->data);
+ ipv6_addr_set_v4mapped(iph->daddr, &daddr);
+ if (type == ICMP_DEST_UNREACH && code == ICMP_PORT_UNREACH) {
+ struct homa_common_hdr *h = (struct homa_common_hdr *)(skb->data
+ + iph->ihl * 4);
+
+ port = ntohs(h->dport);
+ error = -ENOTCONN;
+ } else if (type == ICMP_DEST_UNREACH) {
+ if (code == ICMP_PROT_UNREACH)
+ error = -EPROTONOSUPPORT;
+ else
+ error = -EHOSTUNREACH;
+ } else {
+ pr_notice("%s invoked with info %x, ICMP type %d, ICMP code %d\n",
+ __func__, info, type, code);
+ }
+ if (error != 0)
+ homa_abort_rpcs(homa, &daddr, port, error);
+ return 0;
+}
+
+/**
+ * homa_err_handler_v6() - Invoked by IP to handle an incoming error
+ * packet, such as ICMP UNREACHABLE.
+ * @skb: The incoming packet.
+ * @opt: Not used.
+ * @type: Type of ICMP packet.
+ * @code: Additional information about the error.
+ * @offset: Not used.
+ * @info: Information about the error that occurred?
+ *
+ * Return: zero, or a negative errno if the error couldn't be handled here.
+ */
+int homa_err_handler_v6(struct sk_buff *skb, struct inet6_skb_parm *opt,
+ u8 type, u8 code, int offset, __be32 info)
+{
+ const struct ipv6hdr *iph = (const struct ipv6hdr *)skb->data;
+ struct homa *homa = homa_from_skb(skb);
+ int error = 0;
+ int port = 0;
+
+ if (type == ICMPV6_DEST_UNREACH && code == ICMPV6_PORT_UNREACH) {
+ const struct homa_common_hdr *h;
+
+ h = (struct homa_common_hdr *)(skb->data + sizeof(*iph));
+ port = ntohs(h->dport);
+ error = -ENOTCONN;
+ } else if (type == ICMPV6_DEST_UNREACH && code == ICMPV6_ADDR_UNREACH) {
+ error = -EHOSTUNREACH;
+ } else if (type == ICMPV6_PARAMPROB && code == ICMPV6_UNK_NEXTHDR) {
+ error = -EPROTONOSUPPORT;
+ }
+ if (error != 0)
+ homa_abort_rpcs(homa, &iph->daddr, port, error);
+ return 0;
+}
+
+/**
+ * homa_poll() - Invoked by Linux as part of implementing select, poll,
+ * epoll, etc.
+ * @file: Open file that is participating in a poll, select, etc.
+ * @sock: A Homa socket, associated with @file.
+ * @wait: This table will be registered with the socket, so that it
+ * is notified when the socket's ready state changes.
+ *
+ * Return: A mask of bits such as EPOLLIN, which indicate the current
+ * state of the socket.
+ */
+__poll_t homa_poll(struct file *file, struct socket *sock,
+ struct poll_table_struct *wait)
+{
+ struct homa_sock *hsk = homa_sk(sock->sk);
+ __poll_t mask;
+
+ mask = 0;
+ sock_poll_wait(file, sock, wait);
+ if (homa_sock_wmem_avl(hsk))
+ mask |= EPOLLOUT | EPOLLWRNORM;
+ else
+ set_bit(SOCK_NOSPACE, &hsk->sock.sk_socket->flags);
+
+ if (hsk->shutdown)
+ mask |= EPOLLIN;
+
+ if (!list_empty(&hsk->ready_rpcs))
+ mask |= EPOLLIN | EPOLLRDNORM;
+ return mask;
+}
+
+/**
+ * homa_hrtimer() - This function is invoked by the hrtimer mechanism to
+ * wake up the timer thread. Runs at IRQ level.
+ * @timer: The timer that triggered; not used.
+ *
+ * Return: Always HRTIMER_RESTART.
+ */
+enum hrtimer_restart homa_hrtimer(struct hrtimer *timer)
+{
+ struct homa *homa;
+
+ homa = container_of(timer, struct homa, hrtimer);
+ wake_up_process(homa->timer_kthread);
+ return HRTIMER_NORESTART;
+}
+
+/**
+ * homa_timer_main() - Top-level function for the timer thread.
+ * @transport: Pointer to struct homa.
+ *
+ * Return: Always 0.
+ */
+int homa_timer_main(void *transport)
+{
+ struct homa *homa = (struct homa *)transport;
+ ktime_t tick_interval;
+ u64 nsec;
+
+ hrtimer_setup(&homa->hrtimer, homa_hrtimer, CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL);
+ nsec = 1000000; /* 1 ms */
+ tick_interval = ns_to_ktime(nsec);
+ while (1) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ if (!homa->destroyed) {
+ hrtimer_start(&homa->hrtimer, tick_interval,
+ HRTIMER_MODE_REL);
+ schedule();
+ }
+ __set_current_state(TASK_RUNNING);
+ if (homa->destroyed)
+ break;
+ homa_timer(homa);
+ }
+ hrtimer_cancel(&homa->hrtimer);
+ kthread_complete_and_exit(&timer_thread_done, 0);
+ return 0;
+}
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_AUTHOR("John Ousterhout <ouster@cs.stanford.edu>");
+MODULE_DESCRIPTION("Homa transport protocol");
+MODULE_VERSION("1.0");
+
+/* Arrange for this module to be loaded automatically when a Homa socket is
+ * opened. Apparently symbols don't work in the macros below, so must use
+ * numeric values for IPPROTO_HOMA (146) and SOCK_DGRAM(2).
+ */
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_INET, 146, 2);
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_INET6, 146, 2);
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH net-next v8 15/15] net: homa: create Makefile and Kconfig
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
` (13 preceding siblings ...)
2025-05-02 23:37 ` [PATCH net-next v8 14/15] net: homa: create homa_plumbing.c John Ousterhout
@ 2025-05-02 23:37 ` John Ousterhout
2025-05-06 5:04 ` [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
15 siblings, 0 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-02 23:37 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout
Before this commit the Homa code is "inert": it won't be compiled
in kernel builds. This commit adds Homa's Makefile and Kconfig, and
also links Homa into net/Makefile and net/Kconfig, so that Homa
will be built during kernel builds if enabled (it is disabled by
default).
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
net/Kconfig | 1 +
net/Makefile | 1 +
net/homa/Kconfig | 21 +++++++++++++++++++++
net/homa/Makefile | 16 ++++++++++++++++
4 files changed, 39 insertions(+)
create mode 100644 net/homa/Kconfig
create mode 100644 net/homa/Makefile
diff --git a/net/Kconfig b/net/Kconfig
index c3fca69a7c83..d6df0595d1d5 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -247,6 +247,7 @@ endif
source "net/dccp/Kconfig"
source "net/sctp/Kconfig"
+source "net/homa/Kconfig"
source "net/rds/Kconfig"
source "net/tipc/Kconfig"
source "net/atm/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index 60ed5190eda8..516b17d0bc6f 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -44,6 +44,7 @@ obj-y += 8021q/
endif
obj-$(CONFIG_IP_DCCP) += dccp/
obj-$(CONFIG_IP_SCTP) += sctp/
+obj-$(CONFIG_HOMA) += homa/
obj-$(CONFIG_RDS) += rds/
obj-$(CONFIG_WIRELESS) += wireless/
obj-$(CONFIG_MAC80211) += mac80211/
diff --git a/net/homa/Kconfig b/net/homa/Kconfig
new file mode 100644
index 000000000000..8ce5fbf08258
--- /dev/null
+++ b/net/homa/Kconfig
@@ -0,0 +1,21 @@
+# SPDX-License-Identifier: BSD-2-Clause
+#
+# Homa transport protocol
+#
+
+menuconfig HOMA
+ tristate "The Homa transport protocol"
+ depends on INET
+ depends on IPV6
+
+ help
+ Homa is a network transport protocol for communication within
+ a datacenter. It provides significantly lower latency than TCP,
+ particularly for workloads containing a mixture of large and small
+ messages operating at high network utilization. At present, Homa
+ has been only partially upstreamed; this version provides bare-bones
+ functionality but is not performant. For more information see the
+ homa(7) man page or checkout the Homa Wiki at
+ https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview.
+
+ If unsure, say N.
diff --git a/net/homa/Makefile b/net/homa/Makefile
new file mode 100644
index 000000000000..ed894ebab176
--- /dev/null
+++ b/net/homa/Makefile
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: BSD-2-Clause
+#
+# Makefile for the Linux implementation of the Homa transport protocol.
+
+obj-$(CONFIG_HOMA) := homa.o
+homa-y:= homa_incoming.o \
+ homa_interest.o \
+ homa_outgoing.o \
+ homa_pacer.o \
+ homa_peer.o \
+ homa_plumbing.o \
+ homa_pool.o \
+ homa_rpc.o \
+ homa_sock.o \
+ homa_timer.o \
+ homa_utils.o
--
2.43.0
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 01/15] net: homa: define user-visible API for Homa
2025-05-02 23:37 ` [PATCH net-next v8 01/15] net: homa: define user-visible API for Homa John Ousterhout
@ 2025-05-05 7:54 ` Paolo Abeni
2025-05-05 16:14 ` John Ousterhout
2025-05-05 8:03 ` Paolo Abeni
1 sibling, 1 reply; 46+ messages in thread
From: Paolo Abeni @ 2025-05-05 7:54 UTC (permalink / raw)
To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba
On 5/3/25 1:37 AM, John Ousterhout wrote:
> +/**
> + * define HOMA_MAX_BPAGES - The largest number of bpages that will be required
> + * to store an incoming message.
> + */
> +#define HOMA_MAX_BPAGES ((HOMA_MAX_MESSAGE_LENGTH + HOMA_BPAGE_SIZE - 1) \
> + >> HOMA_BPAGE_SHIFT)
Minor nit: the above indentation is somewhat uncommon, the preferred
style is:
#define HOMA_MAX_BPAGES ((HOMA_MAX_MESSAGE_LENGTH + HOMA_BPAGE_SIZE - 1)
>> \
HOMA_BPAGE_SHIFT)
> +
> +/**
> + * define HOMA_MIN_DEFAULT_PORT - The 16 bit port space is divided into
> + * two nonoverlapping regions. Ports 1-32767 are reserved exclusively
> + * for well-defined server ports. The remaining ports are used for client
> + * ports; these are allocated automatically by Homa. Port 0 is reserved.
> + */
> +#define HOMA_MIN_DEFAULT_PORT 0x8000
> +
> +/**
> + * struct homa_sendmsg_args - Provides information needed by Homa's
> + * sendmsg; passed to sendmsg using the msg_control field.
> + */
> +struct homa_sendmsg_args {
> + /**
> + * @id: (in/out) An initial value of 0 means a new request is
> + * being sent; nonzero means the message is a reply to the given
> + * id. If the message is a request, then the value is modified to
> + * hold the id of the new RPC.
> + */
> + __u64 id;
> +
> + /**
> + * @completion_cookie: (in) Used only for request messages; will be
> + * returned by recvmsg when the RPC completes. Typically used to
> + * locate app-specific info about the RPC.
> + */
> + __u64 completion_cookie;
> +
> + /**
> + * @flags: (in) OR-ed combination of bits that control the operation.
> + * See below for values.
> + */
> + __u32 flags;
> +
> + /** @reserved: Not currently used. */
> + __u32 reserved;
> +};
> +
> +#if !defined(__cplusplus)
> +_Static_assert(sizeof(struct homa_sendmsg_args) >= 24,
> + "homa_sendmsg_args shrunk");
> +_Static_assert(sizeof(struct homa_sendmsg_args) <= 24,
> + "homa_sendmsg_args grew");
> +#endif
I think this assertions don't belong here, should be BUILD_BUG_ON() in c
files. Even better could be avoided with explicit alignment on the
message struct.
[...]
> +int homa_send(int sockfd, const void *message_buf,
> + size_t length, const struct sockaddr *dest_addr,
> + __u32 addrlen, __u64 *id, __u64 completion_cookie,
> + int flags);
> +int homa_sendv(int sockfd, const struct iovec *iov,
> + int iovcnt, const struct sockaddr *dest_addr,
> + __u32 addrlen, __u64 *id, __u64 completion_cookie,
> + int flags);
> +ssize_t homa_reply(int sockfd, const void *message_buf,
> + size_t length, const struct sockaddr *dest_addr,
> + __u32 addrlen, __u64 id);
> +ssize_t homa_replyv(int sockfd, const struct iovec *iov,
> + int iovcnt, const struct sockaddr *dest_addr,
> + __u32 addrlen, __u64 id);
I assume the above are user-space functions definition ??? If so, they
don't belong here.
/P
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 01/15] net: homa: define user-visible API for Homa
2025-05-02 23:37 ` [PATCH net-next v8 01/15] net: homa: define user-visible API for Homa John Ousterhout
2025-05-05 7:54 ` Paolo Abeni
@ 2025-05-05 8:03 ` Paolo Abeni
2025-05-05 22:53 ` John Ousterhout
1 sibling, 1 reply; 46+ messages in thread
From: Paolo Abeni @ 2025-05-05 8:03 UTC (permalink / raw)
To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba
On 5/3/25 1:37 AM, John Ousterhout wrote:
> +/* Flag bits for homa_sendmsg_args.flags (see man page for documentation):
> + */
> +#define HOMA_SENDMSG_PRIVATE 0x01
> +#define HOMA_SENDMSG_NONBLOCKING 0x02
It's unclear why you need to define a new mechanism instead of using
MSG_DONTWAIT. This is possibly not needed and deserves at least a good
motivation in the patch introducing it.
/P
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 02/15] net: homa: create homa_wire.h
2025-05-02 23:37 ` [PATCH net-next v8 02/15] net: homa: create homa_wire.h John Ousterhout
@ 2025-05-05 8:28 ` Paolo Abeni
2025-05-05 23:54 ` John Ousterhout
2025-05-22 5:31 ` John Ousterhout
0 siblings, 2 replies; 46+ messages in thread
From: Paolo Abeni @ 2025-05-05 8:28 UTC (permalink / raw)
To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba
On 5/3/25 1:37 AM, John Ousterhout wrote:
> diff --git a/net/homa/homa_wire.h b/net/homa/homa_wire.h
> new file mode 100644
> index 000000000000..47693244c3ec
> --- /dev/null
> +++ b/net/homa/homa_wire.h
I'm wondering why you keep the wire-struct definition outside the uAPI -
the opposite of what other protocols do.
> @@ -0,0 +1,362 @@
> +/* SPDX-License-Identifier: BSD-2-Clause */
> +
> +/* This file defines the on-the-wire format of Homa packets. */
> +
> +#ifndef _HOMA_WIRE_H
> +#define _HOMA_WIRE_H
> +
> +#include <linux/skbuff.h>
> +
> +/* Defines the possible types of Homa packets.
> + *
> + * See the xxx_header structs below for more information about each type.
> + */
> +enum homa_packet_type {
> + DATA = 0x10,
> + RESEND = 0x12,
> + RPC_UNKNOWN = 0x13,
> + BUSY = 0x14,
> + NEED_ACK = 0x17,
> + ACK = 0x18,
> + BOGUS = 0x19, /* Used only in unit tests. */
> + /* If you add a new type here, you must also do the following:
> + * 1. Change BOGUS so it is the highest opcode
If you instead define 'MAX' value, the required update policy would be
self-explained and you will not need to expose tests details.
> + * 2. Add support for the new opcode in homa_print_packet,
> + * homa_print_packet_short, homa_symbol_for_type, and mock_skb_new.
> + * 3. Add the header length to header_lengths in homa_plumbing.c.
> + */
> +};
> +
> +/** define HOMA_IPV6_HEADER_LENGTH - Size of IP header (V6). */
> +#define HOMA_IPV6_HEADER_LENGTH 40
> +
> +/** define HOMA_IPV4_HEADER_LENGTH - Size of IP header (V4). */
> +#define HOMA_IPV4_HEADER_LENGTH 20
I suspect you will be better off using sizeof(<relevant struct>). Making
protocol-specific definition for common/global constants is somewhat
confusing and unexpected
> +
> +/**
> + * define HOMA_SKB_EXTRA - How many bytes of additional space to allow at the
> + * beginning of each sk_buff, before the IP header. This includes room for a
> + * VLAN header and also includes some extra space, "just to be safe" (not
> + * really sure if this is needed).
> + */
> +#define HOMA_SKB_EXTRA 40
You could use:
#define MAX_HOME_HEADER MAX_TCP_HEADER
to leverage a consolidated value covering most use-cases and kernel configs.
> +
> +/**
> + * define HOMA_ETH_OVERHEAD - Number of bytes per Ethernet packet for Ethernet
> + * header, CRC, preamble, and inter-packet gap.
> + */
> +#define HOMA_ETH_OVERHEAD 42
It's not clear why the protocol should be interested in MAC-specific
details. What if the the MAC is not ethernet?
[...]
> + /**
> + * @type: Homa packet type (one of the values of the homa_packet_type
> + * enum). Corresponds to the low-order byte of the ack in TCP.
> + */
> + __u8 type;
If you keep this outside uAPI you should use 'u8'
[...]
> +_Static_assert(sizeof(struct homa_data_hdr) <= HOMA_MAX_HEADER,
> + "homa_data_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
> +_Static_assert(sizeof(struct homa_data_hdr) >= HOMA_MIN_PKT_LENGTH,
> + "homa_data_hdr too small: Homa doesn't currently have code to pad data packets");
> +_Static_assert(((sizeof(struct homa_data_hdr) - sizeof(struct homa_seg_hdr)) &
> + 0x3) == 0,
> + " homa_data_hdr length not a multiple of 4 bytes (required for TCP/TSO compatibility");
Please use BUILD_BUG_ON() in a .c file instead. Many other cases below.
/P
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 04/15] net: homa: create homa_pool.h and homa_pool.c
2025-05-02 23:37 ` [PATCH net-next v8 04/15] net: homa: create homa_pool.h and homa_pool.c John Ousterhout
@ 2025-05-05 9:51 ` Paolo Abeni
2025-05-06 23:28 ` John Ousterhout
0 siblings, 1 reply; 46+ messages in thread
From: Paolo Abeni @ 2025-05-05 9:51 UTC (permalink / raw)
To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba
On 5/3/25 1:37 AM, John Ousterhout wrote:
[...]
> +/**
> + * set_bpages_needed() - Set the bpages_needed field of @pool based
> + * on the length of the first RPC that's waiting for buffer space.
> + * The caller must own the lock for @pool->hsk.
> + * @pool: Pool to update.
> + */
> +static void set_bpages_needed(struct homa_pool *pool)
> +{
> + struct homa_rpc *rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
> + struct homa_rpc, buf_links);
Minor nit: please insert an empty line between variable declaration and
code.
> + pool->bpages_needed = (rpc->msgin.length + HOMA_BPAGE_SIZE - 1)
> + >> HOMA_BPAGE_SHIFT;
Minor nit: please fix the indentation above
> +}
> +
> +/**
> + * homa_pool_new() - Allocate and initialize a new homa_pool (it will have
> + * no region associated with it until homa_pool_set_region is invoked).
> + * @hsk: Socket the pool will be associated with.
> + * Return: A pointer to the new pool or a negative errno.
> + */
> +struct homa_pool *homa_pool_new(struct homa_sock *hsk)
The proferrend name includes for an allocator includes 'alloc', not 'new'.
> +{
> + struct homa_pool *pool;
> +
> + pool = kzalloc(sizeof(*pool), GFP_ATOMIC);
You should try to use GFP_KERNEL allocation as much as you can, and use
GFP_ATOMIC only in atomic context. If needed, try to move the function
outside the atomic scope doing the allocation before acquiring the
lock/rcu.
> + if (!pool)
> + return ERR_PTR(-ENOMEM);
> + pool->hsk = hsk;
> + return pool;
> +}
> +
> +/**
> + * homa_pool_set_region() - Associate a region of memory with a pool.
> + * @pool: Pool the region will be associated with. Must not currently
> + * have a region associated with it.
> + * @region: First byte of the memory region for the pool, allocated
> + * by the application; must be page-aligned.
> + * @region_size: Total number of bytes available at @buf_region.
> + * Return: Either zero (for success) or a negative errno for failure.
> + */
> +int homa_pool_set_region(struct homa_pool *pool, void __user *region,
> + u64 region_size)
> +{
> + int i, result;
> +
> + if (pool->region)
> + return -EINVAL;
> +
> + if (((uintptr_t)region) & ~PAGE_MASK)
> + return -EINVAL;
> + pool->region = (char __user *)region;
> + pool->num_bpages = region_size >> HOMA_BPAGE_SHIFT;
> + pool->descriptors = NULL;
> + pool->cores = NULL;
> + if (pool->num_bpages < MIN_POOL_SIZE) {
> + result = -EINVAL;
> + goto error;
> + }
> + pool->descriptors = kmalloc_array(pool->num_bpages,
> + sizeof(struct homa_bpage),
> + GFP_ATOMIC | __GFP_ZERO);
> + if (!pool->descriptors) {
> + result = -ENOMEM;
> + goto error;
> + }
> + for (i = 0; i < pool->num_bpages; i++) {
> + struct homa_bpage *bp = &pool->descriptors[i];
> +
> + spin_lock_init(&bp->lock);
> + bp->owner = -1;
> + }
> + atomic_set(&pool->free_bpages, pool->num_bpages);
> + pool->bpages_needed = INT_MAX;
> +
> + /* Allocate and initialize core-specific data. */
> + pool->cores = alloc_percpu_gfp(struct homa_pool_core,
> + GFP_ATOMIC | __GFP_ZERO);
> + if (!pool->cores) {
> + result = -ENOMEM;
> + goto error;
> + }
> + pool->num_cores = nr_cpu_ids;
The 'num_cores' field is likely not needed, and it's never used in this
series.
> + pool->check_waiting_invoked = 0;
> +
> + return 0;
> +
> +error:
> + kfree(pool->descriptors);
> + free_percpu(pool->cores);
The above assumes that 'pool' will be zeroed at allocation time, but the
allocator does not do that. You should probably add the __GFP_ZERO flag
to the pool allocator.
> + pool->region = NULL;
> + return result;
> +}
> +
> +/**
> + * homa_pool_destroy() - Destructor for homa_pool. After this method
> + * returns, the object should not be used (it will be freed here).
> + * @pool: Pool to destroy.
> + */
> +void homa_pool_destroy(struct homa_pool *pool)
> +{
> + if (pool->region) {
> + kfree(pool->descriptors);
> + free_percpu(pool->cores);
> + pool->region = NULL;
> + }
> + kfree(pool);
> +}
> +
> +/**
> + * homa_pool_get_rcvbuf() - Return information needed to handle getsockopt
> + * for HOMA_SO_RCVBUF.
> + * @pool: Pool for which information is needed.
> + * @args: Store info here.
> + */
> +void homa_pool_get_rcvbuf(struct homa_pool *pool,
> + struct homa_rcvbuf_args *args)
> +{
> + args->start = (uintptr_t)pool->region;
> + args->length = pool->num_bpages << HOMA_BPAGE_SHIFT;
> +}
> +
> +/**
> + * homa_bpage_available() - Check whether a bpage is available for use.
> + * @bpage: Bpage to check
> + * @now: Current time (sched_clock() units)
> + * Return: True if the bpage is free or if it can be stolen, otherwise
> + * false.
> + */
> +bool homa_bpage_available(struct homa_bpage *bpage, u64 now)
> +{
> + int ref_count = atomic_read(&bpage->refs);
> +
> + return ref_count == 0 || (ref_count == 1 && bpage->owner >= 0 &&
> + bpage->expiration <= now);
Minor nit: please fix the indentation above. Other cases below. Please
validate your patch with the checkpatch.pl script.
> +}
> +
> +/**
> + * homa_pool_get_pages() - Allocate one or more full pages from the pool.
> + * @pool: Pool from which to allocate pages
> + * @num_pages: Number of pages needed
> + * @pages: The indices of the allocated pages are stored here; caller
> + * must ensure this array is big enough. Reference counts have
> + * been set to 1 on all of these pages (or 2 if set_owner
> + * was specified).
> + * @set_owner: If nonzero, the current core is marked as owner of all
> + * of the allocated pages (and the expiration time is also
> + * set). Otherwise the pages are left unowned.
> + * Return: 0 for success, -1 if there wasn't enough free space in the pool.
> + */
> +int homa_pool_get_pages(struct homa_pool *pool, int num_pages, u32 *pages,
> + int set_owner)
> +{
> + int core_num = smp_processor_id();
> + struct homa_pool_core *core;
> + u64 now = sched_clock();
From sched_clock() documentation:
sched_clock() has no promise of monotonicity or bounded drift between
CPUs, use (which you should not) requires disabling IRQs.
Can't be used for an expiration time. You could use 'jiffies' instead,
> + int alloced = 0;
> + int limit = 0;
> +
> + core = this_cpu_ptr(pool->cores);
> + if (atomic_sub_return(num_pages, &pool->free_bpages) < 0) {
> + atomic_add(num_pages, &pool->free_bpages);
> + return -1;
> + }
> +
> + /* Once we get to this point we know we will be able to find
> + * enough free pages; now we just have to find them.
> + */
> + while (alloced != num_pages) {
> + struct homa_bpage *bpage;
> + int cur;
> +
> + /* If we don't need to use all of the bpages in the pool,
> + * then try to use only the ones with low indexes. This
> + * will reduce the cache footprint for the pool by reusing
> + * a few bpages over and over. Specifically this code will
> + * not consider any candidate page whose index is >= limit.
> + * Limit is chosen to make sure there are a reasonable
> + * number of free pages in the range, so we won't have to
> + * check a huge number of pages.
> + */
> + if (limit == 0) {
> + int extra;
> +
> + limit = pool->num_bpages
> + - atomic_read(&pool->free_bpages);
Nit: indentation above, the operator should stay on the first line.
> + extra = limit >> 2;
> + limit += (extra < MIN_EXTRA) ? MIN_EXTRA : extra;
> + if (limit > pool->num_bpages)
> + limit = pool->num_bpages;
> + }
> +
> + cur = core->next_candidate;
> + core->next_candidate++;
> + if (cur >= limit) {
> + core->next_candidate = 0;
> +
> + /* Must recompute the limit for each new loop through
> + * the bpage array: we may need to consider a larger
> + * range of pages because of concurrent allocations.
> + */
> + limit = 0;
> + continue;
> + }
> + bpage = &pool->descriptors[cur];
> +
> + /* Figure out whether this candidate is free (or can be
> + * stolen). Do a quick check without locking the page, and
> + * if the page looks promising, then lock it and check again
> + * (must check again in case someone else snuck in and
> + * grabbed the page).
> + */
> + if (!homa_bpage_available(bpage, now))
> + continue;
homa_bpage_available() accesses bpage without lock, so needs READ_ONCE()
annotations on the relevant fields and you needed to add paied
WRITE_ONCE() when updating them.
> + if (!spin_trylock_bh(&bpage->lock))
Why only trylock? I have a vague memory on some discussion on this point
in a previous revision. You should at least add a comment here on in the
commit message explaning why a plain spin_lock does not fit.
> + continue;
> + if (!homa_bpage_available(bpage, now)) {
> + spin_unlock_bh(&bpage->lock);
> + continue;
> + }
> + if (bpage->owner >= 0)
> + atomic_inc(&pool->free_bpages);
> + if (set_owner) {
> + atomic_set(&bpage->refs, 2);
> + bpage->owner = core_num;
> + bpage->expiration = now + 1000 *
> + pool->hsk->homa->bpage_lease_usecs;
> + } else {
> + atomic_set(&bpage->refs, 1);
> + bpage->owner = -1;
> + }
> + spin_unlock_bh(&bpage->lock);
> + pages[alloced] = cur;
> + alloced++;
> + }
> + return 0;
> +}
> +
> +/**
> + * homa_pool_allocate() - Allocate buffer space for an RPC.
> + * @rpc: RPC that needs space allocated for its incoming message (space must
> + * not already have been allocated). The fields @msgin->num_buffers
> + * and @msgin->buffers are filled in. Must be locked by caller.
> + * Return: The return value is normally 0, which means either buffer space
> + * was allocated or the @rpc was queued on @hsk->waiting. If a fatal error
> + * occurred, such as no buffer pool present, then a negative errno is
> + * returned.
> + */
> +int homa_pool_allocate(struct homa_rpc *rpc)
> + __must_hold(&rpc->bucket->lock)
> +{
> + struct homa_pool *pool = rpc->hsk->buffer_pool;
> + int full_pages, partial, i, core_id;
> + u32 pages[HOMA_MAX_BPAGES];
> + struct homa_pool_core *core;
> + struct homa_bpage *bpage;
> + struct homa_rpc *other;
> +
> + if (!pool->region)
> + return -ENOMEM;
> +
> + /* First allocate any full bpages that are needed. */
> + full_pages = rpc->msgin.length >> HOMA_BPAGE_SHIFT;
> + if (unlikely(full_pages)) {
> + if (homa_pool_get_pages(pool, full_pages, pages, 0) != 0)
> + goto out_of_space;
> + for (i = 0; i < full_pages; i++)
> + rpc->msgin.bpage_offsets[i] = pages[i] <<
> + HOMA_BPAGE_SHIFT;
> + }
> + rpc->msgin.num_bpages = full_pages;
> +
> + /* The last chunk may be less than a full bpage; for this we use
> + * the bpage that we own (and reuse it for multiple messages).
> + */
> + partial = rpc->msgin.length & (HOMA_BPAGE_SIZE - 1);
> + if (unlikely(partial == 0))
> + goto success;
> + core_id = smp_processor_id();
Is this code running in non-preemptible scope? otherwise you need to use
get_cpu() here and put_cpu() when you are done with 'core_id'.
> + core = this_cpu_ptr(pool->cores);
> + bpage = &pool->descriptors[core->page_hint];
> + if (!spin_trylock_bh(&bpage->lock))
> + spin_lock_bh(&bpage->lock);
I think I already commented on this pattern. Please don't use it.
> + if (bpage->owner != core_id) {
> + spin_unlock_bh(&bpage->lock);
> + goto new_page;
> + }
> + if ((core->allocated + partial) > HOMA_BPAGE_SIZE) {
> + if (atomic_read(&bpage->refs) == 1) {
> + /* Bpage is totally free, so we can reuse it. */
> + core->allocated = 0;
> + } else {
> + bpage->owner = -1;
> +
> + /* We know the reference count can't reach zero here
> + * because of check above, so we won't have to decrement
> + * pool->free_bpages.
> + */
> + atomic_dec_return(&bpage->refs);
> + spin_unlock_bh(&bpage->lock);
> + goto new_page;
> + }
> + }
> + bpage->expiration = sched_clock() +
> + 1000 * pool->hsk->homa->bpage_lease_usecs;
> + atomic_inc(&bpage->refs);
> + spin_unlock_bh(&bpage->lock);
> + goto allocate_partial;
> +
> + /* Can't use the current page; get another one. */
> +new_page:
> + if (homa_pool_get_pages(pool, 1, pages, 1) != 0) {
> + homa_pool_release_buffers(pool, rpc->msgin.num_bpages,
> + rpc->msgin.bpage_offsets);
> + rpc->msgin.num_bpages = 0;
> + goto out_of_space;
> + }
> + core->page_hint = pages[0];
> + core->allocated = 0;
> +
> +allocate_partial:
> + rpc->msgin.bpage_offsets[rpc->msgin.num_bpages] = core->allocated
> + + (core->page_hint << HOMA_BPAGE_SHIFT);
> + rpc->msgin.num_bpages++;
> + core->allocated += partial;
> +
> +success:
> + return 0;
> +
> + /* We get here if there wasn't enough buffer space for this
> + * message; add the RPC to hsk->waiting_for_bufs.
Please also add a comment describing why waiting RPCs are sorted by
message size.
> + */
> +out_of_space:
> + homa_sock_lock(pool->hsk);
> + list_for_each_entry(other, &pool->hsk->waiting_for_bufs, buf_links) {
> + if (other->msgin.length > rpc->msgin.length) {
> + list_add_tail(&rpc->buf_links, &other->buf_links);
> + goto queued;
> + }
> + }
> + list_add_tail(&rpc->buf_links, &pool->hsk->waiting_for_bufs);
> +
> +queued:
> + set_bpages_needed(pool);
> + homa_sock_unlock(pool->hsk);
> + return 0;
> +}
> +
> +/**
> + * homa_pool_get_buffer() - Given an RPC, figure out where to store incoming
> + * message data.
> + * @rpc: RPC for which incoming message data is being processed; its
> + * msgin must be properly initialized and buffer space must have
> + * been allocated for the message.
> + * @offset: Offset within @rpc's incoming message.
> + * @available: Will be filled in with the number of bytes of space available
> + * at the returned address (could be zero if offset is
> + * (erroneously) past the end of the message).
> + * Return: The application's virtual address for buffer space corresponding
> + * to @offset in the incoming message for @rpc.
> + */
> +void __user *homa_pool_get_buffer(struct homa_rpc *rpc, int offset,
> + int *available)
> +{
> + int bpage_index, bpage_offset;
> +
> + bpage_index = offset >> HOMA_BPAGE_SHIFT;
> + if (offset >= rpc->msgin.length) {
> + WARN_ONCE(true, "%s got offset %d >= message length %d\n",
> + __func__, offset, rpc->msgin.length);
> + *available = 0;
> + return NULL;
> + }
> + bpage_offset = offset & (HOMA_BPAGE_SIZE - 1);
> + *available = (bpage_index < (rpc->msgin.num_bpages - 1))
> + ? HOMA_BPAGE_SIZE - bpage_offset
> + : rpc->msgin.length - offset;
> + return rpc->hsk->buffer_pool->region +
> + rpc->msgin.bpage_offsets[bpage_index] + bpage_offset;
> +}
> +
> +/**
> + * homa_pool_release_buffers() - Release buffer space so that it can be
> + * reused.
> + * @pool: Pool that the buffer space belongs to. Doesn't need to
> + * be locked.
> + * @num_buffers: How many buffers to release.
> + * @buffers: Points to @num_buffers values, each of which is an offset
> + * from the start of the pool to the buffer to be released.
> + * Return: 0 for success, otherwise a negative errno.
> + */
> +int homa_pool_release_buffers(struct homa_pool *pool, int num_buffers,
> + u32 *buffers)
> +{
> + int result = 0;
> + int i;
> +
> + if (!pool->region)
> + return result;
> + for (i = 0; i < num_buffers; i++) {
> + u32 bpage_index = buffers[i] >> HOMA_BPAGE_SHIFT;
> + struct homa_bpage *bpage = &pool->descriptors[bpage_index];
> +
> + if (bpage_index < pool->num_bpages) {
> + if (atomic_dec_return(&bpage->refs) == 0)
> + atomic_inc(&pool->free_bpages);
> + } else {
> + result = -EINVAL;
> + }
> + }
> + return result;
> +}
> +
> +/**
> + * homa_pool_check_waiting() - Checks to see if there are enough free
> + * bpages to wake up any RPCs that were blocked. Whenever
> + * homa_pool_release_buffers is invoked, this function must be invoked later,
> + * at a point when the caller holds no locks (homa_pool_release_buffers may
> + * be invoked with locks held, so it can't safely invoke this function).
> + * This is regrettably tricky, but I can't think of a better solution.
> + * @pool: Information about the buffer pool.
> + */
> +void homa_pool_check_waiting(struct homa_pool *pool)
> +{
> + if (!pool->region)
> + return;
> + while (atomic_read(&pool->free_bpages) >= pool->bpages_needed) {
> + struct homa_rpc *rpc;
> +
> + homa_sock_lock(pool->hsk);
> + if (list_empty(&pool->hsk->waiting_for_bufs)) {
> + pool->bpages_needed = INT_MAX;
> + homa_sock_unlock(pool->hsk);
> + break;
> + }
> + rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
> + struct homa_rpc, buf_links);
> + if (!homa_rpc_try_lock(rpc)) {
> + /* Can't just spin on the RPC lock because we're
> + * holding the socket lock (see sync.txt). Instead,
The documenation should live under:
Documentation/networking/
likely in its own subdir, and must be in restructured format.
Here you should just mention that the lock acquiring order is rpc ->
home sock lock.
> + * release the socket lock and try the entire
> + * operation again.
> + */
> + homa_sock_unlock(pool->hsk);
> + continue;
> + }
> + list_del_init(&rpc->buf_links);
> + if (list_empty(&pool->hsk->waiting_for_bufs))
> + pool->bpages_needed = INT_MAX;
> + else
> + set_bpages_needed(pool);
> + homa_sock_unlock(pool->hsk);
> + homa_pool_allocate(rpc);
Why you don't need to check the allocation return value here?
> + homa_rpc_unlock(rpc);
> + }
> +}
> diff --git a/net/homa/homa_pool.h b/net/homa/homa_pool.h
> new file mode 100644
> index 000000000000..d52d61afa557
> --- /dev/null
> +++ b/net/homa/homa_pool.h
> @@ -0,0 +1,149 @@
> +/* SPDX-License-Identifier: BSD-2-Clause */
> +
> +/* This file contains definitions used to manage user-space buffer pools.
> + */
> +
> +#ifndef _HOMA_POOL_H
> +#define _HOMA_POOL_H
> +
> +#include <linux/percpu.h>
> +
> +#include "homa_rpc.h"
> +
> +/**
> + * struct homa_bpage - Contains information about a single page in
> + * a buffer pool.
> + */
> +struct homa_bpage {
> + union {
> + /**
> + * @cache_line: Ensures that each homa_bpage object
> + * is exactly one cache line long.
> + */
> + char cache_line[L1_CACHE_BYTES];
Instead of the struct/union nesting just use ____cacheline_aligned
[...]
> +* Homa's approach means that socket shutdown and deletion can potentially
> + occur while operations are underway that hold RPC locks but not the socket
> + lock. This creates several potential problems:
> + * A socket might be deleted and its memory reclaimed while an RPC still
> + has access to it. Homa assumes that Linux will prevent socket deletion
> + while the kernel call is executing.
This last sentence is not clear to me. Do you mean that the kernel
ensures that the socket is freed after the close() syscall?
/P
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 03/15] net: homa: create shared Homa header files
2025-05-02 23:37 ` [PATCH net-next v8 03/15] net: homa: create shared Homa header files John Ousterhout
@ 2025-05-05 10:20 ` Paolo Abeni
2025-05-06 17:45 ` John Ousterhout
0 siblings, 1 reply; 46+ messages in thread
From: Paolo Abeni @ 2025-05-05 10:20 UTC (permalink / raw)
To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba
On 5/3/25 1:37 AM, John Ousterhout wrote:
[...]
> +/* Forward declarations. */
> +struct homa;
> +struct homa_peer;
> +struct homa_rpc;
> +struct homa_sock;
> +
> +#define sizeof32(type) ((int)(sizeof(type)))
(u32) instead of (int). I think you should try to avoid using this
define, which is not very nice by itself.
> +
> +#ifdef __CHECKER__
> +#define __context__(x, y, z) __attribute__((context(x, y, z)))
> +#else
> +#define __context__(...)
> +#endif /* __CHECKER__ */
Why do you need to fiddle with the sparse annotation? Very likely this
should be dropped.
[...]
> +/**
> + * homa_get_skb_info() - Return the address of Homa's private information
> + * for an sk_buff.
> + * @skb: Socket buffer whose info is needed.
> + * Return: address of Homa's private information for @skb.
> + */
> +static inline struct homa_skb_info *homa_get_skb_info(struct sk_buff *skb)
> +{
> + return (struct homa_skb_info *)(skb_end_pointer(skb)) - 1;
This looks fragile. Why can't you use the skb control buffer here?
> +}
> +
> +/**
> + * homa_set_doff() - Fills in the doff TCP header field for a Homa packet.
> + * @h: Packet header whose doff field is to be set.
> + * @size: Size of the "header", bytes (must be a multiple of 4). This
> + * information is used only for TSO; it's the number of bytes
> + * that should be replicated in each segment. The bytes after
> + * this will be distributed among segments.
> + */
> +static inline void homa_set_doff(struct homa_data_hdr *h, int size)
> +{
> + /* Drop the 2 low-order bits from size and set the 4 high-order
> + * bits of doff from what's left.
> + */
> + h->common.doff = size << 2;
> +}
> +
> +/** skb_is_ipv6() - Return true if the packet is encapsulated with IPv6,
> + * false otherwise (presumably it's IPv4).
> + */
> +static inline bool skb_is_ipv6(const struct sk_buff *skb)
> +{
> + return ipv6_hdr(skb)->version == 6;
> +}
> +
> +/**
> + * ipv6_to_ipv4() - Given an IPv6 address produced by ipv4_to_ipv6, return
> + * the original IPv4 address (in network byte order).
> + * @ip6: IPv6 address; assumed to be a mapped IPv4 address.
> + * Return: IPv4 address stored in @ip6.
> + */
> +static inline __be32 ipv6_to_ipv4(const struct in6_addr ip6)
> +{
> + return ip6.in6_u.u6_addr32[3];
> +}
> +
> +/**
> + * canonical_ipv6_addr() - Convert a socket address to the "standard"
> + * form used in Homa, which is always an IPv6 address; if the original address
> + * was IPv4, convert it to an IPv4-mapped IPv6 address.
> + * @addr: Address to canonicalize (if NULL, "any" is returned).
> + * Return: IPv6 address corresponding to @addr.
> + */
> +static inline struct in6_addr canonical_ipv6_addr(const union sockaddr_in_union
> + *addr)
> +{
> + struct in6_addr mapped;
> +
> + if (addr) {
> + if (addr->sa.sa_family == AF_INET6)
> + return addr->in6.sin6_addr;
> + ipv6_addr_set_v4mapped(addr->in4.sin_addr.s_addr, &mapped);
> + return mapped;
> + }
> + return in6addr_any;
> +}
> +
> +/**
> + * skb_canonical_ipv6_saddr() - Given a packet buffer, return its source
> + * address in the "standard" form used in Homa, which is always an IPv6
> + * address; if the original address was IPv4, convert it to an IPv4-mapped
> + * IPv6 address.
> + * @skb: The source address will be extracted from this packet buffer.
> + * Return: IPv6 address for @skb's source machine.
> + */
> +static inline struct in6_addr skb_canonical_ipv6_saddr(struct sk_buff *skb)
> +{
> + struct in6_addr mapped;
> +
> + if (skb_is_ipv6(skb))
> + return ipv6_hdr(skb)->saddr;
> + ipv6_addr_set_v4mapped(ip_hdr(skb)->saddr, &mapped);
> + return mapped;
> +}
> +
> +static inline bool is_homa_pkt(struct sk_buff *skb)
> +{
> + struct iphdr *iph = ip_hdr(skb);
> +
> + return (iph->protocol == IPPROTO_HOMA);
What if this is an ipv6 packet? Also I don't see any use of this
function later on.
> +}
> +
> +/**
> + * homa_make_header_avl() - Invokes pskb_may_pull to make sure that all the
> + * Homa header information for a packet is in the linear part of the skb
> + * where it can be addressed using skb_transport_header.
> + * @skb: Packet for which header is needed.
> + * Return: The result of pskb_may_pull (true for success)
> + */
> +static inline bool homa_make_header_avl(struct sk_buff *skb)
> +{
> + int pull_length;
> +
> + pull_length = skb_transport_header(skb) - skb->data + HOMA_MAX_HEADER;
> + if (pull_length > skb->len)
> + pull_length = skb->len;
> + return pskb_may_pull(skb, pull_length);
> +}
> +
> +#define UNIT_LOG(...)
> +#define UNIT_HOOK(...)
It looks like the above 2 define are unused later on.
> +extern unsigned int homa_net_id;
> +
> +void homa_abort_rpcs(struct homa *homa, const struct in6_addr *addr,
> + int port, int error);
> +void homa_abort_sock_rpcs(struct homa_sock *hsk, int error);
> +void homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
> + struct homa_rpc *rpc);
> +void homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb);
> +int homa_backlog_rcv(struct sock *sk, struct sk_buff *skb);
> +int homa_bind(struct socket *sk, struct sockaddr *addr,
> + int addr_len);
> +void homa_close(struct sock *sock, long timeout);
> +int homa_copy_to_user(struct homa_rpc *rpc);
> +void homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
> +void homa_destroy(struct homa *homa);
> +int homa_disconnect(struct sock *sk, int flags);
> +void homa_dispatch_pkts(struct sk_buff *skb, struct homa *homa);
> +int homa_err_handler_v4(struct sk_buff *skb, u32 info);
> +int homa_err_handler_v6(struct sk_buff *skb,
> + struct inet6_skb_parm *opt, u8 type, u8 code,
> + int offset, __be32 info);
> +int homa_fill_data_interleaved(struct homa_rpc *rpc,
> + struct sk_buff *skb, struct iov_iter *iter);
> +struct homa_gap *homa_gap_new(struct list_head *next, int start, int end);
> +void homa_gap_retry(struct homa_rpc *rpc);
> +int homa_get_port(struct sock *sk, unsigned short snum);
> +int homa_getsockopt(struct sock *sk, int level, int optname,
> + char __user *optval, int __user *optlen);
> +int homa_hash(struct sock *sk);
> +enum hrtimer_restart homa_hrtimer(struct hrtimer *timer);
> +int homa_init(struct homa *homa, struct net *net);
> +int homa_ioctl(struct sock *sk, int cmd, int *karg);
> +int homa_load(void);
> +int homa_message_out_fill(struct homa_rpc *rpc,
> + struct iov_iter *iter, int xmit);
> +void homa_message_out_init(struct homa_rpc *rpc, int length);
> +void homa_need_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
> + struct homa_rpc *rpc);
> +struct sk_buff *homa_new_data_packet(struct homa_rpc *rpc,
> + struct iov_iter *iter, int offset,
> + int length, int max_seg_data);
> +int homa_net_init(struct net *net);
> +void homa_net_exit(struct net *net);
> +__poll_t homa_poll(struct file *file, struct socket *sock,
> + struct poll_table_struct *wait);
> +int homa_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> + int flags, int *addr_len);
> +void homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
> + struct homa_sock *hsk);
> +void homa_rpc_abort(struct homa_rpc *crpc, int error);
> +void homa_rpc_acked(struct homa_sock *hsk,
> + const struct in6_addr *saddr, struct homa_ack *ack);
> +void homa_rpc_end(struct homa_rpc *rpc);
> +void homa_rpc_handoff(struct homa_rpc *rpc);
> +int homa_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
> +int homa_setsockopt(struct sock *sk, int level, int optname,
> + sockptr_t optval, unsigned int optlen);
> +int homa_shutdown(struct socket *sock, int how);
> +int homa_softirq(struct sk_buff *skb);
> +void homa_spin(int ns);
> +void homa_timer(struct homa *homa);
> +int homa_timer_main(void *transport);
> +void homa_unhash(struct sock *sk);
> +void homa_rpc_unknown_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
> +void homa_unload(void);
> +int homa_wait_private(struct homa_rpc *rpc, int nonblocking);
> +struct homa_rpc
> + *homa_wait_shared(struct homa_sock *hsk, int nonblocking);
> +int homa_xmit_control(enum homa_packet_type type, void *contents,
> + size_t length, struct homa_rpc *rpc);
> +int __homa_xmit_control(void *contents, size_t length,
> + struct homa_peer *peer, struct homa_sock *hsk);
> +void homa_xmit_data(struct homa_rpc *rpc, bool force);
> +void homa_xmit_unknown(struct sk_buff *skb, struct homa_sock *hsk);
> +
> +int homa_message_in_init(struct homa_rpc *rpc, int unsched);
> +void homa_resend_data(struct homa_rpc *rpc, int start, int end);
> +void __homa_xmit_data(struct sk_buff *skb, struct homa_rpc *rpc);
You should introduce the declaration of a given function in the same
patch introducing the implementation. That means the patches should be
sorted from the lowest level helper towards the upper layer.
[...]
> +static inline int homa_skb_append_to_frag(struct homa *homa,
> + struct sk_buff *skb, void *buf,
> + int length)
> +{
> + char *dst = skb_put(skb, length);
> +
> + memcpy(dst, buf, length);
> + return 0;
The name is misleading as it does not append to an skb frag but to the
skb linear part
> +}
> +
> +static inline int homa_skb_append_from_skb(struct homa *homa,
> + struct sk_buff *dst_skb,
> + struct sk_buff *src_skb,
> + int offset, int length)
> +{
> + return homa_skb_append_to_frag(homa, dst_skb,
> + skb_transport_header(src_skb) + offset, length);
> +}
> +
> +static inline void homa_skb_free_tx(struct homa *homa, struct sk_buff *skb)
> +{
> + kfree_skb(skb);
> +}
> +
> +static inline void homa_skb_free_many_tx(struct homa *homa,
> + struct sk_buff **skbs, int count)
> +{
> + int i;
> +
> + for (i = 0; i < count; i++)
> + kfree_skb(skbs[i]);
'home' is unused here.
> +}
> +
> +static inline void homa_skb_get(struct sk_buff *skb, void *dest, int offset,
> + int length)
> +{
> + memcpy(dest, skb_transport_header(skb) + offset, length);
> +}
> +
> +static inline struct sk_buff *homa_skb_new_tx(int length)
please use 'alloc' for allocator.
/P
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 05/15] net: homa: create homa_peer.h and homa_peer.c
2025-05-02 23:37 ` [PATCH net-next v8 05/15] net: homa: create homa_peer.h and homa_peer.c John Ousterhout
@ 2025-05-05 11:06 ` Paolo Abeni
2025-05-07 16:11 ` John Ousterhout
0 siblings, 1 reply; 46+ messages in thread
From: Paolo Abeni @ 2025-05-05 11:06 UTC (permalink / raw)
To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba
On 5/3/25 1:37 AM, John Ousterhout wrote:
[...]
> +{
> + /* Note: when we return, the object must be initialized so it's
> + * safe to call homa_peertab_destroy, even if this function returns
> + * an error.
> + */
> + int i;
> +
> + spin_lock_init(&peertab->write_lock);
> + INIT_LIST_HEAD(&peertab->dead_dsts);
> + peertab->buckets = vmalloc(HOMA_PEERTAB_BUCKETS *
> + sizeof(*peertab->buckets));
This struct looks way too big to be allocated on per netns basis. You
should use a global table and include the netns in the lookup key.
> + if (!peertab->buckets)
> + return -ENOMEM;
> + for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++)
> + INIT_HLIST_HEAD(&peertab->buckets[i]);
> + return 0;
> +}
> +
> +/**
> + * homa_peertab_destroy() - Destructor for homa_peertabs. After this
> + * function returns, it is unsafe to use any results from previous calls
> + * to homa_peer_find, since all existing homa_peer objects will have been
> + * destroyed.
> + * @peertab: The table to destroy.
> + */
> +void homa_peertab_destroy(struct homa_peertab *peertab)
> +{
> + struct hlist_node *next;
> + struct homa_peer *peer;
> + int i;
> +
> + if (!peertab->buckets)
> + return;
> +
> + spin_lock_bh(&peertab->write_lock);
> + for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++) {
> + hlist_for_each_entry_safe(peer, next, &peertab->buckets[i],
> + peertab_links) {
> + dst_release(peer->dst);
> + kfree(peer);
> + }
> + }
> + vfree(peertab->buckets);
> + homa_peertab_gc_dsts(peertab, ~0);
> + spin_unlock_bh(&peertab->write_lock);
> +}
> +
> +/**
> + * homa_peertab_gc_dsts() - Invoked to free unused dst_entries, if it is
> + * safe to do so.
> + * @peertab: The table in which to free entries.
> + * @now: Current time, in sched_clock() units; entries with expiration
> + * dates no later than this will be freed. Specify ~0 to
> + * free all entries.
> + */
> +void homa_peertab_gc_dsts(struct homa_peertab *peertab, u64 now)
> + __must_hold(&peer_tab->write_lock)
> +{
> + while (!list_empty(&peertab->dead_dsts)) {
> + struct homa_dead_dst *dead =
> + list_first_entry(&peertab->dead_dsts,
> + struct homa_dead_dst, dst_links);
> + if (dead->gc_time > now)
> + break;
> + dst_release(dead->dst);
> + list_del(&dead->dst_links);
> + kfree(dead);
> + }
> +}
> +
> +/**
> + * homa_peer_find() - Returns the peer associated with a given host; creates
> + * a new homa_peer if one doesn't already exist.
> + * @peertab: Peer table in which to perform lookup.
> + * @addr: Address of the desired host: IPv4 addresses are represented
> + * as IPv4-mapped IPv6 addresses.
> + * @inet: Socket that will be used for sending packets.
> + *
> + * Return: The peer associated with @addr, or a negative errno if an
> + * error occurred. The caller can retain this pointer
> + * indefinitely: peer entries are never deleted except in
> + * homa_peertab_destroy.
> + */
> +struct homa_peer *homa_peer_find(struct homa_peertab *peertab,
> + const struct in6_addr *addr,
> + struct inet_sock *inet)
> +{
> + struct homa_peer *peer;
> + struct dst_entry *dst;
> +
> + u32 bucket = hash_32((__force u32)addr->in6_u.u6_addr32[0],
> + HOMA_PEERTAB_BUCKET_BITS);
> +
> + bucket ^= hash_32((__force u32)addr->in6_u.u6_addr32[1],
> + HOMA_PEERTAB_BUCKET_BITS);
> + bucket ^= hash_32((__force u32)addr->in6_u.u6_addr32[2],
> + HOMA_PEERTAB_BUCKET_BITS);
> + bucket ^= hash_32((__force u32)addr->in6_u.u6_addr32[3],
> + HOMA_PEERTAB_BUCKET_BITS);
> +
> + /* Use RCU operators to ensure safety even if a concurrent call is
> + * adding a new entry. The calls to rcu_read_lock and rcu_read_unlock
> + * shouldn't actually be needed, since we don't need to protect
> + * against concurrent deletion.
> + */
> + rcu_read_lock();
> + hlist_for_each_entry_rcu(peer, &peertab->buckets[bucket],
> + peertab_links) {
> + if (ipv6_addr_equal(&peer->addr, addr)) {
> + rcu_read_unlock();
> + return peer;
> + }
> + }
> + rcu_read_unlock();
> +
> + /* No existing entry; create a new one.
> + *
> + * Note: after we acquire the lock, we have to check again to
> + * make sure the entry still doesn't exist (it might have been
> + * created by a concurrent invocation of this function).
> + */
> + spin_lock_bh(&peertab->write_lock);
> + hlist_for_each_entry(peer, &peertab->buckets[bucket],
> + peertab_links) {
> + if (ipv6_addr_equal(&peer->addr, addr))
> + goto done;
> + }
> + peer = kmalloc(sizeof(*peer), GFP_ATOMIC | __GFP_ZERO);
Please, move the allocation outside the atomic scope and use GFP_KERNEL.
> + if (!peer) {
> + peer = (struct homa_peer *)ERR_PTR(-ENOMEM);
> + goto done;
> + }
> + peer->addr = *addr;
> + dst = homa_peer_get_dst(peer, inet);
> + if (IS_ERR(dst)) {
> + kfree(peer);
> + peer = (struct homa_peer *)PTR_ERR(dst);
> + goto done;
> + }
> + peer->dst = dst;
> + hlist_add_head_rcu(&peer->peertab_links, &peertab->buckets[bucket]);
At this point another CPU can lookup 'peer'. Since there are no memory
barriers it could observe a NULL peer->dst.
Also AFAICS new peers are always added when lookup for a different
address fail and deleted only at netns shutdown time (never for the initns).
You need to:
- account the memory used for peer
- enforce an upper bound for the total number of peers (per netns),
eventually freeing existing old ones.
Note that freeing the peer at 'runtime' will require additional changes:
i.e. likely refcounting will be beeded or the at lookup time, after the
rcu unlock the code could hit HaF while accessing the looked-up peer.
> + peer->current_ticks = -1;
> + spin_lock_init(&peer->ack_lock);
> +
> +done:
> + spin_unlock_bh(&peertab->write_lock);
> + return peer;
> +}
> +
> +/**
> + * homa_dst_refresh() - This method is called when the dst for a peer is
> + * obsolete; it releases that dst and creates a new one.
> + * @peertab: Table containing the peer.
> + * @peer: Peer whose dst is obsolete.
> + * @hsk: Socket that will be used to transmit data to the peer.
> + */
> +void homa_dst_refresh(struct homa_peertab *peertab, struct homa_peer *peer,
> + struct homa_sock *hsk)
> +{
> + struct homa_dead_dst *save_dead;
> + struct dst_entry *dst;
> + u64 now;
> +
> + /* Need to keep around the current entry for a while in case
> + * someone is using it. If we can't do that, then don't update
> + * the entry.
> + */
> + save_dead = kmalloc(sizeof(*save_dead), GFP_ATOMIC);
> + if (unlikely(!save_dead))
> + return;
> +
> + dst = homa_peer_get_dst(peer, &hsk->inet);
> + if (IS_ERR(dst)) {
> + kfree(save_dead);
/> + return;
> + }
> +
> + spin_lock_bh(&peertab->write_lock);
> + now = sched_clock();
Use jiffies instead.
> + save_dead->dst = peer->dst;
> + save_dead->gc_time = now + 100000000; /* 100 ms */
> + list_add_tail(&save_dead->dst_links, &peertab->dead_dsts);
> + homa_peertab_gc_dsts(peertab, now);
> + peer->dst = dst;
> + spin_unlock_bh(&peertab->write_lock);
It's unclear to me why you need this additional GC layer on top's of the
core one.
[...]
> +static inline struct dst_entry *homa_get_dst(struct homa_peer *peer,
> + struct homa_sock *hsk)
> +{
> + if (unlikely(peer->dst->obsolete > 0))
you need to additionally call dst->ops->check
/P
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 01/15] net: homa: define user-visible API for Homa
2025-05-05 7:54 ` Paolo Abeni
@ 2025-05-05 16:14 ` John Ousterhout
2025-05-05 16:48 ` Andrew Lunn
0 siblings, 1 reply; 46+ messages in thread
From: John Ousterhout @ 2025-05-05 16:14 UTC (permalink / raw)
To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba
On Mon, May 5, 2025 at 12:55 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 5/3/25 1:37 AM, John Ousterhout wrote:
> > +/**
> > + * define HOMA_MAX_BPAGES - The largest number of bpages that will be required
> > + * to store an incoming message.
> > + */
> > +#define HOMA_MAX_BPAGES ((HOMA_MAX_MESSAGE_LENGTH + HOMA_BPAGE_SIZE - 1) \
> > + >> HOMA_BPAGE_SHIFT)
>
> Minor nit: the above indentation is somewhat uncommon, the preferred
> style is:
>
> #define HOMA_MAX_BPAGES ((HOMA_MAX_MESSAGE_LENGTH + HOMA_BPAGE_SIZE - 1)
> >> \
> HOMA_BPAGE_SHIFT)
Fixed (I only recently learned of this convention and have been fixing
noncompliant code as I find it)
> > +
> > +#if !defined(__cplusplus)
> > +_Static_assert(sizeof(struct homa_sendmsg_args) >= 24,
> > + "homa_sendmsg_args shrunk");
> > +_Static_assert(sizeof(struct homa_sendmsg_args) <= 24,
> > + "homa_sendmsg_args grew");
> > +#endif
>
> I think this assertions don't belong here, should be BUILD_BUG_ON() in c
> files. Even better could be avoided with explicit alignment on the
> message struct.
This is a user-facing header; is BUILD_BUG_ON OK there (ChatGPT
doesn't seem to think so)? Also, what do you mean about "explicit
alignment on the message struct"?
>
> [...]
> > +int homa_send(int sockfd, const void *message_buf,
> > + size_t length, const struct sockaddr *dest_addr,
> > + __u32 addrlen, __u64 *id, __u64 completion_cookie,
> > + int flags);
> > +int homa_sendv(int sockfd, const struct iovec *iov,
> > + int iovcnt, const struct sockaddr *dest_addr,
> > + __u32 addrlen, __u64 *id, __u64 completion_cookie,
> > + int flags);
> > +ssize_t homa_reply(int sockfd, const void *message_buf,
> > + size_t length, const struct sockaddr *dest_addr,
> > + __u32 addrlen, __u64 id);
> > +ssize_t homa_replyv(int sockfd, const struct iovec *iov,
> > + int iovcnt, const struct sockaddr *dest_addr,
> > + __u32 addrlen, __u64 id);
>
> I assume the above are user-space functions definition ??? If so, they
> don't belong here.
Yes, these are declarations for user-space functions that wrap the
sendmsg and recvmsg kernel calls. If not here, where should they go?
Are you suggesting a second header file (suggestions for what it
should be called?)? These are very thin wrappers, which I expect
people will almost always use instead of invoking raw sendmsg and
recvmsg, so I thought it would be cleanest to put them here, next to
other info related to the Homa kernel calls.
-John-
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 06/15] net: homa: create homa_sock.h and homa_sock.c
2025-05-02 23:37 ` [PATCH net-next v8 06/15] net: homa: create homa_sock.h and homa_sock.c John Ousterhout
@ 2025-05-05 16:46 ` Paolo Abeni
2025-05-07 18:30 ` John Ousterhout
0 siblings, 1 reply; 46+ messages in thread
From: Paolo Abeni @ 2025-05-05 16:46 UTC (permalink / raw)
To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba
On 5/3/25 1:37 AM, John Ousterhout wrote:
[...]
> +/**
> + * homa_sock_init() - Constructor for homa_sock objects. This function
> + * initializes only the parts of the socket that are owned by Homa.
> + * @hsk: Object to initialize.
> + * @homa: Homa implementation that will manage the socket.
> + *
> + * Return: 0 for success, otherwise a negative errno.
> + */
> +int homa_sock_init(struct homa_sock *hsk, struct homa *homa)
> +{
> + struct homa_socktab *socktab = homa->port_map;
> + struct homa_sock *other;
> + int starting_port;
> + int result = 0;
> + int i;
> +
> + /* Initialize fields outside the Homa part. */
> + hsk->sock.sk_sndbuf = homa->wmem_max;
> +
> + /* Initialize Homa-specific fields. */
> + spin_lock_bh(&socktab->write_lock);
> + atomic_set(&hsk->protect_count, 0);
> + spin_lock_init(&hsk->lock);
> + atomic_set(&hsk->protect_count, 0);
Duplicate 'atomic_set(&hsk->protect_count, 0);' statement above
> + hsk->homa = homa;
> + hsk->ip_header_length = (hsk->inet.sk.sk_family == AF_INET)
> + ? HOMA_IPV4_HEADER_LENGTH : HOMA_IPV6_HEADER_LENGTH;
> + hsk->is_server = false;
> + hsk->shutdown = false;
> + starting_port = homa->prev_default_port;
> + while (1) {
> + homa->prev_default_port++;
> + if (homa->prev_default_port < HOMA_MIN_DEFAULT_PORT)
> + homa->prev_default_port = HOMA_MIN_DEFAULT_PORT;
> + other = homa_sock_find(socktab, homa->prev_default_port);
> + if (!other)
> + break;
> + sock_put(&other->sock);
> + if (homa->prev_default_port == starting_port) {
> + spin_unlock_bh(&socktab->write_lock);
> + hsk->shutdown = true;
> + return -EADDRNOTAVAIL;
> + }
> + }
> + hsk->port = homa->prev_default_port;
> + hsk->inet.inet_num = hsk->port;
> + hsk->inet.inet_sport = htons(hsk->port);
The above code looks like a bind() operation, but it's unclear why it's
peformend at init time.
> + hlist_add_head_rcu(&hsk->socktab_links,
> + &socktab->buckets[homa_port_hash(hsk->port)]);
> + INIT_LIST_HEAD(&hsk->active_rpcs);
> + INIT_LIST_HEAD(&hsk->dead_rpcs);
> + hsk->dead_skbs = 0;
> + INIT_LIST_HEAD(&hsk->waiting_for_bufs);
> + INIT_LIST_HEAD(&hsk->ready_rpcs);
> + INIT_LIST_HEAD(&hsk->interests);
> + for (i = 0; i < HOMA_CLIENT_RPC_BUCKETS; i++) {
> + struct homa_rpc_bucket *bucket = &hsk->client_rpc_buckets[i];
> +
> + spin_lock_init(&bucket->lock);
> + bucket->id = i;
> + INIT_HLIST_HEAD(&bucket->rpcs);
> + }
> + for (i = 0; i < HOMA_SERVER_RPC_BUCKETS; i++) {
> + struct homa_rpc_bucket *bucket = &hsk->server_rpc_buckets[i];
> +
> + spin_lock_init(&bucket->lock);
> + bucket->id = i + 1000000;
> + INIT_HLIST_HEAD(&bucket->rpcs);
> + }
I'm under the impression that using rhashtable for both the client and
the server rpcs will deliver both more efficient memory usage and better
performances.
> + hsk->buffer_pool = homa_pool_new(hsk);
> + if (IS_ERR(hsk->buffer_pool)) {
> + result = PTR_ERR(hsk->buffer_pool);
> + hsk->buffer_pool = NULL;
> + }
> + spin_unlock_bh(&socktab->write_lock);
> + return result;
> +}
> +
> +/*
> + * homa_sock_unlink() - Unlinks a socket from its socktab and does
> + * related cleanups. Once this method returns, the socket will not be
> + * discoverable through the socktab.
> + * @hsk: Socket to unlink.
> + */
> +void homa_sock_unlink(struct homa_sock *hsk)
> +{
> + struct homa_socktab *socktab = hsk->homa->port_map;
> +
> + spin_lock_bh(&socktab->write_lock);
> + hlist_del_rcu(&hsk->socktab_links);
> + spin_unlock_bh(&socktab->write_lock);
> +}
> +
> +/**
> + * homa_sock_shutdown() - Disable a socket so that it can no longer
> + * be used for either sending or receiving messages. Any system calls
> + * currently waiting to send or receive messages will be aborted.
> + * @hsk: Socket to shut down.
> + */
> +void homa_sock_shutdown(struct homa_sock *hsk)
> +{
> + struct homa_interest *interest;
> + struct homa_rpc *rpc;
> + u64 tx_memory;
> +
> + homa_sock_lock(hsk);
> + if (hsk->shutdown) {
> + homa_sock_unlock(hsk);
> + return;
> + }
> +
> + /* The order of cleanup is very important, because there could be
> + * active operations that hold RPC locks but not the socket lock.
> + * 1. Set @shutdown; this ensures that no new RPCs will be created for
> + * this socket (though some creations might already be in progress).
> + * 2. Remove the socket from its socktab: this ensures that
> + * incoming packets for the socket will be dropped.
> + * 3. Go through all of the RPCs and delete them; this will
> + * synchronize with any operations in progress.
> + * 4. Perform other socket cleanup: at this point we know that
> + * there will be no concurrent activities on individual RPCs.
> + * 5. Don't delete the buffer pool until after all of the RPCs
> + * have been reaped.
> + * See sync.txt for additional information about locking.
> + */
> + hsk->shutdown = true;
> + homa_sock_unlink(hsk);
> + homa_sock_unlock(hsk);
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(rpc, &hsk->active_rpcs, active_links) {
> + homa_rpc_lock(rpc);
> + homa_rpc_end(rpc);
> + homa_rpc_unlock(rpc);
> + }
> + rcu_read_unlock();
> +
> + homa_sock_lock(hsk);
> + while (!list_empty(&hsk->interests)) {
> + interest = list_first_entry(&hsk->interests,
> + struct homa_interest, links);
> + __list_del_entry(&interest->links);
> + atomic_set_release(&interest->ready, 1);
> + wake_up(&interest->wait_queue);
> + }
> + homa_sock_unlock(hsk);
> +
> + while (!list_empty(&hsk->dead_rpcs))
> + homa_rpc_reap(hsk, 1000);
> +
> + tx_memory = refcount_read(&hsk->sock.sk_wmem_alloc);
> + if (tx_memory != 1) {
> + pr_err("%s found sk_wmem_alloc %llu bytes, port %d\n",
> + __func__, tx_memory, hsk->port);
> + }
Just:
WARN_ON_ONCE(refcount_read(&sk->sk_wmem_alloc) != 1);
> +
> + if (hsk->buffer_pool) {
> + homa_pool_destroy(hsk->buffer_pool);
> + hsk->buffer_pool = NULL;
> + }
> +}
> +
> +/**
> + * homa_sock_destroy() - Destructor for homa_sock objects. This function
> + * only cleans up the parts of the object that are owned by Homa.
> + * @hsk: Socket to destroy.
> + */
> +void homa_sock_destroy(struct homa_sock *hsk)
> +{
> + homa_sock_shutdown(hsk);
> + sock_set_flag(&hsk->inet.sk, SOCK_RCU_FREE);
Why the flag is set only now and not at creation time?
[...]
> +/**
> + * struct homa_socktab - A hash table that maps from port numbers (either
> + * client or server) to homa_sock objects.
> + *
> + * This table is managed exclusively by homa_socktab.c, using RCU to
> + * minimize synchronization during lookups.
> + */
> +struct homa_socktab {
> + /**
> + * @write_lock: Controls all modifications to this object; not needed
> + * for socket lookups (RCU is used instead). Also used to
> + * synchronize port allocation.
> + */
> + spinlock_t write_lock;
> +
> + /**
> + * @buckets: Heads of chains for hash table buckets. Chains
> + * consist of homa_sock objects.
> + */
> + struct hlist_head buckets[HOMA_SOCKTAB_BUCKETS];
> +};
This is probably a bit too large to be unconditionally allocated for
each netns. You are probably better off with a global hash table, with
the lookup key including the netns itself.
[...]
> +/**
> + * homa_sock_lock() - Acquire the lock for a socket.
> + * @hsk: Socket to lock.
> + */
> +static inline void homa_sock_lock(struct homa_sock *hsk)
> + __acquires(&hsk->lock)
> +{
> + spin_lock_bh(&hsk->lock);
I was wondering how the hsk socket lock could be nested under a
spinlock... The above can't work, unless you prevent any core and
inet-related operations on hsk sockets. That in turn means duplicate
entirely a lot of code or preventing a lot of basic stuff from working
on homa sockets.
Home sockets are still inet ones. It's expected that SOL_SOCKET and
SOL_IP[V6] socket options work on them (or at least could be implemented).
I think this point is very critical.
Somewhat related: the patch order makes IMHO the review complex, because
I often need to look to the following patches to get needed context.
/P
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 01/15] net: homa: define user-visible API for Homa
2025-05-05 16:14 ` John Ousterhout
@ 2025-05-05 16:48 ` Andrew Lunn
2025-05-05 17:52 ` John Ousterhout
0 siblings, 1 reply; 46+ messages in thread
From: Andrew Lunn @ 2025-05-05 16:48 UTC (permalink / raw)
To: John Ousterhout; +Cc: Paolo Abeni, netdev, edumazet, horms, kuba
> > > +int homa_send(int sockfd, const void *message_buf,
> > > + size_t length, const struct sockaddr *dest_addr,
> > > + __u32 addrlen, __u64 *id, __u64 completion_cookie,
> > > + int flags);
> > > +int homa_sendv(int sockfd, const struct iovec *iov,
> > > + int iovcnt, const struct sockaddr *dest_addr,
> > > + __u32 addrlen, __u64 *id, __u64 completion_cookie,
> > > + int flags);
> > > +ssize_t homa_reply(int sockfd, const void *message_buf,
> > > + size_t length, const struct sockaddr *dest_addr,
> > > + __u32 addrlen, __u64 id);
> > > +ssize_t homa_replyv(int sockfd, const struct iovec *iov,
> > > + int iovcnt, const struct sockaddr *dest_addr,
> > > + __u32 addrlen, __u64 id);
> >
> > I assume the above are user-space functions definition ??? If so, they
> > don't belong here.
>
> Yes, these are declarations for user-space functions that wrap the
> sendmsg and recvmsg kernel calls. If not here, where should they go?
> Are you suggesting a second header file (suggestions for what it
> should be called?)? These are very thin wrappers, which I expect
> people will almost always use instead of invoking raw sendmsg and
> recvmsg, so I thought it would be cleanest to put them here, next to
> other info related to the Homa kernel calls.
Maybe put the whole library into tools/lib/homa.
Andrew
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 01/15] net: homa: define user-visible API for Homa
2025-05-05 16:48 ` Andrew Lunn
@ 2025-05-05 17:52 ` John Ousterhout
0 siblings, 0 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-05 17:52 UTC (permalink / raw)
To: Andrew Lunn; +Cc: Paolo Abeni, netdev, edumazet, horms, kuba
On Mon, May 5, 2025 at 9:48 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > > > +int homa_send(int sockfd, const void *message_buf,
> > > > + size_t length, const struct sockaddr *dest_addr,
> > > > + __u32 addrlen, __u64 *id, __u64 completion_cookie,
> > > > + int flags);
> > > > +int homa_sendv(int sockfd, const struct iovec *iov,
> > > > + int iovcnt, const struct sockaddr *dest_addr,
> > > > + __u32 addrlen, __u64 *id, __u64 completion_cookie,
> > > > + int flags);
> > > > +ssize_t homa_reply(int sockfd, const void *message_buf,
> > > > + size_t length, const struct sockaddr *dest_addr,
> > > > + __u32 addrlen, __u64 id);
> > > > +ssize_t homa_replyv(int sockfd, const struct iovec *iov,
> > > > + int iovcnt, const struct sockaddr *dest_addr,
> > > > + __u32 addrlen, __u64 id);
> > >
> > > I assume the above are user-space functions definition ??? If so, they
> > > don't belong here.
> >
> > Yes, these are declarations for user-space functions that wrap the
> > sendmsg and recvmsg kernel calls. If not here, where should they go?
> > Are you suggesting a second header file (suggestions for what it
> > should be called?)? These are very thin wrappers, which I expect
> > people will almost always use instead of invoking raw sendmsg and
> > recvmsg, so I thought it would be cleanest to put them here, next to
> > other info related to the Homa kernel calls.
>
> Maybe put the whole library into tools/lib/homa.
After thinking about this some more, I think I'm going to just delete
these functions. They don't add much value and they create some
awkwardness (e.g. there would need to be a new user-level library with
their code).
-John-
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 01/15] net: homa: define user-visible API for Homa
2025-05-05 8:03 ` Paolo Abeni
@ 2025-05-05 22:53 ` John Ousterhout
0 siblings, 0 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-05 22:53 UTC (permalink / raw)
To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba
On Mon, May 5, 2025 at 1:03 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 5/3/25 1:37 AM, John Ousterhout wrote:
> > +/* Flag bits for homa_sendmsg_args.flags (see man page for documentation):
> > + */
> > +#define HOMA_SENDMSG_PRIVATE 0x01
> > +#define HOMA_SENDMSG_NONBLOCKING 0x02
>
> It's unclear why you need to define a new mechanism instead of using
> MSG_DONTWAIT. This is possibly not needed and deserves at least a good
> motivation in the patch introducing it.
I'm not sure why that flag still exists (or HOMA_RECVMSG_NONBLOCKING
either), since Homa supports MSG_DONTWAIT and that seems to provide
the same functionality. Maybe these are leftover from earlier Homa
versions that weren't layered on sendmsg and recvmsg.
I have removed both HOMA_SENDMSG_NONBLOCKING and HOMA_RECVMSG_NONBLOCKING.
-John-
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 02/15] net: homa: create homa_wire.h
2025-05-05 8:28 ` Paolo Abeni
@ 2025-05-05 23:54 ` John Ousterhout
2025-05-22 5:31 ` John Ousterhout
1 sibling, 0 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-05 23:54 UTC (permalink / raw)
To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba
On Mon, May 5, 2025 at 1:28 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 5/3/25 1:37 AM, John Ousterhout wrote:
> > diff --git a/net/homa/homa_wire.h b/net/homa/homa_wire.h
> > new file mode 100644
> > index 000000000000..47693244c3ec
> > --- /dev/null
> > +++ b/net/homa/homa_wire.h
>
> I'm wondering why you keep the wire-struct definition outside the uAPI -
> the opposite of what other protocols do.
By 'uAPI' I assume you mean homa.h? I didn't consider putting it
there, and would prefer not to, because Homa users should not need to
know anything about the wire format. But, I will combine the files if
you think it is better to do that for consistency with other
protocols. Please confirm?
> > +/* Defines the possible types of Homa packets.
> > + *
> > + * See the xxx_header structs below for more information about each type.
> > + */
> > +enum homa_packet_type {
> > + DATA = 0x10,
> > + RESEND = 0x12,
> > + RPC_UNKNOWN = 0x13,
> > + BUSY = 0x14,
> > + NEED_ACK = 0x17,
> > + ACK = 0x18,
> > + BOGUS = 0x19, /* Used only in unit tests. */
> > + /* If you add a new type here, you must also do the following:
> > + * 1. Change BOGUS so it is the highest opcode
>
> If you instead define 'MAX' value, the required update policy would be
> self-explained and you will not need to expose tests details.
Done.
>
> > + * 2. Add support for the new opcode in homa_print_packet,
> > + * homa_print_packet_short, homa_symbol_for_type, and mock_skb_new.
> > + * 3. Add the header length to header_lengths in homa_plumbing.c.
> > + */
> > +};
> > +
> > +/** define HOMA_IPV6_HEADER_LENGTH - Size of IP header (V6). */
> > +#define HOMA_IPV6_HEADER_LENGTH 40
> > +
> > +/** define HOMA_IPV4_HEADER_LENGTH - Size of IP header (V4). */
> > +#define HOMA_IPV4_HEADER_LENGTH 20
>
> I suspect you will be better off using sizeof(<relevant struct>). Making
> protocol-specific definition for common/global constants is somewhat
> confusing and unexpected
Done.
> > +/**
> > + * define HOMA_SKB_EXTRA - How many bytes of additional space to allow at the
> > + * beginning of each sk_buff, before the IP header. This includes room for a
> > + * VLAN header and also includes some extra space, "just to be safe" (not
> > + * really sure if this is needed).
> > + */
> > +#define HOMA_SKB_EXTRA 40
>
> You could use:
>
> #define MAX_HOME_HEADER MAX_TCP_HEADER
>
> to leverage a consolidated value covering most use-cases and kernel configs.
Done (I wasn't aware of MAX_TCP_HEADER).
> > +/**
> > + * define HOMA_ETH_OVERHEAD - Number of bytes per Ethernet packet for Ethernet
> > + * header, CRC, preamble, and inter-packet gap.
> > + */
> > +#define HOMA_ETH_OVERHEAD 42
>
> It's not clear why the protocol should be interested in MAC-specific
> details. What if the the MAC is not ethernet?
This is used by the pacer in order to get the most accurate possible
estimate of exactly how much time it will take to transmit a packet
(so that Homa can pass packets to the NIC at a rate that runs the
uplink at utilization just under 100% without generating queue buildup
in the NIC). Is there a way to get this information from the dev in a
way that reflects the specific hardware more accurately than guessing
based on Ethernet?
> > + /**
> > + * @type: Homa packet type (one of the values of the homa_packet_type
> > + * enum). Corresponds to the low-order byte of the ack in TCP.
> > + */
> > + __u8 type;
>
> If you keep this outside uAPI you should use 'u8'
Will do.
> > +_Static_assert(sizeof(struct homa_data_hdr) <= HOMA_MAX_HEADER,
> > + "homa_data_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
> > +_Static_assert(sizeof(struct homa_data_hdr) >= HOMA_MIN_PKT_LENGTH,
> > + "homa_data_hdr too small: Homa doesn't currently have code to pad data packets");
> > +_Static_assert(((sizeof(struct homa_data_hdr) - sizeof(struct homa_seg_hdr)) &
> > + 0x3) == 0,
> > + " homa_data_hdr length not a multiple of 4 bytes (required for TCP/TSO compatibility");
>
> Please use BUILD_BUG_ON() in a .c file instead. Many other cases below.
Will do, if this info doesn't move to uAPI (BTW, BUILD_BUG_ON feels
more awkward because it doesn't allow an error message).
-John-
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
` (14 preceding siblings ...)
2025-05-02 23:37 ` [PATCH net-next v8 15/15] net: homa: create Makefile and Kconfig John Ousterhout
@ 2025-05-06 5:04 ` John Ousterhout
2025-05-06 12:01 ` Paolo Abeni
15 siblings, 1 reply; 46+ messages in thread
From: John Ousterhout @ 2025-05-06 5:04 UTC (permalink / raw)
To: netdev; +Cc: pabeni, edumazet, horms, kuba
On Fri, May 2, 2025 at 4:38 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
>
> This patch series begins the process of upstreaming the Homa transport
> protocol....
For some reason patchwork has not run any tests on this version of the
patch series. Is there something I need to do (differently?) to get
the tests to run?
-John-
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol
2025-05-06 5:04 ` [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
@ 2025-05-06 12:01 ` Paolo Abeni
0 siblings, 0 replies; 46+ messages in thread
From: Paolo Abeni @ 2025-05-06 12:01 UTC (permalink / raw)
To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba
On 5/6/25 7:04 AM, John Ousterhout wrote:
> On Fri, May 2, 2025 at 4:38 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
>>
>> This patch series begins the process of upstreaming the Homa transport
>> protocol....
>
> For some reason patchwork has not run any tests on this version of the
> patch series.
There has been a transient bad state for the nipa CI infra. A bunch of
series did not enter testing, among them the homa one.
> Is there something I need to do (differently?) to get
> the tests to run?
The series should be (hopefully) tested on the next submission.
Important node: do not rely on the nipa CI for testing: the patches are
expected to go through similar checks before the submission.
Cheers,
Paolo
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 07/15] net: homa: create homa_interest.h and homa_interest.
2025-05-02 23:37 ` [PATCH net-next v8 07/15] net: homa: create homa_interest.h and homa_interest John Ousterhout
@ 2025-05-06 13:53 ` Paolo Abeni
2025-05-07 18:45 ` John Ousterhout
0 siblings, 1 reply; 46+ messages in thread
From: Paolo Abeni @ 2025-05-06 13:53 UTC (permalink / raw)
To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba
On 5/3/25 1:37 AM, John Ousterhout wrote:
> +/**
> + * homa_interest_init_shared() - Initialize an interest and queue it up on a socket.
What is an 'interest'? An event like input avail or unblocking output
possible? If so, 'event' could be a possible/more idiomatic name.
> + * @interest: Interest to initialize
> + * @hsk: Socket on which the interests should be queued. Must be locked
> + * by caller.
> + */
> +void homa_interest_init_shared(struct homa_interest *interest,
> + struct homa_sock *hsk)
> + __must_hold(&hsk->lock)
> +{
> + interest->rpc = NULL;
> + atomic_set(&interest->ready, 0);
> + interest->core = raw_smp_processor_id();
I don't see this 'core' field used later on in this series. If so,
please avoid introducing it until it's really used.
/P
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 08/15] net: homa: create homa_pacer.h and homa_pacer.c
2025-05-02 23:37 ` [PATCH net-next v8 08/15] net: homa: create homa_pacer.h and homa_pacer.c John Ousterhout
@ 2025-05-06 14:05 ` Paolo Abeni
2025-05-07 18:55 ` John Ousterhout
0 siblings, 1 reply; 46+ messages in thread
From: Paolo Abeni @ 2025-05-06 14:05 UTC (permalink / raw)
To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba
On 5/3/25 1:37 AM, John Ousterhout wrote:
> + /**
> + * @link_mbps: The raw bandwidth of the network uplink, in
> + * units of 1e06 bits per second. Set externally via sysctl.
> + */
> + int link_mbps;
This is will be extremely problematic. In practice nobody will set this
correctly and in some cases the info is not even available (VM) or will
change dynamically due to policing/shaping.
I think you need to build your own estimator of the available B/W. I'm
unsure/I don't think you can re-use bql info here.
/P
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 03/15] net: homa: create shared Homa header files
2025-05-05 10:20 ` Paolo Abeni
@ 2025-05-06 17:45 ` John Ousterhout
0 siblings, 0 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-06 17:45 UTC (permalink / raw)
To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba
On Mon, May 5, 2025 at 3:20 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > +
> > +#define sizeof32(type) ((int)(sizeof(type)))
>
> (u32) instead of (int). I think you should try to avoid using this
> define, which is not very nice by itself.
I have removed that #define and switched to sizeof everywhere.
> > +#ifdef __CHECKER__
> > +#define __context__(x, y, z) __attribute__((context(x, y, z)))
> > +#else
> > +#define __context__(...)
> > +#endif /* __CHECKER__ */
>
> Why do you need to fiddle with the sparse annotation? Very likely this
> should be dropped.
Without this I couldn't get the code to compile. Homa declares
"__context__" for some spinlocks to handle cases where a lock is
acquired for the return value of a function (so '__acquires' can't
name the lock otherwise). For an example, search for 'rpc_bucket_lock'
in homa_sock.h. I'm still pretty much a newbie with sparse, so maybe
there's a better way to do this? In general I'm having a difficult
time getting useful information out of sparse...
> > +/**
> > + * homa_get_skb_info() - Return the address of Homa's private information
> > + * for an sk_buff.
> > + * @skb: Socket buffer whose info is needed.
> > + * Return: address of Homa's private information for @skb.
> > + */
> > +static inline struct homa_skb_info *homa_get_skb_info(struct sk_buff *skb)
> > +{
> > + return (struct homa_skb_info *)(skb_end_pointer(skb)) - 1;
>
> This looks fragile. Why can't you use the skb control buffer here?
I was not aware of the skb control buffer. After poking around and
asking ChatGPT, it appears that information in the control buffer is
not guaranteed to survive across networking layers? Homa depends on
the information being persistent. For example, it's used to link
together all of the skb's in a Homa message, which will be used if
parts of a message need to be retransmitted. Why does this look
fragile to you? It's pretty much equivalent to skb_shinfo, except with
Homa information.
> > +static inline bool is_homa_pkt(struct sk_buff *skb)
> > +{
> > + struct iphdr *iph = ip_hdr(skb);
> > +
> > + return (iph->protocol == IPPROTO_HOMA);
>
> What if this is an ipv6 packet? Also I don't see any use of this
> function later on.
This function isn't used anymore, so I have deleted it. It's probably
leftover from before the addition of IPv6 support.
> > +#define UNIT_LOG(...)
> > +#define UNIT_HOOK(...)
>
> It looks like the above 2 define are unused later on.
Oops, that code was supposed to get stripped out of the upstream
version of Homa. I've now fixed the stripper to keep this code out fo
the upstream version of Homa.
> > +extern unsigned int homa_net_id;
> > +
> > +void homa_abort_rpcs(struct homa *homa, const struct in6_addr *addr,
> > + int port, int error);
> > +void homa_abort_sock_rpcs(struct homa_sock *hsk, int error);
> > +void homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
> > + struct homa_rpc *rpc);
> > +void homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb);
> > +int homa_backlog_rcv(struct sock *sk, struct sk_buff *skb);
> > +int homa_bind(struct socket *sk, struct sockaddr *addr,
> > + int addr_len);
> > +void homa_close(struct sock *sock, long timeout);
> > +int homa_copy_to_user(struct homa_rpc *rpc);
> > +void homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
> > +void homa_destroy(struct homa *homa);
> > +int homa_disconnect(struct sock *sk, int flags);
> > +void homa_dispatch_pkts(struct sk_buff *skb, struct homa *homa);
> > +int homa_err_handler_v4(struct sk_buff *skb, u32 info);
> > +int homa_err_handler_v6(struct sk_buff *skb,
> > + struct inet6_skb_parm *opt, u8 type, u8 code,
> > + int offset, __be32 info);
> > +int homa_fill_data_interleaved(struct homa_rpc *rpc,
> > + struct sk_buff *skb, struct iov_iter *iter);
> > +struct homa_gap *homa_gap_new(struct list_head *next, int start, int end);
> > +void homa_gap_retry(struct homa_rpc *rpc);
> > +int homa_get_port(struct sock *sk, unsigned short snum);
> > +int homa_getsockopt(struct sock *sk, int level, int optname,
> > + char __user *optval, int __user *optlen);
> > +int homa_hash(struct sock *sk);
> > +enum hrtimer_restart homa_hrtimer(struct hrtimer *timer);
> > +int homa_init(struct homa *homa, struct net *net);
> > +int homa_ioctl(struct sock *sk, int cmd, int *karg);
> > +int homa_load(void);
> > +int homa_message_out_fill(struct homa_rpc *rpc,
> > + struct iov_iter *iter, int xmit);
> > +void homa_message_out_init(struct homa_rpc *rpc, int length);
> > +void homa_need_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
> > + struct homa_rpc *rpc);
> > +struct sk_buff *homa_new_data_packet(struct homa_rpc *rpc,
> > + struct iov_iter *iter, int offset,
> > + int length, int max_seg_data);
> > +int homa_net_init(struct net *net);
> > +void homa_net_exit(struct net *net);
> > +__poll_t homa_poll(struct file *file, struct socket *sock,
> > + struct poll_table_struct *wait);
> > +int homa_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> > + int flags, int *addr_len);
> > +void homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
> > + struct homa_sock *hsk);
> > +void homa_rpc_abort(struct homa_rpc *crpc, int error);
> > +void homa_rpc_acked(struct homa_sock *hsk,
> > + const struct in6_addr *saddr, struct homa_ack *ack);
> > +void homa_rpc_end(struct homa_rpc *rpc);
> > +void homa_rpc_handoff(struct homa_rpc *rpc);
> > +int homa_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
> > +int homa_setsockopt(struct sock *sk, int level, int optname,
> > + sockptr_t optval, unsigned int optlen);
> > +int homa_shutdown(struct socket *sock, int how);
> > +int homa_softirq(struct sk_buff *skb);
> > +void homa_spin(int ns);
> > +void homa_timer(struct homa *homa);
> > +int homa_timer_main(void *transport);
> > +void homa_unhash(struct sock *sk);
> > +void homa_rpc_unknown_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
> > +void homa_unload(void);
> > +int homa_wait_private(struct homa_rpc *rpc, int nonblocking);
> > +struct homa_rpc
> > + *homa_wait_shared(struct homa_sock *hsk, int nonblocking);
> > +int homa_xmit_control(enum homa_packet_type type, void *contents,
> > + size_t length, struct homa_rpc *rpc);
> > +int __homa_xmit_control(void *contents, size_t length,
> > + struct homa_peer *peer, struct homa_sock *hsk);
> > +void homa_xmit_data(struct homa_rpc *rpc, bool force);
> > +void homa_xmit_unknown(struct sk_buff *skb, struct homa_sock *hsk);
> > +
> > +int homa_message_in_init(struct homa_rpc *rpc, int unsched);
> > +void homa_resend_data(struct homa_rpc *rpc, int start, int end);
> > +void __homa_xmit_data(struct sk_buff *skb, struct homa_rpc *rpc);
>
> You should introduce the declaration of a given function in the same
> patch introducing the implementation. That means the patches should be
> sorted from the lowest level helper towards the upper layer.
The patches are already sorted from lower layers to upper layers, and
after your last round of comments I tested and reorganized so that the
code compiles cumulatively after the addition of each new patch in the
series (with one exception where there are mutual dependencies between
files in successive patches). homa_impl.h is a grab-bag for things
that don't fit logically elsewhere, so it has declarations for things
in several patches. I will try to distribute the function declarations
over the patches that contain the implementations.
> [...]
> > +static inline int homa_skb_append_to_frag(struct homa *homa,
> > + struct sk_buff *skb, void *buf,
> > + int length)
> > +{
> > + char *dst = skb_put(skb, length);
> > +
> > + memcpy(dst, buf, length);
> > + return 0;
>
> The name is misleading as it does not append to an skb frag but to the
> skb linear part
This file (homa_stub.h) is a temporary file during the upstreaming
process (see the comment at the beginning) in order to reduce the size
of the initial patch series. In a later patch series the
implementation will be replaced with a full-blown implementation of
skb management that actually uses skb frags; this version just puts
the entire skb in the linear part. So, the name reflects what will
eventually happen (which is implemented in the GitHub repo). I'd
prefer not to have to rewind the API back to what it was before "good"
skb management was introduced into Homa. Would you rather I just pull
the full skb management code into this first patch series?
> > +static inline void homa_skb_free_many_tx(struct homa *homa,
> > + struct sk_buff **skbs, int count)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < count; i++)
> > + kfree_skb(skbs[i]);
>
> 'homa' is unused here.
Same issue as the previous comment: it will be used in the "full"
version of skb management.
> > +static inline struct sk_buff *homa_skb_new_tx(int length)
>
> please use 'alloc' for allocator.
Done.
-John-
-John-
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 04/15] net: homa: create homa_pool.h and homa_pool.c
2025-05-05 9:51 ` Paolo Abeni
@ 2025-05-06 23:28 ` John Ousterhout
2025-05-08 8:35 ` Paolo Abeni
0 siblings, 1 reply; 46+ messages in thread
From: John Ousterhout @ 2025-05-06 23:28 UTC (permalink / raw)
To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba
On Mon, May 5, 2025 at 2:51 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 5/3/25 1:37 AM, John Ousterhout wrote:
> [...]
> > +/**
> > + * set_bpages_needed() - Set the bpages_needed field of @pool based
> > + * on the length of the first RPC that's waiting for buffer space.
> > + * The caller must own the lock for @pool->hsk.
> > + * @pool: Pool to update.
> > + */
> > +static void set_bpages_needed(struct homa_pool *pool)
> > +{
> > + struct homa_rpc *rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
> > + struct homa_rpc, buf_links);
>
> Minor nit: please insert an empty line between variable declaration and
> code.
Done. For some reason checkpatch.pl doesn't complain about this (or
the next comment below). Until yesterday I wasn't aware of the
--strict argument to checkpatch.pl, which may explain why patchwork
was finding checkpatch errors even though I was running checkpatch. I
have now made a pass over all the Homa code to clean up --strict
issues. But even with --strict, checkpatch.pl doesn't complain about
the indentation problem above. Are there additional switches I should
be giving to checkpatch.pl in addition to --strict?
> > + pool->bpages_needed = (rpc->msgin.length + HOMA_BPAGE_SIZE - 1)
> > + >> HOMA_BPAGE_SHIFT;
>
> Minor nit: please fix the indentation above
Fixed.
> > +/**
> > + * homa_pool_new() - Allocate and initialize a new homa_pool (it will have
> > + * no region associated with it until homa_pool_set_region is invoked).
> > + * @hsk: Socket the pool will be associated with.
> > + * Return: A pointer to the new pool or a negative errno.
> > + */
> > +struct homa_pool *homa_pool_new(struct homa_sock *hsk)
>
> The proferrend name includes for an allocator includes 'alloc', not 'new'.
Got it. I have scanned the code base and replaced 'new' everywhere
with 'alloc'. I also replaced 'destroy' with 'free'.
> > +{
> > + struct homa_pool *pool;
> > +
> > + pool = kzalloc(sizeof(*pool), GFP_ATOMIC);
>
> You should try to use GFP_KERNEL allocation as much as you can, and use
> GFP_ATOMIC only in atomic context. If needed, try to move the function
> outside the atomic scope doing the allocation before acquiring the
> lock/rcu.
Will do. I was able to refactor the homa_pool code so that it doesn't
need GFP_ATOMIC.
> > + pool->num_cores = nr_cpu_ids;
>
> The 'num_cores' field is likely not needed, and it's never used in this
> series.
Yep, that field is no longer used. I have deleted it.
> > + pool->check_waiting_invoked = 0;
> > +
> > + return 0;
> > +
> > +error:
> > + kfree(pool->descriptors);
> > + free_percpu(pool->cores);
>
> The above assumes that 'pool' will be zeroed at allocation time, but the
> allocator does not do that. You should probably add the __GFP_ZERO flag
> to the pool allocator.
The pool is allocated with kzalloc; that zeroes it, no?
> > +bool homa_bpage_available(struct homa_bpage *bpage, u64 now)
> > +{
> > + int ref_count = atomic_read(&bpage->refs);
> > +
> > + return ref_count == 0 || (ref_count == 1 && bpage->owner >= 0 &&
> > + bpage->expiration <= now);
>
> Minor nit: please fix the indentation above. Other cases below. Please
> validate your patch with the checkpatch.pl script.
I have been running checkpatch.pl, but as I mentioned above it doesn't
seem to be reporting everything.
> > +int homa_pool_get_pages(struct homa_pool *pool, int num_pages, u32 *pages,
> > + int set_owner)
> > +{
> > + int core_num = smp_processor_id();
> > + struct homa_pool_core *core;
> > + u64 now = sched_clock();
>
> From sched_clock() documentation:
>
> sched_clock() has no promise of monotonicity or bounded drift between
> CPUs, use (which you should not) requires disabling IRQs.
>
> Can't be used for an expiration time. You could use 'jiffies' instead,
Jiffies are *really* coarse grain (4 ms on my servers). It's possible
that I could make them work in this situation, but in general jiffies
won't work for Homa. Homa needs to make decisions at
microsecond-scale, and an RPC that takes one jiffy to complete is
completely unacceptable. Homa needs a fine-grain (e.g. cycle level)
clock that is monotonic and synchronous across cores, and as far as I
know, such a clock is available on every server where Homa is likely
to run. For example, I believe that the TSC counters on both Intel and
AMD chips have had the right properties for at least 10-15 years. And
isn't sched_clock based on TSC where it's available? So even though
sched_clock makes no official promises, isn't the reality actually
fine? Can I simply stipulate that Homa is not appropriate for any
machine where sched_clock doesn't have the properties Homa needs (this
won't be a significant limitation in practice)?
Ultimately I think Linux needs to bite the bullet and provide an
official fine-grain clock with ns precision.
> > +
> > + limit = pool->num_bpages
> > + - atomic_read(&pool->free_bpages);
>
> Nit: indentation above, the operator should stay on the first line.
Fixed but, again, checkpatch.pl didn't report it.
> > +
> > + /* Figure out whether this candidate is free (or can be
> > + * stolen). Do a quick check without locking the page, and
> > + * if the page looks promising, then lock it and check again
> > + * (must check again in case someone else snuck in and
> > + * grabbed the page).
> > + */
> > + if (!homa_bpage_available(bpage, now))
> > + continue;
>
> homa_bpage_available() accesses bpage without lock, so needs READ_ONCE()
> annotations on the relevant fields and you needed to add paied
> WRITE_ONCE() when updating them.
I think the code is safe as is. Even though some of the fields
accessed by homa_bpage_accessible are not atomic, it's not a disaster
if they return stale values, since homa_bpage_available is invoked
again after acquiring the lock before making any final decisions. The
worst that can happen is (a) skipping over a bpage that's actually
available or (b) acquiring the lock only to discover the bpage wasn't
actually available (and then skipping it). Neither of these is
problematic.
Also, I'm not sure what you mean by "READ_ONCE() annotations on the
relevant fields". Do I need something additional in the field
declaration, in addition to using READ_ONCE() and WRITE_ONCE() to
access the field?
>
> > + if (!spin_trylock_bh(&bpage->lock))
>
> Why only trylock? I have a vague memory on some discussion on this point
> in a previous revision. You should at least add a comment here on in the
> commit message explaning why a plain spin_lock does not fit.
I think the reasoning is different here than in other situations we
may have discussed. I have added the following comment:
"Rather than wait for a locked page to become free, just go on to the
next page. If the page is locked, it probably won't turn out to be
available anyway."
> > + /* The last chunk may be less than a full bpage; for this we use
> > + * the bpage that we own (and reuse it for multiple messages).
> > + */
> > + partial = rpc->msgin.length & (HOMA_BPAGE_SIZE - 1);
> > + if (unlikely(partial == 0))
> > + goto success;
> > + core_id = smp_processor_id();
>
> Is this code running in non-preemptible scope? otherwise you need to use
> get_cpu() here and put_cpu() when you are done with 'core_id'.
Yes, it's non-preemptible since a spinlock is being held on the RPC.
>
> > + (pool->cores);
> > + bpage = &pool->descriptors[core->page_hint];
> > + if (!spin_trylock_bh(&bpage->lock))
> > + spin_lock_bh(&bpage->lock);
>
> I think I already commented on this pattern. Please don't use it.
Sorry, this is not intentional. It came about because the patches for
upstreaming are generated by extracting code from the "full" version
of Homa, removing things such as instrumentation code and
functionality that is not part of this patch series. The stripper is
not smart enough to recognize situations like this where the stripped
code, though technically correct, is nonsensical. I have to go in by
hand and add extra annotations to the source code so that the output
looks reasonable. I have now done that for this situation.
> > +
> > + /* We get here if there wasn't enough buffer space for this
> > + * message; add the RPC to hsk->waiting_for_bufs.
>
> Please also add a comment describing why waiting RPCs are sorted by
> message size.
Done. The list is sorted in order to implement the SRPT policy (give
priority to the shortest messages).
> > + rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
> > + struct homa_rpc, buf_links);
> > + if (!homa_rpc_try_lock(rpc)) {
> > + /* Can't just spin on the RPC lock because we're
> > + * holding the socket lock (see sync.txt). Instead,
>
> The documentation should live under:
>
> Documentation/networking/
>
> likely in its own subdir, and must be in restructured format.
>
> Here you should just mention that the lock acquiring order is rpc ->
> home sock lock.
I have updated the comment as you requested, and I'll reformat the
.txt files and move them to Documentation/networking/homa.
>
> > + * release the socket lock and try the entire
> > + * operation again.
> > + */
> > + homa_sock_unlock(pool->hsk);
> > + continue;
> > + }
> > + list_del_init(&rpc->buf_links);
> > + if (list_empty(&pool->hsk->waiting_for_bufs))
> > + pool->bpages_needed = INT_MAX;
> > + else
> > + set_bpages_needed(pool);
> > + homa_sock_unlock(pool->hsk);
> > + homa_pool_allocate(rpc);
>
> Why you don't need to check the allocation return value here?
There's no need to check the return value because if the allocation
couldn't be made, homa_pool_allocate automatically requeues the RPC.
The only time it returns an "error" is if there is no allocation
region. This should never happen in the first place, and if it does
the right response is simply to ignore the error and continue.
> > + * struct homa_bpage - Contains information about a single page in
> > + * a buffer pool.
> > + */
> > +struct homa_bpage {
> > + union {
> > + /**
> > + * @cache_line: Ensures that each homa_bpage object
> > + * is exactly one cache line long.
> > + */
> > + char cache_line[L1_CACHE_BYTES];
>
> Instead of the struct/union nesting just use ____cacheline_aligned
Done.
> [...]
> > +* Homa's approach means that socket shutdown and deletion can potentially
> > + occur while operations are underway that hold RPC locks but not the socket
> > + lock. This creates several potential problems:
> > + * A socket might be deleted and its memory reclaimed while an RPC still
> > + has access to it. Homa assumes that Linux will prevent socket deletion
> > + while the kernel call is executing.
>
> This last sentence is not clear to me. Do you mean that the kernel
> ensures that the socket is freed after the close() syscall?
Apologies... this text is no longer accurate. A socket cannot have its
memory reclaimed until all RPCs associated with the socket have been
ended and reaped. I've revised that documentation so it now looks like
this:
* Homa's approach means that socket shutdown and deletion can potentially
begin while operations are underway that hold RPC locks but not the socket
lock. For example, a new RPC creation might be underway when a socket
is shut down, which could attempt to add the new RPC after homa_sock_shutdown
thinks it has deleted all RPCs. Handling this requires careful checking
of hsk->shutdown. For example, during new RPC creation the socket lock
must be acquired to add the new RPC to those for the socket; after acquiring
the lock, it must check hsk->shutdown and abort the RPC creation if the
socket has been shutdown.
A question for you: do socket-related kernel calls such as recvmsg
automatically take a reference on the socket or do something else to
protect it? I've been assuming that sockets can't go away during
callbacks such as those for recvmsg and sendmsg.
-John-
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 05/15] net: homa: create homa_peer.h and homa_peer.c
2025-05-05 11:06 ` Paolo Abeni
@ 2025-05-07 16:11 ` John Ousterhout
2025-05-07 17:25 ` Andrew Lunn
2025-05-08 8:46 ` Paolo Abeni
0 siblings, 2 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-07 16:11 UTC (permalink / raw)
To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba
On Mon, May 5, 2025 at 4:06 AM Paolo Abeni <pabeni@redhat.com> wrote:
> On 5/3/25 1:37 AM, John Ousterhout wrote:
> [...]
> > +{
> > + /* Note: when we return, the object must be initialized so it's
> > + * safe to call homa_peertab_destroy, even if this function returns
> > + * an error.
> > + */
> > + int i;
> > +
> > + spin_lock_init(&peertab->write_lock);
> > + INIT_LIST_HEAD(&peertab->dead_dsts);
> > + peertab->buckets = vmalloc(HOMA_PEERTAB_BUCKETS *
> > + sizeof(*peertab->buckets));
>
> This struct looks way too big to be allocated on per netns basis. You
> should use a global table and include the netns in the lookup key.
Are there likely to be lots of netns's in a system? I thought I read
someplace that a hardware NIC must belong exclusively to a single
netns, so from that I assumed there couldn't be more than a few
netns's. Can there be virtual NICs, leading to lots of netns's? Can
you give me a ballpark number for how many netns's there might be in a
system with "lots" of them? This will be useful in making design
tradeoffs.
> > + /* No existing entry; create a new one.
> > + *
> > + * Note: after we acquire the lock, we have to check again to
> > + * make sure the entry still doesn't exist (it might have been
> > + * created by a concurrent invocation of this function).
> > + */
> > + spin_lock_bh(&peertab->write_lock);
> > + hlist_for_each_entry(peer, &peertab->buckets[bucket],
> > + peertab_links) {
> > + if (ipv6_addr_equal(&peer->addr, addr))
> > + goto done;
> > + }
> > + peer = kmalloc(sizeof(*peer), GFP_ATOMIC | __GFP_ZERO);
>
> Please, move the allocation outside the atomic scope and use GFP_KERNEL.
I don't think I can do that because homa_peer_find is invoked in
softirq code, which is atomic, no? It's not disastrous if the
allocation fails; the worst that happens is that an incoming packet
must be discarded (it will be retried later).
> > + if (!peer) {
> > + peer = (struct homa_peer *)ERR_PTR(-ENOMEM);
> > + goto done;
> > + }
> > + peer->addr = *addr;
> > + dst = homa_peer_get_dst(peer, inet);
> > + if (IS_ERR(dst)) {
> > + kfree(peer);
> > + peer = (struct homa_peer *)PTR_ERR(dst);
> > + goto done;
> > + }
> > + peer->dst = dst;
> > + hlist_add_head_rcu(&peer->peertab_links, &peertab->buckets[bucket]);
>
> At this point another CPU can lookup 'peer'. Since there are no memory
> barriers it could observe a NULL peer->dst.
Oops, good catch. I need to add 'smp_wmb()' just before the
hlist_add_head_rcu line?
> Also AFAICS new peers are always added when lookup for a different
> address fail and deleted only at netns shutdown time (never for the initns).
Correct.
> You need to:
> - account the memory used for peer
> - enforce an upper bound for the total number of peers (per netns),
> eventually freeing existing old ones.
OK, will do.
> Note that freeing the peer at 'runtime' will require additional changes:
> i.e. likely refcounting will be beeded or the at lookup time, after the
> rcu unlock the code could hit HaF while accessing the looked-up peer.
I understand about reference counting, but I couldn't parse the last
1.5 lines above. What is HaF?
> > + dst = homa_peer_get_dst(peer, &hsk->inet);
> > + if (IS_ERR(dst)) {
> > + kfree(save_dead);
> /> + return;
> > + }
> > +
> > + spin_lock_bh(&peertab->write_lock);
> > + now = sched_clock();
>
> Use jiffies instead.
Will do, but this code will probably go away with the refactor to
manage homa_peer memory usage.
> > + save_dead->dst = peer->dst;
> > + save_dead->gc_time = now + 100000000; /* 100 ms */
> > + list_add_tail(&save_dead->dst_links, &peertab->dead_dsts);
> > + homa_peertab_gc_dsts(peertab, now);
> > + peer->dst = dst;
> > + spin_unlock_bh(&peertab->write_lock);
>
> It's unclear to me why you need this additional GC layer on top's of the
> core one.
Now that you mention it, it's unclear to me as well. I think this will
go away in the refactor.
> [...]
> > +static inline struct dst_entry *homa_get_dst(struct homa_peer *peer,
> > + struct homa_sock *hsk)
> > +{
> > + if (unlikely(peer->dst->obsolete > 0))
>
> you need to additionally call dst->ops->check
I wasn't aware of dst->ops->check, and I'm a little confused by it
(usage in the kernel doesn't seem totally consistent):
* If I call dst->ops->check(), do I also need to check obsolete
(perhaps only call check if obsolete is true?)?
* What is the 'cookie' argument to dst->ops->check? Can I just use 0 safely?
* It looks like dst->ops->check now returns a struct dst_entry
pointer. What is the meaning of this? ChatGPT suggests that it is a
replacement dst_entry, if the original is no longer valid. If so, did
the check function release a reference on the original dst_entry
and/or take a reference on the new one? It looks like the return value
is just ignored in many cases, which would suggest that no references
have been taken or released.
-John-
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 05/15] net: homa: create homa_peer.h and homa_peer.c
2025-05-07 16:11 ` John Ousterhout
@ 2025-05-07 17:25 ` Andrew Lunn
2025-05-08 8:46 ` Paolo Abeni
1 sibling, 0 replies; 46+ messages in thread
From: Andrew Lunn @ 2025-05-07 17:25 UTC (permalink / raw)
To: John Ousterhout; +Cc: Paolo Abeni, netdev, edumazet, horms, kuba
On Wed, May 07, 2025 at 09:11:01AM -0700, John Ousterhout wrote:
> On Mon, May 5, 2025 at 4:06 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> > On 5/3/25 1:37 AM, John Ousterhout wrote:
> > [...]
> > > +{
> > > + /* Note: when we return, the object must be initialized so it's
> > > + * safe to call homa_peertab_destroy, even if this function returns
> > > + * an error.
> > > + */
> > > + int i;
> > > +
> > > + spin_lock_init(&peertab->write_lock);
> > > + INIT_LIST_HEAD(&peertab->dead_dsts);
> > > + peertab->buckets = vmalloc(HOMA_PEERTAB_BUCKETS *
> > > + sizeof(*peertab->buckets));
> >
> > This struct looks way too big to be allocated on per netns basis. You
> > should use a global table and include the netns in the lookup key.
>
> Are there likely to be lots of netns's in a system? I thought I read
> someplace that a hardware NIC must belong exclusively to a single
> netns, so from that I assumed there couldn't be more than a few
> netns's.
You might want to read up about PF and VF, as part of SR-IOV
https://www.intel.com/content/www/us/en/developer/articles/technical/configure-sr-iov-network-virtual-functions-in-linux-kvm.html
https://doc.dpdk.org/guides/_images/single_port_nic.png
You can have one NIC support a number of Virtual Functions, each of
which is a PCIe device on the bus and gets its own linux
interface. You can move those interfaces between network names spaces,
or pass them through into virtual machines. Below these Virtual
Functions is an embedded switch, often called eswitch. That allows
traffic to flow between the VFs, for e.g. VM to VM, or out the media
to the link peer.
I've used some Intel NICs which support 32 VFs, but other Intel NICs
support 64 VFs.
Andrew
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 06/15] net: homa: create homa_sock.h and homa_sock.c
2025-05-05 16:46 ` Paolo Abeni
@ 2025-05-07 18:30 ` John Ousterhout
0 siblings, 0 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-07 18:30 UTC (permalink / raw)
To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba
On Mon, May 5, 2025 at 9:46 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 5/3/25 1:37 AM, John Ousterhout wrote:
> > + /* Initialize Homa-specific fields. */
> > + spin_lock_bh(&socktab->write_lock);
> > + atomic_set(&hsk->protect_count, 0);
> > + spin_lock_init(&hsk->lock);
> > + atomic_set(&hsk->protect_count, 0);
>
> Duplicate 'atomic_set(&hsk->protect_count, 0);' statement above
Oops; fixed now.
> > + hsk->port = homa->prev_default_port;
> > + hsk->inet.inet_num = hsk->port;
> > + hsk->inet.inet_sport = htons(hsk->port);
>
> The above code looks like a bind() operation, but it's unclear why it's
> peformend at init time.
All Homa sockets are automatically assigned a port at creation time,
so there's no need for them to call bind in the common case where they
are being used for the client side only. Bind only needs to be called
if the application wants to use a well-known port number.
> > + for (i = 0; i < HOMA_CLIENT_RPC_BUCKETS; i++) {
> > + struct homa_rpc_bucket *bucket = &hsk->client_rpc_buckets[i];
> > +
> > + spin_lock_init(&bucket->lock);
> > + bucket->id = i;
> > + INIT_HLIST_HEAD(&bucket->rpcs);
> > + }
> > + for (i = 0; i < HOMA_SERVER_RPC_BUCKETS; i++) {
> > + struct homa_rpc_bucket *bucket = &hsk->server_rpc_buckets[i];
> > +
> > + spin_lock_init(&bucket->lock);
> > + bucket->id = i + 1000000;
> > + INIT_HLIST_HEAD(&bucket->rpcs);
> > + }
>
> I'm under the impression that using rhashtable for both the client and
> the server rpcs will deliver both more efficient memory usage and better
> performances.
I wasn't aware of rhashtable; I'll take a look.
> > +
> > + tx_memory = refcount_read(&hsk->sock.sk_wmem_alloc);
> > + if (tx_memory != 1) {
> > + pr_err("%s found sk_wmem_alloc %llu bytes, port %d\n",
> > + __func__, tx_memory, hsk->port);
> > + }
>
> Just:
> WARN_ON_ONCE(refcount_read(&sk->sk_wmem_alloc) != 1);
Done (but this is a bit unsatisfying because it generates less useful
information in the log).
> > +/**
> > + * homa_sock_destroy() - Destructor for homa_sock objects. This function
> > + * only cleans up the parts of the object that are owned by Homa.
> > + * @hsk: Socket to destroy.
> > + */
> > +void homa_sock_destroy(struct homa_sock *hsk)
> > +{
> > + homa_sock_shutdown(hsk);
> > + sock_set_flag(&hsk->inet.sk, SOCK_RCU_FREE);
>
> Why the flag is set only now and not at creation time?
No reason that I can think of; I've now moved it to creation time.
After asking ChatGPT about this flag, I'm no longer certain that Homa
needs it. Can you help me understand the conditions that would make
the flag necessary/unnecessary?
> [...]
> > +/**
> > + * struct homa_socktab - A hash table that maps from port numbers (either
> > + * client or server) to homa_sock objects.
> > + *
> > + * This table is managed exclusively by homa_socktab.c, using RCU to
> > + * minimize synchronization during lookups.
> > + */
> > +struct homa_socktab {
> > + /**
> > + * @write_lock: Controls all modifications to this object; not needed
> > + * for socket lookups (RCU is used instead). Also used to
> > + * synchronize port allocation.
> > + */
> > + spinlock_t write_lock;
> > +
> > + /**
> > + * @buckets: Heads of chains for hash table buckets. Chains
> > + * consist of homa_sock objects.
> > + */
> > + struct hlist_head buckets[HOMA_SOCKTAB_BUCKETS];
> > +};
>
> This is probably a bit too large to be unconditionally allocated for
> each netns. You are probably better off with a global hash table, with
> the lookup key including the netns itself.
OK, will do.
> [...]
> > +/**
> > + * homa_sock_lock() - Acquire the lock for a socket.
> > + * @hsk: Socket to lock.
> > + */
> > +static inline void homa_sock_lock(struct homa_sock *hsk)
> > + __acquires(&hsk->lock)
> > +{
> > + spin_lock_bh(&hsk->lock);
>
> I was wondering how the hsk socket lock could be nested under a
> spinlock... The above can't work, unless you prevent any core and
> inet-related operations on hsk sockets. That in turn means duplicate
> entirely a lot of code or preventing a lot of basic stuff from working
> on homa sockets.
>
> Home sockets are still inet ones. It's expected that SOL_SOCKET and
> SOL_IP[V6] socket options work on them (or at least could be implemented).
>
> I think this point is very critical.
Can you provide an example of a specific situation that you think will
be problematic? My belief (hope?) is that Homa does not use any socket
operations that require the official socket lock. Homa implements
socket options using its own spinlock.
> Somewhat related: the patch order makes IMHO the review complex, because
> I often need to look to the following patches to get needed context.
Unfortunately I don't know how to fix this problem. I'm a bit
skeptical that it is even possible to understand this much code in a
purely linear fashion. If you have suggestions for how I can organize
the patches to make them easier to review, I'd be happy to hear them.
At the same time, it's been a struggle for me to extract these patches
cleanly from the full Homa source and evolve them while allowing
concurrent development of the full source; this has led to some of the
awkward chunks of code you have noticed.
-John-
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 07/15] net: homa: create homa_interest.h and homa_interest.
2025-05-06 13:53 ` Paolo Abeni
@ 2025-05-07 18:45 ` John Ousterhout
0 siblings, 0 replies; 46+ messages in thread
From: John Ousterhout @ 2025-05-07 18:45 UTC (permalink / raw)
To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba
On Tue, May 6, 2025 at 6:53 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 5/3/25 1:37 AM, John Ousterhout wrote:
> > +/**
> > + * homa_interest_init_shared() - Initialize an interest and queue it up on a socket.
>
> What is an 'interest'? An event like input avail or unblocking output
> possible? If so, 'event' could be a possible/more idiomatic name.
I've revised the class header for struct homa_interest to be more
helpful (I hope). Here's the new version:
/**
* struct homa_interest - Holds info that allows applications to wait for
* incoming RPC messages. An interest can be either private, in which case
* the application is waiting for a single specific RPC response and the
* interest is referenced by an rpc->private_interest, or shared, in which
* case the application is waiting for any incoming message that isn't
* private and the interest is present on hsk->interests.
*/
> > + * @interest: Interest to initialize
> > + * @hsk: Socket on which the interests should be queued. Must be locked
> > + * by caller.
> > + */
> > +void homa_interest_init_shared(struct homa_interest *interest,
> > + struct homa_sock *hsk)
> > + __must_hold(&hsk->lock)
> > +{
> > + interest->rpc = NULL;
> > + atomic_set(&interest->ready, 0);
> > + interest->core = raw_smp_processor_id();
>
> I don't see this 'core' field used later on in this series. If so,
> please avoid introducing it until it's really used.
Yep, that field is not needed in this patch series (it will be needed
when additional Homa functionality is upstreamed). I've arranged for
it not to appear in future revisions of the patch. Do you know if
there are any tools available that can identify code and declarations
that are no longer relevant when functionality is removed from a
system? I've been doing this by hand, but as you have noticed it's a
pretty error-prone process.
-John-
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 08/15] net: homa: create homa_pacer.h and homa_pacer.c
2025-05-06 14:05 ` Paolo Abeni
@ 2025-05-07 18:55 ` John Ousterhout
2025-05-07 20:31 ` Andrew Lunn
0 siblings, 1 reply; 46+ messages in thread
From: John Ousterhout @ 2025-05-07 18:55 UTC (permalink / raw)
To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba
In Tue, May 6, 2025 at 7:05 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 5/3/25 1:37 AM, John Ousterhout wrote:
> > + /**
> > + * @link_mbps: The raw bandwidth of the network uplink, in
> > + * units of 1e06 bits per second. Set externally via sysctl.
> > + */
> > + int link_mbps;
>
> This is will be extremely problematic. In practice nobody will set this
> correctly and in some cases the info is not even available (VM) or will
> change dynamically due to policing/shaping.
>
> I think you need to build your own estimator of the available B/W. I'm
> unsure/I don't think you can re-use bql info here.
I agree about the issues, but I'd like to defer addressing them. I
have begun working on a new Homa-specific qdisc, which will improve
performance when there is concurrent TCP and Homa traffic. It
retrieves link speed from the net_device, which will eliminate the
need for the link_mbps configuration option. Hopefully this will be
ready by the time Homa upstreaming is complete.
There are additional issues that the qdisc will not address, such as
VMs. VMs raise a bunch of complications for Homa, because Homa
believes that it sees all of the traffic going out on the uplink (so
that it can pace properly) and this isn't the case in VMs. Again, I'd
like to defer dealing with this.
-John-
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 08/15] net: homa: create homa_pacer.h and homa_pacer.c
2025-05-07 18:55 ` John Ousterhout
@ 2025-05-07 20:31 ` Andrew Lunn
2025-05-07 20:46 ` John Ousterhout
0 siblings, 1 reply; 46+ messages in thread
From: Andrew Lunn @ 2025-05-07 20:31 UTC (permalink / raw)
To: John Ousterhout; +Cc: Paolo Abeni, netdev, edumazet, horms, kuba
On Wed, May 07, 2025 at 11:55:23AM -0700, John Ousterhout wrote:
> In Tue, May 6, 2025 at 7:05 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 5/3/25 1:37 AM, John Ousterhout wrote:
> > > + /**
> > > + * @link_mbps: The raw bandwidth of the network uplink, in
> > > + * units of 1e06 bits per second. Set externally via sysctl.
> > > + */
> > > + int link_mbps;
> >
> > This is will be extremely problematic. In practice nobody will set this
> > correctly and in some cases the info is not even available (VM) or will
> > change dynamically due to policing/shaping.
> >
> > I think you need to build your own estimator of the available B/W. I'm
> > unsure/I don't think you can re-use bql info here.
>
> I agree about the issues, but I'd like to defer addressing them. I
> have begun working on a new Homa-specific qdisc, which will improve
> performance when there is concurrent TCP and Homa traffic. It
> retrieves link speed from the net_device, which will eliminate the
> need for the link_mbps configuration option.
I would be sceptical of the link speed, if you mean to use ethtool
get_link_ksettings(). Not all switches have sufficient core bandwidth
to allow all their ports to operate at line rate at the same
time. There could be pause frames being sent back to slow the link
down. And there could be FEC reducing the actual bandwidth you can get
over the media. You also need to consider congestion on switch egress,
when multiple sources are sending to one sink etc.
BQL gives you a better idea of what the link is actually capable of,
over the last few seconds, to the first switch. But after that,
further hops across the network, it does not help.
Andrew
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 08/15] net: homa: create homa_pacer.h and homa_pacer.c
2025-05-07 20:31 ` Andrew Lunn
@ 2025-05-07 20:46 ` John Ousterhout
2025-05-22 7:49 ` Paolo Abeni
0 siblings, 1 reply; 46+ messages in thread
From: John Ousterhout @ 2025-05-07 20:46 UTC (permalink / raw)
To: Andrew Lunn; +Cc: Paolo Abeni, netdev, edumazet, horms, kuba
get_link_ksettings is what I was thinking of. Some of the issues you
mentioned, such as switch egress contention, are explicitly handled by
Homa, so those needn't (and shouldn't) be factored into the link
"speed". And don't pretty much all modern datacenter switches allow
all of their links to operate at full speed?
-John-
On Wed, May 7, 2025 at 1:31 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Wed, May 07, 2025 at 11:55:23AM -0700, John Ousterhout wrote:
> > In Tue, May 6, 2025 at 7:05 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > >
> > > On 5/3/25 1:37 AM, John Ousterhout wrote:
> > > > + /**
> > > > + * @link_mbps: The raw bandwidth of the network uplink, in
> > > > + * units of 1e06 bits per second. Set externally via sysctl.
> > > > + */
> > > > + int link_mbps;
> > >
> > > This is will be extremely problematic. In practice nobody will set this
> > > correctly and in some cases the info is not even available (VM) or will
> > > change dynamically due to policing/shaping.
> > >
> > > I think you need to build your own estimator of the available B/W. I'm
> > > unsure/I don't think you can re-use bql info here.
> >
> > I agree about the issues, but I'd like to defer addressing them. I
> > have begun working on a new Homa-specific qdisc, which will improve
> > performance when there is concurrent TCP and Homa traffic. It
> > retrieves link speed from the net_device, which will eliminate the
> > need for the link_mbps configuration option.
>
> I would be sceptical of the link speed, if you mean to use ethtool
> get_link_ksettings(). Not all switches have sufficient core bandwidth
> to allow all their ports to operate at line rate at the same
> time. There could be pause frames being sent back to slow the link
> down. And there could be FEC reducing the actual bandwidth you can get
> over the media. You also need to consider congestion on switch egress,
> when multiple sources are sending to one sink etc.
>
> BQL gives you a better idea of what the link is actually capable of,
> over the last few seconds, to the first switch. But after that,
> further hops across the network, it does not help.
>
> Andrew
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 04/15] net: homa: create homa_pool.h and homa_pool.c
2025-05-06 23:28 ` John Ousterhout
@ 2025-05-08 8:35 ` Paolo Abeni
0 siblings, 0 replies; 46+ messages in thread
From: Paolo Abeni @ 2025-05-08 8:35 UTC (permalink / raw)
To: John Ousterhout; +Cc: netdev, edumazet, horms, kuba
On 5/7/25 1:28 AM, John Ousterhout wrote:
> On Mon, May 5, 2025 at 2:51 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> On 5/3/25 1:37 AM, John Ousterhout wrote:
[...]
>>> + pool->check_waiting_invoked = 0;
>>> +
>>> + return 0;
>>> +
>>> +error:
>>> + kfree(pool->descriptors);
>>> + free_percpu(pool->cores);
>>
>> The above assumes that 'pool' will be zeroed at allocation time, but the
>> allocator does not do that. You should probably add the __GFP_ZERO flag
>> to the pool allocator.
>
> The pool is allocated with kzalloc; that zeroes it, no?
You are right, I misread the alloc function used.
>>> +int homa_pool_get_pages(struct homa_pool *pool, int num_pages, u32 *pages,
>>> + int set_owner)
>>> +{
>>> + int core_num = smp_processor_id();
>>> + struct homa_pool_core *core;
>>> + u64 now = sched_clock();
>>
>> From sched_clock() documentation:
>>
>> sched_clock() has no promise of monotonicity or bounded drift between
>> CPUs, use (which you should not) requires disabling IRQs.
>>
>> Can't be used for an expiration time. You could use 'jiffies' instead,
>
> Jiffies are *really* coarse grain (4 ms on my servers). It's possible
> that I could make them work in this situation, but in general jiffies
> won't work for Homa. Homa needs to make decisions at
> microsecond-scale, and an RPC that takes one jiffy to complete is
> completely unacceptable. Homa needs a fine-grain (e.g. cycle level)
> clock that is monotonic and synchronous across cores, and as far as I
> know, such a clock is available on every server where Homa is likely
> to run. For example, I believe that the TSC counters on both Intel and
> AMD chips have had the right properties for at least 10-15 years. And
> isn't sched_clock based on TSC where it's available? So even though
> sched_clock makes no official promises, isn't the reality actually
> fine? Can I simply stipulate that Homa is not appropriate for any
> machine where sched_clock doesn't have the properties Homa needs (this
> won't be a significant limitation in practice)?
>
> Ultimately I think Linux needs to bite the bullet and provide an
> official fine-grain clock with ns precision.
If you need ns precision use a ktime_get() variant. The point here is
that you should not use sched_clock().
>>> +
>>> + /* Figure out whether this candidate is free (or can be
>>> + * stolen). Do a quick check without locking the page, and
>>> + * if the page looks promising, then lock it and check again
>>> + * (must check again in case someone else snuck in and
>>> + * grabbed the page).
>>> + */
>>> + if (!homa_bpage_available(bpage, now))
>>> + continue;
>>
>> homa_bpage_available() accesses bpage without lock, so needs READ_ONCE()
>> annotations on the relevant fields and you needed to add paied
>> WRITE_ONCE() when updating them.
>
> I think the code is safe as is. Even though some of the fields
> accessed by homa_bpage_accessible are not atomic, it's not a disaster
> if they return stale values, since homa_bpage_available is invoked
> again after acquiring the lock before making any final decisions. The
> worst that can happen is (a) skipping over a bpage that's actually
> available or (b) acquiring the lock only to discover the bpage wasn't
> actually available (and then skipping it). Neither of these is
> problematic.
>
> Also, I'm not sure what you mean by "READ_ONCE() annotations on the
> relevant fields". Do I need something additional in the field
> declaration, in addition to using READ_ONCE() and WRITE_ONCE() to
> access the field?
If you have concurrent access without any lock protection, fuzzers will
complains. If that is safe - I'm not sure at this point, looks like
double read anti-patterns could be possible - you should document it with:
READ_ONCE(<field>)
to read the relevant field outside the lock and:
WRITE_ONCE(<field>)
to write such field. Have a look at Documentation/memory-barriers.txt
for more details.
>>> + if (!spin_trylock_bh(&bpage->lock))
>>
>> Why only trylock? I have a vague memory on some discussion on this point
>> in a previous revision. You should at least add a comment here on in the
>> commit message explaning why a plain spin_lock does not fit.
>
> I think the reasoning is different here than in other situations we
> may have discussed. I have added the following comment:
> "Rather than wait for a locked page to become free, just go on to the
> next page. If the page is locked, it probably won't turn out to be
> available anyway."
The main point here (and elsewhere) is that unusual constructs that
sparked a question/comment on prior revision should be documented, see:
https://elixir.bootlin.com/linux/v6.14.5/source/Documentation/process/6.Followthrough.rst#L80
I'm sorry, I spent on this series a considerably slice of my time
recently, starving a lot of other patches. I can't spend more time here
for a while.
I would like to outline that there are critical points outstanding.
The locking schema still looks problematic. I *think* that moving the
RPC in a global hash (as opposed to per socket containers) with it's own
lock could simplify the problem and help avoiding the "AB" "BA" deadlocks.
Kernel APIs usage and integration with the socket interface should be
improved.
I haven't looked yet in details to patches after 8/15 but I suspect
relevant rework is needed there, too.
/P
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 05/15] net: homa: create homa_peer.h and homa_peer.c
2025-05-07 16:11 ` John Ousterhout
2025-05-07 17:25 ` Andrew Lunn
@ 2025-05-08 8:46 ` Paolo Abeni
1 sibling, 0 replies; 46+ messages in thread
From: Paolo Abeni @ 2025-05-08 8:46 UTC (permalink / raw)
To: John Ousterhout; +Cc: netdev, edumazet, horms, kuba
On 5/7/25 6:11 PM, John Ousterhout wrote:
> On Mon, May 5, 2025 at 4:06 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
>> On 5/3/25 1:37 AM, John Ousterhout wrote:
>> [...]
>>> +{
>>> + /* Note: when we return, the object must be initialized so it's
>>> + * safe to call homa_peertab_destroy, even if this function returns
>>> + * an error.
>>> + */
>>> + int i;
>>> +
>>> + spin_lock_init(&peertab->write_lock);
>>> + INIT_LIST_HEAD(&peertab->dead_dsts);
>>> + peertab->buckets = vmalloc(HOMA_PEERTAB_BUCKETS *
>>> + sizeof(*peertab->buckets));
>>
>> This struct looks way too big to be allocated on per netns basis. You
>> should use a global table and include the netns in the lookup key.
>
> Are there likely to be lots of netns's in a system?
Yes. In fact a relevant performance metric is the time to create and
destroy thousands of them.
> I thought I read
> someplace that a hardware NIC must belong exclusively to a single
> netns, so from that I assumed there couldn't be more than a few
> netns's. Can there be virtual NICs, leading to lots of netns's?
Yes, veth devices a pretty ubiquitous in containerization setups
> Can
> you give me a ballpark number for how many netns's there might be in a
> system with "lots" of them? This will be useful in making design
> tradeoffs.
You should consider at least 1K as a target number, but a large system
should just work with 10K or more.
>>> + /* No existing entry; create a new one.
>>> + *
>>> + * Note: after we acquire the lock, we have to check again to
>>> + * make sure the entry still doesn't exist (it might have been
>>> + * created by a concurrent invocation of this function).
>>> + */
>>> + spin_lock_bh(&peertab->write_lock);
>>> + hlist_for_each_entry(peer, &peertab->buckets[bucket],
>>> + peertab_links) {
>>> + if (ipv6_addr_equal(&peer->addr, addr))
>>> + goto done;
>>> + }
>>> + peer = kmalloc(sizeof(*peer), GFP_ATOMIC | __GFP_ZERO);
>>
>> Please, move the allocation outside the atomic scope and use GFP_KERNEL.
>
> I don't think I can do that because homa_peer_find is invoked in
> softirq code, which is atomic, no? It's not disastrous if the
> allocation fails; the worst that happens is that an incoming packet
> must be discarded (it will be retried later).
IMHO a _find() helper that allocates thing has a misleading name.
Usually RX path do only lookups, and state allocation is done in the
control path, that avoid atomic issues
>
>>> + if (!peer) {
>>> + peer = (struct homa_peer *)ERR_PTR(-ENOMEM);
>>> + goto done;
>>> + }
>>> + peer->addr = *addr;
>>> + dst = homa_peer_get_dst(peer, inet);
>>> + if (IS_ERR(dst)) {
>>> + kfree(peer);
>>> + peer = (struct homa_peer *)PTR_ERR(dst);
>>> + goto done;
>>> + }
>>> + peer->dst = dst;
>>> + hlist_add_head_rcu(&peer->peertab_links, &peertab->buckets[bucket]);
>>
>> At this point another CPU can lookup 'peer'. Since there are no memory
>> barriers it could observe a NULL peer->dst.
>
> Oops, good catch. I need to add 'smp_wmb()' just before the
> hlist_add_head_rcu line?
Barriers go in pair, one here and one in the lookup. See the relevant
documentation for the gory details.
[...]
>> Note that freeing the peer at 'runtime' will require additional changes:
>> i.e. likely refcounting will be beeded or the at lookup time, after the
>> rcu unlock the code could hit HaF while accessing the looked-up peer.
>
> I understand about reference counting, but I couldn't parse the last
> 1.5 lines above. What is HaF?
A lot of typos on my side, sorry.
You likely need to introduce refcounting, or after the RCU unlock (end
of RCU grace period) the peer could be freed causing Use after Free.
>> [...]
>>> +static inline struct dst_entry *homa_get_dst(struct homa_peer *peer,
>>> + struct homa_sock *hsk)
>>> +{
>>> + if (unlikely(peer->dst->obsolete > 0))
>>
>> you need to additionally call dst->ops->check
>
> I wasn't aware of dst->ops->check, and I'm a little confused by it
> (usage in the kernel doesn't seem totally consistent):
> * If I call dst->ops->check(), do I also need to check obsolete
> (perhaps only call check if obsolete is true?)?
> * What is the 'cookie' argument to dst->ops->check? Can I just use 0 safely?
> * It looks like dst->ops->check now returns a struct dst_entry
> pointer. What is the meaning of this? ChatGPT suggests that it is a
> replacement dst_entry, if the original is no longer valid.
Luckily (on unfortunately depending on your PoV), the tool you mentioned
simply does not work (yet?) for kernel code. You could (and should)
review with extreme care basically any output about such topic.
dst_check() is the reference code. You should use that helper.
/P
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 02/15] net: homa: create homa_wire.h
2025-05-05 8:28 ` Paolo Abeni
2025-05-05 23:54 ` John Ousterhout
@ 2025-05-22 5:31 ` John Ousterhout
2025-05-22 7:36 ` Paolo Abeni
1 sibling, 1 reply; 46+ messages in thread
From: John Ousterhout @ 2025-05-22 5:31 UTC (permalink / raw)
To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba
One small follow-up:
On Mon, May 5, 2025 at 1:28 AM Paolo Abeni <pabeni@redhat.com> wrote:
> [...]
> > +_Static_assert(sizeof(struct homa_data_hdr) <= HOMA_MAX_HEADER,
> > + "homa_data_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
> > +_Static_assert(sizeof(struct homa_data_hdr) >= HOMA_MIN_PKT_LENGTH,
> > + "homa_data_hdr too small: Homa doesn't currently have code to pad data packets");
> > +_Static_assert(((sizeof(struct homa_data_hdr) - sizeof(struct homa_seg_hdr)) &
> > + 0x3) == 0,
> > + " homa_data_hdr length not a multiple of 4 bytes (required for TCP/TSO compatibility");
>
> Please use BUILD_BUG_ON() in a .c file instead. Many other cases below.
BUILD_BUG_ON expands to code, so it only works in contexts where there
can be code. I see that you said to put this in a .c file, but these
assertions are closely related to the structure declaration, so they
really belong right next to the structure (there's no natural place to
put them in a .c file).
I see that "static_assert" is used in several places in the kernel,
and it will work in headers; is it OK if I switch from _Static_assert
to static_assert?
-John-
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 02/15] net: homa: create homa_wire.h
2025-05-22 5:31 ` John Ousterhout
@ 2025-05-22 7:36 ` Paolo Abeni
0 siblings, 0 replies; 46+ messages in thread
From: Paolo Abeni @ 2025-05-22 7:36 UTC (permalink / raw)
To: John Ousterhout; +Cc: netdev, edumazet, horms, kuba
On 5/22/25 7:31 AM, John Ousterhout wrote:
> One small follow-up:
>
> On Mon, May 5, 2025 at 1:28 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> [...]
>>> +_Static_assert(sizeof(struct homa_data_hdr) <= HOMA_MAX_HEADER,
>>> + "homa_data_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
>>> +_Static_assert(sizeof(struct homa_data_hdr) >= HOMA_MIN_PKT_LENGTH,
>>> + "homa_data_hdr too small: Homa doesn't currently have code to pad data packets");
>>> +_Static_assert(((sizeof(struct homa_data_hdr) - sizeof(struct homa_seg_hdr)) &
>>> + 0x3) == 0,
>>> + " homa_data_hdr length not a multiple of 4 bytes (required for TCP/TSO compatibility");
>>
>> Please use BUILD_BUG_ON() in a .c file instead. Many other cases below.
>
> BUILD_BUG_ON expands to code, so it only works in contexts where there
> can be code. I see that you said to put this in a .c file, but these
> assertions are closely related to the structure declaration, so they
> really belong right next to the structure (there's no natural place to
> put them in a .c file).
The customary practice is to add this kind of check in the relevant
_init function, see as a random example:
https://elixir.bootlin.com/linux/v6.14.7/source/net/ipv4/tcp_bbr.c#L1178
Possibly a good location could be the homa per netns init.
/P
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH net-next v8 08/15] net: homa: create homa_pacer.h and homa_pacer.c
2025-05-07 20:46 ` John Ousterhout
@ 2025-05-22 7:49 ` Paolo Abeni
0 siblings, 0 replies; 46+ messages in thread
From: Paolo Abeni @ 2025-05-22 7:49 UTC (permalink / raw)
To: John Ousterhout, Andrew Lunn; +Cc: netdev, edumazet, horms, kuba
On 5/7/25 10:46 PM, John Ousterhout wrote:
> get_link_ksettings is what I was thinking of.
Note that dst->dev and the actual egress link could be different. The
first could be a virtual/stacked device or many kind of redirections
could be in place.
> Some of the issues you
> mentioned, such as switch egress contention, are explicitly handled by
> Homa, so those needn't (and shouldn't) be factored into the link
> "speed". And don't pretty much all modern datacenter switches allow
> all of their links to operate at full speed?
If you mean negotiating the link speed, likely yes.
If you mean actually allowing the connected peer to send data at link
speed and forwarding it without packet loss across the switch, no. i.e.
if 2 or more ports are sending traffic at full speed towards a 3rd one :)
Side note: please avoid top posting.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2025-05-22 7:49 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-02 23:37 [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 01/15] net: homa: define user-visible API for Homa John Ousterhout
2025-05-05 7:54 ` Paolo Abeni
2025-05-05 16:14 ` John Ousterhout
2025-05-05 16:48 ` Andrew Lunn
2025-05-05 17:52 ` John Ousterhout
2025-05-05 8:03 ` Paolo Abeni
2025-05-05 22:53 ` John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 02/15] net: homa: create homa_wire.h John Ousterhout
2025-05-05 8:28 ` Paolo Abeni
2025-05-05 23:54 ` John Ousterhout
2025-05-22 5:31 ` John Ousterhout
2025-05-22 7:36 ` Paolo Abeni
2025-05-02 23:37 ` [PATCH net-next v8 03/15] net: homa: create shared Homa header files John Ousterhout
2025-05-05 10:20 ` Paolo Abeni
2025-05-06 17:45 ` John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 04/15] net: homa: create homa_pool.h and homa_pool.c John Ousterhout
2025-05-05 9:51 ` Paolo Abeni
2025-05-06 23:28 ` John Ousterhout
2025-05-08 8:35 ` Paolo Abeni
2025-05-02 23:37 ` [PATCH net-next v8 05/15] net: homa: create homa_peer.h and homa_peer.c John Ousterhout
2025-05-05 11:06 ` Paolo Abeni
2025-05-07 16:11 ` John Ousterhout
2025-05-07 17:25 ` Andrew Lunn
2025-05-08 8:46 ` Paolo Abeni
2025-05-02 23:37 ` [PATCH net-next v8 06/15] net: homa: create homa_sock.h and homa_sock.c John Ousterhout
2025-05-05 16:46 ` Paolo Abeni
2025-05-07 18:30 ` John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 07/15] net: homa: create homa_interest.h and homa_interest John Ousterhout
2025-05-06 13:53 ` Paolo Abeni
2025-05-07 18:45 ` John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 08/15] net: homa: create homa_pacer.h and homa_pacer.c John Ousterhout
2025-05-06 14:05 ` Paolo Abeni
2025-05-07 18:55 ` John Ousterhout
2025-05-07 20:31 ` Andrew Lunn
2025-05-07 20:46 ` John Ousterhout
2025-05-22 7:49 ` Paolo Abeni
2025-05-02 23:37 ` [PATCH net-next v8 09/15] net: homa: create homa_rpc.h and homa_rpc.c John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 10/15] net: homa: create homa_outgoing.c John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 11/15] net: homa: create homa_utils.c John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 12/15] net: homa: create homa_incoming.c John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 13/15] net: homa: create homa_timer.c John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 14/15] net: homa: create homa_plumbing.c John Ousterhout
2025-05-02 23:37 ` [PATCH net-next v8 15/15] net: homa: create Makefile and Kconfig John Ousterhout
2025-05-06 5:04 ` [PATCH net-next v8 00/15] Begin upstreaming Homa transport protocol John Ousterhout
2025-05-06 12:01 ` Paolo Abeni
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).