[PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol
@ 2025-01-15 18:59 John Ousterhout
  2025-01-15 18:59 ` [PATCH net-next v6 01/12] net: homa: define user-visible API for Homa John Ousterhout
                   ` (12 more replies)
  0 siblings, 13 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-15 18:59 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

This patch series begins the process of upstreaming the Homa transport
protocol. Homa is an alternative to TCP for use in datacenter
environments. It provides 10-100x reductions in tail latency for short
messages relative to TCP. Its benefits are greatest for mixed workloads
containing both short and long messages running under high network loads.
Homa is not API-compatible with TCP: it is connectionless and message-
oriented (but still reliable and flow-controlled). Homa's new API not
only contributes to its performance gains, but it also eliminates the
massive amount of connection state required by TCP for highly connected
datacenter workloads.

For more details on Homa, please consult the Homa Wiki:
https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview
The Wiki has pointers to two papers on Homa (one of which describes
this implementation) as well as man pages describing the application
API and other information.

There is also a GitHub repo for Homa:
https://github.com/PlatformLab/HomaModule
The GitHub repo contains a superset of this patch set, including:
* Additional source code that will eventually be upstreamed
* Extensive unit tests (which will also be upstreamed eventually)
* Application-level library functions (which need to go in glibc?)
* Man pages (which need to be upstreamed as well)
* Benchmarking and instrumentation code

For this patch series, Homa has been stripped down to the bare minimum
functionality capable of actually executing remote procedure calls. (about
8000 lines of source code, compared to 15000 in the complete Homa). The
remaining code will be upstreamed in smaller batches once this patch
series has been accepted. Note: the code in this patch series is
functional but its performance is not very interesting (about the same
as TCP).

The patch series is arranged to introduce the major functional components
of Homa. Until the last patch has been applied, the code is inert (it
will not be compiled).

Note: this implementation of Homa supports both IPv4 and IPv6.

v6 changes:
- Make hrtimer variable in homa_timer_main static instead of stack-allocated
  (avoids complaints when in debug mode).
- Remove unnecessary cast in homa_dst_refresh.
- Replace erroneous uses of GFP_KERNEL with GFP_ATOMIC.
- Check for "all ports in use" in homa_sock_init.
- Refactor API for homa_rpc_reap to incorporate "reap all" feature,
  eliminate need for callers to specify exact amount of work to do
  when in "reap a few" mode.
- Fix bug in homa_rpc_reap (wasn't resetting rx_frees for each iteration
  of outer loop).

v5 changes:
- Change type of start in struct homa_rcvbuf_args from void* to __u64;
  also add more __user annotations.
- Refactor homa_interest: replace awkward ready_rpc field with two
  fields: rpc and rpc_ready. Added new functions homa_interest_get_rpc
  and homa_interest_set_rpc to encapsulate/clarify access to
  interest->rpc_ready.
- Eliminate use of LIST_POISON1 etc. in homa_interests (use list_del_init
  instead of list_del).
- Remove homa_next_skb function, which is obsolete, unused, and incorrect
- Eliminate ipv4_to_ipv6 function (use ipv6_addr_set_v4mapped instead)
- Eliminate is_mapped_ipv4 function (use ipv6_addr_v4mapped instead)
- Use __u64 instead of uint64_t in homa.h
- Remove 'extern "C"' from homa.h
- Various fixes from patchwork checks (checkpatch.pl, etc.)
- A few improvements to comments

v4 changes:
- Remove sport argument for homa_find_server_rpc (unneeded). Also
  remove client_port field from struct homa_ack
- Refactor ICMP packet handling (v6 was incorrect)
- Check for socket shutdown in homa_poll
- Fix potential for memory garbling in homa_symbol_for_type
- Remove unused ETHERNET_MAX_PAYLOAD declaration
- Rename classes in homa_wire.h so they all have "homa_" prefixes
- Various fixes from patchwork checks (checkpatch.pl, etc.)
- A few improvements to comments

v3 changes:
- Fix formatting in Kconfig
- Set ipv6_pinfo_offset in struct proto
- Check return value of inet6_register_protosw
- In homa_load cleanup, don't cleanup things that haven't been
  initialized
- Add MODULE_ALIAS_NET_PF_PROTO_TYPE to auto-load module
- Check return value from kzalloc call in homa_sock_init
- Change SO_HOMA_SET_BUF to SO_HOMA_RCVBUF
- Change struct homa_set_buf_args to struct homa_rcvbuf_args
- Implement getsockopt for SO_HOMA_RCVBUF
- Return ENOPROTOOPT instead of EINVAL where appropriate in
  setsockopt and getsockopt
- Fix crash in homa_pool_check_waiting if pool has no region yet
- Check for NULL msg->msg_name in homa_sendmsg
- Change addr->in6.sin6_family to addr->sa.sa_family in homa_sendmsg
  for clarity
- For some errors in homa_recvmsg, return directly rather than "goto done"
- Return error from recvmsg if offsets of returned read buffers are bogus
- Added comments to clarify lock-unlock pairs for RPCs
- Renamed homa_try_bucket_lock to homa_try_rpc_lock
- Fix issues found by test robot and checkpatch.pl
- Ensure first argument to do_div is 64 bits
- Remove C++ style comments
- Removed some code that will only be relevant in future patches that
  fill in missing Homa functionality

v2 changes:
- Remove sockaddr_in_union declaration from public API in homa.h
- Remove kernel wrapper functions (homa_send, etc.) from homa.h
- Fix many sparse warnings (still more work to do here) and other issues
  uncovered by test robot
- Fix checkpatch.pl issues
- Remove residual code related to unit tests
- Remove references to tt_record from comments
- Make it safe to delete sockets during homa_socktab scans
- Use uintptr_t for portability fo 32-bit platforms
- Use do_div instead of "/" for portability
- Remove homa->busy_usecs and homa->gro_busy_usecs (not needed in
  this stripped down version of Homa)
- Eliminate usage of cpu_khz, use sched_clock instead of get_cycles
- Add missing checks of kmalloc return values
- Remove "inline" qualifier from functions in .c files
- Document that pad fields must be zero
- Use more precise type "uint32_t" rather than "int"
- Remove unneeded #include of linux/version.h

John Ousterhout (12):
  net: homa: define user-visible API for Homa
  net: homa: create homa_wire.h
  net: homa: create shared Homa header files
  net: homa: create homa_pool.h and homa_pool.c
  net: homa: create homa_rpc.h and homa_rpc.c
  net: homa: create homa_peer.h and homa_peer.c
  net: homa: create homa_sock.h and homa_sock.c
  net: homa: create homa_incoming.c
  net: homa: create homa_outgoing.c
  net: homa: create homa_timer.c
  net: homa: create homa_plumbing.c and homa_utils.c
  net: homa: create Makefile and Kconfig

 MAINTAINERS               |    7 +
 include/uapi/linux/homa.h |  161 ++++++
 net/Kconfig               |    1 +
 net/Makefile              |    1 +
 net/homa/Kconfig          |   19 +
 net/homa/Makefile         |   14 +
 net/homa/homa_impl.h      |  711 ++++++++++++++++++++++++
 net/homa/homa_incoming.c  | 1076 +++++++++++++++++++++++++++++++++++++
 net/homa/homa_outgoing.c  |  855 +++++++++++++++++++++++++++++
 net/homa/homa_peer.c      |  366 +++++++++++++
 net/homa/homa_peer.h      |  233 ++++++++
 net/homa/homa_plumbing.c  | 1004 ++++++++++++++++++++++++++++++++++
 net/homa/homa_pool.c      |  453 ++++++++++++++++
 net/homa/homa_pool.h      |  154 ++++++
 net/homa/homa_rpc.c       |  494 +++++++++++++++++
 net/homa/homa_rpc.h       |  458 ++++++++++++++++
 net/homa/homa_sock.c      |  388 +++++++++++++
 net/homa/homa_sock.h      |  410 ++++++++++++++
 net/homa/homa_stub.h      |   81 +++
 net/homa/homa_timer.c     |  157 ++++++
 net/homa/homa_utils.c     |  166 ++++++
 net/homa/homa_wire.h      |  367 +++++++++++++
 22 files changed, 7576 insertions(+)
 create mode 100644 include/uapi/linux/homa.h
 create mode 100644 net/homa/Kconfig
 create mode 100644 net/homa/Makefile
 create mode 100644 net/homa/homa_impl.h
 create mode 100644 net/homa/homa_incoming.c
 create mode 100644 net/homa/homa_outgoing.c
 create mode 100644 net/homa/homa_peer.c
 create mode 100644 net/homa/homa_peer.h
 create mode 100644 net/homa/homa_plumbing.c
 create mode 100644 net/homa/homa_pool.c
 create mode 100644 net/homa/homa_pool.h
 create mode 100644 net/homa/homa_rpc.c
 create mode 100644 net/homa/homa_rpc.h
 create mode 100644 net/homa/homa_sock.c
 create mode 100644 net/homa/homa_sock.h
 create mode 100644 net/homa/homa_stub.h
 create mode 100644 net/homa/homa_timer.c
 create mode 100644 net/homa/homa_utils.c
 create mode 100644 net/homa/homa_wire.h

--
2.34.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH net-next v6 01/12] net: homa: define user-visible API for Homa
  2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
@ 2025-01-15 18:59 ` John Ousterhout
  2025-01-15 18:59 ` [PATCH net-next v6 02/12] net: homa: create homa_wire.h John Ousterhout
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-15 18:59 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

Note: for man pages, see the Homa Wiki at:
https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 MAINTAINERS               |   7 ++
 include/uapi/linux/homa.h | 161 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 168 insertions(+)
 create mode 100644 include/uapi/linux/homa.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 30cbc3d44cd5..2709dfae9995 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10479,6 +10479,13 @@ F:	lib/test_hmm*
 F:	mm/hmm*
 F:	tools/testing/selftests/mm/*hmm*
 
+HOMA TRANSPORT PROTOCOL
+M:	John Ousterhout <ouster@cs.stanford.edu>
+S:	Maintained
+W:	https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview
+F:	include/uapi/linux/homa.h
+F:	net/homa/
+
 HONEYWELL HSC030PA PRESSURE SENSOR SERIES IIO DRIVER
 M:	Petre Rodan <petre.rodan@subdimension.ro>
 L:	linux-iio@vger.kernel.org
diff --git a/include/uapi/linux/homa.h b/include/uapi/linux/homa.h
new file mode 100644
index 000000000000..df873a88512f
--- /dev/null
+++ b/include/uapi/linux/homa.h
@@ -0,0 +1,161 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines the kernel call interface for the Homa
+ * transport protocol.
+ */
+
+#ifndef _UAPI_LINUX_HOMA_H
+#define _UAPI_LINUX_HOMA_H
+
+#include <linux/types.h>
+#ifndef __KERNEL__
+#include <netinet/in.h>
+#include <sys/socket.h>
+#endif
+
+/* IANA-assigned Internet Protocol number for Homa. */
+#define IPPROTO_HOMA 146
+
+/**
+ * define HOMA_MAX_MESSAGE_LENGTH - Maximum bytes of payload in a Homa
+ * request or response message.
+ */
+#define HOMA_MAX_MESSAGE_LENGTH 1000000
+
+/**
+ * define HOMA_BPAGE_SIZE - Number of bytes in pages used for receive
+ * buffers. Must be power of two.
+ */
+#define HOMA_BPAGE_SIZE (1 << HOMA_BPAGE_SHIFT)
+#define HOMA_BPAGE_SHIFT 16
+
+/**
+ * define HOMA_MAX_BPAGES - The largest number of bpages that will be required
+ * to store an incoming message.
+ */
+#define HOMA_MAX_BPAGES ((HOMA_MAX_MESSAGE_LENGTH + HOMA_BPAGE_SIZE - 1) \
+		>> HOMA_BPAGE_SHIFT)
+
+/**
+ * define HOMA_MIN_DEFAULT_PORT - The 16 bit port space is divided into
+ * two nonoverlapping regions. Ports 1-32767 are reserved exclusively
+ * for well-defined server ports. The remaining ports are used for client
+ * ports; these are allocated automatically by Homa. Port 0 is reserved.
+ */
+#define HOMA_MIN_DEFAULT_PORT 0x8000
+
+/**
+ * struct homa_sendmsg_args - Provides information needed by Homa's
+ * sendmsg; passed to sendmsg using the msg_control field.
+ */
+struct homa_sendmsg_args {
+	/**
+	 * @id: (in/out) An initial value of 0 means a new request is
+	 * being sent; nonzero means the message is a reply to the given
+	 * id. If the message is a request, then the value is modified to
+	 * hold the id of the new RPC.
+	 */
+	__u64 id;
+
+	/**
+	 * @completion_cookie: (in) Used only for request messages; will be
+	 * returned by recvmsg when the RPC completes. Typically used to
+	 * locate app-specific info about the RPC.
+	 */
+	__u64 completion_cookie;
+};
+
+#if !defined(__cplusplus)
+_Static_assert(sizeof(struct homa_sendmsg_args) >= 16,
+	       "homa_sendmsg_args shrunk");
+_Static_assert(sizeof(struct homa_sendmsg_args) <= 16,
+	       "homa_sendmsg_args grew");
+#endif
+
+/**
+ * struct homa_recvmsg_args - Provides information needed by Homa's
+ * recvmsg; passed to recvmsg using the msg_control field.
+ */
+struct homa_recvmsg_args {
+	/**
+	 * @id: (in/out) Initially specifies the id of the desired RPC, or 0
+	 * if any RPC is OK; returns the actual id received.
+	 */
+	__u64 id;
+
+	/**
+	 * @completion_cookie: (out) If the incoming message is a response,
+	 * this will return the completion cookie specified when the
+	 * request was sent. For requests this will always be zero.
+	 */
+	__u64 completion_cookie;
+
+	/**
+	 * @flags: (in) OR-ed combination of bits that control the operation.
+	 * See below for values.
+	 */
+	__u32 flags;
+
+	/**
+	 * @num_bpages: (in/out) Number of valid entries in @bpage_offsets.
+	 * Passes in bpages from previous messages that can now be
+	 * recycled; returns bpages from the new message.
+	 */
+	__u32 num_bpages;
+
+	/**
+	 * @bpage_offsets: (in/out) Each entry is an offset into the buffer
+	 * region for the socket pool. When returned from recvmsg, the
+	 * offsets indicate where fragments of the new message are stored. All
+	 * entries but the last refer to full buffer pages (HOMA_BPAGE_SIZE
+	 * bytes) and are bpage-aligned. The last entry may refer to a bpage
+	 * fragment and is not necessarily aligned. The application now owns
+	 * these bpages and must eventually return them to Homa, using
+	 * bpage_offsets in a future recvmsg invocation.
+	 */
+	__u32 bpage_offsets[HOMA_MAX_BPAGES];
+};
+
+#if !defined(__cplusplus)
+_Static_assert(sizeof(struct homa_recvmsg_args) >= 88,
+	       "homa_recvmsg_args shrunk");
+_Static_assert(sizeof(struct homa_recvmsg_args) <= 88,
+	       "homa_recvmsg_args grew");
+#endif
+
+/* Flag bits for homa_recvmsg_args.flags (see man page for documentation):
+ */
+#define HOMA_RECVMSG_REQUEST       0x01
+#define HOMA_RECVMSG_RESPONSE      0x02
+#define HOMA_RECVMSG_NONBLOCKING   0x04
+#define HOMA_RECVMSG_VALID_FLAGS   0x07
+
+/** define SO_HOMA_RCVBUF - setsockopt option for specifying buffer region. */
+#define SO_HOMA_RCVBUF 10
+
+/** struct homa_rcvbuf_args - setsockopt argument for SO_HOMA_RCVBUF. */
+struct homa_rcvbuf_args {
+	/** @start: Address of first byte of buffer region in user space. */
+	__u64 start;
+
+	/** @length: Total number of bytes available at @start. */
+	size_t length;
+};
+
+/* Meanings of the bits in Homa's flag word, which can be set using
+ * "sysctl /net/homa/flags".
+ */
+
+/**
+ * define HOMA_FLAG_DONT_THROTTLE - disable the output throttling mechanism
+ * (always send all packets immediately).
+ */
+#define HOMA_FLAG_DONT_THROTTLE   2
+
+/* I/O control calls on Homa sockets. These are mapped into the
+ * SIOCPROTOPRIVATE range of 0x89e0 through 0x89ef.
+ */
+
+#define HOMAIOCFREEZE _IO(0x89, 0xef)
+
+#endif /* _UAPI_LINUX_HOMA_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH net-next v6 02/12] net: homa: create homa_wire.h
  2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
  2025-01-15 18:59 ` [PATCH net-next v6 01/12] net: homa: define user-visible API for Homa John Ousterhout
@ 2025-01-15 18:59 ` John Ousterhout
  2025-01-15 18:59 ` [PATCH net-next v6 03/12] net: homa: create shared Homa header files John Ousterhout
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-15 18:59 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

This file defines the on-the-wire packet formats for Homa.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_wire.h | 367 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 367 insertions(+)
 create mode 100644 net/homa/homa_wire.h

diff --git a/net/homa/homa_wire.h b/net/homa/homa_wire.h
new file mode 100644
index 000000000000..df0242c4f00b
--- /dev/null
+++ b/net/homa/homa_wire.h
@@ -0,0 +1,367 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines the on-the-wire format of Homa packets. */
+
+#ifndef _HOMA_WIRE_H
+#define _HOMA_WIRE_H
+
+#include <linux/skbuff.h>
+
+/* Defines the possible types of Homa packets.
+ *
+ * See the xxx_header structs below for more information about each type.
+ */
+enum homa_packet_type {
+	DATA               = 0x10,
+	RESEND             = 0x12,
+	UNKNOWN            = 0x13,
+	BUSY               = 0x14,
+	NEED_ACK           = 0x17,
+	ACK                = 0x18,
+	BOGUS              = 0x19,      /* Used only in unit tests. */
+	/* If you add a new type here, you must also do the following:
+	 * 1. Change BOGUS so it is the highest opcode
+	 * 2. Add support for the new opcode in homa_print_packet,
+	 *    homa_print_packet_short, homa_symbol_for_type, and mock_skb_new.
+	 * 3. Add the header length to header_lengths in homa_plumbing.c.
+	 */
+};
+
+/** define HOMA_IPV6_HEADER_LENGTH - Size of IP header (V6). */
+#define HOMA_IPV6_HEADER_LENGTH 40
+
+/** define HOMA_IPV4_HEADER_LENGTH - Size of IP header (V4). */
+#define HOMA_IPV4_HEADER_LENGTH 20
+
+/**
+ * define HOMA_SKB_EXTRA - How many bytes of additional space to allow at the
+ * beginning of each sk_buff, before the IP header. This includes room for a
+ * VLAN header and also includes some extra space, "just to be safe" (not
+ * really sure if this is needed).
+ */
+#define HOMA_SKB_EXTRA 40
+
+/**
+ * define HOMA_ETH_OVERHEAD - Number of bytes per Ethernet packet for Ethernet
+ * header, CRC, preamble, and inter-packet gap.
+ */
+#define HOMA_ETH_OVERHEAD 42
+
+/**
+ * define HOMA_MIN_PKT_LENGTH - Every Homa packet must be padded to at least
+ * this length to meet Ethernet frame size limitations. This number includes
+ * Homa headers and data, but not IP or Ethernet headers.
+ */
+#define HOMA_MIN_PKT_LENGTH 26
+
+/**
+ * define HOMA_MAX_HEADER - Number of bytes in the largest Homa header.
+ */
+#define HOMA_MAX_HEADER 90
+
+/**
+ * struct homa_common_hdr - Wire format for the first bytes in every Homa
+ * packet. This must (mostly) match the format of a TCP header to enable
+ * Homa packets to actually be transmitted as TCP packets (and thereby
+ * take advantage of TSO and other features).
+ */
+struct homa_common_hdr {
+	/**
+	 * @sport: Port on source machine from which packet was sent.
+	 * Must be in the same position as in a TCP header.
+	 */
+	__be16 sport;
+
+	/**
+	 * @dport: Port on destination that is to receive packet. Must be
+	 * in the same position as in a TCP header.
+	 */
+	__be16 dport;
+
+	/**
+	 * @sequence: corresponds to the sequence number field in TCP headers;
+	 * used in DATA packets to hold the offset in the message of the first
+	 * byte of data. This value will only be correct in the first segment
+	 * of a GSO packet.
+	 */
+	__be32 sequence;
+
+	/**
+	 * @ack: Corresponds to the high-order bits of the acknowledgment
+	 * field in TCP headers; not used by Homa.
+	 */
+	char ack[3];
+
+	/**
+	 * @type: Homa packet type (one of the values of the homa_packet_type
+	 * enum). Corresponds to the low-order byte of the ack in TCP.
+	 */
+	__u8 type;
+
+	/**
+	 * @doff: High order 4 bits holds the number of 4-byte chunks in a
+	 * homa_data_hdr (low-order bits unused). Used only for DATA packets;
+	 * must be in the same position as the data offset in a TCP header.
+	 * Used by TSO to determine where the replicated header portion ends.
+	 */
+	__u8 doff;
+
+	/**
+	 * @reserved1: Corresponds to flag bits in TCP; currently unused
+	 * by Homa.
+	 */
+	__u8 reserved1;
+
+	/**
+	 * @window: Corresponds to the window field in TCP headers. Not used
+	 * by HOMA.
+	 */
+	__be16 window;
+
+	/**
+	 * @checksum: Not used by Homa, but must occupy the same bytes as
+	 * the checksum in a TCP header (TSO may modify this?).
+	 */
+	__be16 checksum;
+
+	/**
+	 * @reserved2: Corresponds to the urgent pointer in TCP; not used
+	 * by Homa.
+	 */
+	__be16 reserved2;
+
+	/**
+	 * @sender_id: the identifier of this RPC as used on the sender (i.e.,
+	 * if the low-order bit is set, then the sender is the server for
+	 * this RPC).
+	 */
+	__be64 sender_id;
+} __packed;
+
+/**
+ * struct homa_ack - Identifies an RPC that can be safely deleted by its
+ * server. After sending the response for an RPC, the server must retain its
+ * state for the RPC until it knows that the client has successfully
+ * received the entire response. An ack indicates this. Clients will
+ * piggyback acks on future data packets, but if a client doesn't send
+ * any data to the server, the server will eventually request an ack
+ * explicitly with a NEED_ACK packet, in which case the client will
+ * return an explicit ACK.
+ */
+struct homa_ack {
+	/**
+	 * @client_id: The client's identifier for the RPC. 0 means this ack
+	 * is invalid.
+	 */
+	__be64 client_id;
+
+	/** @server_port: The server-side port for the RPC. */
+	__be16 server_port;
+} __packed;
+
+/* struct homa_data_hdr - Contains data for part or all of a Homa message.
+ * An incoming packet consists of a homa_data_hdr followed by message data.
+ * An outgoing packet can have this simple format as well, or it can be
+ * structured as a GSO packet with the following format:
+ *
+ *    |-----------------------|
+ *    |                       |
+ *    |     data_header       |
+ *    |                       |
+ *    |---------------------- |
+ *    |                       |
+ *    |                       |
+ *    |     segment data      |
+ *    |                       |
+ *    |                       |
+ *    |-----------------------|
+ *    |      seg_header       |
+ *    |-----------------------|
+ *    |                       |
+ *    |                       |
+ *    |     segment data      |
+ *    |                       |
+ *    |                       |
+ *    |-----------------------|
+ *    |      seg_header       |
+ *    |-----------------------|
+ *    |                       |
+ *    |                       |
+ *    |     segment data      |
+ *    |                       |
+ *    |                       |
+ *    |-----------------------|
+ *
+ * TSO will not adjust @homa_common_hdr.sequence in the segments, so Homa
+ * sprinkles correct offsets (in homa_seg_hdrs) throughout the segment data;
+ * TSO/GSO will include a different homa_seg_hdr in each generated packet.
+ */
+
+struct homa_seg_hdr {
+	/**
+	 * @offset: Offset within message of the first byte of data in
+	 * this segment.
+	 */
+	__be32 offset;
+} __packed;
+
+struct homa_data_hdr {
+	struct homa_common_hdr common;
+
+	/** @message_length: Total #bytes in the message. */
+	__be32 message_length;
+
+	__be32 reserved1;
+
+	/** @ack: If the @client_id field of this is nonzero, provides info
+	 * about an RPC that the recipient can now safely free. Note: in
+	 * TSO packets this will get duplicated in each of the segments;
+	 * in order to avoid repeated attempts to ack the same RPC,
+	 * homa_gro_receive will clear this field in all segments but the
+	 * first.
+	 */
+	struct homa_ack ack;
+
+	__be16 reserved2;
+
+	/**
+	 * @retransmit: 1 means this packet was sent in response to a RESEND
+	 * (it has already been sent previously).
+	 */
+	__u8 retransmit;
+
+	char pad[3];
+
+	/** @seg: First of possibly many segments. */
+	struct homa_seg_hdr seg;
+} __packed;
+_Static_assert(sizeof(struct homa_data_hdr) <= HOMA_MAX_HEADER,
+	       "homa_data_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+_Static_assert(sizeof(struct homa_data_hdr) >= HOMA_MIN_PKT_LENGTH,
+	       "homa_data_hdr too small: Homa doesn't currently have code to pad data packets");
+_Static_assert(((sizeof(struct homa_data_hdr) - sizeof(struct homa_seg_hdr)) &
+		0x3) == 0,
+	       " homa_data_hdr length not a multiple of 4 bytes (required for TCP/TSO compatibility");
+
+/**
+ * homa_data_len() - Returns the total number of bytes in a DATA packet
+ * after the homa_data_hdr. Note: if the packet is a GSO packet, the result
+ * may include metadata as well as packet data.
+ * @skb:   Incoming data packet
+ * Return: see above
+ */
+static inline int homa_data_len(struct sk_buff *skb)
+{
+	return skb->len - skb_transport_offset(skb) -
+			sizeof(struct homa_data_hdr);
+}
+
+/**
+ * struct homa_resend_hdr - Wire format for RESEND packets.
+ *
+ * A RESEND is sent by the receiver when it believes that message data may
+ * have been lost in transmission (or if it is concerned that the sender may
+ * have crashed). The receiver should resend the specified portion of the
+ * message, even if it already sent it previously.
+ */
+struct homa_resend_hdr {
+	/** @common: Fields common to all packet types. */
+	struct homa_common_hdr common;
+
+	/**
+	 * @offset: Offset within the message of the first byte of data that
+	 * should be retransmitted.
+	 */
+	__be32 offset;
+
+	/**
+	 * @length: Number of bytes of data to retransmit; this could specify
+	 * a range longer than the total message size. Zero is a special case
+	 * used by servers; in this case, there is no need to actually resend
+	 * anything; the purpose of this packet is to trigger an UNKNOWN
+	 * response if the client no longer cares about this RPC.
+	 */
+	__be32 length;
+} __packed;
+_Static_assert(sizeof(struct homa_resend_hdr) <= HOMA_MAX_HEADER,
+	       "homa_resend_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * struct homa_unknown_hdr - Wire format for UNKNOWN packets.
+ *
+ * An UNKNOWN packet is sent by either server or client when it receives a
+ * packet for an RPC that is unknown to it. When a client receives an
+ * UNKNOWN packet it will typically restart the RPC from the beginning;
+ * when a server receives an UNKNOWN packet it will typically discard its
+ * state for the RPC.
+ */
+struct homa_unknown_hdr {
+	/** @common: Fields common to all packet types. */
+	struct homa_common_hdr common;
+} __packed;
+_Static_assert(sizeof(struct homa_unknown_hdr) <= HOMA_MAX_HEADER,
+	       "homa_unknown_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * struct homa_busy_hdr - Wire format for BUSY packets.
+ *
+ * These packets tell the recipient that the sender is still alive (even if
+ * it isn't sending data expected by the recipient).
+ */
+struct homa_busy_hdr {
+	/** @common: Fields common to all packet types. */
+	struct homa_common_hdr common;
+} __packed;
+_Static_assert(sizeof(struct homa_busy_hdr) <= HOMA_MAX_HEADER,
+	       "homa_busy_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * struct homa_need_ack_hdr - Wire format for NEED_ACK packets.
+ *
+ * These packets ask the recipient (a client) to return an ACK message if
+ * the packet's RPC is no longer active.
+ */
+struct homa_need_ack_hdr {
+	/** @common: Fields common to all packet types. */
+	struct homa_common_hdr common;
+} __packed;
+_Static_assert(sizeof(struct homa_need_ack_hdr) <= HOMA_MAX_HEADER,
+	       "homa_need_ack_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * struct homa_ack_hdr - Wire format for ACK packets.
+ *
+ * These packets are sent from a client to a server to indicate that
+ * a set of RPCs is no longer active on the client, so the server can
+ * free any state it may have for them.
+ */
+struct homa_ack_hdr {
+	/** @common: Fields common to all packet types. */
+	struct homa_common_hdr common;
+
+	/** @num_acks: Number of (leading) elements in @acks that are valid. */
+	__be16 num_acks;
+
+#define HOMA_MAX_ACKS_PER_PKT 5
+	/** @acks: Info about RPCs that are no longer active. */
+	struct homa_ack acks[HOMA_MAX_ACKS_PER_PKT];
+} __packed;
+_Static_assert(sizeof(struct homa_ack_hdr) <= HOMA_MAX_HEADER,
+	       "homa_ack_hdr too large for HOMA_MAX_HEADER; must adjust HOMA_MAX_HEADER");
+
+/**
+ * homa_local_id(): given an RPC identifier from an input packet (which
+ * is network-encoded), return the decoded id we should use for that
+ * RPC on this machine.
+ * @sender_id:  RPC id from an incoming packet, such as h->common.sender_id
+ * Return: see above
+ */
+static inline __u64 homa_local_id(__be64 sender_id)
+{
+	/* If the client bit was set on the sender side, it needs to be
+	 * removed here, and conversely.
+	 */
+	return be64_to_cpu(sender_id) ^ 1;
+}
+
+#endif /* _HOMA_WIRE_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH net-next v6 03/12] net: homa: create shared Homa header files
  2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
  2025-01-15 18:59 ` [PATCH net-next v6 01/12] net: homa: define user-visible API for Homa John Ousterhout
  2025-01-15 18:59 ` [PATCH net-next v6 02/12] net: homa: create homa_wire.h John Ousterhout
@ 2025-01-15 18:59 ` John Ousterhout
  2025-01-23 11:01   ` Paolo Abeni
  2025-01-15 18:59 ` [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c John Ousterhout
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-15 18:59 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

homa_impl.h defines "struct homa", which contains overall information
about the Homa transport, plus various odds and ends that are used
throughout the Homa implementation.

homa_stub.h is a temporary header file that provides stubs for
facilities that have omitted for this first patch series. This file
will go away once Home is fully upstreamed.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_impl.h | 711 +++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_stub.h |  81 +++++
 2 files changed, 792 insertions(+)
 create mode 100644 net/homa/homa_impl.h
 create mode 100644 net/homa/homa_stub.h

diff --git a/net/homa/homa_impl.h b/net/homa/homa_impl.h
new file mode 100644
index 000000000000..4dfe2f5beb82
--- /dev/null
+++ b/net/homa/homa_impl.h
@@ -0,0 +1,711 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file contains definitions that are shared across the files
+ * that implement Homa for Linux.
+ */
+
+#ifndef _HOMA_IMPL_H
+#define _HOMA_IMPL_H
+
+#include <linux/bug.h>
+
+#include <linux/audit.h>
+#include <linux/icmp.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/kthread.h>
+#include <linux/completion.h>
+#include <linux/proc_fs.h>
+#include <linux/sched/clock.h>
+#include <linux/sched/signal.h>
+#include <linux/skbuff.h>
+#include <linux/socket.h>
+#include <linux/vmalloc.h>
+#include <net/icmp.h>
+#include <net/ip.h>
+#include <net/protocol.h>
+#include <net/inet_common.h>
+#include <net/gro.h>
+#include <net/rps.h>
+
+#include <uapi/linux/homa.h>
+#include "homa_wire.h"
+
+/* Forward declarations. */
+struct homa_peer;
+struct homa_sock;
+struct homa;
+
+/* Declarations used in this file, so they can't be made at the end. */
+void     homa_throttle_lock_slow(struct homa *homa);
+
+#define sizeof32(type) ((int)(sizeof(type)))
+
+/**
+ * union sockaddr_in_union - Holds either an IPv4 or IPv6 address (smaller
+ * and easier to use than sockaddr_storage).
+ */
+union sockaddr_in_union {
+	/** @sa: Used to access as a generic sockaddr. */
+	struct sockaddr sa;
+
+	/** @in4: Used to access as IPv4 socket. */
+	struct sockaddr_in in4;
+
+	/** @in6: Used to access as IPv6 socket.  */
+	struct sockaddr_in6 in6;
+};
+
+/**
+ * struct homa_interest - Contains various information used while waiting
+ * for incoming messages (indicates what kinds of messages a particular
+ * thread is interested in receiving).
+ */
+struct homa_interest {
+	/**
+	 * @thread: Thread that would like to receive a message. Will get
+	 * woken up when a suitable message becomes available.
+	 */
+	struct task_struct *thread;
+
+	/**
+	 * @rpc_ready: Non-zero means an appropriate incoming message has
+	 * been assigned to this interest, and @rpc and @locked are valid
+	 * (they must be set before setting this variable).
+	 */
+	atomic_t rpc_ready;
+
+	/**
+	 * @rpc: If @rpc_ready is non-zero, points to an RPC with a ready
+	 * incoming message that meets the requirements of this interest.
+	 */
+	struct homa_rpc *rpc;
+
+	/**
+	 * @locked: Nonzero means that @rpc is locked; only valid if
+	 * @rpc_ready is non-zero.
+	 */
+	int locked;
+
+	/**
+	 * @core: Core on which @thread was executing when it registered
+	 * its interest.  Used for load balancing (see balance.txt).
+	 */
+	int core;
+
+	/**
+	 * @reg_rpc: RPC whose @interest field points here, or
+	 * NULL if none.
+	 */
+	struct homa_rpc *reg_rpc;
+
+	/**
+	 * @request_links: For linking this object into
+	 * &homa_sock.request_interests. The interest must not be linked
+	 * on either this list or @response_links if @id is nonzero.
+	 */
+	struct list_head request_links;
+
+	/**
+	 * @response_links: For linking this object into
+	 * &homa_sock.request_interests.
+	 */
+	struct list_head response_links;
+};
+
+/**
+ * homa_interest_init() - Fill in default values for all of the fields
+ * of a struct homa_interest.
+ * @interest:   Struct to initialize.
+ */
+static inline void homa_interest_init(struct homa_interest *interest)
+{
+	interest->thread = current;
+	atomic_set(&interest->rpc_ready, 0);
+	interest->rpc = NULL;
+	interest->locked = 0;
+	interest->core = raw_smp_processor_id();
+	interest->reg_rpc = NULL;
+	INIT_LIST_HEAD(&interest->request_links);
+	INIT_LIST_HEAD(&interest->response_links);
+}
+
+/**
+ * homa_interest_get_rpc() - Return the ready RPC stored in an interest,
+ * if there is one.
+ * @interest:  Struct to check
+ * Return: the ready RPC, or NULL if none.
+ */
+static inline struct homa_rpc *homa_interest_get_rpc(struct homa_interest *interest)
+{
+	if (atomic_read(&interest->rpc_ready))
+		return interest->rpc;
+	return NULL;
+}
+
+/**
+ * homa_interest_set_rpc() - Hand off a ready RPC to an interest from a
+ * waiting receiver thread. Note: interest->locked must be set before
+ * calling this function.
+ * @interest:   Belongs to a thread that is waiting for an incoming message.
+ * @rpc:        Ready rpc to assign to @interest.
+ * @locked:     1 means @rpc is locked, 0 means unlocked.
+ */
+static inline void homa_interest_set_rpc(struct homa_interest *interest,
+						     struct homa_rpc *rpc,
+						     int locked)
+{
+	interest->rpc = rpc;
+	interest->locked = locked;
+	atomic_set_release(&interest->rpc_ready, 1);
+}
+
+/**
+ * struct homa - Overall information about the Homa protocol implementation.
+ *
+ * There will typically only exist one of these at a time, except during
+ * unit tests.
+ */
+struct homa {
+	/**
+	 * @next_outgoing_id: Id to use for next outgoing RPC request.
+	 * This is always even: it's used only to generate client-side ids.
+	 * Accessed without locks. Note: RPC ids are unique within a
+	 * single client machine.
+	 */
+	atomic64_t next_outgoing_id;
+
+	/**
+	 * @link_idle_time: The time, measured by sched_clock, at which we
+	 * estimate that all of the packets we have passed to Linux for
+	 * transmission will have been transmitted. May be in the past.
+	 * This estimate assumes that only Homa is transmitting data, so
+	 * it could be a severe underestimate if there is competing traffic
+	 * from, say, TCP. Access only with atomic ops.
+	 */
+	atomic64_t link_idle_time __aligned(L1_CACHE_BYTES);
+
+	/**
+	 * @pacer_mutex: Ensures that only one instance of homa_pacer_xmit
+	 * runs at a time. Only used in "try" mode: never block on this.
+	 */
+	spinlock_t pacer_mutex __aligned(L1_CACHE_BYTES);
+
+	/**
+	 * @pacer_fifo_fraction: The fraction of time (in thousandths) when
+	 * the pacer should transmit next from the oldest message, rather
+	 * than the highest-priority message. Set externally via sysctl.
+	 */
+	int pacer_fifo_fraction;
+
+	/**
+	 * @pacer_fifo_count: When this becomes <= zero, it's time for the
+	 * pacer to allow the oldest RPC to transmit.
+	 */
+	int pacer_fifo_count;
+
+	/**
+	 * @pacer_wake_time: time (in sched_clock units) when the pacer last
+	 * woke up (if the pacer is running) or 0 if the pacer is sleeping.
+	 */
+	__u64 pacer_wake_time;
+
+	/**
+	 * @throttle_lock: Used to synchronize access to @throttled_rpcs. To
+	 * insert or remove an RPC from throttled_rpcs, must first acquire
+	 * the RPC's socket lock, then this lock.
+	 */
+	spinlock_t throttle_lock;
+
+	/**
+	 * @throttled_rpcs: Contains all homa_rpcs that have bytes ready
+	 * for transmission, but which couldn't be sent without exceeding
+	 * the queue limits for transmission. Manipulate only with "_rcu"
+	 * functions.
+	 */
+	struct list_head throttled_rpcs;
+
+	/**
+	 * @throttle_add: The time (in sched_clock() units) when the most
+	 * recent RPC was added to @throttled_rpcs.
+	 */
+	__u64 throttle_add;
+
+	/**
+	 * @throttle_min_bytes: If a packet has fewer bytes than this, then it
+	 * bypasses the throttle mechanism and is transmitted immediately.
+	 * We have this limit because for very small packets we can't keep
+	 * up with the NIC (we're limited by CPU overheads); there's no
+	 * need for throttling and going through the throttle mechanism
+	 * adds overhead, which slows things down. At least, that's the
+	 * hypothesis (needs to be verified experimentally!). Set externally
+	 * via sysctl.
+	 */
+	int throttle_min_bytes;
+
+	/**
+	 * @prev_default_port: The most recent port number assigned from
+	 * the range of default ports.
+	 */
+	__u16 prev_default_port __aligned(L1_CACHE_BYTES);
+
+	/**
+	 * @port_map: Information about all open sockets. Dynamically
+	 * allocated; must be kfreed.
+	 */
+	struct homa_socktab *port_map __aligned(L1_CACHE_BYTES);
+
+	/**
+	 * @peers: Info about all the other hosts we have communicated with.
+	 * Dynamically allocated; must be kfreed.
+	 */
+	struct homa_peertab *peers;
+
+	/** @max_numa: Highest NUMA node id in use by any core. */
+	int max_numa;
+
+	/**
+	 * @link_mbps: The raw bandwidth of the network uplink, in
+	 * units of 1e06 bits per second.  Set externally via sysctl.
+	 */
+	int link_mbps;
+
+	/**
+	 * @resend_ticks: When an RPC's @silent_ticks reaches this value,
+	 * start sending RESEND requests.
+	 */
+	int resend_ticks;
+
+	/**
+	 * @resend_interval: minimum number of homa timer ticks between
+	 * RESENDs for the same RPC.
+	 */
+	int resend_interval;
+
+	/**
+	 * @timeout_ticks: abort an RPC if its silent_ticks reaches this value.
+	 */
+	int timeout_ticks;
+
+	/**
+	 * @timeout_resends: Assume that a server is dead if it has not
+	 * responded after this many RESENDs have been sent to it.
+	 */
+	int timeout_resends;
+
+	/**
+	 * @request_ack_ticks: How many timer ticks we'll wait for the
+	 * client to ack an RPC before explicitly requesting an ack.
+	 * Set externally via sysctl.
+	 */
+	int request_ack_ticks;
+
+	/**
+	 * @reap_limit: Maximum number of packet buffers to free in a
+	 * single call to home_rpc_reap.
+	 */
+	int reap_limit;
+
+	/**
+	 * @dead_buffs_limit: If the number of packet buffers in dead but
+	 * not yet reaped RPCs is less than this number, then Homa reaps
+	 * RPCs in a way that minimizes impact on performance but may permit
+	 * dead RPCs to accumulate. If the number of dead packet buffers
+	 * exceeds this value, then Homa switches to a more aggressive approach
+	 * to reaping RPCs. Set externally via sysctl.
+	 */
+	int dead_buffs_limit;
+
+	/**
+	 * @max_dead_buffs: The largest aggregate number of packet buffers
+	 * in dead (but not yet reaped) RPCs that has existed so far in a
+	 * single socket.  Readable via sysctl, and may be reset via sysctl
+	 * to begin recalculating.
+	 */
+	int max_dead_buffs;
+
+	/**
+	 * @pacer_kthread: Kernel thread that transmits packets from
+	 * throttled_rpcs in a way that limits queue buildup in the
+	 * NIC.
+	 */
+	struct task_struct *pacer_kthread;
+
+	/**
+	 * @pacer_exit: true means that the pacer thread should exit as
+	 * soon as possible.
+	 */
+	bool pacer_exit;
+
+	/**
+	 * @max_nic_queue_ns: Limits the NIC queue length: we won't queue
+	 * up a packet for transmission if link_idle_time is this many
+	 * nanoseconds in the future (or more). Set externally via sysctl.
+	 */
+	int max_nic_queue_ns;
+
+	/**
+	 * @ns_per_mbyte: the number of ns that it takes to transmit
+	 * 10**6 bytes on our uplink. This is actually a slight overestimate
+	 * of the value, to ensure that we don't underestimate NIC queue
+	 * length and queue too many packets.
+	 */
+	__u32 ns_per_mbyte;
+
+	/**
+	 * @max_gso_size: Maximum number of bytes that will be included
+	 * in a single output packet that Homa passes to Linux. Can be set
+	 * externally via sysctl to lower the limit already enforced by Linux.
+	 */
+	int max_gso_size;
+
+	/**
+	 * @gso_force_software: A non-zero value will cause Homa to perform
+	 * segmentation in software using GSO; zero means ask the NIC to
+	 * perform TSO. Set externally via sysctl.
+	 */
+	int gso_force_software;
+
+	/**
+	 * @max_gro_skbs: Maximum number of socket buffers that can be
+	 * aggregated by the GRO mechanism.  Set externally via sysctl.
+	 */
+	int max_gro_skbs;
+
+	/**
+	 * @gro_policy: An OR'ed together collection of bits that determine
+	 * how Homa packets should be steered for SoftIRQ handling.  A value
+	 * of zero will eliminate any Homa-specific behaviors, reverting
+	 * to the Linux defaults. Set externally via sysctl (but modifying
+	 * it is almost certainly a bad idea; see below).
+	 */
+	int gro_policy;
+
+	/* Bits that can be specified for gro_policy. These were created for
+	 * testing, in order to evaluate various possible policies; you almost
+	 * certainly should not use any value other than HOMA_GRO_NORMAL.
+	 * HOMA_GRO_SAME_CORE         If isolated packets arrive (not part of a
+	 *                            batch) use the GRO core for SoftIRQ also.
+	 * HOMA_GRO_IDLE              Use old mechanism for selecting an idle
+	 *                            core for SoftIRQ (deprecated).
+	 * HOMA_GRO_NEXT              Always use the next core in circular
+	 *                            order for SoftIRQ (deprecated).
+	 * HOMA_GRO_GEN2              Use the new mechanism for selecting an
+	 *                            idle core for SoftIRQ.
+	 * HOMA_GRO_SHORT_BYPASS      Pass all single-packet messages directly
+	 *                            to homa_softirq during GRO (only if the
+	 *                            core isn't overloaded).
+	 * HOMA_GRO_GEN3              Use the "Gen3" mechanisms for load
+	 *                            balancing.
+	 */
+	#define HOMA_GRO_SAME_CORE         2
+	#define HOMA_GRO_IDLE              4
+	#define HOMA_GRO_NEXT              8
+	#define HOMA_GRO_GEN2           0x10
+	#define HOMA_GRO_SHORT_BYPASS   0x40
+	#define HOMA_GRO_GEN3           0x80
+	#define HOMA_GRO_NORMAL      (HOMA_GRO_SAME_CORE | HOMA_GRO_GEN2 | \
+				      HOMA_GRO_SHORT_BYPASS)
+
+	/**
+	 * @gro_busy_usecs: if the gap between the completion of
+	 * homa_gro_receive and the next call to homa_gro_receive on the same
+	 * core is less than this, then GRO on that core is considered to be
+	 * "busy", and optimizations such as HOMA_GRO_SHORT_BYPASS will not be
+	 * done because they risk overloading the core. Set externally via
+	 * sysctl.
+	 */
+	int gro_busy_usecs;
+
+	/** @gro_busy_ns: Same as busy_usecs except in sched_clock() units. */
+	int gro_busy_ns;
+
+	/**
+	 * @timer_ticks: number of times that homa_timer has been invoked
+	 * (may wraparound, which is safe).
+	 */
+	__u32 timer_ticks;
+
+	/**
+	 * @flags: a collection of bits that can be set using sysctl
+	 * to trigger various behaviors.
+	 */
+	int flags;
+
+	/**
+	 * @bpage_lease_usecs: how long a core can own a bpage (microseconds)
+	 * before its ownership can be revoked to reclaim the page.
+	 */
+	int bpage_lease_usecs;
+
+	/**
+	 * @next_id: Set via sysctl; causes next_outgoing_id to be set to
+	 * this value; always reads as zero. Typically used while debugging to
+	 * ensure that different nodes use different ranges of ids.
+	 */
+	int next_id;
+
+};
+
+/**
+ * struct homa_skb_info - Additional information needed by Homa for each
+ * outbound DATA packet. Space is allocated for this at the very end of the
+ * linear part of the skb.
+ */
+struct homa_skb_info {
+	/**
+	 * @next_skb: used to link together all of the skb's for a Homa
+	 * message (in order of offset).
+	 */
+	struct sk_buff *next_skb;
+
+	/**
+	 * @wire_bytes: total number of bytes of network bandwidth that
+	 * will be consumed by this packet. This includes everything,
+	 * including additional headers added by GSO, IP header, Ethernet
+	 * header, CRC, preamble, and inter-packet gap.
+	 */
+	int wire_bytes;
+
+	/**
+	 * @data_bytes: total bytes of message data across all of the
+	 * segments in this packet.
+	 */
+	int data_bytes;
+
+	/** @seg_length: maximum number of data bytes in each GSO segment. */
+	int seg_length;
+
+	/**
+	 * @offset: offset within the message of the first byte of data in
+	 * this packet.
+	 */
+	int offset;
+};
+
+/**
+ * homa_get_skb_info() - Return the address of Homa's private information
+ * for an sk_buff.
+ * @skb:     Socket buffer whose info is needed.
+ * Return: address of Homa's private information for @skb.
+ */
+static inline struct homa_skb_info *homa_get_skb_info(struct sk_buff *skb)
+{
+	return (struct homa_skb_info *)(skb_end_pointer(skb)) - 1;
+}
+
+/**
+ * homa_set_doff() - Fills in the doff TCP header field for a Homa packet.
+ * @h:     Packet header whose doff field is to be set.
+ * @size:  Size of the "header", bytes (must be a multiple of 4). This
+ *         information is used only for TSO; it's the number of bytes
+ *         that should be replicated in each segment. The bytes after
+ *         this will be distributed among segments.
+ */
+static inline void homa_set_doff(struct homa_data_hdr *h, int size)
+{
+	/* Drop the 2 low-order bits from size and set the 4 high-order
+	 * bits of doff from what's left.
+	 */
+	h->common.doff = size << 2;
+}
+
+/**
+ * homa_throttle_lock() - Acquire the throttle lock. If the lock
+ * isn't immediately available, record stats on the waiting time.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+static inline void homa_throttle_lock(struct homa *homa)
+	__acquires(&homa->throttle_lock)
+{
+	if (!spin_trylock_bh(&homa->throttle_lock))
+		homa_throttle_lock_slow(homa);
+}
+
+/**
+ * homa_throttle_unlock() - Release the throttle lock.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+static inline void homa_throttle_unlock(struct homa *homa)
+	__releases(&homa->throttle_lock)
+{
+	spin_unlock_bh(&homa->throttle_lock);
+}
+
+/** skb_is_ipv6() - Return true if the packet is encapsulated with IPv6,
+ *  false otherwise (presumably it's IPv4).
+ */
+static inline bool skb_is_ipv6(const struct sk_buff *skb)
+{
+	return ipv6_hdr(skb)->version == 6;
+}
+
+/**
+ * ipv6_to_ipv4() - Given an IPv6 address produced by ipv4_to_ipv6, return
+ * the original IPv4 address (in network byte order).
+ * @ip6:  IPv6 address; assumed to be a mapped IPv4 address.
+ * Return: IPv4 address stored in @ip6.
+ */
+static inline __be32 ipv6_to_ipv4(const struct in6_addr ip6)
+{
+	return ip6.in6_u.u6_addr32[3];
+}
+
+/**
+ * canonical_ipv6_addr() - Convert a socket address to the "standard"
+ * form used in Homa, which is always an IPv6 address; if the original address
+ * was IPv4, convert it to an IPv4-mapped IPv6 address.
+ * @addr:   Address to canonicalize (if NULL, "any" is returned).
+ * Return: IPv6 address corresponding to @addr.
+ */
+static inline struct in6_addr canonical_ipv6_addr(const union sockaddr_in_union
+						  *addr)
+{
+	struct in6_addr mapped;
+
+	if (addr) {
+		if (addr->sa.sa_family == AF_INET6)
+			return addr->in6.sin6_addr;
+		ipv6_addr_set_v4mapped(addr->in4.sin_addr.s_addr, &mapped);
+		return mapped;
+	}
+	return in6addr_any;
+}
+
+/**
+ * skb_canonical_ipv6_saddr() - Given a packet buffer, return its source
+ * address in the "standard" form used in Homa, which is always an IPv6
+ * address; if the original address was IPv4, convert it to an IPv4-mapped
+ * IPv6 address.
+ * @skb:   The source address will be extracted from this packet buffer.
+ * Return: IPv6 address for @skb's source machine.
+ */
+static inline struct in6_addr skb_canonical_ipv6_saddr(struct sk_buff *skb)
+{
+	struct in6_addr mapped;
+
+	if (skb_is_ipv6(skb))
+		return ipv6_hdr(skb)->saddr;
+	ipv6_addr_set_v4mapped(ip_hdr(skb)->saddr, &mapped);
+	return mapped;
+}
+
+extern struct homa *global_homa;
+
+void     homa_abort_rpcs(struct homa *homa, const struct in6_addr *addr,
+			 int port, int error);
+void     homa_abort_sock_rpcs(struct homa_sock *hsk, int error);
+void     homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+		      struct homa_rpc *rpc);
+void     homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb);
+void     homa_add_to_throttled(struct homa_rpc *rpc);
+int      homa_backlog_rcv(struct sock *sk, struct sk_buff *skb);
+int      homa_bind(struct socket *sk, struct sockaddr *addr,
+		   int addr_len);
+int      homa_check_nic_queue(struct homa *homa, struct sk_buff *skb,
+			      bool force);
+struct homa_interest *homa_choose_interest(struct homa *homa,
+					   struct list_head *head,
+					   int offset);
+void     homa_close(struct sock *sock, long timeout);
+int      homa_copy_to_user(struct homa_rpc *rpc);
+void     homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
+void     homa_destroy(struct homa *homa);
+int      homa_disconnect(struct sock *sk, int flags);
+void     homa_dispatch_pkts(struct sk_buff *skb, struct homa *homa);
+int      homa_err_handler_v4(struct sk_buff *skb, u32 info);
+int      homa_err_handler_v6(struct sk_buff *skb,
+			     struct inet6_skb_parm *opt, u8 type,  u8 code,
+			     int offset, __be32 info);
+int      homa_fill_data_interleaved(struct homa_rpc *rpc,
+				    struct sk_buff *skb, struct iov_iter *iter);
+struct homa_gap *homa_gap_new(struct list_head *next, int start, int end);
+void     homa_gap_retry(struct homa_rpc *rpc);
+int      homa_get_port(struct sock *sk, unsigned short snum);
+int      homa_getsockopt(struct sock *sk, int level, int optname,
+			 char __user *optval, int __user *optlen);
+int      homa_hash(struct sock *sk);
+enum hrtimer_restart homa_hrtimer(struct hrtimer *timer);
+int      homa_init(struct homa *homa);
+void     homa_incoming_sysctl_changed(struct homa *homa);
+int      homa_ioctl(struct sock *sk, int cmd, int *karg);
+int      homa_load(void);
+int      homa_message_in_init(struct homa_rpc *rpc, int length);
+int      homa_message_out_fill(struct homa_rpc *rpc,
+			       struct iov_iter *iter, int xmit);
+void     homa_message_out_init(struct homa_rpc *rpc, int length);
+void     homa_need_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+			   struct homa_rpc *rpc);
+struct sk_buff *homa_new_data_packet(struct homa_rpc *rpc,
+				     struct iov_iter *iter, int offset,
+				     int length, int max_seg_data);
+void     homa_outgoing_sysctl_changed(struct homa *homa);
+int      homa_pacer_main(void *transport);
+void     homa_pacer_stop(struct homa *homa);
+void     homa_pacer_xmit(struct homa *homa);
+__poll_t homa_poll(struct file *file, struct socket *sock,
+		   struct poll_table_struct *wait);
+int      homa_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+		      int flags, int *addr_len);
+int      homa_register_interests(struct homa_interest *interest,
+				 struct homa_sock *hsk, int flags, __u64 id);
+void     homa_remove_from_throttled(struct homa_rpc *rpc);
+void     homa_resend_data(struct homa_rpc *rpc, int start, int end);
+void     homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
+			 struct homa_sock *hsk);
+void     homa_rpc_abort(struct homa_rpc *crpc, int error);
+void     homa_rpc_acked(struct homa_sock *hsk,
+			const struct in6_addr *saddr, struct homa_ack *ack);
+void     homa_rpc_free(struct homa_rpc *rpc);
+void     homa_rpc_handoff(struct homa_rpc *rpc);
+int      homa_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
+int      homa_setsockopt(struct sock *sk, int level, int optname,
+			 sockptr_t optval, unsigned int optlen);
+int      homa_shutdown(struct socket *sock, int how);
+int      homa_softirq(struct sk_buff *skb);
+void     homa_spin(int ns);
+char    *homa_symbol_for_type(uint8_t type);
+void     homa_timer(struct homa *homa);
+int      homa_timer_main(void *transport);
+void     homa_unhash(struct sock *sk);
+void     homa_unknown_pkt(struct sk_buff *skb, struct homa_rpc *rpc);
+void     homa_unload(void);
+struct homa_rpc *homa_wait_for_message(struct homa_sock *hsk, int flags,
+				       __u64 id);
+int      homa_xmit_control(enum homa_packet_type type, void *contents,
+			   size_t length, struct homa_rpc *rpc);
+int      __homa_xmit_control(void *contents, size_t length,
+			     struct homa_peer *peer, struct homa_sock *hsk);
+void     homa_xmit_data(struct homa_rpc *rpc, bool force);
+void     __homa_xmit_data(struct sk_buff *skb, struct homa_rpc *rpc);
+void     homa_xmit_unknown(struct sk_buff *skb, struct homa_sock *hsk);
+
+/**
+ * homa_check_pacer() - This method is invoked at various places in Homa to
+ * see if the pacer needs to transmit more packets and, if so, transmit
+ * them. It's needed because the pacer thread may get descheduled by
+ * Linux, result in output stalls.
+ * @homa:    Overall data about the Homa protocol implementation. No locks
+ *           should be held when this function is invoked.
+ * @softirq: Nonzero means this code is running at softirq (bh) level;
+ *           zero means it's running in process context.
+ */
+static inline void homa_check_pacer(struct homa *homa, int softirq)
+{
+	if (list_empty(&homa->throttled_rpcs))
+		return;
+
+	/* The ">> 1" in the line below gives homa_pacer_main the first chance
+	 * to queue new packets; if the NIC queue becomes more than half
+	 * empty, then we will help out here.
+	 */
+	if ((sched_clock() + (homa->max_nic_queue_ns >> 1)) <
+			atomic64_read(&homa->link_idle_time))
+		return;
+	homa_pacer_xmit(homa);
+}
+
+extern struct completion homa_pacer_kthread_done;
+#endif /* _HOMA_IMPL_H */
diff --git a/net/homa/homa_stub.h b/net/homa/homa_stub.h
new file mode 100644
index 000000000000..19a27ab340ac
--- /dev/null
+++ b/net/homa/homa_stub.h
@@ -0,0 +1,81 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file contains stripped-down replacements that have been
+ * temporarily removed from Homa during the Linux upstreaming
+ * process. By the time upstreaming is complete this file will
+ * have gone away.
+ */
+
+#ifndef _HOMA_STUB_H
+#define _HOMA_STUB_H
+
+#include "homa_impl.h"
+
+static inline int homa_skb_append_from_iter(struct homa *homa,
+					    struct sk_buff *skb,
+					    struct iov_iter *iter, int length)
+{
+	char *dst = skb_put(skb, length);
+
+	if (copy_from_iter(dst, length, iter) != length)
+		return -EFAULT;
+	return 0;
+}
+
+static inline int homa_skb_append_to_frag(struct homa *homa,
+					  struct sk_buff *skb, void *buf,
+					  int length)
+{
+	char *dst = skb_put(skb, length);
+
+	memcpy(dst, buf, length);
+	return 0;
+}
+
+static inline int  homa_skb_append_from_skb(struct homa *homa,
+					    struct sk_buff *dst_skb,
+					    struct sk_buff *src_skb,
+					    int offset, int length)
+{
+	return homa_skb_append_to_frag(homa, dst_skb,
+			skb_transport_header(src_skb) + offset, length);
+}
+
+static inline void homa_skb_free_tx(struct homa *homa, struct sk_buff *skb)
+{
+	kfree_skb(skb);
+}
+
+static inline void homa_skb_free_many_tx(struct homa *homa,
+					 struct sk_buff **skbs, int count)
+{
+	int i;
+
+	for (i = 0; i < count; i++)
+		kfree_skb(skbs[i]);
+}
+
+static inline void homa_skb_get(struct sk_buff *skb, void *dest, int offset,
+				int length)
+{
+	memcpy(dest, skb_transport_header(skb) + offset, length);
+}
+
+static inline struct sk_buff *homa_skb_new_tx(int length)
+{
+	struct sk_buff *skb;
+
+	skb = alloc_skb(HOMA_SKB_EXTRA + HOMA_IPV6_HEADER_LENGTH +
+			sizeof(struct homa_skb_info) + length,
+			GFP_ATOMIC);
+	if (likely(skb)) {
+		skb_reserve(skb, HOMA_SKB_EXTRA + HOMA_IPV6_HEADER_LENGTH);
+		skb_reset_transport_header(skb);
+	}
+	return skb;
+}
+
+static inline void homa_skb_stash_pages(struct homa *homa, int length)
+{}
+
+#endif /* _HOMA_STUB_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c
  2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (2 preceding siblings ...)
  2025-01-15 18:59 ` [PATCH net-next v6 03/12] net: homa: create shared Homa header files John Ousterhout
@ 2025-01-15 18:59 ` John Ousterhout
  2025-01-23 12:06   ` Paolo Abeni
  2025-01-15 18:59 ` [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c John Ousterhout
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-15 18:59 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

These files implement Homa's mechanism for managing application-level
buffer space for incoming messages This mechanism is needed to allow
Homa to copy data out to user space in parallel with receiving packets;
it was discussed in a talk at NetDev 0x17.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_pool.c | 453 +++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_pool.h | 154 +++++++++++++++
 2 files changed, 607 insertions(+)
 create mode 100644 net/homa/homa_pool.c
 create mode 100644 net/homa/homa_pool.h

diff --git a/net/homa/homa_pool.c b/net/homa/homa_pool.c
new file mode 100644
index 000000000000..0b2ec83b6174
--- /dev/null
+++ b/net/homa/homa_pool.c
@@ -0,0 +1,453 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+#include "homa_impl.h"
+#include "homa_pool.h"
+
+/* This file contains functions that manage user-space buffer pools. */
+
+/* Pools must always have at least this many bpages (no particular
+ * reasoning behind this value).
+ */
+#define MIN_POOL_SIZE 2
+
+/* Used when determining how many bpages to consider for allocation. */
+#define MIN_EXTRA 4
+
+/**
+ * set_bpages_needed() - Set the bpages_needed field of @pool based
+ * on the length of the first RPC that's waiting for buffer space.
+ * The caller must own the lock for @pool->hsk.
+ * @pool: Pool to update.
+ */
+static void set_bpages_needed(struct homa_pool *pool)
+{
+	struct homa_rpc *rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
+			struct homa_rpc, buf_links);
+	pool->bpages_needed = (rpc->msgin.length + HOMA_BPAGE_SIZE - 1)
+			>> HOMA_BPAGE_SHIFT;
+}
+
+/**
+ * homa_pool_init() - Initialize a homa_pool; any previous contents are
+ * destroyed.
+ * @hsk:          Socket containing the pool to initialize.
+ * @region:       First byte of the memory region for the pool, allocated
+ *                by the application; must be page-aligned.
+ * @region_size:  Total number of bytes available at @buf_region.
+ * Return: Either zero (for success) or a negative errno for failure.
+ */
+int homa_pool_init(struct homa_sock *hsk, void __user *region,
+		   __u64 region_size)
+{
+	struct homa_pool *pool = hsk->buffer_pool;
+	int i, result;
+
+	homa_pool_destroy(hsk->buffer_pool);
+
+	if (((uintptr_t)region) & ~PAGE_MASK)
+		return -EINVAL;
+	pool->hsk = hsk;
+	pool->region = (char __user *)region;
+	pool->num_bpages = region_size >> HOMA_BPAGE_SHIFT;
+	pool->descriptors = NULL;
+	pool->cores = NULL;
+	if (pool->num_bpages < MIN_POOL_SIZE) {
+		result = -EINVAL;
+		goto error;
+	}
+	pool->descriptors = kmalloc_array(pool->num_bpages,
+					  sizeof(struct homa_bpage),
+					  GFP_ATOMIC);
+	if (!pool->descriptors) {
+		result = -ENOMEM;
+		goto error;
+	}
+	for (i = 0; i < pool->num_bpages; i++) {
+		struct homa_bpage *bp = &pool->descriptors[i];
+
+		spin_lock_init(&bp->lock);
+		atomic_set(&bp->refs, 0);
+		bp->owner = -1;
+		bp->expiration = 0;
+	}
+	atomic_set(&pool->free_bpages, pool->num_bpages);
+	pool->bpages_needed = INT_MAX;
+
+	/* Allocate and initialize core-specific data. */
+	pool->cores = kmalloc_array(nr_cpu_ids, sizeof(struct homa_pool_core),
+				    GFP_ATOMIC);
+	if (!pool->cores) {
+		result = -ENOMEM;
+		goto error;
+	}
+	pool->num_cores = nr_cpu_ids;
+	for (i = 0; i < pool->num_cores; i++) {
+		pool->cores[i].page_hint = 0;
+		pool->cores[i].allocated = 0;
+		pool->cores[i].next_candidate = 0;
+	}
+	pool->check_waiting_invoked = 0;
+
+	return 0;
+
+error:
+	kfree(pool->descriptors);
+	kfree(pool->cores);
+	pool->region = NULL;
+	return result;
+}
+
+/**
+ * homa_pool_destroy() - Destructor for homa_pool. After this method
+ * returns, the object should not be used unless it has been reinitialized.
+ * @pool: Pool to destroy.
+ */
+void homa_pool_destroy(struct homa_pool *pool)
+{
+	if (!pool->region)
+		return;
+	kfree(pool->descriptors);
+	kfree(pool->cores);
+	pool->region = NULL;
+}
+
+/**
+ * homa_pool_get_rcvbuf() - Return information needed to handle getsockopt
+ * for HOMA_SO_RCVBUF.
+ * @hsk:          Socket on which getsockopt request was made.
+ * @args:         Store info here.
+ */
+void homa_pool_get_rcvbuf(struct homa_sock *hsk,
+			  struct homa_rcvbuf_args *args)
+{
+	homa_sock_lock(hsk, "homa_pool_get_rcvbuf");
+	args->start = (uintptr_t)hsk->buffer_pool->region;
+	args->length = hsk->buffer_pool->num_bpages << HOMA_BPAGE_SHIFT;
+	homa_sock_unlock(hsk);
+}
+
+/**
+ * homa_pool_get_pages() - Allocate one or more full pages from the pool.
+ * @pool:         Pool from which to allocate pages
+ * @num_pages:    Number of pages needed
+ * @pages:        The indices of the allocated pages are stored here; caller
+ *                must ensure this array is big enough. Reference counts have
+ *                been set to 1 on all of these pages (or 2 if set_owner
+ *                was specified).
+ * @set_owner:    If nonzero, the current core is marked as owner of all
+ *                of the allocated pages (and the expiration time is also
+ *                set). Otherwise the pages are left unowned.
+ * Return: 0 for success, -1 if there wasn't enough free space in the pool.
+ */
+int homa_pool_get_pages(struct homa_pool *pool, int num_pages, __u32 *pages,
+			int set_owner)
+{
+	int core_num = raw_smp_processor_id();
+	struct homa_pool_core *core;
+	__u64 now = sched_clock();
+	int alloced = 0;
+	int limit = 0;
+
+	core = &pool->cores[core_num];
+	if (atomic_sub_return(num_pages, &pool->free_bpages) < 0) {
+		atomic_add(num_pages, &pool->free_bpages);
+		return -1;
+	}
+
+	/* Once we get to this point we know we will be able to find
+	 * enough free pages; now we just have to find them.
+	 */
+	while (alloced != num_pages) {
+		struct homa_bpage *bpage;
+		int cur, ref_count;
+
+		/* If we don't need to use all of the bpages in the pool,
+		 * then try to use only the ones with low indexes. This
+		 * will reduce the cache footprint for the pool by reusing
+		 * a few bpages over and over. Specifically this code will
+		 * not consider any candidate page whose index is >= limit.
+		 * Limit is chosen to make sure there are a reasonable
+		 * number of free pages in the range, so we won't have to
+		 * check a huge number of pages.
+		 */
+		if (limit == 0) {
+			int extra;
+
+			limit = pool->num_bpages
+					- atomic_read(&pool->free_bpages);
+			extra = limit >> 2;
+			limit += (extra < MIN_EXTRA) ? MIN_EXTRA : extra;
+			if (limit > pool->num_bpages)
+				limit = pool->num_bpages;
+		}
+
+		cur = core->next_candidate;
+		core->next_candidate++;
+		if (cur >= limit) {
+			core->next_candidate = 0;
+
+			/* Must recompute the limit for each new loop through
+			 * the bpage array: we may need to consider a larger
+			 * range of pages because of concurrent allocations.
+			 */
+			limit = 0;
+			continue;
+		}
+		bpage = &pool->descriptors[cur];
+
+		/* Figure out whether this candidate is free (or can be
+		 * stolen). Do a quick check without locking the page, and
+		 * if the page looks promising, then lock it and check again
+		 * (must check again in case someone else snuck in and
+		 * grabbed the page).
+		 */
+		ref_count = atomic_read(&bpage->refs);
+		if (ref_count >= 2 || (ref_count == 1 && (bpage->owner < 0 ||
+				bpage->expiration > now)))
+			continue;
+		if (!spin_trylock_bh(&bpage->lock))
+			continue;
+		ref_count = atomic_read(&bpage->refs);
+		if (ref_count >= 2 || (ref_count == 1 && (bpage->owner < 0 ||
+				bpage->expiration > now))) {
+			spin_unlock_bh(&bpage->lock);
+			continue;
+		}
+		if (bpage->owner >= 0)
+			atomic_inc(&pool->free_bpages);
+		if (set_owner) {
+			atomic_set(&bpage->refs, 2);
+			bpage->owner = core_num;
+			bpage->expiration = now + 1000 *
+					pool->hsk->homa->bpage_lease_usecs;
+		} else {
+			atomic_set(&bpage->refs, 1);
+			bpage->owner = -1;
+		}
+		spin_unlock_bh(&bpage->lock);
+		pages[alloced] = cur;
+		alloced++;
+	}
+	return 0;
+}
+
+/**
+ * homa_pool_allocate() - Allocate buffer space for an RPC.
+ * @rpc:  RPC that needs space allocated for its incoming message (space must
+ *        not already have been allocated). The fields @msgin->num_buffers
+ *        and @msgin->buffers are filled in. Must be locked by caller.
+ * Return: The return value is normally 0, which means either buffer space
+ * was allocated or the @rpc was queued on @hsk->waiting. If a fatal error
+ * occurred, such as no buffer pool present, then a negative errno is
+ * returned.
+ */
+int homa_pool_allocate(struct homa_rpc *rpc)
+{
+	struct homa_pool *pool = rpc->hsk->buffer_pool;
+	int full_pages, partial, i, core_id;
+	__u32 pages[HOMA_MAX_BPAGES];
+	struct homa_pool_core *core;
+	struct homa_bpage *bpage;
+	struct homa_rpc *other;
+
+	if (!pool->region)
+		return -ENOMEM;
+
+	/* First allocate any full bpages that are needed. */
+	full_pages = rpc->msgin.length >> HOMA_BPAGE_SHIFT;
+	if (unlikely(full_pages)) {
+		if (homa_pool_get_pages(pool, full_pages, pages, 0) != 0)
+			goto out_of_space;
+		for (i = 0; i < full_pages; i++)
+			rpc->msgin.bpage_offsets[i] = pages[i] <<
+					HOMA_BPAGE_SHIFT;
+	}
+	rpc->msgin.num_bpages = full_pages;
+
+	/* The last chunk may be less than a full bpage; for this we use
+	 * the bpage that we own (and reuse it for multiple messages).
+	 */
+	partial = rpc->msgin.length & (HOMA_BPAGE_SIZE - 1);
+	if (unlikely(partial == 0))
+		goto success;
+	core_id = raw_smp_processor_id();
+	core = &pool->cores[core_id];
+	bpage = &pool->descriptors[core->page_hint];
+	if (!spin_trylock_bh(&bpage->lock))
+		spin_lock_bh(&bpage->lock);
+	if (bpage->owner != core_id) {
+		spin_unlock_bh(&bpage->lock);
+		goto new_page;
+	}
+	if ((core->allocated + partial) > HOMA_BPAGE_SIZE) {
+		if (atomic_read(&bpage->refs) == 1) {
+			/* Bpage is totally free, so we can reuse it. */
+			core->allocated = 0;
+		} else {
+			bpage->owner = -1;
+
+			/* We know the reference count can't reach zero here
+			 * because of check above, so we won't have to decrement
+			 * pool->free_bpages.
+			 */
+			atomic_dec_return(&bpage->refs);
+			spin_unlock_bh(&bpage->lock);
+			goto new_page;
+		}
+	}
+	bpage->expiration = sched_clock() +
+			1000 * pool->hsk->homa->bpage_lease_usecs;
+	atomic_inc(&bpage->refs);
+	spin_unlock_bh(&bpage->lock);
+	goto allocate_partial;
+
+	/* Can't use the current page; get another one. */
+new_page:
+	if (homa_pool_get_pages(pool, 1, pages, 1) != 0) {
+		homa_pool_release_buffers(pool, rpc->msgin.num_bpages,
+					  rpc->msgin.bpage_offsets);
+		rpc->msgin.num_bpages = 0;
+		goto out_of_space;
+	}
+	core->page_hint = pages[0];
+	core->allocated = 0;
+
+allocate_partial:
+	rpc->msgin.bpage_offsets[rpc->msgin.num_bpages] = core->allocated
+			+ (core->page_hint << HOMA_BPAGE_SHIFT);
+	rpc->msgin.num_bpages++;
+	core->allocated += partial;
+
+success:
+	return 0;
+
+	/* We get here if there wasn't enough buffer space for this
+	 * message; add the RPC to hsk->waiting_for_bufs.
+	 */
+out_of_space:
+	homa_sock_lock(pool->hsk, "homa_pool_allocate");
+	list_for_each_entry(other, &pool->hsk->waiting_for_bufs, buf_links) {
+		if (other->msgin.length > rpc->msgin.length) {
+			list_add_tail(&rpc->buf_links, &other->buf_links);
+			goto queued;
+		}
+	}
+	list_add_tail_rcu(&rpc->buf_links, &pool->hsk->waiting_for_bufs);
+
+queued:
+	set_bpages_needed(pool);
+	homa_sock_unlock(pool->hsk);
+	return 0;
+}
+
+/**
+ * homa_pool_get_buffer() - Given an RPC, figure out where to store incoming
+ * message data.
+ * @rpc:        RPC for which incoming message data is being processed; its
+ *              msgin must be properly initialized and buffer space must have
+ *              been allocated for the message.
+ * @offset:     Offset within @rpc's incoming message.
+ * @available:  Will be filled in with the number of bytes of space available
+ *              at the returned address (could be zero if offset is
+ *              (erroneously) past the end of the message).
+ * Return:      The application's virtual address for buffer space corresponding
+ *              to @offset in the incoming message for @rpc.
+ */
+void __user *homa_pool_get_buffer(struct homa_rpc *rpc, int offset,
+				  int *available)
+{
+	int bpage_index, bpage_offset;
+
+	bpage_index = offset >> HOMA_BPAGE_SHIFT;
+	if (offset >= rpc->msgin.length) {
+		WARN_ONCE(true, "%s got offset %d >= message length %d\n",
+			  __func__, offset, rpc->msgin.length);
+		*available = 0;
+		return NULL;
+	}
+	bpage_offset = offset & (HOMA_BPAGE_SIZE - 1);
+	*available = (bpage_index < (rpc->msgin.num_bpages - 1))
+			? HOMA_BPAGE_SIZE - bpage_offset
+			: rpc->msgin.length - offset;
+	return rpc->hsk->buffer_pool->region +
+			rpc->msgin.bpage_offsets[bpage_index] + bpage_offset;
+}
+
+/**
+ * homa_pool_release_buffers() - Release buffer space so that it can be
+ * reused.
+ * @pool:         Pool that the buffer space belongs to. Doesn't need to
+ *                be locked.
+ * @num_buffers:  How many buffers to release.
+ * @buffers:      Points to @num_buffers values, each of which is an offset
+ *                from the start of the pool to the buffer to be released.
+ * Return:        0 for success, otherwise a negative errno.
+ */
+int homa_pool_release_buffers(struct homa_pool *pool, int num_buffers,
+			      __u32 *buffers)
+{
+	int result = 0;
+	int i;
+
+	if (!pool->region)
+		return result;
+	for (i = 0; i < num_buffers; i++) {
+		__u32 bpage_index = buffers[i] >> HOMA_BPAGE_SHIFT;
+		struct homa_bpage *bpage = &pool->descriptors[bpage_index];
+
+		if (bpage_index < pool->num_bpages) {
+			if (atomic_dec_return(&bpage->refs) == 0)
+				atomic_inc(&pool->free_bpages);
+		} else {
+			result = -EINVAL;
+		}
+	}
+	return result;
+}
+
+/**
+ * homa_pool_check_waiting() - Checks to see if there are enough free
+ * bpages to wake up any RPCs that were blocked. Whenever
+ * homa_pool_release_buffers is invoked, this function must be invoked later,
+ * at a point when the caller holds no locks (homa_pool_release_buffers may
+ * be invoked with locks held, so it can't safely invoke this function).
+ * This is regrettably tricky, but I can't think of a better solution.
+ * @pool:         Information about the buffer pool.
+ */
+void homa_pool_check_waiting(struct homa_pool *pool)
+{
+	if (!pool->region)
+		return;
+	while (atomic_read(&pool->free_bpages) >= pool->bpages_needed) {
+		struct homa_rpc *rpc;
+
+		homa_sock_lock(pool->hsk, "buffer pool");
+		if (list_empty(&pool->hsk->waiting_for_bufs)) {
+			pool->bpages_needed = INT_MAX;
+			homa_sock_unlock(pool->hsk);
+			break;
+		}
+		rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
+				       struct homa_rpc, buf_links);
+		if (!homa_rpc_try_lock(rpc, "homa_pool_check_waiting")) {
+			/* Can't just spin on the RPC lock because we're
+			 * holding the socket lock (see sync.txt). Instead,
+			 * release the socket lock and try the entire
+			 * operation again.
+			 */
+			homa_sock_unlock(pool->hsk);
+			continue;
+		}
+		list_del_init(&rpc->buf_links);
+		if (list_empty(&pool->hsk->waiting_for_bufs))
+			pool->bpages_needed = INT_MAX;
+		else
+			set_bpages_needed(pool);
+		homa_sock_unlock(pool->hsk);
+		homa_pool_allocate(rpc);
+		if (rpc->msgin.num_bpages > 0)
+			/* Allocation succeeded; "wake up" the RPC. */
+			rpc->msgin.resend_all = 1;
+		homa_rpc_unlock(rpc);
+	}
+}
diff --git a/net/homa/homa_pool.h b/net/homa/homa_pool.h
new file mode 100644
index 000000000000..6dbe7d77dd07
--- /dev/null
+++ b/net/homa/homa_pool.h
@@ -0,0 +1,154 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file contains definitions used to manage user-space buffer pools.
+ */
+
+#ifndef _HOMA_POOL_H
+#define _HOMA_POOL_H
+
+#include "homa_rpc.h"
+
+/**
+ * struct homa_bpage - Contains information about a single page in
+ * a buffer pool.
+ */
+struct homa_bpage {
+	union {
+		/**
+		 * @cache_line: Ensures that each homa_bpage object
+		 * is exactly one cache line long.
+		 */
+		char cache_line[L1_CACHE_BYTES];
+		struct {
+			/** @lock: to synchronize shared access. */
+			spinlock_t lock;
+
+			/**
+			 * @refs: Counts number of distinct uses of this
+			 * bpage (1 tick for each message that is using
+			 * this page, plus an additional tick if the @owner
+			 * field is set).
+			 */
+			atomic_t refs;
+
+			/**
+			 * @owner: kernel core that currently owns this page
+			 * (< 0 if none).
+			 */
+			int owner;
+
+			/**
+			 * @expiration: time (in sched_clock() units) after
+			 * which it's OK to steal this page from its current
+			 * owner (if @refs is 1).
+			 */
+			__u64 expiration;
+		};
+	};
+};
+
+/**
+ * struct homa_pool_core - Holds core-specific data for a homa_pool (a bpage
+ * out of which that core is allocating small chunks).
+ */
+struct homa_pool_core {
+	union {
+		/**
+		 * @cache_line: Ensures that each object is exactly one
+		 * cache line long.
+		 */
+		char cache_line[L1_CACHE_BYTES];
+		struct {
+			/**
+			 * @page_hint: Index of bpage in pool->descriptors,
+			 * which may be owned by this core. If so, we'll use it
+			 * for allocating partial pages.
+			 */
+			int page_hint;
+
+			/**
+			 * @allocated: if the page given by @page_hint is
+			 * owned by this core, this variable gives the number of
+			 * (initial) bytes that have already been allocated
+			 * from the page.
+			 */
+			int allocated;
+
+			/**
+			 * @next_candidate: when searching for free bpages,
+			 * check this index next.
+			 */
+			int next_candidate;
+		};
+	};
+};
+
+/**
+ * struct homa_pool - Describes a pool of buffer space for incoming
+ * messages for a particular socket; managed by homa_pool.c. The pool is
+ * divided up into "bpages", which are a multiple of the hardware page size.
+ * A bpage may be owned by a particular core so that it can more efficiently
+ * allocate space for small messages.
+ */
+struct homa_pool {
+	/**
+	 * @hsk: the socket that this pool belongs to.
+	 */
+	struct homa_sock *hsk;
+
+	/**
+	 * @region: beginning of the pool's region (in the app's virtual
+	 * memory). Divided into bpages. 0 means the pool hasn't yet been
+	 * initialized.
+	 */
+	char __user *region;
+
+	/** @num_bpages: total number of bpages in the pool. */
+	int num_bpages;
+
+	/** @descriptors: kmalloced area containing one entry for each bpage. */
+	struct homa_bpage *descriptors;
+
+	/**
+	 * @free_bpages: the number of pages still available for allocation
+	 * by homa_pool_get pages. This equals the number of pages with zero
+	 * reference counts, minus the number of pages that have been claimed
+	 * by homa_get_pool_pages but not yet allocated.
+	 */
+	atomic_t free_bpages;
+
+	/**
+	 * @bpages_needed: the number of free bpages required to satisfy the
+	 * needs of the first RPC on @hsk->waiting_for_bufs, or INT_MAX if
+	 * that queue is empty.
+	 */
+	int bpages_needed;
+
+	/** @cores: core-specific info; dynamically allocated. */
+	struct homa_pool_core *cores;
+
+	/** @num_cores: number of elements in @cores. */
+	int num_cores;
+
+	/**
+	 * @check_waiting_invoked: incremented during unit tests when
+	 * homa_pool_check_waiting is invoked.
+	 */
+	int check_waiting_invoked;
+};
+
+int      homa_pool_allocate(struct homa_rpc *rpc);
+void     homa_pool_check_waiting(struct homa_pool *pool);
+void     homa_pool_destroy(struct homa_pool *pool);
+void __user *homa_pool_get_buffer(struct homa_rpc *rpc, int offset,
+				  int *available);
+int      homa_pool_get_pages(struct homa_pool *pool, int num_pages,
+			     __u32 *pages, int leave_locked);
+void     homa_pool_get_rcvbuf(struct homa_sock *hsk,
+			      struct homa_rcvbuf_args *args);
+int      homa_pool_init(struct homa_sock *hsk, void *buf_region,
+			__u64 region_size);
+int      homa_pool_release_buffers(struct homa_pool *pool,
+				   int num_buffers, __u32 *buffers);
+
+#endif /* _HOMA_POOL_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
  2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (3 preceding siblings ...)
  2025-01-15 18:59 ` [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c John Ousterhout
@ 2025-01-15 18:59 ` John Ousterhout
  2025-01-23 14:29   ` Paolo Abeni
  2025-01-15 18:59 ` [PATCH net-next v6 06/12] net: homa: create homa_peer.h and homa_peer.c John Ousterhout
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-15 18:59 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

These files provide basic functions for managing remote procedure calls,
which are the fundamental entities managed by Homa. Each RPC consists
of a request message from a client to a server, followed by a response
message returned from the server to the client.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_rpc.c | 494 ++++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_rpc.h | 458 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 952 insertions(+)
 create mode 100644 net/homa/homa_rpc.c
 create mode 100644 net/homa/homa_rpc.h

diff --git a/net/homa/homa_rpc.c b/net/homa/homa_rpc.c
new file mode 100644
index 000000000000..cc8450c984f8
--- /dev/null
+++ b/net/homa/homa_rpc.c
@@ -0,0 +1,494 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file contains functions for managing homa_rpc structs. */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+#include "homa_stub.h"
+
+/**
+ * homa_rpc_new_client() - Allocate and construct a client RPC (one that is used
+ * to issue an outgoing request). Doesn't send any packets. Invoked with no
+ * locks held.
+ * @hsk:      Socket to which the RPC belongs.
+ * @dest:     Address of host (ip and port) to which the RPC will be sent.
+ *
+ * Return:    A printer to the newly allocated object, or a negative
+ *            errno if an error occurred. The RPC will be locked; the
+ *            caller must eventually unlock it.
+ */
+struct homa_rpc *homa_rpc_new_client(struct homa_sock *hsk,
+				     const union sockaddr_in_union *dest)
+	__acquires(&crpc->bucket->lock)
+{
+	struct in6_addr dest_addr_as_ipv6 = canonical_ipv6_addr(dest);
+	struct homa_rpc_bucket *bucket;
+	struct homa_rpc *crpc;
+	int err;
+
+	crpc = kmalloc(sizeof(*crpc), GFP_KERNEL);
+	if (unlikely(!crpc))
+		return ERR_PTR(-ENOMEM);
+
+	/* Initialize fields that don't require the socket lock. */
+	crpc->hsk = hsk;
+	crpc->id = atomic64_fetch_add(2, &hsk->homa->next_outgoing_id);
+	bucket = homa_client_rpc_bucket(hsk, crpc->id);
+	crpc->bucket = bucket;
+	crpc->state = RPC_OUTGOING;
+	atomic_set(&crpc->flags, 0);
+	crpc->peer = homa_peer_find(hsk->homa->peers, &dest_addr_as_ipv6,
+				    &hsk->inet);
+	if (IS_ERR(crpc->peer)) {
+		err = PTR_ERR(crpc->peer);
+		goto error;
+	}
+	crpc->dport = ntohs(dest->in6.sin6_port);
+	crpc->completion_cookie = 0;
+	crpc->error = 0;
+	crpc->msgin.length = -1;
+	crpc->msgin.num_bpages = 0;
+	memset(&crpc->msgout, 0, sizeof(crpc->msgout));
+	crpc->msgout.length = -1;
+	INIT_LIST_HEAD(&crpc->ready_links);
+	INIT_LIST_HEAD(&crpc->buf_links);
+	INIT_LIST_HEAD(&crpc->dead_links);
+	crpc->interest = NULL;
+	INIT_LIST_HEAD(&crpc->throttled_links);
+	crpc->silent_ticks = 0;
+	crpc->resend_timer_ticks = hsk->homa->timer_ticks;
+	crpc->done_timer_ticks = 0;
+	crpc->magic = HOMA_RPC_MAGIC;
+	crpc->start_ns = sched_clock();
+
+	/* Initialize fields that require locking. This allows the most
+	 * expensive work, such as copying in the message from user space,
+	 * to be performed without holding locks. Also, can't hold spin
+	 * locks while doing things that could block, such as memory allocation.
+	 */
+	homa_bucket_lock(bucket, crpc->id, "homa_rpc_new_client");
+	homa_sock_lock(hsk, "homa_rpc_new_client");
+	if (hsk->shutdown) {
+		homa_sock_unlock(hsk);
+		homa_rpc_unlock(crpc);
+		err = -ESHUTDOWN;
+		goto error;
+	}
+	hlist_add_head(&crpc->hash_links, &bucket->rpcs);
+	list_add_tail_rcu(&crpc->active_links, &hsk->active_rpcs);
+	homa_sock_unlock(hsk);
+
+	return crpc;
+
+error:
+	kfree(crpc);
+	return ERR_PTR(err);
+}
+
+/**
+ * homa_rpc_new_server() - Allocate and construct a server RPC (one that is
+ * used to manage an incoming request). If appropriate, the RPC will also
+ * be handed off (we do it here, while we have the socket locked, to avoid
+ * acquiring the socket lock a second time later for the handoff).
+ * @hsk:      Socket that owns this RPC.
+ * @source:   IP address (network byte order) of the RPC's client.
+ * @h:        Header for the first data packet received for this RPC; used
+ *            to initialize the RPC.
+ * @created:  Will be set to 1 if a new RPC was created and 0 if an
+ *            existing RPC was found.
+ *
+ * Return:  A pointer to a new RPC, which is locked, or a negative errno
+ *          if an error occurred. If there is already an RPC corresponding
+ *          to h, then it is returned instead of creating a new RPC.
+ */
+struct homa_rpc *homa_rpc_new_server(struct homa_sock *hsk,
+				     const struct in6_addr *source,
+				     struct homa_data_hdr *h, int *created)
+	__acquires(&srpc->bucket->lock)
+{
+	__u64 id = homa_local_id(h->common.sender_id);
+	struct homa_rpc_bucket *bucket;
+	struct homa_rpc *srpc = NULL;
+	int err;
+
+	/* Lock the bucket, and make sure no-one else has already created
+	 * the desired RPC.
+	 */
+	bucket = homa_server_rpc_bucket(hsk, id);
+	homa_bucket_lock(bucket, id, "homa_rpc_new_server");
+	hlist_for_each_entry_rcu(srpc, &bucket->rpcs, hash_links) {
+		if (srpc->id == id &&
+		    srpc->dport == ntohs(h->common.sport) &&
+		    ipv6_addr_equal(&srpc->peer->addr, source)) {
+			/* RPC already exists; just return it instead
+			 * of creating a new RPC.
+			 */
+			*created = 0;
+			return srpc;
+		}
+	}
+
+	/* Initialize fields that don't require the socket lock. */
+	srpc = kmalloc(sizeof(*srpc), GFP_ATOMIC);
+	if (!srpc) {
+		err = -ENOMEM;
+		goto error;
+	}
+	srpc->hsk = hsk;
+	srpc->bucket = bucket;
+	srpc->state = RPC_INCOMING;
+	atomic_set(&srpc->flags, 0);
+	srpc->peer = homa_peer_find(hsk->homa->peers, source, &hsk->inet);
+	if (IS_ERR(srpc->peer)) {
+		err = PTR_ERR(srpc->peer);
+		goto error;
+	}
+	srpc->dport = ntohs(h->common.sport);
+	srpc->id = id;
+	srpc->completion_cookie = 0;
+	srpc->error = 0;
+	srpc->msgin.length = -1;
+	srpc->msgin.num_bpages = 0;
+	memset(&srpc->msgout, 0, sizeof(srpc->msgout));
+	srpc->msgout.length = -1;
+	INIT_LIST_HEAD(&srpc->ready_links);
+	INIT_LIST_HEAD(&srpc->buf_links);
+	INIT_LIST_HEAD(&srpc->dead_links);
+	srpc->interest = NULL;
+	INIT_LIST_HEAD(&srpc->throttled_links);
+	srpc->silent_ticks = 0;
+	srpc->resend_timer_ticks = hsk->homa->timer_ticks;
+	srpc->done_timer_ticks = 0;
+	srpc->magic = HOMA_RPC_MAGIC;
+	srpc->start_ns = sched_clock();
+	err = homa_message_in_init(srpc, ntohl(h->message_length));
+	if (err != 0)
+		goto error;
+
+	/* Initialize fields that require socket to be locked. */
+	homa_sock_lock(hsk, "homa_rpc_new_server");
+	if (hsk->shutdown) {
+		homa_sock_unlock(hsk);
+		err = -ESHUTDOWN;
+		goto error;
+	}
+	hlist_add_head(&srpc->hash_links, &bucket->rpcs);
+	list_add_tail_rcu(&srpc->active_links, &hsk->active_rpcs);
+	if (ntohl(h->seg.offset) == 0 && srpc->msgin.num_bpages > 0) {
+		atomic_or(RPC_PKTS_READY, &srpc->flags);
+		homa_rpc_handoff(srpc);
+	}
+	homa_sock_unlock(hsk);
+	*created = 1;
+	return srpc;
+
+error:
+	homa_bucket_unlock(bucket, id);
+	kfree(srpc);
+	return ERR_PTR(err);
+}
+
+/**
+ * homa_rpc_acked() - This function is invoked when an ack is received
+ * for an RPC; if the RPC still exists, is freed.
+ * @hsk:     Socket on which the ack was received. May or may not correspond
+ *           to the RPC, but can sometimes be used to avoid a socket lookup.
+ * @saddr:   Source address from which the act was received (the client
+ *           note for the RPC)
+ * @ack:     Information about an RPC from @saddr that may now be deleted
+ *           safely.
+ */
+void homa_rpc_acked(struct homa_sock *hsk, const struct in6_addr *saddr,
+		    struct homa_ack *ack)
+{
+	__u16 server_port = ntohs(ack->server_port);
+	__u64 id = homa_local_id(ack->client_id);
+	struct homa_sock *hsk2 = hsk;
+	struct homa_rpc *rpc;
+
+	if (hsk2->port != server_port) {
+		/* Without RCU, sockets other than hsk can be deleted
+		 * out from under us.
+		 */
+		rcu_read_lock();
+		hsk2 = homa_sock_find(hsk->homa->port_map, server_port);
+		if (!hsk2)
+			goto done;
+	}
+	rpc = homa_find_server_rpc(hsk2, saddr, id);
+	if (rpc) {
+		homa_rpc_free(rpc);
+		homa_rpc_unlock(rpc); /* Locked by homa_find_server_rpc. */
+	}
+
+done:
+	if (hsk->port != server_port)
+		rcu_read_unlock();
+}
+
+/**
+ * homa_rpc_free() - Destructor for homa_rpc; will arrange for all resources
+ * associated with the RPC to be released (eventually).
+ * @rpc:  Structure to clean up, or NULL. Must be locked. Its socket must
+ *        not be locked.
+ */
+void homa_rpc_free(struct homa_rpc *rpc)
+	__acquires(&rpc->hsk->lock)
+	__releases(&rpc->hsk->lock)
+{
+	/* The goal for this function is to make the RPC inaccessible,
+	 * so that no other code will ever access it again. However, don't
+	 * actually release resources; leave that to homa_rpc_reap, which
+	 * runs later. There are two reasons for this. First, releasing
+	 * resources may be expensive, so we don't want to keep the caller
+	 * waiting; homa_rpc_reap will run in situations where there is time
+	 * to spare. Second, there may be other code that currently has
+	 * pointers to this RPC but temporarily released the lock (e.g. to
+	 * copy data to/from user space). It isn't safe to clean up until
+	 * that code has finished its work and released any pointers to the
+	 * RPC (homa_rpc_reap will ensure that this has happened). So, this
+	 * function should only make changes needed to make the RPC
+	 * inaccessible.
+	 */
+	if (!rpc || rpc->state == RPC_DEAD)
+		return;
+	rpc->state = RPC_DEAD;
+
+	/* Unlink from all lists, so no-one will ever find this RPC again. */
+	homa_sock_lock(rpc->hsk, "homa_rpc_free");
+	__hlist_del(&rpc->hash_links);
+	list_del_rcu(&rpc->active_links);
+	list_add_tail_rcu(&rpc->dead_links, &rpc->hsk->dead_rpcs);
+	__list_del_entry(&rpc->ready_links);
+	__list_del_entry(&rpc->buf_links);
+	if (rpc->interest) {
+		rpc->interest->reg_rpc = NULL;
+		wake_up_process(rpc->interest->thread);
+		rpc->interest = NULL;
+	}
+
+	if (rpc->msgin.length >= 0) {
+		rpc->hsk->dead_skbs += skb_queue_len(&rpc->msgin.packets);
+		while (1) {
+			struct homa_gap *gap = list_first_entry_or_null(&rpc->msgin.gaps,
+									struct homa_gap,
+									links);
+			if (!gap)
+				break;
+			list_del(&gap->links);
+			kfree(gap);
+		}
+	}
+	rpc->hsk->dead_skbs += rpc->msgout.num_skbs;
+	if (rpc->hsk->dead_skbs > rpc->hsk->homa->max_dead_buffs)
+		/* This update isn't thread-safe; it's just a
+		 * statistic so it's OK if updates occasionally get
+		 * missed.
+		 */
+		rpc->hsk->homa->max_dead_buffs = rpc->hsk->dead_skbs;
+
+	homa_sock_unlock(rpc->hsk);
+	homa_remove_from_throttled(rpc);
+}
+
+/**
+ * homa_rpc_reap() - Invoked to release resources associated with dead
+ * RPCs for a given socket. For a large RPC, it can take a long time to
+ * free all of its packet buffers, so we try to perform this work
+ * off the critical path where it won't delay applications. Each call to
+ * this function normally does a small chunk of work (unless reap_all is
+ * true). See the file reap.txt for more information.
+ * @hsk:      Homa socket that may contain dead RPCs. Must not be locked by the
+ *            caller; this function will lock and release.
+ * @reap_all: False means do a small chunk of work; there may still be
+ *            unreaped RPCs on return. True means reap all dead rpcs for
+ *            hsk.  Will busy-wait if reaping has been disabled for some RPCs.
+ *
+ * Return: A return value of 0 means that we ran out of work to do; calling
+ *         again will do no work (there could be unreaped RPCs, but if so,
+ *         reaping has been disabled for them).  A value greater than
+ *         zero means there is still more reaping work to be done.
+ */
+int homa_rpc_reap(struct homa_sock *hsk, bool reap_all)
+{
+#define BATCH_MAX 20
+	struct homa_rpc *rpcs[BATCH_MAX];
+	struct sk_buff *skbs[BATCH_MAX];
+	int num_skbs, num_rpcs;
+	struct homa_rpc *rpc;
+	int i, batch_size;
+	int skbs_to_reap;
+	int rx_frees;
+	int result = 0;
+
+	/* Each iteration through the following loop will reap
+	 * BATCH_MAX skbs.
+	 */
+	skbs_to_reap = hsk->homa->reap_limit;
+	while (skbs_to_reap > 0 && !list_empty(&hsk->dead_rpcs)) {
+		batch_size = BATCH_MAX;
+		if (!reap_all) {
+			if (batch_size > skbs_to_reap)
+				batch_size = skbs_to_reap;
+			skbs_to_reap -= batch_size;
+		}
+		num_skbs = 0;
+		num_rpcs = 0;
+		rx_frees = 0;
+
+		homa_sock_lock(hsk, "homa_rpc_reap");
+		if (atomic_read(&hsk->protect_count)) {
+			homa_sock_unlock(hsk);
+			if (reap_all)
+				continue;
+			return 0;
+		}
+
+		/* Collect buffers and freeable RPCs. */
+		list_for_each_entry_rcu(rpc, &hsk->dead_rpcs, dead_links) {
+			if ((atomic_read(&rpc->flags) & RPC_CANT_REAP) ||
+			    atomic_read(&rpc->msgout.active_xmits) != 0)
+				continue;
+			rpc->magic = 0;
+
+			/* For Tx sk_buffs, collect them here but defer
+			 * freeing until after releasing the socket lock.
+			 */
+			if (rpc->msgout.length >= 0) {
+				while (rpc->msgout.packets) {
+					skbs[num_skbs] = rpc->msgout.packets;
+					rpc->msgout.packets = homa_get_skb_info(
+						rpc->msgout.packets)->next_skb;
+					num_skbs++;
+					rpc->msgout.num_skbs--;
+					if (num_skbs >= batch_size)
+						goto release;
+				}
+			}
+
+			/* In the normal case rx sk_buffs will already have been
+			 * freed before we got here. Thus it's OK to free
+			 * immediately in rare situations where there are
+			 * buffers left.
+			 */
+			if (rpc->msgin.length >= 0) {
+				while (1) {
+					struct sk_buff *skb;
+
+					skb = skb_dequeue(&rpc->msgin.packets);
+					if (!skb)
+						break;
+					kfree_skb(skb);
+					rx_frees++;
+				}
+			}
+
+			/* If we get here, it means all packets have been
+			 *  removed from the RPC.
+			 */
+			rpcs[num_rpcs] = rpc;
+			num_rpcs++;
+			list_del_rcu(&rpc->dead_links);
+			if (num_rpcs >= batch_size)
+				goto release;
+		}
+
+		/* Free all of the collected resources; release the socket
+		 * lock while doing this.
+		 */
+release:
+		hsk->dead_skbs -= num_skbs + rx_frees;
+		result = !list_empty(&hsk->dead_rpcs) &&
+				(num_skbs + num_rpcs) != 0;
+		homa_sock_unlock(hsk);
+		homa_skb_free_many_tx(hsk->homa, skbs, num_skbs);
+		for (i = 0; i < num_rpcs; i++) {
+			rpc = rpcs[i];
+			/* Lock and unlock the RPC before freeing it. This
+			 * is needed to deal with races where the code
+			 * that invoked homa_rpc_free hasn't unlocked the
+			 * RPC yet.
+			 */
+			homa_rpc_lock(rpc, "homa_rpc_reap");
+			homa_rpc_unlock(rpc);
+
+			if (unlikely(rpc->msgin.num_bpages))
+				homa_pool_release_buffers(rpc->hsk->buffer_pool,
+							  rpc->msgin.num_bpages,
+							  rpc->msgin.bpage_offsets);
+			if (rpc->msgin.length >= 0) {
+				while (1) {
+					struct homa_gap *gap;
+
+					gap = list_first_entry_or_null(
+							&rpc->msgin.gaps,
+							struct homa_gap,
+							links);
+					if (!gap)
+						break;
+					list_del(&gap->links);
+					kfree(gap);
+				}
+			}
+			rpc->state = 0;
+			kfree(rpc);
+		}
+		if (!result && !reap_all)
+			break;
+	}
+	homa_pool_check_waiting(hsk->buffer_pool);
+	return result;
+}
+
+/**
+ * homa_find_client_rpc() - Locate client-side information about the RPC that
+ * a packet belongs to, if there is any. Thread-safe without socket lock.
+ * @hsk:      Socket via which packet was received.
+ * @id:       Unique identifier for the RPC.
+ *
+ * Return:    A pointer to the homa_rpc for this id, or NULL if none.
+ *            The RPC will be locked; the caller must eventually unlock it
+ *            by invoking homa_rpc_unlock.
+ */
+struct homa_rpc *homa_find_client_rpc(struct homa_sock *hsk, __u64 id)
+	__acquires(&crpc->bucket->lock)
+{
+	struct homa_rpc_bucket *bucket = homa_client_rpc_bucket(hsk, id);
+	struct homa_rpc *crpc;
+
+	homa_bucket_lock(bucket, id, __func__);
+	hlist_for_each_entry_rcu(crpc, &bucket->rpcs, hash_links) {
+		if (crpc->id == id)
+			return crpc;
+	}
+	homa_bucket_unlock(bucket, id);
+	return NULL;
+}
+
+/**
+ * homa_find_server_rpc() - Locate server-side information about the RPC that
+ * a packet belongs to, if there is any. Thread-safe without socket lock.
+ * @hsk:      Socket via which packet was received.
+ * @saddr:    Address from which the packet was sent.
+ * @id:       Unique identifier for the RPC (must have server bit set).
+ *
+ * Return:    A pointer to the homa_rpc matching the arguments, or NULL
+ *            if none. The RPC will be locked; the caller must eventually
+ *            unlock it by invoking homa_rpc_unlock.
+ */
+struct homa_rpc *homa_find_server_rpc(struct homa_sock *hsk,
+				      const struct in6_addr *saddr, __u64 id)
+	__acquires(&srpc->bucket->lock)
+{
+	struct homa_rpc_bucket *bucket = homa_server_rpc_bucket(hsk, id);
+	struct homa_rpc *srpc;
+
+	homa_bucket_lock(bucket, id, __func__);
+	hlist_for_each_entry_rcu(srpc, &bucket->rpcs, hash_links) {
+		if (srpc->id == id && ipv6_addr_equal(&srpc->peer->addr, saddr))
+			return srpc;
+	}
+	homa_bucket_unlock(bucket, id);
+	return NULL;
+}
diff --git a/net/homa/homa_rpc.h b/net/homa/homa_rpc.h
new file mode 100644
index 000000000000..3491c8c6600d
--- /dev/null
+++ b/net/homa/homa_rpc.h
@@ -0,0 +1,458 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines homa_rpc and related structs.  */
+
+#ifndef _HOMA_RPC_H
+#define _HOMA_RPC_H
+
+#include <linux/percpu-defs.h>
+#include <linux/skbuff.h>
+#include <linux/types.h>
+
+#include "homa_sock.h"
+#include "homa_wire.h"
+
+/* Forward references. */
+struct homa_ack;
+
+/**
+ * struct homa_message_out - Describes a message (either request or response)
+ * for which this machine is the sender.
+ */
+struct homa_message_out {
+	/**
+	 * @length: Total bytes in message (excluding headers).  A value
+	 * less than 0 means this structure is uninitialized and therefore
+	 * not in use (all other fields will be zero in this case).
+	 */
+	int length;
+
+	/** @num_skbs: Total number of buffers currently in @packets. */
+	int num_skbs;
+
+	/**
+	 * @copied_from_user: Number of bytes of the message that have
+	 * been copied from user space into skbs in @packets.
+	 */
+	int copied_from_user;
+
+	/**
+	 * @packets: Singly-linked list of all packets in message, linked
+	 * using homa_next_skb. The list is in order of offset in the message
+	 * (offset 0 first); each sk_buff can potentially contain multiple
+	 * data_segments, which will be split into separate packets by GSO.
+	 * This list grows gradually as data is copied in from user space,
+	 * so it may not be complete.
+	 */
+	struct sk_buff *packets;
+
+	/**
+	 * @next_xmit: Pointer to pointer to next packet to transmit (will
+	 * either refer to @packets or homa_next_skb(skb) for some skb
+	 * in @packets).
+	 */
+	struct sk_buff **next_xmit;
+
+	/**
+	 * @next_xmit_offset: All bytes in the message, up to but not
+	 * including this one, have been transmitted.
+	 */
+	int next_xmit_offset;
+
+	/**
+	 * @active_xmits: The number of threads that are currently
+	 * transmitting data packets for this RPC; can't reap the RPC
+	 * until this count becomes zero.
+	 */
+	atomic_t active_xmits;
+
+	/**
+	 * @init_ns: Time in sched_clock units when this structure was
+	 * initialized.  Used to find the oldest outgoing message.
+	 */
+	__u64 init_ns;
+};
+
+/**
+ * struct homa_gap - Represents a range of bytes within a message that have
+ * not yet been received.
+ */
+struct homa_gap {
+	/** @start: offset of first byte in this gap. */
+	int start;
+
+	/** @end: offset of byte just after last one in this gap. */
+	int end;
+
+	/**
+	 * @time: time (in sched_clock units) when the gap was first detected.
+	 * As of 7/2024 this isn't used for anything.
+	 */
+	__u64 time;
+
+	/** @links: for linking into list in homa_message_in. */
+	struct list_head links;
+};
+
+/**
+ * struct homa_message_in - Holds the state of a message received by
+ * this machine; used for both requests and responses.
+ */
+struct homa_message_in {
+	/**
+	 * @length: Payload size in bytes. A value less than 0 means this
+	 * structure is uninitialized and therefore not in use.
+	 */
+	int length;
+
+	/**
+	 * @packets: DATA packets for this message that have been received but
+	 * not yet copied to user space (no particular order).
+	 */
+	struct sk_buff_head packets;
+
+	/**
+	 * @recv_end: Offset of the byte just after the highest one that
+	 * has been received so far.
+	 */
+	int recv_end;
+
+	/**
+	 * @gaps: List of homa_gaps describing all of the bytes with
+	 * offsets less than @recv_end that have not yet been received.
+	 */
+	struct list_head gaps;
+
+	/**
+	 * @bytes_remaining: Amount of data for this message that has
+	 * not yet been received.
+	 */
+	int bytes_remaining;
+
+	/** @resend_all: if nonzero, set resend_all in the next grant packet. */
+	__u8 resend_all;
+
+	/**
+	 * @num_bpages: The number of entries in @bpage_offsets used for this
+	 * message (0 means buffers not allocated yet).
+	 */
+	__u32 num_bpages;
+
+	/**
+	 * @bpage_offsets: Describes buffer space allocated for this message.
+	 * Each entry is an offset from the start of the buffer region.
+	 * All but the last pointer refer to areas of size HOMA_BPAGE_SIZE.
+	 */
+	__u32 bpage_offsets[HOMA_MAX_BPAGES];
+};
+
+/**
+ * struct homa_rpc - One of these structures exists for each active
+ * RPC. The same structure is used to manage both outgoing RPCs on
+ * clients and incoming RPCs on servers.
+ */
+struct homa_rpc {
+	/** @hsk:  Socket that owns the RPC. */
+	struct homa_sock *hsk;
+
+	/**
+	 * @bucket: Pointer to the bucket in hsk->client_rpc_buckets or
+	 * hsk->server_rpc_buckets where this RPC is linked. Used primarily
+	 * for locking the RPC (which is done by locking its bucket).
+	 */
+	struct homa_rpc_bucket *bucket;
+
+	/**
+	 * @state: The current state of this RPC:
+	 *
+	 * @RPC_OUTGOING:     The RPC is waiting for @msgout to be transmitted
+	 *                    to the peer.
+	 * @RPC_INCOMING:     The RPC is waiting for data @msgin to be received
+	 *                    from the peer; at least one packet has already
+	 *                    been received.
+	 * @RPC_IN_SERVICE:   Used only for server RPCs: the request message
+	 *                    has been read from the socket, but the response
+	 *                    message has not yet been presented to the kernel.
+	 * @RPC_DEAD:         RPC has been deleted and is waiting to be
+	 *                    reaped. In some cases, information in the RPC
+	 *                    structure may be accessed in this state.
+	 *
+	 * Client RPCs pass through states in the following order:
+	 * RPC_OUTGOING, RPC_INCOMING, RPC_DEAD.
+	 *
+	 * Server RPCs pass through states in the following order:
+	 * RPC_INCOMING, RPC_IN_SERVICE, RPC_OUTGOING, RPC_DEAD.
+	 */
+	enum {
+		RPC_OUTGOING            = 5,
+		RPC_INCOMING            = 6,
+		RPC_IN_SERVICE          = 8,
+		RPC_DEAD                = 9
+	} state;
+
+	/**
+	 * @flags: Additional state information: an OR'ed combination of
+	 * various single-bit flags. See below for definitions. Must be
+	 * manipulated with atomic operations because some of the manipulations
+	 * occur without holding the RPC lock.
+	 */
+	atomic_t flags;
+
+	/* Valid bits for @flags:
+	 * RPC_PKTS_READY -        The RPC has input packets ready to be
+	 *                         copied to user space.
+	 * RPC_COPYING_FROM_USER - Data is being copied from user space into
+	 *                         the RPC; the RPC must not be reaped.
+	 * RPC_COPYING_TO_USER -   Data is being copied from this RPC to
+	 *                         user space; the RPC must not be reaped.
+	 * RPC_HANDING_OFF -       This RPC is in the process of being
+	 *                         handed off to a waiting thread; it must
+	 *                         not be reaped.
+	 * APP_NEEDS_LOCK -        Means that code in the application thread
+	 *                         needs the RPC lock (e.g. so it can start
+	 *                         copying data to user space) so others
+	 *                         (e.g. SoftIRQ processing) should relinquish
+	 *                         the lock ASAP. Without this, SoftIRQ can
+	 *                         lock out the application for a long time,
+	 *                         preventing data copies to user space from
+	 *                         starting (and they limit throughput at
+	 *                         high network speeds).
+	 */
+#define RPC_PKTS_READY        1
+#define RPC_COPYING_FROM_USER 2
+#define RPC_COPYING_TO_USER   4
+#define RPC_HANDING_OFF       8
+#define APP_NEEDS_LOCK       16
+
+#define RPC_CANT_REAP (RPC_COPYING_FROM_USER | RPC_COPYING_TO_USER \
+		| RPC_HANDING_OFF)
+
+	/**
+	 * @peer: Information about the other machine (the server, if
+	 * this is a client RPC, or the client, if this is a server RPC).
+	 */
+	struct homa_peer *peer;
+
+	/** @dport: Port number on @peer that will handle packets. */
+	__u16 dport;
+
+	/**
+	 * @id: Unique identifier for the RPC among all those issued
+	 * from its port. The low-order bit indicates whether we are
+	 * server (1) or client (0) for this RPC.
+	 */
+	__u64 id;
+
+	/**
+	 * @completion_cookie: Only used on clients. Contains identifying
+	 * information about the RPC provided by the application; returned to
+	 * the application with the RPC's result.
+	 */
+	__u64 completion_cookie;
+
+	/**
+	 * @error: Only used on clients. If nonzero, then the RPC has
+	 * failed and the value is a negative errno that describes the
+	 * problem.
+	 */
+	int error;
+
+	/**
+	 * @msgin: Information about the message we receive for this RPC
+	 * (for server RPCs this is the request, for client RPCs this is the
+	 * response).
+	 */
+	struct homa_message_in msgin;
+
+	/**
+	 * @msgout: Information about the message we send for this RPC
+	 * (for client RPCs this is the request, for server RPCs this is the
+	 * response).
+	 */
+	struct homa_message_out msgout;
+
+	/**
+	 * @hash_links: Used to link this object into a hash bucket for
+	 * either @hsk->client_rpc_buckets (for a client RPC), or
+	 * @hsk->server_rpc_buckets (for a server RPC).
+	 */
+	struct hlist_node hash_links;
+
+	/**
+	 * @ready_links: Used to link this object into
+	 * @hsk->ready_requests or @hsk->ready_responses.
+	 */
+	struct list_head ready_links;
+
+	/**
+	 * @buf_links: Used to link this RPC into @hsk->waiting_for_bufs.
+	 * If the RPC isn't on @hsk->waiting_for_bufs, this is an empty
+	 * list pointing to itself.
+	 */
+	struct list_head buf_links;
+
+	/**
+	 * @active_links: For linking this object into @hsk->active_rpcs.
+	 * The next field will be LIST_POISON1 if this RPC hasn't yet been
+	 * linked into @hsk->active_rpcs. Access with RCU.
+	 */
+	struct list_head active_links;
+
+	/** @dead_links: For linking this object into @hsk->dead_rpcs. */
+	struct list_head dead_links;
+
+	/**
+	 * @interest: Describes a thread that wants to be notified when
+	 * msgin is complete, or NULL if none.
+	 */
+	struct homa_interest *interest;
+
+	/**
+	 * @throttled_links: Used to link this RPC into homa->throttled_rpcs.
+	 * If this RPC isn't in homa->throttled_rpcs, this is an empty
+	 * list pointing to itself.
+	 */
+	struct list_head throttled_links;
+
+	/**
+	 * @silent_ticks: Number of times homa_timer has been invoked
+	 * since the last time a packet indicating progress was received
+	 * for this RPC, so we don't need to send a resend for a while.
+	 */
+	int silent_ticks;
+
+	/**
+	 * @resend_timer_ticks: Value of homa->timer_ticks the last time
+	 * we sent a RESEND for this RPC.
+	 */
+	__u32 resend_timer_ticks;
+
+	/**
+	 * @done_timer_ticks: The value of homa->timer_ticks the first
+	 * time we noticed that this (server) RPC is done (all response
+	 * packets have been transmitted), so we're ready for an ack.
+	 * Zero means we haven't reached that point yet.
+	 */
+	__u32 done_timer_ticks;
+
+	/**
+	 * @magic: when the RPC is alive, this holds a distinct value that
+	 * is unlikely to occur naturally. The value is cleared when the
+	 * RPC is reaped, so we can detect accidental use of an RPC after
+	 * it has been reaped.
+	 */
+#define HOMA_RPC_MAGIC 0xdeadbeef
+	int magic;
+
+	/**
+	 * @start_ns: time (from sched_clock()) when this RPC was created.
+	 * Used (sometimes) for testing.
+	 */
+	u64 start_ns;
+};
+
+void     homa_check_rpc(struct homa_rpc *rpc);
+struct homa_rpc
+	       *homa_find_client_rpc(struct homa_sock *hsk, __u64 id);
+struct homa_rpc
+	       *homa_find_server_rpc(struct homa_sock *hsk,
+				     const struct in6_addr *saddr, __u64 id);
+void     homa_rpc_acked(struct homa_sock *hsk, const struct in6_addr *saddr,
+			struct homa_ack *ack);
+void     homa_rpc_free(struct homa_rpc *rpc);
+struct homa_rpc
+	       *homa_rpc_new_client(struct homa_sock *hsk,
+				    const union sockaddr_in_union *dest);
+struct homa_rpc
+	       *homa_rpc_new_server(struct homa_sock *hsk,
+				    const struct in6_addr *source,
+				    struct homa_data_hdr *h, int *created);
+int      homa_rpc_reap(struct homa_sock *hsk, bool reap_all);
+
+/**
+ * homa_rpc_lock() - Acquire the lock for an RPC.
+ * @rpc:    RPC to lock. Note: this function is only safe under
+ *          limited conditions (in most cases homa_bucket_lock should be
+ *          used). The caller must ensure that the RPC cannot be reaped
+ *          before the lock is acquired. It cannot do that by acquirin
+ *          the socket lock, since that violates lock ordering constraints.
+ *          One approach is to use homa_protect_rpcs. Don't use this function
+ *          unless you are very sure what you are doing!  See sync.txt for
+ *          more info on locking.
+ * @locker: Static string identifying the locking code. Normally ignored,
+ *          but used occasionally for diagnostics and debugging.
+ */
+static inline void homa_rpc_lock(struct homa_rpc *rpc, const char *locker)
+{
+	homa_bucket_lock(rpc->bucket, rpc->id, locker);
+}
+
+/**
+ * homa_rpc_try_lock() - Acquire the lock for an RPC if it is available.
+ * @rpc:       RPC to lock.
+ * @locker:    Static string identifying the locking code. Normally ignored,
+ *             but used when debugging deadlocks.
+ * Return:     Nonzero if lock was successfully acquired, zero if it is
+ *             currently owned by someone else.
+ */
+static inline int homa_rpc_try_lock(struct homa_rpc *rpc, const char *locker)
+{
+	if (!spin_trylock_bh(&rpc->bucket->lock))
+		return 0;
+	return 1;
+}
+
+/**
+ * homa_rpc_unlock() - Release the lock for an RPC.
+ * @rpc:   RPC to unlock.
+ */
+static inline void homa_rpc_unlock(struct homa_rpc *rpc)
+{
+	homa_bucket_unlock(rpc->bucket, rpc->id);
+}
+
+/**
+ * homa_protect_rpcs() - Ensures that no RPCs will be reaped for a given
+ * socket until homa_sock_unprotect is called. Typically used by functions
+ * that want to scan the active RPCs for a socket without holding the socket
+ * lock.  Multiple calls to this function may be in effect at once.
+ * @hsk:    Socket whose RPCs should be protected. Must not be locked
+ *          by the caller; will be locked here.
+ *
+ * Return:  1 for success, 0 if the socket has been shutdown, in which
+ *          case its RPCs cannot be protected.
+ */
+static inline int homa_protect_rpcs(struct homa_sock *hsk)
+{
+	int result;
+
+	homa_sock_lock(hsk, __func__);
+	result = !hsk->shutdown;
+	if (result)
+		atomic_inc(&hsk->protect_count);
+	homa_sock_unlock(hsk);
+	return result;
+}
+
+/**
+ * homa_unprotect_rpcs() - Cancel the effect of a previous call to
+ * homa_sock_protect(), so that RPCs can once again be reaped.
+ * @hsk:    Socket whose RPCs should be unprotected.
+ */
+static inline void homa_unprotect_rpcs(struct homa_sock *hsk)
+{
+	atomic_dec(&hsk->protect_count);
+}
+
+/**
+ * homa_is_client(): returns true if we are the client for a particular RPC,
+ * false if we are the server.
+ * @id:  Id of the RPC in question.
+ * Return: true if we are the client for RPC id, false otherwise
+ */
+static inline bool homa_is_client(__u64 id)
+{
+	return (id & 1) == 0;
+}
+
+#endif /* _HOMA_RPC_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH net-next v6 06/12] net: homa: create homa_peer.h and homa_peer.c
  2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (4 preceding siblings ...)
  2025-01-15 18:59 ` [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c John Ousterhout
@ 2025-01-15 18:59 ` John Ousterhout
  2025-01-23 17:45   ` Paolo Abeni
  2025-01-15 18:59 ` [PATCH net-next v6 07/12] net: homa: create homa_sock.h and homa_sock.c John Ousterhout
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-15 18:59 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

Homa needs to keep a small amount of information for each peer that
it has communicated with. These files define that state and provide
functions for storing and accessing it.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_peer.c | 366 +++++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_peer.h | 233 +++++++++++++++++++++++++++
 2 files changed, 599 insertions(+)
 create mode 100644 net/homa/homa_peer.c
 create mode 100644 net/homa/homa_peer.h

diff --git a/net/homa/homa_peer.c b/net/homa/homa_peer.c
new file mode 100644
index 000000000000..04056936bf71
--- /dev/null
+++ b/net/homa/homa_peer.c
@@ -0,0 +1,366 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file provides functions related to homa_peer and homa_peertab
+ * objects.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+
+/**
+ * homa_peertab_init() - Constructor for homa_peertabs.
+ * @peertab:  The object to initialize; previous contents are discarded.
+ *
+ * Return:    0 in the normal case, or a negative errno if there was a problem.
+ */
+int homa_peertab_init(struct homa_peertab *peertab)
+{
+	/* Note: when we return, the object must be initialized so it's
+	 * safe to call homa_peertab_destroy, even if this function returns
+	 * an error.
+	 */
+	int i;
+
+	spin_lock_init(&peertab->write_lock);
+	INIT_LIST_HEAD(&peertab->dead_dsts);
+	peertab->buckets = vmalloc(HOMA_PEERTAB_BUCKETS *
+				   sizeof(*peertab->buckets));
+	if (!peertab->buckets)
+		return -ENOMEM;
+	for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++)
+		INIT_HLIST_HEAD(&peertab->buckets[i]);
+	return 0;
+}
+
+/**
+ * homa_peertab_destroy() - Destructor for homa_peertabs. After this
+ * function returns, it is unsafe to use any results from previous calls
+ * to homa_peer_find, since all existing homa_peer objects will have been
+ * destroyed.
+ * @peertab:  The table to destroy.
+ */
+void homa_peertab_destroy(struct homa_peertab *peertab)
+{
+	struct hlist_node *next;
+	struct homa_peer *peer;
+	int i;
+
+	if (!peertab->buckets)
+		return;
+
+	for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++) {
+		hlist_for_each_entry_safe(peer, next, &peertab->buckets[i],
+					  peertab_links) {
+			dst_release(peer->dst);
+			kfree(peer);
+		}
+	}
+	vfree(peertab->buckets);
+	homa_peertab_gc_dsts(peertab, ~0);
+}
+
+/**
+ * homa_peertab_get_peers() - Return information about all of the peers
+ * currently known
+ * @peertab:    The table to search for peers.
+ * @num_peers:  Modified to hold the number of peers returned.
+ * Return:      kmalloced array holding pointers to all known peers. The
+ *		caller must free this. If there is an error, or if there
+ *	        are no peers, NULL is returned.
+ */
+struct homa_peer **homa_peertab_get_peers(struct homa_peertab *peertab,
+					  int *num_peers)
+{
+	struct homa_peer **result;
+	struct hlist_node *next;
+	struct homa_peer *peer;
+	int i, count;
+
+	*num_peers = 0;
+	if (!peertab->buckets)
+		return NULL;
+
+	/* Figure out how many peers there are. */
+	count = 0;
+	for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++) {
+		hlist_for_each_entry_safe(peer, next, &peertab->buckets[i],
+					  peertab_links)
+			count++;
+	}
+
+	if (count == 0)
+		return NULL;
+
+	result = kmalloc_array(count, sizeof(peer), GFP_KERNEL);
+	if (!result)
+		return NULL;
+	*num_peers = count;
+	count = 0;
+	for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++) {
+		hlist_for_each_entry_safe(peer, next, &peertab->buckets[i],
+					  peertab_links) {
+			result[count] = peer;
+			count++;
+		}
+	}
+	return result;
+}
+
+/**
+ * homa_peertab_gc_dsts() - Invoked to free unused dst_entries, if it is
+ * safe to do so.
+ * @peertab:       The table in which to free entries.
+ * @now:           Current time, in sched_clock() units; entries with expiration
+ *                 dates no later than this will be freed. Specify ~0 to
+ *                 free all entries.
+ */
+void homa_peertab_gc_dsts(struct homa_peertab *peertab, __u64 now)
+{
+	while (!list_empty(&peertab->dead_dsts)) {
+		struct homa_dead_dst *dead =
+			list_first_entry(&peertab->dead_dsts,
+					 struct homa_dead_dst, dst_links);
+		if (dead->gc_time > now)
+			break;
+		dst_release(dead->dst);
+		list_del(&dead->dst_links);
+		kfree(dead);
+	}
+}
+
+/**
+ * homa_peer_find() - Returns the peer associated with a given host; creates
+ * a new homa_peer if one doesn't already exist.
+ * @peertab:    Peer table in which to perform lookup.
+ * @addr:       Address of the desired host: IPv4 addresses are represented
+ *              as IPv4-mapped IPv6 addresses.
+ * @inet:       Socket that will be used for sending packets.
+ *
+ * Return:      The peer associated with @addr, or a negative errno if an
+ *              error occurred. The caller can retain this pointer
+ *              indefinitely: peer entries are never deleted except in
+ *              homa_peertab_destroy.
+ */
+struct homa_peer *homa_peer_find(struct homa_peertab *peertab,
+				 const struct in6_addr *addr,
+				 struct inet_sock *inet)
+{
+	/* Note: this function uses RCU operators to ensure safety even
+	 * if a concurrent call is adding a new entry.
+	 */
+	struct homa_peer *peer;
+	struct dst_entry *dst;
+
+	__u32 bucket = hash_32((__force __u32)addr->in6_u.u6_addr32[0],
+			       HOMA_PEERTAB_BUCKET_BITS);
+
+	bucket ^= hash_32((__force __u32)addr->in6_u.u6_addr32[1],
+			  HOMA_PEERTAB_BUCKET_BITS);
+	bucket ^= hash_32((__force __u32)addr->in6_u.u6_addr32[2],
+			  HOMA_PEERTAB_BUCKET_BITS);
+	bucket ^= hash_32((__force __u32)addr->in6_u.u6_addr32[3],
+			  HOMA_PEERTAB_BUCKET_BITS);
+	hlist_for_each_entry_rcu(peer, &peertab->buckets[bucket],
+				 peertab_links) {
+		if (ipv6_addr_equal(&peer->addr, addr))
+			return peer;
+	}
+
+	/* No existing entry; create a new one.
+	 *
+	 * Note: after we acquire the lock, we have to check again to
+	 * make sure the entry still doesn't exist (it might have been
+	 * created by a concurrent invocation of this function).
+	 */
+	spin_lock_bh(&peertab->write_lock);
+	hlist_for_each_entry_rcu(peer, &peertab->buckets[bucket],
+				 peertab_links) {
+		if (ipv6_addr_equal(&peer->addr, addr))
+			goto done;
+	}
+	peer = kmalloc(sizeof(*peer), GFP_ATOMIC);
+	if (!peer) {
+		peer = (struct homa_peer *)ERR_PTR(-ENOMEM);
+		goto done;
+	}
+	peer->addr = *addr;
+	dst = homa_peer_get_dst(peer, inet);
+	if (IS_ERR(dst)) {
+		kfree(peer);
+		peer = (struct homa_peer *)PTR_ERR(dst);
+		goto done;
+	}
+	peer->dst = dst;
+	INIT_LIST_HEAD(&peer->grantable_rpcs);
+	INIT_LIST_HEAD(&peer->grantable_links);
+	hlist_add_head_rcu(&peer->peertab_links, &peertab->buckets[bucket]);
+	peer->outstanding_resends = 0;
+	peer->most_recent_resend = 0;
+	peer->least_recent_rpc = NULL;
+	peer->least_recent_ticks = 0;
+	peer->current_ticks = -1;
+	peer->resend_rpc = NULL;
+	peer->num_acks = 0;
+	spin_lock_init(&peer->ack_lock);
+
+done:
+	spin_unlock_bh(&peertab->write_lock);
+	return peer;
+}
+
+/**
+ * homa_dst_refresh() - This method is called when the dst for a peer is
+ * obsolete; it releases that dst and creates a new one.
+ * @peertab:  Table containing the peer.
+ * @peer:     Peer whose dst is obsolete.
+ * @hsk:      Socket that will be used to transmit data to the peer.
+ */
+void homa_dst_refresh(struct homa_peertab *peertab, struct homa_peer *peer,
+		      struct homa_sock *hsk)
+{
+	struct homa_dead_dst *save_dead;
+	struct dst_entry *dst;
+	__u64 now;
+
+	/* Need to keep around the current entry for a while in case
+	 * someone is using it. If we can't do that, then don't update
+	 * the entry.
+	 */
+	save_dead = kmalloc(sizeof(*save_dead), GFP_ATOMIC);
+	if (unlikely(!save_dead))
+		return;
+
+	dst = homa_peer_get_dst(peer, &hsk->inet);
+	if (IS_ERR(dst)) {
+		kfree(save_dead);
+		return;
+	}
+
+	spin_lock_bh(&peertab->write_lock);
+	now = sched_clock();
+	save_dead->dst = peer->dst;
+	save_dead->gc_time = now + 100000000;   /* 100 ms */
+	list_add_tail(&save_dead->dst_links, &peertab->dead_dsts);
+	homa_peertab_gc_dsts(peertab, now);
+	peer->dst = dst;
+	spin_unlock_bh(&peertab->write_lock);
+}
+
+/**
+ * homa_peer_get_dst() - Find an appropriate dst structure (either IPv4
+ * or IPv6) for a peer.
+ * @peer:   The peer for which a dst is needed. Note: this peer's flow
+ *          struct will be overwritten.
+ * @inet:   Socket that will be used for sending packets.
+ * Return:  The dst structure (or an ERR_PTR).
+ */
+struct dst_entry *homa_peer_get_dst(struct homa_peer *peer,
+				    struct inet_sock *inet)
+{
+	memset(&peer->flow, 0, sizeof(peer->flow));
+	if (inet->sk.sk_family == AF_INET) {
+		struct rtable *rt;
+
+		flowi4_init_output(&peer->flow.u.ip4, inet->sk.sk_bound_dev_if,
+				   inet->sk.sk_mark, inet->tos,
+				   RT_SCOPE_UNIVERSE, inet->sk.sk_protocol, 0,
+				   peer->addr.in6_u.u6_addr32[3],
+				   inet->inet_saddr, 0, 0, inet->sk.sk_uid);
+		security_sk_classify_flow(&inet->sk, &peer->flow.u.__fl_common);
+		rt = ip_route_output_flow(sock_net(&inet->sk),
+					  &peer->flow.u.ip4, &inet->sk);
+		if (IS_ERR(rt))
+			return (struct dst_entry *)(PTR_ERR(rt));
+		return &rt->dst;
+	}
+	peer->flow.u.ip6.flowi6_oif = inet->sk.sk_bound_dev_if;
+	peer->flow.u.ip6.flowi6_iif = LOOPBACK_IFINDEX;
+	peer->flow.u.ip6.flowi6_mark = inet->sk.sk_mark;
+	peer->flow.u.ip6.flowi6_scope = RT_SCOPE_UNIVERSE;
+	peer->flow.u.ip6.flowi6_proto = inet->sk.sk_protocol;
+	peer->flow.u.ip6.flowi6_flags = 0;
+	peer->flow.u.ip6.flowi6_secid = 0;
+	peer->flow.u.ip6.flowi6_tun_key.tun_id = 0;
+	peer->flow.u.ip6.flowi6_uid = inet->sk.sk_uid;
+	peer->flow.u.ip6.daddr = peer->addr;
+	peer->flow.u.ip6.saddr = inet->pinet6->saddr;
+	peer->flow.u.ip6.fl6_dport = 0;
+	peer->flow.u.ip6.fl6_sport = 0;
+	peer->flow.u.ip6.mp_hash = 0;
+	peer->flow.u.ip6.__fl_common.flowic_tos = inet->tos;
+	peer->flow.u.ip6.flowlabel = ip6_make_flowinfo(inet->tos, 0);
+	security_sk_classify_flow(&inet->sk, &peer->flow.u.__fl_common);
+	return ip6_dst_lookup_flow(sock_net(&inet->sk), &inet->sk,
+			&peer->flow.u.ip6, NULL);
+}
+
+/**
+ * homa_peer_lock_slow() - This function implements the slow path for
+ * acquiring a peer's @unacked_lock. It is invoked when the lock isn't
+ * immediately available. It waits for the lock, but also records statistics
+ * about the waiting time.
+ * @peer:    Peer to  lock.
+ */
+void homa_peer_lock_slow(struct homa_peer *peer)
+	__acquires(&peer->ack_lock)
+{
+	spin_lock_bh(&peer->ack_lock);
+}
+
+/**
+ * homa_peer_add_ack() - Add a given RPC to the list of unacked
+ * RPCs for its server. Once this method has been invoked, it's safe
+ * to delete the RPC, since it will eventually be acked to the server.
+ * @rpc:    Client RPC that has now completed.
+ */
+void homa_peer_add_ack(struct homa_rpc *rpc)
+{
+	struct homa_peer *peer = rpc->peer;
+	struct homa_ack_hdr ack;
+
+	homa_peer_lock(peer);
+	if (peer->num_acks < HOMA_MAX_ACKS_PER_PKT) {
+		peer->acks[peer->num_acks].client_id = cpu_to_be64(rpc->id);
+		peer->acks[peer->num_acks].server_port = htons(rpc->dport);
+		peer->num_acks++;
+		homa_peer_unlock(peer);
+		return;
+	}
+
+	/* The peer has filled up; send an ACK message to empty it. The
+	 * RPC in the message header will also be considered ACKed.
+	 */
+	memcpy(ack.acks, peer->acks, sizeof(peer->acks));
+	ack.num_acks = htons(peer->num_acks);
+	peer->num_acks = 0;
+	homa_peer_unlock(peer);
+	homa_xmit_control(ACK, &ack, sizeof(ack), rpc);
+}
+
+/**
+ * homa_peer_get_acks() - Copy acks out of a peer, and remove them from the
+ * peer.
+ * @peer:    Peer to check for possible unacked RPCs.
+ * @count:   Maximum number of acks to return.
+ * @dst:     The acks are copied to this location.
+ *
+ * Return:   The number of acks extracted from the peer (<= count).
+ */
+int homa_peer_get_acks(struct homa_peer *peer, int count, struct homa_ack *dst)
+{
+	/* Don't waste time acquiring the lock if there are no ids available. */
+	if (peer->num_acks == 0)
+		return 0;
+
+	homa_peer_lock(peer);
+
+	if (count > peer->num_acks)
+		count = peer->num_acks;
+	memcpy(dst, &peer->acks[peer->num_acks - count],
+	       count * sizeof(peer->acks[0]));
+	peer->num_acks -= count;
+
+	homa_peer_unlock(peer);
+	return count;
+}
diff --git a/net/homa/homa_peer.h b/net/homa/homa_peer.h
new file mode 100644
index 000000000000..556aeda49656
--- /dev/null
+++ b/net/homa/homa_peer.h
@@ -0,0 +1,233 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file contains definitions related to managing peers (homa_peer
+ * and homa_peertab).
+ */
+
+#ifndef _HOMA_PEER_H
+#define _HOMA_PEER_H
+
+#include "homa_wire.h"
+#include "homa_sock.h"
+
+struct homa_rpc;
+
+/**
+ * struct homa_dead_dst - Used to retain dst_entries that are no longer
+ * needed, until it is safe to delete them (I'm not confident that the RCU
+ * mechanism will be safe for these: the reference count could get incremented
+ * after it's on the RCU list?).
+ */
+struct homa_dead_dst {
+	/** @dst: Entry that is no longer used by a struct homa_peer. */
+	struct dst_entry *dst;
+
+	/**
+	 * @gc_time: Time (in units of sched_clock()) when it is safe
+	 * to free @dst.
+	 */
+	__u64 gc_time;
+
+	/** @dst_links: Used to link together entries in peertab->dead_dsts. */
+	struct list_head dst_links;
+};
+
+/**
+ * define HOMA_PEERTAB_BUCKET_BITS - Number of bits in the bucket index for a
+ * homa_peertab.  Should be large enough to hold an entry for every server
+ * in a datacenter without long hash chains.
+ */
+#define HOMA_PEERTAB_BUCKET_BITS 16
+
+/** define HOME_PEERTAB_BUCKETS - Number of buckets in a homa_peertab. */
+#define HOMA_PEERTAB_BUCKETS BIT(HOMA_PEERTAB_BUCKET_BITS)
+
+/**
+ * struct homa_peertab - A hash table that maps from IPv6 addresses
+ * to homa_peer objects. IPv4 entries are encapsulated as IPv6 addresses.
+ * Entries are gradually added to this table, but they are never removed
+ * except when the entire table is deleted. We can't safely delete because
+ * results returned by homa_peer_find may be retained indefinitely.
+ *
+ * This table is managed exclusively by homa_peertab.c, using RCU to
+ * permit efficient lookups.
+ */
+struct homa_peertab {
+	/**
+	 * @write_lock: Synchronizes addition of new entries; not needed
+	 * for lookups (RCU is used instead).
+	 */
+	spinlock_t write_lock;
+
+	/**
+	 * @dead_dsts: List of dst_entries that are waiting to be deleted.
+	 * Hold @write_lock when manipulating.
+	 */
+	struct list_head dead_dsts;
+
+	/**
+	 * @buckets: Pointer to heads of chains of homa_peers for each bucket.
+	 * Malloc-ed, and must eventually be freed. NULL means this structure
+	 * has not been initialized.
+	 */
+	struct hlist_head *buckets;
+};
+
+/**
+ * struct homa_peer - One of these objects exists for each machine that we
+ * have communicated with (either as client or server).
+ */
+struct homa_peer {
+	/**
+	 * @addr: IPv6 address for the machine (IPv4 addresses are stored
+	 * as IPv4-mapped IPv6 addresses).
+	 */
+	struct in6_addr addr;
+
+	/** @flow: Addressing info needed to send packets. */
+	struct flowi flow;
+
+	/**
+	 * @dst: Used to route packets to this peer; we own a reference
+	 * to this, which we must eventually release.
+	 */
+	struct dst_entry *dst;
+
+	/**
+	 * @grantable_rpcs: Contains all homa_rpcs (both requests and
+	 * responses) involving this peer whose msgins require (or required
+	 * them in the past) and have not been fully received. The list is
+	 * sorted in priority order (head has fewest bytes_remaining).
+	 * Locked with homa->grantable_lock.
+	 */
+	struct list_head grantable_rpcs;
+
+	/**
+	 * @grantable_links: Used to link this peer into homa->grantable_peers.
+	 * If this RPC is not linked into homa->grantable_peers, this is an
+	 * empty list pointing to itself.
+	 */
+	struct list_head grantable_links;
+
+	/**
+	 * @peertab_links: Links this object into a bucket of its
+	 * homa_peertab.
+	 */
+	struct hlist_node peertab_links;
+
+	/**
+	 * @outstanding_resends: the number of resend requests we have
+	 * sent to this server (spaced @homa.resend_interval apart) since
+	 * we received a packet from this peer.
+	 */
+	int outstanding_resends;
+
+	/**
+	 * @most_recent_resend: @homa->timer_ticks when the most recent
+	 * resend was sent to this peer.
+	 */
+	int most_recent_resend;
+
+	/**
+	 * @least_recent_rpc: of all the RPCs for this peer scanned at
+	 * @current_ticks, this is the RPC whose @resend_timer_ticks
+	 * is farthest in the past.
+	 */
+	struct homa_rpc *least_recent_rpc;
+
+	/**
+	 * @least_recent_ticks: the @resend_timer_ticks value for
+	 * @least_recent_rpc.
+	 */
+	__u32 least_recent_ticks;
+
+	/**
+	 * @current_ticks: the value of @homa->timer_ticks the last time
+	 * that @least_recent_rpc and @least_recent_ticks were computed.
+	 * Used to detect the start of a new homa_timer pass.
+	 */
+	__u32 current_ticks;
+
+	/**
+	 * @resend_rpc: the value of @least_recent_rpc computed in the
+	 * previous homa_timer pass. This RPC will be issued a RESEND
+	 * in the current pass, if it still needs one.
+	 */
+	struct homa_rpc *resend_rpc;
+
+	/**
+	 * @num_acks: the number of (initial) entries in @acks that
+	 * currently hold valid information.
+	 */
+	int num_acks;
+
+	/**
+	 * @acks: info about client RPCs whose results have been completely
+	 * received.
+	 */
+	struct homa_ack acks[HOMA_MAX_ACKS_PER_PKT];
+
+	/**
+	 * @ack_lock: used to synchronize access to @num_acks and @acks.
+	 */
+	spinlock_t ack_lock;
+};
+
+void     homa_dst_refresh(struct homa_peertab *peertab,
+			  struct homa_peer *peer, struct homa_sock *hsk);
+void     homa_peertab_destroy(struct homa_peertab *peertab);
+struct homa_peer **
+		homa_peertab_get_peers(struct homa_peertab *peertab,
+				       int *num_peers);
+int      homa_peertab_init(struct homa_peertab *peertab);
+void     homa_peer_add_ack(struct homa_rpc *rpc);
+struct homa_peer
+	       *homa_peer_find(struct homa_peertab *peertab,
+			       const struct in6_addr *addr,
+			       struct inet_sock *inet);
+int      homa_peer_get_acks(struct homa_peer *peer, int count,
+			    struct homa_ack *dst);
+struct dst_entry
+	       *homa_peer_get_dst(struct homa_peer *peer,
+				  struct inet_sock *inet);
+void     homa_peer_lock_slow(struct homa_peer *peer);
+void     homa_peertab_gc_dsts(struct homa_peertab *peertab, __u64 now);
+
+/**
+ * homa_peer_lock() - Acquire the lock for a peer's @unacked_lock. If the lock
+ * isn't immediately available, record stats on the waiting time.
+ * @peer:    Peer to lock.
+ */
+static inline void homa_peer_lock(struct homa_peer *peer)
+	__acquires(&peer->ack_lock)
+{
+	if (!spin_trylock_bh(&peer->ack_lock))
+		homa_peer_lock_slow(peer);
+}
+
+/**
+ * homa_peer_unlock() - Release the lock for a peer's @unacked_lock.
+ * @peer:   Peer to lock.
+ */
+static inline void homa_peer_unlock(struct homa_peer *peer)
+	__releases(&peer->ack_lock)
+{
+	spin_unlock_bh(&peer->ack_lock);
+}
+
+/**
+ * homa_get_dst() - Returns destination information associated with a peer,
+ * updating it if the cached information is stale.
+ * @peer:   Peer whose destination information is desired.
+ * @hsk:    Homa socket; needed by lower-level code to recreate the dst.
+ * Return:   Up-to-date destination for peer.
+ */
+static inline struct dst_entry *homa_get_dst(struct homa_peer *peer,
+					     struct homa_sock *hsk)
+{
+	if (unlikely(peer->dst->obsolete > 0))
+		homa_dst_refresh(hsk->homa->peers, peer, hsk);
+	return peer->dst;
+}
+
+#endif /* _HOMA_PEER_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH net-next v6 07/12] net: homa: create homa_sock.h and homa_sock.c
  2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (5 preceding siblings ...)
  2025-01-15 18:59 ` [PATCH net-next v6 06/12] net: homa: create homa_peer.h and homa_peer.c John Ousterhout
@ 2025-01-15 18:59 ` John Ousterhout
  2025-01-23 19:01   ` Paolo Abeni
  2025-01-24  7:33   ` Paolo Abeni
  2025-01-15 18:59 ` [PATCH net-next v6 08/12] net: homa: create homa_incoming.c John Ousterhout
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-15 18:59 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

These files provide functions for managing the state that Homa keeps
for each open Homa socket.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_sock.c | 388 ++++++++++++++++++++++++++++++++++++++++
 net/homa/homa_sock.h | 410 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 798 insertions(+)
 create mode 100644 net/homa/homa_sock.c
 create mode 100644 net/homa/homa_sock.h

diff --git a/net/homa/homa_sock.c b/net/homa/homa_sock.c
new file mode 100644
index 000000000000..991219c6a096
--- /dev/null
+++ b/net/homa/homa_sock.c
@@ -0,0 +1,388 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file manages homa_sock and homa_socktab objects. */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+
+/**
+ * homa_socktab_init() - Constructor for homa_socktabs.
+ * @socktab:  The object to initialize; previous contents are discarded.
+ */
+void homa_socktab_init(struct homa_socktab *socktab)
+{
+	int i;
+
+	spin_lock_init(&socktab->write_lock);
+	for (i = 0; i < HOMA_SOCKTAB_BUCKETS; i++)
+		INIT_HLIST_HEAD(&socktab->buckets[i]);
+	INIT_LIST_HEAD(&socktab->active_scans);
+}
+
+/**
+ * homa_socktab_destroy() - Destructor for homa_socktabs.
+ * @socktab:  The object to destroy.
+ */
+void homa_socktab_destroy(struct homa_socktab *socktab)
+{
+	struct homa_socktab_scan scan;
+	struct homa_sock *hsk;
+
+	for (hsk = homa_socktab_start_scan(socktab, &scan); hsk;
+			hsk = homa_socktab_next(&scan)) {
+		homa_sock_destroy(hsk);
+	}
+	homa_socktab_end_scan(&scan);
+}
+
+/**
+ * homa_socktab_start_scan() - Begin an iteration over all of the sockets
+ * in a socktab.
+ * @socktab:   Socktab to scan.
+ * @scan:      Will hold the current state of the scan; any existing
+ *             contents are discarded.
+ *
+ * Return:     The first socket in the table, or NULL if the table is
+ *             empty.
+ *
+ * Each call to homa_socktab_next will return the next socket in the table.
+ * All sockets that are present in the table at the time this function is
+ * invoked will eventually be returned, as long as they are not removed
+ * from the table. It is safe to remove sockets from the table and/or
+ * delete them while the scan is in progress. If a socket is removed from
+ * the table during the scan, it may or may not be returned by
+ * homa_socktab_next. New entries added during the scan may or may not be
+ * returned. The caller must hold an RCU read lock when invoking the
+ * scan-related methods here, as well as when manipulating sockets returned
+ * during the scan. It is safe to release and reacquire the RCU read lock
+ * during a scan, as long as no socket is held when the read lock is
+ * released and homa_socktab_next isn't invoked until the RCU read lock
+ * is reacquired.
+ */
+struct homa_sock *homa_socktab_start_scan(struct homa_socktab *socktab,
+					  struct homa_socktab_scan *scan)
+{
+	scan->socktab = socktab;
+	scan->current_bucket = -1;
+	scan->next = NULL;
+
+	spin_lock_bh(&socktab->write_lock);
+	list_add_tail_rcu(&scan->scan_links, &socktab->active_scans);
+	spin_unlock_bh(&socktab->write_lock);
+
+	return homa_socktab_next(scan);
+}
+
+/**
+ * homa_socktab_next() - Return the next socket in an iteration over a socktab.
+ * @scan:      State of the scan.
+ *
+ * Return:     The next socket in the table, or NULL if the iteration has
+ *             returned all of the sockets in the table. Sockets are not
+ *             returned in any particular order. It's possible that the
+ *             returned socket has been destroyed.
+ */
+struct homa_sock *homa_socktab_next(struct homa_socktab_scan *scan)
+{
+	struct homa_socktab_links *links;
+	struct homa_sock *hsk;
+
+	while (1) {
+		while (!scan->next) {
+			struct hlist_head *bucket;
+
+			scan->current_bucket++;
+			if (scan->current_bucket >= HOMA_SOCKTAB_BUCKETS)
+				return NULL;
+			bucket = &scan->socktab->buckets[scan->current_bucket];
+			scan->next = (struct homa_socktab_links *)
+				      rcu_dereference(hlist_first_rcu(bucket));
+		}
+		links = scan->next;
+		hsk = links->sock;
+		scan->next = (struct homa_socktab_links *)
+				rcu_dereference(hlist_next_rcu(&links->hash_links));
+		return hsk;
+	}
+}
+
+/**
+ * homa_socktab_end_scan() - Must be invoked on completion of each scan
+ * to clean up state associated with the scan.
+ * @scan:      State of the scan.
+ */
+void homa_socktab_end_scan(struct homa_socktab_scan *scan)
+{
+	spin_lock_bh(&scan->socktab->write_lock);
+	list_del(&scan->scan_links);
+	spin_unlock_bh(&scan->socktab->write_lock);
+}
+
+/**
+ * homa_sock_init() - Constructor for homa_sock objects. This function
+ * initializes only the parts of the socket that are owned by Homa.
+ * @hsk:    Object to initialize.
+ * @homa:   Homa implementation that will manage the socket.
+ *
+ * Return: 0 for success, otherwise a negative errno.
+ */
+int homa_sock_init(struct homa_sock *hsk, struct homa *homa)
+{
+	struct homa_socktab *socktab = homa->port_map;
+	int starting_port;
+	int result = 0;
+	int i;
+
+	spin_lock_bh(&socktab->write_lock);
+	atomic_set(&hsk->protect_count, 0);
+	spin_lock_init(&hsk->lock);
+	hsk->last_locker = "none";
+	atomic_set(&hsk->protect_count, 0);
+	hsk->homa = homa;
+	hsk->ip_header_length = (hsk->inet.sk.sk_family == AF_INET)
+			? HOMA_IPV4_HEADER_LENGTH : HOMA_IPV6_HEADER_LENGTH;
+	hsk->shutdown = false;
+	starting_port = homa->prev_default_port;
+	while (1) {
+		homa->prev_default_port++;
+		if (homa->prev_default_port < HOMA_MIN_DEFAULT_PORT)
+			homa->prev_default_port = HOMA_MIN_DEFAULT_PORT;
+		if (!homa_sock_find(socktab, homa->prev_default_port))
+			break;
+		if (homa->prev_default_port == starting_port) {
+			spin_unlock_bh(&socktab->write_lock);
+			hsk->shutdown = true;
+			return -EADDRNOTAVAIL;
+		}
+	}
+	hsk->port = homa->prev_default_port;
+	hsk->inet.inet_num = hsk->port;
+	hsk->inet.inet_sport = htons(hsk->port);
+	hsk->socktab_links.sock = hsk;
+	hlist_add_head_rcu(&hsk->socktab_links.hash_links,
+			   &socktab->buckets[homa_port_hash(hsk->port)]);
+	INIT_LIST_HEAD(&hsk->active_rpcs);
+	INIT_LIST_HEAD(&hsk->dead_rpcs);
+	hsk->dead_skbs = 0;
+	INIT_LIST_HEAD(&hsk->waiting_for_bufs);
+	INIT_LIST_HEAD(&hsk->ready_requests);
+	INIT_LIST_HEAD(&hsk->ready_responses);
+	INIT_LIST_HEAD(&hsk->request_interests);
+	INIT_LIST_HEAD(&hsk->response_interests);
+	for (i = 0; i < HOMA_CLIENT_RPC_BUCKETS; i++) {
+		struct homa_rpc_bucket *bucket = &hsk->client_rpc_buckets[i];
+
+		spin_lock_init(&bucket->lock);
+		INIT_HLIST_HEAD(&bucket->rpcs);
+		bucket->id = i;
+	}
+	for (i = 0; i < HOMA_SERVER_RPC_BUCKETS; i++) {
+		struct homa_rpc_bucket *bucket = &hsk->server_rpc_buckets[i];
+
+		spin_lock_init(&bucket->lock);
+		INIT_HLIST_HEAD(&bucket->rpcs);
+		bucket->id = i + 1000000;
+	}
+	hsk->buffer_pool = kzalloc(sizeof(*hsk->buffer_pool), GFP_ATOMIC);
+	if (!hsk->buffer_pool)
+		result = -ENOMEM;
+	spin_unlock_bh(&socktab->write_lock);
+	return result;
+}
+
+/*
+ * homa_sock_unlink() - Unlinks a socket from its socktab and does
+ * related cleanups. Once this method returns, the socket will not be
+ * discoverable through the socktab.
+ */
+void homa_sock_unlink(struct homa_sock *hsk)
+{
+	struct homa_socktab *socktab = hsk->homa->port_map;
+	struct homa_socktab_scan *scan;
+
+	/* If any scans refer to this socket, advance them to refer to
+	 * the next socket instead.
+	 */
+	spin_lock_bh(&socktab->write_lock);
+	list_for_each_entry(scan, &socktab->active_scans, scan_links) {
+		if (!scan->next || scan->next->sock != hsk)
+			continue;
+		scan->next = (struct homa_socktab_links *)
+				rcu_dereference(hlist_next_rcu(&scan->next->hash_links));
+	}
+	hlist_del_rcu(&hsk->socktab_links.hash_links);
+	spin_unlock_bh(&socktab->write_lock);
+}
+
+/**
+ * homa_sock_shutdown() - Disable a socket so that it can no longer
+ * be used for either sending or receiving messages. Any system calls
+ * currently waiting to send or receive messages will be aborted.
+ * @hsk:       Socket to shut down.
+ */
+void homa_sock_shutdown(struct homa_sock *hsk)
+	__acquires(&hsk->lock)
+	__releases(&hsk->lock)
+{
+	struct homa_interest *interest;
+	struct homa_rpc *rpc;
+
+	homa_sock_lock(hsk, "homa_socket_shutdown");
+	if (hsk->shutdown) {
+		homa_sock_unlock(hsk);
+		return;
+	}
+
+	/* The order of cleanup is very important, because there could be
+	 * active operations that hold RPC locks but not the socket lock.
+	 * 1. Set @shutdown; this ensures that no new RPCs will be created for
+	 *    this socket (though some creations might already be in progress).
+	 * 2. Remove the socket from its socktab: this ensures that
+	 *    incoming packets for the socket will be dropped.
+	 * 3. Go through all of the RPCs and delete them; this will
+	 *    synchronize with any operations in progress.
+	 * 4. Perform other socket cleanup: at this point we know that
+	 *    there will be no concurrent activities on individual RPCs.
+	 * 5. Don't delete the buffer pool until after all of the RPCs
+	 *    have been reaped.
+	 * See sync.txt for additional information about locking.
+	 */
+	hsk->shutdown = true;
+	homa_sock_unlink(hsk);
+	homa_sock_unlock(hsk);
+
+	list_for_each_entry_rcu(rpc, &hsk->active_rpcs, active_links) {
+		homa_rpc_lock(rpc, "homa_sock_shutdown");
+		homa_rpc_free(rpc);
+		homa_rpc_unlock(rpc);
+	}
+
+	homa_sock_lock(hsk, "homa_socket_shutdown #2");
+	list_for_each_entry(interest, &hsk->request_interests, request_links)
+		wake_up_process(interest->thread);
+	list_for_each_entry(interest, &hsk->response_interests, response_links)
+		wake_up_process(interest->thread);
+	homa_sock_unlock(hsk);
+
+	while (!list_empty(&hsk->dead_rpcs))
+		homa_rpc_reap(hsk, 1000);
+
+	if (hsk->buffer_pool) {
+		homa_pool_destroy(hsk->buffer_pool);
+		kfree(hsk->buffer_pool);
+		hsk->buffer_pool = NULL;
+	}
+}
+
+/**
+ * homa_sock_destroy() - Destructor for homa_sock objects. This function
+ * only cleans up the parts of the object that are owned by Homa.
+ * @hsk:       Socket to destroy.
+ */
+void homa_sock_destroy(struct homa_sock *hsk)
+{
+	homa_sock_shutdown(hsk);
+	sock_set_flag(&hsk->inet.sk, SOCK_RCU_FREE);
+}
+
+/**
+ * homa_sock_bind() - Associates a server port with a socket; if there
+ * was a previous server port assignment for @hsk, it is abandoned.
+ * @socktab:   Hash table in which the binding will be recorded.
+ * @hsk:       Homa socket.
+ * @port:      Desired server port for @hsk. If 0, then this call
+ *             becomes a no-op: the socket will continue to use
+ *             its randomly assigned client port.
+ *
+ * Return:  0 for success, otherwise a negative errno.
+ */
+int homa_sock_bind(struct homa_socktab *socktab, struct homa_sock *hsk,
+		   __u16 port)
+{
+	struct homa_sock *owner;
+	int result = 0;
+
+	if (port == 0)
+		return result;
+	if (port >= HOMA_MIN_DEFAULT_PORT)
+		return -EINVAL;
+	homa_sock_lock(hsk, "homa_sock_bind");
+	spin_lock_bh(&socktab->write_lock);
+	if (hsk->shutdown) {
+		result = -ESHUTDOWN;
+		goto done;
+	}
+
+	owner = homa_sock_find(socktab, port);
+	if (owner) {
+		if (owner != hsk)
+			result = -EADDRINUSE;
+		goto done;
+	}
+	hlist_del_rcu(&hsk->socktab_links.hash_links);
+	hsk->port = port;
+	hsk->inet.inet_num = port;
+	hsk->inet.inet_sport = htons(hsk->port);
+	hlist_add_head_rcu(&hsk->socktab_links.hash_links,
+			   &socktab->buckets[homa_port_hash(port)]);
+done:
+	spin_unlock_bh(&socktab->write_lock);
+	homa_sock_unlock(hsk);
+	return result;
+}
+
+/**
+ * homa_sock_find() - Returns the socket associated with a given port.
+ * @socktab:    Hash table in which to perform lookup.
+ * @port:       The port of interest.
+ * Return:      The socket that owns @port, or NULL if none.
+ *
+ * Note: this function uses RCU list-searching facilities, but it doesn't
+ * call rcu_read_lock. The caller should do that, if the caller cares (this
+ * way, the caller's use of the socket will also be protected).
+ */
+struct homa_sock *homa_sock_find(struct homa_socktab *socktab,  __u16 port)
+{
+	struct homa_socktab_links *link;
+	struct homa_sock *result = NULL;
+
+	hlist_for_each_entry_rcu(link, &socktab->buckets[homa_port_hash(port)],
+				 hash_links) {
+		struct homa_sock *hsk = link->sock;
+
+		if (hsk->port == port) {
+			result = hsk;
+			break;
+		}
+	}
+	return result;
+}
+
+/**
+ * homa_sock_lock_slow() - This function implements the slow path for
+ * acquiring a socketC lock. It is invoked when a socket lock isn't immediately
+ * available. It waits for the lock, but also records statistics about
+ * the waiting time.
+ * @hsk:    socket to  lock.
+ */
+void homa_sock_lock_slow(struct homa_sock *hsk)
+	__acquires(&hsk->lock)
+{
+	spin_lock_bh(&hsk->lock);
+}
+
+/**
+ * homa_bucket_lock_slow() - This function implements the slow path for
+ * locking a bucket in one of the hash tables of RPCs. It is invoked when a
+ * lock isn't immediately available. It waits for the lock, but also records
+ * statistics about the waiting time.
+ * @bucket:    The hash table bucket to lock.
+ * @id:        ID of the particular RPC being locked (multiple RPCs may
+ *             share a single bucket lock).
+ */
+void homa_bucket_lock_slow(struct homa_rpc_bucket *bucket, __u64 id)
+	__acquires(&bucket->lock)
+{
+	spin_lock_bh(&bucket->lock);
+}
diff --git a/net/homa/homa_sock.h b/net/homa/homa_sock.h
new file mode 100644
index 000000000000..48ae69ec05bb
--- /dev/null
+++ b/net/homa/homa_sock.h
@@ -0,0 +1,410 @@
+/* SPDX-License-Identifier: BSD-2-Clause */
+
+/* This file defines structs and other things related to Homa sockets.  */
+
+#ifndef _HOMA_SOCK_H
+#define _HOMA_SOCK_H
+
+/* Forward declarations. */
+struct homa;
+struct homa_pool;
+
+void     homa_sock_lock_slow(struct homa_sock *hsk);
+
+/**
+ * define HOMA_SOCKTAB_BUCKETS - Number of hash buckets in a homa_socktab.
+ * Must be a power of 2.
+ */
+#define HOMA_SOCKTAB_BUCKETS 1024
+
+/**
+ * struct homa_socktab - A hash table that maps from port numbers (either
+ * client or server) to homa_sock objects.
+ *
+ * This table is managed exclusively by homa_socktab.c, using RCU to
+ * minimize synchronization during lookups.
+ */
+struct homa_socktab {
+	/**
+	 * @write_lock: Controls all modifications to this object; not needed
+	 * for socket lookups (RCU is used instead). Also used to
+	 * synchronize port allocation.
+	 */
+	spinlock_t write_lock;
+
+	/**
+	 * @buckets: Heads of chains for hash table buckets. Chains
+	 * consist of homa_socktab_link objects.
+	 */
+	struct hlist_head buckets[HOMA_SOCKTAB_BUCKETS];
+
+	/**
+	 * @active_scans: List of homa_socktab_scan structs for all scans
+	 * currently underway on this homa_socktab.
+	 */
+	struct list_head active_scans;
+};
+
+/**
+ * struct homa_socktab_links - Used to link homa_socks into the hash chains
+ * of a homa_socktab.
+ */
+struct homa_socktab_links {
+	/** @hash_links: links this element into the hash chain. */
+	struct hlist_node hash_links;
+
+	/** @sock: Homa socket structure. */
+	struct homa_sock *sock;
+};
+
+/**
+ * struct homa_socktab_scan - Records the state of an iteration over all
+ * the entries in a homa_socktab, in a way that permits RCU-safe deletion
+ * of entries.
+ */
+struct homa_socktab_scan {
+	/** @socktab: The table that is being scanned. */
+	struct homa_socktab *socktab;
+
+	/**
+	 * @current_bucket: the index of the bucket in socktab->buckets
+	 * currently being scanned. If >= HOMA_SOCKTAB_BUCKETS, the scan
+	 * is complete.
+	 */
+	int current_bucket;
+
+	/**
+	 * @next: the next socket to return from homa_socktab_next (this
+	 * socket has not yet been returned). NULL means there are no
+	 * more sockets in the current bucket.
+	 */
+	struct homa_socktab_links *next;
+
+	/**
+	 * @scan_links: Used to link this scan into @socktab->scans.
+	 */
+	struct list_head scan_links;
+};
+
+/**
+ * struct homa_rpc_bucket - One bucket in a hash table of RPCs.
+ */
+
+struct homa_rpc_bucket {
+	/**
+	 * @lock: serves as a lock both for this bucket (e.g., when
+	 * adding and removing RPCs) and also for all of the RPCs in
+	 * the bucket. Must be held whenever manipulating an RPC in
+	 * this bucket. This dual purpose permits clean and safe
+	 * deletion and garbage collection of RPCs.
+	 */
+	spinlock_t lock;
+
+	/** @rpcs: list of RPCs that hash to this bucket. */
+	struct hlist_head rpcs;
+
+	/**
+	 * @id: identifier for this bucket, used in error messages etc.
+	 * It's the index of the bucket within its hash table bucket
+	 * array, with an additional offset to separate server and
+	 * client RPCs.
+	 */
+	int id;
+};
+
+/**
+ * define HOMA_CLIENT_RPC_BUCKETS - Number of buckets in hash tables for
+ * client RPCs. Must be a power of 2.
+ */
+#define HOMA_CLIENT_RPC_BUCKETS 1024
+
+/**
+ * define HOMA_SERVER_RPC_BUCKETS - Number of buckets in hash tables for
+ * server RPCs. Must be a power of 2.
+ */
+#define HOMA_SERVER_RPC_BUCKETS 1024
+
+/**
+ * struct homa_sock - Information about an open socket.
+ */
+struct homa_sock {
+	/* Info for other network layers. Note: IPv6 info (struct ipv6_pinfo
+	 * comes at the very end of the struct, *after* Homa's data, if this
+	 * socket uses IPv6).
+	 */
+	union {
+		/** @sock: generic socket data; must be the first field. */
+		struct sock sock;
+
+		/**
+		 * @inet: generic Internet socket data; must also be the
+		 first field (contains sock as its first member).
+		 */
+		struct inet_sock inet;
+	};
+
+	/**
+	 * @lock: Must be held when modifying fields such as interests
+	 * and lists of RPCs. This lock is used in place of sk->sk_lock
+	 * because it's used differently (it's always used as a simple
+	 * spin lock).  See sync.txt for more on Homa's synchronization
+	 * strategy.
+	 */
+	spinlock_t lock;
+
+	/**
+	 * @last_locker: identifies the code that most recently acquired
+	 * @lock successfully. Occasionally used for debugging.
+	 */
+	char *last_locker;
+
+	/**
+	 * @protect_count: counts the number of calls to homa_protect_rpcs
+	 * for which there have not yet been calls to homa_unprotect_rpcs.
+	 * See sync.txt for more info.
+	 */
+	atomic_t protect_count;
+
+	/**
+	 * @homa: Overall state about the Homa implementation. NULL
+	 * means this socket has been deleted.
+	 */
+	struct homa *homa;
+
+	/**
+	 * @shutdown: True means the socket is no longer usable (either
+	 * shutdown has already been invoked, or the socket was never
+	 * properly initialized).
+	 */
+	bool shutdown;
+
+	/**
+	 * @port: Port number: identifies this socket uniquely among all
+	 * those on this node.
+	 */
+	__u16 port;
+
+	/**
+	 * @ip_header_length: Length of IP headers for this socket (depends
+	 * on IPv4 vs. IPv6).
+	 */
+	int ip_header_length;
+
+	/**
+	 * @socktab_links: Links this socket into the homa_socktab
+	 * based on @port.
+	 */
+	struct homa_socktab_links socktab_links;
+
+	/**
+	 * @active_rpcs: List of all existing RPCs related to this socket,
+	 * including both client and server RPCs. This list isn't strictly
+	 * needed, since RPCs are already in one of the hash tables below,
+	 * but it's more efficient for homa_timer to have this list
+	 * (so it doesn't have to scan large numbers of hash buckets).
+	 * The list is sorted, with the oldest RPC first. Manipulate with
+	 * RCU so timer can access without locking.
+	 */
+	struct list_head active_rpcs;
+
+	/**
+	 * @dead_rpcs: Contains RPCs for which homa_rpc_free has been
+	 * called, but their packet buffers haven't yet been freed.
+	 */
+	struct list_head dead_rpcs;
+
+	/** @dead_skbs: Total number of socket buffers in RPCs on dead_rpcs. */
+	int dead_skbs;
+
+	/**
+	 * @waiting_for_bufs: Contains RPCs that are blocked because there
+	 * wasn't enough space in the buffer pool region for their incoming
+	 * messages. Sorted in increasing order of message length.
+	 */
+	struct list_head waiting_for_bufs;
+
+	/**
+	 * @ready_requests: Contains server RPCs whose request message is
+	 * in a state requiring attention from  a user process. The head is
+	 * oldest, i.e. next to return.
+	 */
+	struct list_head ready_requests;
+
+	/**
+	 * @ready_responses: Contains client RPCs whose response message is
+	 * in a state requiring attention from a user process. The head is
+	 * oldest, i.e. next to return.
+	 */
+	struct list_head ready_responses;
+
+	/**
+	 * @request_interests: List of threads that want to receive incoming
+	 * request messages.
+	 */
+	struct list_head request_interests;
+
+	/**
+	 * @response_interests: List of threads that want to receive incoming
+	 * response messages.
+	 */
+	struct list_head response_interests;
+
+	/**
+	 * @client_rpc_buckets: Hash table for fast lookup of client RPCs.
+	 * Modifications are synchronized with bucket locks, not
+	 * the socket lock.
+	 */
+	struct homa_rpc_bucket client_rpc_buckets[HOMA_CLIENT_RPC_BUCKETS];
+
+	/**
+	 * @server_rpc_buckets: Hash table for fast lookup of server RPCs.
+	 * Modifications are synchronized with bucket locks, not
+	 * the socket lock.
+	 */
+	struct homa_rpc_bucket server_rpc_buckets[HOMA_SERVER_RPC_BUCKETS];
+
+	/**
+	 * @buffer_pool: used to allocate buffer space for incoming messages.
+	 * Storage is dynamically allocated.
+	 */
+	struct homa_pool *buffer_pool;
+};
+
+/**
+ * struct homa_v6_sock - For IPv6, additional IPv6-specific information
+ * is present in the socket struct after Homa-specific information.
+ */
+struct homa_v6_sock {
+	/** @homa: All socket info except for IPv6-specific stuff. */
+	struct homa_sock homa;
+
+	/** @inet6: Socket info specific to IPv6. */
+	struct ipv6_pinfo inet6;
+};
+
+void               homa_bucket_lock_slow(struct homa_rpc_bucket *bucket,
+					 __u64 id);
+int                homa_sock_bind(struct homa_socktab *socktab,
+				  struct homa_sock *hsk, __u16 port);
+void               homa_sock_destroy(struct homa_sock *hsk);
+struct homa_sock  *homa_sock_find(struct homa_socktab *socktab, __u16 port);
+int                homa_sock_init(struct homa_sock *hsk, struct homa *homa);
+void               homa_sock_shutdown(struct homa_sock *hsk);
+void               homa_sock_unlink(struct homa_sock *hsk);
+int                homa_socket(struct sock *sk);
+void               homa_socktab_destroy(struct homa_socktab *socktab);
+void               homa_socktab_end_scan(struct homa_socktab_scan *scan);
+void               homa_socktab_init(struct homa_socktab *socktab);
+struct homa_sock  *homa_socktab_next(struct homa_socktab_scan *scan);
+struct homa_sock  *homa_socktab_start_scan(struct homa_socktab *socktab,
+					   struct homa_socktab_scan *scan);
+
+/**
+ * homa_sock_lock() - Acquire the lock for a socket. If the socket
+ * isn't immediately available, record stats on the waiting time.
+ * @hsk:     Socket to lock.
+ * @locker:  Static string identifying where the socket was locked;
+ *           used to track down deadlocks.
+ */
+static inline void homa_sock_lock(struct homa_sock *hsk, const char *locker)
+	__acquires(&hsk->lock)
+{
+	if (!spin_trylock_bh(&hsk->lock))
+		homa_sock_lock_slow(hsk);
+}
+
+/**
+ * homa_sock_unlock() - Release the lock for a socket.
+ * @hsk:   Socket to lock.
+ */
+static inline void homa_sock_unlock(struct homa_sock *hsk)
+	__releases(&hsk->lock)
+{
+	spin_unlock_bh(&hsk->lock);
+}
+
+/**
+ * homa_port_hash() - Hash function for port numbers.
+ * @port:   Port number being looked up.
+ *
+ * Return:  The index of the bucket in which this port will be found (if
+ *          it exists.
+ */
+static inline int homa_port_hash(__u16 port)
+{
+	/* We can use a really simple hash function here because client
+	 * port numbers are allocated sequentially and server port numbers
+	 * are unpredictable.
+	 */
+	return port & (HOMA_SOCKTAB_BUCKETS - 1);
+}
+
+/**
+ * homa_client_rpc_bucket() - Find the bucket containing a given
+ * client RPC.
+ * @hsk:      Socket associated with the RPC.
+ * @id:       Id of the desired RPC.
+ *
+ * Return:    The bucket in which this RPC will appear, if the RPC exists.
+ */
+static inline struct homa_rpc_bucket *homa_client_rpc_bucket(struct homa_sock *hsk,
+							     __u64 id)
+{
+	/* We can use a really simple hash function here because RPC ids
+	 * are allocated sequentially.
+	 */
+	return &hsk->client_rpc_buckets[(id >> 1)
+			& (HOMA_CLIENT_RPC_BUCKETS - 1)];
+}
+
+/**
+ * homa_server_rpc_bucket() - Find the bucket containing a given
+ * server RPC.
+ * @hsk:         Socket associated with the RPC.
+ * @id:          Id of the desired RPC.
+ *
+ * Return:    The bucket in which this RPC will appear, if the RPC exists.
+ */
+static inline struct homa_rpc_bucket *homa_server_rpc_bucket(struct homa_sock *hsk,
+							     __u64 id)
+{
+	/* Each client allocates RPC ids sequentially, so they will
+	 * naturally distribute themselves across the hash space.
+	 * Thus we can use the id directly as hash.
+	 */
+	return &hsk->server_rpc_buckets[(id >> 1)
+			& (HOMA_SERVER_RPC_BUCKETS - 1)];
+}
+
+/**
+ * homa_bucket_lock() - Acquire the lock for an RPC hash table bucket.
+ * @bucket:    Bucket to lock
+ * @id:        ID of the RPC that is requesting the lock. Normally ignored,
+ *             but used occasionally for diagnostics and debugging.
+ * @locker:    Static string identifying the locking code. Normally ignored,
+ *             but used occasionally for diagnostics and debugging.
+ */
+static inline void homa_bucket_lock(struct homa_rpc_bucket *bucket,
+				    __u64 id, const char *locker)
+{
+	if (!spin_trylock_bh(&bucket->lock))
+		homa_bucket_lock_slow(bucket, id);
+}
+
+/**
+ * homa_bucket_unlock() - Release the lock for an RPC hash table bucket.
+ * @bucket:   Bucket to unlock.
+ * @id:       ID of the RPC that was using the lock.
+ */
+static inline void homa_bucket_unlock(struct homa_rpc_bucket *bucket, __u64 id)
+	__releases(&bucket->lock)
+{
+	spin_unlock_bh(&bucket->lock);
+}
+
+static inline struct homa_sock *homa_sk(const struct sock *sk)
+{
+	return (struct homa_sock *)sk;
+}
+
+#endif /* _HOMA_SOCK_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (6 preceding siblings ...)
  2025-01-15 18:59 ` [PATCH net-next v6 07/12] net: homa: create homa_sock.h and homa_sock.c John Ousterhout
@ 2025-01-15 18:59 ` John Ousterhout
  2025-01-24  8:31   ` Paolo Abeni
  2025-01-27 10:19   ` Paolo Abeni
  2025-01-15 18:59 ` [PATCH net-next v6 09/12] net: homa: create homa_outgoing.c John Ousterhout
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-15 18:59 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

This file contains most of the code for handling incoming packets,
including top-level dispatching code plus specific handlers for each
pack type. It also contains code for dispatching fully-received
messages to waiting application threads.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_incoming.c | 1076 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 1076 insertions(+)
 create mode 100644 net/homa/homa_incoming.c

diff --git a/net/homa/homa_incoming.c b/net/homa/homa_incoming.c
new file mode 100644
index 000000000000..c7630bf6d1fb
--- /dev/null
+++ b/net/homa/homa_incoming.c
@@ -0,0 +1,1076 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file contains functions that handle incoming Homa messages.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+
+/**
+ * homa_message_in_init() - Constructor for homa_message_in.
+ * @rpc:          RPC whose msgin structure should be initialized.
+ * @length:       Total number of bytes in message.
+ * Return:        Zero for successful initialization, or a negative errno
+ *                if rpc->msgin could not be initialized.
+ */
+int homa_message_in_init(struct homa_rpc *rpc, int length)
+{
+	int err;
+
+	rpc->msgin.length = length;
+	skb_queue_head_init(&rpc->msgin.packets);
+	rpc->msgin.recv_end = 0;
+	INIT_LIST_HEAD(&rpc->msgin.gaps);
+	rpc->msgin.bytes_remaining = length;
+	rpc->msgin.resend_all = 0;
+	rpc->msgin.num_bpages = 0;
+	err = homa_pool_allocate(rpc);
+	if (err != 0)
+		return err;
+	return 0;
+}
+
+/**
+ * homa_gap_new() - Create a new gap and add it to a list.
+ * @next:   Add the new gap just before this list element.
+ * @start:  Offset of first byte covered by the gap.
+ * @end:    Offset of byte just after the last one covered by the gap.
+ * Return:  Pointer to the new gap, or NULL if memory couldn't be allocated
+ *          for the gap object.
+ */
+struct homa_gap *homa_gap_new(struct list_head *next, int start, int end)
+{
+	struct homa_gap *gap;
+
+	gap = kmalloc(sizeof(*gap), GFP_ATOMIC);
+	if (!gap)
+		return NULL;
+	gap->start = start;
+	gap->end = end;
+	gap->time = sched_clock();
+	list_add_tail(&gap->links, next);
+	return gap;
+}
+
+/**
+ * homa_gap_retry() - Send RESEND requests for all of the unreceived
+ * gaps in a message.
+ * @rpc:     RPC to check; must be locked by caller.
+ */
+void homa_gap_retry(struct homa_rpc *rpc)
+{
+	struct homa_resend_hdr resend;
+	struct homa_gap *gap;
+
+	list_for_each_entry(gap, &rpc->msgin.gaps, links) {
+		resend.offset = htonl(gap->start);
+		resend.length = htonl(gap->end - gap->start);
+		homa_xmit_control(RESEND, &resend, sizeof(resend), rpc);
+	}
+}
+
+/**
+ * homa_add_packet() - Add an incoming packet to the contents of a
+ * partially received message.
+ * @rpc:   Add the packet to the msgin for this RPC.
+ * @skb:   The new packet. This function takes ownership of the packet
+ *         (the packet will either be freed or added to rpc->msgin.packets).
+ */
+void homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb)
+{
+	struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
+	struct homa_gap *gap, *dummy, *gap2;
+	int start = ntohl(h->seg.offset);
+	int length = homa_data_len(skb);
+	int end = start + length;
+
+	if ((start + length) > rpc->msgin.length)
+		goto discard;
+
+	if (start == rpc->msgin.recv_end) {
+		/* Common case: packet is sequential. */
+		rpc->msgin.recv_end += length;
+		goto keep;
+	}
+
+	if (start > rpc->msgin.recv_end) {
+		/* Packet creates a new gap. */
+		if (!homa_gap_new(&rpc->msgin.gaps,
+				  rpc->msgin.recv_end, start)) {
+			pr_err("Homa couldn't allocate gap: insufficient memory\n");
+			goto discard;
+		}
+		rpc->msgin.recv_end = end;
+		goto keep;
+	}
+
+	/* Must now check to see if the packet fills in part or all of
+	 * an existing gap.
+	 */
+	list_for_each_entry_safe(gap, dummy, &rpc->msgin.gaps, links) {
+		/* Is packet at the start of this gap? */
+		if (start <= gap->start) {
+			if (end <= gap->start)
+				continue;
+			if (start < gap->start)
+				goto discard;
+			if (end > gap->end)
+				goto discard;
+			gap->start = end;
+			if (gap->start >= gap->end) {
+				list_del(&gap->links);
+				kfree(gap);
+			}
+			goto keep;
+		}
+
+		/* Is packet at the end of this gap? BTW, at this point we know
+		 * the packet can't cover the entire gap.
+		 */
+		if (end >= gap->end) {
+			if (start >= gap->end)
+				continue;
+			if (end > gap->end)
+				goto discard;
+			gap->end = start;
+			goto keep;
+		}
+
+		/* Packet is in the middle of the gap; must split the gap. */
+		gap2 = homa_gap_new(&gap->links, gap->start, start);
+		if (!gap2) {
+			pr_err("Homa couldn't allocate gap for split: insufficient memory\n");
+			goto discard;
+		}
+		gap2->time = gap->time;
+		gap->start = end;
+		goto keep;
+	}
+
+discard:
+	kfree_skb(skb);
+	return;
+
+keep:
+	__skb_queue_tail(&rpc->msgin.packets, skb);
+	rpc->msgin.bytes_remaining -= length;
+}
+
+/**
+ * homa_copy_to_user() - Copy as much data as possible from incoming
+ * packet buffers to buffers in user space.
+ * @rpc:     RPC for which data should be copied. Must be locked by caller.
+ * Return:   Zero for success or a negative errno if there is an error.
+ *           It is possible for the RPC to be freed while this function
+ *           executes (it releases and reacquires the RPC lock). If that
+ *           happens, -EINVAL will be returned and the state of @rpc
+ *           will be RPC_DEAD.
+ */
+int homa_copy_to_user(struct homa_rpc *rpc)
+	__releases(rpc->bucket_lock)
+	__acquires(rpc->bucket_lock)
+{
+#define MAX_SKBS 20
+	struct sk_buff *skbs[MAX_SKBS];
+	int error = 0;
+	int n = 0;             /* Number of filled entries in skbs. */
+	int i;
+
+	/* Tricky note: we can't hold the RPC lock while we're actually
+	 * copying to user space, because (a) it's illegal to hold a spinlock
+	 * while copying to user space and (b) we'd like for homa_softirq
+	 * to add more packets to the RPC while we're copying these out.
+	 * So, collect a bunch of packets to copy, then release the lock,
+	 * copy them, and reacquire the lock.
+	 */
+	while (true) {
+		struct sk_buff *skb;
+
+		if (rpc->state == RPC_DEAD) {
+			error = -EINVAL;
+			break;
+		}
+
+		skb = __skb_dequeue(&rpc->msgin.packets);
+		if (skb) {
+			skbs[n] = skb;
+			n++;
+			if (n < MAX_SKBS)
+				continue;
+		}
+		if (n == 0)
+			break;
+
+		/* At this point we've collected a batch of packets (or
+		 * run out of packets); copy any available packets out to
+		 * user space.
+		 */
+		atomic_or(RPC_COPYING_TO_USER, &rpc->flags);
+		homa_rpc_unlock(rpc);
+
+		/* Each iteration of this loop copies out one skb. */
+		for (i = 0; i < n; i++) {
+			struct homa_data_hdr *h = (struct homa_data_hdr *)
+					skbs[i]->data;
+			int pkt_length = homa_data_len(skbs[i]);
+			int offset = ntohl(h->seg.offset);
+			int buf_bytes, chunk_size;
+			struct iov_iter iter;
+			int copied = 0;
+			char *dst;
+
+			/* Each iteration of this loop copies to one
+			 * user buffer.
+			 */
+			while (copied < pkt_length) {
+				chunk_size = pkt_length - copied;
+				dst = homa_pool_get_buffer(rpc, offset + copied,
+							   &buf_bytes);
+				if (buf_bytes < chunk_size) {
+					if (buf_bytes == 0)
+						/* skb has data beyond message
+						 * end?
+						 */
+						break;
+					chunk_size = buf_bytes;
+				}
+				error = import_ubuf(READ, (void __user *)dst,
+						    chunk_size, &iter);
+				if (error)
+					goto free_skbs;
+				error = skb_copy_datagram_iter(skbs[i],
+							       sizeof(*h) +
+							       copied,  &iter,
+							       chunk_size);
+				if (error)
+					goto free_skbs;
+				copied += chunk_size;
+			}
+		}
+
+free_skbs:
+		for (i = 0; i < n; i++)
+			kfree_skb(skbs[i]);
+		n = 0;
+		atomic_or(APP_NEEDS_LOCK, &rpc->flags);
+		homa_rpc_lock(rpc, "homa_copy_to_user");
+		atomic_andnot(APP_NEEDS_LOCK | RPC_COPYING_TO_USER,
+			      &rpc->flags);
+		if (error)
+			break;
+	}
+	return error;
+}
+
+/**
+ * homa_dispatch_pkts() - Top-level function that processes a batch of packets,
+ * all related to the same RPC.
+ * @skb:       First packet in the batch, linked through skb->next.
+ * @homa:      Overall information about the Homa transport.
+ */
+void homa_dispatch_pkts(struct sk_buff *skb, struct homa *homa)
+{
+#define MAX_ACKS 10
+	const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+	struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
+	__u64 id = homa_local_id(h->common.sender_id);
+	int dport = ntohs(h->common.dport);
+
+	/* Used to collect acks from data packets so we can process them
+	 * all at the end (can't process them inline because that may
+	 * require locking conflicting RPCs). If we run out of space just
+	 * ignore the extra acks; they'll be regenerated later through the
+	 * explicit mechanism.
+	 */
+	struct homa_ack acks[MAX_ACKS];
+	struct homa_rpc *rpc = NULL;
+	struct homa_sock *hsk;
+	struct sk_buff *next;
+	int num_acks = 0;
+
+	/* Find the appropriate socket.*/
+	hsk = homa_sock_find(homa->port_map, dport);
+	if (!hsk) {
+		if (skb_is_ipv6(skb))
+			icmp6_send(skb, ICMPV6_DEST_UNREACH,
+				   ICMPV6_PORT_UNREACH, 0, NULL, IP6CB(skb));
+		else
+			icmp_send(skb, ICMP_DEST_UNREACH,
+				  ICMP_PORT_UNREACH, 0);
+		while (skb) {
+			next = skb->next;
+			kfree_skb(skb);
+			skb = next;
+		}
+		return;
+	}
+
+	/* Each iteration through the following loop processes one packet. */
+	for (; skb; skb = next) {
+		h = (struct homa_data_hdr *)skb->data;
+		next = skb->next;
+
+		/* Relinquish the RPC lock temporarily if it's needed
+		 * elsewhere.
+		 */
+		if (rpc) {
+			int flags = atomic_read(&rpc->flags);
+
+			if (flags & APP_NEEDS_LOCK) {
+				homa_rpc_unlock(rpc);
+				homa_spin(200);
+				rpc = NULL;
+			}
+		}
+
+		/* Find and lock the RPC if we haven't already done so. */
+		if (!rpc) {
+			if (!homa_is_client(id)) {
+				/* We are the server for this RPC. */
+				if (h->common.type == DATA) {
+					int created;
+
+					/* Create a new RPC if one doesn't
+					 * already exist.
+					 */
+					rpc = homa_rpc_new_server(hsk, &saddr,
+								  h, &created);
+					if (IS_ERR(rpc)) {
+						pr_warn("homa_pkt_dispatch couldn't create server rpc: error %lu",
+							-PTR_ERR(rpc));
+						rpc = NULL;
+						goto discard;
+					}
+				} else {
+					rpc = homa_find_server_rpc(hsk, &saddr,
+								   id);
+				}
+			} else {
+				rpc = homa_find_client_rpc(hsk, id);
+			}
+		}
+		if (unlikely(!rpc)) {
+			if (h->common.type != NEED_ACK &&
+			    h->common.type != ACK &&
+			    h->common.type != RESEND)
+				goto discard;
+		} else {
+			if (h->common.type == DATA ||
+			    h->common.type == BUSY ||
+			    h->common.type == NEED_ACK)
+				rpc->silent_ticks = 0;
+			rpc->peer->outstanding_resends = 0;
+		}
+
+		switch (h->common.type) {
+		case DATA:
+			if (h->ack.client_id) {
+				/* Save the ack for processing later, when we
+				 * have released the RPC lock.
+				 */
+				if (num_acks < MAX_ACKS) {
+					acks[num_acks] = h->ack;
+					num_acks++;
+				}
+			}
+			homa_data_pkt(skb, rpc);
+			break;
+		case RESEND:
+			homa_resend_pkt(skb, rpc, hsk);
+			break;
+		case UNKNOWN:
+			homa_unknown_pkt(skb, rpc);
+			break;
+		case BUSY:
+			/* Nothing to do for these packets except reset
+			 * silent_ticks, which happened above.
+			 */
+			goto discard;
+		case NEED_ACK:
+			homa_need_ack_pkt(skb, hsk, rpc);
+			break;
+		case ACK:
+			homa_ack_pkt(skb, hsk, rpc);
+			rpc = NULL;
+
+			/* It isn't safe to process more packets once we've
+			 * released the RPC lock (this should never happen).
+			 */
+			while (next) {
+				WARN_ONCE(next, "%s found extra packets after AC<\n",
+					  __func__);
+				skb = next;
+				next = skb->next;
+				kfree_skb(skb);
+			}
+			break;
+		default:
+			goto discard;
+		}
+		continue;
+
+discard:
+		kfree_skb(skb);
+	}
+	if (rpc)
+		homa_rpc_unlock(rpc);
+
+	while (num_acks > 0) {
+		num_acks--;
+		homa_rpc_acked(hsk, &saddr, &acks[num_acks]);
+	}
+
+	if (hsk->dead_skbs >= 2 * hsk->homa->dead_buffs_limit)
+		/* We get here if neither homa_wait_for_message
+		 * nor homa_timer can keep up with reaping dead
+		 * RPCs. See reap.txt for details.
+		 */
+		homa_rpc_reap(hsk, false);
+}
+
+/**
+ * homa_data_pkt() - Handler for incoming DATA packets
+ * @skb:     Incoming packet; size known to be large enough for the header.
+ *           This function now owns the packet.
+ * @rpc:     Information about the RPC corresponding to this packet.
+ *           Must be locked by the caller.
+ */
+void homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc)
+{
+	struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
+
+	if (rpc->state != RPC_INCOMING && homa_is_client(rpc->id)) {
+		if (unlikely(rpc->state != RPC_OUTGOING))
+			goto discard;
+		rpc->state = RPC_INCOMING;
+		if (homa_message_in_init(rpc, ntohl(h->message_length)) != 0)
+			goto discard;
+	} else if (rpc->state != RPC_INCOMING) {
+		/* Must be server; note that homa_rpc_new_server already
+		 * initialized msgin and allocated buffers.
+		 */
+		if (unlikely(rpc->msgin.length >= 0))
+			goto discard;
+	}
+
+	if (rpc->msgin.num_bpages == 0)
+		/* Drop packets that arrive when we can't allocate buffer
+		 * space. If we keep them around, packet buffer usage can
+		 * exceed available cache space, resulting in poor
+		 * performance.
+		 */
+		goto discard;
+
+	homa_add_packet(rpc, skb);
+
+	if (skb_queue_len(&rpc->msgin.packets) != 0 &&
+	    !(atomic_read(&rpc->flags) & RPC_PKTS_READY)) {
+		atomic_or(RPC_PKTS_READY, &rpc->flags);
+		homa_sock_lock(rpc->hsk, "homa_data_pkt");
+		homa_rpc_handoff(rpc);
+		homa_sock_unlock(rpc->hsk);
+	}
+	return;
+
+discard:
+	kfree_skb(skb);
+}
+
+/**
+ * homa_resend_pkt() - Handler for incoming RESEND packets
+ * @skb:     Incoming packet; size already verified large enough for header.
+ *           This function now owns the packet.
+ * @rpc:     Information about the RPC corresponding to this packet; must
+ *           be locked by caller, but may be NULL if there is no RPC matching
+ *           this packet
+ * @hsk:     Socket on which the packet was received.
+ */
+void homa_resend_pkt(struct sk_buff *skb, struct homa_rpc *rpc,
+		     struct homa_sock *hsk)
+{
+	struct homa_resend_hdr *h = (struct homa_resend_hdr *)skb->data;
+	struct homa_busy_hdr busy;
+
+	if (!rpc) {
+		homa_xmit_unknown(skb, hsk);
+		goto done;
+	}
+
+	if (!homa_is_client(rpc->id) && rpc->state != RPC_OUTGOING) {
+		/* We are the server for this RPC and don't yet have a
+		 * response packet, so just send BUSY.
+		 */
+		homa_xmit_control(BUSY, &busy, sizeof(busy), rpc);
+		goto done;
+	}
+	if (ntohl(h->length) == 0)
+		/* This RESEND is from a server just trying to determine
+		 * whether the client still cares about the RPC; return
+		 * BUSY so the server doesn't time us out.
+		 */
+		homa_xmit_control(BUSY, &busy, sizeof(busy), rpc);
+	homa_resend_data(rpc, ntohl(h->offset),
+			 ntohl(h->offset) + ntohl(h->length));
+
+done:
+	kfree_skb(skb);
+}
+
+/**
+ * homa_unknown_pkt() - Handler for incoming UNKNOWN packets.
+ * @skb:     Incoming packet; size known to be large enough for the header.
+ *           This function now owns the packet.
+ * @rpc:     Information about the RPC corresponding to this packet.
+ */
+void homa_unknown_pkt(struct sk_buff *skb, struct homa_rpc *rpc)
+{
+	if (homa_is_client(rpc->id)) {
+		if (rpc->state == RPC_OUTGOING) {
+			/* It appears that everything we've already transmitted
+			 * has been lost; retransmit it.
+			 */
+			homa_resend_data(rpc, 0, rpc->msgout.next_xmit_offset);
+			goto done;
+		}
+
+	} else {
+		homa_rpc_free(rpc);
+	}
+done:
+	kfree_skb(skb);
+}
+
+/**
+ * homa_need_ack_pkt() - Handler for incoming NEED_ACK packets
+ * @skb:     Incoming packet; size already verified large enough for header.
+ *           This function now owns the packet.
+ * @hsk:     Socket on which the packet was received.
+ * @rpc:     The RPC named in the packet header, or NULL if no such
+ *           RPC exists. The RPC has been locked by the caller.
+ */
+void homa_need_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+		       struct homa_rpc *rpc)
+{
+	struct homa_common_hdr *h = (struct homa_common_hdr *)skb->data;
+	const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+	__u64 id = homa_local_id(h->sender_id);
+	struct homa_peer *peer;
+	struct homa_ack_hdr ack;
+
+	/* Return if it's not safe for the peer to purge its state
+	 * for this RPC (the RPC still exists and we haven't received
+	 * the entire response), or if we can't find peer info.
+	 */
+	if (rpc && (rpc->state != RPC_INCOMING ||
+		    rpc->msgin.bytes_remaining)) {
+		goto done;
+	} else {
+		peer = homa_peer_find(hsk->homa->peers, &saddr, &hsk->inet);
+		if (IS_ERR(peer))
+			goto done;
+	}
+
+	/* Send an ACK for this RPC. At the same time, include all of the
+	 * other acks available for the peer. Note: can't use rpc below,
+	 * since it may be NULL.
+	 */
+	ack.common.type = ACK;
+	ack.common.sport = h->dport;
+	ack.common.dport = h->sport;
+	ack.common.sender_id = cpu_to_be64(id);
+	ack.num_acks = htons(homa_peer_get_acks(peer,
+						HOMA_MAX_ACKS_PER_PKT,
+						ack.acks));
+	__homa_xmit_control(&ack, sizeof(ack), peer, hsk);
+
+done:
+	kfree_skb(skb);
+}
+
+/**
+ * homa_ack_pkt() - Handler for incoming ACK packets
+ * @skb:     Incoming packet; size already verified large enough for header.
+ *           This function now owns the packet.
+ * @hsk:     Socket on which the packet was received.
+ * @rpc:     The RPC named in the packet header, or NULL if no such
+ *           RPC exists. The RPC has been locked by the caller but will
+ *           be unlocked here.
+ */
+void homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
+		  struct homa_rpc *rpc)
+	__releases(rpc->bucket_lock)
+{
+	const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+	struct homa_ack_hdr *h = (struct homa_ack_hdr *)skb->data;
+	int i, count;
+
+	if (rpc) {
+		homa_rpc_free(rpc);
+		homa_rpc_unlock(rpc);
+	}
+
+	count = ntohs(h->num_acks);
+	for (i = 0; i < count; i++)
+		homa_rpc_acked(hsk, &saddr, &h->acks[i]);
+	kfree_skb(skb);
+}
+
+/**
+ * homa_rpc_abort() - Terminate an RPC.
+ * @rpc:     RPC to be terminated.  Must be locked by caller.
+ * @error:   A negative errno value indicating the error that caused the abort.
+ *           If this is a client RPC, the error will be returned to the
+ *           application; if it's a server RPC, the error is ignored and
+ *           we just free the RPC.
+ */
+void homa_rpc_abort(struct homa_rpc *rpc, int error)
+{
+	if (!homa_is_client(rpc->id)) {
+		homa_rpc_free(rpc);
+		return;
+	}
+	rpc->error = error;
+	homa_sock_lock(rpc->hsk, "homa_rpc_abort");
+	if (!rpc->hsk->shutdown)
+		homa_rpc_handoff(rpc);
+	homa_sock_unlock(rpc->hsk);
+}
+
+/**
+ * homa_abort_rpcs() - Abort all RPCs to/from a particular peer.
+ * @homa:    Overall data about the Homa protocol implementation.
+ * @addr:    Address (network order) of the destination whose RPCs are
+ *           to be aborted.
+ * @port:    If nonzero, then RPCs will only be aborted if they were
+ *	     targeted at this server port.
+ * @error:   Negative errno value indicating the reason for the abort.
+ */
+void homa_abort_rpcs(struct homa *homa, const struct in6_addr *addr,
+		     int port, int error)
+{
+	struct homa_socktab_scan scan;
+	struct homa_rpc *rpc, *tmp;
+	struct homa_sock *hsk;
+
+	rcu_read_lock();
+	for (hsk = homa_socktab_start_scan(homa->port_map, &scan); hsk;
+	     hsk = homa_socktab_next(&scan)) {
+		/* Skip the (expensive) lock acquisition if there's no
+		 * work to do.
+		 */
+		if (list_empty(&hsk->active_rpcs))
+			continue;
+		if (!homa_protect_rpcs(hsk))
+			continue;
+		list_for_each_entry_safe(rpc, tmp, &hsk->active_rpcs,
+					 active_links) {
+			if (!ipv6_addr_equal(&rpc->peer->addr, addr))
+				continue;
+			if (port && rpc->dport != port)
+				continue;
+			homa_rpc_lock(rpc, "rpc_abort_rpcs");
+			homa_rpc_abort(rpc, error);
+			homa_rpc_unlock(rpc);
+		}
+		homa_unprotect_rpcs(hsk);
+	}
+	homa_socktab_end_scan(&scan);
+	rcu_read_unlock();
+}
+
+/**
+ * homa_abort_sock_rpcs() - Abort all outgoing (client-side) RPCs on a given
+ * socket.
+ * @hsk:         Socket whose RPCs should be aborted.
+ * @error:       Zero means that the aborted RPCs should be freed immediately.
+ *               A nonzero value means that the RPCs should be marked
+ *               complete, so that they can be returned to the application;
+ *               this value (a negative errno) will be returned from
+ *               recvmsg.
+ */
+void homa_abort_sock_rpcs(struct homa_sock *hsk, int error)
+{
+	struct homa_rpc *rpc, *tmp;
+
+	rcu_read_lock();
+	if (list_empty(&hsk->active_rpcs))
+		goto done;
+	if (!homa_protect_rpcs(hsk))
+		goto done;
+	list_for_each_entry_safe(rpc, tmp, &hsk->active_rpcs, active_links) {
+		if (!homa_is_client(rpc->id))
+			continue;
+		homa_rpc_lock(rpc, "homa_abort_sock_rpcs");
+		if (rpc->state == RPC_DEAD) {
+			homa_rpc_unlock(rpc);
+			continue;
+		}
+		if (error)
+			homa_rpc_abort(rpc, error);
+		else
+			homa_rpc_free(rpc);
+		homa_rpc_unlock(rpc);
+	}
+	homa_unprotect_rpcs(hsk);
+done:
+	rcu_read_unlock();
+}
+
+/**
+ * homa_register_interests() - Records information in various places so
+ * that a thread will be woken up if an RPC that it cares about becomes
+ * available.
+ * @interest:     Used to record information about the messages this thread is
+ *                waiting on. The initial contents of the structure are
+ *                assumed to be undefined.
+ * @hsk:          Socket on which relevant messages will arrive.  Must not be
+ *                locked.
+ * @flags:        Flags field from homa_recvmsg_args; see manual entry for
+ *                details.
+ * @id:           If non-zero, then the caller is interested in receiving
+ *                the response for this RPC (@id must be a client request).
+ * Return:        Either zero or a negative errno value. If a matching RPC
+ *                is already available, information about it will be stored in
+ *                interest.
+ */
+int homa_register_interests(struct homa_interest *interest,
+			    struct homa_sock *hsk, int flags, __u64 id)
+{
+	struct homa_rpc *rpc = NULL;
+	int locked = 1;
+
+	homa_interest_init(interest);
+	if (id != 0) {
+		if (!homa_is_client(id))
+			return -EINVAL;
+		rpc = homa_find_client_rpc(hsk, id); /* Locks rpc. */
+		if (!rpc)
+			return -EINVAL;
+		if (rpc->interest && rpc->interest != interest) {
+			homa_rpc_unlock(rpc);
+			return -EINVAL;
+		}
+	}
+
+	/* Need both the RPC lock (acquired above) and the socket lock to
+	 * avoid races.
+	 */
+	homa_sock_lock(hsk, "homa_register_interests");
+	if (hsk->shutdown) {
+		homa_sock_unlock(hsk);
+		if (rpc)
+			homa_rpc_unlock(rpc);
+		return -ESHUTDOWN;
+	}
+
+	if (id != 0) {
+		if ((atomic_read(&rpc->flags) & RPC_PKTS_READY) || rpc->error)
+			goto claim_rpc;
+		rpc->interest = interest;
+		interest->reg_rpc = rpc;
+		homa_rpc_unlock(rpc);
+	}
+
+	locked = 0;
+	if (flags & HOMA_RECVMSG_RESPONSE) {
+		if (!list_empty(&hsk->ready_responses)) {
+			rpc = list_first_entry(&hsk->ready_responses,
+					       struct homa_rpc,
+					       ready_links);
+			goto claim_rpc;
+		}
+		/* Insert this thread at the *front* of the list;
+		 * we'll get better cache locality if we reuse
+		 * the same thread over and over, rather than
+		 * round-robining between threads.  Same below.
+		 */
+		list_add(&interest->response_links,
+			 &hsk->response_interests);
+	}
+	if (flags & HOMA_RECVMSG_REQUEST) {
+		if (!list_empty(&hsk->ready_requests)) {
+			rpc = list_first_entry(&hsk->ready_requests,
+					       struct homa_rpc, ready_links);
+			/* Make sure the interest isn't on the response list;
+			 * otherwise it might receive a second RPC.
+			 */
+			if (!list_empty(&interest->response_links))
+				list_del_init(&interest->response_links);
+			goto claim_rpc;
+		}
+		list_add(&interest->request_links, &hsk->request_interests);
+	}
+	homa_sock_unlock(hsk);
+	return 0;
+
+claim_rpc:
+	list_del_init(&rpc->ready_links);
+	if (!list_empty(&hsk->ready_requests) ||
+	    !list_empty(&hsk->ready_responses)) {
+		hsk->sock.sk_data_ready(&hsk->sock);
+	}
+
+	/* This flag is needed to keep the RPC from being reaped during the
+	 * gap between when we release the socket lock and we acquire the
+	 * RPC lock.
+	 */
+	atomic_or(RPC_HANDING_OFF, &rpc->flags);
+	homa_sock_unlock(hsk);
+	if (!locked) {
+		atomic_or(APP_NEEDS_LOCK, &rpc->flags);
+		homa_rpc_lock(rpc, "homa_register_interests");
+		atomic_andnot(APP_NEEDS_LOCK, &rpc->flags);
+		locked = 1;
+	}
+	atomic_andnot(RPC_HANDING_OFF, &rpc->flags);
+	homa_interest_set_rpc(interest, rpc, locked);
+	return 0;
+}
+
+/**
+ * homa_wait_for_message() - Wait for receipt of an incoming message
+ * that matches the parameters. Various other activities can occur while
+ * waiting, such as reaping dead RPCs and copying data to user space.
+ * @hsk:          Socket where messages will arrive.
+ * @flags:        Flags field from homa_recvmsg_args; see manual entry for
+ *                details.
+ * @id:           If non-zero, then a response message matching this id may
+ *                be returned (@id must refer to a client request).
+ *
+ * Return:   Pointer to an RPC that matches @flags and @id, or a negative
+ *           errno value. The RPC will be locked; the caller must unlock.
+ */
+struct homa_rpc *homa_wait_for_message(struct homa_sock *hsk, int flags,
+				       __u64 id)
+	__acquires(&rpc->bucket_lock)
+{
+	struct homa_rpc *result = NULL;
+	struct homa_interest interest;
+	struct homa_rpc *rpc = NULL;
+	int error;
+
+	/* Each iteration of this loop finds an RPC, but it might not be
+	 * in a state where we can return it (e.g., there might be packets
+	 * ready to transfer to user space, but the incoming message isn't yet
+	 * complete). Thus it could take many iterations of this loop
+	 * before we have an RPC with a complete message.
+	 */
+	while (1) {
+		error = homa_register_interests(&interest, hsk, flags, id);
+		rpc = homa_interest_get_rpc(&interest);
+		if (rpc)
+			goto found_rpc;
+		if (error < 0) {
+			result = ERR_PTR(error);
+			goto found_rpc;
+		}
+
+		/* There is no ready RPC so far. Clean up dead RPCs before
+		 * going to sleep (or returning, if in nonblocking mode).
+		 */
+		while (1) {
+			int reaper_result;
+
+			rpc = homa_interest_get_rpc(&interest);
+			if (rpc)
+				goto found_rpc;
+			reaper_result = homa_rpc_reap(hsk, false);
+			if (reaper_result == 0)
+				break;
+
+			/* Give NAPI and SoftIRQ tasks a chance to run. */
+			schedule();
+		}
+		if (flags & HOMA_RECVMSG_NONBLOCKING) {
+			result = ERR_PTR(-EAGAIN);
+			goto found_rpc;
+		}
+
+		/* Now it's time to sleep. */
+		set_current_state(TASK_INTERRUPTIBLE);
+		rpc = homa_interest_get_rpc(&interest);
+		if (!rpc && !hsk->shutdown)
+			schedule();
+		__set_current_state(TASK_RUNNING);
+
+found_rpc:
+		/* If we get here, it means either an RPC is ready for our
+		 * attention or an error occurred.
+		 *
+		 * First, clean up all of the interests. Must do this before
+		 * making any other decisions, because until we do, an incoming
+		 * message could still be passed to us. Note: if we went to
+		 * sleep, then this info was already cleaned up by whoever
+		 * woke us up. Also, values in the interest may change between
+		 * when we test them below and when we acquire the socket lock,
+		 * so they have to be checked again after locking the socket.
+		 */
+		if (interest.reg_rpc ||
+		    !list_empty(&interest.request_links) ||
+		    !list_empty(&interest.response_links)) {
+			homa_sock_lock(hsk, "homa_wait_for_message");
+			if (interest.reg_rpc)
+				interest.reg_rpc->interest = NULL;
+			if (!list_empty(&interest.request_links))
+				list_del_init(&interest.request_links);
+			if (!list_empty(&interest.response_links))
+				list_del_init(&interest.response_links);
+			homa_sock_unlock(hsk);
+		}
+
+		/* Now check to see if we received an RPC handoff (note that
+		 * this could have happened anytime up until we reset the
+		 * interests above).
+		 */
+		rpc = homa_interest_get_rpc(&interest);
+		if (rpc) {
+			if (!interest.locked) {
+				atomic_or(APP_NEEDS_LOCK, &rpc->flags);
+				homa_rpc_lock(rpc, "homa_wait_for_message");
+				atomic_andnot(APP_NEEDS_LOCK | RPC_HANDING_OFF,
+					      &rpc->flags);
+			} else {
+				atomic_andnot(RPC_HANDING_OFF, &rpc->flags);
+			}
+			if (!rpc->error)
+				rpc->error = homa_copy_to_user(rpc);
+			if (rpc->state == RPC_DEAD) {
+				homa_rpc_unlock(rpc);
+				continue;
+			}
+			if (rpc->error)
+				goto done;
+			atomic_andnot(RPC_PKTS_READY, &rpc->flags);
+			if (rpc->msgin.bytes_remaining == 0 &&
+			    !skb_queue_len(&rpc->msgin.packets))
+				goto done;
+			homa_rpc_unlock(rpc);
+		}
+
+		/* A complete message isn't available: check for errors. */
+		if (IS_ERR(result))
+			return result;
+		if (signal_pending(current))
+			return ERR_PTR(-EINTR);
+
+		/* No message and no error; try again. */
+	}
+
+done:
+	return rpc;
+}
+
+/**
+ * homa_choose_interest() - Given a list of interests for an incoming
+ * message, choose the best one to handle it (if any).
+ * @homa:        Overall information about the Homa transport.
+ * @head:        Head pointers for the list of interest: either
+ *		 hsk->request_interests or hsk->response_interests.
+ * @offset:      Offset of "next" pointers in the list elements (either
+ *               offsetof(request_links) or offsetof(response_links).
+ * Return:       An interest to use for the incoming message, or NULL if none
+ *               is available. If possible, this function tries to pick an
+ *               interest whose thread is running on a core that isn't
+ *               currently busy doing Homa transport work.
+ */
+struct homa_interest *homa_choose_interest(struct homa *homa,
+					   struct list_head *head, int offset)
+{
+	struct homa_interest *backup = NULL;
+	struct homa_interest *interest;
+	struct list_head *pos;
+
+	list_for_each(pos, head) {
+		interest = (struct homa_interest *)(((char *)pos) - offset);
+		if (!backup)
+			backup = interest;
+	}
+
+	/* All interested threads are on busy cores; return the first. */
+	return backup;
+}
+
+/**
+ * homa_rpc_handoff() - This function is called when the input message for
+ * an RPC is ready for attention from a user thread. It either notifies
+ * a waiting reader or queues the RPC.
+ * @rpc:                RPC to handoff; must be locked. The caller must
+ *			also have locked the socket for this RPC.
+ */
+void homa_rpc_handoff(struct homa_rpc *rpc)
+{
+	struct homa_sock *hsk = rpc->hsk;
+	struct homa_interest *interest;
+
+	if ((atomic_read(&rpc->flags) & RPC_HANDING_OFF) ||
+	    !list_empty(&rpc->ready_links))
+		return;
+
+	/* First, see if someone is interested in this RPC specifically.
+	 */
+	if (rpc->interest) {
+		interest = rpc->interest;
+		goto thread_waiting;
+	}
+
+	/* Second, check the interest list for this type of RPC. */
+	if (homa_is_client(rpc->id)) {
+		interest = homa_choose_interest(hsk->homa,
+						&hsk->response_interests,
+						offsetof(struct homa_interest,
+							 response_links));
+		if (interest)
+			goto thread_waiting;
+		list_add_tail(&rpc->ready_links, &hsk->ready_responses);
+	} else {
+		interest = homa_choose_interest(hsk->homa,
+						&hsk->request_interests,
+						offsetof(struct homa_interest,
+							 request_links));
+		if (interest)
+			goto thread_waiting;
+		list_add_tail(&rpc->ready_links, &hsk->ready_requests);
+	}
+
+	/* If we get here, no-one is waiting for the RPC, so it has been
+	 * queued.
+	 */
+
+	/* Notify the poll mechanism. */
+	hsk->sock.sk_data_ready(&hsk->sock);
+	return;
+
+thread_waiting:
+	/* We found a waiting thread. The following 3 lines must be here,
+	 * before clearing the interest, in order to avoid a race with
+	 * homa_wait_for_message (which won't acquire the socket lock if
+	 * the interest is clear).
+	 */
+	atomic_or(RPC_HANDING_OFF, &rpc->flags);
+	homa_interest_set_rpc(interest, rpc, 0);
+
+	/* Clear the interest. This serves two purposes. First, it saves
+	 * the waking thread from acquiring the socket lock again, which
+	 * reduces contention on that lock). Second, it ensures that
+	 * no-one else attempts to give this interest a different RPC.
+	 */
+	if (interest->reg_rpc) {
+		interest->reg_rpc->interest = NULL;
+		interest->reg_rpc = NULL;
+	}
+	if (!list_empty(&interest->request_links))
+		list_del_init(&interest->request_links);
+	if (!list_empty(&interest->response_links))
+		list_del_init(&interest->response_links);
+	wake_up_process(interest->thread);
+}
+
+/**
+ * homa_incoming_sysctl_changed() - Invoked whenever a sysctl value is changed;
+ * any input-related parameters that depend on sysctl-settable values.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+void homa_incoming_sysctl_changed(struct homa *homa)
+{
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH net-next v6 09/12] net: homa: create homa_outgoing.c
  2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (7 preceding siblings ...)
  2025-01-15 18:59 ` [PATCH net-next v6 08/12] net: homa: create homa_incoming.c John Ousterhout
@ 2025-01-15 18:59 ` John Ousterhout
  2025-01-15 18:59 ` [PATCH net-next v6 10/12] net: homa: create homa_timer.c John Ousterhout
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-15 18:59 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

This file does most of the work of transmitting outgoing messages.
It is responsible for copying data from user space into skbs and
it also implements the "pacer", which throttles output if necessary
to prevent queue buildup in the NIC. Note: the pacer eventually
needs to be replaced with a Homa-specific qdisc, which can better
manage simultaneous transmissions by Homa and TCP. The current
implementation can coexist with TCP and doesn't harm TCP, but
Homa's latency suffers when TCP runs concurrently.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_outgoing.c | 855 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 855 insertions(+)
 create mode 100644 net/homa/homa_outgoing.c

diff --git a/net/homa/homa_outgoing.c b/net/homa/homa_outgoing.c
new file mode 100644
index 000000000000..cb671063709b
--- /dev/null
+++ b/net/homa/homa_outgoing.c
@@ -0,0 +1,855 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file contains functions related to the sender side of message
+ * transmission. It also contains utility functions for sending packets.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+#include "homa_stub.h"
+#include "homa_wire.h"
+
+/**
+ * homa_message_out_init() - Initialize rpc->msgout.
+ * @rpc:       RPC whose output message should be initialized.
+ * @length:    Number of bytes that will eventually be in rpc->msgout.
+ */
+void homa_message_out_init(struct homa_rpc *rpc, int length)
+{
+	rpc->msgout.length = length;
+	rpc->msgout.num_skbs = 0;
+	rpc->msgout.copied_from_user = 0;
+	rpc->msgout.packets = NULL;
+	rpc->msgout.next_xmit = &rpc->msgout.packets;
+	rpc->msgout.next_xmit_offset = 0;
+	atomic_set(&rpc->msgout.active_xmits, 0);
+	rpc->msgout.init_ns = sched_clock();
+}
+
+/**
+ * homa_fill_data_interleaved() - This function is invoked to fill in the
+ * part of a data packet after the initial header, when GSO is being used.
+ * As result, homa_seg_hdrs must be interleaved with the data to provide
+ * the correct offset for each segment.
+ * @rpc:            RPC whose output message is being created.
+ * @skb:            The packet being filled. The initial homa_data_hdr was
+ *                  created and initialized by the caller and the
+ *                  homa_skb_info has been filled in with the packet geometry.
+ * @iter:           Describes location(s) of (remaining) message data in user
+ *                  space.
+ * Return:          Either a negative errno or 0 (for success).
+ */
+int homa_fill_data_interleaved(struct homa_rpc *rpc, struct sk_buff *skb,
+			       struct iov_iter *iter)
+{
+	struct homa_skb_info *homa_info = homa_get_skb_info(skb);
+	int seg_length = homa_info->seg_length;
+	int bytes_left = homa_info->data_bytes;
+	int offset = homa_info->offset;
+	int err;
+
+	/* Each iteration of the following loop adds info for one packet,
+	 * which includes a homa_seg_hdr followed by the data for that
+	 * segment. The first homa_seg_hdr was already added by the caller.
+	 */
+	while (1) {
+		struct homa_seg_hdr seg;
+
+		if (bytes_left < seg_length)
+			seg_length = bytes_left;
+		err = homa_skb_append_from_iter(rpc->hsk->homa, skb, iter,
+						seg_length);
+		if (err != 0)
+			return err;
+		bytes_left -= seg_length;
+		offset += seg_length;
+
+		if (bytes_left == 0)
+			break;
+
+		seg.offset = htonl(offset);
+		err = homa_skb_append_to_frag(rpc->hsk->homa, skb, &seg,
+					      sizeof(seg));
+		if (err != 0)
+			return err;
+	}
+	return 0;
+}
+
+/**
+ * homa_new_data_packet() - Allocate a new sk_buff and fill it with a Homa
+ * data packet. The resulting packet will be a GSO packet that will eventually
+ * be segmented by the NIC.
+ * @rpc:          RPC that packet will belong to (msgout must have been
+ *                initialized).
+ * @iter:         Describes location(s) of (remaining) message data in user
+ *                space.
+ * @offset:       Offset in the message of the first byte of data in this
+ *                packet.
+ * @length:       How many bytes of data to include in the skb. Caller must
+ *                ensure that this amount of data isn't too much for a
+ *                well-formed GSO packet, and that iter has at least this
+ *                much data.
+ * @max_seg_data: Maximum number of bytes of message data that can go in
+ *                a single segment of the GSO packet.
+ * Return: A pointer to the new packet, or a negative errno.
+ */
+struct sk_buff *homa_new_data_packet(struct homa_rpc *rpc,
+				     struct iov_iter *iter, int offset,
+				     int length, int max_seg_data)
+{
+	struct homa_skb_info *homa_info;
+	struct homa_data_hdr *h;
+	struct sk_buff *skb;
+	int err, gso_size;
+	__u64 segs;
+
+	segs = length + max_seg_data - 1;
+	do_div(segs, max_seg_data);
+
+	/* Initialize the overall skb. */
+	skb = homa_skb_new_tx(sizeof32(struct homa_data_hdr) + length +
+			      segs * sizeof32(struct homa_seg_hdr));
+	if (!skb)
+		return ERR_PTR(-ENOMEM);
+
+	/* Fill in the Homa header (which will be replicated in every
+	 * network packet by GSO).
+	 */
+	h = (struct homa_data_hdr *)skb_put(skb, sizeof(struct homa_data_hdr));
+	h->common.sport = htons(rpc->hsk->port);
+	h->common.dport = htons(rpc->dport);
+	h->common.sequence = htonl(offset);
+	h->common.type = DATA;
+	homa_set_doff(h, sizeof(struct homa_data_hdr));
+	h->common.checksum = 0;
+	h->common.sender_id = cpu_to_be64(rpc->id);
+	h->message_length = htonl(rpc->msgout.length);
+	h->ack.client_id = 0;
+	homa_peer_get_acks(rpc->peer, 1, &h->ack);
+	h->retransmit = 0;
+	h->seg.offset = htonl(offset);
+
+	homa_info = homa_get_skb_info(skb);
+	homa_info->next_skb = NULL;
+	homa_info->wire_bytes = length + segs * (sizeof(struct homa_data_hdr)
+			+  rpc->hsk->ip_header_length + HOMA_ETH_OVERHEAD);
+	homa_info->data_bytes = length;
+	homa_info->seg_length = max_seg_data;
+	homa_info->offset = offset;
+
+	if (segs > 1) {
+		homa_set_doff(h, sizeof(struct homa_data_hdr)  -
+				sizeof32(struct homa_seg_hdr));
+		gso_size = max_seg_data + sizeof(struct homa_seg_hdr);
+		err = homa_fill_data_interleaved(rpc, skb, iter);
+	} else {
+		gso_size = max_seg_data;
+		err = homa_skb_append_from_iter(rpc->hsk->homa, skb, iter,
+						length);
+	}
+	if (err)
+		goto error;
+
+	if (segs > 1) {
+		skb_shinfo(skb)->gso_segs = segs;
+		skb_shinfo(skb)->gso_size = gso_size;
+
+		/* It's unclear what gso_type should be used to force software
+		 * GSO; the value below seems to work...
+		 */
+		skb_shinfo(skb)->gso_type =
+		    rpc->hsk->homa->gso_force_software ? 0xd : SKB_GSO_TCPV6;
+	}
+	return skb;
+
+error:
+	homa_skb_free_tx(rpc->hsk->homa, skb);
+	return ERR_PTR(err);
+}
+
+/**
+ * homa_message_out_fill() - Initializes information for sending a message
+ * for an RPC (either request or response); copies the message data from
+ * user space and (possibly) begins transmitting the message.
+ * @rpc:     RPC for which to send message; this function must not
+ *           previously have been called for the RPC. Must be locked. The RPC
+ *           will be unlocked while copying data, but will be locked again
+ *           before returning.
+ * @iter:    Describes location(s) of message data in user space.
+ * @xmit:    Nonzero means this method should start transmitting packets;
+ *           transmission will be overlapped with copying from user space.
+ *           Zero means the caller will initiate transmission after this
+ *           function returns.
+ *
+ * Return:   0 for success, or a negative errno for failure. It is possible
+ *           for the RPC to be freed while this function is active. If that
+ *           happens, copying will cease, -EINVAL will be returned, and
+ *           rpc->state will be RPC_DEAD.
+ */
+int homa_message_out_fill(struct homa_rpc *rpc, struct iov_iter *iter, int xmit)
+	__releases(rpc->bucket_lock)
+	__acquires(rpc->bucket_lock)
+{
+	/* Geometry information for packets:
+	 * mtu:              largest size for an on-the-wire packet (including
+	 *                   all headers through IP header, but not Ethernet
+	 *                   header).
+	 * max_seg_data:     largest amount of Homa message data that fits
+	 *                   in an on-the-wire packet (after segmentation).
+	 * max_gso_data:     largest amount of Homa message data that fits
+	 *                   in a GSO packet (before segmentation).
+	 */
+	int mtu, max_seg_data, max_gso_data;
+
+	struct sk_buff **last_link;
+	struct dst_entry *dst;
+	__u64 segs_per_gso;
+	int overlap_xmit;
+
+	/* Bytes of the message that haven't yet been copied into skbs. */
+	int bytes_left;
+
+	int gso_size;
+	int err;
+
+	homa_message_out_init(rpc, iter->count);
+	if (unlikely(rpc->msgout.length > HOMA_MAX_MESSAGE_LENGTH ||
+		     rpc->msgout.length == 0)) {
+		err = -EINVAL;
+		goto error;
+	}
+
+	/* Compute the geometry of packets. */
+	dst = homa_get_dst(rpc->peer, rpc->hsk);
+	mtu = dst_mtu(dst);
+	max_seg_data = mtu - rpc->hsk->ip_header_length
+			- sizeof(struct homa_data_hdr);
+	gso_size = dst->dev->gso_max_size;
+	if (gso_size > rpc->hsk->homa->max_gso_size)
+		gso_size = rpc->hsk->homa->max_gso_size;
+
+	/* Round gso_size down to an even # of mtus. */
+	segs_per_gso = gso_size - rpc->hsk->ip_header_length
+			- sizeof(struct homa_data_hdr);
+	do_div(segs_per_gso, max_seg_data);
+	if (segs_per_gso == 0)
+		segs_per_gso = 1;
+	max_gso_data = segs_per_gso * max_seg_data;
+
+	overlap_xmit = rpc->msgout.length > 2 * max_gso_data;
+	atomic_or(RPC_COPYING_FROM_USER, &rpc->flags);
+	homa_skb_stash_pages(rpc->hsk->homa, rpc->msgout.length);
+
+	/* Each iteration of the loop below creates one GSO packet. */
+	last_link = &rpc->msgout.packets;
+	for (bytes_left = rpc->msgout.length; bytes_left > 0; ) {
+		int skb_data_bytes, offset;
+		struct sk_buff *skb;
+
+		homa_rpc_unlock(rpc);
+		skb_data_bytes = max_gso_data;
+		offset = rpc->msgout.length - bytes_left;
+		if (skb_data_bytes > bytes_left)
+			skb_data_bytes = bytes_left;
+		skb = homa_new_data_packet(rpc, iter, offset, skb_data_bytes,
+					   max_seg_data);
+		if (unlikely(!skb)) {
+			err = PTR_ERR(skb);
+			homa_rpc_lock(rpc, "homa_message_out_fill");
+			goto error;
+		}
+		bytes_left -= skb_data_bytes;
+
+		homa_rpc_lock(rpc, "homa_message_out_fill2");
+		if (rpc->state == RPC_DEAD) {
+			/* RPC was freed while we were copying. */
+			err = -EINVAL;
+			homa_skb_free_tx(rpc->hsk->homa, skb);
+			goto error;
+		}
+		*last_link = skb;
+		last_link = &(homa_get_skb_info(skb)->next_skb);
+		*last_link = NULL;
+		rpc->msgout.num_skbs++;
+		rpc->msgout.copied_from_user = rpc->msgout.length - bytes_left;
+		if (overlap_xmit && list_empty(&rpc->throttled_links) && xmit)
+			homa_add_to_throttled(rpc);
+	}
+	atomic_andnot(RPC_COPYING_FROM_USER, &rpc->flags);
+	if (!overlap_xmit && xmit)
+		homa_xmit_data(rpc, false);
+	return 0;
+
+error:
+	atomic_andnot(RPC_COPYING_FROM_USER, &rpc->flags);
+	return err;
+}
+
+/**
+ * homa_xmit_control() - Send a control packet to the other end of an RPC.
+ * @type:      Packet type, such as DATA.
+ * @contents:  Address of buffer containing the contents of the packet.
+ *             Only information after the common header must be valid;
+ *             the common header will be filled in by this function.
+ * @length:    Length of @contents (including the common header).
+ * @rpc:       The packet will go to the socket that handles the other end
+ *             of this RPC. Addressing info for the packet, including all of
+ *             the fields of homa_common_hdr except type, will be set from this.
+ *
+ * Return:     Either zero (for success), or a negative errno value if there
+ *             was a problem.
+ */
+int homa_xmit_control(enum homa_packet_type type, void *contents,
+		      size_t length, struct homa_rpc *rpc)
+{
+	struct homa_common_hdr *h = contents;
+
+	h->type = type;
+	h->sport = htons(rpc->hsk->port);
+	h->dport = htons(rpc->dport);
+	h->sender_id = cpu_to_be64(rpc->id);
+	return __homa_xmit_control(contents, length, rpc->peer, rpc->hsk);
+}
+
+/**
+ * __homa_xmit_control() - Lower-level version of homa_xmit_control: sends
+ * a control packet.
+ * @contents:  Address of buffer containing the contents of the packet.
+ *             The caller must have filled in all of the information,
+ *             including the common header.
+ * @length:    Length of @contents.
+ * @peer:      Destination to which the packet will be sent.
+ * @hsk:       Socket via which the packet will be sent.
+ *
+ * Return:     Either zero (for success), or a negative errno value if there
+ *             was a problem.
+ */
+int __homa_xmit_control(void *contents, size_t length, struct homa_peer *peer,
+			struct homa_sock *hsk)
+{
+	struct homa_common_hdr *h;
+	struct dst_entry *dst;
+	struct sk_buff *skb;
+	int extra_bytes;
+	int result;
+
+	dst = homa_get_dst(peer, hsk);
+	skb = homa_skb_new_tx(HOMA_MAX_HEADER);
+	if (unlikely(!skb))
+		return -ENOBUFS;
+	dst_hold(dst);
+	skb_dst_set(skb, dst);
+
+	h = skb_put(skb, length);
+	memcpy(h, contents, length);
+	extra_bytes = HOMA_MIN_PKT_LENGTH - length;
+	if (extra_bytes > 0)
+		memset(skb_put(skb, extra_bytes), 0, extra_bytes);
+	skb->ooo_okay = 1;
+	skb_get(skb);
+	if (hsk->inet.sk.sk_family == AF_INET6)
+		result = ip6_xmit(&hsk->inet.sk, skb, &peer->flow.u.ip6, 0,
+				  NULL, 0, 0);
+	else
+		result = ip_queue_xmit(&hsk->inet.sk, skb, &peer->flow);
+	if (unlikely(result != 0)) {
+		/* It appears that ip*_xmit frees skbuffs after
+		 * errors; the following code is to raise an alert if
+		 * this isn't actually the case. The extra skb_get above
+		 * and kfree_skb call below are needed to do the check
+		 * accurately (otherwise the buffer could be freed and
+		 * its memory used for some other purpose, resulting in
+		 * a bogus "reference count").
+		 */
+		if (refcount_read(&skb->users) > 1) {
+			if (hsk->inet.sk.sk_family == AF_INET6)
+				pr_notice("ip6_xmit didn't free Homa control packet (type %d) after error %d\n",
+					  h->type, result);
+			else
+				pr_notice("ip_queue_xmit didn't free Homa control packet (type %d) after error %d\n",
+					  h->type, result);
+		}
+	}
+	kfree_skb(skb);
+	return result;
+}
+
+/**
+ * homa_xmit_unknown() - Send an UNKNOWN packet to a peer.
+ * @skb:         Buffer containing an incoming packet; identifies the peer to
+ *               which the UNKNOWN packet should be sent.
+ * @hsk:         Socket that should be used to send the UNKNOWN packet.
+ */
+void homa_xmit_unknown(struct sk_buff *skb, struct homa_sock *hsk)
+{
+	struct homa_common_hdr *h = (struct homa_common_hdr *)skb->data;
+	struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
+	struct homa_unknown_hdr unknown;
+	struct homa_peer *peer;
+
+	unknown.common.sport = h->dport;
+	unknown.common.dport = h->sport;
+	unknown.common.type = UNKNOWN;
+	unknown.common.sender_id = cpu_to_be64(homa_local_id(h->sender_id));
+	peer = homa_peer_find(hsk->homa->peers, &saddr, &hsk->inet);
+	if (!IS_ERR(peer))
+		__homa_xmit_control(&unknown, sizeof(unknown), peer, hsk);
+}
+
+/**
+ * homa_xmit_data() - If an RPC has outbound data packets that are permitted
+ * to be transmitted according to the scheduling mechanism, arrange for
+ * them to be sent (some may be sent immediately; others may be sent
+ * later by the pacer thread).
+ * @rpc:       RPC to check for transmittable packets. Must be locked by
+ *             caller. Note: this function will release the RPC lock while
+ *             passing packets through the RPC stack, then reacquire it
+ *             before returning. It is possible that the RPC gets freed
+ *             when the lock isn't held, in which case the state will
+ *             be RPC_DEAD on return.
+ * @force:     True means send at least one packet, even if the NIC queue
+ *             is too long. False means that zero packets may be sent, if
+ *             the NIC queue is sufficiently long.
+ */
+void homa_xmit_data(struct homa_rpc *rpc, bool force)
+	__releases(rpc->bucket_lock)
+	__acquires(rpc->bucket_lock)
+{
+	struct homa *homa = rpc->hsk->homa;
+
+	atomic_inc(&rpc->msgout.active_xmits);
+	while (*rpc->msgout.next_xmit) {
+		struct sk_buff *skb = *rpc->msgout.next_xmit;
+
+		if ((rpc->msgout.length - rpc->msgout.next_xmit_offset)
+				>= homa->throttle_min_bytes) {
+			if (!homa_check_nic_queue(homa, skb, force)) {
+				homa_add_to_throttled(rpc);
+				break;
+			}
+		}
+
+		rpc->msgout.next_xmit = &(homa_get_skb_info(skb)->next_skb);
+		rpc->msgout.next_xmit_offset +=
+				homa_get_skb_info(skb)->data_bytes;
+
+		homa_rpc_unlock(rpc);
+		skb_get(skb);
+		__homa_xmit_data(skb, rpc);
+		force = false;
+		homa_rpc_lock(rpc, "homa_xmit_data");
+		if (rpc->state == RPC_DEAD)
+			break;
+	}
+	atomic_dec(&rpc->msgout.active_xmits);
+}
+
+/**
+ * __homa_xmit_data() - Handles packet transmission stuff that is common
+ * to homa_xmit_data and homa_resend_data.
+ * @skb:      Packet to be sent. The packet will be freed after transmission
+ *            (and also if errors prevented transmission).
+ * @rpc:      Information about the RPC that the packet belongs to.
+ */
+void __homa_xmit_data(struct sk_buff *skb, struct homa_rpc *rpc)
+{
+	struct dst_entry *dst;
+
+	dst = homa_get_dst(rpc->peer, rpc->hsk);
+	dst_hold(dst);
+	skb_dst_set(skb, dst);
+
+	skb->ooo_okay = 1;
+	skb->ip_summed = CHECKSUM_PARTIAL;
+	skb->csum_start = skb_transport_header(skb) - skb->head;
+	skb->csum_offset = offsetof(struct homa_common_hdr, checksum);
+	if (rpc->hsk->inet.sk.sk_family == AF_INET6)
+		ip6_xmit(&rpc->hsk->inet.sk, skb, &rpc->peer->flow.u.ip6,
+			 0, NULL, 0, 0);
+	else
+
+		ip_queue_xmit(&rpc->hsk->inet.sk, skb, &rpc->peer->flow);
+}
+
+/**
+ * homa_resend_data() - This function is invoked as part of handling RESEND
+ * requests. It retransmits the packet(s) containing a given range of bytes
+ * from a message.
+ * @rpc:      RPC for which data should be resent.
+ * @start:    Offset within @rpc->msgout of the first byte to retransmit.
+ * @end:      Offset within @rpc->msgout of the byte just after the last one
+ *            to retransmit.
+ */
+void homa_resend_data(struct homa_rpc *rpc, int start, int end)
+{
+	struct homa_skb_info *homa_info;
+	struct sk_buff *skb;
+
+	if (end <= start)
+		return;
+
+	/* Each iteration of this loop checks one packet in the message
+	 * to see if it contains segments that need to be retransmitted.
+	 */
+	for (skb = rpc->msgout.packets; skb; skb = homa_info->next_skb) {
+		int seg_offset, offset, seg_length, data_left;
+		struct homa_data_hdr *h;
+
+		homa_info = homa_get_skb_info(skb);
+		offset = homa_info->offset;
+		if (offset >= end)
+			break;
+		if (start >= (offset + homa_info->data_bytes))
+			continue;
+
+		offset = homa_info->offset;
+		seg_offset = sizeof32(struct homa_data_hdr);
+		data_left = homa_info->data_bytes;
+		if (skb_shinfo(skb)->gso_segs <= 1) {
+			seg_length = data_left;
+		} else {
+			seg_length = homa_info->seg_length;
+			h = (struct homa_data_hdr *)skb_transport_header(skb);
+		}
+		for ( ; data_left > 0; data_left -= seg_length,
+		     offset += seg_length,
+		     seg_offset += skb_shinfo(skb)->gso_size) {
+			struct homa_skb_info *new_homa_info;
+			struct sk_buff *new_skb;
+			int err;
+
+			if (seg_length > data_left)
+				seg_length = data_left;
+
+			if (end <= offset)
+				goto resend_done;
+			if ((offset + seg_length) <= start)
+				continue;
+
+			/* This segment must be retransmitted. */
+			new_skb = homa_skb_new_tx(sizeof(struct homa_data_hdr) +
+						  seg_length);
+			if (unlikely(!new_skb))
+				goto resend_done;
+			h = __skb_put_data(new_skb, skb_transport_header(skb),
+					   sizeof32(struct homa_data_hdr));
+			h->common.sequence = htonl(offset);
+			h->seg.offset = htonl(offset);
+			h->retransmit = 1;
+			err = homa_skb_append_from_skb(rpc->hsk->homa, new_skb,
+						       skb, seg_offset,
+						       seg_length);
+			if (err != 0) {
+				pr_err("%s got error %d from homa_skb_append_from_skb\n",
+				       __func__, err);
+				kfree_skb(new_skb);
+				goto resend_done;
+			}
+
+			new_homa_info = homa_get_skb_info(new_skb);
+			new_homa_info->wire_bytes = rpc->hsk->ip_header_length
+					+ sizeof(struct homa_data_hdr)
+					+ seg_length + HOMA_ETH_OVERHEAD;
+			new_homa_info->data_bytes = seg_length;
+			new_homa_info->seg_length = seg_length;
+			new_homa_info->offset = offset;
+			homa_check_nic_queue(rpc->hsk->homa, new_skb, true);
+			__homa_xmit_data(new_skb, rpc);
+		}
+	}
+
+resend_done:
+	return;
+}
+
+/**
+ * homa_outgoing_sysctl_changed() - Invoked whenever a sysctl value is changed;
+ * any output-related parameters that depend on sysctl-settable values.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+void homa_outgoing_sysctl_changed(struct homa *homa)
+{
+	__u64 tmp;
+
+	tmp = 8 * 1000ULL * 1000ULL * 1000ULL;
+
+	/* Underestimate link bandwidth (overestimate time) by 1%. */
+	tmp = tmp * 101 / 100;
+	do_div(tmp, homa->link_mbps);
+	homa->ns_per_mbyte = tmp;
+}
+
+/**
+ * homa_check_nic_queue() - This function is invoked before passing a packet
+ * to the NIC for transmission. It serves two purposes. First, it maintains
+ * an estimate of the NIC queue length. Second, it indicates to the caller
+ * whether the NIC queue is so full that no new packets should be queued
+ * (Homa's SRPT depends on keeping the NIC queue short).
+ * @homa:     Overall data about the Homa protocol implementation.
+ * @skb:      Packet that is about to be transmitted.
+ * @force:    True means this packet is going to be transmitted
+ *            regardless of the queue length.
+ * Return:    Nonzero is returned if either the NIC queue length is
+ *            acceptably short or @force was specified. 0 means that the
+ *            NIC queue is at capacity or beyond, so the caller should delay
+ *            the transmission of @skb. If nonzero is returned, then the
+ *            queue estimate is updated to reflect the transmission of @skb.
+ */
+int homa_check_nic_queue(struct homa *homa, struct sk_buff *skb, bool force)
+{
+	__u64 idle, new_idle, clock, ns_for_packet;
+	int bytes;
+
+	bytes = homa_get_skb_info(skb)->wire_bytes;
+	ns_for_packet = homa->ns_per_mbyte;
+	ns_for_packet *= bytes;
+	do_div(ns_for_packet, 1000000);
+	while (1) {
+		clock = sched_clock();
+		idle = atomic64_read(&homa->link_idle_time);
+		if ((clock + homa->max_nic_queue_ns) < idle && !force &&
+		    !(homa->flags & HOMA_FLAG_DONT_THROTTLE))
+			return 0;
+		if (idle < clock)
+			new_idle = clock + ns_for_packet;
+		else
+			new_idle = idle + ns_for_packet;
+
+		/* This method must be thread-safe. */
+		if (atomic64_cmpxchg_relaxed(&homa->link_idle_time, idle,
+					     new_idle) == idle)
+			break;
+	}
+	return 1;
+}
+
+/**
+ * homa_pacer_main() - Top-level function for the pacer thread.
+ * @transport:  Pointer to struct homa.
+ *
+ * Return:         Always 0.
+ */
+int homa_pacer_main(void *transport)
+{
+	struct homa *homa = (struct homa *)transport;
+
+	homa->pacer_wake_time = sched_clock();
+	while (1) {
+		if (homa->pacer_exit) {
+			homa->pacer_wake_time = 0;
+			break;
+		}
+		homa_pacer_xmit(homa);
+
+		/* Sleep this thread if the throttled list is empty. Even
+		 * if the throttled list isn't empty, call the scheduler
+		 * to give other processes a chance to run (if we don't,
+		 * softirq handlers can get locked out, which prevents
+		 * incoming packets from being handled).
+		 */
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (list_first_or_null_rcu(&homa->throttled_rpcs,
+					   struct homa_rpc,
+					   throttled_links) != NULL)
+			__set_current_state(TASK_RUNNING);
+		homa->pacer_wake_time = 0;
+		schedule();
+		homa->pacer_wake_time = sched_clock();
+		__set_current_state(TASK_RUNNING);
+	}
+	kthread_complete_and_exit(&homa_pacer_kthread_done, 0);
+	return 0;
+}
+
+/**
+ * homa_pacer_xmit() - Transmit packets from  the throttled list. Note:
+ * this function may be invoked from either process context or softirq (BH)
+ * level. This function is invoked from multiple places, not just in the
+ * pacer thread. The reason for this is that (as of 10/2019) Linux's scheduling
+ * of the pacer thread is unpredictable: the thread may block for long periods
+ * of time (e.g., because it is assigned to the same CPU as a busy interrupt
+ * handler). This can result in poor utilization of the network link. So,
+ * this method gets invoked from other places as well, to increase the
+ * likelihood that we keep the link busy. Those other invocations are not
+ * guaranteed to happen, so the pacer thread provides a backstop.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+void homa_pacer_xmit(struct homa *homa)
+{
+	struct homa_rpc *rpc;
+	int i;
+
+	/* Make sure only one instance of this function executes at a
+	 * time.
+	 */
+	if (!spin_trylock_bh(&homa->pacer_mutex))
+		return;
+
+	/* Each iteration through the following loop sends one packet. We
+	 * limit the number of passes through this loop in order to cap the
+	 * time spent in one call to this function (see note in
+	 * homa_pacer_main about interfering with softirq handlers).
+	 */
+	for (i = 0; i < 5; i++) {
+		__u64 idle_time, now;
+
+		/* If the NIC queue is too long, wait until it gets shorter. */
+		now = sched_clock();
+		idle_time = atomic64_read(&homa->link_idle_time);
+		while ((now + homa->max_nic_queue_ns) < idle_time) {
+			/* If we've xmitted at least one packet then
+			 * return (this helps with testing and also
+			 * allows homa_pacer_main to yield the core).
+			 */
+			if (i != 0)
+				goto done;
+			now = sched_clock();
+		}
+		/* Note: when we get here, it's possible that the NIC queue is
+		 * still too long because other threads have queued packets,
+		 * but we transmit anyway so we don't starve (see perf.text
+		 * for more info).
+		 */
+
+		/* Lock the first throttled RPC. This may not be possible
+		 * because we have to hold throttle_lock while locking
+		 * the RPC; that means we can't wait for the RPC lock because
+		 * of lock ordering constraints (see sync.txt). Thus, if
+		 * the RPC lock isn't available, do nothing. Holding the
+		 * throttle lock while locking the RPC is important because
+		 * it keeps the RPC from being deleted before it can be locked.
+		 */
+		homa_throttle_lock(homa);
+		homa->pacer_fifo_count -= homa->pacer_fifo_fraction;
+		if (homa->pacer_fifo_count <= 0) {
+			struct homa_rpc *cur;
+			__u64 oldest = ~0;
+
+			homa->pacer_fifo_count += 1000;
+			rpc = NULL;
+			list_for_each_entry_rcu(cur, &homa->throttled_rpcs,
+						throttled_links) {
+				if (cur->msgout.init_ns < oldest) {
+					rpc = cur;
+					oldest = cur->msgout.init_ns;
+				}
+			}
+		} else {
+			rpc = list_first_or_null_rcu(&homa->throttled_rpcs,
+						     struct homa_rpc,
+						     throttled_links);
+		}
+		if (!rpc) {
+			homa_throttle_unlock(homa);
+			break;
+		}
+		if (!homa_rpc_try_lock(rpc, "homa_pacer_xmit")) {
+			homa_throttle_unlock(homa);
+			break;
+		}
+		homa_throttle_unlock(homa);
+
+		homa_xmit_data(rpc, true);
+
+		/* Note: rpc->state could be RPC_DEAD here, but the code
+		 * below should work anyway.
+		 */
+		if (!*rpc->msgout.next_xmit) {
+			/* Nothing more to transmit from this message (right
+			 * now), so remove it from the throttled list.
+			 */
+			homa_throttle_lock(homa);
+			if (!list_empty(&rpc->throttled_links)) {
+				list_del_rcu(&rpc->throttled_links);
+
+				/* Note: this reinitialization is only safe
+				 * because the pacer only looks at the first
+				 * element of the list, rather than traversing
+				 * it (and besides, we know the pacer isn't
+				 * active concurrently, since this code *is*
+				 * the pacer). It would not be safe under more
+				 * general usage patterns.
+				 */
+				INIT_LIST_HEAD_RCU(&rpc->throttled_links);
+			}
+			homa_throttle_unlock(homa);
+		}
+		homa_rpc_unlock(rpc);
+	}
+done:
+	spin_unlock_bh(&homa->pacer_mutex);
+}
+
+/**
+ * homa_pacer_stop() - Will cause the pacer thread to exit (waking it up
+ * if necessary); doesn't return until after the pacer thread has exited.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+void homa_pacer_stop(struct homa *homa)
+{
+	homa->pacer_exit = true;
+	wake_up_process(homa->pacer_kthread);
+	kthread_stop(homa->pacer_kthread);
+	homa->pacer_kthread = NULL;
+}
+
+/**
+ * homa_add_to_throttled() - Make sure that an RPC is on the throttled list
+ * and wake up the pacer thread if necessary.
+ * @rpc:     RPC with outbound packets that have been granted but can't be
+ *           sent because of NIC queue restrictions. Must be locked by caller.
+ */
+void homa_add_to_throttled(struct homa_rpc *rpc)
+	__must_hold(&rpc->bucket->lock)
+{
+	struct homa *homa = rpc->hsk->homa;
+	struct homa_rpc *candidate;
+	int bytes_left;
+	int checks = 0;
+	__u64 now;
+
+	if (!list_empty(&rpc->throttled_links))
+		return;
+	now = sched_clock();
+	homa->throttle_add = now;
+	bytes_left = rpc->msgout.length - rpc->msgout.next_xmit_offset;
+	homa_throttle_lock(homa);
+	list_for_each_entry_rcu(candidate, &homa->throttled_rpcs,
+				throttled_links) {
+		int bytes_left_cand;
+
+		checks++;
+
+		/* Watch out: the pacer might have just transmitted the last
+		 * packet from candidate.
+		 */
+		bytes_left_cand = candidate->msgout.length -
+				candidate->msgout.next_xmit_offset;
+		if (bytes_left_cand > bytes_left) {
+			list_add_tail_rcu(&rpc->throttled_links,
+					  &candidate->throttled_links);
+			goto done;
+		}
+	}
+	list_add_tail_rcu(&rpc->throttled_links, &homa->throttled_rpcs);
+done:
+	homa_throttle_unlock(homa);
+	wake_up_process(homa->pacer_kthread);
+}
+
+/**
+ * homa_remove_from_throttled() - Make sure that an RPC is not on the
+ * throttled list.
+ * @rpc:     RPC of interest.
+ */
+void homa_remove_from_throttled(struct homa_rpc *rpc)
+{
+	if (unlikely(!list_empty(&rpc->throttled_links))) {
+		homa_throttle_lock(rpc->hsk->homa);
+		list_del(&rpc->throttled_links);
+		homa_throttle_unlock(rpc->hsk->homa);
+		INIT_LIST_HEAD(&rpc->throttled_links);
+	}
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH net-next v6 10/12] net: homa: create homa_timer.c
  2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (8 preceding siblings ...)
  2025-01-15 18:59 ` [PATCH net-next v6 09/12] net: homa: create homa_outgoing.c John Ousterhout
@ 2025-01-15 18:59 ` John Ousterhout
  2025-01-15 18:59 ` [PATCH net-next v6 11/12] net: homa: create homa_plumbing.c and homa_utils.c John Ousterhout
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-15 18:59 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

This file contains code that wakes up periodically to check for
missing data, initiate retransmissions, and declare peer nodes
"dead".

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_timer.c | 157 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 157 insertions(+)
 create mode 100644 net/homa/homa_timer.c

diff --git a/net/homa/homa_timer.c b/net/homa/homa_timer.c
new file mode 100644
index 000000000000..272a6ac71ee9
--- /dev/null
+++ b/net/homa/homa_timer.c
@@ -0,0 +1,157 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file handles timing-related functions for Homa, such as retries
+ * and timeouts.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+#include "homa_stub.h"
+
+/**
+ * homa_check_rpc() -  Invoked for each RPC during each timer pass; does
+ * most of the work of checking for time-related actions such as sending
+ * resends, aborting RPCs for which there is no response, and sending
+ * requests for acks. It is separate from homa_timer because homa_timer
+ * got too long and deeply indented.
+ * @rpc:     RPC to check; must be locked by the caller.
+ */
+void homa_check_rpc(struct homa_rpc *rpc)
+{
+	struct homa *homa = rpc->hsk->homa;
+	struct homa_resend_hdr resend;
+
+	/* See if we need to request an ack for this RPC. */
+	if (!homa_is_client(rpc->id) && rpc->state == RPC_OUTGOING &&
+	    rpc->msgout.next_xmit_offset >= rpc->msgout.length) {
+		if (rpc->done_timer_ticks == 0) {
+			rpc->done_timer_ticks = homa->timer_ticks;
+		} else {
+			/* >= comparison that handles tick wrap-around. */
+			if ((rpc->done_timer_ticks + homa->request_ack_ticks
+					- 1 - homa->timer_ticks) & 1 << 31) {
+				struct homa_need_ack_hdr h;
+
+				homa_xmit_control(NEED_ACK, &h, sizeof(h), rpc);
+			}
+		}
+	}
+
+	if (rpc->state == RPC_INCOMING) {
+		if (rpc->msgin.num_bpages == 0) {
+			/* Waiting for buffer space, so no problem. */
+			rpc->silent_ticks = 0;
+			return;
+		}
+	} else if (!homa_is_client(rpc->id)) {
+		/* We're the server and we've received the input message;
+		 * no need to worry about retries.
+		 */
+		rpc->silent_ticks = 0;
+		return;
+	}
+
+	if (rpc->state == RPC_OUTGOING) {
+		if (rpc->msgout.next_xmit_offset < rpc->msgout.length) {
+			/* There are bytes that we haven't transmitted,
+			 * so no need to be concerned; the ball is in our court.
+			 */
+			rpc->silent_ticks = 0;
+			return;
+		}
+	}
+
+	if (rpc->silent_ticks < homa->resend_ticks)
+		return;
+	if (rpc->silent_ticks >= homa->timeout_ticks) {
+		homa_rpc_abort(rpc, -ETIMEDOUT);
+		return;
+	}
+	if (((rpc->silent_ticks - homa->resend_ticks) % homa->resend_interval)
+			!= 0)
+		return;
+
+	/* Issue a resend for the bytes just after the last ones received
+	 * (gaps in the middle were already handled by homa_gap_retry above).
+	 */
+	if (rpc->msgin.length < 0) {
+		/* Haven't received any data for this message; request
+		 * retransmission of just the first packet (the sender
+		 * will send at least one full packet, regardless of
+		 * the length below).
+		 */
+		resend.offset = htonl(0);
+		resend.length = htonl(100);
+	} else {
+		homa_gap_retry(rpc);
+		resend.offset = htonl(rpc->msgin.recv_end);
+		resend.length = htonl(rpc->msgin.length - rpc->msgin.recv_end);
+		if (resend.length == 0)
+			return;
+	}
+	homa_xmit_control(RESEND, &resend, sizeof(resend), rpc);
+}
+
+/**
+ * homa_timer() - This function is invoked at regular intervals ("ticks")
+ * to implement retries and aborts for Homa.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+void homa_timer(struct homa *homa)
+{
+	struct homa_socktab_scan scan;
+	struct homa_sock *hsk;
+	struct homa_rpc *rpc;
+	int total_rpcs = 0;
+	int rpc_count = 0;
+
+	homa->timer_ticks++;
+
+	/* Scan all existing RPCs in all sockets.  The rcu_read_lock
+	 * below prevents sockets from being deleted during the scan.
+	 */
+	rcu_read_lock();
+	for (hsk = homa_socktab_start_scan(homa->port_map, &scan);
+			hsk; hsk = homa_socktab_next(&scan)) {
+		while (hsk->dead_skbs >= homa->dead_buffs_limit)
+			/* If we get here, it means that homa_wait_for_message
+			 * isn't keeping up with RPC reaping, so we'll help
+			 * out.  See reap.txt for more info.
+			 */
+			if (homa_rpc_reap(hsk, false) == 0)
+				break;
+
+		if (list_empty(&hsk->active_rpcs) || hsk->shutdown)
+			continue;
+
+		if (!homa_protect_rpcs(hsk))
+			continue;
+		list_for_each_entry_rcu(rpc, &hsk->active_rpcs, active_links) {
+			total_rpcs++;
+			homa_rpc_lock(rpc, "homa_timer");
+			if (rpc->state == RPC_IN_SERVICE) {
+				rpc->silent_ticks = 0;
+				homa_rpc_unlock(rpc);
+				continue;
+			}
+			rpc->silent_ticks++;
+			homa_check_rpc(rpc);
+			homa_rpc_unlock(rpc);
+			rpc_count++;
+			if (rpc_count >= 10) {
+				/* Give other kernel threads a chance to run
+				 * on this core. Must release the RCU read lock
+				 * while doing this.
+				 */
+				rcu_read_unlock();
+				schedule();
+				rcu_read_lock();
+				rpc_count = 0;
+			}
+		}
+		homa_unprotect_rpcs(hsk);
+	}
+	homa_socktab_end_scan(&scan);
+	rcu_read_unlock();
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH net-next v6 11/12] net: homa: create homa_plumbing.c and homa_utils.c
  2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (9 preceding siblings ...)
  2025-01-15 18:59 ` [PATCH net-next v6 10/12] net: homa: create homa_timer.c John Ousterhout
@ 2025-01-15 18:59 ` John Ousterhout
  2025-01-15 18:59 ` [PATCH net-next v6 12/12] net: homa: create Makefile and Kconfig John Ousterhout
  2025-01-24  8:55 ` [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol Paolo Abeni
  12 siblings, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-15 18:59 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

homa_plumbing.c contains functions that connect Homa to the rest of
the Linux kernel, such as dispatch tables used by Linux and the
top-level functions that Linux invokes from those dispatch tables.

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/homa/homa_plumbing.c | 1004 ++++++++++++++++++++++++++++++++++++++
 net/homa/homa_utils.c    |  166 +++++++
 2 files changed, 1170 insertions(+)
 create mode 100644 net/homa/homa_plumbing.c
 create mode 100644 net/homa/homa_utils.c

diff --git a/net/homa/homa_plumbing.c b/net/homa/homa_plumbing.c
new file mode 100644
index 000000000000..6c654444241b
--- /dev/null
+++ b/net/homa/homa_plumbing.c
@@ -0,0 +1,1004 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file consists mostly of "glue" that hooks Homa into the rest of
+ * the Linux kernel. The guts of the protocol are in other files.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_pool.h"
+
+/* Not yet sure what these variables are for */
+static long sysctl_homa_mem[3] __read_mostly;
+static int sysctl_homa_rmem_min __read_mostly;
+static int sysctl_homa_wmem_min __read_mostly;
+
+/* Global data for Homa. Never reference homa_data directly. Always use
+ * the global_homa variable instead; this allows overriding during unit tests.
+ */
+static struct homa homa_data;
+
+/* This variable contains the address of the statically-allocated struct homa
+ * used throughout Homa. This variable should almost never be used directly:
+ * it should be passed as a parameter to functions that need it. This
+ * variable is used only by functions called from Linux (so they can't pass
+ * in a pointer).
+ */
+struct homa *global_homa = &homa_data;
+
+/* True means that the Homa module is in the process of unloading itself,
+ * so everyone should clean up.
+ */
+static bool exiting;
+
+/* Thread that runs timer code to detect lost packets and crashed peers. */
+static struct task_struct *timer_kthread;
+
+/* This structure defines functions that handle various operations on
+ * Homa sockets. These functions are relatively generic: they are called
+ * to implement top-level system calls. Many of these operations can
+ * be implemented by PF_INET6 functions that are independent of the
+ * Homa protocol.
+ */
+static const struct proto_ops homa_proto_ops = {
+	.family		   = PF_INET,
+	.owner		   = THIS_MODULE,
+	.release	   = inet_release,
+	.bind		   = homa_bind,
+	.connect	   = inet_dgram_connect,
+	.socketpair	   = sock_no_socketpair,
+	.accept		   = sock_no_accept,
+	.getname	   = inet_getname,
+	.poll		   = homa_poll,
+	.ioctl		   = inet_ioctl,
+	.listen		   = sock_no_listen,
+	.shutdown	   = homa_shutdown,
+	.setsockopt	   = sock_common_setsockopt,
+	.getsockopt	   = sock_common_getsockopt,
+	.sendmsg	   = inet_sendmsg,
+	.recvmsg	   = inet_recvmsg,
+	.mmap		   = sock_no_mmap,
+	.set_peek_off	   = sk_set_peek_off,
+};
+
+static const struct proto_ops homav6_proto_ops = {
+	.family		   = PF_INET6,
+	.owner		   = THIS_MODULE,
+	.release	   = inet6_release,
+	.bind		   = homa_bind,
+	.connect	   = inet_dgram_connect,
+	.socketpair	   = sock_no_socketpair,
+	.accept		   = sock_no_accept,
+	.getname	   = inet6_getname,
+	.poll		   = homa_poll,
+	.ioctl		   = inet6_ioctl,
+	.listen		   = sock_no_listen,
+	.shutdown	   = homa_shutdown,
+	.setsockopt	   = sock_common_setsockopt,
+	.getsockopt	   = sock_common_getsockopt,
+	.sendmsg	   = inet_sendmsg,
+	.recvmsg	   = inet_recvmsg,
+	.mmap		   = sock_no_mmap,
+	.set_peek_off	   = sk_set_peek_off,
+};
+
+/* This structure also defines functions that handle various operations
+ * on Homa sockets. However, these functions are lower-level than those
+ * in homa_proto_ops: they are specific to the PF_INET or PF_INET6
+ * protocol family, and in many cases they are invoked by functions in
+ * homa_proto_ops. Most of these functions have Homa-specific implementations.
+ */
+static struct proto homa_prot = {
+	.name		   = "HOMA",
+	.owner		   = THIS_MODULE,
+	.close		   = homa_close,
+	.connect	   = ip4_datagram_connect,
+	.disconnect	   = homa_disconnect,
+	.ioctl		   = homa_ioctl,
+	.init		   = homa_socket,
+	.destroy	   = NULL,
+	.setsockopt	   = homa_setsockopt,
+	.getsockopt	   = homa_getsockopt,
+	.sendmsg	   = homa_sendmsg,
+	.recvmsg	   = homa_recvmsg,
+	.backlog_rcv       = homa_backlog_rcv,
+	.hash		   = homa_hash,
+	.unhash		   = homa_unhash,
+	.get_port	   = homa_get_port,
+	.sysctl_mem	   = sysctl_homa_mem,
+	.sysctl_wmem	   = &sysctl_homa_wmem_min,
+	.sysctl_rmem	   = &sysctl_homa_rmem_min,
+	.obj_size	   = sizeof(struct homa_sock),
+	.no_autobind       = 1,
+};
+
+static struct proto homav6_prot = {
+	.name		   = "HOMAv6",
+	.owner		   = THIS_MODULE,
+	.close		   = homa_close,
+	.connect	   = ip6_datagram_connect,
+	.disconnect	   = homa_disconnect,
+	.ioctl		   = homa_ioctl,
+	.init		   = homa_socket,
+	.destroy	   = NULL,
+	.setsockopt	   = homa_setsockopt,
+	.getsockopt	   = homa_getsockopt,
+	.sendmsg	   = homa_sendmsg,
+	.recvmsg	   = homa_recvmsg,
+	.backlog_rcv       = homa_backlog_rcv,
+	.hash		   = homa_hash,
+	.unhash		   = homa_unhash,
+	.get_port	   = homa_get_port,
+	.sysctl_mem	   = sysctl_homa_mem,
+	.sysctl_wmem	   = &sysctl_homa_wmem_min,
+	.sysctl_rmem	   = &sysctl_homa_rmem_min,
+
+	.obj_size	   = sizeof(struct homa_v6_sock),
+	.ipv6_pinfo_offset = offsetof(struct homa_v6_sock, inet6),
+
+	.no_autobind       = 1,
+};
+
+/* Top-level structure describing the Homa protocol. */
+static struct inet_protosw homa_protosw = {
+	.type              = SOCK_DGRAM,
+	.protocol          = IPPROTO_HOMA,
+	.prot              = &homa_prot,
+	.ops               = &homa_proto_ops,
+	.flags             = INET_PROTOSW_REUSE,
+};
+
+static struct inet_protosw homav6_protosw = {
+	.type              = SOCK_DGRAM,
+	.protocol          = IPPROTO_HOMA,
+	.prot              = &homav6_prot,
+	.ops               = &homav6_proto_ops,
+	.flags             = INET_PROTOSW_REUSE,
+};
+
+/* This structure is used by IP to deliver incoming Homa packets to us. */
+static struct net_protocol homa_protocol = {
+	.handler =	homa_softirq,
+	.err_handler =	homa_err_handler_v4,
+	.no_policy =     1,
+};
+
+static struct inet6_protocol homav6_protocol = {
+	.handler =	homa_softirq,
+	.err_handler =	homa_err_handler_v6,
+	.flags =        INET6_PROTO_NOPOLICY | INET6_PROTO_FINAL,
+};
+
+/* Sizes of the headers for each Homa packet type, in bytes. */
+static __u16 header_lengths[] = {
+	sizeof32(struct homa_data_hdr),
+	0,
+	sizeof32(struct homa_resend_hdr),
+	sizeof32(struct homa_unknown_hdr),
+	sizeof32(struct homa_busy_hdr),
+	0,
+	0,
+	sizeof32(struct homa_need_ack_hdr),
+	sizeof32(struct homa_ack_hdr)
+};
+
+static DECLARE_COMPLETION(timer_thread_done);
+
+/**
+ * homa_load() - invoked when this module is loaded into the Linux kernel
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int __init homa_load(void)
+{
+	struct homa *homa = global_homa;
+	int status;
+
+	pr_notice("Homa module loading\n");
+	status = proto_register(&homa_prot, 1);
+	if (status != 0) {
+		pr_err("proto_register failed for homa_prot: %d\n", status);
+		goto proto_register_err;
+	}
+	status = proto_register(&homav6_prot, 1);
+	if (status != 0) {
+		pr_err("proto_register failed for homav6_prot: %d\n", status);
+		goto proto_register_v6_err;
+	}
+	inet_register_protosw(&homa_protosw);
+	status = inet6_register_protosw(&homav6_protosw);
+	if (status != 0) {
+		pr_err("inet6_register_protosw failed in %s: %d\n", __func__,
+		       status);
+		goto register_protosw_v6_err;
+	}
+	status = inet_add_protocol(&homa_protocol, IPPROTO_HOMA);
+	if (status != 0) {
+		pr_err("inet_add_protocol failed in %s: %d\n", __func__,
+		       status);
+		goto add_protocol_err;
+	}
+	status = inet6_add_protocol(&homav6_protocol, IPPROTO_HOMA);
+	if (status != 0) {
+		pr_err("inet6_add_protocol failed in %s: %d\n",  __func__,
+		       status);
+		goto add_protocol_v6_err;
+	}
+
+	status = homa_init(homa);
+	if (status)
+		goto homa_init_err;
+
+	timer_kthread = kthread_run(homa_timer_main, homa, "homa_timer");
+	if (IS_ERR(timer_kthread)) {
+		status = PTR_ERR(timer_kthread);
+		pr_err("couldn't create homa pacer thread: error %d\n",
+		       status);
+		timer_kthread = NULL;
+		goto timer_err;
+	}
+
+	return 0;
+
+timer_err:
+	homa_destroy(homa);
+homa_init_err:
+	inet6_del_protocol(&homav6_protocol, IPPROTO_HOMA);
+add_protocol_v6_err:
+	inet_del_protocol(&homa_protocol, IPPROTO_HOMA);
+add_protocol_err:
+	inet6_unregister_protosw(&homav6_protosw);
+register_protosw_v6_err:
+	inet_unregister_protosw(&homa_protosw);
+	proto_unregister(&homav6_prot);
+proto_register_v6_err:
+	proto_unregister(&homa_prot);
+proto_register_err:
+	return status;
+}
+
+/**
+ * homa_unload() - invoked when this module is unloaded from the Linux kernel.
+ */
+void __exit homa_unload(void)
+{
+	struct homa *homa = global_homa;
+
+	pr_notice("Homa module unloading\n");
+	exiting = true;
+
+	if (timer_kthread)
+		wake_up_process(timer_kthread);
+	wait_for_completion(&timer_thread_done);
+	homa_destroy(homa);
+	inet_del_protocol(&homa_protocol, IPPROTO_HOMA);
+	inet_unregister_protosw(&homa_protosw);
+	inet6_del_protocol(&homav6_protocol, IPPROTO_HOMA);
+	inet6_unregister_protosw(&homav6_protosw);
+	proto_unregister(&homa_prot);
+	proto_unregister(&homav6_prot);
+}
+
+module_init(homa_load);
+module_exit(homa_unload);
+
+/**
+ * homa_bind() - Implements the bind system call for Homa sockets: associates
+ * a well-known service port with a socket. Unlike other AF_INET6 protocols,
+ * there is no need to invoke this system call for sockets that are only
+ * used as clients.
+ * @sock:     Socket on which the system call was invoked.
+ * @addr:    Contains the desired port number.
+ * @addr_len: Number of bytes in uaddr.
+ * Return:    0 on success, otherwise a negative errno.
+ */
+int homa_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
+{
+	union sockaddr_in_union *addr_in = (union sockaddr_in_union *)addr;
+	struct homa_sock *hsk = homa_sk(sock->sk);
+	int port = 0;
+
+	if (unlikely(addr->sa_family != sock->sk->sk_family))
+		return -EAFNOSUPPORT;
+	if (addr_in->in6.sin6_family == AF_INET6) {
+		if (addr_len < sizeof(struct sockaddr_in6))
+			return -EINVAL;
+		port = ntohs(addr_in->in4.sin_port);
+	} else if (addr_in->in4.sin_family == AF_INET) {
+		if (addr_len < sizeof(struct sockaddr_in))
+			return -EINVAL;
+		port = ntohs(addr_in->in6.sin6_port);
+	}
+	return homa_sock_bind(hsk->homa->port_map, hsk, port);
+}
+
+/**
+ * homa_close() - Invoked when close system call is invoked on a Homa socket.
+ * @sk:      Socket being closed
+ * @timeout: ??
+ */
+void homa_close(struct sock *sk, long timeout)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+
+	homa_sock_destroy(hsk);
+	sk_common_release(sk);
+}
+
+/**
+ * homa_shutdown() - Implements the shutdown system call for Homa sockets.
+ * @sock:    Socket to shut down.
+ * @how:     Ignored: for other sockets, can independently shut down
+ *           sending and receiving, but for Homa any shutdown will
+ *           shut down everything.
+ *
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_shutdown(struct socket *sock, int how)
+{
+	homa_sock_shutdown(homa_sk(sock->sk));
+	return 0;
+}
+
+/**
+ * homa_disconnect() - Invoked when disconnect system call is invoked on a
+ * Homa socket.
+ * @sk:    Socket to disconnect
+ * @flags: ??
+ *
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_disconnect(struct sock *sk, int flags)
+{
+	pr_warn("unimplemented disconnect invoked on Homa socket\n");
+	return -EINVAL;
+}
+
+/**
+ * homa_ioctl() - Implements the ioctl system call for Homa sockets.
+ * @sk:    Socket on which the system call was invoked.
+ * @cmd:   Identifier for a particular ioctl operation.
+ * @karg:  Operation-specific argument; typically the address of a block
+ *         of data in user address space.
+ *
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_ioctl(struct sock *sk, int cmd, int *karg)
+{
+	return -EINVAL;
+}
+
+/**
+ * homa_socket() - Implements the socket(2) system call for sockets.
+ * @sk:    Socket on which the system call was invoked. The non-Homa
+ *         parts have already been initialized.
+ *
+ * Return: always 0 (success).
+ */
+int homa_socket(struct sock *sk)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+	struct homa *homa = global_homa;
+	int result;
+
+	result = homa_sock_init(hsk, homa);
+	if (result != 0)
+		homa_sock_destroy(hsk);
+	return result;
+}
+
+/**
+ * homa_setsockopt() - Implements the getsockopt system call for Homa sockets.
+ * @sk:      Socket on which the system call was invoked.
+ * @level:   Level at which the operation should be handled; will always
+ *           be IPPROTO_HOMA.
+ * @optname: Identifies a particular setsockopt operation.
+ * @optval:  Address in user space of information about the option.
+ * @optlen:  Number of bytes of data at @optval.
+ * Return:   0 on success, otherwise a negative errno.
+ */
+int homa_setsockopt(struct sock *sk, int level, int optname,
+		    sockptr_t optval, unsigned int optlen)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+	struct homa_rcvbuf_args args;
+	int ret;
+
+	if (level != IPPROTO_HOMA || optname != SO_HOMA_RCVBUF)
+		return -ENOPROTOOPT;
+	if (optlen != sizeof(struct homa_rcvbuf_args))
+		return -EINVAL;
+
+	if (copy_from_sockptr(&args, optval, optlen))
+		return -EFAULT;
+
+	/* Do a trivial test to make sure we can at least write the first
+	 * page of the region.
+	 */
+	if (copy_to_user(u64_to_user_ptr(args.start), &args,
+			 sizeof(args)))
+		return -EFAULT;
+
+	homa_sock_lock(hsk, "homa_setsockopt SO_HOMA_RCV_BUF");
+	ret = homa_pool_init(hsk, u64_to_user_ptr(args.start), args.length);
+	homa_sock_unlock(hsk);
+	return ret;
+}
+
+/**
+ * homa_getsockopt() - Implements the getsockopt system call for Homa sockets.
+ * @sk:      Socket on which the system call was invoked.
+ * @level:   Selects level in the network stack to handle the request;
+ *           must be IPPROTO_HOMA.
+ * @optname: Identifies a particular setsockopt operation.
+ * @optval:  Address in user space where the option's value should be stored.
+ * @optlen:  Number of bytes available at optval; will be overwritten with
+ *           actual number of bytes stored.
+ * Return:   0 on success, otherwise a negative errno.
+ */
+int homa_getsockopt(struct sock *sk, int level, int optname,
+		    char __user *optval, int __user *optlen)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+	struct homa_rcvbuf_args val;
+	int len;
+
+	if (copy_from_sockptr(&len, USER_SOCKPTR(optlen), sizeof(int)))
+		return -EFAULT;
+
+	if (level != IPPROTO_HOMA || optname != SO_HOMA_RCVBUF)
+		return -ENOPROTOOPT;
+	if (len < sizeof(val))
+		return -EINVAL;
+
+	homa_pool_get_rcvbuf(hsk, &val);
+	len = sizeof(val);
+
+	if (copy_to_sockptr(USER_SOCKPTR(optlen), &len, sizeof(int)))
+		return -EFAULT;
+
+	if (copy_to_sockptr(USER_SOCKPTR(optval), &val, len))
+		return -EFAULT;
+	return 0;
+}
+
+/**
+ * homa_sendmsg() - Send a request or response message on a Homa socket.
+ * @sk:     Socket on which the system call was invoked.
+ * @msg:    Structure describing the message to send; the msg_control
+ *          field points to additional information.
+ * @length: Number of bytes of the message.
+ * Return: 0 on success, otherwise a negative errno.
+ */
+int homa_sendmsg(struct sock *sk, struct msghdr *msg, size_t length)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+	struct homa_sendmsg_args args;
+	union sockaddr_in_union *addr;
+	struct homa_rpc *rpc = NULL;
+	int result = 0;
+
+	addr = (union sockaddr_in_union *)msg->msg_name;
+	if (!addr) {
+		result = -EINVAL;
+		goto error;
+	}
+
+	if (unlikely(!msg->msg_control_is_user)) {
+		result = -EINVAL;
+		goto error;
+	}
+	if (unlikely(copy_from_user(&args, (void __user *)msg->msg_control,
+				    sizeof(args)))) {
+		result = -EFAULT;
+		goto error;
+	}
+	if (addr->sa.sa_family != sk->sk_family) {
+		result = -EAFNOSUPPORT;
+		goto error;
+	}
+	if (msg->msg_namelen < sizeof(struct sockaddr_in) ||
+	    (msg->msg_namelen < sizeof(struct sockaddr_in6) &&
+	     addr->in6.sin6_family == AF_INET6)) {
+		result = -EINVAL;
+		goto error;
+	}
+
+	if (!args.id) {
+		/* This is a request message. */
+		rpc = homa_rpc_new_client(hsk, addr);
+		if (IS_ERR(rpc)) {
+			result = PTR_ERR(rpc);
+			rpc = NULL;
+			goto error;
+		}
+		rpc->completion_cookie = args.completion_cookie;
+		result = homa_message_out_fill(rpc, &msg->msg_iter, 1);
+		if (result)
+			goto error;
+		args.id = rpc->id;
+		homa_rpc_unlock(rpc); /* Locked by homa_rpc_new_client. */
+		rpc = NULL;
+
+		if (unlikely(copy_to_user((void __user *)msg->msg_control,
+					  &args, sizeof(args)))) {
+			rpc = homa_find_client_rpc(hsk, args.id);
+			result = -EFAULT;
+			goto error;
+		}
+	} else {
+		/* This is a response message. */
+		struct in6_addr canonical_dest;
+
+		if (args.completion_cookie != 0) {
+			result = -EINVAL;
+			goto error;
+		}
+		canonical_dest = canonical_ipv6_addr(addr);
+
+		rpc = homa_find_server_rpc(hsk, &canonical_dest, args.id);
+		if (!rpc)
+			/* Return without an error if the RPC doesn't exist;
+			 * this could be totally valid (e.g. client is
+			 * no longer interested in it).
+			 */
+			return 0;
+		if (rpc->error) {
+			result = rpc->error;
+			goto error;
+		}
+		if (rpc->state != RPC_IN_SERVICE) {
+			/* Locked by homa_find_server_rpc. */
+			homa_rpc_unlock(rpc);
+			rpc = NULL;
+			result = -EINVAL;
+			goto error;
+		}
+		rpc->state = RPC_OUTGOING;
+
+		result = homa_message_out_fill(rpc, &msg->msg_iter, 1);
+		if (result && rpc->state != RPC_DEAD)
+			goto error;
+		homa_rpc_unlock(rpc); /* Locked by homa_find_server_rpc. */
+	}
+	return 0;
+
+error:
+	if (rpc) {
+		homa_rpc_free(rpc);
+		homa_rpc_unlock(rpc); /* Locked by homa_find_server_rpc. */
+	}
+	return result;
+}
+
+/**
+ * homa_recvmsg() - Receive a message from a Homa socket.
+ * @sk:          Socket on which the system call was invoked.
+ * @msg:         Controlling information for the receive.
+ * @len:         Total bytes of space available in msg->msg_iov; not used.
+ * @flags:       Flags from system call; only MSG_DONTWAIT is used.
+ * @addr_len:    Store the length of the sender address here
+ * Return:       The length of the message on success, otherwise a negative
+ *               errno.
+ */
+int homa_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
+		 int *addr_len)
+{
+	struct homa_sock *hsk = homa_sk(sk);
+	struct homa_recvmsg_args control;
+	struct homa_rpc *rpc;
+	int result;
+
+	if (unlikely(!msg->msg_control)) {
+		/* This test isn't strictly necessary, but it provides a
+		 * hook for testing kernel call times.
+		 */
+		return -EINVAL;
+	}
+	if (msg->msg_controllen != sizeof(control))
+		return -EINVAL;
+	if (unlikely(copy_from_user(&control, (void __user *)msg->msg_control,
+				    sizeof(control))))
+		return -EFAULT;
+	control.completion_cookie = 0;
+
+	if (control.num_bpages > HOMA_MAX_BPAGES ||
+	    (control.flags & ~HOMA_RECVMSG_VALID_FLAGS)) {
+		result = -EINVAL;
+		goto done;
+	}
+	result = homa_pool_release_buffers(hsk->buffer_pool, control.num_bpages,
+					   control.bpage_offsets);
+	control.num_bpages = 0;
+	if (result != 0)
+		goto done;
+
+	rpc = homa_wait_for_message(hsk, (flags & MSG_DONTWAIT)
+			? (control.flags | HOMA_RECVMSG_NONBLOCKING)
+			: control.flags, control.id);
+	if (IS_ERR(rpc)) {
+		/* If we get here, it means there was an error that prevented
+		 * us from finding an RPC to return. If there's an error in
+		 * the RPC itself we won't get here.
+		 */
+		result = PTR_ERR(rpc);
+		goto done;
+	}
+	result = rpc->error ? rpc->error : rpc->msgin.length;
+
+	/* Collect result information. */
+	control.id = rpc->id;
+	control.completion_cookie = rpc->completion_cookie;
+	if (likely(rpc->msgin.length >= 0)) {
+		control.num_bpages = rpc->msgin.num_bpages;
+		memcpy(control.bpage_offsets, rpc->msgin.bpage_offsets,
+		       sizeof(rpc->msgin.bpage_offsets));
+	}
+	if (sk->sk_family == AF_INET6) {
+		struct sockaddr_in6 *in6 = msg->msg_name;
+
+		in6->sin6_family = AF_INET6;
+		in6->sin6_port = htons(rpc->dport);
+		in6->sin6_addr = rpc->peer->addr;
+		*addr_len = sizeof(*in6);
+	} else {
+		struct sockaddr_in *in4 = msg->msg_name;
+
+		in4->sin_family = AF_INET;
+		in4->sin_port = htons(rpc->dport);
+		in4->sin_addr.s_addr = ipv6_to_ipv4(rpc->peer->addr);
+		*addr_len = sizeof(*in4);
+	}
+
+	/* This indicates that the application now owns the buffers, so
+	 * we won't free them in homa_rpc_free.
+	 */
+	rpc->msgin.num_bpages = 0;
+
+	/* Must release the RPC lock (and potentially free the RPC) before
+	 * copying the results back to user space.
+	 */
+	if (homa_is_client(rpc->id)) {
+		homa_peer_add_ack(rpc);
+		homa_rpc_free(rpc);
+	} else {
+		if (result < 0)
+			homa_rpc_free(rpc);
+		else
+			rpc->state = RPC_IN_SERVICE;
+	}
+	homa_rpc_unlock(rpc); /* Locked by homa_wait_for_message. */
+
+done:
+	if (unlikely(copy_to_user((__force void __user *)msg->msg_control,
+				  &control, sizeof(control)))) {
+		/* Note: in this case the message's buffers will be leaked. */
+		pr_notice("%s couldn't copy back args\n", __func__);
+		result = -EFAULT;
+	}
+
+	return result;
+}
+
+/**
+ * homa_hash() - Not needed for Homa.
+ * @sk:    Socket for the operation
+ * Return: ??
+ */
+int homa_hash(struct sock *sk)
+{
+	return 0;
+}
+
+/**
+ * homa_unhash() - Not needed for Homa.
+ * @sk:    Socket for the operation
+ */
+void homa_unhash(struct sock *sk)
+{
+}
+
+/**
+ * homa_get_port() - It appears that this function is called to assign a
+ * default port for a socket.
+ * @sk:    Socket for the operation
+ * @snum:  Unclear what this is.
+ * Return: Zero for success, or a negative errno for an error.
+ */
+int homa_get_port(struct sock *sk, unsigned short snum)
+{
+	/* Homa always assigns ports immediately when a socket is created,
+	 * so there is nothing to do here.
+	 */
+	return 0;
+}
+
+/**
+ * homa_softirq() - This function is invoked at SoftIRQ level to handle
+ * incoming packets.
+ * @skb:   The incoming packet.
+ * Return: Always 0
+ */
+int homa_softirq(struct sk_buff *skb)
+{
+	struct sk_buff *packets, *other_pkts, *next;
+	struct sk_buff **prev_link, **other_link;
+	struct homa *homa = global_homa;
+	struct homa_common_hdr *h;
+	int header_offset;
+	int pull_length;
+
+	/* skb may actually contain many distinct packets, linked through
+	 * skb_shinfo(skb)->frag_list by the Homa GRO mechanism. Make a
+	 * pass through the list to process all of the short packets,
+	 * leaving the longer packets in the list. Also, perform various
+	 * prep/cleanup/error checking functions.
+	 */
+	skb->next = skb_shinfo(skb)->frag_list;
+	skb_shinfo(skb)->frag_list = NULL;
+	packets = skb;
+	prev_link = &packets;
+	for (skb = packets; skb; skb = next) {
+		next = skb->next;
+
+		/* Make the header available at skb->data, even if the packet
+		 * is fragmented. One complication: it's possible that the IP
+		 * header hasn't yet been removed (this happens for GRO packets
+		 * on the frag_list, since they aren't handled explicitly by IP.
+		 */
+		header_offset = skb_transport_header(skb) - skb->data;
+		pull_length = HOMA_MAX_HEADER + header_offset;
+		if (pull_length > skb->len)
+			pull_length = skb->len;
+		if (!pskb_may_pull(skb, pull_length))
+			goto discard;
+		if (header_offset)
+			__skb_pull(skb, header_offset);
+
+		/* Reject packets that are too short or have bogus types. */
+		h = (struct homa_common_hdr *)skb->data;
+		if (unlikely(skb->len < sizeof(struct homa_common_hdr) ||
+			     h->type < DATA || h->type >= BOGUS ||
+			     skb->len < header_lengths[h->type - DATA]))
+			goto discard;
+
+		/* Process the packet now if it is a control packet or
+		 * if it contains an entire short message.
+		 */
+		if (h->type != DATA || ntohl(((struct homa_data_hdr *)h)
+				->message_length) < 1400) {
+			*prev_link = skb->next;
+			skb->next = NULL;
+			homa_dispatch_pkts(skb, homa);
+		} else {
+			prev_link = &skb->next;
+		}
+		continue;
+
+discard:
+		*prev_link = skb->next;
+		kfree_skb(skb);
+	}
+
+	/* Now process the longer packets. Each iteration of this loop
+	 * collects all of the packets for a particular RPC and dispatches
+	 * them (batching the packets for an RPC allows more efficient
+	 * generation of grants).
+	 */
+	while (packets) {
+		struct in6_addr saddr, saddr2;
+		struct homa_common_hdr *h2;
+		struct sk_buff *skb2;
+
+		skb = packets;
+		prev_link = &skb->next;
+		saddr = skb_canonical_ipv6_saddr(skb);
+		other_pkts = NULL;
+		other_link = &other_pkts;
+		h = (struct homa_common_hdr *)skb->data;
+		for (skb2 = skb->next; skb2; skb2 = next) {
+			next = skb2->next;
+			h2 = (struct homa_common_hdr *)skb2->data;
+			if (h2->sender_id == h->sender_id) {
+				saddr2 = skb_canonical_ipv6_saddr(skb2);
+				if (ipv6_addr_equal(&saddr, &saddr2)) {
+					*prev_link = skb2;
+					prev_link = &skb2->next;
+					continue;
+				}
+			}
+			*other_link = skb2;
+			other_link = &skb2->next;
+		}
+		*prev_link = NULL;
+		*other_link = NULL;
+		homa_dispatch_pkts(packets, homa);
+		packets = other_pkts;
+	}
+
+	return 0;
+}
+
+/**
+ * homa_backlog_rcv() - Invoked to handle packets saved on a socket's
+ * backlog because it was locked when the packets first arrived.
+ * @sk:     Homa socket that owns the packet's destination port.
+ * @skb:    The incoming packet. This function takes ownership of the packet
+ *          (we'll delete it).
+ *
+ * Return:  Always returns 0.
+ */
+int homa_backlog_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	pr_warn_once("unimplemented backlog_rcv invoked on Homa socket\n");
+	kfree_skb(skb);
+	return 0;
+}
+
+/**
+ * homa_err_handler_v4() - Invoked by IP to handle an incoming error
+ * packet, such as ICMP UNREACHABLE.
+ * @skb:   The incoming packet.
+ * @info:  Information about the error that occurred?
+ *
+ * Return: zero, or a negative errno if the error couldn't be handled here.
+ */
+int homa_err_handler_v4(struct sk_buff *skb, u32 info)
+{
+	const struct icmphdr *icmp = icmp_hdr(skb);
+	struct homa *homa = global_homa;
+	struct in6_addr daddr;
+	int type = icmp->type;
+	int code = icmp->code;
+	struct iphdr *iph;
+	int error = 0;
+	int port = 0;
+
+	iph = (struct iphdr *)(skb->data);
+	ipv6_addr_set_v4mapped(iph->daddr, &daddr);
+	if (type == ICMP_DEST_UNREACH && code == ICMP_PORT_UNREACH) {
+		struct homa_common_hdr *h = (struct homa_common_hdr *)(skb->data
+				+ iph->ihl * 4);
+
+		port = ntohs(h->dport);
+		error = -ENOTCONN;
+	} else if (type == ICMP_DEST_UNREACH) {
+		if (code == ICMP_PROT_UNREACH)
+			error = -EPROTONOSUPPORT;
+		else
+			error = -EHOSTUNREACH;
+	} else {
+		pr_notice("%s invoked with info %x, ICMP type %d, ICMP code %d\n",
+			  __func__, info, type, code);
+	}
+	if (error != 0)
+		homa_abort_rpcs(homa, &daddr, port, error);
+	return 0;
+}
+
+/**
+ * homa_err_handler_v6() - Invoked by IP to handle an incoming error
+ * packet, such as ICMP UNREACHABLE.
+ * @skb:    The incoming packet.
+ * @opt:    Not used.
+ * @type:   Type of ICMP packet.
+ * @code:   Additional information about the error.
+ * @offset: Not used.
+ * @info:   Information about the error that occurred?
+ *
+ * Return: zero, or a negative errno if the error couldn't be handled here.
+ */
+int homa_err_handler_v6(struct sk_buff *skb, struct inet6_skb_parm *opt,
+			u8 type,  u8 code,  int offset,  __be32 info)
+{
+	const struct ipv6hdr *iph = (const struct ipv6hdr *)skb->data;
+	struct homa *homa = global_homa;
+	int error = 0;
+	int port = 0;
+
+	if (type == ICMPV6_DEST_UNREACH && code == ICMPV6_PORT_UNREACH) {
+		const struct homa_common_hdr *h;
+
+		h = (struct homa_common_hdr *)(skb->data + sizeof(*iph));
+		port = ntohs(h->dport);
+		error = -ENOTCONN;
+	} else if (type == ICMPV6_DEST_UNREACH && code == ICMPV6_ADDR_UNREACH) {
+		error = -EHOSTUNREACH;
+	} else if (type == ICMPV6_PARAMPROB && code == ICMPV6_UNK_NEXTHDR) {
+		error = -EPROTONOSUPPORT;
+	}
+	if (error != 0)
+		homa_abort_rpcs(homa, &iph->daddr, port, error);
+	return 0;
+}
+
+/**
+ * homa_poll() - Invoked by Linux as part of implementing select, poll,
+ * epoll, etc.
+ * @file:  Open file that is participating in a poll, select, etc.
+ * @sock:  A Homa socket, associated with @file.
+ * @wait:  This table will be registered with the socket, so that it
+ *         is notified when the socket's ready state changes.
+ *
+ * Return: A mask of bits such as EPOLLIN, which indicate the current
+ *         state of the socket.
+ */
+__poll_t homa_poll(struct file *file, struct socket *sock,
+		   struct poll_table_struct *wait)
+{
+	struct sock *sk = sock->sk;
+	__u32 mask;
+
+	sock_poll_wait(file, sock, wait);
+	mask = POLLOUT | POLLWRNORM;
+
+	if (homa_sk(sk)->shutdown)
+		mask |= POLLIN;
+
+	if (!list_empty(&homa_sk(sk)->ready_requests) ||
+	    !list_empty(&homa_sk(sk)->ready_responses))
+		mask |= POLLIN | POLLRDNORM;
+	return (__poll_t)mask;
+}
+
+/**
+ * homa_hrtimer() - This function is invoked by the hrtimer mechanism to
+ * wake up the timer thread. Runs at IRQ level.
+ * @timer:   The timer that triggered; not used.
+ *
+ * Return:   Always HRTIMER_RESTART.
+ */
+enum hrtimer_restart homa_hrtimer(struct hrtimer *timer)
+{
+	wake_up_process(timer_kthread);
+	return HRTIMER_NORESTART;
+}
+
+/**
+ * homa_timer_main() - Top-level function for the timer thread.
+ * @transport:  Pointer to struct homa.
+ *
+ * Return:         Always 0.
+ */
+int homa_timer_main(void *transport)
+{
+	struct homa *homa = (struct homa *)transport;
+
+	/* The following variable is static because hrtimer_init will
+	 * complain about a stack-allocated hrtimer if in debug mode.
+	 */
+	static struct hrtimer hrtimer;
+	ktime_t tick_interval;
+	u64 nsec;
+
+	hrtimer_init(&hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer.function = &homa_hrtimer;
+	nsec = 1000000;                   /* 1 ms */
+	tick_interval = ns_to_ktime(nsec);
+	while (1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!exiting) {
+			hrtimer_start(&hrtimer, tick_interval,
+				      HRTIMER_MODE_REL);
+			schedule();
+		}
+		__set_current_state(TASK_RUNNING);
+		if (exiting)
+			break;
+		homa_timer(homa);
+	}
+	hrtimer_cancel(&hrtimer);
+	kthread_complete_and_exit(&timer_thread_done, 0);
+	return 0;
+}
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_AUTHOR("John Ousterhout <ouster@cs.stanford.edu>");
+MODULE_DESCRIPTION("Homa transport protocol");
+MODULE_VERSION("1.0");
+
+/* Arrange for this module to be loaded automatically when a Homa socket is
+ * opened. Apparently symbols don't work in the macros below, so must use
+ * numeric values for IPPROTO_HOMA (146) and SOCK_DGRAM(2).
+ */
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_INET, 146, 2);
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_INET6, 146, 2);
diff --git a/net/homa/homa_utils.c b/net/homa/homa_utils.c
new file mode 100644
index 000000000000..ac851eaff8b6
--- /dev/null
+++ b/net/homa/homa_utils.c
@@ -0,0 +1,166 @@
+// SPDX-License-Identifier: BSD-2-Clause
+
+/* This file contains miscellaneous utility functions for Homa, such
+ * as initializing and destroying homa structs.
+ */
+
+#include "homa_impl.h"
+#include "homa_peer.h"
+#include "homa_rpc.h"
+#include "homa_stub.h"
+
+struct completion homa_pacer_kthread_done;
+
+/**
+ * homa_init() - Constructor for homa objects.
+ * @homa:   Object to initialize.
+ *
+ * Return:  0 on success, or a negative errno if there was an error. Even
+ *          if an error occurs, it is safe (and necessary) to call
+ *          homa_destroy at some point.
+ */
+int homa_init(struct homa *homa)
+{
+	int err;
+
+	homa->pacer_kthread = NULL;
+	init_completion(&homa_pacer_kthread_done);
+	atomic64_set(&homa->next_outgoing_id, 2);
+	atomic64_set(&homa->link_idle_time, sched_clock());
+	spin_lock_init(&homa->pacer_mutex);
+	homa->pacer_fifo_fraction = 50;
+	homa->pacer_fifo_count = 1;
+	homa->pacer_wake_time = 0;
+	spin_lock_init(&homa->throttle_lock);
+	INIT_LIST_HEAD_RCU(&homa->throttled_rpcs);
+	homa->throttle_add = 0;
+	homa->throttle_min_bytes = 200;
+	homa->prev_default_port = HOMA_MIN_DEFAULT_PORT - 1;
+	homa->port_map = kmalloc(sizeof(*homa->port_map), GFP_KERNEL);
+	if (!homa->port_map) {
+		pr_err("%s couldn't create port_map: kmalloc failure",
+		       __func__);
+		return -ENOMEM;
+	}
+	homa_socktab_init(homa->port_map);
+	homa->peers = kmalloc(sizeof(*homa->peers), GFP_KERNEL);
+	if (!homa->peers) {
+		pr_err("%s couldn't create peers: kmalloc failure", __func__);
+		return -ENOMEM;
+	}
+	err = homa_peertab_init(homa->peers);
+	if (err) {
+		pr_err("%s couldn't initialize peer table (errno %d)\n",
+		       __func__, -err);
+		return err;
+	}
+
+	/* Wild guesses to initialize configuration values... */
+	homa->link_mbps = 25000;
+	homa->resend_ticks = 5;
+	homa->resend_interval = 5;
+	homa->timeout_ticks = 100;
+	homa->timeout_resends = 5;
+	homa->request_ack_ticks = 2;
+	homa->reap_limit = 10;
+	homa->dead_buffs_limit = 5000;
+	homa->max_dead_buffs = 0;
+	homa->pacer_kthread = kthread_run(homa_pacer_main, homa,
+					  "homa_pacer");
+	if (IS_ERR(homa->pacer_kthread)) {
+		err = PTR_ERR(homa->pacer_kthread);
+		homa->pacer_kthread = NULL;
+		pr_err("couldn't create homa pacer thread: error %d\n", err);
+		return err;
+	}
+	homa->pacer_exit = false;
+	homa->max_nic_queue_ns = 5000;
+	homa->ns_per_mbyte = 0;
+	homa->max_gso_size = 10000;
+	homa->gso_force_software = 0;
+	homa->max_gro_skbs = 20;
+	homa->gro_policy = HOMA_GRO_NORMAL;
+	homa->timer_ticks = 0;
+	homa->flags = 0;
+	homa->bpage_lease_usecs = 10000;
+	homa->next_id = 0;
+	homa_outgoing_sysctl_changed(homa);
+	homa_incoming_sysctl_changed(homa);
+	return 0;
+}
+
+/**
+ * homa_destroy() -  Destructor for homa objects.
+ * @homa:      Object to destroy.
+ */
+void homa_destroy(struct homa *homa)
+{
+	if (homa->pacer_kthread) {
+		homa_pacer_stop(homa);
+		wait_for_completion(&homa_pacer_kthread_done);
+	}
+
+	/* The order of the following statements matters! */
+	if (homa->port_map) {
+		homa_socktab_destroy(homa->port_map);
+		kfree(homa->port_map);
+		homa->port_map = NULL;
+	}
+	if (homa->peers) {
+		homa_peertab_destroy(homa->peers);
+		kfree(homa->peers);
+		homa->peers = NULL;
+	}
+}
+
+/**
+ * homa_symbol_for_type() - Returns a printable string describing a packet type.
+ * @type:  A value from those defined by &homa_packet_type.
+ *
+ * Return: A static string holding the packet type corresponding to @type.
+ */
+char *homa_symbol_for_type(uint8_t type)
+{
+	switch (type) {
+	case DATA:
+		return "DATA";
+	case RESEND:
+		return "RESEND";
+	case UNKNOWN:
+		return "UNKNOWN";
+	case BUSY:
+		return "BUSY";
+	case NEED_ACK:
+		return "NEED_ACK";
+	case ACK:
+		return "ACK";
+	}
+	return "??";
+}
+
+/**
+ * homa_spin() - Delay (without sleeping) for a given time interval.
+ * @ns:   How long to delay (in nanoseconds)
+ */
+void homa_spin(int ns)
+{
+	__u64 end;
+
+	end = sched_clock() + ns;
+	while (sched_clock() < end)
+		/* Empty loop body.*/
+		;
+}
+
+/**
+ * homa_throttle_lock_slow() - This function implements the slow path for
+ * acquiring the throttle lock. It is invoked when the lock isn't immediately
+ * available. It waits for the lock, but also records statistics about
+ * the waiting time.
+ * @homa:    Overall data about the Homa protocol implementation.
+ */
+void homa_throttle_lock_slow(struct homa *homa)
+	__acquires(&homa->throttle_lock)
+{
+	spin_lock_bh(&homa->throttle_lock);
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH net-next v6 12/12] net: homa: create Makefile and Kconfig
  2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (10 preceding siblings ...)
  2025-01-15 18:59 ` [PATCH net-next v6 11/12] net: homa: create homa_plumbing.c and homa_utils.c John Ousterhout
@ 2025-01-15 18:59 ` John Ousterhout
  2025-01-24  8:55 ` [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol Paolo Abeni
  12 siblings, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-15 18:59 UTC (permalink / raw)
  To: netdev; +Cc: pabeni, edumazet, horms, kuba, John Ousterhout

Before this commit the Homa code is "inert": it won't be compiled
in kernel builds. This commit adds Homa's Makefile and Kconfig, and
also links Homa into net/Makefile and net/Kconfig, so that Homa
will be built during kernel builds if enabled (it is disabled by
default).

Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
 net/Kconfig       |  1 +
 net/Makefile      |  1 +
 net/homa/Kconfig  | 19 +++++++++++++++++++
 net/homa/Makefile | 14 ++++++++++++++
 4 files changed, 35 insertions(+)
 create mode 100644 net/homa/Kconfig
 create mode 100644 net/homa/Makefile

diff --git a/net/Kconfig b/net/Kconfig
index c3fca69a7c83..d6df0595d1d5 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -247,6 +247,7 @@ endif
 
 source "net/dccp/Kconfig"
 source "net/sctp/Kconfig"
+source "net/homa/Kconfig"
 source "net/rds/Kconfig"
 source "net/tipc/Kconfig"
 source "net/atm/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index 60ed5190eda8..516b17d0bc6f 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -44,6 +44,7 @@ obj-y				+= 8021q/
 endif
 obj-$(CONFIG_IP_DCCP)		+= dccp/
 obj-$(CONFIG_IP_SCTP)		+= sctp/
+obj-$(CONFIG_HOMA)		+= homa/
 obj-$(CONFIG_RDS)		+= rds/
 obj-$(CONFIG_WIRELESS)		+= wireless/
 obj-$(CONFIG_MAC80211)		+= mac80211/
diff --git a/net/homa/Kconfig b/net/homa/Kconfig
new file mode 100644
index 000000000000..3e623906612f
--- /dev/null
+++ b/net/homa/Kconfig
@@ -0,0 +1,19 @@
+# SPDX-License-Identifier: BSD-2-Clause
+#
+# Homa transport protocol
+#
+
+menuconfig HOMA
+	tristate "The Homa transport protocol"
+	depends on INET
+	depends on IPV6
+
+	help
+	  Homa is a network transport protocol for communication within
+	  a datacenter. It provides significantly lower latency than TCP,
+	  particularly for workloads containing a mixture of large and small
+	  messages operating at high network utilization. For more information
+	  see the homa(7) man page or checkout the Homa Wiki at
+	  https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview.
+
+	  If unsure, say N.
diff --git a/net/homa/Makefile b/net/homa/Makefile
new file mode 100644
index 000000000000..3eb192a6ffa6
--- /dev/null
+++ b/net/homa/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: BSD-2-Clause
+#
+# Makefile for the Linux implementation of the Homa transport protocol.
+
+obj-$(CONFIG_HOMA) := homa.o
+homa-y:=        homa_incoming.o \
+		homa_outgoing.o \
+		homa_peer.o \
+		homa_pool.o \
+		homa_plumbing.o \
+		homa_rpc.o \
+		homa_sock.o \
+		homa_timer.o \
+		homa_utils.o
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 03/12] net: homa: create shared Homa header files
  2025-01-15 18:59 ` [PATCH net-next v6 03/12] net: homa: create shared Homa header files John Ousterhout
@ 2025-01-23 11:01   ` Paolo Abeni
  2025-01-24 21:21     ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-01-23 11:01 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 1/15/25 7:59 PM, John Ousterhout wrote:
[...]
> +/**
> + * union sockaddr_in_union - Holds either an IPv4 or IPv6 address (smaller
> + * and easier to use than sockaddr_storage).
> + */
> +union sockaddr_in_union {
> +	/** @sa: Used to access as a generic sockaddr. */
> +	struct sockaddr sa;
> +
> +	/** @in4: Used to access as IPv4 socket. */
> +	struct sockaddr_in in4;
> +
> +	/** @in6: Used to access as IPv6 socket.  */
> +	struct sockaddr_in6 in6;
> +};

There are other protocol using the same struct with a different name
(sctp) or  a very similar struct (mptcp). It would be nice to move this
in a shared header and allow re-use.

[...]
> +	/**
> +	 * @core: Core on which @thread was executing when it registered
> +	 * its interest.  Used for load balancing (see balance.txt).
> +	 */
> +	int core;

I don't see a 'balance.txt' file in this submission, possibly stray
reference?

[...]
> +	/**
> +	 * @pacer_wake_time: time (in sched_clock units) when the pacer last
> +	 * woke up (if the pacer is running) or 0 if the pacer is sleeping.
> +	 */
> +	__u64 pacer_wake_time;

why do you use the '__' variant here? this is not uapi, you should use
the plain u64/u32 (more occurrences below).

[...]
> +	/**
> +	 * @prev_default_port: The most recent port number assigned from
> +	 * the range of default ports.
> +	 */
> +	__u16 prev_default_port __aligned(L1_CACHE_BYTES);

I think the idiomatic way to express the above is to use:

	u16 prev_default_port ____cacheline_aligned;

or

	u16 prev_default_port ____cacheline_aligned_in_smp;

more similar occourrences below.

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c
  2025-01-15 18:59 ` [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c John Ousterhout
@ 2025-01-23 12:06   ` Paolo Abeni
  2025-01-24 23:53     ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-01-23 12:06 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba



On 1/15/25 7:59 PM, John Ousterhout wrote:
> These files implement Homa's mechanism for managing application-level
> buffer space for incoming messages This mechanism is needed to allow
> Homa to copy data out to user space in parallel with receiving packets;
> it was discussed in a talk at NetDev 0x17.
> 
> Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
> ---
>  net/homa/homa_pool.c | 453 +++++++++++++++++++++++++++++++++++++++++++
>  net/homa/homa_pool.h | 154 +++++++++++++++
>  2 files changed, 607 insertions(+)
>  create mode 100644 net/homa/homa_pool.c
>  create mode 100644 net/homa/homa_pool.h
> 
> diff --git a/net/homa/homa_pool.c b/net/homa/homa_pool.c
> new file mode 100644
> index 000000000000..0b2ec83b6174
> --- /dev/null
> +++ b/net/homa/homa_pool.c
> @@ -0,0 +1,453 @@
> +// SPDX-License-Identifier: BSD-2-Clause
> +
> +#include "homa_impl.h"
> +#include "homa_pool.h"
> +
> +/* This file contains functions that manage user-space buffer pools. */
> +
> +/* Pools must always have at least this many bpages (no particular
> + * reasoning behind this value).
> + */
> +#define MIN_POOL_SIZE 2
> +
> +/* Used when determining how many bpages to consider for allocation. */
> +#define MIN_EXTRA 4
> +
> +/**
> + * set_bpages_needed() - Set the bpages_needed field of @pool based
> + * on the length of the first RPC that's waiting for buffer space.
> + * The caller must own the lock for @pool->hsk.
> + * @pool: Pool to update.
> + */
> +static void set_bpages_needed(struct homa_pool *pool)
> +{
> +	struct homa_rpc *rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
> +			struct homa_rpc, buf_links);
> +	pool->bpages_needed = (rpc->msgin.length + HOMA_BPAGE_SIZE - 1)
> +			>> HOMA_BPAGE_SHIFT;
> +}
> +
> +/**
> + * homa_pool_init() - Initialize a homa_pool; any previous contents are
> + * destroyed.
> + * @hsk:          Socket containing the pool to initialize.
> + * @region:       First byte of the memory region for the pool, allocated
> + *                by the application; must be page-aligned.
> + * @region_size:  Total number of bytes available at @buf_region.
> + * Return: Either zero (for success) or a negative errno for failure.
> + */
> +int homa_pool_init(struct homa_sock *hsk, void __user *region,
> +		   __u64 region_size)
> +{
> +	struct homa_pool *pool = hsk->buffer_pool;
> +	int i, result;
> +
> +	homa_pool_destroy(hsk->buffer_pool);
> +
> +	if (((uintptr_t)region) & ~PAGE_MASK)
> +		return -EINVAL;
> +	pool->hsk = hsk;
> +	pool->region = (char __user *)region;
> +	pool->num_bpages = region_size >> HOMA_BPAGE_SHIFT;
> +	pool->descriptors = NULL;
> +	pool->cores = NULL;
> +	if (pool->num_bpages < MIN_POOL_SIZE) {
> +		result = -EINVAL;
> +		goto error;
> +	}
> +	pool->descriptors = kmalloc_array(pool->num_bpages,
> +					  sizeof(struct homa_bpage),
> +					  GFP_ATOMIC);

Possibly wort adding '| __GFP_ZERO' and avoid zeroing some fields later.

> +	if (!pool->descriptors) {
> +		result = -ENOMEM;
> +		goto error;
> +	}
> +	for (i = 0; i < pool->num_bpages; i++) {
> +		struct homa_bpage *bp = &pool->descriptors[i];
> +
> +		spin_lock_init(&bp->lock);
> +		atomic_set(&bp->refs, 0);
> +		bp->owner = -1;
> +		bp->expiration = 0;
> +	}
> +	atomic_set(&pool->free_bpages, pool->num_bpages);
> +	pool->bpages_needed = INT_MAX;
> +
> +	/* Allocate and initialize core-specific data. */
> +	pool->cores = kmalloc_array(nr_cpu_ids, sizeof(struct homa_pool_core),
> +				    GFP_ATOMIC);

Uhm... on large system this could be an order-3 allocation, which in
turn could fail quite easily under memory pressure, and it looks
contradictory with WRT the cover letter statement about reducing the
amount of per socket status.

Why don't you use alloc_percpu_gfp() here?


> +/**
> + * homa_pool_get_pages() - Allocate one or more full pages from the pool.
> + * @pool:         Pool from which to allocate pages
> + * @num_pages:    Number of pages needed
> + * @pages:        The indices of the allocated pages are stored here; caller
> + *                must ensure this array is big enough. Reference counts have
> + *                been set to 1 on all of these pages (or 2 if set_owner
> + *                was specified).
> + * @set_owner:    If nonzero, the current core is marked as owner of all
> + *                of the allocated pages (and the expiration time is also
> + *                set). Otherwise the pages are left unowned.
> + * Return: 0 for success, -1 if there wasn't enough free space in the pool.
> + */
> +int homa_pool_get_pages(struct homa_pool *pool, int num_pages, __u32 *pages,
> +			int set_owner)
> +{
> +	int core_num = raw_smp_processor_id();

Why the 'raw' variant? If this code is pre-emptible it means another
process could be scheduled on the same core...


> +	struct homa_pool_core *core;
> +	__u64 now = sched_clock();
> +	int alloced = 0;
> +	int limit = 0;
> +
> +	core = &pool->cores[core_num];
> +	if (atomic_sub_return(num_pages, &pool->free_bpages) < 0) {
> +		atomic_add(num_pages, &pool->free_bpages);
> +		return -1;
> +	}
> +
> +	/* Once we get to this point we know we will be able to find
> +	 * enough free pages; now we just have to find them.
> +	 */
> +	while (alloced != num_pages) {
> +		struct homa_bpage *bpage;
> +		int cur, ref_count;
> +
> +		/* If we don't need to use all of the bpages in the pool,
> +		 * then try to use only the ones with low indexes. This
> +		 * will reduce the cache footprint for the pool by reusing
> +		 * a few bpages over and over. Specifically this code will
> +		 * not consider any candidate page whose index is >= limit.
> +		 * Limit is chosen to make sure there are a reasonable
> +		 * number of free pages in the range, so we won't have to
> +		 * check a huge number of pages.
> +		 */
> +		if (limit == 0) {
> +			int extra;
> +
> +			limit = pool->num_bpages
> +					- atomic_read(&pool->free_bpages);
> +			extra = limit >> 2;
> +			limit += (extra < MIN_EXTRA) ? MIN_EXTRA : extra;
> +			if (limit > pool->num_bpages)
> +				limit = pool->num_bpages;
> +		}
> +
> +		cur = core->next_candidate;
> +		core->next_candidate++;

... here, making this increment racy.

> +		if (cur >= limit) {
> +			core->next_candidate = 0;
> +
> +			/* Must recompute the limit for each new loop through
> +			 * the bpage array: we may need to consider a larger
> +			 * range of pages because of concurrent allocations.
> +			 */
> +			limit = 0;
> +			continue;
> +		}
> +		bpage = &pool->descriptors[cur];
> +
> +		/* Figure out whether this candidate is free (or can be
> +		 * stolen). Do a quick check without locking the page, and
> +		 * if the page looks promising, then lock it and check again
> +		 * (must check again in case someone else snuck in and
> +		 * grabbed the page).
> +		 */
> +		ref_count = atomic_read(&bpage->refs);
> +		if (ref_count >= 2 || (ref_count == 1 && (bpage->owner < 0 ||
> +				bpage->expiration > now)))

The above conditions could be place in separate helper, making the code
more easy to follow and avoiding some duplications.

> +			continue;
> +		if (!spin_trylock_bh(&bpage->lock))
> +			continue;
> +		ref_count = atomic_read(&bpage->refs);
> +		if (ref_count >= 2 || (ref_count == 1 && (bpage->owner < 0 ||
> +				bpage->expiration > now))) {
> +			spin_unlock_bh(&bpage->lock);
> +			continue;
> +		}
> +		if (bpage->owner >= 0)
> +			atomic_inc(&pool->free_bpages);
> +		if (set_owner) {
> +			atomic_set(&bpage->refs, 2);
> +			bpage->owner = core_num;
> +			bpage->expiration = now + 1000 *
> +					pool->hsk->homa->bpage_lease_usecs;
> +		} else {
> +			atomic_set(&bpage->refs, 1);
> +			bpage->owner = -1;
> +		}
> +		spin_unlock_bh(&bpage->lock);
> +		pages[alloced] = cur;
> +		alloced++;
> +	}
> +	return 0;
> +}
> +
> +/**
> + * homa_pool_allocate() - Allocate buffer space for an RPC.
> + * @rpc:  RPC that needs space allocated for its incoming message (space must
> + *        not already have been allocated). The fields @msgin->num_buffers
> + *        and @msgin->buffers are filled in. Must be locked by caller.
> + * Return: The return value is normally 0, which means either buffer space
> + * was allocated or the @rpc was queued on @hsk->waiting. If a fatal error
> + * occurred, such as no buffer pool present, then a negative errno is
> + * returned.
> + */
> +int homa_pool_allocate(struct homa_rpc *rpc)
> +{
> +	struct homa_pool *pool = rpc->hsk->buffer_pool;
> +	int full_pages, partial, i, core_id;
> +	__u32 pages[HOMA_MAX_BPAGES];
> +	struct homa_pool_core *core;
> +	struct homa_bpage *bpage;
> +	struct homa_rpc *other;
> +
> +	if (!pool->region)
> +		return -ENOMEM;
> +
> +	/* First allocate any full bpages that are needed. */
> +	full_pages = rpc->msgin.length >> HOMA_BPAGE_SHIFT;
> +	if (unlikely(full_pages)) {
> +		if (homa_pool_get_pages(pool, full_pages, pages, 0) != 0)

full_pages must be less than HOMA_MAX_BPAGES, but I don't see any check
on incoming message length to be somewhat limited ?!?

> +			goto out_of_space;
> +		for (i = 0; i < full_pages; i++)
> +			rpc->msgin.bpage_offsets[i] = pages[i] <<
> +					HOMA_BPAGE_SHIFT;
> +	}
> +	rpc->msgin.num_bpages = full_pages;
> +
> +	/* The last chunk may be less than a full bpage; for this we use
> +	 * the bpage that we own (and reuse it for multiple messages).
> +	 */
> +	partial = rpc->msgin.length & (HOMA_BPAGE_SIZE - 1);
> +	if (unlikely(partial == 0))
> +		goto success;
> +	core_id = raw_smp_processor_id();
> +	core = &pool->cores[core_id];
> +	bpage = &pool->descriptors[core->page_hint];
> +	if (!spin_trylock_bh(&bpage->lock))
> +		spin_lock_bh(&bpage->lock);
> +	if (bpage->owner != core_id) {
> +		spin_unlock_bh(&bpage->lock);
> +		goto new_page;
> +	}
> +	if ((core->allocated + partial) > HOMA_BPAGE_SIZE) {
> +		if (atomic_read(&bpage->refs) == 1) {
> +			/* Bpage is totally free, so we can reuse it. */
> +			core->allocated = 0;
> +		} else {
> +			bpage->owner = -1;
> +
> +			/* We know the reference count can't reach zero here
> +			 * because of check above, so we won't have to decrement
> +			 * pool->free_bpages.
> +			 */
> +			atomic_dec_return(&bpage->refs);
> +			spin_unlock_bh(&bpage->lock);
> +			goto new_page;
> +		}
> +	}
> +	bpage->expiration = sched_clock() +
> +			1000 * pool->hsk->homa->bpage_lease_usecs;
> +	atomic_inc(&bpage->refs);
> +	spin_unlock_bh(&bpage->lock);
> +	goto allocate_partial;
> +
> +	/* Can't use the current page; get another one. */
> +new_page:
> +	if (homa_pool_get_pages(pool, 1, pages, 1) != 0) {
> +		homa_pool_release_buffers(pool, rpc->msgin.num_bpages,
> +					  rpc->msgin.bpage_offsets);
> +		rpc->msgin.num_bpages = 0;
> +		goto out_of_space;
> +	}
> +	core->page_hint = pages[0];
> +	core->allocated = 0;
> +
> +allocate_partial:
> +	rpc->msgin.bpage_offsets[rpc->msgin.num_bpages] = core->allocated
> +			+ (core->page_hint << HOMA_BPAGE_SHIFT);
> +	rpc->msgin.num_bpages++;
> +	core->allocated += partial;
> +
> +success:
> +	return 0;
> +
> +	/* We get here if there wasn't enough buffer space for this
> +	 * message; add the RPC to hsk->waiting_for_bufs.
> +	 */
> +out_of_space:
> +	homa_sock_lock(pool->hsk, "homa_pool_allocate");

There is some chicken-egg issue, with homa_sock_lock() being defined
only later in the series, but it looks like the string argument is never
used.

> +	list_for_each_entry(other, &pool->hsk->waiting_for_bufs, buf_links) {
> +		if (other->msgin.length > rpc->msgin.length) {
> +			list_add_tail(&rpc->buf_links, &other->buf_links);
> +			goto queued;
> +		}
> +	}
> +	list_add_tail_rcu(&rpc->buf_links, &pool->hsk->waiting_for_bufs);
> +
> +queued:
> +	set_bpages_needed(pool);
> +	homa_sock_unlock(pool->hsk);
> +	return 0;
> +}
> +
> +/**
> + * homa_pool_get_buffer() - Given an RPC, figure out where to store incoming
> + * message data.
> + * @rpc:        RPC for which incoming message data is being processed; its
> + *              msgin must be properly initialized and buffer space must have
> + *              been allocated for the message.
> + * @offset:     Offset within @rpc's incoming message.
> + * @available:  Will be filled in with the number of bytes of space available
> + *              at the returned address (could be zero if offset is
> + *              (erroneously) past the end of the message).
> + * Return:      The application's virtual address for buffer space corresponding
> + *              to @offset in the incoming message for @rpc.
> + */
> +void __user *homa_pool_get_buffer(struct homa_rpc *rpc, int offset,
> +				  int *available)
> +{
> +	int bpage_index, bpage_offset;
> +
> +	bpage_index = offset >> HOMA_BPAGE_SHIFT;
> +	if (offset >= rpc->msgin.length) {
> +		WARN_ONCE(true, "%s got offset %d >= message length %d\n",
> +			  __func__, offset, rpc->msgin.length);
> +		*available = 0;
> +		return NULL;
> +	}
> +	bpage_offset = offset & (HOMA_BPAGE_SIZE - 1);
> +	*available = (bpage_index < (rpc->msgin.num_bpages - 1))
> +			? HOMA_BPAGE_SIZE - bpage_offset
> +			: rpc->msgin.length - offset;
> +	return rpc->hsk->buffer_pool->region +
> +			rpc->msgin.bpage_offsets[bpage_index] + bpage_offset;
> +}
> +
> +/**
> + * homa_pool_release_buffers() - Release buffer space so that it can be
> + * reused.
> + * @pool:         Pool that the buffer space belongs to. Doesn't need to
> + *                be locked.
> + * @num_buffers:  How many buffers to release.
> + * @buffers:      Points to @num_buffers values, each of which is an offset
> + *                from the start of the pool to the buffer to be released.
> + * Return:        0 for success, otherwise a negative errno.
> + */
> +int homa_pool_release_buffers(struct homa_pool *pool, int num_buffers,
> +			      __u32 *buffers)
> +{
> +	int result = 0;
> +	int i;
> +
> +	if (!pool->region)
> +		return result;
> +	for (i = 0; i < num_buffers; i++) {
> +		__u32 bpage_index = buffers[i] >> HOMA_BPAGE_SHIFT;
> +		struct homa_bpage *bpage = &pool->descriptors[bpage_index];
> +
> +		if (bpage_index < pool->num_bpages) {
> +			if (atomic_dec_return(&bpage->refs) == 0)
> +				atomic_inc(&pool->free_bpages);
> +		} else {
> +			result = -EINVAL;
> +		}
> +	}
> +	return result;
> +}
> +
> +/**
> + * homa_pool_check_waiting() - Checks to see if there are enough free
> + * bpages to wake up any RPCs that were blocked. Whenever
> + * homa_pool_release_buffers is invoked, this function must be invoked later,
> + * at a point when the caller holds no locks (homa_pool_release_buffers may
> + * be invoked with locks held, so it can't safely invoke this function).
> + * This is regrettably tricky, but I can't think of a better solution.
> + * @pool:         Information about the buffer pool.
> + */
> +void homa_pool_check_waiting(struct homa_pool *pool)
> +{
> +	if (!pool->region)
> +		return;
> +	while (atomic_read(&pool->free_bpages) >= pool->bpages_needed) {
> +		struct homa_rpc *rpc;
> +
> +		homa_sock_lock(pool->hsk, "buffer pool");
> +		if (list_empty(&pool->hsk->waiting_for_bufs)) {
> +			pool->bpages_needed = INT_MAX;
> +			homa_sock_unlock(pool->hsk);
> +			break;
> +		}
> +		rpc = list_first_entry(&pool->hsk->waiting_for_bufs,
> +				       struct homa_rpc, buf_links);
> +		if (!homa_rpc_try_lock(rpc, "homa_pool_check_waiting")) {
> +			/* Can't just spin on the RPC lock because we're
> +			 * holding the socket lock (see sync.txt). Instead,

Stray reference to sync.txt. It would be nice to have the locking schema
described somewhere start to finish in this series.

> +			 * release the socket lock and try the entire
> +			 * operation again.
> +			 */
> +			homa_sock_unlock(pool->hsk);
> +			continue;
> +		}
> +		list_del_init(&rpc->buf_links);
> +		if (list_empty(&pool->hsk->waiting_for_bufs))
> +			pool->bpages_needed = INT_MAX;
> +		else
> +			set_bpages_needed(pool);
> +		homa_sock_unlock(pool->hsk);
> +		homa_pool_allocate(rpc);
> +		if (rpc->msgin.num_bpages > 0)
> +			/* Allocation succeeded; "wake up" the RPC. */
> +			rpc->msgin.resend_all = 1;
> +		homa_rpc_unlock(rpc);
> +	}
> +}
> diff --git a/net/homa/homa_pool.h b/net/homa/homa_pool.h
> new file mode 100644
> index 000000000000..6dbe7d77dd07
> --- /dev/null
> +++ b/net/homa/homa_pool.h
> @@ -0,0 +1,154 @@
> +/* SPDX-License-Identifier: BSD-2-Clause */
> +
> +/* This file contains definitions used to manage user-space buffer pools.
> + */
> +
> +#ifndef _HOMA_POOL_H
> +#define _HOMA_POOL_H
> +
> +#include "homa_rpc.h"
> +
> +/**
> + * struct homa_bpage - Contains information about a single page in
> + * a buffer pool.
> + */
> +struct homa_bpage {
> +	union {
> +		/**
> +		 * @cache_line: Ensures that each homa_bpage object
> +		 * is exactly one cache line long.
> +		 */
> +		char cache_line[L1_CACHE_BYTES];
> +		struct {
> +			/** @lock: to synchronize shared access. */
> +			spinlock_t lock;
> +
> +			/**
> +			 * @refs: Counts number of distinct uses of this
> +			 * bpage (1 tick for each message that is using
> +			 * this page, plus an additional tick if the @owner
> +			 * field is set).
> +			 */
> +			atomic_t refs;
> +
> +			/**
> +			 * @owner: kernel core that currently owns this page
> +			 * (< 0 if none).
> +			 */
> +			int owner;
> +
> +			/**
> +			 * @expiration: time (in sched_clock() units) after
> +			 * which it's OK to steal this page from its current
> +			 * owner (if @refs is 1).
> +			 */
> +			__u64 expiration;
> +		};

____cacheline_aligned instead of inserting the struct into an union
should suffice.

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
  2025-01-15 18:59 ` [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c John Ousterhout
@ 2025-01-23 14:29   ` Paolo Abeni
  2025-01-27  5:22     ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-01-23 14:29 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 1/15/25 7:59 PM, John Ousterhout wrote:
> These files provide basic functions for managing remote procedure calls,
> which are the fundamental entities managed by Homa. Each RPC consists
> of a request message from a client to a server, followed by a response
> message returned from the server to the client.
> 
> Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
> ---
>  net/homa/homa_rpc.c | 494 ++++++++++++++++++++++++++++++++++++++++++++
>  net/homa/homa_rpc.h | 458 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 952 insertions(+)
>  create mode 100644 net/homa/homa_rpc.c
>  create mode 100644 net/homa/homa_rpc.h
> 
> diff --git a/net/homa/homa_rpc.c b/net/homa/homa_rpc.c
> new file mode 100644
> index 000000000000..cc8450c984f8
> --- /dev/null
> +++ b/net/homa/homa_rpc.c
> @@ -0,0 +1,494 @@
> +// SPDX-License-Identifier: BSD-2-Clause
> +
> +/* This file contains functions for managing homa_rpc structs. */
> +
> +#include "homa_impl.h"
> +#include "homa_peer.h"
> +#include "homa_pool.h"
> +#include "homa_stub.h"
> +
> +/**
> + * homa_rpc_new_client() - Allocate and construct a client RPC (one that is used
> + * to issue an outgoing request). Doesn't send any packets. Invoked with no
> + * locks held.
> + * @hsk:      Socket to which the RPC belongs.
> + * @dest:     Address of host (ip and port) to which the RPC will be sent.
> + *
> + * Return:    A printer to the newly allocated object, or a negative
> + *            errno if an error occurred. The RPC will be locked; the
> + *            caller must eventually unlock it.
> + */
> +struct homa_rpc *homa_rpc_new_client(struct homa_sock *hsk,
> +				     const union sockaddr_in_union *dest)
> +	__acquires(&crpc->bucket->lock)
> +{
> +	struct in6_addr dest_addr_as_ipv6 = canonical_ipv6_addr(dest);
> +	struct homa_rpc_bucket *bucket;
> +	struct homa_rpc *crpc;
> +	int err;
> +
> +	crpc = kmalloc(sizeof(*crpc), GFP_KERNEL);
> +	if (unlikely(!crpc))
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* Initialize fields that don't require the socket lock. */
> +	crpc->hsk = hsk;
> +	crpc->id = atomic64_fetch_add(2, &hsk->homa->next_outgoing_id);
> +	bucket = homa_client_rpc_bucket(hsk, crpc->id);
> +	crpc->bucket = bucket;
> +	crpc->state = RPC_OUTGOING;
> +	atomic_set(&crpc->flags, 0);
> +	crpc->peer = homa_peer_find(hsk->homa->peers, &dest_addr_as_ipv6,
> +				    &hsk->inet);
> +	if (IS_ERR(crpc->peer)) {
> +		err = PTR_ERR(crpc->peer);
> +		goto error;
> +	}
> +	crpc->dport = ntohs(dest->in6.sin6_port);
> +	crpc->completion_cookie = 0;
> +	crpc->error = 0;
> +	crpc->msgin.length = -1;
> +	crpc->msgin.num_bpages = 0;
> +	memset(&crpc->msgout, 0, sizeof(crpc->msgout));
> +	crpc->msgout.length = -1;
> +	INIT_LIST_HEAD(&crpc->ready_links);
> +	INIT_LIST_HEAD(&crpc->buf_links);
> +	INIT_LIST_HEAD(&crpc->dead_links);
> +	crpc->interest = NULL;
> +	INIT_LIST_HEAD(&crpc->throttled_links);
> +	crpc->silent_ticks = 0;
> +	crpc->resend_timer_ticks = hsk->homa->timer_ticks;
> +	crpc->done_timer_ticks = 0;
> +	crpc->magic = HOMA_RPC_MAGIC;
> +	crpc->start_ns = sched_clock();
> +
> +	/* Initialize fields that require locking. This allows the most
> +	 * expensive work, such as copying in the message from user space,
> +	 * to be performed without holding locks. Also, can't hold spin
> +	 * locks while doing things that could block, such as memory allocation.
> +	 */
> +	homa_bucket_lock(bucket, crpc->id, "homa_rpc_new_client");
> +	homa_sock_lock(hsk, "homa_rpc_new_client");
> +	if (hsk->shutdown) {
> +		homa_sock_unlock(hsk);
> +		homa_rpc_unlock(crpc);
> +		err = -ESHUTDOWN;
> +		goto error;
> +	}
> +	hlist_add_head(&crpc->hash_links, &bucket->rpcs);
> +	list_add_tail_rcu(&crpc->active_links, &hsk->active_rpcs);
> +	homa_sock_unlock(hsk);
> +
> +	return crpc;
> +
> +error:
> +	kfree(crpc);
> +	return ERR_PTR(err);
> +}
> +
> +/**
> + * homa_rpc_new_server() - Allocate and construct a server RPC (one that is
> + * used to manage an incoming request). If appropriate, the RPC will also
> + * be handed off (we do it here, while we have the socket locked, to avoid
> + * acquiring the socket lock a second time later for the handoff).
> + * @hsk:      Socket that owns this RPC.
> + * @source:   IP address (network byte order) of the RPC's client.
> + * @h:        Header for the first data packet received for this RPC; used
> + *            to initialize the RPC.
> + * @created:  Will be set to 1 if a new RPC was created and 0 if an
> + *            existing RPC was found.
> + *
> + * Return:  A pointer to a new RPC, which is locked, or a negative errno
> + *          if an error occurred. If there is already an RPC corresponding
> + *          to h, then it is returned instead of creating a new RPC.
> + */
> +struct homa_rpc *homa_rpc_new_server(struct homa_sock *hsk,
> +				     const struct in6_addr *source,
> +				     struct homa_data_hdr *h, int *created)
> +	__acquires(&srpc->bucket->lock)
> +{
> +	__u64 id = homa_local_id(h->common.sender_id);
> +	struct homa_rpc_bucket *bucket;
> +	struct homa_rpc *srpc = NULL;
> +	int err;
> +
> +	/* Lock the bucket, and make sure no-one else has already created
> +	 * the desired RPC.
> +	 */
> +	bucket = homa_server_rpc_bucket(hsk, id);
> +	homa_bucket_lock(bucket, id, "homa_rpc_new_server");
> +	hlist_for_each_entry_rcu(srpc, &bucket->rpcs, hash_links) {
> +		if (srpc->id == id &&
> +		    srpc->dport == ntohs(h->common.sport) &&
> +		    ipv6_addr_equal(&srpc->peer->addr, source)) {
> +			/* RPC already exists; just return it instead
> +			 * of creating a new RPC.
> +			 */
> +			*created = 0;
> +			return srpc;
> +		}
> +	}

How many RPCs should concurrently exists in a real server? with 1024
buckets there could be a lot of them on each/some list and linar search
could be very expansive. And this happens with BH disabled.

> +
> +	/* Initialize fields that don't require the socket lock. */
> +	srpc = kmalloc(sizeof(*srpc), GFP_ATOMIC);

You could do the allocation outside the bucket lock, too and avoid the
ATOMIC flag.

> +	if (!srpc) {
> +		err = -ENOMEM;
> +		goto error;
> +	}
> +	srpc->hsk = hsk;
> +	srpc->bucket = bucket;
> +	srpc->state = RPC_INCOMING;
> +	atomic_set(&srpc->flags, 0);
> +	srpc->peer = homa_peer_find(hsk->homa->peers, source, &hsk->inet);
> +	if (IS_ERR(srpc->peer)) {
> +		err = PTR_ERR(srpc->peer);
> +		goto error;
> +	}
> +	srpc->dport = ntohs(h->common.sport);
> +	srpc->id = id;
> +	srpc->completion_cookie = 0;
> +	srpc->error = 0;
> +	srpc->msgin.length = -1;
> +	srpc->msgin.num_bpages = 0;
> +	memset(&srpc->msgout, 0, sizeof(srpc->msgout));
> +	srpc->msgout.length = -1;
> +	INIT_LIST_HEAD(&srpc->ready_links);
> +	INIT_LIST_HEAD(&srpc->buf_links);
> +	INIT_LIST_HEAD(&srpc->dead_links);
> +	srpc->interest = NULL;
> +	INIT_LIST_HEAD(&srpc->throttled_links);
> +	srpc->silent_ticks = 0;
> +	srpc->resend_timer_ticks = hsk->homa->timer_ticks;
> +	srpc->done_timer_ticks = 0;
> +	srpc->magic = HOMA_RPC_MAGIC;
> +	srpc->start_ns = sched_clock();
> +	err = homa_message_in_init(srpc, ntohl(h->message_length));
> +	if (err != 0)
> +		goto error;
> +
> +	/* Initialize fields that require socket to be locked. */
> +	homa_sock_lock(hsk, "homa_rpc_new_server");
> +	if (hsk->shutdown) {
> +		homa_sock_unlock(hsk);
> +		err = -ESHUTDOWN;
> +		goto error;
> +	}
> +	hlist_add_head(&srpc->hash_links, &bucket->rpcs);
> +	list_add_tail_rcu(&srpc->active_links, &hsk->active_rpcs);
> +	if (ntohl(h->seg.offset) == 0 && srpc->msgin.num_bpages > 0) {
> +		atomic_or(RPC_PKTS_READY, &srpc->flags);
> +		homa_rpc_handoff(srpc);
> +	}
> +	homa_sock_unlock(hsk);
> +	*created = 1;
> +	return srpc;
> +
> +error:
> +	homa_bucket_unlock(bucket, id);
> +	kfree(srpc);
> +	return ERR_PTR(err);
> +}
> +
> +/**
> + * homa_rpc_acked() - This function is invoked when an ack is received
> + * for an RPC; if the RPC still exists, is freed.
> + * @hsk:     Socket on which the ack was received. May or may not correspond
> + *           to the RPC, but can sometimes be used to avoid a socket lookup.
> + * @saddr:   Source address from which the act was received (the client
> + *           note for the RPC)
> + * @ack:     Information about an RPC from @saddr that may now be deleted
> + *           safely.
> + */
> +void homa_rpc_acked(struct homa_sock *hsk, const struct in6_addr *saddr,
> +		    struct homa_ack *ack)
> +{
> +	__u16 server_port = ntohs(ack->server_port);
> +	__u64 id = homa_local_id(ack->client_id);
> +	struct homa_sock *hsk2 = hsk;
> +	struct homa_rpc *rpc;
> +
> +	if (hsk2->port != server_port) {
> +		/* Without RCU, sockets other than hsk can be deleted
> +		 * out from under us.
> +		 */
> +		rcu_read_lock();
> +		hsk2 = homa_sock_find(hsk->homa->port_map, server_port);
> +		if (!hsk2)
> +			goto done;
> +	}
> +	rpc = homa_find_server_rpc(hsk2, saddr, id);
> +	if (rpc) {
> +		homa_rpc_free(rpc);
> +		homa_rpc_unlock(rpc); /* Locked by homa_find_server_rpc. */
> +	}
> +
> +done:
> +	if (hsk->port != server_port)
> +		rcu_read_unlock();
> +}
> +
> +/**
> + * homa_rpc_free() - Destructor for homa_rpc; will arrange for all resources
> + * associated with the RPC to be released (eventually).
> + * @rpc:  Structure to clean up, or NULL. Must be locked. Its socket must
> + *        not be locked.
> + */
> +void homa_rpc_free(struct homa_rpc *rpc)
> +	__acquires(&rpc->hsk->lock)
> +	__releases(&rpc->hsk->lock)

The function name is IMHO misleading. I expect homa_rpc_free() to
actually free the memory allocated for the rpc argument, including the
rpc struct itself.

> +{
> +	/* The goal for this function is to make the RPC inaccessible,
> +	 * so that no other code will ever access it again. However, don't
> +	 * actually release resources; leave that to homa_rpc_reap, which
> +	 * runs later. There are two reasons for this. First, releasing
> +	 * resources may be expensive, so we don't want to keep the caller
> +	 * waiting; homa_rpc_reap will run in situations where there is time
> +	 * to spare. Second, there may be other code that currently has
> +	 * pointers to this RPC but temporarily released the lock (e.g. to
> +	 * copy data to/from user space). It isn't safe to clean up until
> +	 * that code has finished its work and released any pointers to the
> +	 * RPC (homa_rpc_reap will ensure that this has happened). So, this
> +	 * function should only make changes needed to make the RPC
> +	 * inaccessible.
> +	 */
> +	if (!rpc || rpc->state == RPC_DEAD)
> +		return;
> +	rpc->state = RPC_DEAD;
> +
> +	/* Unlink from all lists, so no-one will ever find this RPC again. */
> +	homa_sock_lock(rpc->hsk, "homa_rpc_free");
> +	__hlist_del(&rpc->hash_links);
> +	list_del_rcu(&rpc->active_links);
> +	list_add_tail_rcu(&rpc->dead_links, &rpc->hsk->dead_rpcs);
> +	__list_del_entry(&rpc->ready_links);
> +	__list_del_entry(&rpc->buf_links);
> +	if (rpc->interest) {
> +		rpc->interest->reg_rpc = NULL;
> +		wake_up_process(rpc->interest->thread);
> +		rpc->interest = NULL;
> +	}
> +
> +	if (rpc->msgin.length >= 0) {
> +		rpc->hsk->dead_skbs += skb_queue_len(&rpc->msgin.packets);
> +		while (1) {
> +			struct homa_gap *gap = list_first_entry_or_null(&rpc->msgin.gaps,
> +									struct homa_gap,
> +									links);
> +			if (!gap)
> +				break;
> +			list_del(&gap->links);
> +			kfree(gap);
> +		}
> +	}
> +	rpc->hsk->dead_skbs += rpc->msgout.num_skbs;
> +	if (rpc->hsk->dead_skbs > rpc->hsk->homa->max_dead_buffs)
> +		/* This update isn't thread-safe; it's just a
> +		 * statistic so it's OK if updates occasionally get
> +		 * missed.
> +		 */
> +		rpc->hsk->homa->max_dead_buffs = rpc->hsk->dead_skbs;
> +
> +	homa_sock_unlock(rpc->hsk);
> +	homa_remove_from_throttled(rpc);
> +}
> +
> +/**
> + * homa_rpc_reap() - Invoked to release resources associated with dead
> + * RPCs for a given socket. For a large RPC, it can take a long time to
> + * free all of its packet buffers, so we try to perform this work
> + * off the critical path where it won't delay applications. Each call to
> + * this function normally does a small chunk of work (unless reap_all is
> + * true). See the file reap.txt for more information.
> + * @hsk:      Homa socket that may contain dead RPCs. Must not be locked by the
> + *            caller; this function will lock and release.
> + * @reap_all: False means do a small chunk of work; there may still be
> + *            unreaped RPCs on return. True means reap all dead rpcs for
> + *            hsk.  Will busy-wait if reaping has been disabled for some RPCs.
> + *
> + * Return: A return value of 0 means that we ran out of work to do; calling
> + *         again will do no work (there could be unreaped RPCs, but if so,
> + *         reaping has been disabled for them).  A value greater than
> + *         zero means there is still more reaping work to be done.
> + */
> +int homa_rpc_reap(struct homa_sock *hsk, bool reap_all)
> +{
> +#define BATCH_MAX 20
> +	struct homa_rpc *rpcs[BATCH_MAX];
> +	struct sk_buff *skbs[BATCH_MAX];
> +	int num_skbs, num_rpcs;
> +	struct homa_rpc *rpc;
> +	int i, batch_size;
> +	int skbs_to_reap;
> +	int rx_frees;
> +	int result = 0;
> +
> +	/* Each iteration through the following loop will reap
> +	 * BATCH_MAX skbs.
> +	 */
> +	skbs_to_reap = hsk->homa->reap_limit;
> +	while (skbs_to_reap > 0 && !list_empty(&hsk->dead_rpcs)) {
> +		batch_size = BATCH_MAX;
> +		if (!reap_all) {
> +			if (batch_size > skbs_to_reap)
> +				batch_size = skbs_to_reap;
> +			skbs_to_reap -= batch_size;
> +		}
> +		num_skbs = 0;
> +		num_rpcs = 0;
> +		rx_frees = 0;
> +
> +		homa_sock_lock(hsk, "homa_rpc_reap");
> +		if (atomic_read(&hsk->protect_count)) {
> +			homa_sock_unlock(hsk);
> +			if (reap_all)
> +				continue;
> +			return 0;
> +		}
> +
> +		/* Collect buffers and freeable RPCs. */
> +		list_for_each_entry_rcu(rpc, &hsk->dead_rpcs, dead_links) {
> +			if ((atomic_read(&rpc->flags) & RPC_CANT_REAP) ||
> +			    atomic_read(&rpc->msgout.active_xmits) != 0)
> +				continue;
> +			rpc->magic = 0;
> +
> +			/* For Tx sk_buffs, collect them here but defer
> +			 * freeing until after releasing the socket lock.
> +			 */
> +			if (rpc->msgout.length >= 0) {
> +				while (rpc->msgout.packets) {
> +					skbs[num_skbs] = rpc->msgout.packets;
> +					rpc->msgout.packets = homa_get_skb_info(
> +						rpc->msgout.packets)->next_skb;
> +					num_skbs++;
> +					rpc->msgout.num_skbs--;
> +					if (num_skbs >= batch_size)
> +						goto release;
> +				}
> +			}
> +
> +			/* In the normal case rx sk_buffs will already have been
> +			 * freed before we got here. Thus it's OK to free
> +			 * immediately in rare situations where there are
> +			 * buffers left.
> +			 */
> +			if (rpc->msgin.length >= 0) {
> +				while (1) {
> +					struct sk_buff *skb;
> +
> +					skb = skb_dequeue(&rpc->msgin.packets);
> +					if (!skb)
> +						break;
> +					kfree_skb(skb);

You can use:
					rx_free+= skb_queue_len(&rpc->msgin.packets);
					skb_queue_purge(&rpc->msgin.packets);


> +					rx_frees++;
> +				}
> +			}
> +
> +			/* If we get here, it means all packets have been
> +			 *  removed from the RPC.
> +			 */
> +			rpcs[num_rpcs] = rpc;
> +			num_rpcs++;
> +			list_del_rcu(&rpc->dead_links);
> +			if (num_rpcs >= batch_size)
> +				goto release;
> +		}
> +
> +		/* Free all of the collected resources; release the socket
> +		 * lock while doing this.
> +		 */
> +release:
> +		hsk->dead_skbs -= num_skbs + rx_frees;
> +		result = !list_empty(&hsk->dead_rpcs) &&
> +				(num_skbs + num_rpcs) != 0;
> +		homa_sock_unlock(hsk);
> +		homa_skb_free_many_tx(hsk->homa, skbs, num_skbs);
> +		for (i = 0; i < num_rpcs; i++) {
> +			rpc = rpcs[i];
> +			/* Lock and unlock the RPC before freeing it. This
> +			 * is needed to deal with races where the code
> +			 * that invoked homa_rpc_free hasn't unlocked the
> +			 * RPC yet.
> +			 */
> +			homa_rpc_lock(rpc, "homa_rpc_reap");
> +			homa_rpc_unlock(rpc);
> +
> +			if (unlikely(rpc->msgin.num_bpages))
> +				homa_pool_release_buffers(rpc->hsk->buffer_pool,
> +							  rpc->msgin.num_bpages,
> +							  rpc->msgin.bpage_offsets);
> +			if (rpc->msgin.length >= 0) {
> +				while (1) {
> +					struct homa_gap *gap;
> +
> +					gap = list_first_entry_or_null(
> +							&rpc->msgin.gaps,
> +							struct homa_gap,
> +							links);
> +					if (!gap)
> +						break;
> +					list_del(&gap->links);
> +					kfree(gap);
> +				}
> +			}
> +			rpc->state = 0;
> +			kfree(rpc);
> +		}
> +		if (!result && !reap_all)
> +			break;
> +	}
> +	homa_pool_check_waiting(hsk->buffer_pool);
> +	return result;
> +}
> +
> +/**
> + * homa_find_client_rpc() - Locate client-side information about the RPC that
> + * a packet belongs to, if there is any. Thread-safe without socket lock.
> + * @hsk:      Socket via which packet was received.
> + * @id:       Unique identifier for the RPC.
> + *
> + * Return:    A pointer to the homa_rpc for this id, or NULL if none.
> + *            The RPC will be locked; the caller must eventually unlock it
> + *            by invoking homa_rpc_unlock.

Why are using this lock schema? It looks like it adds quite a bit of
complexity. The usual way of handling this kind of hash lookup is do the
lookup locklessly, under RCU, and eventually add a refcnt to the
looked-up entity - homa_rpc - to ensure it will not change under the
hood after the lookup.

> + */
> +struct homa_rpc *homa_find_client_rpc(struct homa_sock *hsk, __u64 id)
> +	__acquires(&crpc->bucket->lock)
> +{
> +	struct homa_rpc_bucket *bucket = homa_client_rpc_bucket(hsk, id);
> +	struct homa_rpc *crpc;
> +
> +	homa_bucket_lock(bucket, id, __func__);
> +	hlist_for_each_entry_rcu(crpc, &bucket->rpcs, hash_links) {

Why are you using the RCU variant? I don't see RCU access for rpcs.

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 06/12] net: homa: create homa_peer.h and homa_peer.c
  2025-01-15 18:59 ` [PATCH net-next v6 06/12] net: homa: create homa_peer.h and homa_peer.c John Ousterhout
@ 2025-01-23 17:45   ` Paolo Abeni
  2025-01-28  0:06     ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-01-23 17:45 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 1/15/25 7:59 PM, John Ousterhout wrote:
> +/**
> + * homa_peertab_get_peers() - Return information about all of the peers
> + * currently known
> + * @peertab:    The table to search for peers.
> + * @num_peers:  Modified to hold the number of peers returned.
> + * Return:      kmalloced array holding pointers to all known peers. The
> + *		caller must free this. If there is an error, or if there
> + *	        are no peers, NULL is returned.
> + */
> +struct homa_peer **homa_peertab_get_peers(struct homa_peertab *peertab,
> +					  int *num_peers)

Look like this function is unsed in the current series. Please don't
introduce unused code.

> +{
> +	struct homa_peer **result;
> +	struct hlist_node *next;
> +	struct homa_peer *peer;
> +	int i, count;
> +
> +	*num_peers = 0;
> +	if (!peertab->buckets)
> +		return NULL;
> +
> +	/* Figure out how many peers there are. */
> +	count = 0;
> +	for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++) {
> +		hlist_for_each_entry_safe(peer, next, &peertab->buckets[i],
> +					  peertab_links)

No lock acquired, so others process could concurrently modify the list;
hlist_for_each_entry_safe() is not the correct helper to use. You should
probably use hlist_for_each_entry_rcu(), adding rcu protection. Assuming
the thing is actually under an RCU schema, which is not entirely clear.

> +			count++;
> +	}
> +
> +	if (count == 0)
> +		return NULL;
> +
> +	result = kmalloc_array(count, sizeof(peer), GFP_KERNEL);
> +	if (!result)
> +		return NULL;
> +	*num_peers = count;
> +	count = 0;
> +	for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++) {
> +		hlist_for_each_entry_safe(peer, next, &peertab->buckets[i],
> +					  peertab_links) {
> +			result[count] = peer;
> +			count++;
> +		}
> +	}
> +	return result;
> +}
> +
> +/**
> + * homa_peertab_gc_dsts() - Invoked to free unused dst_entries, if it is
> + * safe to do so.
> + * @peertab:       The table in which to free entries.
> + * @now:           Current time, in sched_clock() units; entries with expiration
> + *                 dates no later than this will be freed. Specify ~0 to
> + *                 free all entries.
> + */
> +void homa_peertab_gc_dsts(struct homa_peertab *peertab, __u64 now)
> +{

Apparently this is called under (and need) peertab lock, an annotation
or a comment would be helpful.

> +	while (!list_empty(&peertab->dead_dsts)) {
> +		struct homa_dead_dst *dead =
> +			list_first_entry(&peertab->dead_dsts,
> +					 struct homa_dead_dst, dst_links);
> +		if (dead->gc_time > now)
> +			break;
> +		dst_release(dead->dst);
> +		list_del(&dead->dst_links);
> +		kfree(dead);
> +	}
> +}
> +
> +/**
> + * homa_peer_find() - Returns the peer associated with a given host; creates
> + * a new homa_peer if one doesn't already exist.
> + * @peertab:    Peer table in which to perform lookup.
> + * @addr:       Address of the desired host: IPv4 addresses are represented
> + *              as IPv4-mapped IPv6 addresses.
> + * @inet:       Socket that will be used for sending packets.
> + *
> + * Return:      The peer associated with @addr, or a negative errno if an
> + *              error occurred. The caller can retain this pointer
> + *              indefinitely: peer entries are never deleted except in
> + *              homa_peertab_destroy.
> + */
> +struct homa_peer *homa_peer_find(struct homa_peertab *peertab,
> +				 const struct in6_addr *addr,
> +				 struct inet_sock *inet)
> +{
> +	/* Note: this function uses RCU operators to ensure safety even
> +	 * if a concurrent call is adding a new entry.
> +	 */
> +	struct homa_peer *peer;
> +	struct dst_entry *dst;
> +
> +	__u32 bucket = hash_32((__force __u32)addr->in6_u.u6_addr32[0],
> +			       HOMA_PEERTAB_BUCKET_BITS);
> +
> +	bucket ^= hash_32((__force __u32)addr->in6_u.u6_addr32[1],
> +			  HOMA_PEERTAB_BUCKET_BITS);
> +	bucket ^= hash_32((__force __u32)addr->in6_u.u6_addr32[2],
> +			  HOMA_PEERTAB_BUCKET_BITS);
> +	bucket ^= hash_32((__force __u32)addr->in6_u.u6_addr32[3],
> +			  HOMA_PEERTAB_BUCKET_BITS);
> +	hlist_for_each_entry_rcu(peer, &peertab->buckets[bucket],
> +				 peertab_links) {
> +		if (ipv6_addr_equal(&peer->addr, addr))

The caller does not acquire the RCU read lock, so this looks buggy.

AFAICS UaF is not possible because peers are removed only by
homa_peertab_destroy(), at unload time. That in turn looks
dangerous/wrong. What about memory utilization for peers over time?
apparently bucket list could grow in an unlimited way.

[...]
> +/**
> + * homa_peer_lock_slow() - This function implements the slow path for
> + * acquiring a peer's @unacked_lock. It is invoked when the lock isn't
> + * immediately available. It waits for the lock, but also records statistics
> + * about the waiting time.
> + * @peer:    Peer to  lock.
> + */
> +void homa_peer_lock_slow(struct homa_peer *peer)
> +	__acquires(&peer->ack_lock)
> +{
> +	spin_lock_bh(&peer->ack_lock);

Is this just a placeholder for future changes?!? I don't see any stats
update here, and currently homa_peer_lock() is really:

	if (!spin_trylock_bh(&peer->ack_lock))
		spin_lock_bh(&peer->ack_lock);

which does not make much sense to me. Either document this is going to
change very soon (possibly even how and why) or use a plain spin_lock_bh()

> +}
> +
> +/**
> + * homa_peer_add_ack() - Add a given RPC to the list of unacked
> + * RPCs for its server. Once this method has been invoked, it's safe
> + * to delete the RPC, since it will eventually be acked to the server.
> + * @rpc:    Client RPC that has now completed.
> + */
> +void homa_peer_add_ack(struct homa_rpc *rpc)
> +{
> +	struct homa_peer *peer = rpc->peer;
> +	struct homa_ack_hdr ack;
> +
> +	homa_peer_lock(peer);
> +	if (peer->num_acks < HOMA_MAX_ACKS_PER_PKT) {
> +		peer->acks[peer->num_acks].client_id = cpu_to_be64(rpc->id);
> +		peer->acks[peer->num_acks].server_port = htons(rpc->dport);
> +		peer->num_acks++;
> +		homa_peer_unlock(peer);
> +		return;
> +	}
> +
> +	/* The peer has filled up; send an ACK message to empty it. The
> +	 * RPC in the message header will also be considered ACKed.
> +	 */
> +	memcpy(ack.acks, peer->acks, sizeof(peer->acks));
> +	ack.num_acks = htons(peer->num_acks);
> +	peer->num_acks = 0;
> +	homa_peer_unlock(peer);
> +	homa_xmit_control(ACK, &ack, sizeof(ack), rpc);
> +}
> +
> +/**
> + * homa_peer_get_acks() - Copy acks out of a peer, and remove them from the
> + * peer.
> + * @peer:    Peer to check for possible unacked RPCs.
> + * @count:   Maximum number of acks to return.
> + * @dst:     The acks are copied to this location.
> + *
> + * Return:   The number of acks extracted from the peer (<= count).
> + */
> +int homa_peer_get_acks(struct homa_peer *peer, int count, struct homa_ack *dst)
> +{
> +	/* Don't waste time acquiring the lock if there are no ids available. */
> +	if (peer->num_acks == 0)
> +		return 0;
> +
> +	homa_peer_lock(peer);
> +
> +	if (count > peer->num_acks)
> +		count = peer->num_acks;
> +	memcpy(dst, &peer->acks[peer->num_acks - count],
> +	       count * sizeof(peer->acks[0]));
> +	peer->num_acks -= count;
> +
> +	homa_peer_unlock(peer);
> +	return count;
> +}
> diff --git a/net/homa/homa_peer.h b/net/homa/homa_peer.h
> new file mode 100644
> index 000000000000..556aeda49656
> --- /dev/null
> +++ b/net/homa/homa_peer.h
> @@ -0,0 +1,233 @@
> +/* SPDX-License-Identifier: BSD-2-Clause */
> +
> +/* This file contains definitions related to managing peers (homa_peer
> + * and homa_peertab).
> + */
> +
> +#ifndef _HOMA_PEER_H
> +#define _HOMA_PEER_H
> +
> +#include "homa_wire.h"
> +#include "homa_sock.h"
> +
> +struct homa_rpc;
> +
> +/**
> + * struct homa_dead_dst - Used to retain dst_entries that are no longer
> + * needed, until it is safe to delete them (I'm not confident that the RCU
> + * mechanism will be safe for these: the reference count could get incremented
> + * after it's on the RCU list?).
> + */
> +struct homa_dead_dst {
> +	/** @dst: Entry that is no longer used by a struct homa_peer. */
> +	struct dst_entry *dst;
> +
> +	/**
> +	 * @gc_time: Time (in units of sched_clock()) when it is safe
> +	 * to free @dst.
> +	 */
> +	__u64 gc_time;
> +
> +	/** @dst_links: Used to link together entries in peertab->dead_dsts. */
> +	struct list_head dst_links;
> +};
> +
> +/**
> + * define HOMA_PEERTAB_BUCKET_BITS - Number of bits in the bucket index for a
> + * homa_peertab.  Should be large enough to hold an entry for every server
> + * in a datacenter without long hash chains.
> + */
> +#define HOMA_PEERTAB_BUCKET_BITS 16
> +
> +/** define HOME_PEERTAB_BUCKETS - Number of buckets in a homa_peertab. */
> +#define HOMA_PEERTAB_BUCKETS BIT(HOMA_PEERTAB_BUCKET_BITS)
> +
> +/**
> + * struct homa_peertab - A hash table that maps from IPv6 addresses
> + * to homa_peer objects. IPv4 entries are encapsulated as IPv6 addresses.
> + * Entries are gradually added to this table, but they are never removed
> + * except when the entire table is deleted. We can't safely delete because
> + * results returned by homa_peer_find may be retained indefinitely.
> + *
> + * This table is managed exclusively by homa_peertab.c, using RCU to
> + * permit efficient lookups.
> + */
> +struct homa_peertab {
> +	/**
> +	 * @write_lock: Synchronizes addition of new entries; not needed
> +	 * for lookups (RCU is used instead).
> +	 */
> +	spinlock_t write_lock;

This look looks potentially havily contented on add, why don't you use a
per bucket lock?

> +
> +	/**
> +	 * @dead_dsts: List of dst_entries that are waiting to be deleted.
> +	 * Hold @write_lock when manipulating.
> +	 */
> +	struct list_head dead_dsts;
> +
> +	/**
> +	 * @buckets: Pointer to heads of chains of homa_peers for each bucket.
> +	 * Malloc-ed, and must eventually be freed. NULL means this structure
> +	 * has not been initialized.
> +	 */
> +	struct hlist_head *buckets;
> +};
> +
> +/**
> + * struct homa_peer - One of these objects exists for each machine that we
> + * have communicated with (either as client or server).
> + */
> +struct homa_peer {
> +	/**
> +	 * @addr: IPv6 address for the machine (IPv4 addresses are stored
> +	 * as IPv4-mapped IPv6 addresses).
> +	 */
> +	struct in6_addr addr;
> +
> +	/** @flow: Addressing info needed to send packets. */
> +	struct flowi flow;
> +
> +	/**
> +	 * @dst: Used to route packets to this peer; we own a reference
> +	 * to this, which we must eventually release.
> +	 */
> +	struct dst_entry *dst;
> +
> +	/**
> +	 * @grantable_rpcs: Contains all homa_rpcs (both requests and
> +	 * responses) involving this peer whose msgins require (or required
> +	 * them in the past) and have not been fully received. The list is
> +	 * sorted in priority order (head has fewest bytes_remaining).
> +	 * Locked with homa->grantable_lock.
> +	 */
> +	struct list_head grantable_rpcs;

Apparently not used in this patch series. More field below with similar
problem. Please introduce such fields in the same series that will
actualy use them.

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 07/12] net: homa: create homa_sock.h and homa_sock.c
  2025-01-15 18:59 ` [PATCH net-next v6 07/12] net: homa: create homa_sock.h and homa_sock.c John Ousterhout
@ 2025-01-23 19:01   ` Paolo Abeni
  2025-01-28  0:40     ` John Ousterhout
  2025-01-24  7:33   ` Paolo Abeni
  1 sibling, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-01-23 19:01 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 1/15/25 7:59 PM, John Ousterhout wrote:
> +	spin_unlock_bh(&socktab->write_lock);
> +
> +	return homa_socktab_next(scan);
> +}
> +
> +/**
> + * homa_socktab_next() - Return the next socket in an iteration over a socktab.
> + * @scan:      State of the scan.
> + *
> + * Return:     The next socket in the table, or NULL if the iteration has
> + *             returned all of the sockets in the table. Sockets are not
> + *             returned in any particular order. It's possible that the
> + *             returned socket has been destroyed.
> + */
> +struct homa_sock *homa_socktab_next(struct homa_socktab_scan *scan)
> +{
> +	struct homa_socktab_links *links;
> +	struct homa_sock *hsk;
> +
> +	while (1) {
> +		while (!scan->next) {
> +			struct hlist_head *bucket;
> +
> +			scan->current_bucket++;
> +			if (scan->current_bucket >= HOMA_SOCKTAB_BUCKETS)
> +				return NULL;
> +			bucket = &scan->socktab->buckets[scan->current_bucket];
> +			scan->next = (struct homa_socktab_links *)
> +				      rcu_dereference(hlist_first_rcu(bucket));

The only caller for this function so far is not under RCU lock: you
should see a splat here if you build and run this code with:

CONFIG_LOCKDEP=y

(which in turn is highly encouraged)

> +		}
> +		links = scan->next;
> +		hsk = links->sock;
> +		scan->next = (struct homa_socktab_links *)
> +				rcu_dereference(hlist_next_rcu(&links->hash_links));

homa_socktab_links is embedded into the home sock; if the RCU protection
is released and re-acquired after a homa_socktab_next() call, there is
no guarantee links/hsk are still around and the above statement could
cause UaF.

This homa_socktab things looks quite complex. A simpler implementation
could use a simple RCU list _and_ acquire a reference to the hsk before
releasing the RCU lock.

> +		return hsk;
> +	}
> +}
> +
> +/**
> + * homa_socktab_end_scan() - Must be invoked on completion of each scan
> + * to clean up state associated with the scan.
> + * @scan:      State of the scan.
> + */
> +void homa_socktab_end_scan(struct homa_socktab_scan *scan)
> +{
> +	spin_lock_bh(&scan->socktab->write_lock);
> +	list_del(&scan->scan_links);
> +	spin_unlock_bh(&scan->socktab->write_lock);
> +}
> +
> +/**
> + * homa_sock_init() - Constructor for homa_sock objects. This function
> + * initializes only the parts of the socket that are owned by Homa.
> + * @hsk:    Object to initialize.
> + * @homa:   Homa implementation that will manage the socket.
> + *
> + * Return: 0 for success, otherwise a negative errno.
> + */
> +int homa_sock_init(struct homa_sock *hsk, struct homa *homa)
> +{
> +	struct homa_socktab *socktab = homa->port_map;
> +	int starting_port;
> +	int result = 0;
> +	int i;
> +
> +	spin_lock_bh(&socktab->write_lock);

A single contended lock for the whole homa sock table? Why don't you use
per bucket locks?

[...]
> +struct homa_rpc_bucket {
> +	/**
> +	 * @lock: serves as a lock both for this bucket (e.g., when
> +	 * adding and removing RPCs) and also for all of the RPCs in
> +	 * the bucket. Must be held whenever manipulating an RPC in
> +	 * this bucket. This dual purpose permits clean and safe
> +	 * deletion and garbage collection of RPCs.
> +	 */
> +	spinlock_t lock;
> +
> +	/** @rpcs: list of RPCs that hash to this bucket. */
> +	struct hlist_head rpcs;
> +
> +	/**
> +	 * @id: identifier for this bucket, used in error messages etc.
> +	 * It's the index of the bucket within its hash table bucket
> +	 * array, with an additional offset to separate server and
> +	 * client RPCs.
> +	 */
> +	int id;

On 64 bit arches this struct will have 2 4-bytes holes. If you reorder
the field:
	spinlock_t lock;
	int id;
	struct hlist_head rpcs;

the struct size will decrease by 8 bytes.

> +};
> +
> +/**
> + * define HOMA_CLIENT_RPC_BUCKETS - Number of buckets in hash tables for
> + * client RPCs. Must be a power of 2.
> + */
> +#define HOMA_CLIENT_RPC_BUCKETS 1024
> +
> +/**
> + * define HOMA_SERVER_RPC_BUCKETS - Number of buckets in hash tables for
> + * server RPCs. Must be a power of 2.
> + */
> +#define HOMA_SERVER_RPC_BUCKETS 1024
> +
> +/**
> + * struct homa_sock - Information about an open socket.
> + */
> +struct homa_sock {
> +	/* Info for other network layers. Note: IPv6 info (struct ipv6_pinfo
> +	 * comes at the very end of the struct, *after* Homa's data, if this
> +	 * socket uses IPv6).
> +	 */
> +	union {
> +		/** @sock: generic socket data; must be the first field. */
> +		struct sock sock;
> +
> +		/**
> +		 * @inet: generic Internet socket data; must also be the
> +		 first field (contains sock as its first member).
> +		 */
> +		struct inet_sock inet;
> +	};

Why adding this union? Just
	struct inet_sock inet;
would do.

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 07/12] net: homa: create homa_sock.h and homa_sock.c
  2025-01-15 18:59 ` [PATCH net-next v6 07/12] net: homa: create homa_sock.h and homa_sock.c John Ousterhout
  2025-01-23 19:01   ` Paolo Abeni
@ 2025-01-24  7:33   ` Paolo Abeni
  1 sibling, 0 replies; 68+ messages in thread
From: Paolo Abeni @ 2025-01-24  7:33 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 1/15/25 7:59 PM, John Ousterhout wrote:
> +/**
> + * homa_sock_init() - Constructor for homa_sock objects. This function
> + * initializes only the parts of the socket that are owned by Homa.
> + * @hsk:    Object to initialize.
> + * @homa:   Homa implementation that will manage the socket.
> + *
> + * Return: 0 for success, otherwise a negative errno.
> + */
> +int homa_sock_init(struct homa_sock *hsk, struct homa *homa)
> +{
> +	struct homa_socktab *socktab = homa->port_map;
> +	int starting_port;
> +	int result = 0;
> +	int i;
> +
> +	spin_lock_bh(&socktab->write_lock);
> +	atomic_set(&hsk->protect_count, 0);
> +	spin_lock_init(&hsk->lock);
> +	hsk->last_locker = "none";
> +	atomic_set(&hsk->protect_count, 0);
> +	hsk->homa = homa;
> +	hsk->ip_header_length = (hsk->inet.sk.sk_family == AF_INET)
> +			? HOMA_IPV4_HEADER_LENGTH : HOMA_IPV6_HEADER_LENGTH;
> +	hsk->shutdown = false;
> +	starting_port = homa->prev_default_port;
> +	while (1) {
> +		homa->prev_default_port++;
> +		if (homa->prev_default_port < HOMA_MIN_DEFAULT_PORT)
> +			homa->prev_default_port = HOMA_MIN_DEFAULT_PORT;
> +		if (!homa_sock_find(socktab, homa->prev_default_port))
> +			break;
> +		if (homa->prev_default_port == starting_port) {
> +			spin_unlock_bh(&socktab->write_lock);
> +			hsk->shutdown = true;
> +			return -EADDRNOTAVAIL;
> +		}
> +	}
> +	hsk->port = homa->prev_default_port;
> +	hsk->inet.inet_num = hsk->port;
> +	hsk->inet.inet_sport = htons(hsk->port);
> +	hsk->socktab_links.sock = hsk;
> +	hlist_add_head_rcu(&hsk->socktab_links.hash_links,
> +			   &socktab->buckets[homa_port_hash(hsk->port)]);

At this point the socket is apparently exposed to lookup from incoming
packets, but it's only partially initialized: bad things could happen.

> +/**
> + * homa_sock_find() - Returns the socket associated with a given port.
> + * @socktab:    Hash table in which to perform lookup.
> + * @port:       The port of interest.
> + * Return:      The socket that owns @port, or NULL if none.
> + *
> + * Note: this function uses RCU list-searching facilities, but it doesn't
> + * call rcu_read_lock. The caller should do that, if the caller cares (this
> + * way, the caller's use of the socket will also be protected).
> + */
> +struct homa_sock *homa_sock_find(struct homa_socktab *socktab,  __u16 port)

It would help the review if you reorder the code defining first the
basic helpers like this one and after the functions using them

> +{
> +	struct homa_socktab_links *link;
> +	struct homa_sock *result = NULL;
> +
> +	hlist_for_each_entry_rcu(link, &socktab->buckets[homa_port_hash(port)],
> +				 hash_links) {

This require the caller owing the rcu read lock, which is not always the
case in this patchset.

> +		struct homa_sock *hsk = link->sock;
> +
> +		if (hsk->port == port) {
> +			result = hsk;

The local port is the full key of the socket lookup? not even the
address? This simplify the code a bit, but is quite against user
expectation.

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-01-15 18:59 ` [PATCH net-next v6 08/12] net: homa: create homa_incoming.c John Ousterhout
@ 2025-01-24  8:31   ` Paolo Abeni
  2025-01-30  0:41     ` John Ousterhout
  2025-01-27 10:19   ` Paolo Abeni
  1 sibling, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-01-24  8:31 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 1/15/25 7:59 PM, John Ousterhout wrote:
> +/**
> + * homa_add_packet() - Add an incoming packet to the contents of a
> + * partially received message.
> + * @rpc:   Add the packet to the msgin for this RPC.
> + * @skb:   The new packet. This function takes ownership of the packet
> + *         (the packet will either be freed or added to rpc->msgin.packets).
> + */
> +void homa_add_packet(struct homa_rpc *rpc, struct sk_buff *skb)
> +{
> +	struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
> +	struct homa_gap *gap, *dummy, *gap2;
> +	int start = ntohl(h->seg.offset);
> +	int length = homa_data_len(skb);
> +	int end = start + length;
> +
> +	if ((start + length) > rpc->msgin.length)
> +		goto discard;
> +
> +	if (start == rpc->msgin.recv_end) {
> +		/* Common case: packet is sequential. */
> +		rpc->msgin.recv_end += length;
> +		goto keep;
> +	}
> +
> +	if (start > rpc->msgin.recv_end) {
> +		/* Packet creates a new gap. */
> +		if (!homa_gap_new(&rpc->msgin.gaps,
> +				  rpc->msgin.recv_end, start)) {
> +			pr_err("Homa couldn't allocate gap: insufficient memory\n");
> +			goto discard;

OoO will cause additional allocation? this feels like DoS prone.

> +		}
> +		rpc->msgin.recv_end = end;
> +		goto keep;
> +	}
> +
> +	/* Must now check to see if the packet fills in part or all of
> +	 * an existing gap.
> +	 */
> +	list_for_each_entry_safe(gap, dummy, &rpc->msgin.gaps, links) {

Linear search for OoO has proven to be subject to serious dos issue. You
should instead use a (rb-)tree to handle OoO packets.

> +		/* Is packet at the start of this gap? */
> +		if (start <= gap->start) {
> +			if (end <= gap->start)
> +				continue;
> +			if (start < gap->start)
> +				goto discard;
> +			if (end > gap->end)
> +				goto discard;
> +			gap->start = end;
> +			if (gap->start >= gap->end) {
> +				list_del(&gap->links);
> +				kfree(gap);
> +			}
> +			goto keep;
> +		}
> +
> +		/* Is packet at the end of this gap? BTW, at this point we know
> +		 * the packet can't cover the entire gap.
> +		 */
> +		if (end >= gap->end) {
> +			if (start >= gap->end)
> +				continue;
> +			if (end > gap->end)
> +				goto discard;
> +			gap->end = start;
> +			goto keep;
> +		}
> +
> +		/* Packet is in the middle of the gap; must split the gap. */
> +		gap2 = homa_gap_new(&gap->links, gap->start, start);
> +		if (!gap2) {
> +			pr_err("Homa couldn't allocate gap for split: insufficient memory\n");
> +			goto discard;
> +		}
> +		gap2->time = gap->time;
> +		gap->start = end;
> +		goto keep;
> +	}
> +
> +discard:
> +	kfree_skb(skb);
> +	return;
> +
> +keep:
> +	__skb_queue_tail(&rpc->msgin.packets, skb);

Here 'msgin.packets' is apparently under RCP lock protection, but
elsewhere - in homa_rpc_reap() - the list is apparently protected by
it's own lock.

Also it looks like there is no memory accounting at all, and SO_RCVBUF
setting are just ignored.

> +/**
> + * homa_dispatch_pkts() - Top-level function that processes a batch of packets,
> + * all related to the same RPC.
> + * @skb:       First packet in the batch, linked through skb->next.
> + * @homa:      Overall information about the Homa transport.
> + */
> +void homa_dispatch_pkts(struct sk_buff *skb, struct homa *homa)

I see I haven't mentioned the following so far, but you should move the
struct homa to a pernet subsystem.

> +{
> +#define MAX_ACKS 10
> +	const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
> +	struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
> +	__u64 id = homa_local_id(h->common.sender_id);
> +	int dport = ntohs(h->common.dport);
> +
> +	/* Used to collect acks from data packets so we can process them
> +	 * all at the end (can't process them inline because that may
> +	 * require locking conflicting RPCs). If we run out of space just
> +	 * ignore the extra acks; they'll be regenerated later through the
> +	 * explicit mechanism.
> +	 */
> +	struct homa_ack acks[MAX_ACKS];
> +	struct homa_rpc *rpc = NULL;
> +	struct homa_sock *hsk;
> +	struct sk_buff *next;
> +	int num_acks = 0;
> +
> +	/* Find the appropriate socket.*/
> +	hsk = homa_sock_find(homa->port_map, dport);

This needs RCU protection

> +	if (!hsk) {
> +		if (skb_is_ipv6(skb))
> +			icmp6_send(skb, ICMPV6_DEST_UNREACH,
> +				   ICMPV6_PORT_UNREACH, 0, NULL, IP6CB(skb));
> +		else
> +			icmp_send(skb, ICMP_DEST_UNREACH,
> +				  ICMP_PORT_UNREACH, 0);
> +		while (skb) {
> +			next = skb->next;
> +			kfree_skb(skb);
> +			skb = next;
> +		}
> +		return;
> +	}
> +
> +	/* Each iteration through the following loop processes one packet. */
> +	for (; skb; skb = next) {
> +		h = (struct homa_data_hdr *)skb->data;
> +		next = skb->next;
> +
> +		/* Relinquish the RPC lock temporarily if it's needed
> +		 * elsewhere.
> +		 */
> +		if (rpc) {
> +			int flags = atomic_read(&rpc->flags);
> +
> +			if (flags & APP_NEEDS_LOCK) {
> +				homa_rpc_unlock(rpc);
> +				homa_spin(200);
> +				rpc = NULL;
> +			}
> +		}
> +
> +		/* Find and lock the RPC if we haven't already done so. */
> +		if (!rpc) {
> +			if (!homa_is_client(id)) {
> +				/* We are the server for this RPC. */
> +				if (h->common.type == DATA) {
> +					int created;
> +
> +					/* Create a new RPC if one doesn't
> +					 * already exist.
> +					 */
> +					rpc = homa_rpc_new_server(hsk, &saddr,
> +								  h, &created);

It looks like a buggy or malicious client could force server RPC
allocation to any _client_ ?!?

> +					if (IS_ERR(rpc)) {
> +						pr_warn("homa_pkt_dispatch couldn't create server rpc: error %lu",
> +							-PTR_ERR(rpc));
> +						rpc = NULL;
> +						goto discard;
> +					}
> +				} else {
> +					rpc = homa_find_server_rpc(hsk, &saddr,
> +								   id);
> +				}
> +			} else {
> +				rpc = homa_find_client_rpc(hsk, id);

Both the client and the server lookup require a contended lock; The
lookup could/should be lockless, and the the lock could/should be
asserted only on the relevant RPC.

> +			}
> +		}
> +		if (unlikely(!rpc)) {
> +			if (h->common.type != NEED_ACK &&
> +			    h->common.type != ACK &&
> +			    h->common.type != RESEND)
> +				goto discard;
> +		} else {
> +			if (h->common.type == DATA ||
> +			    h->common.type == BUSY ||
> +			    h->common.type == NEED_ACK)
> +				rpc->silent_ticks = 0;
> +			rpc->peer->outstanding_resends = 0;
> +		}
> +
> +		switch (h->common.type) {
> +		case DATA:
> +			if (h->ack.client_id) {
> +				/* Save the ack for processing later, when we
> +				 * have released the RPC lock.
> +				 */
> +				if (num_acks < MAX_ACKS) {
> +					acks[num_acks] = h->ack;
> +					num_acks++;
> +				}
> +			}
> +			homa_data_pkt(skb, rpc);
> +			break;
> +		case RESEND:
> +			homa_resend_pkt(skb, rpc, hsk);
> +			break;
> +		case UNKNOWN:
> +			homa_unknown_pkt(skb, rpc);

It's sort of unexpected that the protocol explicitly defines the unknown
packet type, and handles it differently form undefined types.

> +			break;
> +		case BUSY:
> +			/* Nothing to do for these packets except reset
> +			 * silent_ticks, which happened above.
> +			 */
> +			goto discard;
> +		case NEED_ACK:
> +			homa_need_ack_pkt(skb, hsk, rpc);
> +			break;
> +		case ACK:
> +			homa_ack_pkt(skb, hsk, rpc);
> +			rpc = NULL;
> +
> +			/* It isn't safe to process more packets once we've
> +			 * released the RPC lock (this should never happen).
> +			 */
> +			while (next) {
> +				WARN_ONCE(next, "%s found extra packets after AC<\n",
> +					  __func__);

It looks like the above WARN could be triggered by an unexpected traffic
pattern generate from the client. If so, you should avoid the WARN() and
instead use e.g. some mib counter.

> +				skb = next;
> +				next = skb->next;
> +				kfree_skb(skb);
> +			}
> +			break;
> +		default:
> +			goto discard;
> +		}
> +		continue;
> +
> +discard:
> +		kfree_skb(skb);
> +	}
> +	if (rpc)
> +		homa_rpc_unlock(rpc);
> +
> +	while (num_acks > 0) {
> +		num_acks--;
> +		homa_rpc_acked(hsk, &saddr, &acks[num_acks]);
> +	}
> +
> +	if (hsk->dead_skbs >= 2 * hsk->homa->dead_buffs_limit)
> +		/* We get here if neither homa_wait_for_message
> +		 * nor homa_timer can keep up with reaping dead
> +		 * RPCs. See reap.txt for details.
> +		 */
> +		homa_rpc_reap(hsk, false);
> +}
> +
> +/**
> + * homa_data_pkt() - Handler for incoming DATA packets
> + * @skb:     Incoming packet; size known to be large enough for the header.
> + *           This function now owns the packet.
> + * @rpc:     Information about the RPC corresponding to this packet.
> + *           Must be locked by the caller.
> + */
> +void homa_data_pkt(struct sk_buff *skb, struct homa_rpc *rpc)
> +{
> +	struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
> +
> +	if (rpc->state != RPC_INCOMING && homa_is_client(rpc->id)) {
> +		if (unlikely(rpc->state != RPC_OUTGOING))
> +			goto discard;
> +		rpc->state = RPC_INCOMING;
> +		if (homa_message_in_init(rpc, ntohl(h->message_length)) != 0)
> +			goto discard;
> +	} else if (rpc->state != RPC_INCOMING) {
> +		/* Must be server; note that homa_rpc_new_server already
> +		 * initialized msgin and allocated buffers.
> +		 */
> +		if (unlikely(rpc->msgin.length >= 0))
> +			goto discard;
> +	}
> +
> +	if (rpc->msgin.num_bpages == 0)
> +		/* Drop packets that arrive when we can't allocate buffer
> +		 * space. If we keep them around, packet buffer usage can
> +		 * exceed available cache space, resulting in poor
> +		 * performance.
> +		 */
> +		goto discard;
> +
> +	homa_add_packet(rpc, skb);
> +
> +	if (skb_queue_len(&rpc->msgin.packets) != 0 &&
> +	    !(atomic_read(&rpc->flags) & RPC_PKTS_READY)) {
> +		atomic_or(RPC_PKTS_READY, &rpc->flags);
> +		homa_sock_lock(rpc->hsk, "homa_data_pkt");
> +		homa_rpc_handoff(rpc);
> +		homa_sock_unlock(rpc->hsk);

It looks like you tryied to enforce the following lock acquiring order:
rpc lock
socket lock
which is IMHO quite innatural, as the socket has a wider scope than the
RPC. In practice the locking schema is quite complex and hard to follow.
I think (wild guess) that inverting the lock order would simplify the
locking schema significantly.

[...]
> +/**
> + * homa_ack_pkt() - Handler for incoming ACK packets
> + * @skb:     Incoming packet; size already verified large enough for header.
> + *           This function now owns the packet.
> + * @hsk:     Socket on which the packet was received.
> + * @rpc:     The RPC named in the packet header, or NULL if no such
> + *           RPC exists. The RPC has been locked by the caller but will
> + *           be unlocked here.
> + */
> +void homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
> +		  struct homa_rpc *rpc)
> +	__releases(rpc->bucket_lock)
> +{
> +	const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
> +	struct homa_ack_hdr *h = (struct homa_ack_hdr *)skb->data;
> +	int i, count;
> +
> +	if (rpc) {
> +		homa_rpc_free(rpc);
> +		homa_rpc_unlock(rpc);

Another point that makes IMHO the locking schema hard to follow is the
fact that many non-locking-related functions acquires or release some
lock internally. The code would be much more easy to follow if you could
pair the lock and unlock as much as possible inside the same code block.

> +	}
> +
> +	count = ntohs(h->num_acks);
> +	for (i = 0; i < count; i++)
> +		homa_rpc_acked(hsk, &saddr, &h->acks[i]);
> +	kfree_skb(skb);
> +}
> +
> +/**
> + * homa_rpc_abort() - Terminate an RPC.
> + * @rpc:     RPC to be terminated.  Must be locked by caller.
> + * @error:   A negative errno value indicating the error that caused the abort.
> + *           If this is a client RPC, the error will be returned to the
> + *           application; if it's a server RPC, the error is ignored and
> + *           we just free the RPC.
> + */
> +void homa_rpc_abort(struct homa_rpc *rpc, int error)
> +{
> +	if (!homa_is_client(rpc->id)) {
> +		homa_rpc_free(rpc);
> +		return;
> +	}
> +	rpc->error = error;
> +	homa_sock_lock(rpc->hsk, "homa_rpc_abort");
> +	if (!rpc->hsk->shutdown)
> +		homa_rpc_handoff(rpc);
> +	homa_sock_unlock(rpc->hsk);
> +}
> +
> +/**
> + * homa_abort_rpcs() - Abort all RPCs to/from a particular peer.
> + * @homa:    Overall data about the Homa protocol implementation.
> + * @addr:    Address (network order) of the destination whose RPCs are
> + *           to be aborted.
> + * @port:    If nonzero, then RPCs will only be aborted if they were
> + *	     targeted at this server port.
> + * @error:   Negative errno value indicating the reason for the abort.
> + */
> +void homa_abort_rpcs(struct homa *homa, const struct in6_addr *addr,
> +		     int port, int error)
> +{
> +	struct homa_socktab_scan scan;
> +	struct homa_rpc *rpc, *tmp;
> +	struct homa_sock *hsk;
> +
> +	rcu_read_lock();
> +	for (hsk = homa_socktab_start_scan(homa->port_map, &scan); hsk;
> +	     hsk = homa_socktab_next(&scan)) {
> +		/* Skip the (expensive) lock acquisition if there's no
> +		 * work to do.
> +		 */
> +		if (list_empty(&hsk->active_rpcs))
> +			continue;
> +		if (!homa_protect_rpcs(hsk))
> +			continue;
> +		list_for_each_entry_safe(rpc, tmp, &hsk->active_rpcs,
> +					 active_links) {
> +			if (!ipv6_addr_equal(&rpc->peer->addr, addr))
> +				continue;
> +			if (port && rpc->dport != port)
> +				continue;
> +			homa_rpc_lock(rpc, "rpc_abort_rpcs");
> +			homa_rpc_abort(rpc, error);
> +			homa_rpc_unlock(rpc);
> +		}
> +		homa_unprotect_rpcs(hsk);
> +	}
> +	homa_socktab_end_scan(&scan);
> +	rcu_read_unlock();
> +}
> +
> +/**
> + * homa_abort_sock_rpcs() - Abort all outgoing (client-side) RPCs on a given
> + * socket.
> + * @hsk:         Socket whose RPCs should be aborted.
> + * @error:       Zero means that the aborted RPCs should be freed immediately.
> + *               A nonzero value means that the RPCs should be marked
> + *               complete, so that they can be returned to the application;
> + *               this value (a negative errno) will be returned from
> + *               recvmsg.
> + */
> +void homa_abort_sock_rpcs(struct homa_sock *hsk, int error)
> +{
> +	struct homa_rpc *rpc, *tmp;
> +
> +	rcu_read_lock();
> +	if (list_empty(&hsk->active_rpcs))
> +		goto done;
> +	if (!homa_protect_rpcs(hsk))
> +		goto done;
> +	list_for_each_entry_safe(rpc, tmp, &hsk->active_rpcs, active_links) {
> +		if (!homa_is_client(rpc->id))
> +			continue;
> +		homa_rpc_lock(rpc, "homa_abort_sock_rpcs");
> +		if (rpc->state == RPC_DEAD) {
> +			homa_rpc_unlock(rpc);
> +			continue;
> +		}
> +		if (error)
> +			homa_rpc_abort(rpc, error);
> +		else
> +			homa_rpc_free(rpc);
> +		homa_rpc_unlock(rpc);
> +	}
> +	homa_unprotect_rpcs(hsk);
> +done:
> +	rcu_read_unlock();
> +}
> +
> +/**
> + * homa_register_interests() - Records information in various places so
> + * that a thread will be woken up if an RPC that it cares about becomes
> + * available.
> + * @interest:     Used to record information about the messages this thread is
> + *                waiting on. The initial contents of the structure are
> + *                assumed to be undefined.
> + * @hsk:          Socket on which relevant messages will arrive.  Must not be
> + *                locked.
> + * @flags:        Flags field from homa_recvmsg_args; see manual entry for
> + *                details.
> + * @id:           If non-zero, then the caller is interested in receiving
> + *                the response for this RPC (@id must be a client request).
> + * Return:        Either zero or a negative errno value. If a matching RPC
> + *                is already available, information about it will be stored in
> + *                interest.
> + */
> +int homa_register_interests(struct homa_interest *interest,
> +			    struct homa_sock *hsk, int flags, __u64 id)
> +{
> +	struct homa_rpc *rpc = NULL;
> +	int locked = 1;
> +
> +	homa_interest_init(interest);
> +	if (id != 0) {
> +		if (!homa_is_client(id))
> +			return -EINVAL;
> +		rpc = homa_find_client_rpc(hsk, id); /* Locks rpc. */
> +		if (!rpc)
> +			return -EINVAL;
> +		if (rpc->interest && rpc->interest != interest) {
> +			homa_rpc_unlock(rpc);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	/* Need both the RPC lock (acquired above) and the socket lock to
> +	 * avoid races.
> +	 */
> +	homa_sock_lock(hsk, "homa_register_interests");
> +	if (hsk->shutdown) {
> +		homa_sock_unlock(hsk);
> +		if (rpc)
> +			homa_rpc_unlock(rpc);
> +		return -ESHUTDOWN;
> +	}
> +
> +	if (id != 0) {
> +		if ((atomic_read(&rpc->flags) & RPC_PKTS_READY) || rpc->error)
> +			goto claim_rpc;
> +		rpc->interest = interest;
> +		interest->reg_rpc = rpc;
> +		homa_rpc_unlock(rpc);

With the current schema you should release the hsh socket lock before
releasing the rpc one.

> +	}
> +
> +	locked = 0;
> +	if (flags & HOMA_RECVMSG_RESPONSE) {
> +		if (!list_empty(&hsk->ready_responses)) {
> +			rpc = list_first_entry(&hsk->ready_responses,
> +					       struct homa_rpc,
> +					       ready_links);
> +			goto claim_rpc;
> +		}
> +		/* Insert this thread at the *front* of the list;
> +		 * we'll get better cache locality if we reuse
> +		 * the same thread over and over, rather than
> +		 * round-robining between threads.  Same below.
> +		 */
> +		list_add(&interest->response_links,
> +			 &hsk->response_interests);
> +	}
> +	if (flags & HOMA_RECVMSG_REQUEST) {
> +		if (!list_empty(&hsk->ready_requests)) {
> +			rpc = list_first_entry(&hsk->ready_requests,
> +					       struct homa_rpc, ready_links);
> +			/* Make sure the interest isn't on the response list;
> +			 * otherwise it might receive a second RPC.
> +			 */
> +			if (!list_empty(&interest->response_links))
> +				list_del_init(&interest->response_links);
> +			goto claim_rpc;
> +		}
> +		list_add(&interest->request_links, &hsk->request_interests);
> +	}
> +	homa_sock_unlock(hsk);
> +	return 0;
> +
> +claim_rpc:
> +	list_del_init(&rpc->ready_links);
> +	if (!list_empty(&hsk->ready_requests) ||
> +	    !list_empty(&hsk->ready_responses)) {
> +		hsk->sock.sk_data_ready(&hsk->sock);
> +	}
> +
> +	/* This flag is needed to keep the RPC from being reaped during the
> +	 * gap between when we release the socket lock and we acquire the
> +	 * RPC lock.
> +	 */
> +	atomic_or(RPC_HANDING_OFF, &rpc->flags);
> +	homa_sock_unlock(hsk);
> +	if (!locked) {
> +		atomic_or(APP_NEEDS_LOCK, &rpc->flags);
> +		homa_rpc_lock(rpc, "homa_register_interests");
> +		atomic_andnot(APP_NEEDS_LOCK, &rpc->flags);
> +		locked = 1;
> +	}
> +	atomic_andnot(RPC_HANDING_OFF, &rpc->flags);
> +	homa_interest_set_rpc(interest, rpc, locked);
> +	return 0;
> +}
> +
> +/**
> + * homa_wait_for_message() - Wait for receipt of an incoming message
> + * that matches the parameters. Various other activities can occur while
> + * waiting, such as reaping dead RPCs and copying data to user space.
> + * @hsk:          Socket where messages will arrive.
> + * @flags:        Flags field from homa_recvmsg_args; see manual entry for
> + *                details.
> + * @id:           If non-zero, then a response message matching this id may
> + *                be returned (@id must refer to a client request).
> + *
> + * Return:   Pointer to an RPC that matches @flags and @id, or a negative
> + *           errno value. The RPC will be locked; the caller must unlock.
> + */
> +struct homa_rpc *homa_wait_for_message(struct homa_sock *hsk, int flags,
> +				       __u64 id)
> +	__acquires(&rpc->bucket_lock)
> +{
> +	struct homa_rpc *result = NULL;
> +	struct homa_interest interest;
> +	struct homa_rpc *rpc = NULL;
> +	int error;
> +
> +	/* Each iteration of this loop finds an RPC, but it might not be
> +	 * in a state where we can return it (e.g., there might be packets
> +	 * ready to transfer to user space, but the incoming message isn't yet
> +	 * complete). Thus it could take many iterations of this loop
> +	 * before we have an RPC with a complete message.
> +	 */
> +	while (1) {
> +		error = homa_register_interests(&interest, hsk, flags, id);
> +		rpc = homa_interest_get_rpc(&interest);
> +		if (rpc)
> +			goto found_rpc;
> +		if (error < 0) {
> +			result = ERR_PTR(error);
> +			goto found_rpc;
> +		}
> +
> +		/* There is no ready RPC so far. Clean up dead RPCs before
> +		 * going to sleep (or returning, if in nonblocking mode).
> +		 */
> +		while (1) {
> +			int reaper_result;
> +
> +			rpc = homa_interest_get_rpc(&interest);
> +			if (rpc)
> +				goto found_rpc;
> +			reaper_result = homa_rpc_reap(hsk, false);
> +			if (reaper_result == 0)
> +				break;
> +
> +			/* Give NAPI and SoftIRQ tasks a chance to run. */
> +			schedule();
> +		}
> +		if (flags & HOMA_RECVMSG_NONBLOCKING) {
> +			result = ERR_PTR(-EAGAIN);
> +			goto found_rpc;
> +		}
> +
> +		/* Now it's time to sleep. */
> +		set_current_state(TASK_INTERRUPTIBLE);
> +		rpc = homa_interest_get_rpc(&interest);
> +		if (!rpc && !hsk->shutdown)
> +			schedule();
> +		__set_current_state(TASK_RUNNING);
> +
> +found_rpc:
> +		/* If we get here, it means either an RPC is ready for our
> +		 * attention or an error occurred.
> +		 *
> +		 * First, clean up all of the interests. Must do this before
> +		 * making any other decisions, because until we do, an incoming
> +		 * message could still be passed to us. Note: if we went to
> +		 * sleep, then this info was already cleaned up by whoever
> +		 * woke us up. Also, values in the interest may change between
> +		 * when we test them below and when we acquire the socket lock,
> +		 * so they have to be checked again after locking the socket.
> +		 */
> +		if (interest.reg_rpc ||
> +		    !list_empty(&interest.request_links) ||
> +		    !list_empty(&interest.response_links)) {
> +			homa_sock_lock(hsk, "homa_wait_for_message");
> +			if (interest.reg_rpc)
> +				interest.reg_rpc->interest = NULL;
> +			if (!list_empty(&interest.request_links))
> +				list_del_init(&interest.request_links);
> +			if (!list_empty(&interest.response_links))
> +				list_del_init(&interest.response_links);
> +			homa_sock_unlock(hsk);
> +		}
> +
> +		/* Now check to see if we received an RPC handoff (note that
> +		 * this could have happened anytime up until we reset the
> +		 * interests above).
> +		 */
> +		rpc = homa_interest_get_rpc(&interest);
> +		if (rpc) {
> +			if (!interest.locked) {
> +				atomic_or(APP_NEEDS_LOCK, &rpc->flags);
> +				homa_rpc_lock(rpc, "homa_wait_for_message");
> +				atomic_andnot(APP_NEEDS_LOCK | RPC_HANDING_OFF,
> +					      &rpc->flags);
> +			} else {
> +				atomic_andnot(RPC_HANDING_OFF, &rpc->flags);
> +			}
> +			if (!rpc->error)
> +				rpc->error = homa_copy_to_user(rpc);
> +			if (rpc->state == RPC_DEAD) {
> +				homa_rpc_unlock(rpc);
> +				continue;
> +			}
> +			if (rpc->error)
> +				goto done;
> +			atomic_andnot(RPC_PKTS_READY, &rpc->flags);
> +			if (rpc->msgin.bytes_remaining == 0 &&
> +			    !skb_queue_len(&rpc->msgin.packets))
> +				goto done;
> +			homa_rpc_unlock(rpc);
> +		}
> +
> +		/* A complete message isn't available: check for errors. */
> +		if (IS_ERR(result))
> +			return result;
> +		if (signal_pending(current))
> +			return ERR_PTR(-EINTR);
> +
> +		/* No message and no error; try again. */
> +	}
> +
> +done:
> +	return rpc;

The amount of custom code to wait is concerning. Why can't you build
around sk_wait_event()?

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol
  2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
                   ` (11 preceding siblings ...)
  2025-01-15 18:59 ` [PATCH net-next v6 12/12] net: homa: create Makefile and Kconfig John Ousterhout
@ 2025-01-24  8:55 ` Paolo Abeni
  2025-02-10 19:19   ` John Ousterhout
  12 siblings, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-01-24  8:55 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 1/15/25 7:59 PM, John Ousterhout wrote:
> This patch series begins the process of upstreaming the Homa transport
> protocol. Homa is an alternative to TCP for use in datacenter
> environments. It provides 10-100x reductions in tail latency for short
> messages relative to TCP. Its benefits are greatest for mixed workloads
> containing both short and long messages running under high network loads.
> Homa is not API-compatible with TCP: it is connectionless and message-
> oriented (but still reliable and flow-controlled). Homa's new API not
> only contributes to its performance gains, but it also eliminates the
> massive amount of connection state required by TCP for highly connected
> datacenter workloads.
> 
> For more details on Homa, please consult the Homa Wiki:
> https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview
> The Wiki has pointers to two papers on Homa (one of which describes
> this implementation) as well as man pages describing the application
> API and other information.
> 
> There is also a GitHub repo for Homa:
> https://github.com/PlatformLab/HomaModule
> The GitHub repo contains a superset of this patch set, including:
> * Additional source code that will eventually be upstreamed
> * Extensive unit tests (which will also be upstreamed eventually)
> * Application-level library functions (which need to go in glibc?)
> * Man pages (which need to be upstreamed as well)
> * Benchmarking and instrumentation code
> 
> For this patch series, Homa has been stripped down to the bare minimum
> functionality capable of actually executing remote procedure calls. (about
> 8000 lines of source code, compared to 15000 in the complete Homa). The
> remaining code will be upstreamed in smaller batches once this patch
> series has been accepted. Note: the code in this patch series is
> functional but its performance is not very interesting (about the same
> as TCP).
> 
> The patch series is arranged to introduce the major functional components
> of Homa. Until the last patch has been applied, the code is inert (it
> will not be compiled).
> 
> Note: this implementation of Homa supports both IPv4 and IPv6.

I haven't completed reviewing the current iteration yet, but with the
amount of code inspected at this point, the series looks quite far from
a mergeable status.

Before the next iteration, I strongly advice to review (and possibly
rethink) completely the locking schema, especially the RCU usage, to
implement rcvbuf and sendbuf accounting (and possibly even memory
accounting), to reorganize the code for better reviewability (the code
in each patch should refer/use only the code current and previous
patches), to use more the existing kernel API and constructs and to test
the code with all the kernel/configs/debug.config knobs enabled.

Unless a patch is new or completely rewritten from scratch, it would be
helpful to add per patch changelog, after the SoB tag and a '---' separator.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 03/12] net: homa: create shared Homa header files
  2025-01-23 11:01   ` Paolo Abeni
@ 2025-01-24 21:21     ` John Ousterhout
  2025-01-27  9:05       ` Paolo Abeni
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-24 21:21 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Thu, Jan 23, 2025 at 3:01 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 1/15/25 7:59 PM, John Ousterhout wrote:
> [...]
> > +/**
> > + * union sockaddr_in_union - Holds either an IPv4 or IPv6 address (smaller
> > + * and easier to use than sockaddr_storage).
> > + */
> > +union sockaddr_in_union {
> > +     /** @sa: Used to access as a generic sockaddr. */
> > +     struct sockaddr sa;
> > +
> > +     /** @in4: Used to access as IPv4 socket. */
> > +     struct sockaddr_in in4;
> > +
> > +     /** @in6: Used to access as IPv6 socket.  */
> > +     struct sockaddr_in6 in6;
> > +};
>
> There are other protocol using the same struct with a different name
> (sctp) or  a very similar struct (mptcp). It would be nice to move this
> in a shared header and allow re-use.

I would be happy to do this, but I suspect it should be done
separately from this patch series. It's not obvious to me where such a
definition should go; can you suggest an appropriate place for it?

> [...]
> > +     /**
> > +      * @core: Core on which @thread was executing when it registered
> > +      * its interest.  Used for load balancing (see balance.txt).
> > +      */
> > +     int core;
>
> I don't see a 'balance.txt' file in this submission, possibly stray
> reference?

This is a file in the GitHub repo that I hadn't (yet) been including
with the code being upstreamed. I've now added this file (and a couple
of other explanatory .txt files) to the manifest for upstreaming.

> [...]
> > +     /**
> > +      * @pacer_wake_time: time (in sched_clock units) when the pacer last
> > +      * woke up (if the pacer is running) or 0 if the pacer is sleeping.
> > +      */
> > +     __u64 pacer_wake_time;
>
> why do you use the '__' variant here? this is not uapi, you should use
> the plain u64/u32 (more occurrences below).

Sorry, newbie mistake (I wasn't aware of the difference). I will fix everywhere.

> [...]
> > +     /**
> > +      * @prev_default_port: The most recent port number assigned from
> > +      * the range of default ports.
> > +      */
> > +     __u16 prev_default_port __aligned(L1_CACHE_BYTES);
>
> I think the idiomatic way to express the above is to use:
>
>         u16 prev_default_port ____cacheline_aligned;
>
> or
>
>         u16 prev_default_port ____cacheline_aligned_in_smp;
>
> more similar occourrences below.

I will fix everywhere.

Thanks for the comments.


-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c
  2025-01-23 12:06   ` Paolo Abeni
@ 2025-01-24 23:53     ` John Ousterhout
  2025-01-25  0:46       ` Andrew Lunn
  2025-01-27  9:41       ` Paolo Abeni
  0 siblings, 2 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-24 23:53 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Thu, Jan 23, 2025 at 4:06 AM Paolo Abeni <pabeni@redhat.com> wrote:
...
> > +     pool->descriptors = kmalloc_array(pool->num_bpages,
> > +                                       sizeof(struct homa_bpage),
> > +                                       GFP_ATOMIC);
>
> Possibly wort adding '| __GFP_ZERO' and avoid zeroing some fields later.

I prefer to do all the initialization explicitly (this makes it
totally clear that a zero value is intended, as opposed to accidental
omission of an initializer). If you still think I should use
__GFP_ZERO, let me know and I'll add it.

> > +
> > +     /* Allocate and initialize core-specific data. */
> > +     pool->cores = kmalloc_array(nr_cpu_ids, sizeof(struct homa_pool_core),
> > +                                 GFP_ATOMIC);
>
> Uhm... on large system this could be an order-3 allocation, which in
> turn could fail quite easily under memory pressure, and it looks
> contradictory with WRT the cover letter statement about reducing the
> amount of per socket status.
>
> Why don't you use alloc_percpu_gfp() here?

I have now switched to alloc_percpu_gfp. On the issue of per-socket
memory requirements, Homa doesn't significantly reduce the amount of
memory allocated for any given socket. Its memory savings come about
because a single Homa socket can be used to communicate with any
number of peers simultaneously, whereas TCP requires a separate socket
for each peer-to-peer connection. I have added a bit more to the cover
letter to clarify this.

> > +int homa_pool_get_pages(struct homa_pool *pool, int num_pages, __u32 *pages,
> > +                     int set_owner)
> > +{
> > +     int core_num = raw_smp_processor_id();
>
> Why the 'raw' variant? If this code is pre-emptible it means another
> process could be scheduled on the same core...

My understanding is that raw_smp_processor_id is faster.
homa_pool_get_pages is invoked with a spinlock held, so there is no
risk of a core switch while it is executing. Is there some other
problem I have missed?

> > +
> > +             cur = core->next_candidate;
> > +             core->next_candidate++;
>
> ... here, making this increment racy.

Because this code always runs in atomic mode, I don't believe there is
any danger of racing: no other thread can run on the same core
concurrently.

> > +             if (cur >= limit) {
> > +                     core->next_candidate = 0;
> > +
> > +                     /* Must recompute the limit for each new loop through
> > +                      * the bpage array: we may need to consider a larger
> > +                      * range of pages because of concurrent allocations.
> > +                      */
> > +                     limit = 0;
> > +                     continue;
> > +             }
> > +             bpage = &pool->descriptors[cur];
> > +
> > +             /* Figure out whether this candidate is free (or can be
> > +              * stolen). Do a quick check without locking the page, and
> > +              * if the page looks promising, then lock it and check again
> > +              * (must check again in case someone else snuck in and
> > +              * grabbed the page).
> > +              */
> > +             ref_count = atomic_read(&bpage->refs);
> > +             if (ref_count >= 2 || (ref_count == 1 && (bpage->owner < 0 ||
> > +                             bpage->expiration > now)))
>
> The above conditions could be place in separate helper, making the code
> more easy to follow and avoiding some duplications.

Done; I've created a new function homa_bpage_available.

> > +     /* First allocate any full bpages that are needed. */
> > +     full_pages = rpc->msgin.length >> HOMA_BPAGE_SHIFT;
> > +     if (unlikely(full_pages)) {
> > +             if (homa_pool_get_pages(pool, full_pages, pages, 0) != 0)
>
> full_pages must be less than HOMA_MAX_BPAGES, but I don't see any check
> on incoming message length to be somewhat limited ?!?

Oops, good catch. There was a check in the outbound path, but not in
the inbound path. I have added one now (in homa_message_in_init in
homa_incoming.c).

> > +
> > +     /* We get here if there wasn't enough buffer space for this
> > +      * message; add the RPC to hsk->waiting_for_bufs.
> > +      */
> > +out_of_space:
> > +     homa_sock_lock(pool->hsk, "homa_pool_allocate");
>
> There is some chicken-egg issue, with homa_sock_lock() being defined
> only later in the series, but it looks like the string argument is never
> used.

Right: in normal usage this argument is ignored. It exists because
there are occasionally deadlocks involving socket locks; when that
happens I temporarily add code to homa_sock_lock that uses this
argument to help track them down. I'd prefer to keep it, even though
it isn't normally used, because otherwise when a new deadlock arises
I'd have to modify every call to homa_sock_lock in order to add the
information back in again. I added a few more words to the comment for
homa_sock_lock to make this more clear.


> > +             if (!homa_rpc_try_lock(rpc, "homa_pool_check_waiting")) {
> > +                     /* Can't just spin on the RPC lock because we're
> > +                      * holding the socket lock (see sync.txt). Instead,
>
> Stray reference to sync.txt. It would be nice to have the locking schema
> described somewhere start to finish in this series.

sync.txt will be part of the next revision of this series.

> > +struct homa_bpage {
> > +     union {
> > +             /**
> > +              * @cache_line: Ensures that each homa_bpage object
> > +              * is exactly one cache line long.
> > +              */
> > +             char cache_line[L1_CACHE_BYTES];
> > +             struct {
> > +                     /** @lock: to synchronize shared access. */
> > +                     spinlock_t lock;
> > +
> > +                     /**
> > +                      * @refs: Counts number of distinct uses of this
> > +                      * bpage (1 tick for each message that is using
> > +                      * this page, plus an additional tick if the @owner
> > +                      * field is set).
> > +                      */
> > +                     atomic_t refs;
> > +
> > +                     /**
> > +                      * @owner: kernel core that currently owns this page
> > +                      * (< 0 if none).
> > +                      */
> > +                     int owner;
> > +
> > +                     /**
> > +                      * @expiration: time (in sched_clock() units) after
> > +                      * which it's OK to steal this page from its current
> > +                      * owner (if @refs is 1).
> > +                      */
> > +                     __u64 expiration;
> > +             };
>
> ____cacheline_aligned instead of inserting the struct into an union
> should suffice.

Done (but now that alloc_percpu_gfp is being used I'm not sure this is
needed to ensure alignment?).

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c
  2025-01-24 23:53     ` John Ousterhout
@ 2025-01-25  0:46       ` Andrew Lunn
  2025-01-26  5:33         ` John Ousterhout
  2025-01-27  9:41       ` Paolo Abeni
  1 sibling, 1 reply; 68+ messages in thread
From: Andrew Lunn @ 2025-01-25  0:46 UTC (permalink / raw)
  To: John Ousterhout; +Cc: Paolo Abeni, netdev, edumazet, horms, kuba

> > > +     homa_sock_lock(pool->hsk, "homa_pool_allocate");
> >
> > There is some chicken-egg issue, with homa_sock_lock() being defined
> > only later in the series, but it looks like the string argument is never
> > used.
> 
> Right: in normal usage this argument is ignored. It exists because
> there are occasionally deadlocks involving socket locks; when that
> happens I temporarily add code to homa_sock_lock that uses this
> argument to help track them down. I'd prefer to keep it, even though
> it isn't normally used, because otherwise when a new deadlock arises
> I'd have to modify every call to homa_sock_lock in order to add the
> information back in again. I added a few more words to the comment for
> homa_sock_lock to make this more clear.

CONFIG_PROVE_LOCKING is pretty good at finding deadlocks, before they
happen. With practice you can turn the stack traces back to lines of
code, to know where each lock was taken. This is why no other part of
Linux has this sort of annotate with a string indicating where a lock
was taken.

You really should have CONFIG_PROVE_LOCKING enabled when doing
development and functional testing. Then turn it off for performance
testing.

	Andrew

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c
  2025-01-25  0:46       ` Andrew Lunn
@ 2025-01-26  5:33         ` John Ousterhout
  0 siblings, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-26  5:33 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Paolo Abeni, netdev, edumazet, horms, kuba

On Fri, Jan 24, 2025 at 4:46 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > > > +     homa_sock_lock(pool->hsk, "homa_pool_allocate");
> > >
> > > There is some chicken-egg issue, with homa_sock_lock() being defined
> > > only later in the series, but it looks like the string argument is never
> > > used.
> >
> > Right: in normal usage this argument is ignored. It exists because
> > there are occasionally deadlocks involving socket locks; when that
> > happens I temporarily add code to homa_sock_lock that uses this
> > argument to help track them down. I'd prefer to keep it, even though
> > it isn't normally used, because otherwise when a new deadlock arises
> > I'd have to modify every call to homa_sock_lock in order to add the
> > information back in again. I added a few more words to the comment for
> > homa_sock_lock to make this more clear.
>
> CONFIG_PROVE_LOCKING is pretty good at finding deadlocks, before they
> happen. With practice you can turn the stack traces back to lines of
> code, to know where each lock was taken. This is why no other part of
> Linux has this sort of annotate with a string indicating where a lock
> was taken.
>
> You really should have CONFIG_PROVE_LOCKING enabled when doing
> development and functional testing. Then turn it off for performance
> testing.

This makes sense. I wasn't aware of CONFIG_LOCKDEP or
CONFIG_PROVE_LOCKING until Eric Dumazet mentioned them in a comment on
an earlier version of this patch. I've had them set in my development
environment ever since, and I agree that the extra annotations
shouldn't be necessary anymore. I'll take them out.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
  2025-01-23 14:29   ` Paolo Abeni
@ 2025-01-27  5:22     ` John Ousterhout
  2025-01-27 10:01       ` Paolo Abeni
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-27  5:22 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Thu, Jan 23, 2025 at 6:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
> ...
> How many RPCs should concurrently exist in a real server? with 1024
> buckets there could be a lot of them on each/some list and linear search
> could be very expansive. And this happens with BH disabled.

Server RPCs tend to be short-lived, so my best guess is that the
number of concurrent server RPCs will be relatively small (maybe a few
hundred?). But this is just a guess: I won't know for sure until I can
measure Homa in production use. If the number of concurrent RPCs turns
out to be huge then we'll have to find a different solution.

> > +
> > +     /* Initialize fields that don't require the socket lock. */
> > +     srpc = kmalloc(sizeof(*srpc), GFP_ATOMIC);
>
> You could do the allocation outside the bucket lock, too and avoid the
> ATOMIC flag.

In many cases this function will return an existing RPC so there won't
be any need to allocate; I wouldn't want to pay the allocation
overhead in that case. I could conceivably check the offset in the
packet and pre-allocate if the offset is zero (in this case it's
highly unlikely that there will be an existing RPC). But this is
starting to feel complicated so I'm not sure it's worth doing (and
there are many other places where GFP_ATOMIC is unavoidable, so fixing
just one place may not make much difference). homa_rpc objects are
about 500 bytes, so not super huge. I'm inclined to leave this as is
and consider a more complex approach only if problems arise in
practice.

> > + * homa_rpc_free() - Destructor for homa_rpc; will arrange for all resources
> > + * associated with the RPC to be released (eventually).
> > + * @rpc:  Structure to clean up, or NULL. Must be locked. Its socket must
> > + *        not be locked.
> > + */
> > +void homa_rpc_free(struct homa_rpc *rpc)
> > +     __acquires(&rpc->hsk->lock)
> > +     __releases(&rpc->hsk->lock)
>
> The function name is IMHO misleading. I expect homa_rpc_free() to
> actually free the memory allocated for the rpc argument, including the
> rpc struct itself.

That's a fair point. I have bitten the bullet and renamed it to homa_rpc_end.

> > +                     if (rpc->msgin.length >= 0) {
> > +                             while (1) {
> > +                                     struct sk_buff *skb;
> > +
> > +                                     skb = skb_dequeue(&rpc->msgin.packets);
> > +                                     if (!skb)
> > +                                             break;
> > +                                     kfree_skb(skb);
>
> You can use:
>                                         rx_free+= skb_queue_len(&rpc->msgin.packets);
>                                         skb_queue_purge(&rpc->msgin.packets);

Done.

> > +/**
> > + * homa_find_client_rpc() - Locate client-side information about the RPC that
> > + * a packet belongs to, if there is any. Thread-safe without socket lock.
> > + * @hsk:      Socket via which packet was received.
> > + * @id:       Unique identifier for the RPC.
> > + *
> > + * Return:    A pointer to the homa_rpc for this id, or NULL if none.
> > + *            The RPC will be locked; the caller must eventually unlock it
> > + *            by invoking homa_rpc_unlock.
>
> Why are using this lock schema? It looks like it adds quite a bit of
> complexity. The usual way of handling this kind of hash lookup is do the
> lookup locklessly, under RCU, and eventually add a refcnt to the
> looked-up entity - homa_rpc - to ensure it will not change under the
> hood after the lookup.

I considered using RCU for this, but the time period for RCU
reclamation is too long (10's - 100's of ms, if I recall correctly).
Homa needs to handle a very high rate of RPCs, so this would result in
too much accumulated memory (in particular, skbs don't get reclaimed
until the RPC is reclaimed).

The caller must have a lock on the homa_rpc anyway, so RCU wouldn't
save the overhead of acquiring a lock. The reason for putting the lock
in the hash table instead of the homa_rpc is that this makes RPC
creation/deletion atomic with respect to lookups. The lock was
initially in the homa_rpc, but that led to complex races with hash
table insertion/deletion. This is explained in sync.txt, but of course
you don't have that (yet).

This approach is unusual, but it has worked out really well. Before
implementing this approach I had what seemed like a never-ending
stream of synchronization problems over the socket hash tables; each
"fix" introduced new problems. Once I implemented this, all the
problems went away and the code has been very stable ever since
(several years now).

> > + */
> > +struct homa_rpc *homa_find_client_rpc(struct homa_sock *hsk, __u64 id)
> > +     __acquires(&crpc->bucket->lock)
> > +{
> > +     struct homa_rpc_bucket *bucket = homa_client_rpc_bucket(hsk, id);
> > +     struct homa_rpc *crpc;
> > +
> > +     homa_bucket_lock(bucket, id, __func__);
> > +     hlist_for_each_entry_rcu(crpc, &bucket->rpcs, hash_links) {
>
> Why are you using the RCU variant? I don't see RCU access for rpcs.

I have no idea why that uses RCU (maybe leftover from a long-ago
version that did actually use RCU?). I'll take it out. After seeing
this, I decided to review all of the RCU usage in Homa and I have
found and fixed several other problems and/or unnecessary uses of RCU.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 03/12] net: homa: create shared Homa header files
  2025-01-24 21:21     ` John Ousterhout
@ 2025-01-27  9:05       ` Paolo Abeni
  2025-01-27 17:04         ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-01-27  9:05 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev, edumazet, horms, kuba

On 1/24/25 10:21 PM, John Ousterhout wrote:
> On Thu, Jan 23, 2025 at 3:01 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>
>> On 1/15/25 7:59 PM, John Ousterhout wrote:
>> [...]
>>> +/**
>>> + * union sockaddr_in_union - Holds either an IPv4 or IPv6 address (smaller
>>> + * and easier to use than sockaddr_storage).
>>> + */
>>> +union sockaddr_in_union {
>>> +     /** @sa: Used to access as a generic sockaddr. */
>>> +     struct sockaddr sa;
>>> +
>>> +     /** @in4: Used to access as IPv4 socket. */
>>> +     struct sockaddr_in in4;
>>> +
>>> +     /** @in6: Used to access as IPv6 socket.  */
>>> +     struct sockaddr_in6 in6;
>>> +};
>>
>> There are other protocol using the same struct with a different name
>> (sctp) or  a very similar struct (mptcp). It would be nice to move this
>> in a shared header and allow re-use.
> 
> I would be happy to do this, but I suspect it should be done
> separately from this patch series. It's not obvious to me where such a
> definition should go; can you suggest an appropriate place for it?

Probably a new header file under include/net/. My choice for files name
are usually not quite good, but I would go for 'sockaddr_generic.h' or
'sockaddr_common.h'

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c
  2025-01-24 23:53     ` John Ousterhout
  2025-01-25  0:46       ` Andrew Lunn
@ 2025-01-27  9:41       ` Paolo Abeni
  2025-01-27 17:34         ` John Ousterhout
  1 sibling, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-01-27  9:41 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev, edumazet, horms, kuba

On 1/25/25 12:53 AM, John Ousterhout wrote:
> On Thu, Jan 23, 2025 at 4:06 AM Paolo Abeni <pabeni@redhat.com> wrote:
> ...
>>> +     pool->descriptors = kmalloc_array(pool->num_bpages,
>>> +                                       sizeof(struct homa_bpage),
>>> +                                       GFP_ATOMIC);
>>
>> Possibly wort adding '| __GFP_ZERO' and avoid zeroing some fields later.
> 
> I prefer to do all the initialization explicitly (this makes it
> totally clear that a zero value is intended, as opposed to accidental
> omission of an initializer). If you still think I should use
> __GFP_ZERO, let me know and I'll add it.

Indeed the __GFP_ZERO flag is the preferred for such allocation, as it
at very least reduce the generated code size.

>>> +int homa_pool_get_pages(struct homa_pool *pool, int num_pages, __u32 *pages,
>>> +                     int set_owner)
>>> +{
>>> +     int core_num = raw_smp_processor_id();
>>
>> Why the 'raw' variant? If this code is pre-emptible it means another
>> process could be scheduled on the same core...
> 
> My understanding is that raw_smp_processor_id is faster.
> homa_pool_get_pages is invoked with a spinlock held, so there is no
> risk of a core switch while it is executing. Is there some other
> problem I have missed?

raw_* variants, alike __* ones, fall under the 'use at your own risk'
category.

In this specific case raw_smp_processor_id() is supposed to be used if
you don't care the process being move on other cores while using the
'id' value.

Using raw_smp_processor_id() and building with the CONFIG_DEBUG_PREEMPT
knob, the generated code will miss run-time check for preemption being
actually disabled at invocation time. Such check will be added while
using smp_processor_id(), with no performance cost for non debug build.

>>> +struct homa_bpage {
>>> +     union {
>>> +             /**
>>> +              * @cache_line: Ensures that each homa_bpage object
>>> +              * is exactly one cache line long.
>>> +              */
>>> +             char cache_line[L1_CACHE_BYTES];
>>> +             struct {
>>> +                     /** @lock: to synchronize shared access. */
>>> +                     spinlock_t lock;
>>> +
>>> +                     /**
>>> +                      * @refs: Counts number of distinct uses of this
>>> +                      * bpage (1 tick for each message that is using
>>> +                      * this page, plus an additional tick if the @owner
>>> +                      * field is set).
>>> +                      */
>>> +                     atomic_t refs;
>>> +
>>> +                     /**
>>> +                      * @owner: kernel core that currently owns this page
>>> +                      * (< 0 if none).
>>> +                      */
>>> +                     int owner;
>>> +
>>> +                     /**
>>> +                      * @expiration: time (in sched_clock() units) after
>>> +                      * which it's OK to steal this page from its current
>>> +                      * owner (if @refs is 1).
>>> +                      */
>>> +                     __u64 expiration;
>>> +             };
>>
>> ____cacheline_aligned instead of inserting the struct into an union
>> should suffice.
> 
> Done (but now that alloc_percpu_gfp is being used I'm not sure this is
> needed to ensure alignment?).

Yep, cacheline alignment should not be needed for percpu data.

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
  2025-01-27  5:22     ` John Ousterhout
@ 2025-01-27 10:01       ` Paolo Abeni
  2025-01-27 18:03         ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-01-27 10:01 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev, edumazet, horms, kuba

On 1/27/25 6:22 AM, John Ousterhout wrote:
> On Thu, Jan 23, 2025 at 6:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> ...
>> How many RPCs should concurrently exist in a real server? with 1024
>> buckets there could be a lot of them on each/some list and linear search
>> could be very expansive. And this happens with BH disabled.
> 
> Server RPCs tend to be short-lived, so my best guess is that the
> number of concurrent server RPCs will be relatively small (maybe a few
> hundred?). But this is just a guess: I won't know for sure until I can
> measure Homa in production use. If the number of concurrent RPCs turns
> out to be huge then we'll have to find a different solution.
> 
>>> +
>>> +     /* Initialize fields that don't require the socket lock. */
>>> +     srpc = kmalloc(sizeof(*srpc), GFP_ATOMIC);
>>
>> You could do the allocation outside the bucket lock, too and avoid the
>> ATOMIC flag.
> 
> In many cases this function will return an existing RPC so there won't
> be any need to allocate; I wouldn't want to pay the allocation
> overhead in that case. I could conceivably check the offset in the
> packet and pre-allocate if the offset is zero (in this case it's
> highly unlikely that there will be an existing RPC). 

If you use RCU properly here, you could do a lockless lookup. If such
lookup fail, you could do the allocation still outside the lock and
avoiding it in most of cases.

>>> +/**
>>> + * homa_find_client_rpc() - Locate client-side information about the RPC that
>>> + * a packet belongs to, if there is any. Thread-safe without socket lock.
>>> + * @hsk:      Socket via which packet was received.
>>> + * @id:       Unique identifier for the RPC.
>>> + *
>>> + * Return:    A pointer to the homa_rpc for this id, or NULL if none.
>>> + *            The RPC will be locked; the caller must eventually unlock it
>>> + *            by invoking homa_rpc_unlock.
>>
>> Why are using this lock schema? It looks like it adds quite a bit of
>> complexity. The usual way of handling this kind of hash lookup is do the
>> lookup locklessly, under RCU, and eventually add a refcnt to the
>> looked-up entity - homa_rpc - to ensure it will not change under the
>> hood after the lookup.
> 
> I considered using RCU for this, but the time period for RCU
> reclamation is too long (10's - 100's of ms, if I recall correctly).

RCU grace period usually extend on a kernel jiffy (1-10 ms depending on
your kernel build option).

> Homa needs to handle a very high rate of RPCs, so this would result in
> too much accumulated memory  (in particular, skbs don't get reclaimed
> until the RPC is reclaimed).

For the RPC struct, that above is a fair point, but why skbs need to be
freed together with the RCP struct? if you have skbs i.e. sitting in a
RX queue, you can flush such queue when the RPC goes out of scope,
without any additional delay.

> The caller must have a lock on the homa_rpc anyway, so RCU wouldn't
> save the overhead of acquiring a lock. The reason for putting the lock
> in the hash table instead of the homa_rpc is that this makes RPC
> creation/deletion atomic with respect to lookups. The lock was
> initially in the homa_rpc, but that led to complex races with hash
> table insertion/deletion. This is explained in sync.txt, but of course
> you don't have that (yet).

The per bucket RPC lock is prone to contention, a per RPC lock will
avoid such problem.

> This approach is unusual, but it has worked out really well. Before
> implementing this approach I had what seemed like a never-ending
> stream of synchronization problems over the socket hash tables; each
> "fix" introduced new problems. Once I implemented this, all the
> problems went away and the code has been very stable ever since
> (several years now).

Have you tried running a fuzzer on this code? I bet syzkaller will give
a lot of interesting results, if you teach it about the homa APIs.

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-01-15 18:59 ` [PATCH net-next v6 08/12] net: homa: create homa_incoming.c John Ousterhout
  2025-01-24  8:31   ` Paolo Abeni
@ 2025-01-27 10:19   ` Paolo Abeni
  2025-01-30  0:48     ` John Ousterhout
  1 sibling, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-01-27 10:19 UTC (permalink / raw)
  To: John Ousterhout, netdev; +Cc: edumazet, horms, kuba

On 1/15/25 7:59 PM, John Ousterhout wrote:
> +	/* Each iteration through the following loop processes one packet. */
> +	for (; skb; skb = next) {
> +		h = (struct homa_data_hdr *)skb->data;
> +		next = skb->next;
> +
> +		/* Relinquish the RPC lock temporarily if it's needed
> +		 * elsewhere.
> +		 */
> +		if (rpc) {
> +			int flags = atomic_read(&rpc->flags);
> +
> +			if (flags & APP_NEEDS_LOCK) {
> +				homa_rpc_unlock(rpc);
> +				homa_spin(200);

Why spinning on the current CPU here? This is completely unexpected, and
usually tolerated only to deal with H/W imposed delay while programming
some device registers.

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 03/12] net: homa: create shared Homa header files
  2025-01-27  9:05       ` Paolo Abeni
@ 2025-01-27 17:04         ` John Ousterhout
  0 siblings, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-27 17:04 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Mon, Jan 27, 2025 at 1:06 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 1/24/25 10:21 PM, John Ousterhout wrote:
> > On Thu, Jan 23, 2025 at 3:01 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>
> >> On 1/15/25 7:59 PM, John Ousterhout wrote:
> >> [...]
> >>> +/**
> >>> + * union sockaddr_in_union - Holds either an IPv4 or IPv6 address (smaller
> >>> + * and easier to use than sockaddr_storage).
> >>> + */
> >>> +union sockaddr_in_union {
> >>> +     /** @sa: Used to access as a generic sockaddr. */
> >>> +     struct sockaddr sa;
> >>> +
> >>> +     /** @in4: Used to access as IPv4 socket. */
> >>> +     struct sockaddr_in in4;
> >>> +
> >>> +     /** @in6: Used to access as IPv6 socket.  */
> >>> +     struct sockaddr_in6 in6;
> >>> +};
> >>
> >> There are other protocol using the same struct with a different name
> >> (sctp) or  a very similar struct (mptcp). It would be nice to move this
> >> in a shared header and allow re-use.
> >
> > I would be happy to do this, but I suspect it should be done
> > separately from this patch series. It's not obvious to me where such a
> > definition should go; can you suggest an appropriate place for it?
>
> Probably a new header file under include/net/. My choice for files name
> are usually not quite good, but I would go for 'sockaddr_generic.h' or
> 'sockaddr_common.h'

It's OK to have a new header file with a single ~10-line definition in it?

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c
  2025-01-27  9:41       ` Paolo Abeni
@ 2025-01-27 17:34         ` John Ousterhout
  2025-01-27 18:28           ` Paolo Abeni
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-27 17:34 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Mon, Jan 27, 2025 at 1:41 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 1/25/25 12:53 AM, John Ousterhout wrote:
> > On Thu, Jan 23, 2025 at 4:06 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > ...
> >>> +     pool->descriptors = kmalloc_array(pool->num_bpages,
> >>> +                                       sizeof(struct homa_bpage),
> >>> +                                       GFP_ATOMIC);
> >>
> >> Possibly wort adding '| __GFP_ZERO' and avoid zeroing some fields later.
> >
> > I prefer to do all the initialization explicitly (this makes it
> > totally clear that a zero value is intended, as opposed to accidental
> > omission of an initializer). If you still think I should use
> > __GFP_ZERO, let me know and I'll add it.
>
> Indeed the __GFP_ZERO flag is the preferred for such allocation, as it
> at very least reduce the generated code size.

OK, I have added __GFP_ZERO and removed explicit zero initializers,
both here and in similar situations elsewhere in the code.

> >>> +int homa_pool_get_pages(struct homa_pool *pool, int num_pages, __u32 *pages,
> >>> +                     int set_owner)
> >>> +{
> >>> +     int core_num = raw_smp_processor_id();
> >>
> >> Why the 'raw' variant? If this code is pre-emptible it means another
> >> process could be scheduled on the same core...
> >
> > My understanding is that raw_smp_processor_id is faster.
> > homa_pool_get_pages is invoked with a spinlock held, so there is no
> > risk of a core switch while it is executing. Is there some other
> > problem I have missed?
>
> raw_* variants, alike __* ones, fall under the 'use at your own risk'
> category.
>
> In this specific case raw_smp_processor_id() is supposed to be used if
> you don't care the process being move on other cores while using the
> 'id' value.
>
> Using raw_smp_processor_id() and building with the CONFIG_DEBUG_PREEMPT
> knob, the generated code will miss run-time check for preemption being
> actually disabled at invocation time. Such check will be added while
> using smp_processor_id(), with no performance cost for non debug build.

I'm pretty confident that the raw variant is safe. However, are you
saying that there is no performance advantage of the raw version in
production builds? If so, then I might as well switch to the non-raw
version.

> >> ____cacheline_aligned instead of inserting the struct into an union
> >> should suffice.
> >
> > Done (but now that alloc_percpu_gfp is being used I'm not sure this is
> > needed to ensure alignment?).
>
> Yep, cacheline alignment should not be needed for percpu data.

OK, I've removed the alignment directive for the percpu data.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
  2025-01-27 10:01       ` Paolo Abeni
@ 2025-01-27 18:03         ` John Ousterhout
  2025-01-28  8:19           ` Paolo Abeni
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-27 18:03 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Mon, Jan 27, 2025 at 2:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 1/27/25 6:22 AM, John Ousterhout wrote:
> > On Thu, Jan 23, 2025 at 6:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >> ...
> >> How many RPCs should concurrently exist in a real server? with 1024
> >> buckets there could be a lot of them on each/some list and linear search
> >> could be very expansive. And this happens with BH disabled.
> >
> > Server RPCs tend to be short-lived, so my best guess is that the
> > number of concurrent server RPCs will be relatively small (maybe a few
> > hundred?). But this is just a guess: I won't know for sure until I can
> > measure Homa in production use. If the number of concurrent RPCs turns
> > out to be huge then we'll have to find a different solution.
> >
> >>> +
> >>> +     /* Initialize fields that don't require the socket lock. */
> >>> +     srpc = kmalloc(sizeof(*srpc), GFP_ATOMIC);
> >>
> >> You could do the allocation outside the bucket lock, too and avoid the
> >> ATOMIC flag.
> >
> > In many cases this function will return an existing RPC so there won't
> > be any need to allocate; I wouldn't want to pay the allocation
> > overhead in that case. I could conceivably check the offset in the
> > packet and pre-allocate if the offset is zero (in this case it's
> > highly unlikely that there will be an existing RPC).
>
> If you use RCU properly here, you could do a lockless lookup. If such
> lookup fail, you could do the allocation still outside the lock and
> avoiding it in most of cases.

I think that might work, but it would suffer from the slow reclamation
problem I mentioned with RCU. It would also create more complexity in
the code (e.g. the allocation might still turn out to be redundant, so
there would need to be additional code to check for that: the lookup
would essentially have to be done twice in the case of creating a new
RPC). I'd rather not incur this complexity until there's evidence that
GFP_ATOMIC is causing problems.

> > Homa needs to handle a very high rate of RPCs, so this would result in
> > too much accumulated memory  (in particular, skbs don't get reclaimed
> > until the RPC is reclaimed).
>
> For the RPC struct, that above is a fair point, but why skbs need to be
> freed together with the RCP struct? if you have skbs i.e. sitting in a
> RX queue, you can flush such queue when the RPC goes out of scope,
> without any additional delay.

Reclaiming the skbs inline would be too expensive; this is the reason
for the reaping mechanism. It's conceivable that the reaping could be
done in two stages: reap skb's ASAP, but wait to reap homa_rpc structs
until RCU gives the OK . However, once again, this would add more
complexity: it's simpler to have a single reaper that handles
everything.

> > The caller must have a lock on the homa_rpc anyway, so RCU wouldn't
> > save the overhead of acquiring a lock. The reason for putting the lock
> > in the hash table instead of the homa_rpc is that this makes RPC
> > creation/deletion atomic with respect to lookups. The lock was
> > initially in the homa_rpc, but that led to complex races with hash
> > table insertion/deletion. This is explained in sync.txt, but of course
> > you don't have that (yet).
>
> The per bucket RPC lock is prone to contention, a per RPC lock will
> avoid such problem.

There are a lot of buckets (1024); this was done intentionally to
reduce the likelihood of contention between different RPCs  trying to
acquire the same bucket lock.  I was concerned about the potential for
contention, but I have measured it under heavy (admittedly synthetic)
workloads and found that contention for the bucket locks is not a
significant problem.

Note that the bucket locks would be needed even with RCU usage, in
order to permit concurrent RPC creation in different buckets. Thus
Homa's locking scheme doesn't introduce additional locks; it
eliminates locks that would otherwise be needed on individual RPCs and
uses the bucket locks for 2 purposes.

> > This approach is unusual, but it has worked out really well. Before
> > implementing this approach I had what seemed like a never-ending
> > stream of synchronization problems over the socket hash tables; each
> > "fix" introduced new problems. Once I implemented this, all the
> > problems went away and the code has been very stable ever since
> > (several years now).
>
> Have you tried running a fuzzer on this code? I bet syzkaller will give
> a lot of interesting results, if you teach it about the homa APIs.

I haven't done that yet; I'll put it on my "to do" list. I do have
synthetic workloads for Homa that are randomly driven, and so far they
seem to have been pretty effective at finding races.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c
  2025-01-27 17:34         ` John Ousterhout
@ 2025-01-27 18:28           ` Paolo Abeni
  2025-01-27 19:12             ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-01-27 18:28 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev, edumazet, horms, kuba

On 1/27/25 6:34 PM, John Ousterhout wrote:
> On Mon, Jan 27, 2025 at 1:41 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> raw_* variants, alike __* ones, fall under the 'use at your own risk'
>> category.
>>
>> In this specific case raw_smp_processor_id() is supposed to be used if
>> you don't care the process being move on other cores while using the
>> 'id' value.
>>
>> Using raw_smp_processor_id() and building with the CONFIG_DEBUG_PREEMPT
>> knob, the generated code will miss run-time check for preemption being
>> actually disabled at invocation time. Such check will be added while
>> using smp_processor_id(), with no performance cost for non debug build.
> 
> I'm pretty confident that the raw variant is safe. However, are you
> saying that there is no performance advantage of the raw version in
> production builds? 

Yes.

> If so, then I might as well switch to the non-raw version.

Please do. In fact using the raw variant when not needed will bring only
shortcoming.

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c
  2025-01-27 18:28           ` Paolo Abeni
@ 2025-01-27 19:12             ` John Ousterhout
  2025-01-28  8:27               ` Paolo Abeni
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-27 19:12 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Mon, Jan 27, 2025 at 10:28 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 1/27/25 6:34 PM, John Ousterhout wrote:
> > On Mon, Jan 27, 2025 at 1:41 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >> raw_* variants, alike __* ones, fall under the 'use at your own risk'
> >> category.
> >>
> >> In this specific case raw_smp_processor_id() is supposed to be used if
> >> you don't care the process being move on other cores while using the
> >> 'id' value.
> >>
> >> Using raw_smp_processor_id() and building with the CONFIG_DEBUG_PREEMPT
> >> knob, the generated code will miss run-time check for preemption being
> >> actually disabled at invocation time. Such check will be added while
> >> using smp_processor_id(), with no performance cost for non debug build.
> >
> > I'm pretty confident that the raw variant is safe. However, are you
> > saying that there is no performance advantage of the raw version in
> > production builds?
>
> Yes.
>
> > If so, then I might as well switch to the non-raw version.
>
> Please do. In fact using the raw variant when not needed will bring only
> shortcoming.

Will do. Just for my information, when is the raw variant "needed"?

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 06/12] net: homa: create homa_peer.h and homa_peer.c
  2025-01-23 17:45   ` Paolo Abeni
@ 2025-01-28  0:06     ` John Ousterhout
  2025-01-28  0:32       ` Jason Xing
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-28  0:06 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Thu, Jan 23, 2025 at 9:45 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 1/15/25 7:59 PM, John Ousterhout wrote:
> > +/**
> > + * homa_peertab_get_peers() - Return information about all of the peers
> > + * currently known
> > + * @peertab:    The table to search for peers.
> > + * @num_peers:  Modified to hold the number of peers returned.
> > + * Return:      kmalloced array holding pointers to all known peers. The
> > + *           caller must free this. If there is an error, or if there
> > + *           are no peers, NULL is returned.
> > + */
> > +struct homa_peer **homa_peertab_get_peers(struct homa_peertab *peertab,
> > +                                       int *num_peers)
>
> Look like this function is unsed in the current series. Please don't
> introduce unused code.

Sorry about that. This patch series only contains about half of Homa's
full functionality. I didn't notice that this function had become
orphaned during the trimming process. I've removed it now (but not
before fixing the issues below).

> > +{
> > +     struct homa_peer **result;
> > +     struct hlist_node *next;
> > +     struct homa_peer *peer;
> > +     int i, count;
> > +
> > +     *num_peers = 0;
> > +     if (!peertab->buckets)
> > +             return NULL;
> > +
> > +     /* Figure out how many peers there are. */
> > +     count = 0;
> > +     for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++) {
> > +             hlist_for_each_entry_safe(peer, next, &peertab->buckets[i],
> > +                                       peertab_links)
>
> No lock acquired, so others process could concurrently modify the list;
> hlist_for_each_entry_safe() is not the correct helper to use. You should
> probably use hlist_for_each_entry_rcu(), adding rcu protection. Assuming
> the thing is actually under an RCU schema, which is not entirely clear.

Looks like I misunderstood what "safe" means when I wrote this code.
As I understand it now, hlist_for_each_entry_safe is only "safe"
against deletion of the current entry by the thread that is iterating:
it is not safe against insertions or deletions by other threads, or
even deleting elements other than the current one. Is that correct?

I have switched to use hlist_for_each_entry_rcu instead, but this
raises questions. If I use hlist_for_each_entry_rcu, will I need to
use rcu_read_lock/unlock also in order to avoid complaints from the
RCU validator? Technically, I don't think rcu_read_lock and unlock
are necessary, because this code only needs protection against
concurrent modifications to the list structure, and I think that the
rcu iterators provide that. If I understand correctly, rcu_read_lock
and unlock are only needed to prevent an object from being deleted
while it is being used, but that can't happen here because peers are
not deleted. For now I have added calls to rcu_read_lock and unlock;
is there a way to annotate this usage so that I can skip the calls to
rcu_read_lock/unlock without complaints from the RCU validator?

> > +/**
> > + * homa_peertab_gc_dsts() - Invoked to free unused dst_entries, if it is
> > + * safe to do so.
> > + * @peertab:       The table in which to free entries.
> > + * @now:           Current time, in sched_clock() units; entries with expiration
> > + *                 dates no later than this will be freed. Specify ~0 to
> > + *                 free all entries.
> > + */
> > +void homa_peertab_gc_dsts(struct homa_peertab *peertab, __u64 now)
> > +{
>
> Apparently this is called under (and need) peertab lock, an annotation
> or a comment would be helpful.

I have now added a __must_hold(&peer_tab->write_lock) annotation to
this function.

> > +     hlist_for_each_entry_rcu(peer, &peertab->buckets[bucket],
> > +                              peertab_links) {
> > +             if (ipv6_addr_equal(&peer->addr, addr))
>
> The caller does not acquire the RCU read lock, so this looks buggy.

I have added rcu_read_lock/unlock calls, but I don't think they are
technically necessary, for the same reason discussed above.

> AFAICS UaF is not possible because peers are removed only by
> homa_peertab_destroy(), at unload time. That in turn looks
> dangerous/wrong. What about memory utilization for peers over time?
> apparently bucket list could grow in an unlimited way.

Correct: peers are only freed at unload time. I have deferred trying
to reclaim peer data earlier because it's unclear to me that that is
either necessary or good. Homa is intended for use only within a
particular datacenter so the number of peers is limited to the number
of hosts in the datacenter (100K?). The amount of information for each
peer is relatively small (about 300 bytes) so even in the worst case I
don't think it would be completely intolerable to have them all loaded
in memory at once. I would expect the actual number to be less than
that, due to locality of host-host access patterns. Anyhow, if
possible I'd prefer to defer the implementation of peer data until
there are measurements to indicate that it is necessary.  BTW, the
bucket array currently has 64K entries, so the bucket lists shouldn't
become very long even with 100K peers.

> [...]
> > +/**
> > + * homa_peer_lock_slow() - This function implements the slow path for
> > + * acquiring a peer's @unacked_lock. It is invoked when the lock isn't
> > + * immediately available. It waits for the lock, but also records statistics
> > + * about the waiting time.
> > + * @peer:    Peer to  lock.
> > + */
> > +void homa_peer_lock_slow(struct homa_peer *peer)
> > +     __acquires(&peer->ack_lock)
> > +{
> > +     spin_lock_bh(&peer->ack_lock);
>
> Is this just a placeholder for future changes?!? I don't see any stats
> update here, and currently homa_peer_lock() is really:
>
>         if (!spin_trylock_bh(&peer->ack_lock))
>                 spin_lock_bh(&peer->ack_lock);
>
> which does not make much sense to me. Either document this is going to
> change very soon (possibly even how and why) or use a plain spin_lock_bh()

The "full" Homa uses the "lock_slow" functions to report statistics on
lock conflicts all of Homa's metrics were removed for this patch
series, leaving the "lock_slow" functions as hollow shells. You aren't
the first reviewer to have been confused by this, so I will remove the
"lock_slow" functions for now.

> > +struct homa_peertab {
> > +     /**
> > +      * @write_lock: Synchronizes addition of new entries; not needed
> > +      * for lookups (RCU is used instead).
> > +      */
> > +     spinlock_t write_lock;
>
> This look looks potentially heavily contended on add, why don't you use a
> per bucket lock?

Peers aren't added very often so I don't expect contention for this
lock; if it turns out to be contended then I'll switch to per-bucket
locks.

> > +     /**
> > +      * @grantable_rpcs: Contains all homa_rpcs (both requests and
> > +      * responses) involving this peer whose msgins require (or required
> > +      * them in the past) and have not been fully received. The list is
> > +      * sorted in priority order (head has fewest bytes_remaining).
> > +      * Locked with homa->grantable_lock.
> > +      */
> > +     struct list_head grantable_rpcs;
>
> Apparently not used in this patch series. More field below with similar
> problem. Please introduce such fields in the same series that will
> actualy use them.

Oops, fixed. Several of the following fields are not even used in the
full Homa.... somehow they didn't get deleted once all of the usages
were removed.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 06/12] net: homa: create homa_peer.h and homa_peer.c
  2025-01-28  0:06     ` John Ousterhout
@ 2025-01-28  0:32       ` Jason Xing
  0 siblings, 0 replies; 68+ messages in thread
From: Jason Xing @ 2025-01-28  0:32 UTC (permalink / raw)
  To: John Ousterhout; +Cc: Paolo Abeni, netdev, edumazet, horms, kuba

On Tue, Jan 28, 2025 at 8:16 AM John Ousterhout <ouster@cs.stanford.edu> wrote:
>
> On Thu, Jan 23, 2025 at 9:45 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 1/15/25 7:59 PM, John Ousterhout wrote:
> > > +/**
> > > + * homa_peertab_get_peers() - Return information about all of the peers
> > > + * currently known
> > > + * @peertab:    The table to search for peers.
> > > + * @num_peers:  Modified to hold the number of peers returned.
> > > + * Return:      kmalloced array holding pointers to all known peers. The
> > > + *           caller must free this. If there is an error, or if there
> > > + *           are no peers, NULL is returned.
> > > + */
> > > +struct homa_peer **homa_peertab_get_peers(struct homa_peertab *peertab,
> > > +                                       int *num_peers)
> >
> > Look like this function is unsed in the current series. Please don't
> > introduce unused code.
>
> Sorry about that. This patch series only contains about half of Homa's
> full functionality. I didn't notice that this function had become
> orphaned during the trimming process. I've removed it now (but not
> before fixing the issues below).
>
> > > +{
> > > +     struct homa_peer **result;
> > > +     struct hlist_node *next;
> > > +     struct homa_peer *peer;
> > > +     int i, count;
> > > +
> > > +     *num_peers = 0;
> > > +     if (!peertab->buckets)
> > > +             return NULL;
> > > +
> > > +     /* Figure out how many peers there are. */
> > > +     count = 0;
> > > +     for (i = 0; i < HOMA_PEERTAB_BUCKETS; i++) {
> > > +             hlist_for_each_entry_safe(peer, next, &peertab->buckets[i],
> > > +                                       peertab_links)
> >
> > No lock acquired, so others process could concurrently modify the list;
> > hlist_for_each_entry_safe() is not the correct helper to use. You should
> > probably use hlist_for_each_entry_rcu(), adding rcu protection. Assuming
> > the thing is actually under an RCU schema, which is not entirely clear.
>
> Looks like I misunderstood what "safe" means when I wrote this code.
> As I understand it now, hlist_for_each_entry_safe is only "safe"
> against deletion of the current entry by the thread that is iterating:
> it is not safe against insertions or deletions by other threads, or
> even deleting elements other than the current one. Is that correct?

I'm not Paolo. From what I've known, your understanding is correct.
RCU mechanism guarantees that the deletion process is safe.

>
> I have switched to use hlist_for_each_entry_rcu instead, but this
> raises questions. If I use hlist_for_each_entry_rcu, will I need to
> use rcu_read_lock/unlock also in order to avoid complaints from the

rcu_read_lock/unlock() should be used correspondingly. If without, the
deletion would not be safe.

> RCU validator? Technically, I don't think rcu_read_lock and unlock
> are necessary, because this code only needs protection against
> concurrent modifications to the list structure, and I think that the
> rcu iterators provide that. If I understand correctly, rcu_read_lock

RCU would not be helpful if multiple threads are trying to write in
the same buckets at the same time. spin_lock() is a common approach, I
would recommend. Please see __inet_lookup_established() as a good
example.

Thanks,
Jason

> and unlock are only needed to prevent an object from being deleted
> while it is being used, but that can't happen here because peers are
> not deleted. For now I have added calls to rcu_read_lock and unlock;
> is there a way to annotate this usage so that I can skip the calls to
> rcu_read_lock/unlock without complaints from the RCU validator?
>
> > > +/**
> > > + * homa_peertab_gc_dsts() - Invoked to free unused dst_entries, if it is
> > > + * safe to do so.
> > > + * @peertab:       The table in which to free entries.
> > > + * @now:           Current time, in sched_clock() units; entries with expiration
> > > + *                 dates no later than this will be freed. Specify ~0 to
> > > + *                 free all entries.
> > > + */
> > > +void homa_peertab_gc_dsts(struct homa_peertab *peertab, __u64 now)
> > > +{
> >
> > Apparently this is called under (and need) peertab lock, an annotation
> > or a comment would be helpful.
>
> I have now added a __must_hold(&peer_tab->write_lock) annotation to
> this function.
>
> > > +     hlist_for_each_entry_rcu(peer, &peertab->buckets[bucket],
> > > +                              peertab_links) {
> > > +             if (ipv6_addr_equal(&peer->addr, addr))
> >
> > The caller does not acquire the RCU read lock, so this looks buggy.
>
> I have added rcu_read_lock/unlock calls, but I don't think they are
> technically necessary, for the same reason discussed above.
>
> > AFAICS UaF is not possible because peers are removed only by
> > homa_peertab_destroy(), at unload time. That in turn looks
> > dangerous/wrong. What about memory utilization for peers over time?
> > apparently bucket list could grow in an unlimited way.
>
> Correct: peers are only freed at unload time. I have deferred trying
> to reclaim peer data earlier because it's unclear to me that that is
> either necessary or good. Homa is intended for use only within a
> particular datacenter so the number of peers is limited to the number
> of hosts in the datacenter (100K?). The amount of information for each
> peer is relatively small (about 300 bytes) so even in the worst case I
> don't think it would be completely intolerable to have them all loaded
> in memory at once. I would expect the actual number to be less than
> that, due to locality of host-host access patterns. Anyhow, if
> possible I'd prefer to defer the implementation of peer data until
> there are measurements to indicate that it is necessary.  BTW, the
> bucket array currently has 64K entries, so the bucket lists shouldn't
> become very long even with 100K peers.
>
> > [...]
> > > +/**
> > > + * homa_peer_lock_slow() - This function implements the slow path for
> > > + * acquiring a peer's @unacked_lock. It is invoked when the lock isn't
> > > + * immediately available. It waits for the lock, but also records statistics
> > > + * about the waiting time.
> > > + * @peer:    Peer to  lock.
> > > + */
> > > +void homa_peer_lock_slow(struct homa_peer *peer)
> > > +     __acquires(&peer->ack_lock)
> > > +{
> > > +     spin_lock_bh(&peer->ack_lock);
> >
> > Is this just a placeholder for future changes?!? I don't see any stats
> > update here, and currently homa_peer_lock() is really:
> >
> >         if (!spin_trylock_bh(&peer->ack_lock))
> >                 spin_lock_bh(&peer->ack_lock);
> >
> > which does not make much sense to me. Either document this is going to
> > change very soon (possibly even how and why) or use a plain spin_lock_bh()
>
> The "full" Homa uses the "lock_slow" functions to report statistics on
> lock conflicts all of Homa's metrics were removed for this patch
> series, leaving the "lock_slow" functions as hollow shells. You aren't
> the first reviewer to have been confused by this, so I will remove the
> "lock_slow" functions for now.
>
> > > +struct homa_peertab {
> > > +     /**
> > > +      * @write_lock: Synchronizes addition of new entries; not needed
> > > +      * for lookups (RCU is used instead).
> > > +      */
> > > +     spinlock_t write_lock;
> >
> > This look looks potentially heavily contended on add, why don't you use a
> > per bucket lock?
>
> Peers aren't added very often so I don't expect contention for this
> lock; if it turns out to be contended then I'll switch to per-bucket
> locks.
>
> > > +     /**
> > > +      * @grantable_rpcs: Contains all homa_rpcs (both requests and
> > > +      * responses) involving this peer whose msgins require (or required
> > > +      * them in the past) and have not been fully received. The list is
> > > +      * sorted in priority order (head has fewest bytes_remaining).
> > > +      * Locked with homa->grantable_lock.
> > > +      */
> > > +     struct list_head grantable_rpcs;
> >
> > Apparently not used in this patch series. More field below with similar
> > problem. Please introduce such fields in the same series that will
> > actualy use them.
>
> Oops, fixed. Several of the following fields are not even used in the
> full Homa.... somehow they didn't get deleted once all of the usages
> were removed.
>
> -John-
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 07/12] net: homa: create homa_sock.h and homa_sock.c
  2025-01-23 19:01   ` Paolo Abeni
@ 2025-01-28  0:40     ` John Ousterhout
  2025-01-28  4:26       ` John Ousterhout
  2025-01-28 15:10       ` Eric Dumazet
  0 siblings, 2 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-28  0:40 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Thu, Jan 23, 2025 at 11:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > +struct homa_sock *homa_socktab_next(struct homa_socktab_scan *scan)
> > +{
> > +     struct homa_socktab_links *links;
> > +     struct homa_sock *hsk;
> > +
> > +     while (1) {
> > +             while (!scan->next) {
> > +                     struct hlist_head *bucket;
> > +
> > +                     scan->current_bucket++;
> > +                     if (scan->current_bucket >= HOMA_SOCKTAB_BUCKETS)
> > +                             return NULL;
> > +                     bucket = &scan->socktab->buckets[scan->current_bucket];
> > +                     scan->next = (struct homa_socktab_links *)
> > +                                   rcu_dereference(hlist_first_rcu(bucket));
>
> The only caller for this function so far is not under RCU lock: you
> should see a splat here if you build and run this code with:
>
> CONFIG_LOCKDEP=y
>
> (which in turn is highly encouraged)

Strange... I have had CONFIG_LOCKDEP enabled for a while now, but for
some reason I didn't see a flag for that. In any case, all of the
callers to homa_socktab_next now hold the RCU lock (I fixed this
during my scan of RCU usage in response to one of your earlier
messages for this patch series).

> > +             }
> > +             links = scan->next;
> > +             hsk = links->sock;
> > +             scan->next = (struct homa_socktab_links *)
> > +                             rcu_dereference(hlist_next_rcu(&links->hash_links));
>
> homa_socktab_links is embedded into the home sock; if the RCU protection
> is released and re-acquired after a homa_socktab_next() call, there is
> no guarantee links/hsk are still around and the above statement could
> cause UaF.

There is code in homa_sock_unlink to deal with this: it makes a pass
over all of the active scans, and if the "next" field in any
homa_socktab_scan refers to the socket being deleted, it updates the
"next" field to refer to the next socket after the one being deleted.
Thus the "next" fields are always valid, even in the face of socket
deletion.

> This homa_socktab things looks quite complex. A simpler implementation
> could use a simple RCU list _and_ acquire a reference to the hsk before
> releasing the RCU lock.

I agree that this is complicated. But I can't see a simpler solution.
The problem is that we need to iterate through all of the sockets and
release the RCU lock at some points during the iteration. The problem
isn't preserving the current hsk; it's preserving the validity of the
pointer to the next one also. I don't fully understand what you're
proposing above; if you can make it a bit more precise I'll see if it
solves all the problems I'm aware of and does it in a simpler way.

> > +int homa_sock_init(struct homa_sock *hsk, struct homa *homa)
> > +{
> > +     struct homa_socktab *socktab = homa->port_map;
> > +     int starting_port;
> > +     int result = 0;
> > +     int i;
> > +
> > +     spin_lock_bh(&socktab->write_lock);
>
> A single contended lock for the whole homa sock table? Why don't you use
> per bucket locks?

Creating a socket is a very rare operation: it happens roughly once in
the lifetime of each application. Thus per bucket locks aren't
necessary. Homa is very different from TCP in this regard.

> [...]
> > +struct homa_rpc_bucket {
> > +     /**
> > +      * @lock: serves as a lock both for this bucket (e.g., when
> > +      * adding and removing RPCs) and also for all of the RPCs in
> > +      * the bucket. Must be held whenever manipulating an RPC in
> > +      * this bucket. This dual purpose permits clean and safe
> > +      * deletion and garbage collection of RPCs.
> > +      */
> > +     spinlock_t lock;
> > +
> > +     /** @rpcs: list of RPCs that hash to this bucket. */
> > +     struct hlist_head rpcs;
> > +
> > +     /**
> > +      * @id: identifier for this bucket, used in error messages etc.
> > +      * It's the index of the bucket within its hash table bucket
> > +      * array, with an additional offset to separate server and
> > +      * client RPCs.
> > +      */
> > +     int id;
>
> On 64 bit arches this struct will have 2 4-bytes holes. If you reorder
> the field:
>         spinlock_t lock;
>         int id;
>         struct hlist_head rpcs;
>
> the struct size will decrease by 8 bytes.

Done. I wasn't aware that spinlock_t is so tiny.

> > +struct homa_sock {
> > +     /* Info for other network layers. Note: IPv6 info (struct ipv6_pinfo
> > +      * comes at the very end of the struct, *after* Homa's data, if this
> > +      * socket uses IPv6).
> > +      */
> > +     union {
> > +             /** @sock: generic socket data; must be the first field. */
> > +             struct sock sock;
> > +
> > +             /**
> > +              * @inet: generic Internet socket data; must also be the
> > +              first field (contains sock as its first member).
> > +              */
> > +             struct inet_sock inet;
> > +     };
>
> Why adding this union? Just
>         struct inet_sock inet;
> would do.

It's not technically necessary, but it allows code to refer to the
struct sock as hsk->sock rather than hsk->inet.sk (saves me having to
visit struct inet to remember the field name for the struct sock). Not
a big deal, I'll admit...

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 07/12] net: homa: create homa_sock.h and homa_sock.c
  2025-01-28  0:40     ` John Ousterhout
@ 2025-01-28  4:26       ` John Ousterhout
  2025-01-28 15:10       ` Eric Dumazet
  1 sibling, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-28  4:26 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Mon, Jan 27, 2025 at 4:40 PM John Ousterhout <ouster@cs.stanford.edu> wrote:

> > This homa_socktab thing looks quite complex. A simpler implementation
> > could use a simple RCU list _and_ acquire a reference to the hsk before
> > releasing the RCU lock.
>
> I agree that this is complicated. But I can't see a simpler solution.
> The problem is that we need to iterate through all of the sockets and
> release the RCU lock at some points during the iteration. The problem
> isn't preserving the current hsk; it's preserving the validity of the
> pointer to the next one also. I don't fully understand what you're
> proposing above; if you can make it a bit more precise I'll see if it
> solves all the problems I'm aware of and does it in a simpler way.

Responding to my own email: about 15 minutes after sending this email
I realized what you were getting at with your suggestion.  I agree
that's a much better approach than the one currently implemented, so
I'm going to switch to that. Among other things, I think it may allow
all of the RCU stuff  to be encapsulated entirely within the socket
iteration mechanism  (no need for callers to invoke rcu_read_lock).

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
  2025-01-27 18:03         ` John Ousterhout
@ 2025-01-28  8:19           ` Paolo Abeni
  2025-01-29  1:23             ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-01-28  8:19 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev, edumazet, horms, kuba

On 1/27/25 7:03 PM, John Ousterhout wrote:
> On Mon, Jan 27, 2025 at 2:02 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> On 1/27/25 6:22 AM, John Ousterhout wrote:
>>> On Thu, Jan 23, 2025 at 6:30 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>>> ...
>>>> How many RPCs should concurrently exist in a real server? with 1024
>>>> buckets there could be a lot of them on each/some list and linear search
>>>> could be very expansive. And this happens with BH disabled.
>>>
>>> Server RPCs tend to be short-lived, so my best guess is that the
>>> number of concurrent server RPCs will be relatively small (maybe a few
>>> hundred?). But this is just a guess: I won't know for sure until I can
>>> measure Homa in production use. If the number of concurrent RPCs turns
>>> out to be huge then we'll have to find a different solution.
>>>
>>>>> +
>>>>> +     /* Initialize fields that don't require the socket lock. */
>>>>> +     srpc = kmalloc(sizeof(*srpc), GFP_ATOMIC);
>>>>
>>>> You could do the allocation outside the bucket lock, too and avoid the
>>>> ATOMIC flag.
>>>
>>> In many cases this function will return an existing RPC so there won't
>>> be any need to allocate; I wouldn't want to pay the allocation
>>> overhead in that case. I could conceivably check the offset in the
>>> packet and pre-allocate if the offset is zero (in this case it's
>>> highly unlikely that there will be an existing RPC).
>>
>> If you use RCU properly here, you could do a lockless lookup. If such
>> lookup fail, you could do the allocation still outside the lock and
>> avoiding it in most of cases.
> 
> I think that might work, but it would suffer from the slow reclamation
> problem I mentioned with RCU. It would also create more complexity in
> the code (e.g. the allocation might still turn out to be redundant, so
> there would need to be additional code to check for that: the lookup
> would essentially have to be done twice in the case of creating a new
> RPC). I'd rather not incur this complexity until there's evidence that
> GFP_ATOMIC is causing problems.

Have a look at tcp established socket lookup and the
SLAB_TYPESAFE_BY_RCU flag usage for slab-based allocations. A combo of
such flag for RPC allocation (using a dedicated kmem_cache) and RCU
lookup should improve consistently the performances, with a consolidate
code layout and no unmanageable problems with large number of objects
waiting for the grace period.

>>> Homa needs to handle a very high rate of RPCs, so this would result in
>>> too much accumulated memory  (in particular, skbs don't get reclaimed
>>> until the RPC is reclaimed).
>>
>> For the RPC struct, that above is a fair point, but why skbs need to be
>> freed together with the RCP struct? if you have skbs i.e. sitting in a
>> RX queue, you can flush such queue when the RPC goes out of scope,
>> without any additional delay.
> 
> Reclaiming the skbs inline would be too expensive; 

Why? For other protocols the main skb free cost is due to memory
accounting, that homa is currently not implementing, so I don't see why
it should be critically expansive at this point (note that homa should
performat least rmem/wmem accounting, but let put this aside for a
moment). Could you please elaborate on this topic?

>>> The caller must have a lock on the homa_rpc anyway, so RCU wouldn't
>>> save the overhead of acquiring a lock. The reason for putting the lock
>>> in the hash table instead of the homa_rpc is that this makes RPC
>>> creation/deletion atomic with respect to lookups. The lock was
>>> initially in the homa_rpc, but that led to complex races with hash
>>> table insertion/deletion. This is explained in sync.txt, but of course
>>> you don't have that (yet).
>>
>> The per bucket RPC lock is prone to contention, a per RPC lock will
>> avoid such problem.
> 
> There are a lot of buckets (1024); this was done intentionally to
> reduce the likelihood of contention between different RPCs  trying to
> acquire the same bucket lock. 

1024 does not look to big for internet standard, but I must admit the
usage pattern is not 110% clear to me.

[...]
> Note that the bucket locks would be needed even with RCU usage, in
> order to permit concurrent RPC creation in different buckets. Thus
> Homa's locking scheme doesn't introduce additional locks; it
> eliminates locks that would otherwise be needed on individual RPCs and
> uses the bucket locks for 2 purposes.

It depends on the relative frequency of RPC lookup vs RPC
insertion/deletion. i.e. for TCP connections the lookup frequency is
expected to be significantly higher than the socket creation and
destruction.

I understand the expected patter in quite different with homa RPC? If so
you should at least consider a dedicated kmem_cache for such structs.

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c
  2025-01-27 19:12             ` John Ousterhout
@ 2025-01-28  8:27               ` Paolo Abeni
  0 siblings, 0 replies; 68+ messages in thread
From: Paolo Abeni @ 2025-01-28  8:27 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev, edumazet, horms, kuba

On 1/27/25 8:12 PM, John Ousterhout wrote:
> On Mon, Jan 27, 2025 at 10:28 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> Please do. In fact using the raw variant when not needed will bring only
>> shortcoming.
> 
> Will do. Just for my information, when is the raw variant "needed"?

Ah, I just noticed the related documentation has typo (in the name), so
you probably missed it:

 * raw_processor_id() - get the current (unstable) CPU id
 *
 * For then you know what you are doing and need an unstable
 * CPU id.

'unstable' means it can actually change under the hood, so the caller is
looking just for 'an hint' of the current processor id. A sample
use-case is for socket lookup scoring, e.g. as used by the TCP and UDP
protocols.

/P

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 07/12] net: homa: create homa_sock.h and homa_sock.c
  2025-01-28  0:40     ` John Ousterhout
  2025-01-28  4:26       ` John Ousterhout
@ 2025-01-28 15:10       ` Eric Dumazet
  2025-01-28 17:04         ` John Ousterhout
  1 sibling, 1 reply; 68+ messages in thread
From: Eric Dumazet @ 2025-01-28 15:10 UTC (permalink / raw)
  To: John Ousterhout; +Cc: Paolo Abeni, netdev, horms, kuba

On Tue, Jan 28, 2025 at 1:41 AM John Ousterhout <ouster@cs.stanford.edu> wrote:

> > The only caller for this function so far is not under RCU lock: you
> > should see a splat here if you build and run this code with:
> >
> > CONFIG_LOCKDEP=y
> >
> > (which in turn is highly encouraged)
>
> Strange... I have had CONFIG_LOCKDEP enabled for a while now, but for
> some reason I didn't see a flag for that. In any case, all of the
> callers to homa_socktab_next now hold the RCU lock (I fixed this
> during my scan of RCU usage in response to one of your earlier
> messages for this patch series).

The proper config name is CONFIG_PROVE_LOCKING

CONFIG_PROVE_LOCKING=y

While are it, also add

CONFIG_PROVE_RCU_LIST=y
CONFIG_RCU_EXPERT=y

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 07/12] net: homa: create homa_sock.h and homa_sock.c
  2025-01-28 15:10       ` Eric Dumazet
@ 2025-01-28 17:04         ` John Ousterhout
  0 siblings, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-28 17:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Paolo Abeni, netdev, horms, kuba

On Tue, Jan 28, 2025 at 7:10 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Jan 28, 2025 at 1:41 AM John Ousterhout <ouster@cs.stanford.edu> wrote:
>
> > > The only caller for this function so far is not under RCU lock: you
> > > should see a splat here if you build and run this code with:
> > >
> > > CONFIG_LOCKDEP=y
> > >
> > > (which in turn is highly encouraged)
> >
> > Strange... I have had CONFIG_LOCKDEP enabled for a while now, but for
> > some reason I didn't see a flag for that. In any case, all of the
> > callers to homa_socktab_next now hold the RCU lock (I fixed this
> > during my scan of RCU usage in response to one of your earlier
> > messages for this patch series).
>
> The proper config name is CONFIG_PROVE_LOCKING
>
> CONFIG_PROVE_LOCKING=y
>
> While are it, also add
>
> CONFIG_PROVE_RCU_LIST=y
> CONFIG_RCU_EXPERT=y

I had CONFIG_PROVE_LOCKING already, will add the others now.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
  2025-01-28  8:19           ` Paolo Abeni
@ 2025-01-29  1:23             ` John Ousterhout
       [not found]               ` <13345e2a-849d-4bd8-a95e-9cd7f287c7df@redhat.com>
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-29  1:23 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Tue, Jan 28, 2025 at 12:20 AM Paolo Abeni <pabeni@redhat.com> wrote:
> ...
> > I think that might work, but it would suffer from the slow reclamation
> > problem I mentioned with RCU. It would also create more complexity in
> > the code (e.g. the allocation might still turn out to be redundant, so
> > there would need to be additional code to check for that: the lookup
> > would essentially have to be done twice in the case of creating a new
> > RPC). I'd rather not incur this complexity until there's evidence that
> > GFP_ATOMIC is causing problems.
>
> Have a look at tcp established socket lookup and the
> SLAB_TYPESAFE_BY_RCU flag usage for slab-based allocations. A combo of
> such flag for RPC allocation (using a dedicated kmem_cache) and RCU
> lookup should improve consistently the performances, with a consolidate
> code layout and no unmanageable problems with large number of objects
> waiting for the grace period.

I will check that out.

> >>> Homa needs to handle a very high rate of RPCs, so this would result in
> >>> too much accumulated memory  (in particular, skbs don't get reclaimed
> >>> until the RPC is reclaimed).
> >>
> >> For the RPC struct, that above is a fair point, but why skbs need to be
> >> freed together with the RCP struct? if you have skbs i.e. sitting in a
> >> RX queue, you can flush such queue when the RPC goes out of scope,
> >> without any additional delay.
> >
> > Reclaiming the skbs inline would be too expensive;
>
> Why? For other protocols the main skb free cost is due to memory
> accounting, that homa is currently not implementing, so I don't see why
> it should be critically expansive at this point (note that homa should
> performat least rmem/wmem accounting, but let put this aside for a
> moment). Could you please elaborate on this topic?

In my measurements, skb freeing is by far the largest cost in RPC
reaping. I'm not currently in a good position to remeasure this, but
my recollection is that it takes a few hundred ns to free an skb. A
large RPC (1 MByte is Homa's current limit) will have at least 100
skbs (with jumbo frames) and more than 600 skbs with 1500B frames:
that's 20-100 usec. The problem is that this can occur in a place
where it delays the processing of a packet for an unrelated short
message. Under good conditions, Homa can handle a short RPC in 15 usec
(end-to-end round-trip) so delaying a packet by 100 usec is a severe
penalty.

> [...]
> > Note that the bucket locks would be needed even with RCU usage, in
> > order to permit concurrent RPC creation in different buckets. Thus
> > Homa's locking scheme doesn't introduce additional locks; it
> > eliminates locks that would otherwise be needed on individual RPCs and
> > uses the bucket locks for 2 purposes.
>
> It depends on the relative frequency of RPC lookup vs RPC
> insertion/deletion. i.e. for TCP connections the lookup frequency is
> expected to be significantly higher than the socket creation and
> destruction.
>
> I understand the expected patter in quite different with homa RPC? If so
> you should at least consider a dedicated kmem_cache for such structs.

Right: Homa creates sockets even less often than TCP, but it creates
new RPCs all the time. For an RPC with short messages (very common)
the client will do one insertion and one lookup; the server will do an
insertion but never a lookup. Thus the relative frequency of lookup
vs. insertion is quite different in Homa from TCP. It might well be
worth looking at a kmem_cache for the RPC structs. I don't yet know
much about kmem_caches, but I'll put it on my "to do" list.


-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
       [not found]               ` <13345e2a-849d-4bd8-a95e-9cd7f287c7df@redhat.com>
@ 2025-01-29 16:43                 ` John Ousterhout
  2025-01-29 16:49                   ` Eric Dumazet
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-29 16:43 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Netdev, Eric Dumazet, Simon Horman, Jakub Kicinski

On Wed, Jan 29, 2025 at 2:24 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 1/29/25 2:23 AM, John Ousterhout wrote:
> > In my measurements, skb freeing is by far the largest cost in RPC
> > reaping. I'm not currently in a good position to remeasure this, but
> > my recollection is that it takes a few hundred ns to free an skb. A
> > large RPC (1 MByte is Homa's current limit) will have at least 100
> > skbs (with jumbo frames) and more than 600 skbs with 1500B frames:
> > that's 20-100 usec.
>
> I guess a couple of things could improve skb free performances:
>
> - packet aggregation for home protocol - either at the GRO stage[*] or
> skb coalescing while enqueuing in `msgin.packets`, see
> skb_try_coalesce()/tcp_try_coalesce().
>
> - deferred skb freeing, see skb_attempt_defer_free() in net/core/skbuff.c.
>
> [*] I see a bunch of parameters for it but no actual code, I guess it's
> planned for later?

GRO is implemented in the "full" Homa (and essential for decent
performance); I left it out of this initial patch series to reduce the
size of the patch. But that doesn't affect the cost of freeing skbs.
GRO aggregates skb's into batches for more efficient processing, but
the same number of skb's ends up being freed in the end.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
  2025-01-29 16:43                 ` John Ousterhout
@ 2025-01-29 16:49                   ` Eric Dumazet
  2025-01-29 16:54                     ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Eric Dumazet @ 2025-01-29 16:49 UTC (permalink / raw)
  To: John Ousterhout; +Cc: Paolo Abeni, Netdev, Simon Horman, Jakub Kicinski

On Wed, Jan 29, 2025 at 5:44 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
>
> GRO is implemented in the "full" Homa (and essential for decent
> performance); I left it out of this initial patch series to reduce the
> size of the patch. But that doesn't affect the cost of freeing skbs.
> GRO aggregates skb's into batches for more efficient processing, but
> the same number of skb's ends up being freed in the end.

Not at all, unless GRO is forced to use shinfo->frag_list.

GRO fast path cooks a single skb for a large payload, usually adding
as many page fragments as possible.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
  2025-01-29 16:49                   ` Eric Dumazet
@ 2025-01-29 16:54                     ` John Ousterhout
  2025-01-29 17:04                       ` Eric Dumazet
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-29 16:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Paolo Abeni, Netdev, Simon Horman, Jakub Kicinski

On Wed, Jan 29, 2025 at 8:50 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Jan 29, 2025 at 5:44 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
> >
> > GRO is implemented in the "full" Homa (and essential for decent
> > performance); I left it out of this initial patch series to reduce the
> > size of the patch. But that doesn't affect the cost of freeing skbs.
> > GRO aggregates skb's into batches for more efficient processing, but
> > the same number of skb's ends up being freed in the end.
>
> Not at all, unless GRO is forced to use shinfo->frag_list.
>
> GRO fast path cooks a single skb for a large payload, usually adding
> as many page fragments as possible.

Are you referring to hardware GRO or software GRO? I was referring to
software GRO, which is what Homa currently implements. With software
GRO there is a stream of skb's coming up from the driver; regardless
of how GRO re-arranges them, each skb eventually has to be freed, no?

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
  2025-01-29 16:54                     ` John Ousterhout
@ 2025-01-29 17:04                       ` Eric Dumazet
  2025-01-29 20:27                         ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Eric Dumazet @ 2025-01-29 17:04 UTC (permalink / raw)
  To: John Ousterhout; +Cc: Paolo Abeni, Netdev, Simon Horman, Jakub Kicinski

On Wed, Jan 29, 2025 at 5:55 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
>
> On Wed, Jan 29, 2025 at 8:50 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Jan 29, 2025 at 5:44 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
> > >
> > > GRO is implemented in the "full" Homa (and essential for decent
> > > performance); I left it out of this initial patch series to reduce the
> > > size of the patch. But that doesn't affect the cost of freeing skbs.
> > > GRO aggregates skb's into batches for more efficient processing, but
> > > the same number of skb's ends up being freed in the end.
> >
> > Not at all, unless GRO is forced to use shinfo->frag_list.
> >
> > GRO fast path cooks a single skb for a large payload, usually adding
> > as many page fragments as possible.
>
> Are you referring to hardware GRO or software GRO? I was referring to
> software GRO, which is what Homa currently implements. With software
> GRO there is a stream of skb's coming up from the driver; regardless
> of how GRO re-arranges them, each skb eventually has to be freed, no?

I am referring to software GRO.
We do not allocate/free skbs for each aggregated segment.
napi_get_frags() & napi_reuse_skb() for details.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
  2025-01-29 17:04                       ` Eric Dumazet
@ 2025-01-29 20:27                         ` John Ousterhout
  2025-01-29 20:40                           ` Eric Dumazet
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-29 20:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Paolo Abeni, Netdev, Simon Horman, Jakub Kicinski

On Wed, Jan 29, 2025 at 9:04 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Jan 29, 2025 at 5:55 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
> >
> > On Wed, Jan 29, 2025 at 8:50 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, Jan 29, 2025 at 5:44 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
> > > >
> > > > GRO is implemented in the "full" Homa (and essential for decent
> > > > performance); I left it out of this initial patch series to reduce the
> > > > size of the patch. But that doesn't affect the cost of freeing skbs.
> > > > GRO aggregates skb's into batches for more efficient processing, but
> > > > the same number of skb's ends up being freed in the end.
> > >
> > > Not at all, unless GRO is forced to use shinfo->frag_list.
> > >
> > > GRO fast path cooks a single skb for a large payload, usually adding
> > > as many page fragments as possible.
> >
> > Are you referring to hardware GRO or software GRO? I was referring to
> > software GRO, which is what Homa currently implements. With software
> > GRO there is a stream of skb's coming up from the driver; regardless
> > of how GRO re-arranges them, each skb eventually has to be freed, no?
>
> I am referring to software GRO.
> We do not allocate/free skbs for each aggregated segment.
> napi_get_frags() & napi_reuse_skb() for details.

 YATIDNK (Yet Another Thing I Did Not Know); thanks for the information.

So it sounds like GRO moves the page frags into another skb and
returns the skb shell to napi for reuse, eliminating an
alloc_skb/kfree_skb pair? Nice.

The skb that receives all of the page frags: does that eventually get
kfree_skb'ed, or is there an optimization for that that I'm also not
aware of?

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
  2025-01-29 20:27                         ` John Ousterhout
@ 2025-01-29 20:40                           ` Eric Dumazet
  2025-01-29 21:08                             ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Eric Dumazet @ 2025-01-29 20:40 UTC (permalink / raw)
  To: John Ousterhout; +Cc: Paolo Abeni, Netdev, Simon Horman, Jakub Kicinski

On Wed, Jan 29, 2025 at 9:27 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
>
> On Wed, Jan 29, 2025 at 9:04 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Jan 29, 2025 at 5:55 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
> > >
> > > On Wed, Jan 29, 2025 at 8:50 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Wed, Jan 29, 2025 at 5:44 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
> > > > >
> > > > > GRO is implemented in the "full" Homa (and essential for decent
> > > > > performance); I left it out of this initial patch series to reduce the
> > > > > size of the patch. But that doesn't affect the cost of freeing skbs.
> > > > > GRO aggregates skb's into batches for more efficient processing, but
> > > > > the same number of skb's ends up being freed in the end.
> > > >
> > > > Not at all, unless GRO is forced to use shinfo->frag_list.
> > > >
> > > > GRO fast path cooks a single skb for a large payload, usually adding
> > > > as many page fragments as possible.
> > >
> > > Are you referring to hardware GRO or software GRO? I was referring to
> > > software GRO, which is what Homa currently implements. With software
> > > GRO there is a stream of skb's coming up from the driver; regardless
> > > of how GRO re-arranges them, each skb eventually has to be freed, no?
> >
> > I am referring to software GRO.
> > We do not allocate/free skbs for each aggregated segment.
> > napi_get_frags() & napi_reuse_skb() for details.
>
>  YATIDNK (Yet Another Thing I Did Not Know); thanks for the information.
>
> So it sounds like GRO moves the page frags into another skb and
> returns the skb shell to napi for reuse, eliminating an
> alloc_skb/kfree_skb pair? Nice.
>
> The skb that receives all of the page frags: does that eventually get
> kfree_skb'ed, or is there an optimization for that that I'm also not
> aware of?

This fat skb is going to be stored into a socket receive queue,
so that its content can be copied or given to the user application.

TCP then gives back the fat skb to the cpu which allocated the pages,
so that kfree_skb() is very cheap. Fast NIC have page pools.

tcp_eat_recv_skb()

With BIG TCP, we typically store 180 KB of payload per sk_buff

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c
  2025-01-29 20:40                           ` Eric Dumazet
@ 2025-01-29 21:08                             ` John Ousterhout
  0 siblings, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-29 21:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Paolo Abeni, Netdev, Simon Horman, Jakub Kicinski

Thanks for the additional information; very helpful.

-John-

On Wed, Jan 29, 2025 at 12:41 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Jan 29, 2025 at 9:27 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
> >
> > On Wed, Jan 29, 2025 at 9:04 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, Jan 29, 2025 at 5:55 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
> > > >
> > > > On Wed, Jan 29, 2025 at 8:50 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Wed, Jan 29, 2025 at 5:44 PM John Ousterhout <ouster@cs.stanford.edu> wrote:
> > > > > >
> > > > > > GRO is implemented in the "full" Homa (and essential for decent
> > > > > > performance); I left it out of this initial patch series to reduce the
> > > > > > size of the patch. But that doesn't affect the cost of freeing skbs.
> > > > > > GRO aggregates skb's into batches for more efficient processing, but
> > > > > > the same number of skb's ends up being freed in the end.
> > > > >
> > > > > Not at all, unless GRO is forced to use shinfo->frag_list.
> > > > >
> > > > > GRO fast path cooks a single skb for a large payload, usually adding
> > > > > as many page fragments as possible.
> > > >
> > > > Are you referring to hardware GRO or software GRO? I was referring to
> > > > software GRO, which is what Homa currently implements. With software
> > > > GRO there is a stream of skb's coming up from the driver; regardless
> > > > of how GRO re-arranges them, each skb eventually has to be freed, no?
> > >
> > > I am referring to software GRO.
> > > We do not allocate/free skbs for each aggregated segment.
> > > napi_get_frags() & napi_reuse_skb() for details.
> >
> >  YATIDNK (Yet Another Thing I Did Not Know); thanks for the information.
> >
> > So it sounds like GRO moves the page frags into another skb and
> > returns the skb shell to napi for reuse, eliminating an
> > alloc_skb/kfree_skb pair? Nice.
> >
> > The skb that receives all of the page frags: does that eventually get
> > kfree_skb'ed, or is there an optimization for that that I'm also not
> > aware of?
>
> This fat skb is going to be stored into a socket receive queue,
> so that its content can be copied or given to the user application.
>
> TCP then gives back the fat skb to the cpu which allocated the pages,
> so that kfree_skb() is very cheap. Fast NIC have page pools.
>
> tcp_eat_recv_skb()
>
> With BIG TCP, we typically store 180 KB of payload per sk_buff

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-01-24  8:31   ` Paolo Abeni
@ 2025-01-30  0:41     ` John Ousterhout
       [not found]       ` <991b5ad9-57cf-4e1d-8e01-9d0639fa4e49@redhat.com>
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-30  0:41 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Fri, Jan 24, 2025 at 12:31 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> OoO will cause additional allocation? this feels like DoS prone.
>
> > +             }
> > +             rpc->msgin.recv_end = end;
> > +             goto keep;
> > +     }
> > +
> > +     /* Must now check to see if the packet fills in part or all of
> > +      * an existing gap.
> > +      */
> > +     list_for_each_entry_safe(gap, dummy, &rpc->msgin.gaps, links) {
>
> Linear search for OoO has proven to be subject to serious dos issue. You
> should instead use a (rb-)tree to handle OoO packets.

I have been assuming that DoS won't be a major issue for Homa because
it's intended for use only in datacenters (if there are antagonistic
parties, they will be isolated from each other by networking
hardware). Is this a bad assumption?

> > +
> > +             /* Packet is in the middle of the gap; must split the gap. */
> > +             gap2 = homa_gap_new(&gap->links, gap->start, start);
> > +             if (!gap2) {
> > +                     pr_err("Homa couldn't allocate gap for split: insufficient memory\n");
> > +                     goto discard;
> > +             }
> > +             gap2->time = gap->time;
> > +             gap->start = end;
> > +             goto keep;
> > +     }
> > +
> > +discard:
> > +     kfree_skb(skb);
> > +     return;
> > +
> > +keep:
> > +     __skb_queue_tail(&rpc->msgin.packets, skb);
>
> Here 'msgin.packets' is apparently under RCP lock protection, but
> elsewhere - in homa_rpc_reap() - the list is apparently protected by
> it's own lock.

What are you referring to by "its own lock?" As far as I know there is
no lock specific to msgin.packets. Normally everything in a homa_rpc
is protected by the RPC lock, and that's the case for homa_add_packet
above. By the time homa_rpc_reap sees an RPC
it has been marked dead and removed from all lists, so no-one else
will try to mutate it and there's no need for synchronization over its
internals. The only remaining problem is that there could still be
outstanding references to the RPC, whose owners haven't yet discovered
that it's dead and dropped their references. The protect_count on the
socket is used to detect these situations.

> Also it looks like there is no memory accounting at all, and SO_RCVBUF
> setting are just ignored.

Homa doesn't yet have comprehensive memory accounting, but there is a
limit on buffer space for incoming messages. Instead of SO_RCVBUF,
applications control the amount of receive buffer space by controlling
the size of the buffer pool they provide to Homa with the
SO_HOMA_RCVBUF socket option.

> > +/**
> > + * homa_dispatch_pkts() - Top-level function that processes a batch of packets,
> > + * all related to the same RPC.
> > + * @skb:       First packet in the batch, linked through skb->next.
> > + * @homa:      Overall information about the Homa transport.
> > + */
> > +void homa_dispatch_pkts(struct sk_buff *skb, struct homa *homa)
>
> I see I haven't mentioned the following so far, but you should move the
> struct homa to a pernet subsystem.

Sorry for my ignorance, but I'm not familiar with the concept of "a
pernet subsystem". What's the best way for me to learn more about
this?

> > +{
> > +#define MAX_ACKS 10
> > +     const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
> > +     struct homa_data_hdr *h = (struct homa_data_hdr *)skb->data;
> > +     __u64 id = homa_local_id(h->common.sender_id);
> > +     int dport = ntohs(h->common.dport);
> > +
> > +     /* Used to collect acks from data packets so we can process them
> > +      * all at the end (can't process them inline because that may
> > +      * require locking conflicting RPCs). If we run out of space just
> > +      * ignore the extra acks; they'll be regenerated later through the
> > +      * explicit mechanism.
> > +      */
> > +     struct homa_ack acks[MAX_ACKS];
> > +     struct homa_rpc *rpc = NULL;
> > +     struct homa_sock *hsk;
> > +     struct sk_buff *next;
> > +     int num_acks = 0;
> > +
> > +     /* Find the appropriate socket.*/
> > +     hsk = homa_sock_find(homa->port_map, dport);
>
> This needs RCU protection

Yep. I have reworked homa_sock_find so that it uses RCU protection
internally and then takes a reference on the socket before returning.
Callers have to eventually release the reference, but they shouldn't
need to deal with RCU anymore.

> > +
> > +             /* Find and lock the RPC if we haven't already done so. */
> > +             if (!rpc) {
> > +                     if (!homa_is_client(id)) {
> > +                             /* We are the server for this RPC. */
> > +                             if (h->common.type == DATA) {
> > +                                     int created;
> > +
> > +                                     /* Create a new RPC if one doesn't
> > +                                      * already exist.
> > +                                      */
> > +                                     rpc = homa_rpc_new_server(hsk, &saddr,
> > +                                                               h, &created);
>
> It looks like a buggy or malicious client could force server RPC
> allocation to any _client_ ?!?

I'm not sure what you mean by "force server RPC allocation to any
_client_"; can you give a bit more detail?

> > +                                     if (IS_ERR(rpc)) {
> > +                                             pr_warn("homa_pkt_dispatch couldn't create server rpc: error %lu",
> > +                                                     -PTR_ERR(rpc));
> > +                                             rpc = NULL;
> > +                                             goto discard;
> > +                                     }
> > +                             } else {
> > +                                     rpc = homa_find_server_rpc(hsk, &saddr,
> > +                                                                id);
> > +                             }
> > +                     } else {
> > +                             rpc = homa_find_client_rpc(hsk, id);
>
> Both the client and the server lookup require a contended lock; The
> lookup could/should be lockless, and the the lock could/should be
> asserted only on the relevant RPC.

I think we've discussed this issue in response to earlier comments:
the only lock acquired during lookup is the hash table bucket lock for
the desired RPC, and the hash table has enough buckets to avoid
serious contention.

>
> > +             case UNKNOWN:
> > +                     homa_unknown_pkt(skb, rpc);
>
> It's sort of unexpected that the protocol explicitly defines the unknown
> packet type, and handles it differently form undefined types.

Maybe the name UNKNOWN is causing confusion? An UNKNOWN packet is sent
when an endpoint receives a RESEND packet for an RPC that is unknown
to it. The term UNKNOWN refers to an unknown RPC, as opposed to an
unrecognized packet type.

> > +                     break;
> > +             case BUSY:
> > +                     /* Nothing to do for these packets except reset
> > +                      * silent_ticks, which happened above.
> > +                      */
> > +                     goto discard;
> > +             case NEED_ACK:
> > +                     homa_need_ack_pkt(skb, hsk, rpc);
> > +                     break;
> > +             case ACK:
> > +                     homa_ack_pkt(skb, hsk, rpc);
> > +                     rpc = NULL;
> > +
> > +                     /* It isn't safe to process more packets once we've
> > +                      * released the RPC lock (this should never happen).
> > +                      */
> > +                     while (next) {
> > +                             WARN_ONCE(next, "%s found extra packets after AC<\n",
> > +                                       __func__);
>
> It looks like the above WARN could be triggered by an unexpected traffic
> pattern generate from the client. If so, you should avoid the WARN() and
> instead use e.g. some mib counter.

The real problem here is with homa_ack_pkt returning with the RPC lock
released. I've fixed that now, so the check is no longer necessary
(I've deleted it).

> > +
> > +     if (skb_queue_len(&rpc->msgin.packets) != 0 &&
> > +         !(atomic_read(&rpc->flags) & RPC_PKTS_READY)) {
> > +             atomic_or(RPC_PKTS_READY, &rpc->flags);
> > +             homa_sock_lock(rpc->hsk, "homa_data_pkt");
> > +             homa_rpc_handoff(rpc);
> > +             homa_sock_unlock(rpc->hsk);
>
> It looks like you tried to enforce the following lock acquiring order:
> rpc lock
> socket lock
> which is IMHO quite unnatural, as the socket has a wider scope than the
> RPC. In practice the locking schema is quite complex and hard to follow.
> I think (wild guess) that inverting the lock order would simplify the
> locking schema significantly.

This locking order is necessary for Homa.  Because a single socket can
be used for many concurrent RPCs to many peers, it doesn't work to
acquire the socket lock for every operation: it would suffer terrible
contention (I tried this in the earliest versions of Homa and it was a
bottleneck under high load). Thus the RPC lock is the primary lock in
Homa, not the socket lock. Many operations can be completed without
ever holding the socket lock, which reduces contention for the socket
lock.

In TCP a busy app will spread itself over a lot of sockets, so the
socket locks are less likely to be contended.

> > +void homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
> > +               struct homa_rpc *rpc)
> > +     __releases(rpc->bucket_lock)
> > +{
> > +     const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
> > +     struct homa_ack_hdr *h = (struct homa_ack_hdr *)skb->data;
> > +     int i, count;
> > +
> > +     if (rpc) {
> > +             homa_rpc_free(rpc);
> > +             homa_rpc_unlock(rpc);
>
> Another point that makes IMHO the locking schema hard to follow is the
> fact that many non-locking-related functions acquires or release some
> lock internally. The code would be much more easy to follow if you could
> pair the lock and unlock as much as possible inside the same code block.

I agree. There are places where a function has to release a lock
internally for various reasons, but it should reacquire the lock
before returning to preserve symmetry. There are places where
functions release a lock without reacquiring it, but that's a bad idea
I'd like to fix (homa_ack_pkt is one example). One of the reasons for
this was that once an RPC lock is released the RPC could go away, so
it wasn't safe to attempt to relock it. I have added new methods
homa_rpc_hold() and homa_rpc_put() so that it's possible to take a
reference count on an RPC to keep it around while the lock is
released, so the lock can safely be reacquired later. This is how I
fixed the homa_ack_pkt problem you pointed out above. If you see any
other places with this asymmetry, let me know and I'll fix them also.
The new methods also provide a consistent and simple solution to
several other problems that had been solved in an ad hoc way.

It would be even better if a function never had to release a lock
internally, but so far I haven't figured out how to do that. If you
have ideas I'd like to hear them.

> > +
> > +     if (id != 0) {
> > +             if ((atomic_read(&rpc->flags) & RPC_PKTS_READY) || rpc->error)
> > +                     goto claim_rpc;
> > +             rpc->interest = interest;
> > +             interest->reg_rpc = rpc;
> > +             homa_rpc_unlock(rpc);
>
> With the current schema you should release the hsh socket lock before
> releasing the rpc one.

Normally I would agree, but that won't work here: the RPC is no longer
of interest, so it needs to be unlocked, but we need to keep the
socket lock through the code that follows. This is safe (out-of-order
lock acquisition can cause deadlocks, but the order of lock releasing
doesn't matter except aesthetically).

> > +struct homa_rpc *homa_wait_for_message(struct homa_sock *hsk, int flags,
> > +                                    __u64 id)
> > +     __acquires(&rpc->bucket_lock)
> > +{
> > +     ...
> > + }

> The amount of custom code to wait is concerning. Why can't you build
> around sk_wait_event()?

I agree that it's complicated. sk_wait_event can't be used because
Homa allows different threads to wait on partially-overlapping sets of
RPCs. For example, one thread can wait for a specific RPC to complete,
while another thread waits for *any* RPC to complete. Thus a given RPC
completion may not apply to all of the waiting threads. Here's a link
to a man page that describes the recvmsg API for Homa:

https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview

That said, I have never been very happy with this API (and its
consequences for the waiting code). I've occasionally thought there
must be a better alternative but never came up with anything I liked.
However, your comment forced me to think about this some more, and I
now think I have a better idea for how to do waiting, which will
eliminate overlapping waits and allow sk_wait_event to be used instead
of the "interest" mechanism  that's currently implemented. This will
be an API change, but if I'm going to do it I think I should do it
now, before upstreaming. So I will do that.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-01-27 10:19   ` Paolo Abeni
@ 2025-01-30  0:48     ` John Ousterhout
  2025-01-30  9:57       ` Paolo Abeni
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-30  0:48 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Mon, Jan 27, 2025 at 2:19 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 1/15/25 7:59 PM, John Ousterhout wrote:
> > +     /* Each iteration through the following loop processes one packet. */
> > +     for (; skb; skb = next) {
> > +             h = (struct homa_data_hdr *)skb->data;
> > +             next = skb->next;
> > +
> > +             /* Relinquish the RPC lock temporarily if it's needed
> > +              * elsewhere.
> > +              */
> > +             if (rpc) {
> > +                     int flags = atomic_read(&rpc->flags);
> > +
> > +                     if (flags & APP_NEEDS_LOCK) {
> > +                             homa_rpc_unlock(rpc);
> > +                             homa_spin(200);
>
> Why spinning on the current CPU here? This is completely unexpected, and
> usually tolerated only to deal with H/W imposed delay while programming
> some device registers.

This is done to pass the RPC lock off to another thread (the
application); the spin is there to allow the other thread to acquire
the lock before this thread tries to acquire it again (almost
immediately). There's no performance impact from the spin because this
thread is going to turn around and try to acquire the RPC lock again
(at which point it will spin until the other thread releases the
lock). Thus it's either spin here or spin there. I've added a comment
to explain this.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-01-30  0:48     ` John Ousterhout
@ 2025-01-30  9:57       ` Paolo Abeni
  2025-01-31 22:51         ` John Ousterhout
       [not found]         ` <CAGXJAmxLqnjnWr8sjooJRRyQ2-5BqPCQL8gnn0gzYoZ0MMoBSw@mail.gmail.com>
  0 siblings, 2 replies; 68+ messages in thread
From: Paolo Abeni @ 2025-01-30  9:57 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev, edumazet, horms, kuba



On 1/30/25 1:48 AM, John Ousterhout wrote:
> On Mon, Jan 27, 2025 at 2:19 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>
>> On 1/15/25 7:59 PM, John Ousterhout wrote:
>>> +     /* Each iteration through the following loop processes one packet. */
>>> +     for (; skb; skb = next) {
>>> +             h = (struct homa_data_hdr *)skb->data;
>>> +             next = skb->next;
>>> +
>>> +             /* Relinquish the RPC lock temporarily if it's needed
>>> +              * elsewhere.
>>> +              */
>>> +             if (rpc) {
>>> +                     int flags = atomic_read(&rpc->flags);
>>> +
>>> +                     if (flags & APP_NEEDS_LOCK) {
>>> +                             homa_rpc_unlock(rpc);
>>> +                             homa_spin(200);
>>
>> Why spinning on the current CPU here? This is completely unexpected, and
>> usually tolerated only to deal with H/W imposed delay while programming
>> some device registers.
> 
> This is done to pass the RPC lock off to another thread (the
> application); the spin is there to allow the other thread to acquire
> the lock before this thread tries to acquire it again (almost
> immediately). There's no performance impact from the spin because this
> thread is going to turn around and try to acquire the RPC lock again
> (at which point it will spin until the other thread releases the
> lock). Thus it's either spin here or spin there. I've added a comment
> to explain this.

What if another process is spinning on the RPC lock without setting
APP_NEEDS_LOCK? AFAICS incoming packets targeting the same RPC could
land on different RX queues.

If the spin is not functionally needed, just drop it. If it's needed, it
would be better to find some functional replacement, possibly explicit
notification via waitqueue or completion.

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
       [not found]       ` <991b5ad9-57cf-4e1d-8e01-9d0639fa4e49@redhat.com>
@ 2025-01-31 22:48         ` John Ousterhout
  2025-02-03  9:12           ` Paolo Abeni
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-01-31 22:48 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Netdev, Eric Dumazet, Simon Horman, Jakub Kicinski

Resending because I accidentally left HTML enabled in the original; sorry...

On Thu, Jan 30, 2025 at 1:39 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 1/30/25 1:41 AM, John Ousterhout wrote:
> > On Fri, Jan 24, 2025 at 12:31 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>
> >> OoO will cause additional allocation? this feels like DoS prone.
> >>
> >>> +             }
> >>> +             rpc->msgin.recv_end = end;
> >>> +             goto keep;
> >>> +     }
> >>> +
> >>> +     /* Must now check to see if the packet fills in part or all of
> >>> +      * an existing gap.
> >>> +      */
> >>> +     list_for_each_entry_safe(gap, dummy, &rpc->msgin.gaps, links) {
> >>
> >> Linear search for OoO has proven to be subject to serious dos issue. You
> >> should instead use a (rb-)tree to handle OoO packets.
> >
> > I have been assuming that DoS won't be a major issue for Homa because
> > it's intended for use only in datacenters (if there are antagonistic
> > parties, they will be isolated from each other by networking
> > hardware). Is this a bad assumption?
>
> I think assuming that the peer will always behave is dangerous. The peer
> could be buggy or compromised, transient network condition may arise.
> Even un-malicious users tend to do the most crazy and unexpected things
> given enough time.
>
> Also the disclaimer "please don't use this in on an internet facing
> host" sounds quite bad for a networking protocol ;)

I don't see why this disclaimer would be needed: as long as the Homa
hosts are inside the firewall, the firewall will prevent any external
Homa packets from reaching them. And, if your host is compromised, DoS
is the least of your worries.

It seems to me that this is mostly about helping people debug. I agree
that that could be useful. However, it's hard for me to imagine this
particular situation (lots of gaps in the packet stream) happening by
accident. Applications won't be generating Homa packets themselves,
they will be using Homa, and Homa won't generate lots of gaps. There
are other kinds of bad application behavior that are much more likely
to occur, such as an app that gets into an infinite loop sending
requests without ever receiving responses.

And, an rb-tree will add complexity and slow down the common case
(trees have better O(...) behavior than lists but worse constant
factors).

Unless you see this as a show-stopper issue, I'd prefer not to use an
rb-tree for packet gaps.

> >>> +
> >>> +             /* Packet is in the middle of the gap; must split the gap. */
> >>> +             gap2 = homa_gap_new(&gap->links, gap->start, start);
> >>> +             if (!gap2) {
> >>> +                     pr_err("Homa couldn't allocate gap for split: insufficient memory\n");
> >>> +                     goto discard;
> >>> +             }
> >>> +             gap2->time = gap->time;
> >>> +             gap->start = end;
> >>> +             goto keep;
> >>> +     }
> >>> +
> >>> +discard:
> >>> +     kfree_skb(skb);
> >>> +     return;
> >>> +
> >>> +keep:
> >>> +     __skb_queue_tail(&rpc->msgin.packets, skb);
> >>
> >> Here 'msgin.packets' is apparently under RCP lock protection, but
> >> elsewhere - in homa_rpc_reap() - the list is apparently protected by
> >> it's own lock.
> >
> > What are you referring to by "its own lock?"
>
> msgin.packets.lock
>
> i.e. skb_dequeue() in homa_rpc_reap() uses such lock, while all the '__'
>  variants of the sk_buff_head helper don't use it.

That's a bug. I  wasn't aware of the internal lock when I wrote that
code, or had forgotten. I'll switch to the '__' variants.

> >> Also it looks like there is no memory accounting at all, and SO_RCVBUF
> >> setting are just ignored.
> >
> > Homa doesn't yet have comprehensive memory accounting, but there is a
> > limit on buffer space for incoming messages. Instead of SO_RCVBUF,
> > applications control the amount of receive buffer space by controlling
> > the size of the buffer pool they provide to Homa with the
> > SO_HOMA_RCVBUF socket option.
>
> Ignoring SO_RCVBUF (and net.core.rmem_* sysctls) is both unexpected and
> dangerous (a single application may consume unbounded amount of system
> memory). Also what about the TX side? I don't see any limit at all there.

An application cannot consume unbounded system memory on the RX side
(in fact it consumes almost none). When packets arrive, their data is
immediately transferred to a buffer region in user memory provided by
the application (using the facilities in homa_pool.c). Skb's are
occupied only long enough to make this transfer, and it happens even
if there is no pending recv* kernel call. The size of the buffer
region is limited by the application, and the application must provide
a region via SO_HOMA_RCVBUF. Given this, there's no need for SO_RCVBUF
(and I don't see why a different limit would be specified via
SO_RCVBUF than the one already provided via SO_HOMA_RCVBUF). I agree
that this is different from TCP, but Homa is different from TCP in
lots of ways.

There is currently no accounting or control on the TX side. I agree
that this needs to be implemented at some point, but if possible I'd
prefer to defer this until more of Homa has been upstreamed. For
example, this current patch doesn't include any sysctl support, which
would be needed as part of accounting/control (the support is part of
the GitHub repo, it's just not in this patch series).

> >>> +/**
> >>> + * homa_dispatch_pkts() - Top-level function that processes a batch of packets,
> >>> + * all related to the same RPC.
> >>> + * @skb:       First packet in the batch, linked through skb->next.
> >>> + * @homa:      Overall information about the Homa transport.
> >>> + */
> >>> +void homa_dispatch_pkts(struct sk_buff *skb, struct homa *homa)
> >>
> >> I see I haven't mentioned the following so far, but you should move the
> >> struct homa to a pernet subsystem.
> >
> > Sorry for my ignorance, but I'm not familiar with the concept of "a
> > pernet subsystem". What's the best way for me to learn more about
> > this?
>
> Have a look at register_pernet_subsys(), struct pernet_operations and
> some basic usage example, i.e. in net/8021q/vlan.c.
>
> register_pernet_subsys() allow registering/allocating a per network
> namespace structure of specified size (pernet_operations.size) that that
> subsystem can use according to its own need fetching it from the netns
> via the `id` obtained at registration time.

I will take a look.

> >>> +             /* Find and lock the RPC if we haven't already done so. */
> >>> +             if (!rpc) {
> >>> +                     if (!homa_is_client(id)) {
> >>> +                             /* We are the server for this RPC. */
> >>> +                             if (h->common.type == DATA) {
> >>> +                                     int created;
> >>> +
> >>> +                                     /* Create a new RPC if one doesn't
> >>> +                                      * already exist.
> >>> +                                      */
> >>> +                                     rpc = homa_rpc_new_server(hsk, &saddr,
> >>> +                                                               h, &created);
> >>
> >> It looks like a buggy or malicious client could force server RPC
> >> allocation to any _client_ ?!?
> >
> > I'm not sure what you mean by "force server RPC allocation to any
> > _client_"; can you give a bit more detail?
>
> AFAICS the home protocol uses only the `id` provided by the sender to
> discriminate the incoming packet as a client requests and thus
> allocating resources (RPC new server) on the receiver.
>
> Suppose an host creates a client socket, and a port is assigned to it.
>
> A malicious or buggy peer starts sending an (unlimited amount of)
> uncompleted homa RPC request to it.
>
> AFAICS the host A will allocate server RPCs in response to such incoming
> packets, which is unexpected to me.

Now I see what you're getting at. Homa sockets are symmetric: any
socket can be used for both the client and server sides of RPCs.  Thus
it's possible to send requests even to sockets that haven't been
"bound". I think of this as a feature, not a bug (it can potentially
reduce the need to allocate "known" port numbers). At the same time, I
see your point that some applications might not expect to receive
requests. Would you like a mechanism to disable this? For example,
sockets could be configured by default to reject incoming requests;
invoking the "bind" system call would enable incoming requests (I
would also add a setsockopt mechanism for enabling requests even on
"unbound" sockets).

> Additionally AFAICS each RPC is identified only by dport/id and both
> port and id allocation is sequential it looks like it's quite easy to
> spoof/inject data in a different RPC - even "by mistake". I guess this
> is a protocol limit.

On the server side an RPC is identified by <client address, dport,
id>, but on the client side only by <dport, id> (the sender address
isn't needed to lookup the correct RPC). However, it would be easy to
check incoming packets to make sure that the sender address matches
the sender in the RPC. I will do that.

> >>> +             case UNKNOWN:
> >>> +                     homa_unknown_pkt(skb, rpc);
> >>
> >> It's sort of unexpected that the protocol explicitly defines the unknown
> >> packet type, and handles it differently form undefined types.
> >
> > Maybe the name UNKNOWN is causing confusion? An UNKNOWN packet is sent
> > when an endpoint receives a RESEND packet for an RPC that is unknown
> > to it. The term UNKNOWN refers to an unknown RPC, as opposed to an
> > unrecognized packet type.
>
> Yep, possibly a more extended/verbose type name would help.

OK, I'll rename it.

> >>> +
> >>> +     if (skb_queue_len(&rpc->msgin.packets) != 0 &&
> >>> +         !(atomic_read(&rpc->flags) & RPC_PKTS_READY)) {
> >>> +             atomic_or(RPC_PKTS_READY, &rpc->flags);
> >>> +             homa_sock_lock(rpc->hsk, "homa_data_pkt");
> >>> +             homa_rpc_handoff(rpc);
> >>> +             homa_sock_unlock(rpc->hsk);
> >>
> >> It looks like you tried to enforce the following lock acquiring order:
> >> rpc lock
> >> socket lock
> >> which is IMHO quite unnatural, as the socket has a wider scope than the
> >> RPC. In practice the locking schema is quite complex and hard to follow.
> >> I think (wild guess) that inverting the lock order would simplify the
> >> locking schema significantly.
> >
> > This locking order is necessary for Homa.  Because a single socket can
> > be used for many concurrent RPCs to many peers, it doesn't work to
> > acquire the socket lock for every operation: it would suffer terrible
> > contention (I tried this in the earliest versions of Homa and it was a
> > bottleneck under high load).
>
> Would a separate lock for the homa_pool help?

I don't think so. The main problem with reversing the order of lock
acquisition is that the socket lock would have to be acquired for
every packet lookup.

> >>> +void homa_ack_pkt(struct sk_buff *skb, struct homa_sock *hsk,
> >>> +               struct homa_rpc *rpc)
> >>> +     __releases(rpc->bucket_lock)
> >>> +{
> >>> +     const struct in6_addr saddr = skb_canonical_ipv6_saddr(skb);
> >>> +     struct homa_ack_hdr *h = (struct homa_ack_hdr *)skb->data;
> >>> +     int i, count;
> >>> +
> >>> +     if (rpc) {
> >>> +             homa_rpc_free(rpc);
> >>> +             homa_rpc_unlock(rpc);
> >>
> >> Another point that makes IMHO the locking schema hard to follow is the
> >> fact that many non-locking-related functions acquires or release some
> >> lock internally. The code would be much more easy to follow if you could
> >> pair the lock and unlock as much as possible inside the same code block.
> >
> > I agree. There are places where a function has to release a lock
> > internally for various reasons, but it should reacquire the lock
> > before returning to preserve symmetry. There are places where
> > functions release a lock without reacquiring it, but that's a bad idea
> > I'd like to fix (homa_ack_pkt is one example). One of the reasons for
> > this was that once an RPC lock is released the RPC could go away, so
> > it wasn't safe to attempt to relock it. I have added new methods
> > homa_rpc_hold() and homa_rpc_put() so that it's possible to take a
> > reference count on an RPC to keep it around while the lock is
> > released, so the lock can safely be reacquired later. This is how I
> > fixed the homa_ack_pkt problem you pointed out above. If you see any
> > other places with this asymmetry, let me know and I'll fix them also.
>
> I need to see the new code :) (plus a lot of time, I guess)

I have now gone through the code myself; I think I have now eliminated
all the places where an RPC is locked on entry to a function but
unlocked on return.

> > The new methods also provide a consistent and simple solution to
> > several other problems that had been solved in an ad hoc way.
> >
> > It would be even better if a function never had to release a lock
> > internally, but so far I haven't figured out how to do that. If you
> > have ideas I'd like to hear them.
>
> In some cases it could be possible to move the unlock in the caller,
> eventually breaking the relevant function in smaller helpers.
>
> >>> +
> >>> +     if (id != 0) {
> >>> +             if ((atomic_read(&rpc->flags) & RPC_PKTS_READY) || rpc->error)
> >>> +                     goto claim_rpc;
> >>> +             rpc->interest = interest;
> >>> +             interest->reg_rpc = rpc;
> >>> +             homa_rpc_unlock(rpc);
> >>
> >> With the current schema you should release the hsh socket lock before
> >> releasing the rpc one.
> >
> > Normally I would agree, but that won't work here: the RPC is no longer
> > of interest, so it needs to be unlocked,
>
> Is that unlock strictly necessary (would cause a deadlock if omitted) or
> just an optimization?

That RPC definitely needs to be unlocked (another RPC gets locked
later in the function). It would be possible to defer its unlocking
until after the socket is unlocked, but that would be awkward: that
RPC is never used again, and there would need to be an extra variable
to remember the RPC for later unlocking; the later unlocking would
drop in "out of the blue" (readers would wonder "why is this RPC being
unlocked here?"). And it would keep an RPC locked unnecessarily.

> > but we need to keep the socket lock through the code that follows.
>
> Why? Do you need the hsk to be alive, or some specific state to be
> consistent? The first could possibly be achieved with a refcnt.

The socket lock must be held when examining the queues of ready RPCs
and waiting threads.

> > This is safe (out-of-order
> > lock acquisition can cause deadlocks, but the order of lock releasing
> > doesn't matter except aesthetically).
>
> I think the existing code would trigger some lockdep splat.

So far it hasn't, and I think I've got the lockdep checks enabled.

> >>> +struct homa_rpc *homa_wait_for_message(struct homa_sock *hsk, int flags,
> >>> +                                    __u64 id)
> >>> +     __acquires(&rpc->bucket_lock)
> >>> +{
> >>> +     ...
> >>> + }
> >
> >> The amount of custom code to wait is concerning. Why can't you build
> >> around sk_wait_event()?
> >
> > I agree that it's complicated. sk_wait_event can't be used because
> > Homa allows different threads to wait on partially-overlapping sets of
> > RPCs. For example, one thread can wait for a specific RPC to complete,
> > while another thread waits for *any* RPC to complete. Thus a given RPC
> > completion may not apply to all of the waiting threads. Here's a link
> > to a man page that describes the recvmsg API for Homa:
> >
> > https://homa-transport.atlassian.net/wiki/spaces/HOMA/overview
>
> sk_wait_event() could deal with arbitrary complex wake-up conditions -
> code them in a function and pass such function as the __condition argument.
>
> A problem could be WRT locking, since sk_wait_event() expect the caller
> holding the sk socket lock.
>
> Have you considered using the sk lock to protect the hsk status, and
> finer grained spinlocks for specific hsk fields/attributes?

That's the idea behind the RPC locks, and they reduced hsk lock
contention to a tolerable level. The main place where socket lock
contention still happens now is in the handoff mechanism for incoming
messages: for every message, both the SoftIRQ code that declares the
message complete (homa_rpc_handoff) and the thread that eventually
receives the message (homa_wait_for_message) must acquire the socket
lock. This limits the message throughput for a single socket, which
could impact server apps that want to balance incoming load across a
large number of threads (especially if the workload consists of short
messages with short service times) . I have a few vague ideas for how
to get around this but haven't yet had time to try any of them out.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-01-30  9:57       ` Paolo Abeni
@ 2025-01-31 22:51         ` John Ousterhout
       [not found]         ` <CAGXJAmxLqnjnWr8sjooJRRyQ2-5BqPCQL8gnn0gzYoZ0MMoBSw@mail.gmail.com>
  1 sibling, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-01-31 22:51 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

Also resending this message to get rid of HTML in the original...

On Thu, Jan 30, 2025 at 1:57 AM Paolo Abeni <pabeni@redhat.com> wrote:
> On 1/30/25 1:48 AM, John Ousterhout wrote:
> > On Mon, Jan 27, 2025 at 2:19 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>
> >> On 1/15/25 7:59 PM, John Ousterhout wrote:
> >>> +     /* Each iteration through the following loop processes one packet. */
> >>> +     for (; skb; skb = next) {
> >>> +             h = (struct homa_data_hdr *)skb->data;
> >>> +             next = skb->next;
> >>> +
> >>> +             /* Relinquish the RPC lock temporarily if it's needed
> >>> +              * elsewhere.
> >>> +              */
> >>> +             if (rpc) {
> >>> +                     int flags = atomic_read(&rpc->flags);
> >>> +
> >>> +                     if (flags & APP_NEEDS_LOCK) {
> >>> +                             homa_rpc_unlock(rpc);
> >>> +                             homa_spin(200);
> >>
> >> Why spinning on the current CPU here? This is completely unexpected, and
> >> usually tolerated only to deal with H/W imposed delay while programming
> >> some device registers.
> >
> > This is done to pass the RPC lock off to another thread (the
> > application); the spin is there to allow the other thread to acquire
> > the lock before this thread tries to acquire it again (almost
> > immediately). There's no performance impact from the spin because this
> > thread is going to turn around and try to acquire the RPC lock again
> > (at which point it will spin until the other thread releases the
> > lock). Thus it's either spin here or spin there. I've added a comment
> > to explain this.
>
> What if another process is spinning on the RPC lock without setting
> APP_NEEDS_LOCK? AFAICS incoming packets targeting the same RPC could
> land on different RX queues.

If that happens then it could grab the lock instead of the desired
application, which would defeat the performance optimization and delay
the application a bit. This would be no worse than if the
APP_NEEDS_LOCK mechanism were not present.

> If the spin is not functionally needed, just drop it. If it's needed, it
> would be better to find some functional replacement, possibly explicit
> notification via waitqueue or completion.

The goal is to have a very lightweight mechanism for an application to
preempt the RPC lock. I'd be happy to use an existing mechanism if
something appropriate exists, but waitqueues and completions sound
more heavyweight to me; aren't they both based on blocking rather than
spinning?

One of the reasons Homa has rolled its own mechanisms is that it's
trying to operate at a timescale that's different from the rest of the
kernel.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-01-31 22:48         ` John Ousterhout
@ 2025-02-03  9:12           ` Paolo Abeni
  2025-02-03 23:33             ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-02-03  9:12 UTC (permalink / raw)
  To: John Ousterhout; +Cc: Netdev, Eric Dumazet, Simon Horman, Jakub Kicinski

On 1/31/25 11:48 PM, John Ousterhout wrote:
> On Thu, Jan 30, 2025 at 1:39 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> On 1/30/25 1:41 AM, John Ousterhout wrote:
>>> On Fri, Jan 24, 2025 at 12:31 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>>>
>>>> OoO will cause additional allocation? this feels like DoS prone.
>>>>
>>>>> +             }
>>>>> +             rpc->msgin.recv_end = end;
>>>>> +             goto keep;
>>>>> +     }
>>>>> +
>>>>> +     /* Must now check to see if the packet fills in part or all of
>>>>> +      * an existing gap.
>>>>> +      */
>>>>> +     list_for_each_entry_safe(gap, dummy, &rpc->msgin.gaps, links) {
>>>>
>>>> Linear search for OoO has proven to be subject to serious dos issue. You
>>>> should instead use a (rb-)tree to handle OoO packets.
>>>
>>> I have been assuming that DoS won't be a major issue for Homa because
>>> it's intended for use only in datacenters (if there are antagonistic
>>> parties, they will be isolated from each other by networking
>>> hardware). Is this a bad assumption?
>>
>> I think assuming that the peer will always behave is dangerous. The peer
>> could be buggy or compromised, transient network condition may arise.
>> Even un-malicious users tend to do the most crazy and unexpected things
>> given enough time.
>>
>> Also the disclaimer "please don't use this in on an internet facing
>> host" sounds quite bad for a networking protocol ;)
> 
> I don't see why this disclaimer would be needed: as long as the Homa
> hosts are inside the firewall, the firewall will prevent any external
> Homa packets from reaching them. And, if your host is compromised, DoS
> is the least of your worries.
> 
> It seems to me that this is mostly about helping people debug. I agree
> that that could be useful. However, it's hard for me to imagine this
> particular situation (lots of gaps in the packet stream) happening by
> accident. Applications won't be generating Homa packets themselves,
> they will be using Homa, and Homa won't generate lots of gaps. There
> are other kinds of bad application behavior that are much more likely
> to occur, such as an app that gets into an infinite loop sending
> requests without ever receiving responses.
> 
> And, an rb-tree will add complexity and slow down the common case
> (trees have better O(...) behavior than lists but worse constant
> factors).

My point is that a service (and the host running it) is supposed to
survive any kind of unexpected conditions, as they will happen for sure
in any internet-facing host, and will likely happen also in the most
controlled environment due to bugs or unexpected events somewhere(else).

>>>> Also it looks like there is no memory accounting at all, and SO_RCVBUF
>>>> setting are just ignored.
>>>
>>> Homa doesn't yet have comprehensive memory accounting, but there is a
>>> limit on buffer space for incoming messages. Instead of SO_RCVBUF,
>>> applications control the amount of receive buffer space by controlling
>>> the size of the buffer pool they provide to Homa with the
>>> SO_HOMA_RCVBUF socket option.
>>
>> Ignoring SO_RCVBUF (and net.core.rmem_* sysctls) is both unexpected and
>> dangerous (a single application may consume unbounded amount of system
>> memory). Also what about the TX side? I don't see any limit at all there.
> 
> An application cannot consume unbounded system memory on the RX side
> (in fact it consumes almost none). When packets arrive, their data is
> immediately transferred to a buffer region in user memory provided by
> the application (using the facilities in homa_pool.c). Skb's are
> occupied only long enough to make this transfer, and it happens even
> if there is no pending recv* kernel call. The size of the buffer
> region is limited by the application, and the application must provide
> a region via SO_HOMA_RCVBUF. 

I don't see where/how the SO_HOMA_RCVBUF max value is somehow bounded?!?
It looks like the user-space could pick an arbitrary large value for it.

> Given this, there's no need for SO_RCVBUF
> (and I don't see why a different limit would be specified via
> SO_RCVBUF than the one already provided via SO_HOMA_RCVBUF). 
> I agree that this is different from TCP, but Homa is different from TCP in
> lots of ways.
> 
> There is currently no accounting or control on the TX side. I agree
> that this needs to be implemented at some point, but if possible I'd
> prefer to defer this until more of Homa has been upstreamed. For
> example, this current patch doesn't include any sysctl support, which
> would be needed as part of accounting/control (the support is part of
> the GitHub repo, it's just not in this patch series).

SO_RCVBUF and SO_SNDBUF are expected to apply to any kind of socket,
see man 7 sockets. Exceptions should be at least documented, but we need
some way to limit memory usage in both directions.

Fine tuning controls and sysctls could land later, but the basic
constraints should IMHO be there from the beginning.

>>>>> +/**
>>>>> + * homa_dispatch_pkts() - Top-level function that processes a batch of packets,
>>>>> + * all related to the same RPC.
>>>>> + * @skb:       First packet in the batch, linked through skb->next.
>>>>> + * @homa:      Overall information about the Homa transport.
>>>>> + */
>>>>> +void homa_dispatch_pkts(struct sk_buff *skb, struct homa *homa)
>>>>
>>>> I see I haven't mentioned the following so far, but you should move the
>>>> struct homa to a pernet subsystem.
>>>
>>> Sorry for my ignorance, but I'm not familiar with the concept of "a
>>> pernet subsystem". What's the best way for me to learn more about
>>> this?
>>
>> Have a look at register_pernet_subsys(), struct pernet_operations and
>> some basic usage example, i.e. in net/8021q/vlan.c.
>>
>> register_pernet_subsys() allow registering/allocating a per network
>> namespace structure of specified size (pernet_operations.size) that that
>> subsystem can use according to its own need fetching it from the netns
>> via the `id` obtained at registration time.
> 
> I will take a look.
> 
>>>>> +             /* Find and lock the RPC if we haven't already done so. */
>>>>> +             if (!rpc) {
>>>>> +                     if (!homa_is_client(id)) {
>>>>> +                             /* We are the server for this RPC. */
>>>>> +                             if (h->common.type == DATA) {
>>>>> +                                     int created;
>>>>> +
>>>>> +                                     /* Create a new RPC if one doesn't
>>>>> +                                      * already exist.
>>>>> +                                      */
>>>>> +                                     rpc = homa_rpc_new_server(hsk, &saddr,
>>>>> +                                                               h, &created);
>>>>
>>>> It looks like a buggy or malicious client could force server RPC
>>>> allocation to any _client_ ?!?
>>>
>>> I'm not sure what you mean by "force server RPC allocation to any
>>> _client_"; can you give a bit more detail?
>>
>> AFAICS the home protocol uses only the `id` provided by the sender to
>> discriminate the incoming packet as a client requests and thus
>> allocating resources (RPC new server) on the receiver.
>>
>> Suppose an host creates a client socket, and a port is assigned to it.
>>
>> A malicious or buggy peer starts sending an (unlimited amount of)
>> uncompleted homa RPC request to it.
>>
>> AFAICS the host A will allocate server RPCs in response to such incoming
>> packets, which is unexpected to me.
> 
> Now I see what you're getting at. Homa sockets are symmetric: any
> socket can be used for both the client and server sides of RPCs.  Thus
> it's possible to send requests even to sockets that haven't been
> "bound". I think of this as a feature, not a bug (it can potentially
> reduce the need to allocate "known" port numbers). At the same time, I
> see your point that some applications might not expect to receive
> requests. Would you like a mechanism to disable this? For example,
> sockets could be configured by default to reject incoming requests;
> invoking the "bind" system call would enable incoming requests (I
> would also add a setsockopt mechanism for enabling requests even on
> "unbound" sockets).

I think that an explicit setsockopt() to enable incoming requests should
be fine.

>> Additionally AFAICS each RPC is identified only by dport/id and both
>> port and id allocation is sequential it looks like it's quite easy to
>> spoof/inject data in a different RPC - even "by mistake". I guess this
>> is a protocol limit.
> 
> On the server side an RPC is identified by <client address, dport,
> id>, but on the client side only by <dport, id> (the sender address
> isn't needed to lookup the correct RPC). However, it would be easy to
> check incoming packets to make sure that the sender address matches
> the sender in the RPC. I will do that.

I somewhat missed the src address matching for the server side. I think
it would be good if the lookup could be symmetric for both the client
and the server.

>>> The new methods also provide a consistent and simple solution to
>>> several other problems that had been solved in an ad hoc way.
>>>
>>> It would be even better if a function never had to release a lock
>>> internally, but so far I haven't figured out how to do that. If you
>>> have ideas I'd like to hear them.
>>
>> In some cases it could be possible to move the unlock in the caller,
>> eventually breaking the relevant function in smaller helpers.
>>
>>>>> +
>>>>> +     if (id != 0) {
>>>>> +             if ((atomic_read(&rpc->flags) & RPC_PKTS_READY) || rpc->error)
>>>>> +                     goto claim_rpc;
>>>>> +             rpc->interest = interest;
>>>>> +             interest->reg_rpc = rpc;
>>>>> +             homa_rpc_unlock(rpc);
>>>>
>>>> With the current schema you should release the hsh socket lock before
>>>> releasing the rpc one.
>>>
>>> Normally I would agree, but that won't work here: the RPC is no longer
>>> of interest, so it needs to be unlocked,
>>
>> Is that unlock strictly necessary (would cause a deadlock if omitted) or
>> just an optimization?
> 
> That RPC definitely needs to be unlocked (another RPC gets locked
> later in the function). 

Side note: if you use per RPC lock, and you know that the later one is a
_different_ RPC, there will be no need for unlocking (and LOCKDEP will
be happy with a "_nested" annotation).

> It would be possible to defer its unlocking
> until after the socket is unlocked, but that would be awkward: that
> RPC is never used again, and there would need to be an extra variable
> to remember the RPC for later unlocking; the later unlocking would
> drop in "out of the blue" (readers would wonder "why is this RPC being
> unlocked here?"). And it would keep an RPC locked unnecessarily.

I guess it's a matter of taste and personal preferences. IMHO
inconsistent unlock chain is harder to follow. I also think a comment is
due in either cases.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
       [not found]         ` <CAGXJAmxLqnjnWr8sjooJRRyQ2-5BqPCQL8gnn0gzYoZ0MMoBSw@mail.gmail.com>
@ 2025-02-03  9:17           ` Paolo Abeni
  2025-02-03 17:33             ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-02-03  9:17 UTC (permalink / raw)
  To: John Ousterhout; +Cc: netdev, edumazet, horms, kuba

On 1/31/25 11:35 PM, John Ousterhout wrote:
> On Thu, Jan 30, 2025 at 1:57 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> On 1/30/25 1:48 AM, John Ousterhout wrote:
>>> On Mon, Jan 27, 2025 at 2:19 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>>>
>>>> On 1/15/25 7:59 PM, John Ousterhout wrote:
>>>>> +     /* Each iteration through the following loop processes one
>> packet. */
>>>>> +     for (; skb; skb = next) {
>>>>> +             h = (struct homa_data_hdr *)skb->data;
>>>>> +             next = skb->next;
>>>>> +
>>>>> +             /* Relinquish the RPC lock temporarily if it's needed
>>>>> +              * elsewhere.
>>>>> +              */
>>>>> +             if (rpc) {
>>>>> +                     int flags = atomic_read(&rpc->flags);
>>>>> +
>>>>> +                     if (flags & APP_NEEDS_LOCK) {
>>>>> +                             homa_rpc_unlock(rpc);
>>>>> +                             homa_spin(200);
>>>>
>>>> Why spinning on the current CPU here? This is completely unexpected, and
>>>> usually tolerated only to deal with H/W imposed delay while programming
>>>> some device registers.
>>>
>>> This is done to pass the RPC lock off to another thread (the
>>> application); the spin is there to allow the other thread to acquire
>>> the lock before this thread tries to acquire it again (almost
>>> immediately). There's no performance impact from the spin because this
>>> thread is going to turn around and try to acquire the RPC lock again
>>> (at which point it will spin until the other thread releases the
>>> lock). Thus it's either spin here or spin there. I've added a comment
>>> to explain this.
>>
>> What if another process is spinning on the RPC lock without setting
>> APP_NEEDS_LOCK? AFAICS incoming packets targeting the same RPC could
>> land on different RX queues.
>>
> 
> If that happens then it could grab the lock instead of the desired
> application, which would defeat the performance optimization and delay the
> application a bit. This would be no worse than if the APP_NEEDS_LOCK
> mechanism were not present.

Then I suggest using plain unlock/lock() with no additional spinning in
between.

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-02-03  9:17           ` Paolo Abeni
@ 2025-02-03 17:33             ` John Ousterhout
  2025-02-03 17:58               ` Andrew Lunn
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-02-03 17:33 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Mon, Feb 3, 2025 at 1:17 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 1/31/25 11:35 PM, John Ousterhout wrote:
> > On Thu, Jan 30, 2025 at 1:57 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >> On 1/30/25 1:48 AM, John Ousterhout wrote:
> >>> On Mon, Jan 27, 2025 at 2:19 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>>>
> >>>> On 1/15/25 7:59 PM, John Ousterhout wrote:
> >>>>> +     /* Each iteration through the following loop processes one
> >> packet. */
> >>>>> +     for (; skb; skb = next) {
> >>>>> +             h = (struct homa_data_hdr *)skb->data;
> >>>>> +             next = skb->next;
> >>>>> +
> >>>>> +             /* Relinquish the RPC lock temporarily if it's needed
> >>>>> +              * elsewhere.
> >>>>> +              */
> >>>>> +             if (rpc) {
> >>>>> +                     int flags = atomic_read(&rpc->flags);
> >>>>> +
> >>>>> +                     if (flags & APP_NEEDS_LOCK) {
> >>>>> +                             homa_rpc_unlock(rpc);
> >>>>> +                             homa_spin(200);
> >>>>
> >>>> Why spinning on the current CPU here? This is completely unexpected, and
> >>>> usually tolerated only to deal with H/W imposed delay while programming
> >>>> some device registers.
> >>>
> >>> This is done to pass the RPC lock off to another thread (the
> >>> application); the spin is there to allow the other thread to acquire
> >>> the lock before this thread tries to acquire it again (almost
> >>> immediately). There's no performance impact from the spin because this
> >>> thread is going to turn around and try to acquire the RPC lock again
> >>> (at which point it will spin until the other thread releases the
> >>> lock). Thus it's either spin here or spin there. I've added a comment
> >>> to explain this.
> >>
> >> What if another process is spinning on the RPC lock without setting
> >> APP_NEEDS_LOCK? AFAICS incoming packets targeting the same RPC could
> >> land on different RX queues.
> >>
> >
> > If that happens then it could grab the lock instead of the desired
> > application, which would defeat the performance optimization and delay the
> > application a bit. This would be no worse than if the APP_NEEDS_LOCK
> > mechanism were not present.
>
> Then I suggest using plain unlock/lock() with no additional spinning in
> between.

My concern here is that the unlock/lock sequence will happen so fast
that the other thread never actually has a chance to get the lock. I
will do some measurements to see what actually happens; if lock
ownership is successfully transferred in the common case without a
spin, then I'll remove it.


-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-02-03 17:33             ` John Ousterhout
@ 2025-02-03 17:58               ` Andrew Lunn
  2025-02-05 23:56                 ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Andrew Lunn @ 2025-02-03 17:58 UTC (permalink / raw)
  To: John Ousterhout; +Cc: Paolo Abeni, netdev, edumazet, horms, kuba

> > > If that happens then it could grab the lock instead of the desired
> > > application, which would defeat the performance optimization and delay the
> > > application a bit. This would be no worse than if the APP_NEEDS_LOCK
> > > mechanism were not present.
> >
> > Then I suggest using plain unlock/lock() with no additional spinning in
> > between.
> 
> My concern here is that the unlock/lock sequence will happen so fast
> that the other thread never actually has a chance to get the lock. I
> will do some measurements to see what actually happens; if lock
> ownership is successfully transferred in the common case without a
> spin, then I'll remove it.

https://docs.kernel.org/locking/mutex-design.html

If there is a thread waiting for the lock, it will spin for a while
trying to acquire it. The document also mentions that when there are
multiple waiters, the algorithm tries to be fair. So if there is a
fast unlock/lock, it should act fairly with the other waiter.

	Andrew


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-02-03  9:12           ` Paolo Abeni
@ 2025-02-03 23:33             ` John Ousterhout
  2025-02-04  8:50               ` Paolo Abeni
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-02-03 23:33 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Netdev, Eric Dumazet, Simon Horman, Jakub Kicinski

On Mon, Feb 3, 2025 at 1:12 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> >>>> Also it looks like there is no memory accounting at all, and SO_RCVBUF
> >>>> setting are just ignored.
> >>>
> >>> Homa doesn't yet have comprehensive memory accounting, but there is a
> >>> limit on buffer space for incoming messages. Instead of SO_RCVBUF,
> >>> applications control the amount of receive buffer space by controlling
> >>> the size of the buffer pool they provide to Homa with the
> >>> SO_HOMA_RCVBUF socket option.
> >>
> >> Ignoring SO_RCVBUF (and net.core.rmem_* sysctls) is both unexpected and
> >> dangerous (a single application may consume unbounded amount of system
> >> memory). Also what about the TX side? I don't see any limit at all there.
> >
> > An application cannot consume unbounded system memory on the RX side
> > (in fact it consumes almost none). When packets arrive, their data is
> > immediately transferred to a buffer region in user memory provided by
> > the application (using the facilities in homa_pool.c). Skb's are
> > occupied only long enough to make this transfer, and it happens even
> > if there is no pending recv* kernel call. The size of the buffer
> > region is limited by the application, and the application must provide
> > a region via SO_HOMA_RCVBUF.
>
> I don't see where/how the SO_HOMA_RCVBUF max value is somehow bounded?!?
> It looks like the user-space could pick an arbitrary large value for it.

That's right; is there anything to be gained by limiting it? This is
simply mmapped memory in the user address space. Aren't applications
allowed to allocate as much memory as they like? If so, why shouldn't
they be able to use that memory for incoming buffers if they choose?

> > Given this, there's no need for SO_RCVBUF
> > (and I don't see why a different limit would be specified via
> > SO_RCVBUF than the one already provided via SO_HOMA_RCVBUF).
> > I agree that this is different from TCP, but Homa is different from TCP in
> > lots of ways.
> >
> > There is currently no accounting or control on the TX side. I agree
> > that this needs to be implemented at some point, but if possible I'd
> > prefer to defer this until more of Homa has been upstreamed. For
> > example, this current patch doesn't include any sysctl support, which
> > would be needed as part of accounting/control (the support is part of
> > the GitHub repo, it's just not in this patch series).
>
> SO_RCVBUF and SO_SNDBUF are expected to apply to any kind of socket,
> see man 7 sockets. Exceptions should be at least documented, but we need
> some way to limit memory usage in both directions.

The expectations around these limits are based on an unstated (and
probably unconscious) assumption of a TCP-like streaming protocol.
RPCs are different. For example, there is no one value of rmem_default
or rmem_max that will work for both TCP and Homa. On my system, these
values are both around 200 KB, which seems fine for TCP, but that's
not even enough for a single full-size RPC in Homa, and Homa apps need
to have several active RPCs at a time. Thus it doesn't make sense to
use SO_RCVBUF and SO_SNDBUF for both Homa and TCP; their needs are too
different.

> Fine tuning controls and sysctls could land later, but the basic
> constraints should IMHO be there from the beginning.

OK. I think that SO_HOMA_RCVBUF takes care of RX buffer space. For TX,
what's the simplest scheme that you would be comfortable with? For
example, if I cap the number of outstanding RPCs per socket, will that
be enough for now?

> Side note: if you use per RPC lock, and you know that the later one is a
> _different_ RPC, there will be no need for unlocking (and LOCKDEP will
> be happy with a "_nested" annotation).

This risks deadlock if some other thread decides to do things in the
other order.


-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-02-03 23:33             ` John Ousterhout
@ 2025-02-04  8:50               ` Paolo Abeni
  2025-02-04 16:30                 ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Paolo Abeni @ 2025-02-04  8:50 UTC (permalink / raw)
  To: John Ousterhout; +Cc: Netdev, Eric Dumazet, Simon Horman, Jakub Kicinski

On 2/4/25 12:33 AM, John Ousterhout wrote:
> On Mon, Feb 3, 2025 at 1:12 AM Paolo Abeni <pabeni@redhat.com> wrote:
>> I don't see where/how the SO_HOMA_RCVBUF max value is somehow bounded?!?
>> It looks like the user-space could pick an arbitrary large value for it.
> 
> That's right; is there anything to be gained by limiting it? This is
> simply mmapped memory in the user address space. Aren't applications
> allowed to allocate as much memory as they like? If so, why shouldn't
> they be able to use that memory for incoming buffers if they choose?

If unprivileged applications could use unlimited amount of kernel
memory, they could hurt the whole system stability, possibly causing
functional issue of core kernel due to ENOMEM.

The we always try to bound/put limits on amount of kernel memory
user-space application can use.

>> SO_RCVBUF and SO_SNDBUF are expected to apply to any kind of socket,
>> see man 7 sockets. Exceptions should be at least documented, but we need
>> some way to limit memory usage in both directions.
> 
> The expectations around these limits are based on an unstated (and
> probably unconscious) assumption of a TCP-like streaming protocol.

Actually TCP can use it's own, separated, limits, see:
net.ipv4.tcp_rmem, net.ipv4.tcp_wmem:

https://elixir.bootlin.com/linux/v6.13.1/source/Documentation/networking/ip-sysctl.rst#L719
https://elixir.bootlin.com/linux/v6.13.1/source/Documentation/networking/ip-sysctl.rst#L719

> RPCs are different. For example, there is no one value of rmem_default
> or rmem_max that will work for both TCP and Homa. On my system, these
> values are both around 200 KB, which seems fine for TCP, but that's
> not even enough for a single full-size RPC in Homa, and Homa apps need
> to have several active RPCs at a time. Thus it doesn't make sense to
> use SO_RCVBUF and SO_SNDBUF for both Homa and TCP; their needs are too
> different.

Specific, per protocols limits are allowed, but should be there and
documented.

>> Fine tuning controls and sysctls could land later, but the basic
>> constraints should IMHO be there from the beginning.
> 
> OK. I think that SO_HOMA_RCVBUF takes care of RX buffer space. 

We need some way to allow the admin to bound the SO_HOMA_RCVBUF max value.

> For TX, what's the simplest scheme that you would be comfortable with? For
> example, if I cap the number of outstanding RPCs per socket, will that
> be enough for now?

Usually the bounds are expressed in bytes. How complex would be adding
wmem accounting?

/P


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-02-04  8:50               ` Paolo Abeni
@ 2025-02-04 16:30                 ` John Ousterhout
  2025-02-04 19:41                   ` Andrew Lunn
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-02-04 16:30 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: Netdev, Eric Dumazet, Simon Horman, Jakub Kicinski

On Tue, Feb 4, 2025 at 12:50 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 2/4/25 12:33 AM, John Ousterhout wrote:
> > On Mon, Feb 3, 2025 at 1:12 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >> I don't see where/how the SO_HOMA_RCVBUF max value is somehow bounded?!?
> >> It looks like the user-space could pick an arbitrary large value for it.
> >
> > That's right; is there anything to be gained by limiting it? This is
> > simply mmapped memory in the user address space. Aren't applications
> > allowed to allocate as much memory as they like? If so, why shouldn't
> > they be able to use that memory for incoming buffers if they choose?
>
> If unprivileged applications could use unlimited amount of kernel
> memory, they could hurt the whole system stability, possibly causing
> functional issue of core kernel due to ENOMEM.
>
> The we always try to bound/put limits on amount of kernel memory
> user-space application can use.

Homa's receive buffer space is *not kernel memory*; it's just a large
mmapped region created by the application., no different from an
application allocating a large region of memory for its internal
computation.

> >> Fine tuning controls and sysctls could land later, but the basic
> >> constraints should IMHO be there from the beginning.
> >
> > OK. I think that SO_HOMA_RCVBUF takes care of RX buffer space.
>
> We need some way to allow the admin to bound the SO_HOMA_RCVBUF max value.

Even if this memory is entirely user memory (we seem to be
miscommunicating over this)?

> > For TX, what's the simplest scheme that you would be comfortable with? For
> > example, if I cap the number of outstanding RPCs per socket, will that
> > be enough for now?
>
> Usually the bounds are expressed in bytes. How complex would be adding
> wmem accounting?

I'll see what I can do.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-02-04 16:30                 ` John Ousterhout
@ 2025-02-04 19:41                   ` Andrew Lunn
  2025-02-04 21:20                     ` John Ousterhout
  0 siblings, 1 reply; 68+ messages in thread
From: Andrew Lunn @ 2025-02-04 19:41 UTC (permalink / raw)
  To: John Ousterhout
  Cc: Paolo Abeni, Netdev, Eric Dumazet, Simon Horman, Jakub Kicinski

> > If unprivileged applications could use unlimited amount of kernel
> > memory, they could hurt the whole system stability, possibly causing
> > functional issue of core kernel due to ENOMEM.
> >
> > The we always try to bound/put limits on amount of kernel memory
> > user-space application can use.
> 
> Homa's receive buffer space is *not kernel memory*; it's just a large
> mmapped region created by the application., no different from an
> application allocating a large region of memory for its internal
> computation.

ulimit -v should be able to limit this, if user space is doing the
mmap(). It should be easy to test. Set a low enough limit the mmap()
should fail, and i guess you get MAP_FAILED and errno = ENOMEM?

	Andrew

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-02-04 19:41                   ` Andrew Lunn
@ 2025-02-04 21:20                     ` John Ousterhout
  0 siblings, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-02-04 21:20 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Paolo Abeni, Netdev, Eric Dumazet, Simon Horman, Jakub Kicinski

On Tue, Feb 4, 2025 at 11:41 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > > If unprivileged applications could use unlimited amount of kernel
> > > memory, they could hurt the whole system stability, possibly causing
> > > functional issue of core kernel due to ENOMEM.
> > >
> > > The we always try to bound/put limits on amount of kernel memory
> > > user-space application can use.
> >
> > Homa's receive buffer space is *not kernel memory*; it's just a large
> > mmapped region created by the application., no different from an
> > application allocating a large region of memory for its internal
> > computation.
>
> ulimit -v should be able to limit this, if user space is doing the
> mmap(). It should be easy to test. Set a low enough limit the mmap()
> should fail, and i guess you get MAP_FAILED and errno = ENOMEM?

I just tried this, and yes, if ulimt -v is set low enough, user apps
can't mmap buffer space to pass to Homa.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-02-03 17:58               ` Andrew Lunn
@ 2025-02-05 23:56                 ` John Ousterhout
  2025-02-06  1:49                   ` Andrew Lunn
  0 siblings, 1 reply; 68+ messages in thread
From: John Ousterhout @ 2025-02-05 23:56 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Paolo Abeni, netdev, edumazet, horms, kuba

On Mon, Feb 3, 2025 at 9:58 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > > > If that happens then it could grab the lock instead of the desired
> > > > application, which would defeat the performance optimization and delay the
> > > > application a bit. This would be no worse than if the APP_NEEDS_LOCK
> > > > mechanism were not present.
> > >
> > > Then I suggest using plain unlock/lock() with no additional spinning in
> > > between.
> >
> > My concern here is that the unlock/lock sequence will happen so fast
> > that the other thread never actually has a chance to get the lock. I
> > will do some measurements to see what actually happens; if lock
> > ownership is successfully transferred in the common case without a
> > spin, then I'll remove it.
>
> https://docs.kernel.org/locking/mutex-design.html
>
> If there is a thread waiting for the lock, it will spin for a while
> trying to acquire it. The document also mentions that when there are
> multiple waiters, the algorithm tries to be fair. So if there is a
> fast unlock/lock, it should act fairly with the other waiter.

The link above refers to mutexes, whereas the code in question uses spinlocks.

I spent some time today doing measurements, and here's what I found.

* Without the call to homa_spin the handoff fails 20-25% of the time
(i.e., the releasing thread reacquires the lock before the "needy"
thread can get it).

* With the call to homa_spin the handoff fails 0.3-1% of the time.
This happens because of delays in the needy thread, typically an
interrupt that keeps it from retrying the lock quickly. This surprised
me as I thought that interrupts  were disabled by spinlocks, but I
definitely see the interrupts happening; maybe only *some* interrupts
(softirqs?) are disabled by spinlocks?

* I tried varying the length of the spin to see how that affects the
handoff failure rate. In case you're curious:

200ns             0.3-1.0%
100ns             0.4-1.0%
50ns              0.4-1.6%
20ns              1.3-3.9%
10ns              3.3-6.4%

* Note: the call to homa_spin is "free" in cases where the lock is
successfully handed off, since the thread that calls homa_spin will
attempt to reacquire the spinlock, and the lock won't become free
again until well after homa_spin has returned (without the call to
homa_spin the thread just spends more time spinning for the lock). It
only adds overhead in the (rare) case of a handoff failure.

* Interestingly, the lock transfer seems to happen a bit faster with
the homa_spin call than without it. I measured transfer times (time
from when one thread releases the lock until the other thread acquires
it) of 205-225 ns with the call to homa_spin, and 220-250 ns without
the call to homa_spin. This improvement in the common case where the
transfer succeeds more than compensates for the 100ns of wasted time
when the transfer fails.

Based on all of this, I'm going to keep the call to homa_spin but
reduce the spin time to 100ns (I want to leave some leeway in case
there is variation between architectures in how long it takes the
needy thread to grab the lock). I have fleshed out the comment next to
the code to provide more information about the benefits and to make it
clear that the benefits have been measured, not just hypothesized.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 08/12] net: homa: create homa_incoming.c
  2025-02-05 23:56                 ` John Ousterhout
@ 2025-02-06  1:49                   ` Andrew Lunn
  0 siblings, 0 replies; 68+ messages in thread
From: Andrew Lunn @ 2025-02-06  1:49 UTC (permalink / raw)
  To: John Ousterhout; +Cc: Paolo Abeni, netdev, edumazet, horms, kuba

On Wed, Feb 05, 2025 at 03:56:36PM -0800, John Ousterhout wrote:
> On Mon, Feb 3, 2025 at 9:58 AM Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > > > > If that happens then it could grab the lock instead of the desired
> > > > > application, which would defeat the performance optimization and delay the
> > > > > application a bit. This would be no worse than if the APP_NEEDS_LOCK
> > > > > mechanism were not present.
> > > >
> > > > Then I suggest using plain unlock/lock() with no additional spinning in
> > > > between.
> > >
> > > My concern here is that the unlock/lock sequence will happen so fast
> > > that the other thread never actually has a chance to get the lock. I
> > > will do some measurements to see what actually happens; if lock
> > > ownership is successfully transferred in the common case without a
> > > spin, then I'll remove it.
> >
> > https://docs.kernel.org/locking/mutex-design.html
> >
> > If there is a thread waiting for the lock, it will spin for a while
> > trying to acquire it. The document also mentions that when there are
> > multiple waiters, the algorithm tries to be fair. So if there is a
> > fast unlock/lock, it should act fairly with the other waiter.
> 
> The link above refers to mutexes, whereas the code in question uses spinlocks.

Ah, sorry, could not see that from the context in the email.

> * With the call to homa_spin the handoff fails 0.3-1% of the time.
> This happens because of delays in the needy thread, typically an
> interrupt that keeps it from retrying the lock quickly. This surprised
> me as I thought that interrupts  were disabled by spinlocks, but I
> definitely see the interrupts happening; maybe only *some* interrupts
> (softirqs?) are disabled by spinlocks?

By defaults, spinlocks don't disable interrupts. Which is why you
cannot use them in interrupt handlers. There is however

spin_lock_irqsave(lock, flags);
spin_unlock_irqrestore(lock, flags);

which saves the current interrupt state into flags, and disable
interrupts if needed. The state is then restore when you unlock.

Also, when PREEMPT_RT is enabled for real time support, spinlocks get
turned into mutexes.

	Andrew

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol
  2025-01-24  8:55 ` [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol Paolo Abeni
@ 2025-02-10 19:19   ` John Ousterhout
  0 siblings, 0 replies; 68+ messages in thread
From: John Ousterhout @ 2025-02-10 19:19 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev, edumazet, horms, kuba

On Fri, Jan 24, 2025 at 12:55 AM Paolo Abeni <pabeni@redhat.com> wrote:
> I haven't completed reviewing the current iteration yet, but with the
> amount of code inspected at this point, the series looks quite far from
> a mergeable status.
>
> Before the next iteration, I strongly advice to review (and possibly
> rethink) completely the locking schema, especially the RCU usage, to
> implement rcvbuf and sendbuf accounting (and possibly even memory
> accounting), to reorganize the code for better reviewability (the code
> in each patch should refer/use only the code current and previous
> patches), to use more the existing kernel API and constructs and to test
> the code with all the kernel/configs/debug.config knobs enabled.
>
> Unless a patch is new or completely rewritten from scratch, it would be
> helpful to add per patch changelog, after the SoB tag and a '---' separator.

I should be able to take care of all these things in the next revision.

-John-

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2025-02-10 19:20 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-15 18:59 [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol John Ousterhout
2025-01-15 18:59 ` [PATCH net-next v6 01/12] net: homa: define user-visible API for Homa John Ousterhout
2025-01-15 18:59 ` [PATCH net-next v6 02/12] net: homa: create homa_wire.h John Ousterhout
2025-01-15 18:59 ` [PATCH net-next v6 03/12] net: homa: create shared Homa header files John Ousterhout
2025-01-23 11:01   ` Paolo Abeni
2025-01-24 21:21     ` John Ousterhout
2025-01-27  9:05       ` Paolo Abeni
2025-01-27 17:04         ` John Ousterhout
2025-01-15 18:59 ` [PATCH net-next v6 04/12] net: homa: create homa_pool.h and homa_pool.c John Ousterhout
2025-01-23 12:06   ` Paolo Abeni
2025-01-24 23:53     ` John Ousterhout
2025-01-25  0:46       ` Andrew Lunn
2025-01-26  5:33         ` John Ousterhout
2025-01-27  9:41       ` Paolo Abeni
2025-01-27 17:34         ` John Ousterhout
2025-01-27 18:28           ` Paolo Abeni
2025-01-27 19:12             ` John Ousterhout
2025-01-28  8:27               ` Paolo Abeni
2025-01-15 18:59 ` [PATCH net-next v6 05/12] net: homa: create homa_rpc.h and homa_rpc.c John Ousterhout
2025-01-23 14:29   ` Paolo Abeni
2025-01-27  5:22     ` John Ousterhout
2025-01-27 10:01       ` Paolo Abeni
2025-01-27 18:03         ` John Ousterhout
2025-01-28  8:19           ` Paolo Abeni
2025-01-29  1:23             ` John Ousterhout
     [not found]               ` <13345e2a-849d-4bd8-a95e-9cd7f287c7df@redhat.com>
2025-01-29 16:43                 ` John Ousterhout
2025-01-29 16:49                   ` Eric Dumazet
2025-01-29 16:54                     ` John Ousterhout
2025-01-29 17:04                       ` Eric Dumazet
2025-01-29 20:27                         ` John Ousterhout
2025-01-29 20:40                           ` Eric Dumazet
2025-01-29 21:08                             ` John Ousterhout
2025-01-15 18:59 ` [PATCH net-next v6 06/12] net: homa: create homa_peer.h and homa_peer.c John Ousterhout
2025-01-23 17:45   ` Paolo Abeni
2025-01-28  0:06     ` John Ousterhout
2025-01-28  0:32       ` Jason Xing
2025-01-15 18:59 ` [PATCH net-next v6 07/12] net: homa: create homa_sock.h and homa_sock.c John Ousterhout
2025-01-23 19:01   ` Paolo Abeni
2025-01-28  0:40     ` John Ousterhout
2025-01-28  4:26       ` John Ousterhout
2025-01-28 15:10       ` Eric Dumazet
2025-01-28 17:04         ` John Ousterhout
2025-01-24  7:33   ` Paolo Abeni
2025-01-15 18:59 ` [PATCH net-next v6 08/12] net: homa: create homa_incoming.c John Ousterhout
2025-01-24  8:31   ` Paolo Abeni
2025-01-30  0:41     ` John Ousterhout
     [not found]       ` <991b5ad9-57cf-4e1d-8e01-9d0639fa4e49@redhat.com>
2025-01-31 22:48         ` John Ousterhout
2025-02-03  9:12           ` Paolo Abeni
2025-02-03 23:33             ` John Ousterhout
2025-02-04  8:50               ` Paolo Abeni
2025-02-04 16:30                 ` John Ousterhout
2025-02-04 19:41                   ` Andrew Lunn
2025-02-04 21:20                     ` John Ousterhout
2025-01-27 10:19   ` Paolo Abeni
2025-01-30  0:48     ` John Ousterhout
2025-01-30  9:57       ` Paolo Abeni
2025-01-31 22:51         ` John Ousterhout
     [not found]         ` <CAGXJAmxLqnjnWr8sjooJRRyQ2-5BqPCQL8gnn0gzYoZ0MMoBSw@mail.gmail.com>
2025-02-03  9:17           ` Paolo Abeni
2025-02-03 17:33             ` John Ousterhout
2025-02-03 17:58               ` Andrew Lunn
2025-02-05 23:56                 ` John Ousterhout
2025-02-06  1:49                   ` Andrew Lunn
2025-01-15 18:59 ` [PATCH net-next v6 09/12] net: homa: create homa_outgoing.c John Ousterhout
2025-01-15 18:59 ` [PATCH net-next v6 10/12] net: homa: create homa_timer.c John Ousterhout
2025-01-15 18:59 ` [PATCH net-next v6 11/12] net: homa: create homa_plumbing.c and homa_utils.c John Ousterhout
2025-01-15 18:59 ` [PATCH net-next v6 12/12] net: homa: create Makefile and Kconfig John Ousterhout
2025-01-24  8:55 ` [PATCH net-next v6 00/12] Begin upstreaming Homa transport protocol Paolo Abeni
2025-02-10 19:19   ` John Ousterhout

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).