From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-174.mta0.migadu.com (out-174.mta0.migadu.com [91.218.175.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E03C113A245 for ; Fri, 26 Apr 2024 04:59:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.174 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714107543; cv=none; b=dYkC5yEjcJlXT1i3DdLSGmhx2Y/tgFWtdf8tsem6R9UR0CkvyanMRvYvxNtwveJ8UYZerv0+dMriFR65oLvmuaMr5OLSVg/V9uVra2KfMyfm/hMtn85Iib30ZXMGUjySSd1ZLEXCA4t6EjjuFEIXw43Pl9z39xH9pjxnRIKpuhU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714107543; c=relaxed/simple; bh=d3KmDmp/3dmubANe+l1+bNRnUChLWKM23acflPULoUM=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=DrYpaBPLfIHlkW630bAZawtt+L41Zupf9brSjnDkI5UX4vdfE2Bu3wuIDzSceqGvOdcvNvh+ygwLxnC94w8YZJEJN8NbU/JtPVwfPTNM6SPww3szm4EgHn8mEwMOAxF9NTR9ufsi0M5KthyX/v55RQxbzXkf3zK16QSxl4HQeOU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=pZLFqUFT; arc=none smtp.client-ip=91.218.175.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="pZLFqUFT" Message-ID: <840ddcb4-acaa-4ce4-ad56-e2d14b447907@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1714107538; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=djOZafUyrrq9V8xkvWbRVjEsaKg7oC8Pb5wsRAwbtL4=; b=pZLFqUFTypSQywrdzl7e4BPh62DtVR8OpL+ZLy1jgs55uNXOAHLhOOqRHVYCps0YlT5G6s Hltj3DqdiU6hnHAUiAwdpZbHIEkw03sqaaLzxx7KRYa0cdHHyjVzGStYymeew9EmcNaaQz vouPRb7v9/HHg9cHjyYXzWZNMj6K9zg= Date: Thu, 25 Apr 2024 21:58:50 -0700 Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [RFC PATCH net-next 0/5] net: In-kernel QUIC implementation with Userspace handshake To: Xin Long , Stefan Metzmacher Cc: network dev , davem@davemloft.net, kuba@kernel.org, Eric Dumazet , Paolo Abeni , Steve French , Namjae Jeon , Chuck Lever III , Jeff Layton , Sabrina Dubroca , Tyler Fanelli , Pengtao He , "linux-cifs@vger.kernel.org" , Samba Technical , bpf References: <74d5db09-6b5c-4054-b9d3-542f34769083@samba.org> <438496a6-7f90-403d-9558-4a813e842540@samba.org> <1456b69c-4ffd-4a08-b120-6a00abf1eb05@samba.org> <95922a2f-07a1-4555-acd2-c745e59bcb8e@samba.org> Content-Language: en-US X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Martin KaFai Lau In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On 4/22/24 1:58 PM, Xin Long wrote: > On Sun, Apr 21, 2024 at 3:27 PM Stefan Metzmacher wrote: >> >> Am 20.04.24 um 21:32 schrieb Xin Long: >>> On Fri, Apr 19, 2024 at 3:19 PM Xin Long wrote: >>>> >>>> On Fri, Apr 19, 2024 at 2:51 PM Stefan Metzmacher wrote: >>>>> >>>>> Hi Xin Long, >>>>> >>>>>>> But I think its unavoidable for the ALPN and SNI fields on >>>>>>> the server side. As every service tries to use udp port 443 >>>>>>> and somehow that needs to be shared if multiple services want to >>>>>>> use it. >>>>>>> >>>>>>> I guess on the acceptor side we would need to somehow detach low level >>>>>>> udp struct sock from the logical listen struct sock. >>>>>>> >>>>>>> And quic_do_listen_rcv() would need to find the correct logical listening >>>>>>> socket and call quic_request_sock_enqueue() on the logical socket >>>>>>> not the lowlevel udo socket. The same for all stuff happening after >>>>>>> quic_request_sock_enqueue() at the end of quic_do_listen_rcv. >>>>>>> >>>>>> The implementation allows one low level UDP sock to serve for multiple >>>>>> QUIC socks. >>>>>> >>>>>> Currently, if your 3 quic applications listen to the same address:port >>>>>> with SO_REUSEPORT socket option set, the incoming connection will choose >>>>>> one of your applications randomly with hash(client_addr+port) vi >>>>>> reuseport_select_sock() in quic_sock_lookup(). >>>>>> >>>>>> It should be easy to do a further match with ALPN between these 3 quic >>>>>> socks that listens to the same address:port to get the right quic sock, >>>>>> instead of that randomly choosing. >>>>> >>>>> Ah, that sounds good. >>>>> >>>>>> The problem is to parse the TLS Client_Hello message to get the ALPN in >>>>>> quic_sock_lookup(), which is not a proper thing to do in kernel, and >>>>>> might be rejected by networking maintainers, I need to check with them. >>>>> >>>>> Is the reassembling of CRYPTO frames done in the kernel or >>>>> userspace? Can you point me to the place in the code? >>>> In quic_inq_handshake_tail() in kernel, for Client Initial packet >>>> is processed when calling accept(), this is the path: >>>> >>>> quic_accept()-> quic_accept_sock_init() -> quic_packet_process() -> >>>> quic_packet_handshake_process() -> quic_frame_process() -> >>>> quic_frame_crypto_process() -> quic_inq_handshake_tail(). >>>> >>>> Note that it's with the accept sock, not the listen sock. >>>> >>>>> >>>>> If it's really impossible to do in C code maybe >>>>> registering a bpf function in order to allow a listener >>>>> to check the intial quic packet and decide if it wants to serve >>>>> that connection would be possible as last resort? >>>> That's a smart idea! man. >>>> I think the bpf hook in reuseport_select_sock() is meant to do such >>>> selection. >>>> >>>> For the Client initial packet (the only packet you need to handle), >>>> I double you will need to do the reassembling, as Client Hello TLS message >>>> is always less than 400 byte in my env. >>>> >>>> But I think you need to do the decryption for the Client initial packet >>>> before decoding it then parsing the TLS message from its crypto frame. >>> I created this patch: >>> >>> https://github.com/lxin/quic/commit/aee0b7c77df3f39941f98bb901c73fdc560befb8 >>> >>> to do this decryption in quic_sock_look() before calling >>> reuseport_select_sock(), so that it provides the bpf selector with >>> a plain-text QUIC initial packet: >>> >>> https://datatracker.ietf.org/doc/html/rfc9000#section-17.2.2 >>> >>> If it's complex for you to do the decryption for the initial packet in >>> the bpf selector, I will apply this patch. Please let me know. >> >> I guess in addition to quic_server_handshake(), which is called >> after accept(), there should be quic_server_prepare_listen() >> (and something similar for in kernel servers) that setup the reuseport >> magic for the socket, so that it's not needed in every application. > It's done when calling listen(), see quic_inet_listen()->quic_hash() > where only listening sockets with its sk_reuseport set will be > added into the reuseport group. > > It means SO_REUSEPORT sockopt must be set for every socket > before calling listen(). > >> >> It seems there is only a single ebpf program possible per >> reuseport group, so there has to be just a single one. > Yes, a single ebpf program per reuseport group should work. > see prepare_sk_fds() in kernel selftests for select_reuseport bfp. > >> >> But is it possible for in kernel servers to also register an epbf program? > Good question. TBH, I don't really know much about epbf programming. > I guess the real problem is how you pass the .o file to kernel space? > > Another question is, in the selftests: > tools/testing/selftests/bpf/prog_tests/s > tools/testing/selftests/bpf/progs/test_select_reuseport_kern.c > > it created a global reuseport_array, and then added these sockets > into this array for the later lookup, but these sockets are all created > in the same process. > > But your case is that the sockets are created in different processes. > I'm not sure if it's possible to add sockets from different processes > into the same reuseport_array? > > Added Martin who introduced BPF_PROG_TYPE_SK_REUSEPORT, > I guess he may know the answers. I didn't read the patchset, so I don't know what wanted to be done. From capturing the questions in this and next email: the reuseport_array is a bpf map. Like any bpf map, it can be shared across different processes. Meaning different processes can add sk to the map. The bpf prog that selects a sk from the reuseport_array is set by the userspace through setsockopt(SO_ATTACH_REUSEPORT_EBPF). It is the only way right now, iirc. If you can summarize what want to be done, it could help to see if there are ways that work for the use case. > > Thanks. >