* [PATCH bpf 0/3] bpf, sockmap complete fixes for avail bytes @ 2023-09-20 23:27 John Fastabend 2023-09-20 23:27 ` [PATCH bpf 1/3] bpf: tcp_read_skb needs to pop skb regardless of seq John Fastabend ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: John Fastabend @ 2023-09-20 23:27 UTC (permalink / raw) To: daniel, ast, andrii, jakub; +Cc: john.fastabend, bpf, netdev With e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq") we started fixing the available bytes accounting by moving copied_seq to where the user actually reads the bytes. However we missed handling MSG_PEEK correctly and we need to ensure that we don't kfree_skb() a skb off the receive_queue when the copied_seq number is not incremented by user reads for some time. John Fastabend (3): bpf: tcp_read_skb needs to pop skb regardless of seq bpf: sockmap, do not inc copied_seq when PEEK flag set bpf: sockmap, add tests for MSG_F_PEEK net/ipv4/tcp.c | 3 +- net/ipv4/tcp_bpf.c | 4 +- .../selftests/bpf/prog_tests/sockmap_basic.c | 52 +++++++++++++++++++ 3 files changed, 56 insertions(+), 3 deletions(-) -- 2.33.0 ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH bpf 1/3] bpf: tcp_read_skb needs to pop skb regardless of seq 2023-09-20 23:27 [PATCH bpf 0/3] bpf, sockmap complete fixes for avail bytes John Fastabend @ 2023-09-20 23:27 ` John Fastabend 2023-09-21 21:08 ` Simon Horman 2023-09-23 14:37 ` kernel test robot 2023-09-20 23:27 ` [PATCH bpf 2/3] bpf: sockmap, do not inc copied_seq when PEEK flag set John Fastabend 2023-09-20 23:27 ` [PATCH bpf 3/3] bpf: sockmap, add tests for MSG_F_PEEK John Fastabend 2 siblings, 2 replies; 9+ messages in thread From: John Fastabend @ 2023-09-20 23:27 UTC (permalink / raw) To: daniel, ast, andrii, jakub; +Cc: john.fastabend, bpf, netdev Before fix e5c6de5fa0258 tcp_read_skb() would increment the tp->copied-seq value. This (as described in the commit) would cause an error for apps because once that is incremented the application might believe there is no data to be read. Then some apps would stall or abort believing no data is available. However, the fix is incomplete because it introduces another issue in the skb dequeue. The loop does tcp_recv_skb() in a while loop to consume as many skbs as possible. The problem is the call is, tcp_recv_skb(sk, seq, &offset) Where 'seq' is u32 seq = tp->copied_seq; Now we can hit a case where we've yet incremented copied_seq from BPF side, but then tcp_recv_skb() fails this test, if (offset < skb->len || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)) so that instead of returning the skb we call tcp_eat_recv_skb() which frees the skb. This is because the routine believes the SKB has been collapsed per comment, /* This looks weird, but this can happen if TCP collapsing * splitted a fat GRO packet, while we released socket lock * in skb_splice_bits() */ This can't happen here we've unlinked the full SKB and orphaned it. Anyways it would confuse any BPF programs if the data were suddenly moved underneath it. To fix this situation do simpler operation and just skb_peek() the data of the queue followed by the unlink. It shouldn't need to check this condition and tcp_read_skb() reads entire skbs so there is no need to handle the 'offset!=0' case as we would see in tcp_read_sock(). Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq") Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> --- net/ipv4/tcp.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 0c3040a63ebd..45e7f39e67bc 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1625,12 +1625,11 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) u32 seq = tp->copied_seq; struct sk_buff *skb; int copied = 0; - u32 offset; if (sk->sk_state == TCP_LISTEN) return -ENOTCONN; - while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { + while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) { u8 tcp_flags; int used; -- 2.33.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH bpf 1/3] bpf: tcp_read_skb needs to pop skb regardless of seq 2023-09-20 23:27 ` [PATCH bpf 1/3] bpf: tcp_read_skb needs to pop skb regardless of seq John Fastabend @ 2023-09-21 21:08 ` Simon Horman 2023-09-21 21:23 ` John Fastabend 2023-09-23 14:37 ` kernel test robot 1 sibling, 1 reply; 9+ messages in thread From: Simon Horman @ 2023-09-21 21:08 UTC (permalink / raw) To: John Fastabend; +Cc: daniel, ast, andrii, jakub, bpf, netdev On Wed, Sep 20, 2023 at 04:27:04PM -0700, John Fastabend wrote: > Before fix e5c6de5fa0258 tcp_read_skb() would increment the tp->copied-seq > value. This (as described in the commit) would cause an error for apps > because once that is incremented the application might believe there is no > data to be read. Then some apps would stall or abort believing no data is > available. > > However, the fix is incomplete because it introduces another issue in > the skb dequeue. The loop does tcp_recv_skb() in a while loop to consume > as many skbs as possible. The problem is the call is, > > tcp_recv_skb(sk, seq, &offset) > > Where 'seq' is > > u32 seq = tp->copied_seq; > > Now we can hit a case where we've yet incremented copied_seq from BPF side, > but then tcp_recv_skb() fails this test, > > if (offset < skb->len || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)) > > so that instead of returning the skb we call tcp_eat_recv_skb() which frees > the skb. This is because the routine believes the SKB has been collapsed > per comment, > > /* This looks weird, but this can happen if TCP collapsing > * splitted a fat GRO packet, while we released socket lock > * in skb_splice_bits() > */ > > This can't happen here we've unlinked the full SKB and orphaned it. Anyways > it would confuse any BPF programs if the data were suddenly moved underneath > it. > > To fix this situation do simpler operation and just skb_peek() the data > of the queue followed by the unlink. It shouldn't need to check this > condition and tcp_read_skb() reads entire skbs so there is no need to > handle the 'offset!=0' case as we would see in tcp_read_sock(). > > Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq") > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") > Signed-off-by: John Fastabend <john.fastabend@gmail.com> > --- > net/ipv4/tcp.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 0c3040a63ebd..45e7f39e67bc 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1625,12 +1625,11 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) > u32 seq = tp->copied_seq; Hi John, according to clang-16, with this change seq is now set but unused. I guess seq can simply be removed as part of this change. > struct sk_buff *skb; > int copied = 0; > - u32 offset; > > if (sk->sk_state == TCP_LISTEN) > return -ENOTCONN; > > - while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { > + while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) { > u8 tcp_flags; > int used; > > -- > 2.33.0 > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH bpf 1/3] bpf: tcp_read_skb needs to pop skb regardless of seq 2023-09-21 21:08 ` Simon Horman @ 2023-09-21 21:23 ` John Fastabend 0 siblings, 0 replies; 9+ messages in thread From: John Fastabend @ 2023-09-21 21:23 UTC (permalink / raw) To: Simon Horman, John Fastabend; +Cc: daniel, ast, andrii, jakub, bpf, netdev Simon Horman wrote: > On Wed, Sep 20, 2023 at 04:27:04PM -0700, John Fastabend wrote: > > Before fix e5c6de5fa0258 tcp_read_skb() would increment the tp->copied-seq > > value. This (as described in the commit) would cause an error for apps > > because once that is incremented the application might believe there is no > > data to be read. Then some apps would stall or abort believing no data is > > available. > > > > However, the fix is incomplete because it introduces another issue in > > the skb dequeue. The loop does tcp_recv_skb() in a while loop to consume > > as many skbs as possible. The problem is the call is, > > > > tcp_recv_skb(sk, seq, &offset) > > > > Where 'seq' is > > > > u32 seq = tp->copied_seq; > > > > Now we can hit a case where we've yet incremented copied_seq from BPF side, > > but then tcp_recv_skb() fails this test, > > > > if (offset < skb->len || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)) > > > > so that instead of returning the skb we call tcp_eat_recv_skb() which frees > > the skb. This is because the routine believes the SKB has been collapsed > > per comment, > > > > /* This looks weird, but this can happen if TCP collapsing > > * splitted a fat GRO packet, while we released socket lock > > * in skb_splice_bits() > > */ > > > > This can't happen here we've unlinked the full SKB and orphaned it. Anyways > > it would confuse any BPF programs if the data were suddenly moved underneath > > it. > > > > To fix this situation do simpler operation and just skb_peek() the data > > of the queue followed by the unlink. It shouldn't need to check this > > condition and tcp_read_skb() reads entire skbs so there is no need to > > handle the 'offset!=0' case as we would see in tcp_read_sock(). > > > > Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq") > > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") > > Signed-off-by: John Fastabend <john.fastabend@gmail.com> > > --- > > net/ipv4/tcp.c | 3 +-- > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 0c3040a63ebd..45e7f39e67bc 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -1625,12 +1625,11 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) > > u32 seq = tp->copied_seq; > > Hi John, > > according to clang-16, with this change seq is now set but unused. > I guess seq can simply be removed as part of this change. Yeah I'll send a v2. Thanks. > > > struct sk_buff *skb; > > int copied = 0; > > - u32 offset; > > > > if (sk->sk_state == TCP_LISTEN) > > return -ENOTCONN; > > > > - while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { > > + while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) { > > u8 tcp_flags; > > int used; > > > > -- > > 2.33.0 > > > > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH bpf 1/3] bpf: tcp_read_skb needs to pop skb regardless of seq 2023-09-20 23:27 ` [PATCH bpf 1/3] bpf: tcp_read_skb needs to pop skb regardless of seq John Fastabend 2023-09-21 21:08 ` Simon Horman @ 2023-09-23 14:37 ` kernel test robot 1 sibling, 0 replies; 9+ messages in thread From: kernel test robot @ 2023-09-23 14:37 UTC (permalink / raw) To: John Fastabend, daniel, ast, andrii, jakub Cc: llvm, oe-kbuild-all, john.fastabend, bpf, netdev Hi John, kernel test robot noticed the following build warnings: [auto build test WARNING on bpf/master] url: https://github.com/intel-lab-lkp/linux/commits/John-Fastabend/bpf-tcp_read_skb-needs-to-pop-skb-regardless-of-seq/20230921-073209 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master patch link: https://lore.kernel.org/r/20230920232706.498747-2-john.fastabend%40gmail.com patch subject: [PATCH bpf 1/3] bpf: tcp_read_skb needs to pop skb regardless of seq config: um-allnoconfig (https://download.01.org/0day-ci/archive/20230923/202309232236.36lvZlKR-lkp@intel.com/config) compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230923/202309232236.36lvZlKR-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202309232236.36lvZlKR-lkp@intel.com/ All warnings (new ones prefixed by >>): In file included from net/ipv4/tcp.c:252: In file included from include/linux/inet_diag.h:5: In file included from include/net/netlink.h:6: In file included from include/linux/netlink.h:7: In file included from include/linux/skbuff.h:17: In file included from include/linux/bvec.h:10: In file included from include/linux/highmem.h:12: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 547 | val = __raw_readb(PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 560 | val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr)); | ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu' 37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x)) | ^ In file included from net/ipv4/tcp.c:252: In file included from include/linux/inet_diag.h:5: In file included from include/net/netlink.h:6: In file included from include/linux/netlink.h:7: In file included from include/linux/skbuff.h:17: In file included from include/linux/bvec.h:10: In file included from include/linux/highmem.h:12: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 573 | val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr)); | ~~~~~~~~~~ ^ include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu' 35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x)) | ^ In file included from net/ipv4/tcp.c:252: In file included from include/linux/inet_diag.h:5: In file included from include/net/netlink.h:6: In file included from include/linux/netlink.h:7: In file included from include/linux/skbuff.h:17: In file included from include/linux/bvec.h:10: In file included from include/linux/highmem.h:12: In file included from include/linux/hardirq.h:11: In file included from arch/um/include/asm/hardirq.h:5: In file included from include/asm-generic/hardirq.h:17: In file included from include/linux/irq.h:20: In file included from include/linux/io.h:13: In file included from arch/um/include/asm/io.h:24: include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 584 | __raw_writeb(value, PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 594 | __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 604 | __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr); | ~~~~~~~~~~ ^ include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 692 | readsb(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:700:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 700 | readsw(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:708:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 708 | readsl(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:717:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 717 | writesb(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:726:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 726 | writesw(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ include/asm-generic/io.h:735:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic] 735 | writesl(PCI_IOBASE + addr, buffer, count); | ~~~~~~~~~~ ^ >> net/ipv4/tcp.c:1625:6: warning: variable 'seq' set but not used [-Wunused-but-set-variable] 1625 | u32 seq = tp->copied_seq; | ^ 13 warnings generated. vim +/seq +1625 net/ipv4/tcp.c ^1da177e4c3f41 Linus Torvalds 2005-04-16 1621 965b57b469a589 Cong Wang 2022-06-15 1622 int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor) 04919bed948dc2 Cong Wang 2022-06-15 1623 { 04919bed948dc2 Cong Wang 2022-06-15 1624 struct tcp_sock *tp = tcp_sk(sk); 04919bed948dc2 Cong Wang 2022-06-15 @1625 u32 seq = tp->copied_seq; 04919bed948dc2 Cong Wang 2022-06-15 1626 struct sk_buff *skb; 04919bed948dc2 Cong Wang 2022-06-15 1627 int copied = 0; 04919bed948dc2 Cong Wang 2022-06-15 1628 04919bed948dc2 Cong Wang 2022-06-15 1629 if (sk->sk_state == TCP_LISTEN) 04919bed948dc2 Cong Wang 2022-06-15 1630 return -ENOTCONN; 04919bed948dc2 Cong Wang 2022-06-15 1631 98edede7d6d1b6 John Fastabend 2023-09-20 1632 while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) { db4192a754ebd5 Cong Wang 2022-09-12 1633 u8 tcp_flags; db4192a754ebd5 Cong Wang 2022-09-12 1634 int used; 04919bed948dc2 Cong Wang 2022-06-15 1635 04919bed948dc2 Cong Wang 2022-06-15 1636 __skb_unlink(skb, &sk->sk_receive_queue); 96628951869c0d Peilin Ye 2022-09-08 1637 WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk)); db4192a754ebd5 Cong Wang 2022-09-12 1638 tcp_flags = TCP_SKB_CB(skb)->tcp_flags; db4192a754ebd5 Cong Wang 2022-09-12 1639 used = recv_actor(sk, skb); db4192a754ebd5 Cong Wang 2022-09-12 1640 if (used < 0) { db4192a754ebd5 Cong Wang 2022-09-12 1641 if (!copied) db4192a754ebd5 Cong Wang 2022-09-12 1642 copied = used; db4192a754ebd5 Cong Wang 2022-09-12 1643 break; db4192a754ebd5 Cong Wang 2022-09-12 1644 } db4192a754ebd5 Cong Wang 2022-09-12 1645 seq += used; db4192a754ebd5 Cong Wang 2022-09-12 1646 copied += used; db4192a754ebd5 Cong Wang 2022-09-12 1647 db4192a754ebd5 Cong Wang 2022-09-12 1648 if (tcp_flags & TCPHDR_FIN) { 04919bed948dc2 Cong Wang 2022-06-15 1649 ++seq; db4192a754ebd5 Cong Wang 2022-09-12 1650 break; db4192a754ebd5 Cong Wang 2022-09-12 1651 } 04919bed948dc2 Cong Wang 2022-06-15 1652 } 04919bed948dc2 Cong Wang 2022-06-15 1653 return copied; 04919bed948dc2 Cong Wang 2022-06-15 1654 } 04919bed948dc2 Cong Wang 2022-06-15 1655 EXPORT_SYMBOL(tcp_read_skb); 04919bed948dc2 Cong Wang 2022-06-15 1656 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH bpf 2/3] bpf: sockmap, do not inc copied_seq when PEEK flag set 2023-09-20 23:27 [PATCH bpf 0/3] bpf, sockmap complete fixes for avail bytes John Fastabend 2023-09-20 23:27 ` [PATCH bpf 1/3] bpf: tcp_read_skb needs to pop skb regardless of seq John Fastabend @ 2023-09-20 23:27 ` John Fastabend 2023-09-22 10:23 ` Jakub Sitnicki 2023-09-20 23:27 ` [PATCH bpf 3/3] bpf: sockmap, add tests for MSG_F_PEEK John Fastabend 2 siblings, 1 reply; 9+ messages in thread From: John Fastabend @ 2023-09-20 23:27 UTC (permalink / raw) To: daniel, ast, andrii, jakub; +Cc: john.fastabend, bpf, netdev When data is peek'd off the receive queue we shouldn't considered it copied from tcp_sock side. When we increment copied_seq this will confuse tcp_data_ready() because copied_seq can be arbitrarily increased. From] application side it results in poll() operations not waking up when expected. Notice tcp stack without BPF recvmsg programs also does not increment copied_seq. We broke this when we moved copied_seq into recvmsg to only update when actual copy was happening. But, it wasn't working correctly either before because the tcp_data_ready() tried to use the copied_seq value to see if data was read by user yet. See fixes tags. Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq") Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> --- net/ipv4/tcp_bpf.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index 81f0dff69e0b..327268203001 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -222,6 +222,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, int *addr_len) { struct tcp_sock *tcp = tcp_sk(sk); + int peek = flags & MSG_PEEK; u32 seq = tcp->copied_seq; struct sk_psock *psock; int copied = 0; @@ -311,7 +312,8 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, copied = -EAGAIN; } out: - WRITE_ONCE(tcp->copied_seq, seq); + if (!peek) + WRITE_ONCE(tcp->copied_seq, seq); tcp_rcv_space_adjust(sk); if (copied > 0) __tcp_cleanup_rbuf(sk, copied); -- 2.33.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH bpf 2/3] bpf: sockmap, do not inc copied_seq when PEEK flag set 2023-09-20 23:27 ` [PATCH bpf 2/3] bpf: sockmap, do not inc copied_seq when PEEK flag set John Fastabend @ 2023-09-22 10:23 ` Jakub Sitnicki 0 siblings, 0 replies; 9+ messages in thread From: Jakub Sitnicki @ 2023-09-22 10:23 UTC (permalink / raw) To: John Fastabend; +Cc: daniel, ast, andrii, bpf, netdev On Wed, Sep 20, 2023 at 04:27 PM -07, John Fastabend wrote: > When data is peek'd off the receive queue we shouldn't considered it > copied from tcp_sock side. When we increment copied_seq this will confuse > tcp_data_ready() because copied_seq can be arbitrarily increased. From] > application side it results in poll() operations not waking up when > expected. > > Notice tcp stack without BPF recvmsg programs also does not increment > copied_seq. > > We broke this when we moved copied_seq into recvmsg to only update when > actual copy was happening. But, it wasn't working correctly either before > because the tcp_data_ready() tried to use the copied_seq value to see > if data was read by user yet. See fixes tags. > > Fixes: e5c6de5fa0258 ("bpf, sockmap: Incorrectly handling copied_seq") > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") > Signed-off-by: John Fastabend <john.fastabend@gmail.com> > --- > net/ipv4/tcp_bpf.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c > index 81f0dff69e0b..327268203001 100644 > --- a/net/ipv4/tcp_bpf.c > +++ b/net/ipv4/tcp_bpf.c > @@ -222,6 +222,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, > int *addr_len) > { > struct tcp_sock *tcp = tcp_sk(sk); > + int peek = flags & MSG_PEEK; > u32 seq = tcp->copied_seq; > struct sk_psock *psock; > int copied = 0; > @@ -311,7 +312,8 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk, > copied = -EAGAIN; > } > out: > - WRITE_ONCE(tcp->copied_seq, seq); > + if (!peek) > + WRITE_ONCE(tcp->copied_seq, seq); > tcp_rcv_space_adjust(sk); > if (copied > 0) > __tcp_cleanup_rbuf(sk, copied); I was surprised to see that we recalculate TCP buffer space and ACK frames when peeking at the receive queue. But tcp_recvmsg seems to do the same. Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH bpf 3/3] bpf: sockmap, add tests for MSG_F_PEEK 2023-09-20 23:27 [PATCH bpf 0/3] bpf, sockmap complete fixes for avail bytes John Fastabend 2023-09-20 23:27 ` [PATCH bpf 1/3] bpf: tcp_read_skb needs to pop skb regardless of seq John Fastabend 2023-09-20 23:27 ` [PATCH bpf 2/3] bpf: sockmap, do not inc copied_seq when PEEK flag set John Fastabend @ 2023-09-20 23:27 ` John Fastabend 2023-09-22 11:06 ` Jakub Sitnicki 2 siblings, 1 reply; 9+ messages in thread From: John Fastabend @ 2023-09-20 23:27 UTC (permalink / raw) To: daniel, ast, andrii, jakub; +Cc: john.fastabend, bpf, netdev Test that we can read with MSG_F_PEEK and then still get correct number of available bytes through FIONREAD. The recv() (without PEEK) then returns the bytes as expected. The recv() always worked though because it was just the available byte reporting that was broke before latest fixes. Signed-off-by: John Fastabend <john.fastabend@gmail.com> --- .../selftests/bpf/prog_tests/sockmap_basic.c | 52 +++++++++++++++++++ 1 file changed, 52 insertions(+) diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c index 064cc5e8d9ad..e8eee2b8901e 100644 --- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c @@ -475,6 +475,56 @@ static void test_sockmap_skb_verdict_fionread(bool pass_prog) test_sockmap_drop_prog__destroy(drop); } + +static void test_sockmap_skb_verdict_peek(void) +{ + int err, map, verdict, s, c1, p1, zero = 0, sent, recvd, avail; + struct test_sockmap_pass_prog *pass; + char snd[256] = "0123456789"; + char rcv[256] = "0"; + + pass = test_sockmap_pass_prog__open_and_load(); + if (!ASSERT_OK_PTR(pass, "open_and_load")) + return; + verdict = bpf_program__fd(pass->progs.prog_skb_verdict); + map = bpf_map__fd(pass->maps.sock_map_rx); + + err = bpf_prog_attach(verdict, map, BPF_SK_SKB_STREAM_VERDICT, 0); + if (!ASSERT_OK(err, "bpf_prog_attach")) + goto out; + + s = socket_loopback(AF_INET, SOCK_STREAM); + if (!ASSERT_GT(s, -1, "socket_loopback(s)")) + goto out; + + err = create_pair(s, AF_INET, SOCK_STREAM, &c1, &p1); + if (!ASSERT_OK(err, "create_pairs(s)")) + goto out; + + err = bpf_map_update_elem(map, &zero, &c1, BPF_NOEXIST); + if (!ASSERT_OK(err, "bpf_map_update_elem(c1)")) + goto out_close; + + sent = xsend(p1, snd, sizeof(snd), 0); + ASSERT_EQ(sent, sizeof(snd), "xsend(p1)"); + recvd = recv(c1, rcv, sizeof(rcv), MSG_PEEK); + ASSERT_EQ(recvd, sizeof(rcv), "recv(c1)"); + err = ioctl(c1, FIONREAD, &avail); + ASSERT_OK(err, "ioctl(FIONREAD) error"); + ASSERT_EQ(avail, sizeof(snd), "after peek ioctl(FIONREAD)"); + recvd = recv(c1, rcv, sizeof(rcv), 0); + ASSERT_EQ(recvd, sizeof(rcv), "recv(p0)"); + err = ioctl(c1, FIONREAD, &avail); + ASSERT_OK(err, "ioctl(FIONREAD) error"); + ASSERT_EQ(avail, 0, "after read ioctl(FIONREAD)"); + +out_close: + close(c1); + close(p1); +out: + test_sockmap_pass_prog__destroy(pass); +} + void test_sockmap_basic(void) { if (test__start_subtest("sockmap create_update_free")) @@ -515,4 +565,6 @@ void test_sockmap_basic(void) test_sockmap_skb_verdict_fionread(true); if (test__start_subtest("sockmap skb_verdict fionread on drop")) test_sockmap_skb_verdict_fionread(false); + if (test__start_subtest("sockmap skb_verdict msg_f_peek")) + test_sockmap_skb_verdict_peek(); } -- 2.33.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH bpf 3/3] bpf: sockmap, add tests for MSG_F_PEEK 2023-09-20 23:27 ` [PATCH bpf 3/3] bpf: sockmap, add tests for MSG_F_PEEK John Fastabend @ 2023-09-22 11:06 ` Jakub Sitnicki 0 siblings, 0 replies; 9+ messages in thread From: Jakub Sitnicki @ 2023-09-22 11:06 UTC (permalink / raw) To: John Fastabend; +Cc: daniel, ast, andrii, bpf, netdev On Wed, Sep 20, 2023 at 04:27 PM -07, John Fastabend wrote: > Test that we can read with MSG_F_PEEK and then still get correct number > of available bytes through FIONREAD. The recv() (without PEEK) then > returns the bytes as expected. The recv() always worked though because > it was just the available byte reporting that was broke before latest > fixes. > > Signed-off-by: John Fastabend <john.fastabend@gmail.com> > --- Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-09-23 14:38 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-09-20 23:27 [PATCH bpf 0/3] bpf, sockmap complete fixes for avail bytes John Fastabend 2023-09-20 23:27 ` [PATCH bpf 1/3] bpf: tcp_read_skb needs to pop skb regardless of seq John Fastabend 2023-09-21 21:08 ` Simon Horman 2023-09-21 21:23 ` John Fastabend 2023-09-23 14:37 ` kernel test robot 2023-09-20 23:27 ` [PATCH bpf 2/3] bpf: sockmap, do not inc copied_seq when PEEK flag set John Fastabend 2023-09-22 10:23 ` Jakub Sitnicki 2023-09-20 23:27 ` [PATCH bpf 3/3] bpf: sockmap, add tests for MSG_F_PEEK John Fastabend 2023-09-22 11:06 ` Jakub Sitnicki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).