From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f181.google.com (mail-pf1-f181.google.com [209.85.210.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3F8352F5467; Wed, 18 Jun 2025 18:44:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.181 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750272290; cv=none; b=I4PEq+I9GrXDmjVkHB8iQi4eXRf9XOYQeFT2vpPUjtva2FAqwG6ZTU0bqOSjE/cAPky2rQAK+C3d2+8OUoDWd+B7ZpTTfUA1rh3Ecuim9Qed9kFQQHUR7WQaD9VIYMoN6rv2PMOsxvi0LMbDjqs2nXov2BZGdubuiXhu215xPPQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750272290; c=relaxed/simple; bh=IF+QHC1hqJsnLz4/uI/7psIsJIdjMC/q+fKeyfjOUyk=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ZLDfBvYInHJXLeL78MI30oxhNfixAdEbeP5i/W+ZJ4hwITOgnjq82axm+mZHgSQkn56S+yrSxl0zmyt+cIFmpYKGCUi6pI6da2EEmszEAllv03E25sOd3dFr99PPb0moF+s42jQerWFrzcAHjmPiZlYS73pGQE8d3mjRWCPgFN4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Fz5h4i5N; arc=none smtp.client-ip=209.85.210.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Fz5h4i5N" Received: by mail-pf1-f181.google.com with SMTP id d2e1a72fcca58-748ece799bdso657877b3a.1; Wed, 18 Jun 2025 11:44:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750272287; x=1750877087; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=lfxAip8IqJCM1oH0nU7XeBePxqLCBFbrrrG86sn/KK8=; b=Fz5h4i5Nixl/YIxgv4grkEv57EvLOVk6Q/yxjiGQ/w0LSonciQjg9xlrRlTmlx+62X bnU/IlNalRK3l/uM6FsTxl70Wa1OTWrboRVpv0w77EGkirzlHKBMrGqWs5+6STdxCG36 gv7uw9L6grrgfLCh1lD6MlHOmJ66qDZdipczWPbQZw3QJvxk6OB3ZmxWF7a0kp8czq4H lpCShjOBNV0EyJ0qFVaB8TYZ1U2p/hhqgM1P4jPwAKQnZZOaHYPQdVKI1Mte3ALANraw YlamxtONmvr6PahIWJ0oF2kq7FXndM8jjd224w1M5uP6Tv84Jo4KyniRMPTa2h9vnRc1 vwZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750272287; x=1750877087; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=lfxAip8IqJCM1oH0nU7XeBePxqLCBFbrrrG86sn/KK8=; b=CbpPYn3ZGUYCqcZqCmtQ26xL9rF8wFiCVrl/8G63+9fU5RxHFEND0VBfGeP87nZIAT 94gOBYPZ265MPAGIzMZb6i5GrAkkbBCoOvQ+bxJd2O+Erm+VRzJiHPJjlkaoL484YhjC Yk3R/Gg+WoAcRNLy+HlRIKlZT18DLylwD58ACgoX+Hbuz6pDxmprXKSLXhNKYK2loGF5 J6THxr2nCdeJjwTpVKSj/aKs71GHDrHfleh8N78XrgRRG4R3jBQL4uHS8S7XDL50YNl3 WiICbmVw7hrwVTWkx3qjLN2Dqeg2PO+7ZSE8QfAy3v1AfeNq254J/ehykwHhlEKCdyN3 3XFA== X-Forwarded-Encrypted: i=1; AJvYcCUq4vjnsXQLRtyvfGzsfdUyGOGIsEESIQC7bdBV7RpMdZK9uJP1qiYRF0tss76h6kJCbpU=@vger.kernel.org X-Gm-Message-State: AOJu0YxhnKsKnEm688A2lIJK5OKK7i14OumJwiQ8SwfQUw2389RfvlxL tRVRNRQr12+LVt/hoae9U2cxoL0mkZaF5I4FIbc5fpMzLBn2CWL1jO0= X-Gm-Gg: ASbGnct7uAyly7I3GHzkg647aBes3/i2kW/uWSEjxdE/0Cop/vRK1DxpzpL+u4fEDFT dV4/INFvxUn1gMAxAxV8Uh49dT0VVWXzX2Gcgf0iGKbJYTMavjhbd1RYdMlvtX9PJvPdiAHr4fE yta3Rv587u6KZ7/GolS77/tHwY1U5FBj2v9CVt9gd5x5hIiJAA5KGbryrb32iq0dEOhXWDReMV3 DrHA3CfSQazNhkaPHgexQnME2EV/WGpdIfAhgvCknzW9O7+P93ZY6xWt3cG/KMO8pnnap0laoFI nPLMajexZQpJICiLrKjTy15sfJkjOn5skQkZAjlQb5blgG9bygOkmKASXwrjw8AP8lHiDw6qA/Y DtDvCFAz7fQbEb8SxW3x/xbo= X-Google-Smtp-Source: AGHT+IHgUWHhMp60x3Ry5mbAwE5jMeIO/TBgIb7bpCThFlc22fwjDxS3HmrwkLT1tmhjc/vYEkz3pw== X-Received: by 2002:a05:6a00:478c:b0:748:e0ee:dcff with SMTP id d2e1a72fcca58-748e0eee3a9mr4940243b3a.11.1750272287238; Wed, 18 Jun 2025 11:44:47 -0700 (PDT) Received: from localhost (c-73-158-218-242.hsd1.ca.comcast.net. [73.158.218.242]) by smtp.gmail.com with UTF8SMTPSA id d2e1a72fcca58-748e5b620f2sm2248409b3a.50.2025.06.18.11.44.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jun 2025 11:44:46 -0700 (PDT) Date: Wed, 18 Jun 2025 11:44:46 -0700 From: Stanislav Fomichev To: Jordan Rife Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, Daniel Borkmann , Martin KaFai Lau , Willem de Bruijn , Kuniyuki Iwashima , Alexei Starovoitov Subject: Re: [RESEND PATCH v2 bpf-next 02/12] bpf: tcp: Make sure iter->batch always contains a full bucket snapshot Message-ID: References: <20250618162545.15633-1-jordan@jrife.io> <20250618162545.15633-3-jordan@jrife.io> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20250618162545.15633-3-jordan@jrife.io> On 06/18, Jordan Rife wrote: > Require that iter->batch always contains a full bucket snapshot. This > invariant is important to avoid skipping or repeating sockets during > iteration when combined with the next few patches. Before, there were > two cases where a call to bpf_iter_tcp_batch may only capture part of a > bucket: > > 1. When bpf_iter_tcp_realloc_batch() returns -ENOMEM. > 2. When more sockets are added to the bucket while calling > bpf_iter_tcp_realloc_batch(), making the updated batch size > insufficient. > > In cases where the batch size only covers part of a bucket, it is > possible to forget which sockets were already visited, especially if we > have to process a bucket in more than two batches. This forces us to > choose between repeating or skipping sockets, so don't allow this: > > 1. Stop iteration and propagate -ENOMEM up to userspace if reallocation > fails instead of continuing with a partial batch. > 2. Try bpf_iter_tcp_realloc_batch() with GFP_USER just as before, but if > we still aren't able to capture the full bucket, call > bpf_iter_tcp_realloc_batch() again while holding the bucket lock to > guarantee the bucket does not change. On the second attempt use > GFP_NOWAIT since we hold onto the spin lock. > > I did some manual testing to exercise the code paths where GFP_NOWAIT is > used and where ERR_PTR(err) is returned. I used the realloc test cases > included later in this series to trigger a scenario where a realloc > happens inside bpf_iter_tcp_batch and made a small code tweak to force > the first realloc attempt to allocate a too-small batch, thus requiring > another attempt with GFP_NOWAIT. Some printks showed both reallocs with > the tests passing: > > May 09 18:18:55 crow kernel: resize batch TCP_SEQ_STATE_LISTENING > May 09 18:18:55 crow kernel: again GFP_USER > May 09 18:18:55 crow kernel: resize batch TCP_SEQ_STATE_LISTENING > May 09 18:18:55 crow kernel: again GFP_NOWAIT > May 09 18:18:57 crow kernel: resize batch TCP_SEQ_STATE_ESTABLISHED > May 09 18:18:57 crow kernel: again GFP_USER > May 09 18:18:57 crow kernel: resize batch TCP_SEQ_STATE_ESTABLISHED > May 09 18:18:57 crow kernel: again GFP_NOWAIT > > With this setup, I also forced each of the bpf_iter_tcp_realloc_batch > calls to return -ENOMEM to ensure that iteration ends and that the > read() in userspace fails. > > Signed-off-by: Jordan Rife > Reviewed-by: Kuniyuki Iwashima > --- > net/ipv4/tcp_ipv4.c | 96 ++++++++++++++++++++++++++++++++------------- > 1 file changed, 68 insertions(+), 28 deletions(-) > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index 2e40af6aff37..69c976a07434 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -3057,7 +3057,10 @@ static int bpf_iter_tcp_realloc_batch(struct bpf_tcp_iter_state *iter, > if (!new_batch) > return -ENOMEM; > > - bpf_iter_tcp_put_batch(iter); > + if (flags != GFP_NOWAIT) > + bpf_iter_tcp_put_batch(iter); > + > + memcpy(new_batch, iter->batch, sizeof(*iter->batch) * iter->end_sk); > kvfree(iter->batch); > iter->batch = new_batch; > iter->max_sk = new_batch_sz; > @@ -3066,69 +3069,85 @@ static int bpf_iter_tcp_realloc_batch(struct bpf_tcp_iter_state *iter, > } > > static unsigned int bpf_iter_tcp_listening_batch(struct seq_file *seq, > - struct sock *start_sk) > + struct sock **start_sk) > { > - struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row.hashinfo; > struct bpf_tcp_iter_state *iter = seq->private; > - struct tcp_iter_state *st = &iter->state; > struct hlist_nulls_node *node; > unsigned int expected = 1; > struct sock *sk; > > - sock_hold(start_sk); > - iter->batch[iter->end_sk++] = start_sk; > + sock_hold(*start_sk); > + iter->batch[iter->end_sk++] = *start_sk; > > - sk = sk_nulls_next(start_sk); > + sk = sk_nulls_next(*start_sk); > + *start_sk = NULL; > sk_nulls_for_each_from(sk, node) { > if (seq_sk_match(seq, sk)) { > if (iter->end_sk < iter->max_sk) { > sock_hold(sk); > iter->batch[iter->end_sk++] = sk; > + } else if (!*start_sk) { > + /* Remember where we left off. */ > + *start_sk = sk; > } > expected++; > } > } > - spin_unlock(&hinfo->lhash2[st->bucket].lock); > > return expected; > } > > static unsigned int bpf_iter_tcp_established_batch(struct seq_file *seq, > - struct sock *start_sk) > + struct sock **start_sk) > { > - struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row.hashinfo; > struct bpf_tcp_iter_state *iter = seq->private; > - struct tcp_iter_state *st = &iter->state; > struct hlist_nulls_node *node; > unsigned int expected = 1; > struct sock *sk; > > - sock_hold(start_sk); > - iter->batch[iter->end_sk++] = start_sk; > + sock_hold(*start_sk); > + iter->batch[iter->end_sk++] = *start_sk; > > - sk = sk_nulls_next(start_sk); > + sk = sk_nulls_next(*start_sk); > + *start_sk = NULL; > sk_nulls_for_each_from(sk, node) { > if (seq_sk_match(seq, sk)) { > if (iter->end_sk < iter->max_sk) { > sock_hold(sk); > iter->batch[iter->end_sk++] = sk; > + } else if (!*start_sk) { > + /* Remember where we left off. */ > + *start_sk = sk; > } > expected++; > } > } > - spin_unlock_bh(inet_ehash_lockp(hinfo, st->bucket)); > > return expected; > } > > +static void bpf_iter_tcp_unlock_bucket(struct seq_file *seq) > +{ > + struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row.hashinfo; > + struct bpf_tcp_iter_state *iter = seq->private; > + struct tcp_iter_state *st = &iter->state; > + > + if (st->state == TCP_SEQ_STATE_LISTENING) > + spin_unlock(&hinfo->lhash2[st->bucket].lock); > + else > + spin_unlock_bh(inet_ehash_lockp(hinfo, st->bucket)); > +} > + > static struct sock *bpf_iter_tcp_batch(struct seq_file *seq) > { > struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row.hashinfo; > struct bpf_tcp_iter_state *iter = seq->private; > struct tcp_iter_state *st = &iter->state; > + int prev_bucket, prev_state; > unsigned int expected; > - bool resized = false; > + int resizes = 0; > struct sock *sk; > + int err; > > /* The st->bucket is done. Directly advance to the next > * bucket instead of having the tcp_seek_last_pos() to skip > @@ -3149,29 +3168,50 @@ static struct sock *bpf_iter_tcp_batch(struct seq_file *seq) > /* Get a new batch */ > iter->cur_sk = 0; > iter->end_sk = 0; > - iter->st_bucket_done = false; > + iter->st_bucket_done = true; > > + prev_bucket = st->bucket; > + prev_state = st->state; > sk = tcp_seek_last_pos(seq); > if (!sk) > return NULL; /* Done */ > + if (st->bucket != prev_bucket || st->state != prev_state) > + resizes = 0; > + expected = 0; > > +fill_batch: > if (st->state == TCP_SEQ_STATE_LISTENING) > - expected = bpf_iter_tcp_listening_batch(seq, sk); > + expected += bpf_iter_tcp_listening_batch(seq, &sk); > else > - expected = bpf_iter_tcp_established_batch(seq, sk); > + expected += bpf_iter_tcp_established_batch(seq, &sk); > > - if (iter->end_sk == expected) { > - iter->st_bucket_done = true; > - return sk; > - } [..] > + if (unlikely(resizes <= 1 && iter->end_sk != expected)) { > + resizes++; > + > + if (resizes == 1) { > + bpf_iter_tcp_unlock_bucket(seq); > > - if (!resized && !bpf_iter_tcp_realloc_batch(iter, expected * 3 / 2, > - GFP_USER)) { > - resized = true; > - goto again; > + err = bpf_iter_tcp_realloc_batch(iter, expected * 3 / 2, > + GFP_USER); > + if (err) > + return ERR_PTR(err); > + goto again; > + } > + > + err = bpf_iter_tcp_realloc_batch(iter, expected, GFP_NOWAIT); > + if (err) { > + bpf_iter_tcp_unlock_bucket(seq); > + return ERR_PTR(err); > + } > + > + expected = iter->end_sk; > + goto fill_batch; Can we try to unroll this? Add new helpers to hide the repeating parts, store extra state in iter if needed. AFAIU, we want the following: 1. find sk, try to fill the batch, if it fits -> bail out 2. try to allocate new batch with GPU_USER, try to fill again -> bail out 3. otherwise, attempt GPF_NOWAIT and do that dance where you copy over previous partial copy The conditional put in bpf_iter_tcp_put_batch does not look nice :-( Same for unconditional memcpy (which, if I understand correctly, only needed for GFP_NOWAIT case). I'm 99% sure your current version works, but it's a bit hard to follow :-( Untested code to illustrate the idea below. Any reason it won't work? /* fast path */ sk = tcp_seek_last_pos(seq); if (!sk) return NULL; fits = bpf_iter_tcp_fill_batch(...); bpf_iter_tcp_unlock_bucket(iter); if (fits) return sk; /* not enough space to store full batch, try to reallocate with GFP_USER */ bpf_iter_tcp_free_batch(iter); if (bpf_iter_tcp_alloc_batch(iter, GFP_USER)) { /* allocated 'expected' size, try to fill again */ sk = tcp_seek_last_pos(seq); if (!sk) return NULL; fits = bpf_iter_tcp_fill_batch(...); if (fits) { bpf_iter_tcp_unlock_bucket(iter); return sk; } } /* the bucket is still locked here, sk points to the correct one, * we have a partial result in iter->batch */ old_batch = iter->batch; if (!bpf_iter_tcp_alloc_batch(iter, GFP_NOWAIT)) { /* no luck, bail out */ bpf_iter_tcp_unlock_bucket(iter); bpf_iter_tcp_free_batch(iter); /* or put? */ return ERR_PTR(-ENOMEM); } if (old_batch) { /* copy partial result from the previous run if needed? */ memcpy(iter->batch, old_batch, ...); kvfree(old_batch); } /* TODO: somehow fill the remainder */ bpf_iter_tcp_unlock_bucket(iter); return ..; .... bool bpf_iter_tcp_fill_batch(...) { if (st->state == TCP_SEQ_STATE_LISTENING) expected = bpf_iter_tcp_listening_batch(seq, sk); else expected = bpf_iter_tcp_established_batch(seq, sk); /* TODO: store expected into the iter for future resizing */ /* TODO: make bpf_iter_tcp_xxx_batch store start_sk in iter */ if (iter->end_sk == expected) { iter->st_bucket_done = true; return true; } return false; } bool bpf_iter_tcp_free_batch(...) { bpf_iter_tcp_put_batch(iter); kvfree(iter->batch); iter->batch = NULL; }