From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
Jakub Kicinski <kuba@kernel.org>,
Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>
Subject: Re: [TEST] txtimestamp.sh pains after netdev foundation migration
Date: Sun, 11 Jan 2026 22:28:39 -0500 [thread overview]
Message-ID: <willemdebruijn.kernel.311e0b9ad88f0@gmail.com> (raw)
In-Reply-To: <willemdebruijn.kernel.555dd45f2e96@gmail.com>
Willem de Bruijn wrote:
> Willem de Bruijn wrote:
> > Jakub Kicinski wrote:
> > > On Thu, 08 Jan 2026 14:02:15 -0500 Willem de Bruijn wrote:
> > > > Increasing tolerance should work.
> > > >
> > > > The current values are pragmatic choices to be so low as to minimize
> > > > total test runtime, but high enough to avoid flakes. Well..
> > > >
> > > > If increasing tolerance, we also need to increase the time the test
> > > > waits for all notifications to arrive, cfg_sleep_usec.
> > >
> > > To be clear the theory is that we got scheduled out between taking the
> > > USR timestamp and sending the packet. But once the packet is in the
> > > kernel it seems to flow, so AFAIU cfg_sleep_usec can remain untouched.
> > >
> > > Thinking about it more - maybe what blocks us is the print? Maybe under
> > > vng there's a non-trivial chance that a print to stderr ends up
> > > blocking on serial and schedules us out? I mean maybe we should:
> > >
> > > diff --git a/tools/testing/selftests/net/txtimestamp.c b/tools/testing/selftests/net/txtimestamp.c
> > > index abcec47ec2e6..e2273fdff495 100644
> > > --- a/tools/testing/selftests/net/txtimestamp.c
> > > +++ b/tools/testing/selftests/net/txtimestamp.c
> > > @@ -207,12 +207,10 @@ static void __print_timestamp(const char *name, struct timespec *cur,
> > > fprintf(stderr, "\n");
> > > }
> > >
> > > -static void print_timestamp_usr(void)
> > > +static void record_timestamp_usr(void)
> > > {
> > > if (clock_gettime(CLOCK_REALTIME, &ts_usr))
> > > error(1, errno, "clock_gettime");
> > > -
> > > - __print_timestamp(" USR", &ts_usr, 0, 0);
> > > }
> > >
> > > static void check_timestamp_usr(void)
> > > @@ -636,8 +634,6 @@ static void do_test(int family, unsigned int report_opt)
> > > fill_header_udp(buf + off, family == PF_INET);
> > > }
> > >
> > > - print_timestamp_usr();
> > > -
> > > iov.iov_base = buf;
> > > iov.iov_len = total_len;
> > >
> > > @@ -692,10 +688,14 @@ static void do_test(int family, unsigned int report_opt)
> > >
> > > }
> > >
> > > + record_timestamp_usr();
> > > val = sendmsg(fd, &msg, 0);
> > > if (val != total_len)
> > > error(1, errno, "send");
> > >
> > > + /* Avoid I/O between taking ts_usr and sendmsg() */
> > > + __print_timestamp(" USR", &ts_usr, 0, 0);
> > > +
> > > check_timestamp_usr();
> > >
> > > /* wait for all errors to be queued, else ACKs arrive OOO */
> >
> > Definitely worth including.
> >
> > Could it be helpful to schedule at RR or FIFO prio. Depends on the
> > reason for descheduling. And it only affects priority within the VM.
> >
> > I'm having trouble reproducing it in vng both locally and on
> > netdev-virt.
> >
> > At this point, an initial obviously correct patch and observe how
> > much that mitigates the issue is likely the fastest way forward.
>
> Instead of increasing tolerance, how about optionally allowing one
> moderate timing error:
>
> @@ -166,8 +167,15 @@ static void validate_timestamp(struct timespec *cur, int min_delay)
> if (cur64 < start64 + min_delay || cur64 > start64 + max_delay) {
> fprintf(stderr, "ERROR: %" PRId64 " us expected between %d and %d\n",
> cur64 - start64, min_delay, max_delay);
> - if (!getenv("KSFT_MACHINE_SLOW"))
> - test_failed = true;
> + if (!getenv("KSFT_MACHINE_SLOW")) {
> + if (cfg_num_max_timing_failures &&
> + (cur64 <= start64 + (max_delay * 2))) {
> + cfg_num_max_timing_failures--;
> + fprintf(stderr, "CONTINUE: ignore 1 timing failure\n");
> + } else {
> + test_failed = true;
> + }
> + }
> }
> }
>
> @@ -746,6 +755,10 @@ static void parse_opt(int argc, char **argv)
> case 'E':
> cfg_use_epoll = true;
> cfg_epollet = true;
> + break;
> + case 'f':
> + cfg_num_max_timing_failures = strtoul(optarg, NULL, 10);
> + break;
>
> +++ b/tools/testing/selftests/net/txtimestamp.sh
> @@ -30,8 +30,8 @@ run_test_v4v6() {
> # wait for ACK to be queued
> local -r args="$@ -v 10000 -V 60000 -t 8000 -S 80000"
>
> - ./txtimestamp ${args} -4 -L 127.0.0.1
> - ./txtimestamp ${args} -6 -L ::1
> + ./txtimestamp ${args} -f 1 -4 -L 127.0.0.1
> + ./txtimestamp ${args} -f 1 -6 -L ::1
> }
>
> and some boilerplate.
>
> Can fold in the record_timestamp_usr() change too.
>
> I can send this, your alternative with Suggested-by, or let me know if
> you prefer to send that.
>
> It's tricky to reproduce, but evidently on some platforms this occurs,
> so not unreasonable to give some leeway. A single UDP test runs 12
> timing validations: 4 packets * {SND, ENQ, END + SND} setups. A single
> TCP test runs additional {ACK, SND + ACK, ENQ + SND + ACK} cases. If
> we consider 1/12 skips too high, we could increase packet count.
That should say 16 validations: ENQ + SND validates both.
> txtimestamp.sh runs 3 * 7 * 2 test variants. Alternatively we suppress
> 1 failure here, rather than in the individual tests.
>
> Any of these approaches should significantly reduce the flake rate
> reported on netdev.bots.
next prev parent reply other threads:[~2026-01-12 3:28 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-07 19:05 [TEST] txtimestamp.sh pains after netdev foundation migration Jakub Kicinski
2026-01-08 0:19 ` Willem de Bruijn
2026-01-08 3:25 ` Jakub Kicinski
2026-01-08 16:06 ` Jakub Kicinski
2026-01-08 19:02 ` Willem de Bruijn
2026-01-08 20:38 ` Jakub Kicinski
2026-01-08 21:19 ` Willem de Bruijn
2026-01-12 3:24 ` Willem de Bruijn
2026-01-12 3:28 ` Willem de Bruijn [this message]
2026-01-12 14:29 ` Jakub Kicinski
2026-01-12 16:38 ` Willem de Bruijn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=willemdebruijn.kernel.311e0b9ad88f0@gmail.com \
--to=willemdebruijn.kernel@gmail.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=willemb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.