selftests/bpf test_sockmap failure

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* selftests/bpf test_sockmap failure
@ 2018-07-24 15:45 Yonghong Song
  2018-07-24 22:40 ` John Fastabend
  0 siblings, 1 reply; 5+ messages in thread
From: Yonghong Song @ 2018-07-24 15:45 UTC (permalink / raw)
  To: John Fastabend, netdev; +Cc: Yonghong Song, Martin Lau

In one of our production machines, tools/testing/selftests/bpf
test_sockmap failed randomly like below:

...
[TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): rx thread exited with 
err 1. FAILED
...

...
[TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): rx thread exited with 
err 1. FAILED
...

...
[TEST 83]: (100, 1, 5, sendpage, pass,apply 1,): rx thread exited with 
err 1. FAILED
...

...
[TEST 79]: (512, 1, 1, sendpage, pass,apply 1,): rx thread exited with 
err 1. FAILED
...

The command line is just `test_sockmap`. The machine has 80 cpus, 256G 
memory. The kernel is based on 4.16 but backported with latest bpf-next 
bpf changes.

The failed test number (78, 79, 80, or 83) is random. But they all share
similar characteristics:
    . the option rate is greater than one, i.e., more than one
      sendmsg/sendpage in the sender forked process.
    . The txmsg_apply is not 0

I debugged a little bit. It happens in msg_loop() function below
"unexpected timeout" path.

...
                         slct = select(max_fd + 1, &w, NULL, NULL, 
&timeout);
                         if (slct == -1) {
                                 perror("select()");
                                 clock_gettime(CLOCK_MONOTONIC, &s->end);
                                 goto out_errno;
                         } else if (!slct) {
                                 if (opt->verbose)
                                         fprintf(stderr, "unexpected 
timeout\n");
                                 errno = -EIO;
                                 clock_gettime(CLOCK_MONOTONIC, &s->end);
                                 goto out_errno;
                         }
...

It appears that when the error happens, the receive process does not 
receive all bytes sent from the send process and eventually times out.

Has anybody seen this issue as well?
John, any comments on this failure?

Thanks,

Yonghong

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: selftests/bpf test_sockmap failure
  2018-07-24 15:45 selftests/bpf test_sockmap failure Yonghong Song
@ 2018-07-24 22:40 ` John Fastabend
  2018-07-24 23:02   ` Yonghong Song
  0 siblings, 1 reply; 5+ messages in thread
From: John Fastabend @ 2018-07-24 22:40 UTC (permalink / raw)
  To: Yonghong Song, netdev; +Cc: Martin Lau

On 07/24/2018 08:45 AM, Yonghong Song wrote:
> In one of our production machines, tools/testing/selftests/bpf
> test_sockmap failed randomly like below:
> 
> ...
> [TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): rx thread exited with
> err 1. FAILED
> ...
> 
> ...
> [TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): rx thread exited with
> err 1. FAILED
> ...
> 
> ...
> [TEST 83]: (100, 1, 5, sendpage, pass,apply 1,): rx thread exited with
> err 1. FAILED
> ...
> 
> ...
> [TEST 79]: (512, 1, 1, sendpage, pass,apply 1,): rx thread exited with
> err 1. FAILED
> ...
> 
> The command line is just `test_sockmap`. The machine has 80 cpus, 256G
> memory. The kernel is based on 4.16 but backported with latest bpf-next
> bpf changes.
> 
> The failed test number (78, 79, 80, or 83) is random. But they all share
> similar characteristics:
>    . the option rate is greater than one, i.e., more than one
>      sendmsg/sendpage in the sender forked process.
>    . The txmsg_apply is not 0
> 
> I debugged a little bit. It happens in msg_loop() function below
> "unexpected timeout" path.
> 
> ...
>                         slct = select(max_fd + 1, &w, NULL, NULL,
> &timeout);
>                         if (slct == -1) {
>                                 perror("select()");
>                                 clock_gettime(CLOCK_MONOTONIC, &s->end);
>                                 goto out_errno;
>                         } else if (!slct) {
>                                 if (opt->verbose)
>                                         fprintf(stderr, "unexpected
> timeout\n");
>                                 errno = -EIO;
>                                 clock_gettime(CLOCK_MONOTONIC, &s->end);
>                                 goto out_errno;
>                         }
> ...
> 
> It appears that when the error happens, the receive process does not
> receive all bytes sent from the send process and eventually times out.
> 
> Has anybody seen this issue as well?
> John, any comments on this failure?

Can you run the test with verbose enabled so we can determine if the tx
side is even sending the message? Sample patch below. This will allow
us to see the tx bytes and rx bytes, although it will be a bit noisy.

I notice that the test program is not smart enough to (re)send bytes if
the sendmsg call doesn't consume all bytes. This is a valid error if
we get a enomem or other normal error on the tx side. With apply this
is more likely because every byte (in apply = 1 case) is being sent
through BPF prog.

If this is not the case I can do some more digging but I've not seen
this before.

Thanks!
John

--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -1031,6 +1031,7 @@ static int test_exec(int cgrp, struct
sockmap_options *opt)
        if (err)
                goto out;

+       opt->verbose = true;
        err = __test_exec(cgrp, SENDPAGE, opt);
 out:
        return err;


> 
> Thanks,
> 
> Yonghong

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: selftests/bpf test_sockmap failure
  2018-07-24 22:40 ` John Fastabend
@ 2018-07-24 23:02   ` Yonghong Song
  2018-07-25  0:49     ` Prashant Bhole
  0 siblings, 1 reply; 5+ messages in thread
From: Yonghong Song @ 2018-07-24 23:02 UTC (permalink / raw)
  To: John Fastabend, netdev; +Cc: Martin Lau



On 7/24/18 3:40 PM, John Fastabend wrote:
> On 07/24/2018 08:45 AM, Yonghong Song wrote:
>> In one of our production machines, tools/testing/selftests/bpf
>> test_sockmap failed randomly like below:
>>
>> ...
>> [TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): rx thread exited with
>> err 1. FAILED
>> ...
>>
>> ...
>> [TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): rx thread exited with
>> err 1. FAILED
>> ...
>>
>> ...
>> [TEST 83]: (100, 1, 5, sendpage, pass,apply 1,): rx thread exited with
>> err 1. FAILED
>> ...
>>
>> ...
>> [TEST 79]: (512, 1, 1, sendpage, pass,apply 1,): rx thread exited with
>> err 1. FAILED
>> ...
>>
>> The command line is just `test_sockmap`. The machine has 80 cpus, 256G
>> memory. The kernel is based on 4.16 but backported with latest bpf-next
>> bpf changes.
>>
>> The failed test number (78, 79, 80, or 83) is random. But they all share
>> similar characteristics:
>>     . the option rate is greater than one, i.e., more than one
>>       sendmsg/sendpage in the sender forked process.
>>     . The txmsg_apply is not 0
>>
>> I debugged a little bit. It happens in msg_loop() function below
>> "unexpected timeout" path.
>>
>> ...
>>                          slct = select(max_fd + 1, &w, NULL, NULL,
>> &timeout);
>>                          if (slct == -1) {
>>                                  perror("select()");
>>                                  clock_gettime(CLOCK_MONOTONIC, &s->end);
>>                                  goto out_errno;
>>                          } else if (!slct) {
>>                                  if (opt->verbose)
>>                                          fprintf(stderr, "unexpected
>> timeout\n");
>>                                  errno = -EIO;
>>                                  clock_gettime(CLOCK_MONOTONIC, &s->end);
>>                                  goto out_errno;
>>                          }
>> ...
>>
>> It appears that when the error happens, the receive process does not
>> receive all bytes sent from the send process and eventually times out.
>>
>> Has anybody seen this issue as well?
>> John, any comments on this failure?
> 
> Can you run the test with verbose enabled so we can determine if the tx
> side is even sending the message? Sample patch below. This will allow
> us to see the tx bytes and rx bytes, although it will be a bit noisy.
> 
> I notice that the test program is not smart enough to (re)send bytes if
> the sendmsg call doesn't consume all bytes. This is a valid error if
> we get a enomem or other normal error on the tx side. With apply this
> is more likely because every byte (in apply = 1 case) is being sent
> through BPF prog.

The following are some of logs for the failed tests:

...
[TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1 
<-> p1, c2 <-> p2
cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
0.000000GB/s
unexpected timeout
msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s 
0.000000GB/s
rx thread exited with err 1. FAILED
...

[TEST 82]: (100, 1, 5, sendmsg, pass,apply 1,): connected sockets: c1 
<-> p1, c2 <-> p2
cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
tx_sendmsg: TX: 500B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
0.000000GB/s
unexpected timeout
msg_loop_rx: iov_count 1 iov_buf 5 cnt 100 err -5
rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 366B 366.000000B/s 
0.000000GB/s
rx thread exited with err 1. FAILED

[TEST 402]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1 
<-> p1, c2 <-> p2
cgroups binding: c1(42) <-> s1(40) - - - c2(43) <-> s2(41)
tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
0.000000GB/s
unexpected timeout
msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s 
0.000000GB/s
rx thread exited with err 1. FAILED

[TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): connected sockets: c1 
<-> p1, c2 <-> p2
cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
tx_sendmsg: TX: 524288B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
0.000000GB/s
unexpected timeout
msg_loop_rx: iov_count 1024 iov_buf 256 cnt 2 err -5
rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 458805B 458805.000000B/s 
0.000459GB/s
rx thread exited with err 1. FAILED


> 
> If this is not the case I can do some more digging but I've not seen
> this before.

I cannot reproduce it in my FC27 VM (with latest bpf-next) either.
The bug only shows up on our production service which is a lot busier 
and has more cpus/memorys and some sysctl configurations are different 
from my VM setup.

> 
> Thanks!
> John
> 
> --- a/tools/testing/selftests/bpf/test_sockmap.c
> +++ b/tools/testing/selftests/bpf/test_sockmap.c
> @@ -1031,6 +1031,7 @@ static int test_exec(int cgrp, struct
> sockmap_options *opt)
>          if (err)
>                  goto out;
> 
> +       opt->verbose = true;
>          err = __test_exec(cgrp, SENDPAGE, opt);
>   out:
>          return err;
> 
> 
>>
>> Thanks,
>>
>> Yonghong
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: selftests/bpf test_sockmap failure
  2018-07-24 23:02   ` Yonghong Song
@ 2018-07-25  0:49     ` Prashant Bhole
  2018-07-25  5:09       ` Yonghong Song
  0 siblings, 1 reply; 5+ messages in thread
From: Prashant Bhole @ 2018-07-25  0:49 UTC (permalink / raw)
  To: Yonghong Song, John Fastabend, netdev; +Cc: Martin Lau



On 7/25/2018 8:02 AM, Yonghong Song wrote:
> 
> 
> On 7/24/18 3:40 PM, John Fastabend wrote:
>> On 07/24/2018 08:45 AM, Yonghong Song wrote:
>>> In one of our production machines, tools/testing/selftests/bpf
>>> test_sockmap failed randomly like below:
>>>
>>> ...
>>> [TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): rx thread exited with
>>> err 1. FAILED
>>> ...
>>>
>>> ...
>>> [TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): rx thread exited with
>>> err 1. FAILED
>>> ...
>>>
>>> ...
>>> [TEST 83]: (100, 1, 5, sendpage, pass,apply 1,): rx thread exited with
>>> err 1. FAILED
>>> ...
>>>
>>> ...
>>> [TEST 79]: (512, 1, 1, sendpage, pass,apply 1,): rx thread exited with
>>> err 1. FAILED
>>> ...
>>>
>>> The command line is just `test_sockmap`. The machine has 80 cpus, 256G
>>> memory. The kernel is based on 4.16 but backported with latest bpf-next
>>> bpf changes.
>>>
>>> The failed test number (78, 79, 80, or 83) is random. But they all share
>>> similar characteristics:
>>>     . the option rate is greater than one, i.e., more than one
>>>       sendmsg/sendpage in the sender forked process.
>>>     . The txmsg_apply is not 0
>>>
>>> I debugged a little bit. It happens in msg_loop() function below
>>> "unexpected timeout" path.
>>>
>>> ...
>>>                          slct = select(max_fd + 1, &w, NULL, NULL,
>>> &timeout);
>>>                          if (slct == -1) {
>>>                                  perror("select()");
>>>                                  clock_gettime(CLOCK_MONOTONIC, 
>>> &s->end);
>>>                                  goto out_errno;
>>>                          } else if (!slct) {
>>>                                  if (opt->verbose)
>>>                                          fprintf(stderr, "unexpected
>>> timeout\n");
>>>                                  errno = -EIO;
>>>                                  clock_gettime(CLOCK_MONOTONIC, 
>>> &s->end);
>>>                                  goto out_errno;
>>>                          }
>>> ...
>>>
>>> It appears that when the error happens, the receive process does not
>>> receive all bytes sent from the send process and eventually times out.
>>>
>>> Has anybody seen this issue as well?
>>> John, any comments on this failure?
>>
>> Can you run the test with verbose enabled so we can determine if the tx
>> side is even sending the message? Sample patch below. This will allow
>> us to see the tx bytes and rx bytes, although it will be a bit noisy.
>>
>> I notice that the test program is not smart enough to (re)send bytes if
>> the sendmsg call doesn't consume all bytes. This is a valid error if
>> we get a enomem or other normal error on the tx side. With apply this
>> is more likely because every byte (in apply = 1 case) is being sent
>> through BPF prog.
> 
> The following are some of logs for the failed tests:
> 
> ...
> [TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1 
> <-> p1, c2 <-> p2
> cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
> tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
> 0.000000GB/s
> unexpected timeout
> msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s 
> 0.000000GB/s
> rx thread exited with err 1. FAILED
> ...
> 
> [TEST 82]: (100, 1, 5, sendmsg, pass,apply 1,): connected sockets: c1 
> <-> p1, c2 <-> p2
> cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
> tx_sendmsg: TX: 500B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
> 0.000000GB/s
> unexpected timeout
> msg_loop_rx: iov_count 1 iov_buf 5 cnt 100 err -5
> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 366B 366.000000B/s 
> 0.000000GB/s
> rx thread exited with err 1. FAILED
> 
> [TEST 402]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1 
> <-> p1, c2 <-> p2
> cgroups binding: c1(42) <-> s1(40) - - - c2(43) <-> s2(41)
> tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
> 0.000000GB/s
> unexpected timeout
> msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s 
> 0.000000GB/s
> rx thread exited with err 1. FAILED
> 
> [TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): connected sockets: c1 
> <-> p1, c2 <-> p2
> cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
> tx_sendmsg: TX: 524288B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
> 0.000000GB/s
> unexpected timeout
> msg_loop_rx: iov_count 1024 iov_buf 256 cnt 2 err -5
> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 458805B 458805.000000B/s 
> 0.000459GB/s
> rx thread exited with err 1. FAILED
> 
> 
>>
>> If this is not the case I can do some more digging but I've not seen
>> this before.
> 
> I cannot reproduce it in my FC27 VM (with latest bpf-next) either.
> The bug only shows up on our production service which is a lot busier 
> and has more cpus/memorys and some sysctl configurations are different 
> from my VM setup.

In the past I have noticed that on busy machines sometimes data is 
delayed so much that select() timeout is triggered. Although it isn't a 
solution but you can try to increase the timeout value to check if that 
is the case.

-Prashant

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: selftests/bpf test_sockmap failure
  2018-07-25  0:49     ` Prashant Bhole
@ 2018-07-25  5:09       ` Yonghong Song
  0 siblings, 0 replies; 5+ messages in thread
From: Yonghong Song @ 2018-07-25  5:09 UTC (permalink / raw)
  To: Prashant Bhole, John Fastabend, netdev; +Cc: Martin Lau



On 7/24/18 5:49 PM, Prashant Bhole wrote:
> 
> 
> On 7/25/2018 8:02 AM, Yonghong Song wrote:
>>
>>
>> On 7/24/18 3:40 PM, John Fastabend wrote:
>>> On 07/24/2018 08:45 AM, Yonghong Song wrote:
>>>> In one of our production machines, tools/testing/selftests/bpf
>>>> test_sockmap failed randomly like below:
>>>>
>>>> ...
>>>> [TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): rx thread exited with
>>>> err 1. FAILED
>>>> ...
>>>>
>>>> ...
>>>> [TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): rx thread exited 
>>>> with
>>>> err 1. FAILED
>>>> ...
>>>>
>>>> ...
>>>> [TEST 83]: (100, 1, 5, sendpage, pass,apply 1,): rx thread exited with
>>>> err 1. FAILED
>>>> ...
>>>>
>>>> ...
>>>> [TEST 79]: (512, 1, 1, sendpage, pass,apply 1,): rx thread exited with
>>>> err 1. FAILED
>>>> ...
>>>>
>>>> The command line is just `test_sockmap`. The machine has 80 cpus, 256G
>>>> memory. The kernel is based on 4.16 but backported with latest bpf-next
>>>> bpf changes.
>>>>
>>>> The failed test number (78, 79, 80, or 83) is random. But they all 
>>>> share
>>>> similar characteristics:
>>>>     . the option rate is greater than one, i.e., more than one
>>>>       sendmsg/sendpage in the sender forked process.
>>>>     . The txmsg_apply is not 0
>>>>
>>>> I debugged a little bit. It happens in msg_loop() function below
>>>> "unexpected timeout" path.
>>>>
>>>> ...
>>>>                          slct = select(max_fd + 1, &w, NULL, NULL,
>>>> &timeout);
>>>>                          if (slct == -1) {
>>>>                                  perror("select()");
>>>>                                  clock_gettime(CLOCK_MONOTONIC, 
>>>> &s->end);
>>>>                                  goto out_errno;
>>>>                          } else if (!slct) {
>>>>                                  if (opt->verbose)
>>>>                                          fprintf(stderr, "unexpected
>>>> timeout\n");
>>>>                                  errno = -EIO;
>>>>                                  clock_gettime(CLOCK_MONOTONIC, 
>>>> &s->end);
>>>>                                  goto out_errno;
>>>>                          }
>>>> ...
>>>>
>>>> It appears that when the error happens, the receive process does not
>>>> receive all bytes sent from the send process and eventually times out.
>>>>
>>>> Has anybody seen this issue as well?
>>>> John, any comments on this failure?
>>>
>>> Can you run the test with verbose enabled so we can determine if the tx
>>> side is even sending the message? Sample patch below. This will allow
>>> us to see the tx bytes and rx bytes, although it will be a bit noisy.
>>>
>>> I notice that the test program is not smart enough to (re)send bytes if
>>> the sendmsg call doesn't consume all bytes. This is a valid error if
>>> we get a enomem or other normal error on the tx side. With apply this
>>> is more likely because every byte (in apply = 1 case) is being sent
>>> through BPF prog.
>>
>> The following are some of logs for the failed tests:
>>
>> ...
>> [TEST 78]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1 
>> <-> p1, c2 <-> p2
>> cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
>> tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
>> 0.000000GB/s
>> unexpected timeout
>> msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
>> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s 
>> 0.000000GB/s
>> rx thread exited with err 1. FAILED
>> ...
>>
>> [TEST 82]: (100, 1, 5, sendmsg, pass,apply 1,): connected sockets: c1 
>> <-> p1, c2 <-> p2
>> cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
>> tx_sendmsg: TX: 500B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
>> 0.000000GB/s
>> unexpected timeout
>> msg_loop_rx: iov_count 1 iov_buf 5 cnt 100 err -5
>> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 366B 366.000000B/s 
>> 0.000000GB/s
>> rx thread exited with err 1. FAILED
>>
>> [TEST 402]: (512, 1, 1, sendmsg, pass,apply 1,): connected sockets: c1 
>> <-> p1, c2 <-> p2
>> cgroups binding: c1(42) <-> s1(40) - - - c2(43) <-> s2(41)
>> tx_sendmsg: TX: 512B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
>> 0.000000GB/s
>> unexpected timeout
>> msg_loop_rx: iov_count 1 iov_buf 1 cnt 512 err -5
>> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 147B 147.000000B/s 
>> 0.000000GB/s
>> rx thread exited with err 1. FAILED
>>
>> [TEST 80]: (2, 1024, 256, sendmsg, pass,apply 1,): connected sockets: 
>> c1 <-> p1, c2 <-> p2
>> cgroups binding: c1(24) <-> s1(22) - - - c2(25) <-> s2(23)
>> tx_sendmsg: TX: 524288B 0.000000B/s 0.000000 GB/s RX: 0B 0.000000B/s 
>> 0.000000GB/s
>> unexpected timeout
>> msg_loop_rx: iov_count 1024 iov_buf 256 cnt 2 err -5
>> rx_sendmsg: TX: 0B 0.000000B/s 0.000000GB/s RX: 458805B 
>> 458805.000000B/s 0.000459GB/s
>> rx thread exited with err 1. FAILED
>>
>>
>>>
>>> If this is not the case I can do some more digging but I've not seen
>>> this before.
>>
>> I cannot reproduce it in my FC27 VM (with latest bpf-next) either.
>> The bug only shows up on our production service which is a lot busier 
>> and has more cpus/memorys and some sysctl configurations are different 
>> from my VM setup.
> 
> In the past I have noticed that on busy machines sometimes data is 
> delayed so much that select() timeout is triggered. Although it isn't a 
> solution but you can try to increase the timeout value to check if that 
> is the case.

Thanks for the tip. My test machine is busier, but most jobs are waiting 
for the work and the actual cpu utilization is not that high.

The default recv timeout is 1 second. I increased it to 10 seconds, and 
the failure still exists.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-07-25  6:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-07-24 15:45 selftests/bpf test_sockmap failure Yonghong Song
2018-07-24 22:40 ` John Fastabend
2018-07-24 23:02   ` Yonghong Song
2018-07-25  0:49     ` Prashant Bhole
2018-07-25  5:09       ` Yonghong Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).