BPF List
 help / color / mirror / Atom feed
From: Yonghong Song <yhs@meta.com>
To: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: bpf <bpf@vger.kernel.org>, Kui-Feng Lee <kuifeng@meta.com>,
	Manu Bretelle <chantr4@gmail.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Jakub Sitnicki <jakub@cloudflare.com>
Subject: Re: selftest sock_fields failed on s390x with latest llvm17
Date: Sun, 14 May 2023 09:58:26 -0700	[thread overview]
Message-ID: <d275bd5e-e468-c590-9a10-8230a9ad7daa@meta.com> (raw)
In-Reply-To: <75f39027fe1889cd69d01d439d558418cbd020a1.camel@linux.ibm.com>



On 5/13/23 1:24 AM, Ilya Leoshkevich wrote:
> On Fri, 2023-05-12 at 21:13 -0700, Yonghong Song wrote:
>>
>>
>> On 5/12/23 7:40 AM, Ilya Leoshkevich wrote:
>>> On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote:
>>>> On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote:
>>>>> Hi, Ilya,
>>>>>
>>>>> BPF CI ([1]) detected a s390x failure when bpf program is
>>>>> compiled
>>>>> with
>>>>> latest llvm17 on bpf-next branch. To reproduce the issue, just
>>>>> run
>>>>> normal './test_progs -j'. The failure log looks like below:
>>>>>
>>>>> Notice: Success: 341/3015, Skipped: 29, Failed: 1
>>>>> Error: #191 sock_fields
>>>>>      Error: #191 sock_fields
>>>>>      create_netns:PASS:create netns 0 nsec
>>>>>      create_netns:PASS:bring up lo 0 nsec
>>>>>     
>>>>> serial_test_sock_fields:PASS:test_sock_fields__open_and_load 0
>>>>> nsec
>>>>>     
>>>>> serial_test_sock_fields:PASS:attach_cgroup(egress_read_sock_fie
>>>>> lds)
>>>>> 0
>>>>> nsec
>>>>>     
>>>>> serial_test_sock_fields:PASS:attach_cgroup(ingress_read_sock_fi
>>>>> elds
>>>>> )
>>>>> 0 nsec
>>>>>      serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_port
>>>>> 0
>>>>> nsec
>>>>>      test:PASS:getsockname(listen_fd) 0 nsec
>>>>>      test:PASS:getsockname(cli_fd) 0 nsec
>>>>>      test:PASS:accept(listen_fd) 0 nsec
>>>>>      init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt_fd)
>>>>> 0
>>>>> nsec
>>>>>     
>>>>> init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt10_fd) 0
>>>>> nsec
>>>>>      test:PASS:send(accept_fd) 0 nsec
>>>>>      test:PASS:recv(cli_fd) 0 nsec
>>>>>      test:PASS:send(accept_fd) 0 nsec
>>>>>      test:PASS:recv(cli_fd) 0 nsec
>>>>>      test:PASS:recv(accept_fd) for fin 0 nsec
>>>>>      test:PASS:recv(cli_fd) for fin 0 nsec
>>>>>     
>>>>> check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt,
>>>>> &accept_fd) 0 nsec
>>>>>     
>>>>> check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt,
>>>>> &cli_fd) 0 nsec
>>>>>      check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
>>>>>      check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
>>>>>      check_result:PASS:bpf_map_lookup_elem(linum_map_fd,
>>>>> READ_SK_DST_PORT_IDX) 0 nsec
>>>>>      check_result:FAIL:failure in read_sk_dst_port on line
>>>>> unexpected
>>>>> failure in read_sk_dst_port on line: actual 297 != expected 0
>>>>>      listen_sk: state:10 bound_dev_if:0 family:10 type:1
>>>>> protocol:6
>>>>> mark:0
>>>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>>>> src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::)
>>>>> dst_port:0
>>>>>      srv_sk: state:9 bound_dev_if:0 family:10 type:1 protocol:6
>>>>> mark:0
>>>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>>>> src_port:51966 dst_ip4:7f000006(127.0.0.6) dst_ip6:0:0:0:1(::1)
>>>>> dst_port:38030
>>>>>      cli_sk: state:5 bound_dev_if:0 family:10 type:1 protocol:6
>>>>> mark:0
>>>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>>>> src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1)
>>>>> dst_port:51966
>>>>>      listen_tp: snd_cwnd:10 srtt_us:0 rtt_min:4294967295
>>>>> snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0
>>>>> mss_cache:536
>>>>> ecn_flags:0 rate_delivered:0 rate_interval_us:0 packets_out:0
>>>>> retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0
>>>>> segs_out:0
>>>>> data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:0
>>>>> bytes_acked:0
>>>>>      srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272
>>>>> snd_ssthresh:2147483647
>>>>> rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810
>>>>> mss_cache:32768
>>>>> ecn_flags:0 rate_delivered:1 rate_interval_us:272 packets_out:0
>>>>> retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0
>>>>> segs_out:3
>>>>> data_segs_out:2 lost_out:0 sacked_out:0 bytes_received:1
>>>>> bytes_acked:22
>>>>>      cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730
>>>>> snd_ssthresh:2147483647
>>>>> rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715
>>>>> mss_cache:32768
>>>>> ecn_flags:0 rate_delivered:1 rate_interval_us:925 packets_out:0
>>>>> retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2
>>>>> segs_out:6
>>>>> data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:23
>>>>> bytes_acked:2
>>>>>      check_result:PASS:listen_sk 0 nsec
>>>>>      check_result:PASS:srv_sk 0 nsec
>>>>>      check_result:PASS:srv_tp 0 nsec
>>>>>
>>>>> If bpf program is compiled with llvm16, the test passed
>>>>> according
>>>>> to
>>>>> a CI run.
>>>>>
>>>>> I don't have s390x environment to debug this. Could you help
>>>>> debug
>>>>> it?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>      [1]
>>>>> https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645
>>>>
>>>>
>>>> Hi,
>>>>
>>>> thank for letting me know.
>>>> I will look into this.
>>>>
>>>> Best regards,
>>>> Ilya
>>>
>>> In the meantime the issue was fixed by:
>>>
>>> commit 141be5c062ecf22bd287afffd310e8ac4711444a
>>> Author: Shoaib Meenai <smeenai@fb.com>
>>> Date:   Fri May 5 14:18:12 2023 -0700
>>>
>>>       Revert "Reland [Pipeline] Don't limit ArgumentPromotion to -
>>> O3"
>>>       
>>>       This reverts commit 6f29d1adf29820daae9ea7a01ae2588b67735b9e.
>>>       
>>>       https://reviews.llvm.org/D149768   is causing size regressions
>>> for -
>>> Oz
>>>       with FullLTO, and I'm reverting that one while investigating.
>>> This
>>>       commit depends on that one, so it needs to be reverted as
>>> well.
>>
>> The transformtion "Don't limit ArgumentPromotion to -O3" is
>> temporarily
>> reverted. But it could be reverted again once the issue is resolved.
>> So it is a good idea to resolve the issue in the kernel.
>>
>>>
>>> But looking at the codegen differences:
>>>
>>> $ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0-9]*://g
>>> fail.s)
>>>
>>> -pass.o:        file format elf64-bpf
>>> +fail.o:        file format elf64-bpf
>>>
>>> -00000000000002c8 <sk_dst_port__load_half>
>>> -       69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48)
>>> +00000000000002c0 <sk_dst_port__load_half>
>>> +       54 10 00 00 00 00 ff ff w1 &= 65535
>>>           b4 00 00 00 00 00 00 01 w0 = 1
>>>           16 10 00 01 00 00 ca fe if w1 == 51966 goto +1 <LBB6_2>
>>>           b4 00 00 00 00 00 00 00 w0 = 0
>>>
>>> This is what ArgumentPromotion is supposed to do, so that's okay so
>>> far. However, further down below we have:
>>>
>>>    Disassembly of section cgroup_skb/egress:
>>>
>>> -       bf 16 00 00 00 00 00 00 r1 = r6
>>> +       61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48)
>>> +       bc 17 00 00 00 00 00 00 w1 = w7
>>>           85 01 00 00 00 00 00 53 call sk_dst_port__load_word
>>>
>>> ...
>>>
>>> -       bf 16 00 00 00 00 00 00 r1 = r6
>>> +       74 70 00 00 00 00 00 10 w7 >>= 16
>>> +       bc 17 00 00 00 00 00 00 w1 = w7
>>>           85 01 00 00 00 00 00 57 call sk_dst_port__load_half
>>>
>>> so there is no 16-bit load anymore, instead, the result from the
>>> earlier 32-bit load is reused. However, on BE these kinds of loads
>>> for this particular field are not consistent at the moment - see
>>> [1]
>>> and the previous discussions.
>>>
>>> De-facto we have the following results:
>>>
>>> - int load: 0x0000cafe
>>> - short load: 0xcafe
>>
>> So 'De-facto' means the above is the expected result.
>>
>>>
>>> On a consistent BE we should have rather had:
>>>
>>> - int load: 0x0000cafe
>>> - short load: 0
>>
>> What does 'consistent BE' mean here? Does it mean the expected
>> result from the source code itself?
> 
> I should not have called the de-facto example "BE" at all: it's rather
> "mixed endianness" or "weird endianness" or something along these
> lines.
> 
> On "consistent BE" or simply "BE" properties like
> 
> *(uint32_t *)p = (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2);
> 
> hold. This is currently not the case for bpf_sock.dst_port.
> 
> We compile with -mbig-endian, so we promise to the compiler that the
> machine is big-endian, and the compiler expects the above to hold for
> any p. Unfortunately when p points to bpf_sock.dst_port, this is not
> the case.

If I understand correctly, *(uint32_t *)p to get the bpf_sock.dst_port
is for backward compatibility. But the real u32 read by compiler will
do (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2) which is not the
same as expected *(uint32_t *)p so we have problem here.

> 
> The property above is important for the correctness of the load/store
> tearing transformations. What we have here is not exactly tearing, but
> is quite close.
> 
>>> Clang, of course, expects a consistent BE and optimizes around
>>> that.
>>>
>>> This was a conscious tradeoff Jakub and I have agreed on in order
>>> to
>>> keep the quirky behavior from the past. Given what's happening with
>>> Clang now, I wonder if it would be worth revisiting it in the name
>>> of
>>> consistency?
>>
>> If I understand correctly, I think the issue is
>>       r7 = *(u32 *)(r6 + 48)
>>       w7 >= 16
>>       w1 = w7
>>
>> after verifier, it is changed to
>>      r7 = *(u16 *)(r6 + <kernel offset>)
>>      w7 >= 16
>>      w1 = w7
>>
>> and the result after verifier rewrite is completely wrong.
>> Is it right?
> 
> No, the verifier rewrite is correct.
> The sk_dst_port__load_word() part of the test succeeds.
> 
> All these rewrites already work fine, they are correct and consistent.
> It's really just bpf_sock.dst_port that is special.
> 
>> If this is the case, during verifier rewrite, if it is
>> big endian, say the user intends to load 4 bytes (from uapi header)
>> while the kernel field is 2 bytes, in such cases, kernel
>> can still pretend to generate 4-byte load. For example,
>> for the above example, the code after verification could be:
>>      r7 = *(u16 *)(r6 + <kernel offset>)
>>      r7 <= 16
>>      w7 >= 16
>>      w1 = w7
>>
>> Hopefully, we won't have many such cases. Does this work?
> 
> This would break the sk_dst_port__load_word() part of the test.

This is a hack. This may work for this specific u16 case, but
yes, it won't work for u32 load case.

> 
> 
> 
> Above I asked whether we can resolve the inconsistency, but I thought
> about it and I don't see a way of doing it without breaking the ABI,
> which is at worst unacceptable, and at best a last resort measure.
> 
> What do you think about marking bpf_sock.dst_port volatile? volatile
> should prevent tearing and similar optimizations, with which we have a
> problem here.
> 
> We could also add a comment warning users not to cast away this
> volatile due to the quirk we have. And then we should adjust the test
> (making all casts volatile) to comply with this new warning.

I did a little study on this. The main problem here for
static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
{
         __u16 *half = (__u16 *)&sk->dst_port;
         return half[0] == bpf_htons(0xcafe);
}

Through some cross-function optimization by ArgumentPromotion
optimization, the compiler does:
    /* the below shared by sk_dst_port__load_word
     * and sk_dst_port__load_half
     */
    __u32 *word = (__u32 *)&sk->dst_port;
    __u32 word_val = word[0];

    /* the below is for sk_dst_port__load_half only */
    __u16 half_val = word_val >> 16;

    ... half_val passed into sk_dst_port__load_half ...
    return half_val == bpf_htons(0xcafe);

Here, 'word_val = word[0]' is replaced by 2-byte load
by the verifier and this caused the trouble for later
sk_dst_port__load_half().

I don't have a good solution here. The issue is exposed
as we have both u16 and u32 load for &sk->dst_port and
the compiler did some optimization with this.

I would say this is an extreme corner case and we can just
fix in the source code like below:

diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c 
b/tools/testing/selftests/bpf/progs/test_sock_fields.c
index bbad3c2d9aa5..39c975786720 100644
--- a/tools/testing/selftests/bpf/progs/test_sock_fields.c
+++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c
@@ -265,7 +265,10 @@ static __noinline bool 
sk_dst_port__load_word(struct bpf_sock *sk)

  static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
  {
-       __u16 *half = (__u16 *)&sk->dst_port;
+       __u16 *half;
+
+       asm volatile ("");
+       half  = (__u16 *)&sk->dst_port;
         return half[0] == bpf_htons(0xcafe);
  }

Could you try whether the above workaround works or not?
If we want the code to be future proof for potential
cross-func optimization for these noinline functions, we
can add similar asm codes to all of
bool sk_dst_port__load_{word, half, byte}.


> 
>>> [1]
>>> https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@cloudflare.com

  reply	other threads:[~2023-05-14 16:58 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-03 19:35 selftest sock_fields failed on s390x with latest llvm17 Yonghong Song
2023-05-03 19:46 ` Ilya Leoshkevich
2023-05-12 14:40   ` Ilya Leoshkevich
2023-05-13  4:13     ` Yonghong Song
2023-05-13  8:24       ` Ilya Leoshkevich
2023-05-14 16:58         ` Yonghong Song [this message]
2023-05-15  7:55           ` Ilya Leoshkevich
2023-05-15 15:27             ` Yonghong Song
2023-05-16  9:16               ` Ilya Leoshkevich
2023-05-16 15:20                 ` Yonghong Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d275bd5e-e468-c590-9a10-8230a9ad7daa@meta.com \
    --to=yhs@meta.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=chantr4@gmail.com \
    --cc=iii@linux.ibm.com \
    --cc=jakub@cloudflare.com \
    --cc=kuifeng@meta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox