From: Yonghong Song <yhs@meta.com>
To: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: bpf <bpf@vger.kernel.org>, Kui-Feng Lee <kuifeng@meta.com>,
Manu Bretelle <chantr4@gmail.com>,
Alexei Starovoitov <ast@kernel.org>,
Jakub Sitnicki <jakub@cloudflare.com>
Subject: Re: selftest sock_fields failed on s390x with latest llvm17
Date: Sun, 14 May 2023 09:58:26 -0700 [thread overview]
Message-ID: <d275bd5e-e468-c590-9a10-8230a9ad7daa@meta.com> (raw)
In-Reply-To: <75f39027fe1889cd69d01d439d558418cbd020a1.camel@linux.ibm.com>
On 5/13/23 1:24 AM, Ilya Leoshkevich wrote:
> On Fri, 2023-05-12 at 21:13 -0700, Yonghong Song wrote:
>>
>>
>> On 5/12/23 7:40 AM, Ilya Leoshkevich wrote:
>>> On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote:
>>>> On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote:
>>>>> Hi, Ilya,
>>>>>
>>>>> BPF CI ([1]) detected a s390x failure when bpf program is
>>>>> compiled
>>>>> with
>>>>> latest llvm17 on bpf-next branch. To reproduce the issue, just
>>>>> run
>>>>> normal './test_progs -j'. The failure log looks like below:
>>>>>
>>>>> Notice: Success: 341/3015, Skipped: 29, Failed: 1
>>>>> Error: #191 sock_fields
>>>>> Error: #191 sock_fields
>>>>> create_netns:PASS:create netns 0 nsec
>>>>> create_netns:PASS:bring up lo 0 nsec
>>>>>
>>>>> serial_test_sock_fields:PASS:test_sock_fields__open_and_load 0
>>>>> nsec
>>>>>
>>>>> serial_test_sock_fields:PASS:attach_cgroup(egress_read_sock_fie
>>>>> lds)
>>>>> 0
>>>>> nsec
>>>>>
>>>>> serial_test_sock_fields:PASS:attach_cgroup(ingress_read_sock_fi
>>>>> elds
>>>>> )
>>>>> 0 nsec
>>>>> serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_port
>>>>> 0
>>>>> nsec
>>>>> test:PASS:getsockname(listen_fd) 0 nsec
>>>>> test:PASS:getsockname(cli_fd) 0 nsec
>>>>> test:PASS:accept(listen_fd) 0 nsec
>>>>> init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt_fd)
>>>>> 0
>>>>> nsec
>>>>>
>>>>> init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt10_fd) 0
>>>>> nsec
>>>>> test:PASS:send(accept_fd) 0 nsec
>>>>> test:PASS:recv(cli_fd) 0 nsec
>>>>> test:PASS:send(accept_fd) 0 nsec
>>>>> test:PASS:recv(cli_fd) 0 nsec
>>>>> test:PASS:recv(accept_fd) for fin 0 nsec
>>>>> test:PASS:recv(cli_fd) for fin 0 nsec
>>>>>
>>>>> check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt,
>>>>> &accept_fd) 0 nsec
>>>>>
>>>>> check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt,
>>>>> &cli_fd) 0 nsec
>>>>> check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
>>>>> check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec
>>>>> check_result:PASS:bpf_map_lookup_elem(linum_map_fd,
>>>>> READ_SK_DST_PORT_IDX) 0 nsec
>>>>> check_result:FAIL:failure in read_sk_dst_port on line
>>>>> unexpected
>>>>> failure in read_sk_dst_port on line: actual 297 != expected 0
>>>>> listen_sk: state:10 bound_dev_if:0 family:10 type:1
>>>>> protocol:6
>>>>> mark:0
>>>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>>>> src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::)
>>>>> dst_port:0
>>>>> srv_sk: state:9 bound_dev_if:0 family:10 type:1 protocol:6
>>>>> mark:0
>>>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>>>> src_port:51966 dst_ip4:7f000006(127.0.0.6) dst_ip6:0:0:0:1(::1)
>>>>> dst_port:38030
>>>>> cli_sk: state:5 bound_dev_if:0 family:10 type:1 protocol:6
>>>>> mark:0
>>>>> priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
>>>>> src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1)
>>>>> dst_port:51966
>>>>> listen_tp: snd_cwnd:10 srtt_us:0 rtt_min:4294967295
>>>>> snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0
>>>>> mss_cache:536
>>>>> ecn_flags:0 rate_delivered:0 rate_interval_us:0 packets_out:0
>>>>> retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0
>>>>> segs_out:0
>>>>> data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:0
>>>>> bytes_acked:0
>>>>> srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272
>>>>> snd_ssthresh:2147483647
>>>>> rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810
>>>>> mss_cache:32768
>>>>> ecn_flags:0 rate_delivered:1 rate_interval_us:272 packets_out:0
>>>>> retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0
>>>>> segs_out:3
>>>>> data_segs_out:2 lost_out:0 sacked_out:0 bytes_received:1
>>>>> bytes_acked:22
>>>>> cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730
>>>>> snd_ssthresh:2147483647
>>>>> rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715
>>>>> mss_cache:32768
>>>>> ecn_flags:0 rate_delivered:1 rate_interval_us:925 packets_out:0
>>>>> retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2
>>>>> segs_out:6
>>>>> data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:23
>>>>> bytes_acked:2
>>>>> check_result:PASS:listen_sk 0 nsec
>>>>> check_result:PASS:srv_sk 0 nsec
>>>>> check_result:PASS:srv_tp 0 nsec
>>>>>
>>>>> If bpf program is compiled with llvm16, the test passed
>>>>> according
>>>>> to
>>>>> a CI run.
>>>>>
>>>>> I don't have s390x environment to debug this. Could you help
>>>>> debug
>>>>> it?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> [1]
>>>>> https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645
>>>>
>>>>
>>>> Hi,
>>>>
>>>> thank for letting me know.
>>>> I will look into this.
>>>>
>>>> Best regards,
>>>> Ilya
>>>
>>> In the meantime the issue was fixed by:
>>>
>>> commit 141be5c062ecf22bd287afffd310e8ac4711444a
>>> Author: Shoaib Meenai <smeenai@fb.com>
>>> Date: Fri May 5 14:18:12 2023 -0700
>>>
>>> Revert "Reland [Pipeline] Don't limit ArgumentPromotion to -
>>> O3"
>>>
>>> This reverts commit 6f29d1adf29820daae9ea7a01ae2588b67735b9e.
>>>
>>> https://reviews.llvm.org/D149768 is causing size regressions
>>> for -
>>> Oz
>>> with FullLTO, and I'm reverting that one while investigating.
>>> This
>>> commit depends on that one, so it needs to be reverted as
>>> well.
>>
>> The transformtion "Don't limit ArgumentPromotion to -O3" is
>> temporarily
>> reverted. But it could be reverted again once the issue is resolved.
>> So it is a good idea to resolve the issue in the kernel.
>>
>>>
>>> But looking at the codegen differences:
>>>
>>> $ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0-9]*://g
>>> fail.s)
>>>
>>> -pass.o: file format elf64-bpf
>>> +fail.o: file format elf64-bpf
>>>
>>> -00000000000002c8 <sk_dst_port__load_half>
>>> - 69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48)
>>> +00000000000002c0 <sk_dst_port__load_half>
>>> + 54 10 00 00 00 00 ff ff w1 &= 65535
>>> b4 00 00 00 00 00 00 01 w0 = 1
>>> 16 10 00 01 00 00 ca fe if w1 == 51966 goto +1 <LBB6_2>
>>> b4 00 00 00 00 00 00 00 w0 = 0
>>>
>>> This is what ArgumentPromotion is supposed to do, so that's okay so
>>> far. However, further down below we have:
>>>
>>> Disassembly of section cgroup_skb/egress:
>>>
>>> - bf 16 00 00 00 00 00 00 r1 = r6
>>> + 61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48)
>>> + bc 17 00 00 00 00 00 00 w1 = w7
>>> 85 01 00 00 00 00 00 53 call sk_dst_port__load_word
>>>
>>> ...
>>>
>>> - bf 16 00 00 00 00 00 00 r1 = r6
>>> + 74 70 00 00 00 00 00 10 w7 >>= 16
>>> + bc 17 00 00 00 00 00 00 w1 = w7
>>> 85 01 00 00 00 00 00 57 call sk_dst_port__load_half
>>>
>>> so there is no 16-bit load anymore, instead, the result from the
>>> earlier 32-bit load is reused. However, on BE these kinds of loads
>>> for this particular field are not consistent at the moment - see
>>> [1]
>>> and the previous discussions.
>>>
>>> De-facto we have the following results:
>>>
>>> - int load: 0x0000cafe
>>> - short load: 0xcafe
>>
>> So 'De-facto' means the above is the expected result.
>>
>>>
>>> On a consistent BE we should have rather had:
>>>
>>> - int load: 0x0000cafe
>>> - short load: 0
>>
>> What does 'consistent BE' mean here? Does it mean the expected
>> result from the source code itself?
>
> I should not have called the de-facto example "BE" at all: it's rather
> "mixed endianness" or "weird endianness" or something along these
> lines.
>
> On "consistent BE" or simply "BE" properties like
>
> *(uint32_t *)p = (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2);
>
> hold. This is currently not the case for bpf_sock.dst_port.
>
> We compile with -mbig-endian, so we promise to the compiler that the
> machine is big-endian, and the compiler expects the above to hold for
> any p. Unfortunately when p points to bpf_sock.dst_port, this is not
> the case.
If I understand correctly, *(uint32_t *)p to get the bpf_sock.dst_port
is for backward compatibility. But the real u32 read by compiler will
do (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2) which is not the
same as expected *(uint32_t *)p so we have problem here.
>
> The property above is important for the correctness of the load/store
> tearing transformations. What we have here is not exactly tearing, but
> is quite close.
>
>>> Clang, of course, expects a consistent BE and optimizes around
>>> that.
>>>
>>> This was a conscious tradeoff Jakub and I have agreed on in order
>>> to
>>> keep the quirky behavior from the past. Given what's happening with
>>> Clang now, I wonder if it would be worth revisiting it in the name
>>> of
>>> consistency?
>>
>> If I understand correctly, I think the issue is
>> r7 = *(u32 *)(r6 + 48)
>> w7 >= 16
>> w1 = w7
>>
>> after verifier, it is changed to
>> r7 = *(u16 *)(r6 + <kernel offset>)
>> w7 >= 16
>> w1 = w7
>>
>> and the result after verifier rewrite is completely wrong.
>> Is it right?
>
> No, the verifier rewrite is correct.
> The sk_dst_port__load_word() part of the test succeeds.
>
> All these rewrites already work fine, they are correct and consistent.
> It's really just bpf_sock.dst_port that is special.
>
>> If this is the case, during verifier rewrite, if it is
>> big endian, say the user intends to load 4 bytes (from uapi header)
>> while the kernel field is 2 bytes, in such cases, kernel
>> can still pretend to generate 4-byte load. For example,
>> for the above example, the code after verification could be:
>> r7 = *(u16 *)(r6 + <kernel offset>)
>> r7 <= 16
>> w7 >= 16
>> w1 = w7
>>
>> Hopefully, we won't have many such cases. Does this work?
>
> This would break the sk_dst_port__load_word() part of the test.
This is a hack. This may work for this specific u16 case, but
yes, it won't work for u32 load case.
>
>
>
> Above I asked whether we can resolve the inconsistency, but I thought
> about it and I don't see a way of doing it without breaking the ABI,
> which is at worst unacceptable, and at best a last resort measure.
>
> What do you think about marking bpf_sock.dst_port volatile? volatile
> should prevent tearing and similar optimizations, with which we have a
> problem here.
>
> We could also add a comment warning users not to cast away this
> volatile due to the quirk we have. And then we should adjust the test
> (making all casts volatile) to comply with this new warning.
I did a little study on this. The main problem here for
static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
{
__u16 *half = (__u16 *)&sk->dst_port;
return half[0] == bpf_htons(0xcafe);
}
Through some cross-function optimization by ArgumentPromotion
optimization, the compiler does:
/* the below shared by sk_dst_port__load_word
* and sk_dst_port__load_half
*/
__u32 *word = (__u32 *)&sk->dst_port;
__u32 word_val = word[0];
/* the below is for sk_dst_port__load_half only */
__u16 half_val = word_val >> 16;
... half_val passed into sk_dst_port__load_half ...
return half_val == bpf_htons(0xcafe);
Here, 'word_val = word[0]' is replaced by 2-byte load
by the verifier and this caused the trouble for later
sk_dst_port__load_half().
I don't have a good solution here. The issue is exposed
as we have both u16 and u32 load for &sk->dst_port and
the compiler did some optimization with this.
I would say this is an extreme corner case and we can just
fix in the source code like below:
diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c
b/tools/testing/selftests/bpf/progs/test_sock_fields.c
index bbad3c2d9aa5..39c975786720 100644
--- a/tools/testing/selftests/bpf/progs/test_sock_fields.c
+++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c
@@ -265,7 +265,10 @@ static __noinline bool
sk_dst_port__load_word(struct bpf_sock *sk)
static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
{
- __u16 *half = (__u16 *)&sk->dst_port;
+ __u16 *half;
+
+ asm volatile ("");
+ half = (__u16 *)&sk->dst_port;
return half[0] == bpf_htons(0xcafe);
}
Could you try whether the above workaround works or not?
If we want the code to be future proof for potential
cross-func optimization for these noinline functions, we
can add similar asm codes to all of
bool sk_dst_port__load_{word, half, byte}.
>
>>> [1]
>>> https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@cloudflare.com
next prev parent reply other threads:[~2023-05-14 16:58 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-03 19:35 selftest sock_fields failed on s390x with latest llvm17 Yonghong Song
2023-05-03 19:46 ` Ilya Leoshkevich
2023-05-12 14:40 ` Ilya Leoshkevich
2023-05-13 4:13 ` Yonghong Song
2023-05-13 8:24 ` Ilya Leoshkevich
2023-05-14 16:58 ` Yonghong Song [this message]
2023-05-15 7:55 ` Ilya Leoshkevich
2023-05-15 15:27 ` Yonghong Song
2023-05-16 9:16 ` Ilya Leoshkevich
2023-05-16 15:20 ` Yonghong Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d275bd5e-e468-c590-9a10-8230a9ad7daa@meta.com \
--to=yhs@meta.com \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=chantr4@gmail.com \
--cc=iii@linux.ibm.com \
--cc=jakub@cloudflare.com \
--cc=kuifeng@meta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox