From: Yonghong Song <yonghong.song@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Zac Ecob <zacecob@protonmail.com>,
Daniel Borkmann <daniel@iogearbox.net>,
"bpf@vger.kernel.org" <bpf@vger.kernel.org>
Subject: Re: Kernel oops caused by signed divide
Date: Tue, 10 Sep 2024 15:00:25 -0700 [thread overview]
Message-ID: <269dc5c4-d867-45ff-974c-3cc8a5a30e77@linux.dev> (raw)
In-Reply-To: <CAADnVQLwZNQkViS-c=BA-2Y7+0pqEpo78Otx9SPjaScF0CGvEw@mail.gmail.com>
On 9/10/24 2:53 PM, Alexei Starovoitov wrote:
> On Tue, Sep 10, 2024 at 12:32 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 9/10/24 11:25 AM, Alexei Starovoitov wrote:
>>> On Tue, Sep 10, 2024 at 11:02 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>> On 9/10/24 8:21 AM, Alexei Starovoitov wrote:
>>>>> On Tue, Sep 10, 2024 at 7:21 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>>>>> On 9/9/24 10:29 AM, Alexei Starovoitov wrote:
>>>>>>> On Mon, Sep 9, 2024 at 10:21 AM Zac Ecob <zacecob@protonmail.com> wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> I recently received a kernel 'oops' about a divide error.
>>>>>>>> After some research, it seems that the 'div64_s64' function used for the 'MOD'/'REM' instructions boils down to an 'idiv'.
>>>>>>>>
>>>>>>>> The 'dividend' is set to INT64_MIN, and the 'divisor' to -1, then because of two's complement, there is no corresponding positive value, causing the error (at least to my understanding).
>>>>>>>>
>>>>>>>>
>>>>>>>> Apologies if this is already known / not a relevant concern.
>>>>>>> Thanks for the report. This is a new issue.
>>>>>>>
>>>>>>> Yonghong,
>>>>>>>
>>>>>>> it's related to the new signed div insn.
>>>>>>> It sounds like we need to update chk_and_div[] part of
>>>>>>> the verifier to account for signed div differently.
>>>>>> In verifier, we have
>>>>>> /* [R,W]x div 0 -> 0 */
>>>>>> /* [R,W]x mod 0 -> [R,W]x */
>>>>> the verifier is doing what hw does. In this case this is arm64 behavior.
>>>> Okay, I see. I tried on a arm64 machine it indeed hehaves like the above.
>>>>
>>>> # uname -a
>>>> Linux ... #1 SMP PREEMPT_DYNAMIC Thu Aug 1 06:58:32 PDT 2024 aarch64 aarch64 aarch64 GNU/Linux
>>>> # cat t2.c
>>>> #include <stdio.h>
>>>> #include <limits.h>
>>>> int main(void) {
>>>> volatile long long a = 5;
>>>> volatile long long b = 0;
>>>> printf("a/b = %lld\n", a/b);
>>>> return 0;
>>>> }
>>>> # cat t3.c
>>>> #include <stdio.h>
>>>> #include <limits.h>
>>>> int main(void) {
>>>> volatile long long a = 5;
>>>> volatile long long b = 0;
>>>> printf("a%%b = %lld\n", a%b);
>>>> return 0;
>>>> }
>>>> # gcc -O2 t2.c && ./a.out
>>>> a/b = 0
>>>> # gcc -O2 t3.c && ./a.out
>>>> a%b = 5
>>>>
>>>> on arm64, clang18 compiled binary has the same result
>>>>
>>>> # clang -O2 t2.c && ./a.out
>>>> a/b = 0
>>>> # clang -O2 t3.c && ./a.out
>>>> a%b = 5
>>>>
>>>> The same source code, compiled on x86_64 with -O2 as well,
>>>> it generates:
>>>> Floating point exception (core dumped)
>>>>
>>>>>> What the value for
>>>>>> Rx_a sdiv Rx_b -> ?
>>>>>> where Rx_a = INT64_MIN and Rx_b = -1?
>>>>> Why does it matter what Rx_a contains ?
>>>> It does matter. See below:
>>>>
>>>> on arm64:
>>>>
>>>> # cat t1.c
>>>> #include <stdio.h>
>>>> #include <limits.h>
>>>> int main(void) {
>>>> volatile long long a = LLONG_MIN;
>>>> volatile long long b = -1;
>>>> printf("a/b = %lld\n", a/b);
>>>> return 0;
>>>> }
>>>> # clang -O2 t1.c && ./a.out
>>>> a/b = -9223372036854775808
>>>> # gcc -O2 t1.c && ./a.out
>>>> a/b = -9223372036854775808
>>>>
>>>> So the result of a/b is LLONG_MIN
>>>>
>>>> The same code will cause exception on x86_64:
>>>>
>>>> $ uname -a
>>>> Linux ... #1 SMP Wed Jun 5 06:21:21 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux
>>>> [yhs@devvm1513.prn0 ~]$ gcc -O2 t1.c && ./a.out
>>>> Floating point exception (core dumped)
>>>> [yhs@devvm1513.prn0 ~]$ clang -O2 t1.c && ./a.out
>>>> Floating point exception (core dumped)
>>>>
>>>> So this is what we care about.
>>>>
>>>> So I guess we can follow arm64 result too.
>>>>
>>>>> What cpus do in this case?
>>>> See above. arm64 produces *some* result while x64 cause exception.
>>>> We do need to special handle for LLONG_MIN/(-1) case.
>>> My point about Rx_a that idiv will cause out-of-range exception
>>> for many other values than Rx_a == INT64_MIN.
>>> I'm not sure that divisor -1 is the only such case either.
>>> Probably is, since intuitively -2 and all other divisors should fit fine.
>>> So the check likely needs Rx_b == -1 and a check for high bit in Rx_a ?
>> Looks like only Rx_a == INT64_MIN may cause the problem.
>> All other Rx_a numbers (from INT64_MIN+1 to INT64_MAX)
>> should be okay. Some selective testing below on x64 host:
>>
>> $ cat t5.c
>> #include <stdio.h>
>> #include <limits.h>
>>
>> unsigned long long res;
>> int main(void) {
>> volatile long long a;
>> long long i;
>> for (i = LLONG_MIN + 1; i <= LLONG_MIN + 100; i++) {
>> volatile long long b = -1;
>> a = i;
>> res += (unsigned long long)(a/b);
>> }
>> for (i = LLONG_MAX - 100; i <= LLONG_MAX - 1; i++) {
> Changing this test to i <= LLONG_MAX
> and compiling with gcc -O0 or clang -O2 or clang -O0
> is causing an exception,
> because 'a' becomes LLONG_MIN.
This is my theory.
If change to i <= LLONG_MAX, then after
i = LLONG_MAX, it will do 'i++' and then
it will become tricky since then undefined
behavior will pick in as 'i++' will become
out of range.
> Compilers are doing some odd code gen.
> I don't understand how 'i' can wrap this way.
>
>> volatile long long b = -1;
>> a = i;
>> res += (unsigned long long)(a/b);
>> }
>> printf("res = %llx\n", res);
>> return 0;
>> }
>> $ gcc -O2 t5.c && ./a.out
>> res = 64
>>
>> So I think it should be okay if the range is from LLONG_MIN + 1
>> to LLONG_MAX - 1.
>>
>> Now for LLONG_MAX/(-1)
>>
>> $ cat t6.c
>> #include <stdio.h>
>> #include <limits.h>
>> int main(void) {
>> volatile long long a = LLONG_MAX;
>> volatile long long b = -1;
>> printf("a/b = %lld\n", a/b);
>> return 0;
>> }
>> $ gcc -O2 t6.c && ./a.out
>> a/b = -9223372036854775807
>>
>> It is okay too. So I think LLONG_MIN/(-1) is the only case
>> we should take care of.
> The test shows that that's the case, but I still can wrap
> my head around that only LLONG_MIN/(-1) is a problem.
>
> Any math experts can explain this?
next prev parent reply other threads:[~2024-09-10 22:00 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-09 17:21 Kernel oops caused by signed divide Zac Ecob
2024-09-09 17:27 ` Yonghong Song
2024-09-09 17:29 ` Alexei Starovoitov
2024-09-09 23:47 ` Yonghong Song
2024-09-10 14:21 ` Yonghong Song
2024-09-10 14:44 ` Dave Thaler
2024-09-10 15:18 ` Yonghong Song
2024-09-10 15:21 ` Alexei Starovoitov
2024-09-10 18:12 ` Yonghong Song
2024-09-10 15:21 ` Alexei Starovoitov
2024-09-10 18:02 ` Yonghong Song
2024-09-10 18:25 ` Alexei Starovoitov
2024-09-10 19:32 ` Yonghong Song
2024-09-10 21:53 ` Alexei Starovoitov
2024-09-10 22:00 ` Yonghong Song [this message]
2024-09-10 22:43 ` Andrii Nakryiko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=269dc5c4-d867-45ff-974c-3cc8a5a30e77@linux.dev \
--to=yonghong.song@linux.dev \
--cc=alexei.starovoitov@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=zacecob@protonmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.