Page faults in tracepoint caused by aliased pointer

BPF List
 help / color / mirror / Atom feed

* Page faults in tracepoint caused by aliased pointer
@ 2024-02-12 22:14 Yan Zhai
  2024-02-12 22:55 ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 11+ messages in thread
From: Yan Zhai @ 2024-02-12 22:14 UTC (permalink / raw)
  To: bpf; +Cc: kernel-team, jakub, ignat

Hello!

We are getting page fault errors inside BPF tracepoint that accessed
not-present pages. This caused kernel panic:

[717542.963064][T897981] BUG: unable to handle page fault for address: ffffffffff600c7d
[717542.975692][T897981] #PF: supervisor read access in kernel mode
[717542.986496][T897981] #PF: error_code(0x0000) - not-present page
[717542.997237][T897981] PGD 1965012067 P4D 1965012067 PUD 1965014067 PMD 1965016067 PTE 0
[717543.009965][T897981] Oops: 0000 [#1] PREEMPT SMP NOPTI
[717543.019835][T897981] CPU: 34 PID: 897981 Comm: warp-service Kdump: loaded Tainted: G           O       6.1.74-cloudflare-2024.1.14 #1
[717543.041140][T897981] Hardware name: HYVE EDGE-METAL-GEN11/HS1811D_Lite, BIOS V0.11-sig 12/23/2022
[717543.059260][T897981] RIP: 0010:bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x326/0xada
[717543.071449][T897981] Code: ff eb 07 48 8b bf f8 04 00 00 49 bb 80 0e 00 00 00 80 00 00 4c 39 df 72 0c 49 89 fb 49 81 c3 80 0e 00 00 73 05 45 31 ed eb 07 <4c> 8b af 80 0e 00 00 48 89 ee 48 83 c6 f0 48 bf 00 04 7a 0a 3d 9e
[717543.104780][T897981] RSP: 0018:ffffaece810efab8 EFLAGS: 00010286
[717543.115372][T897981] RAX: 0000000000000000 RBX: ffffcea96b4ae350 RCX: 0000000000000010
[717543.127887][T897981] RDX: 0000000000000030 RSI: ffffffffac168443 RDI: ffffffffff5ffdfd
[717543.140325][T897981] RBP: ffffaece810efb28 R08: ffff9e61e3b27c80 R09: 000000000000e000
[717543.152712][T897981] R10: 0000000000000041 R11: ffffffffff600c7d R12: 00028c9a1e371991
[717543.165011][T897981] R13: 0000000000000000 R14: ffff9e6339dce8c0 R15: ffff9e61e3b27c00
[717543.177253][T897981] FS:  00007f769a1fd6c0(0000) GS:ffff9e6bdfa80000(0000) knlGS:0000000000000000
[717543.194511][T897981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[717543.205261][T897981] CR2: ffffffffff600c7d CR3: 0000003d21706005 CR4: 0000000000770ee0
[717543.217411][T897981] PKRU: 55555554
[717543.224999][T897981] Call Trace:
[717543.232224][T897981]  <TASK>
[717543.239016][T897981]  ? __die+0x20/0x70
[717543.246661][T897981]  ? page_fault_oops+0x150/0x490
[717543.255270][T897981]  ? __sk_dst_check+0x39/0xa0
[717543.263548][T897981]  ? inet6_csk_route_socket+0x123/0x200
[717543.272622][T897981]  ? exc_page_fault+0x67/0x140
[717543.280831][T897981]  ? asm_exc_page_fault+0x22/0x30
[717543.289230][T897981]  ? tcp_data_queue+0xc03/0xe20
[717543.297374][T897981]  ? bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x326/0xada
[717543.307555][T897981]  ? bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x281/0xada
[717543.317638][T897981]  ? tcp_data_queue+0xc03/0xe20
[717543.325540][T897981]  bpf_trace_run3+0x92/0xc0
[717543.333026][T897981]  ? tcp_data_queue+0xc03/0xe20
[717543.340823][T897981]  kfree_skb_reason+0x7b/0xd0
[717543.348427][T897981]  tcp_data_queue+0xc03/0xe20
[717543.355985][T897981]  tcp_rcv_established+0x218/0x740
[717543.363944][T897981]  tcp_v4_do_rcv+0x157/0x290
[717543.371315][T897981]  tcp_v4_rcv+0xddd/0xf00
[717543.378330][T897981]  ? raw_local_deliver+0xc0/0x230
[717543.385973][T897981]  ip_protocol_deliver_rcu+0x32/0x200
[717543.393880][T897981]  ip_local_deliver_finish+0x73/0xa0
[717543.401616][T897981]  __netif_receive_skb_one_core+0x8b/0xa0
[717543.409751][T897981]  netif_receive_skb+0x38/0x160
[717543.416920][T897981]  tun_get_user+0xbe6/0x1080 [tun]
[717543.424292][T897981]  ? mlx5e_handle_rx_dim+0x6b/0x80 [mlx5_core]
[717543.432754][T897981]  ? mlx5e_napi_poll+0x710/0x720 [mlx5_core]
[717543.441007][T897981]  ? tun_chr_write_iter+0x69/0xb0 [tun]
[717543.448753][T897981]  tun_chr_write_iter+0x69/0xb0 [tun]
[717543.456312][T897981]  vfs_write+0x2a3/0x3b0
[717543.462722][T897981]  ksys_write+0x5f/0xe0
[717543.469018][T897981]  do_syscall_64+0x3b/0x90
[717543.475522][T897981]  entry_SYSCALL_64_after_hwframe+0x4c/0xb6
[717543.483443][T897981] RIP: 0033:0x7f76b3b3027f
[717543.489848][T897981] Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 8c d5 f8 ff 48
[717543.515551][T897981] RSP: 002b:00007f769a1f9870 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[717543.526219][T897981] RAX: ffffffffffffffda RBX: 0000000000000500 RCX: 00007f76b3b3027f
[717543.536507][T897981] RDX: 0000000000000500 RSI: 00007f761a694a00 RDI: 00000000000015a4
[717543.546815][T897981] RBP: 00007f75f53cf600 R08: 0000000000000000 R09: 00000000000272c8
[717543.557136][T897981] R10: 00000000000075dc R11: 0000000000000293 R12: 00007f76b37b0198
[717543.567447][T897981] R13: 0000000000000000 R14: 00007f76b37a4000 R15: 0000000000000004
[717543.577777][T897981]  </TASK>
[717543.583106][T897981] Modules linked in: mptcp_diag raw_diag unix_diag xt_LOG nf_log_syslog overlay nft_compat xt_hashlimit ip_set_hash_netport xt_length esp4 nf_conntrack_netlink nft_fwd_netdev nf_dup_netdev xfrm_interface xfrm6_tunnel nft_numgen nft_log nft_limit dummy xfrm_user xfrm_algo fou6 ip6_tunnel tunnel6 ipip mpls_gso mpls_iptunnel mpls_router sit tunnel4 fou nft_ct nf_tables cls_bpf ip_gre gre ip_tunnel geneve ip6_udp_tunnel udp_tunnel zstd zstd_compress zram zsmalloc sch_ingress tcp_diag veth tun udp_diag inet_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6table_mangle ip6table_raw ip6table_security ip6table_nat ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpmss iptable_filter xt_TCPMSS xt_bpf xt_limit xt_multiport xt_NFLOG nfnetlink_log xt_connbytes xt_connlabel xt_statistic xt_mark xt_connmark xt_conntrack iptable_mangle xt_nat iptable_nat nf_nat xt_owner xt_set xt_comment xt_tcpudp xt_CT iptable_raw
[717543.583186][T897981]  ip_set_hash_ip ip_set_hash_net ip_set nfnetlink tcp_bbr sch_fq nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 algif_skcipher af_alg raid0 md_mod essiv dm_crypt trusted asn1_encoder tee 8021q garp mrp stp llc nvme_fabrics ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 mlx5_core aesni_intel acpi_ipmi rapl ipmi_si mlxfw xhci_pci nvme tls ipmi_devintf tiny_power_button xhci_hcd nvme_core psample ccp i2c_piix4 ipmi_msghandler button fuse dm_mod dax efivarfs ip_tables x_tables bcmcrypt(O) crypto_simd cryptd [last unloaded: kheaders]
[717543.774881][T897981] CR2: ffffffffff600c7d

The panic happens as we inspect dropped out of order TCP packets in kfree_skb
tracepoint with a tp_btf program, and try to read out the network namespace
cookie via:

skb->dev->nd_net.net->net_cookie

Code generation looks fine on x86_64 with 4 layer pagetable, but the verifier
placed boundary check is not sufficient to catch the issue: skb->dev is alised
as skb->rbnode in the same union after packets entered TCP state machine, and
the out of order queue is one of such rbnode users:

; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
 2bd:   movabs $0x800000000010,%r11
 2c7:   cmp    %r11,%r15
 2ca:   jb     0x000002d8
 2cc:   mov    %r15,%r11
 2cf:   add    $0x10,%r11
 2d6:   jae    0x000002dc
 2d8:   xor    %edi,%edi
 2da:   jmp    0x000002e0
 2dc:   mov    0x10(%r15),%rdi
; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
 2e0:   movabs $0x8000000004f8,%r11
 2ea:   cmp    %r11,%rdi   <--- (1) rdi is a valid rbnode*, not net_device*
 2ed:   jb     0x000002fb
 2ef:   mov    %rdi,%r11
 2f2:   add    $0x4f8,%r11
 2f9:   jae    0x000002ff
 2fb:   xor    %edi,%edi
 2fd:   jmp    0x00000306
 2ff:   mov    0x4f8(%rdi),%rdi
; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
 306:   movabs $0x800000000e80,%r11
 310:   cmp    %r11,%rdi  <--- (2) rdi is a wild ptr now
 313:   jb     0x00000321
 315:   mov    %rdi,%r11
 318:   add    $0xe80,%r11
 31f:   jae    0x00000326
 321:   xor    %r13d,%r13d
 324:   jmp    0x0000032d
 326:   mov    0xe80(%rdi),%r13 <--- (3) fault
 32d:   mov    %rbp,%rsi

OOO happens a lot on our servers but this is the first time we noticed
such panic since we had deployed the program for a while. For bpf list
I think the question is mainly about what to do in this scenario:
apparently it is a valid kernel pointer at step (1) above, but it's
just not the type we assumed, which leads to a wild pointer at (2) and
caused fault at (3). I am not aware of a way to determine such aliased
pointer is good or not in general. Is it possible to PF safer in this
case, like returning from PF handler to the end of tracing program?

thanks
Yan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Page faults in tracepoint caused by aliased pointer
  2024-02-12 22:14 Page faults in tracepoint caused by aliased pointer Yan Zhai
@ 2024-02-12 22:55 ` Kumar Kartikeya Dwivedi
  2024-02-12 23:15   ` Ignat Korchagin
  0 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-12 22:55 UTC (permalink / raw)
  To: Yan Zhai; +Cc: bpf, kernel-team, jakub, ignat

On Mon, 12 Feb 2024 at 23:14, Yan Zhai <yan@cloudflare.com> wrote:
>
> Hello!
>
> We are getting page fault errors inside BPF tracepoint that accessed
> not-present pages. This caused kernel panic:
>
> [717542.963064][T897981] BUG: unable to handle page fault for address: ffffffffff600c7d
> [717542.975692][T897981] #PF: supervisor read access in kernel mode
> [717542.986496][T897981] #PF: error_code(0x0000) - not-present page
> [717542.997237][T897981] PGD 1965012067 P4D 1965012067 PUD 1965014067 PMD 1965016067 PTE 0
> [717543.009965][T897981] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [717543.019835][T897981] CPU: 34 PID: 897981 Comm: warp-service Kdump: loaded Tainted: G           O       6.1.74-cloudflare-2024.1.14 #1
> [717543.041140][T897981] Hardware name: HYVE EDGE-METAL-GEN11/HS1811D_Lite, BIOS V0.11-sig 12/23/2022
> [717543.059260][T897981] RIP: 0010:bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x326/0xada
> [717543.071449][T897981] Code: ff eb 07 48 8b bf f8 04 00 00 49 bb 80 0e 00 00 00 80 00 00 4c 39 df 72 0c 49 89 fb 49 81 c3 80 0e 00 00 73 05 45 31 ed eb 07 <4c> 8b af 80 0e 00 00 48 89 ee 48 83 c6 f0 48 bf 00 04 7a 0a 3d 9e
> [717543.104780][T897981] RSP: 0018:ffffaece810efab8 EFLAGS: 00010286
> [717543.115372][T897981] RAX: 0000000000000000 RBX: ffffcea96b4ae350 RCX: 0000000000000010
> [717543.127887][T897981] RDX: 0000000000000030 RSI: ffffffffac168443 RDI: ffffffffff5ffdfd
> [717543.140325][T897981] RBP: ffffaece810efb28 R08: ffff9e61e3b27c80 R09: 000000000000e000
> [717543.152712][T897981] R10: 0000000000000041 R11: ffffffffff600c7d R12: 00028c9a1e371991
> [717543.165011][T897981] R13: 0000000000000000 R14: ffff9e6339dce8c0 R15: ffff9e61e3b27c00
> [717543.177253][T897981] FS:  00007f769a1fd6c0(0000) GS:ffff9e6bdfa80000(0000) knlGS:0000000000000000
> [717543.194511][T897981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [717543.205261][T897981] CR2: ffffffffff600c7d CR3: 0000003d21706005 CR4: 0000000000770ee0
> [717543.217411][T897981] PKRU: 55555554
> [717543.224999][T897981] Call Trace:
> [717543.232224][T897981]  <TASK>
> [717543.239016][T897981]  ? __die+0x20/0x70
> [717543.246661][T897981]  ? page_fault_oops+0x150/0x490
> [717543.255270][T897981]  ? __sk_dst_check+0x39/0xa0
> [717543.263548][T897981]  ? inet6_csk_route_socket+0x123/0x200
> [717543.272622][T897981]  ? exc_page_fault+0x67/0x140
> [717543.280831][T897981]  ? asm_exc_page_fault+0x22/0x30
> [717543.289230][T897981]  ? tcp_data_queue+0xc03/0xe20
> [717543.297374][T897981]  ? bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x326/0xada
> [717543.307555][T897981]  ? bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x281/0xada
> [717543.317638][T897981]  ? tcp_data_queue+0xc03/0xe20
> [717543.325540][T897981]  bpf_trace_run3+0x92/0xc0
> [717543.333026][T897981]  ? tcp_data_queue+0xc03/0xe20
> [717543.340823][T897981]  kfree_skb_reason+0x7b/0xd0
> [717543.348427][T897981]  tcp_data_queue+0xc03/0xe20
> [717543.355985][T897981]  tcp_rcv_established+0x218/0x740
> [717543.363944][T897981]  tcp_v4_do_rcv+0x157/0x290
> [717543.371315][T897981]  tcp_v4_rcv+0xddd/0xf00
> [717543.378330][T897981]  ? raw_local_deliver+0xc0/0x230
> [717543.385973][T897981]  ip_protocol_deliver_rcu+0x32/0x200
> [717543.393880][T897981]  ip_local_deliver_finish+0x73/0xa0
> [717543.401616][T897981]  __netif_receive_skb_one_core+0x8b/0xa0
> [717543.409751][T897981]  netif_receive_skb+0x38/0x160
> [717543.416920][T897981]  tun_get_user+0xbe6/0x1080 [tun]
> [717543.424292][T897981]  ? mlx5e_handle_rx_dim+0x6b/0x80 [mlx5_core]
> [717543.432754][T897981]  ? mlx5e_napi_poll+0x710/0x720 [mlx5_core]
> [717543.441007][T897981]  ? tun_chr_write_iter+0x69/0xb0 [tun]
> [717543.448753][T897981]  tun_chr_write_iter+0x69/0xb0 [tun]
> [717543.456312][T897981]  vfs_write+0x2a3/0x3b0
> [717543.462722][T897981]  ksys_write+0x5f/0xe0
> [717543.469018][T897981]  do_syscall_64+0x3b/0x90
> [717543.475522][T897981]  entry_SYSCALL_64_after_hwframe+0x4c/0xb6
> [717543.483443][T897981] RIP: 0033:0x7f76b3b3027f
> [717543.489848][T897981] Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 8c d5 f8 ff 48
> [717543.515551][T897981] RSP: 002b:00007f769a1f9870 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> [717543.526219][T897981] RAX: ffffffffffffffda RBX: 0000000000000500 RCX: 00007f76b3b3027f
> [717543.536507][T897981] RDX: 0000000000000500 RSI: 00007f761a694a00 RDI: 00000000000015a4
> [717543.546815][T897981] RBP: 00007f75f53cf600 R08: 0000000000000000 R09: 00000000000272c8
> [717543.557136][T897981] R10: 00000000000075dc R11: 0000000000000293 R12: 00007f76b37b0198
> [717543.567447][T897981] R13: 0000000000000000 R14: 00007f76b37a4000 R15: 0000000000000004
> [717543.577777][T897981]  </TASK>
> [717543.583106][T897981] Modules linked in: mptcp_diag raw_diag unix_diag xt_LOG nf_log_syslog overlay nft_compat xt_hashlimit ip_set_hash_netport xt_length esp4 nf_conntrack_netlink nft_fwd_netdev nf_dup_netdev xfrm_interface xfrm6_tunnel nft_numgen nft_log nft_limit dummy xfrm_user xfrm_algo fou6 ip6_tunnel tunnel6 ipip mpls_gso mpls_iptunnel mpls_router sit tunnel4 fou nft_ct nf_tables cls_bpf ip_gre gre ip_tunnel geneve ip6_udp_tunnel udp_tunnel zstd zstd_compress zram zsmalloc sch_ingress tcp_diag veth tun udp_diag inet_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6table_mangle ip6table_raw ip6table_security ip6table_nat ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpmss iptable_filter xt_TCPMSS xt_bpf xt_limit xt_multiport xt_NFLOG nfnetlink_log xt_connbytes xt_connlabel xt_statistic xt_mark xt_connmark xt_conntrack iptable_mangle xt_nat iptable_nat nf_nat xt_owner xt_set xt_comment xt_tcpudp xt_CT iptable_raw
> [717543.583186][T897981]  ip_set_hash_ip ip_set_hash_net ip_set nfnetlink tcp_bbr sch_fq nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 algif_skcipher af_alg raid0 md_mod essiv dm_crypt trusted asn1_encoder tee 8021q garp mrp stp llc nvme_fabrics ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 mlx5_core aesni_intel acpi_ipmi rapl ipmi_si mlxfw xhci_pci nvme tls ipmi_devintf tiny_power_button xhci_hcd nvme_core psample ccp i2c_piix4 ipmi_msghandler button fuse dm_mod dax efivarfs ip_tables x_tables bcmcrypt(O) crypto_simd cryptd [last unloaded: kheaders]
> [717543.774881][T897981] CR2: ffffffffff600c7d
>
> The panic happens as we inspect dropped out of order TCP packets in kfree_skb
> tracepoint with a tp_btf program, and try to read out the network namespace
> cookie via:
>
> skb->dev->nd_net.net->net_cookie
>
> Code generation looks fine on x86_64 with 4 layer pagetable, but the verifier
> placed boundary check is not sufficient to catch the issue: skb->dev is alised
> as skb->rbnode in the same union after packets entered TCP state machine, and
> the out of order queue is one of such rbnode users:
>
> ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
>  2bd:   movabs $0x800000000010,%r11
>  2c7:   cmp    %r11,%r15
>  2ca:   jb     0x000002d8
>  2cc:   mov    %r15,%r11
>  2cf:   add    $0x10,%r11
>  2d6:   jae    0x000002dc
>  2d8:   xor    %edi,%edi
>  2da:   jmp    0x000002e0
>  2dc:   mov    0x10(%r15),%rdi
> ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
>  2e0:   movabs $0x8000000004f8,%r11
>  2ea:   cmp    %r11,%rdi   <--- (1) rdi is a valid rbnode*, not net_device*
>  2ed:   jb     0x000002fb
>  2ef:   mov    %rdi,%r11
>  2f2:   add    $0x4f8,%r11
>  2f9:   jae    0x000002ff
>  2fb:   xor    %edi,%edi
>  2fd:   jmp    0x00000306
>  2ff:   mov    0x4f8(%rdi),%rdi
> ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
>  306:   movabs $0x800000000e80,%r11
>  310:   cmp    %r11,%rdi  <--- (2) rdi is a wild ptr now
>  313:   jb     0x00000321
>  315:   mov    %rdi,%r11
>  318:   add    $0xe80,%r11
>  31f:   jae    0x00000326
>  321:   xor    %r13d,%r13d
>  324:   jmp    0x0000032d
>  326:   mov    0xe80(%rdi),%r13 <--- (3) fault
>  32d:   mov    %rbp,%rsi
>
> OOO happens a lot on our servers but this is the first time we noticed
> such panic since we had deployed the program for a while. For bpf list
> I think the question is mainly about what to do in this scenario:
> apparently it is a valid kernel pointer at step (1) above, but it's
> just not the type we assumed, which leads to a wild pointer at (2) and
> caused fault at (3). I am not aware of a way to determine such aliased
> pointer is good or not in general. Is it possible to PF safer in this
> case, like returning from PF handler to the end of tracing program?
>

I think it is not supposed to panic, since exception handling for such
PROBE_MEM loads should handle such a case and mark the destination as
zero.
Something must be broken with that.

Which kernel do you observe this problem with? And do you have a
reference version where you do not see it?
Do you have a reduced reproducer for this that I could play with?
Just the part of the tp_btf program necessary to trigger this?

There were some changes made to the JIT code around the bounds
checking to reduce the instruction count.
That was in 90156f4bfa21 ("bpf, x86: Improve PROBE_MEM runtime load check").
Especially when src_reg == dst_reg, the case which happens in the
splat at 0x2ff.
Nothing else comes immediately to mind in terms of changes that could
affect this exception handling stuff.

> thanks
> Yan
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Page faults in tracepoint caused by aliased pointer
  2024-02-12 22:55 ` Kumar Kartikeya Dwivedi
@ 2024-02-12 23:15   ` Ignat Korchagin
  2024-02-12 23:27     ` Yan Zhai
  2024-02-12 23:33     ` Alexei Starovoitov
  0 siblings, 2 replies; 11+ messages in thread
From: Ignat Korchagin @ 2024-02-12 23:15 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: Yan Zhai, bpf, kernel-team, jakub

On Mon, Feb 12, 2024 at 10:55 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Mon, 12 Feb 2024 at 23:14, Yan Zhai <yan@cloudflare.com> wrote:
> >
> > Hello!
> >
> > We are getting page fault errors inside BPF tracepoint that accessed
> > not-present pages. This caused kernel panic:
> >
> > [717542.963064][T897981] BUG: unable to handle page fault for address: ffffffffff600c7d
> > [717542.975692][T897981] #PF: supervisor read access in kernel mode
> > [717542.986496][T897981] #PF: error_code(0x0000) - not-present page
> > [717542.997237][T897981] PGD 1965012067 P4D 1965012067 PUD 1965014067 PMD 1965016067 PTE 0
> > [717543.009965][T897981] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > [717543.019835][T897981] CPU: 34 PID: 897981 Comm: warp-service Kdump: loaded Tainted: G           O       6.1.74-cloudflare-2024.1.14 #1
> > [717543.041140][T897981] Hardware name: HYVE EDGE-METAL-GEN11/HS1811D_Lite, BIOS V0.11-sig 12/23/2022
> > [717543.059260][T897981] RIP: 0010:bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x326/0xada
> > [717543.071449][T897981] Code: ff eb 07 48 8b bf f8 04 00 00 49 bb 80 0e 00 00 00 80 00 00 4c 39 df 72 0c 49 89 fb 49 81 c3 80 0e 00 00 73 05 45 31 ed eb 07 <4c> 8b af 80 0e 00 00 48 89 ee 48 83 c6 f0 48 bf 00 04 7a 0a 3d 9e
> > [717543.104780][T897981] RSP: 0018:ffffaece810efab8 EFLAGS: 00010286
> > [717543.115372][T897981] RAX: 0000000000000000 RBX: ffffcea96b4ae350 RCX: 0000000000000010
> > [717543.127887][T897981] RDX: 0000000000000030 RSI: ffffffffac168443 RDI: ffffffffff5ffdfd
> > [717543.140325][T897981] RBP: ffffaece810efb28 R08: ffff9e61e3b27c80 R09: 000000000000e000
> > [717543.152712][T897981] R10: 0000000000000041 R11: ffffffffff600c7d R12: 00028c9a1e371991
> > [717543.165011][T897981] R13: 0000000000000000 R14: ffff9e6339dce8c0 R15: ffff9e61e3b27c00
> > [717543.177253][T897981] FS:  00007f769a1fd6c0(0000) GS:ffff9e6bdfa80000(0000) knlGS:0000000000000000
> > [717543.194511][T897981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [717543.205261][T897981] CR2: ffffffffff600c7d CR3: 0000003d21706005 CR4: 0000000000770ee0
> > [717543.217411][T897981] PKRU: 55555554
> > [717543.224999][T897981] Call Trace:
> > [717543.232224][T897981]  <TASK>
> > [717543.239016][T897981]  ? __die+0x20/0x70
> > [717543.246661][T897981]  ? page_fault_oops+0x150/0x490
> > [717543.255270][T897981]  ? __sk_dst_check+0x39/0xa0
> > [717543.263548][T897981]  ? inet6_csk_route_socket+0x123/0x200
> > [717543.272622][T897981]  ? exc_page_fault+0x67/0x140
> > [717543.280831][T897981]  ? asm_exc_page_fault+0x22/0x30
> > [717543.289230][T897981]  ? tcp_data_queue+0xc03/0xe20
> > [717543.297374][T897981]  ? bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x326/0xada
> > [717543.307555][T897981]  ? bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x281/0xada
> > [717543.317638][T897981]  ? tcp_data_queue+0xc03/0xe20
> > [717543.325540][T897981]  bpf_trace_run3+0x92/0xc0
> > [717543.333026][T897981]  ? tcp_data_queue+0xc03/0xe20
> > [717543.340823][T897981]  kfree_skb_reason+0x7b/0xd0
> > [717543.348427][T897981]  tcp_data_queue+0xc03/0xe20
> > [717543.355985][T897981]  tcp_rcv_established+0x218/0x740
> > [717543.363944][T897981]  tcp_v4_do_rcv+0x157/0x290
> > [717543.371315][T897981]  tcp_v4_rcv+0xddd/0xf00
> > [717543.378330][T897981]  ? raw_local_deliver+0xc0/0x230
> > [717543.385973][T897981]  ip_protocol_deliver_rcu+0x32/0x200
> > [717543.393880][T897981]  ip_local_deliver_finish+0x73/0xa0
> > [717543.401616][T897981]  __netif_receive_skb_one_core+0x8b/0xa0
> > [717543.409751][T897981]  netif_receive_skb+0x38/0x160
> > [717543.416920][T897981]  tun_get_user+0xbe6/0x1080 [tun]
> > [717543.424292][T897981]  ? mlx5e_handle_rx_dim+0x6b/0x80 [mlx5_core]
> > [717543.432754][T897981]  ? mlx5e_napi_poll+0x710/0x720 [mlx5_core]
> > [717543.441007][T897981]  ? tun_chr_write_iter+0x69/0xb0 [tun]
> > [717543.448753][T897981]  tun_chr_write_iter+0x69/0xb0 [tun]
> > [717543.456312][T897981]  vfs_write+0x2a3/0x3b0
> > [717543.462722][T897981]  ksys_write+0x5f/0xe0
> > [717543.469018][T897981]  do_syscall_64+0x3b/0x90
> > [717543.475522][T897981]  entry_SYSCALL_64_after_hwframe+0x4c/0xb6
> > [717543.483443][T897981] RIP: 0033:0x7f76b3b3027f
> > [717543.489848][T897981] Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 8c d5 f8 ff 48
> > [717543.515551][T897981] RSP: 002b:00007f769a1f9870 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> > [717543.526219][T897981] RAX: ffffffffffffffda RBX: 0000000000000500 RCX: 00007f76b3b3027f
> > [717543.536507][T897981] RDX: 0000000000000500 RSI: 00007f761a694a00 RDI: 00000000000015a4
> > [717543.546815][T897981] RBP: 00007f75f53cf600 R08: 0000000000000000 R09: 00000000000272c8
> > [717543.557136][T897981] R10: 00000000000075dc R11: 0000000000000293 R12: 00007f76b37b0198
> > [717543.567447][T897981] R13: 0000000000000000 R14: 00007f76b37a4000 R15: 0000000000000004
> > [717543.577777][T897981]  </TASK>
> > [717543.583106][T897981] Modules linked in: mptcp_diag raw_diag unix_diag xt_LOG nf_log_syslog overlay nft_compat xt_hashlimit ip_set_hash_netport xt_length esp4 nf_conntrack_netlink nft_fwd_netdev nf_dup_netdev xfrm_interface xfrm6_tunnel nft_numgen nft_log nft_limit dummy xfrm_user xfrm_algo fou6 ip6_tunnel tunnel6 ipip mpls_gso mpls_iptunnel mpls_router sit tunnel4 fou nft_ct nf_tables cls_bpf ip_gre gre ip_tunnel geneve ip6_udp_tunnel udp_tunnel zstd zstd_compress zram zsmalloc sch_ingress tcp_diag veth tun udp_diag inet_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6table_mangle ip6table_raw ip6table_security ip6table_nat ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpmss iptable_filter xt_TCPMSS xt_bpf xt_limit xt_multiport xt_NFLOG nfnetlink_log xt_connbytes xt_connlabel xt_statistic xt_mark xt_connmark xt_conntrack iptable_mangle xt_nat iptable_nat nf_nat xt_owner xt_set xt_comment xt_tcpudp xt_CT iptable_raw
> > [717543.583186][T897981]  ip_set_hash_ip ip_set_hash_net ip_set nfnetlink tcp_bbr sch_fq nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 algif_skcipher af_alg raid0 md_mod essiv dm_crypt trusted asn1_encoder tee 8021q garp mrp stp llc nvme_fabrics ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 mlx5_core aesni_intel acpi_ipmi rapl ipmi_si mlxfw xhci_pci nvme tls ipmi_devintf tiny_power_button xhci_hcd nvme_core psample ccp i2c_piix4 ipmi_msghandler button fuse dm_mod dax efivarfs ip_tables x_tables bcmcrypt(O) crypto_simd cryptd [last unloaded: kheaders]
> > [717543.774881][T897981] CR2: ffffffffff600c7d
> >
> > The panic happens as we inspect dropped out of order TCP packets in kfree_skb
> > tracepoint with a tp_btf program, and try to read out the network namespace
> > cookie via:
> >
> > skb->dev->nd_net.net->net_cookie
> >
> > Code generation looks fine on x86_64 with 4 layer pagetable, but the verifier
> > placed boundary check is not sufficient to catch the issue: skb->dev is alised
> > as skb->rbnode in the same union after packets entered TCP state machine, and
> > the out of order queue is one of such rbnode users:
> >
> > ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
> >  2bd:   movabs $0x800000000010,%r11
> >  2c7:   cmp    %r11,%r15
> >  2ca:   jb     0x000002d8
> >  2cc:   mov    %r15,%r11
> >  2cf:   add    $0x10,%r11
> >  2d6:   jae    0x000002dc
> >  2d8:   xor    %edi,%edi
> >  2da:   jmp    0x000002e0
> >  2dc:   mov    0x10(%r15),%rdi
> > ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
> >  2e0:   movabs $0x8000000004f8,%r11
> >  2ea:   cmp    %r11,%rdi   <--- (1) rdi is a valid rbnode*, not net_device*
> >  2ed:   jb     0x000002fb
> >  2ef:   mov    %rdi,%r11
> >  2f2:   add    $0x4f8,%r11
> >  2f9:   jae    0x000002ff
> >  2fb:   xor    %edi,%edi
> >  2fd:   jmp    0x00000306
> >  2ff:   mov    0x4f8(%rdi),%rdi
> > ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
> >  306:   movabs $0x800000000e80,%r11
> >  310:   cmp    %r11,%rdi  <--- (2) rdi is a wild ptr now
> >  313:   jb     0x00000321
> >  315:   mov    %rdi,%r11
> >  318:   add    $0xe80,%r11
> >  31f:   jae    0x00000326
> >  321:   xor    %r13d,%r13d
> >  324:   jmp    0x0000032d
> >  326:   mov    0xe80(%rdi),%r13 <--- (3) fault
> >  32d:   mov    %rbp,%rsi
> >
> > OOO happens a lot on our servers but this is the first time we noticed
> > such panic since we had deployed the program for a while. For bpf list
> > I think the question is mainly about what to do in this scenario:
> > apparently it is a valid kernel pointer at step (1) above, but it's
> > just not the type we assumed, which leads to a wild pointer at (2) and
> > caused fault at (3). I am not aware of a way to determine such aliased
> > pointer is good or not in general. Is it possible to PF safer in this
> > case, like returning from PF handler to the end of tracing program?
> >
>
> I think it is not supposed to panic, since exception handling for such
> PROBE_MEM loads should handle such a case and mark the destination as
> zero.
> Something must be broken with that.
>
> Which kernel do you observe this problem with? And do you have a
> reference version where you do not see it?
> Do you have a reduced reproducer for this that I could play with?
> Just the part of the tp_btf program necessary to trigger this?

We were able to reproduce this with a simple bpftrace (version 0.17.1):

$ sudo bpftrace -kk -e 'BEGIN { print(*(uint8 *)0xffffffffff600c7d); exit(); }'
WARNING: Addrspace is not set
Attaching 1 probe...
Killed
[288931.216699][T109754] BUG: unable to handle page fault for address:
ffffffffff600c7d
[288931.217143][T109754] #PF: supervisor read access in kernel mode
[288931.217143][T109754] #PF: error_code(0x0000) - not-present page
[288931.217143][T109754] PGD fa5a1e067 P4D fa5a1e067 PUD fa5a20067 PMD
fa5a22067 PTE 0
[288931.217143][T109754] Oops: 0000 [#1] PREEMPT SMP NOPTI
[288931.217143][T109754] CPU: 4 PID: 109754 Comm: bpftrace Not tainted
6.6.16+ #10
[288931.217143][T109754] Hardware name: KubeVirt None/RHEL, BIOS
edk2-20221207gitfff6d81270b5-9.el9 12/07/2022
[288931.217143][T109754] RIP: 0010:copy_from_kernel_nofault+0x89/0xe0
[288931.217143][T109754] Code: 48 83 c3 04 48 83 c5 04 49 83 ec 04 49
83 fc 01 76 13 66 8b 03 66 89 45 00 48 83 c3 02 48 83 c5 02 49 83 ec
02 4d 85 e4 74 05 <8a> 03 88 45 00 65 48 8b 04 25 80 11 03 00 83 a8 54
1b 00 00 01 31
[288931.217143][T109754] RSP: 0018:ffff93d787777d38 EFLAGS: 00010202
[288931.217143][T109754] RAX: ffff910d4830a0c0 RBX: ffffffffff600c7d
RCX: 0000000000000010
[288931.217143][T109754] RDX: 0000000000000030 RSI: 0000000000000001
RDI: ffffffffff600c7d
[288931.217143][T109754] RBP: ffff93d787777d87 R08: 0101010101010101
R09: 00000000fffffdf4
[288931.217143][T109754] R10: 0000000000000000 R11: 0000000000000000
R12: 0000000000000001
[288931.217143][T109754] R13: ffff93d7807eb000 R14: 000000000000000a
R15: 0000000000000000
[288931.217143][T109754] FS:  00007f2818fd79c0(0000)
GS:ffff911c7f800000(0000) knlGS:0000000000000000
[288931.217143][T109754] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[288931.217143][T109754] CR2: ffffffffff600c7d CR3: 00000001174f4000
CR4: 0000000000350ee0
[288931.217143][T109754] Call Trace:
[288931.217143][T109754]  <TASK>
[288931.217143][T109754]  ? __die+0x1f/0x70
[288931.217143][T109754]  ? page_fault_oops+0x151/0x480
[288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
[288931.217143][T109754]  ? alloc_empty_file+0x7a/0x120
[288931.217143][T109754]  ? __d_instantiate+0x34/0xf0
[288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
[288931.217143][T109754]  ? alloc_file+0x9b/0x170
[288931.217143][T109754]  ? exc_page_fault+0x68/0x140
[288931.217143][T109754]  ? asm_exc_page_fault+0x22/0x30
[288931.217143][T109754]  ? copy_from_kernel_nofault+0x89/0xe0
[288931.217143][T109754]  ? copy_from_kernel_nofault+0x1d/0xe0
[288931.217143][T109754]  bpf_probe_read_compat+0x6a/0x90
[288931.217143][T109754]  bpf_prog_27620a19791a7c9c_BEGIN+0x2e/0xe7
[288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
[288931.217143][T109754]  __bpf_prog_test_run_raw_tp+0x2e/0x90
[288931.217143][T109754]  bpf_prog_test_run_raw_tp+0xe6/0x1c0
[288931.217143][T109754]  __sys_bpf+0x93a/0x26a0
[288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
[288931.217143][T109754]  ? __check_object_size+0x16a/0x2c0
[288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
[288931.217143][T109754]  __x64_sys_bpf+0x1a/0x30
[288931.217143][T109754]  do_syscall_64+0x3a/0x90
[288931.217143][T109754]  entry_SYSCALL_64_after_hwframe+0x56/0xc0
[288931.217143][T109754] RIP: 0033:0x7f281b320719
[288931.217143][T109754] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00
00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c
8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7
d8 64 89 01 48
[288931.217143][T109754] RSP: 002b:00007ffdab7f0118 EFLAGS: 00000246
ORIG_RAX: 0000000000000141
[288931.217143][T109754] RAX: ffffffffffffffda RBX: 00007ffdab7f01e8
RCX: 00007f281b320719
[288931.217143][T109754] RDX: 0000000000000050 RSI: 00007ffdab7f0120
RDI: 000000000000000a
[288931.217143][T109754] RBP: 00000000024e27a0 R08: 00007f28100cf880
R09: 0000000000000016
[288931.217143][T109754] R10: 0000000000000007 R11: 0000000000000246
R12: 00000000ffffffff
[288931.217143][T109754] R13: 00007ffdab7f10c8 R14: 000000000000000d
R15: 00000000024e27a0
[288931.217143][T109754]  </TASK>
[288931.217143][T109754] Modules linked in: xt_bpf xt_conntrack
nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype
nft_compat nf_tables br_netfilter bridge stp llc overlay kvm_amd ccp
kvm irqbypass crc32_pclmul sha512_ssse3 sha256_ssse3 sha1_ssse3
aesni_intel crypto_simd cryptd virtio_balloon virtio_console
tiny_power_button button fuse dm_mod dax configfs nfnetlink efivarfs
ip_tables x_tables virtio_net net_failover virtio_blk virtio_scsi
failover crc32c_intel i2c_i801 virtio_pci i2c_smbus
virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
[288931.217143][T109754] CR2: ffffffffff600c7d
[288931.217143][T109754] ---[ end trace 0000000000000000 ]---
[288931.509063][T109754] RIP: 0010:copy_from_kernel_nofault+0x89/0xe0
[288931.509063][T109754] Code: 48 83 c3 04 48 83 c5 04 49 83 ec 04 49
83 fc 01 76 13 66 8b 03 66 89 45 00 48 83 c3 02 48 83 c5 02 49 83 ec
02 4d 85 e4 74 05 <8a> 03 88 45 00 65 48 8b 04 25 80 11 03 00 83 a8 54
1b 00 00 01 31
[288931.509063][T109754] RSP: 0018:ffff93d787777d38 EFLAGS: 00010202
[288931.509063][T109754] RAX: ffff910d4830a0c0 RBX: ffffffffff600c7d
RCX: 0000000000000010
[288931.509063][T109754] RDX: 0000000000000030 RSI: 0000000000000001
RDI: ffffffffff600c7d
[288931.509063][T109754] RBP: ffff93d787777d87 R08: 0101010101010101
R09: 00000000fffffdf4
[288931.509063][T109754] R10: 0000000000000000 R11: 0000000000000000
R12: 0000000000000001
[288931.509063][T109754] R13: ffff93d7807eb000 R14: 000000000000000a
R15: 0000000000000000
[288931.509063][T109754] FS:  00007f2818fd79c0(0000)
GS:ffff911c7f800000(0000) knlGS:0000000000000000
[288931.509063][T109754] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[288931.509063][T109754] CR2: ffffffffff600c7d CR3: 00000001174f4000
CR4: 0000000000350ee0
[288931.509063][T109754] note: bpftrace[109754] exited with irqs disabled
[288932.319062][T109754] note: bpftrace[109754] exited with preempt_count 1

And Jakub CCed here did it for 6.8.0-rc2+

> There were some changes made to the JIT code around the bounds
> checking to reduce the instruction count.
> That was in 90156f4bfa21 ("bpf, x86: Improve PROBE_MEM runtime load check").
> Especially when src_reg == dst_reg, the case which happens in the
> splat at 0x2ff.
> Nothing else comes immediately to mind in terms of changes that could
> affect this exception handling stuff.
>
> > thanks
> > Yan
> >

Ignat

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Page faults in tracepoint caused by aliased pointer
  2024-02-12 23:15   ` Ignat Korchagin
@ 2024-02-12 23:27     ` Yan Zhai
  2024-02-12 23:33     ` Alexei Starovoitov
  1 sibling, 0 replies; 11+ messages in thread
From: Yan Zhai @ 2024-02-12 23:27 UTC (permalink / raw)
  To: Ignat Korchagin; +Cc: Kumar Kartikeya Dwivedi, bpf, kernel-team, jakub

On Mon, Feb 12, 2024 at 5:16 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
>
> On Mon, Feb 12, 2024 at 10:55 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Mon, 12 Feb 2024 at 23:14, Yan Zhai <yan@cloudflare.com> wrote:
> > >
> > > Hello!
> > >
> > > We are getting page fault errors inside BPF tracepoint that accessed
> > > not-present pages. This caused kernel panic:
> > >
> > > [717542.963064][T897981] BUG: unable to handle page fault for address: ffffffffff600c7d
> > > [717542.975692][T897981] #PF: supervisor read access in kernel mode
> > > [717542.986496][T897981] #PF: error_code(0x0000) - not-present page
> > > [717542.997237][T897981] PGD 1965012067 P4D 1965012067 PUD 1965014067 PMD 1965016067 PTE 0
> > > [717543.009965][T897981] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > [717543.019835][T897981] CPU: 34 PID: 897981 Comm: warp-service Kdump: loaded Tainted: G           O       6.1.74-cloudflare-2024.1.14 #1
> > > [717543.041140][T897981] Hardware name: HYVE EDGE-METAL-GEN11/HS1811D_Lite, BIOS V0.11-sig 12/23/2022
> > > [717543.059260][T897981] RIP: 0010:bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x326/0xada
> > > [717543.071449][T897981] Code: ff eb 07 48 8b bf f8 04 00 00 49 bb 80 0e 00 00 00 80 00 00 4c 39 df 72 0c 49 89 fb 49 81 c3 80 0e 00 00 73 05 45 31 ed eb 07 <4c> 8b af 80 0e 00 00 48 89 ee 48 83 c6 f0 48 bf 00 04 7a 0a 3d 9e
> > > [717543.104780][T897981] RSP: 0018:ffffaece810efab8 EFLAGS: 00010286
> > > [717543.115372][T897981] RAX: 0000000000000000 RBX: ffffcea96b4ae350 RCX: 0000000000000010
> > > [717543.127887][T897981] RDX: 0000000000000030 RSI: ffffffffac168443 RDI: ffffffffff5ffdfd
> > > [717543.140325][T897981] RBP: ffffaece810efb28 R08: ffff9e61e3b27c80 R09: 000000000000e000
> > > [717543.152712][T897981] R10: 0000000000000041 R11: ffffffffff600c7d R12: 00028c9a1e371991
> > > [717543.165011][T897981] R13: 0000000000000000 R14: ffff9e6339dce8c0 R15: ffff9e61e3b27c00
> > > [717543.177253][T897981] FS:  00007f769a1fd6c0(0000) GS:ffff9e6bdfa80000(0000) knlGS:0000000000000000
> > > [717543.194511][T897981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [717543.205261][T897981] CR2: ffffffffff600c7d CR3: 0000003d21706005 CR4: 0000000000770ee0
> > > [717543.217411][T897981] PKRU: 55555554
> > > [717543.224999][T897981] Call Trace:
> > > [717543.232224][T897981]  <TASK>
> > > [717543.239016][T897981]  ? __die+0x20/0x70
> > > [717543.246661][T897981]  ? page_fault_oops+0x150/0x490
> > > [717543.255270][T897981]  ? __sk_dst_check+0x39/0xa0
> > > [717543.263548][T897981]  ? inet6_csk_route_socket+0x123/0x200
> > > [717543.272622][T897981]  ? exc_page_fault+0x67/0x140
> > > [717543.280831][T897981]  ? asm_exc_page_fault+0x22/0x30
> > > [717543.289230][T897981]  ? tcp_data_queue+0xc03/0xe20
> > > [717543.297374][T897981]  ? bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x326/0xada
> > > [717543.307555][T897981]  ? bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x281/0xada
> > > [717543.317638][T897981]  ? tcp_data_queue+0xc03/0xe20
> > > [717543.325540][T897981]  bpf_trace_run3+0x92/0xc0
> > > [717543.333026][T897981]  ? tcp_data_queue+0xc03/0xe20
> > > [717543.340823][T897981]  kfree_skb_reason+0x7b/0xd0
> > > [717543.348427][T897981]  tcp_data_queue+0xc03/0xe20
> > > [717543.355985][T897981]  tcp_rcv_established+0x218/0x740
> > > [717543.363944][T897981]  tcp_v4_do_rcv+0x157/0x290
> > > [717543.371315][T897981]  tcp_v4_rcv+0xddd/0xf00
> > > [717543.378330][T897981]  ? raw_local_deliver+0xc0/0x230
> > > [717543.385973][T897981]  ip_protocol_deliver_rcu+0x32/0x200
> > > [717543.393880][T897981]  ip_local_deliver_finish+0x73/0xa0
> > > [717543.401616][T897981]  __netif_receive_skb_one_core+0x8b/0xa0
> > > [717543.409751][T897981]  netif_receive_skb+0x38/0x160
> > > [717543.416920][T897981]  tun_get_user+0xbe6/0x1080 [tun]
> > > [717543.424292][T897981]  ? mlx5e_handle_rx_dim+0x6b/0x80 [mlx5_core]
> > > [717543.432754][T897981]  ? mlx5e_napi_poll+0x710/0x720 [mlx5_core]
> > > [717543.441007][T897981]  ? tun_chr_write_iter+0x69/0xb0 [tun]
> > > [717543.448753][T897981]  tun_chr_write_iter+0x69/0xb0 [tun]
> > > [717543.456312][T897981]  vfs_write+0x2a3/0x3b0
> > > [717543.462722][T897981]  ksys_write+0x5f/0xe0
> > > [717543.469018][T897981]  do_syscall_64+0x3b/0x90
> > > [717543.475522][T897981]  entry_SYSCALL_64_after_hwframe+0x4c/0xb6
> > > [717543.483443][T897981] RIP: 0033:0x7f76b3b3027f
> > > [717543.489848][T897981] Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 8c d5 f8 ff 48
> > > [717543.515551][T897981] RSP: 002b:00007f769a1f9870 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> > > [717543.526219][T897981] RAX: ffffffffffffffda RBX: 0000000000000500 RCX: 00007f76b3b3027f
> > > [717543.536507][T897981] RDX: 0000000000000500 RSI: 00007f761a694a00 RDI: 00000000000015a4
> > > [717543.546815][T897981] RBP: 00007f75f53cf600 R08: 0000000000000000 R09: 00000000000272c8
> > > [717543.557136][T897981] R10: 00000000000075dc R11: 0000000000000293 R12: 00007f76b37b0198
> > > [717543.567447][T897981] R13: 0000000000000000 R14: 00007f76b37a4000 R15: 0000000000000004
> > > [717543.577777][T897981]  </TASK>
> > > [717543.583106][T897981] Modules linked in: mptcp_diag raw_diag unix_diag xt_LOG nf_log_syslog overlay nft_compat xt_hashlimit ip_set_hash_netport xt_length esp4 nf_conntrack_netlink nft_fwd_netdev nf_dup_netdev xfrm_interface xfrm6_tunnel nft_numgen nft_log nft_limit dummy xfrm_user xfrm_algo fou6 ip6_tunnel tunnel6 ipip mpls_gso mpls_iptunnel mpls_router sit tunnel4 fou nft_ct nf_tables cls_bpf ip_gre gre ip_tunnel geneve ip6_udp_tunnel udp_tunnel zstd zstd_compress zram zsmalloc sch_ingress tcp_diag veth tun udp_diag inet_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6table_mangle ip6table_raw ip6table_security ip6table_nat ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpmss iptable_filter xt_TCPMSS xt_bpf xt_limit xt_multiport xt_NFLOG nfnetlink_log xt_connbytes xt_connlabel xt_statistic xt_mark xt_connmark xt_conntrack iptable_mangle xt_nat iptable_nat nf_nat xt_owner xt_set xt_comment xt_tcpudp xt_CT iptable_raw
> > > [717543.583186][T897981]  ip_set_hash_ip ip_set_hash_net ip_set nfnetlink tcp_bbr sch_fq nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 algif_skcipher af_alg raid0 md_mod essiv dm_crypt trusted asn1_encoder tee 8021q garp mrp stp llc nvme_fabrics ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 mlx5_core aesni_intel acpi_ipmi rapl ipmi_si mlxfw xhci_pci nvme tls ipmi_devintf tiny_power_button xhci_hcd nvme_core psample ccp i2c_piix4 ipmi_msghandler button fuse dm_mod dax efivarfs ip_tables x_tables bcmcrypt(O) crypto_simd cryptd [last unloaded: kheaders]
> > > [717543.774881][T897981] CR2: ffffffffff600c7d
> > >
> > > The panic happens as we inspect dropped out of order TCP packets in kfree_skb
> > > tracepoint with a tp_btf program, and try to read out the network namespace
> > > cookie via:
> > >
> > > skb->dev->nd_net.net->net_cookie
> > >
> > > Code generation looks fine on x86_64 with 4 layer pagetable, but the verifier
> > > placed boundary check is not sufficient to catch the issue: skb->dev is alised
> > > as skb->rbnode in the same union after packets entered TCP state machine, and
> > > the out of order queue is one of such rbnode users:
> > >
> > > ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
> > >  2bd:   movabs $0x800000000010,%r11
> > >  2c7:   cmp    %r11,%r15
> > >  2ca:   jb     0x000002d8
> > >  2cc:   mov    %r15,%r11
> > >  2cf:   add    $0x10,%r11
> > >  2d6:   jae    0x000002dc
> > >  2d8:   xor    %edi,%edi
> > >  2da:   jmp    0x000002e0
> > >  2dc:   mov    0x10(%r15),%rdi
> > > ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
> > >  2e0:   movabs $0x8000000004f8,%r11
> > >  2ea:   cmp    %r11,%rdi   <--- (1) rdi is a valid rbnode*, not net_device*
> > >  2ed:   jb     0x000002fb
> > >  2ef:   mov    %rdi,%r11
> > >  2f2:   add    $0x4f8,%r11
> > >  2f9:   jae    0x000002ff
> > >  2fb:   xor    %edi,%edi
> > >  2fd:   jmp    0x00000306
> > >  2ff:   mov    0x4f8(%rdi),%rdi
> > > ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
> > >  306:   movabs $0x800000000e80,%r11
> > >  310:   cmp    %r11,%rdi  <--- (2) rdi is a wild ptr now
> > >  313:   jb     0x00000321
> > >  315:   mov    %rdi,%r11
> > >  318:   add    $0xe80,%r11
> > >  31f:   jae    0x00000326
> > >  321:   xor    %r13d,%r13d
> > >  324:   jmp    0x0000032d
> > >  326:   mov    0xe80(%rdi),%r13 <--- (3) fault
> > >  32d:   mov    %rbp,%rsi
> > >
> > > OOO happens a lot on our servers but this is the first time we noticed
> > > such panic since we had deployed the program for a while. For bpf list
> > > I think the question is mainly about what to do in this scenario:
> > > apparently it is a valid kernel pointer at step (1) above, but it's
> > > just not the type we assumed, which leads to a wild pointer at (2) and
> > > caused fault at (3). I am not aware of a way to determine such aliased
> > > pointer is good or not in general. Is it possible to PF safer in this
> > > case, like returning from PF handler to the end of tracing program?
> > >
> >
> > I think it is not supposed to panic, since exception handling for such
> > PROBE_MEM loads should handle such a case and mark the destination as
> > zero.
> > Something must be broken with that.

Just to clarify, by exception handling, do you mean there are some
special treatments in the page fault handler? Or do you mean page
fault should not happen in the first place because the BPF verifier
should catch it?

> >
> > Which kernel do you observe this problem with? And do you have a
> > reference version where you do not see it?
We saw this once today on the 6.1.74 kernel with some internally
vendor patches, but none of those ring a bell to this situation. Other
6.1.74 kernel does fine. To be clear, I did see netns cookies being
read as 0 when the skb->dev is NULL. But not sure if it also can deal
with a valid but aliased pointer this way.

> > Do you have a reduced reproducer for this that I could play with?
> > Just the part of the tp_btf program necessary to trigger this?
>
> We were able to reproduce this with a simple bpftrace (version 0.17.1):
>
> $ sudo bpftrace -kk -e 'BEGIN { print(*(uint8 *)0xffffffffff600c7d); exit(); }'
> WARNING: Addrspace is not set
> Attaching 1 probe...
> Killed
> [288931.216699][T109754] BUG: unable to handle page fault for address:
> ffffffffff600c7d
> [288931.217143][T109754] #PF: supervisor read access in kernel mode
> [288931.217143][T109754] #PF: error_code(0x0000) - not-present page
> [288931.217143][T109754] PGD fa5a1e067 P4D fa5a1e067 PUD fa5a20067 PMD
> fa5a22067 PTE 0
> [288931.217143][T109754] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [288931.217143][T109754] CPU: 4 PID: 109754 Comm: bpftrace Not tainted
> 6.6.16+ #10
> [288931.217143][T109754] Hardware name: KubeVirt None/RHEL, BIOS
> edk2-20221207gitfff6d81270b5-9.el9 12/07/2022
> [288931.217143][T109754] RIP: 0010:copy_from_kernel_nofault+0x89/0xe0
> [288931.217143][T109754] Code: 48 83 c3 04 48 83 c5 04 49 83 ec 04 49
> 83 fc 01 76 13 66 8b 03 66 89 45 00 48 83 c3 02 48 83 c5 02 49 83 ec
> 02 4d 85 e4 74 05 <8a> 03 88 45 00 65 48 8b 04 25 80 11 03 00 83 a8 54
> 1b 00 00 01 31
> [288931.217143][T109754] RSP: 0018:ffff93d787777d38 EFLAGS: 00010202
> [288931.217143][T109754] RAX: ffff910d4830a0c0 RBX: ffffffffff600c7d
> RCX: 0000000000000010
> [288931.217143][T109754] RDX: 0000000000000030 RSI: 0000000000000001
> RDI: ffffffffff600c7d
> [288931.217143][T109754] RBP: ffff93d787777d87 R08: 0101010101010101
> R09: 00000000fffffdf4
> [288931.217143][T109754] R10: 0000000000000000 R11: 0000000000000000
> R12: 0000000000000001
> [288931.217143][T109754] R13: ffff93d7807eb000 R14: 000000000000000a
> R15: 0000000000000000
> [288931.217143][T109754] FS:  00007f2818fd79c0(0000)
> GS:ffff911c7f800000(0000) knlGS:0000000000000000
> [288931.217143][T109754] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [288931.217143][T109754] CR2: ffffffffff600c7d CR3: 00000001174f4000
> CR4: 0000000000350ee0
> [288931.217143][T109754] Call Trace:
> [288931.217143][T109754]  <TASK>
> [288931.217143][T109754]  ? __die+0x1f/0x70
> [288931.217143][T109754]  ? page_fault_oops+0x151/0x480
> [288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
> [288931.217143][T109754]  ? alloc_empty_file+0x7a/0x120
> [288931.217143][T109754]  ? __d_instantiate+0x34/0xf0
> [288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
> [288931.217143][T109754]  ? alloc_file+0x9b/0x170
> [288931.217143][T109754]  ? exc_page_fault+0x68/0x140
> [288931.217143][T109754]  ? asm_exc_page_fault+0x22/0x30
> [288931.217143][T109754]  ? copy_from_kernel_nofault+0x89/0xe0
> [288931.217143][T109754]  ? copy_from_kernel_nofault+0x1d/0xe0
> [288931.217143][T109754]  bpf_probe_read_compat+0x6a/0x90
> [288931.217143][T109754]  bpf_prog_27620a19791a7c9c_BEGIN+0x2e/0xe7
> [288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
> [288931.217143][T109754]  __bpf_prog_test_run_raw_tp+0x2e/0x90
> [288931.217143][T109754]  bpf_prog_test_run_raw_tp+0xe6/0x1c0
> [288931.217143][T109754]  __sys_bpf+0x93a/0x26a0
> [288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
> [288931.217143][T109754]  ? __check_object_size+0x16a/0x2c0
> [288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
> [288931.217143][T109754]  __x64_sys_bpf+0x1a/0x30
> [288931.217143][T109754]  do_syscall_64+0x3a/0x90
> [288931.217143][T109754]  entry_SYSCALL_64_after_hwframe+0x56/0xc0
> [288931.217143][T109754] RIP: 0033:0x7f281b320719
> [288931.217143][T109754] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00
> 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c
> 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7
> d8 64 89 01 48
> [288931.217143][T109754] RSP: 002b:00007ffdab7f0118 EFLAGS: 00000246
> ORIG_RAX: 0000000000000141
> [288931.217143][T109754] RAX: ffffffffffffffda RBX: 00007ffdab7f01e8
> RCX: 00007f281b320719
> [288931.217143][T109754] RDX: 0000000000000050 RSI: 00007ffdab7f0120
> RDI: 000000000000000a
> [288931.217143][T109754] RBP: 00000000024e27a0 R08: 00007f28100cf880
> R09: 0000000000000016
> [288931.217143][T109754] R10: 0000000000000007 R11: 0000000000000246
> R12: 00000000ffffffff
> [288931.217143][T109754] R13: 00007ffdab7f10c8 R14: 000000000000000d
> R15: 00000000024e27a0
> [288931.217143][T109754]  </TASK>
> [288931.217143][T109754] Modules linked in: xt_bpf xt_conntrack
> nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype
> nft_compat nf_tables br_netfilter bridge stp llc overlay kvm_amd ccp
> kvm irqbypass crc32_pclmul sha512_ssse3 sha256_ssse3 sha1_ssse3
> aesni_intel crypto_simd cryptd virtio_balloon virtio_console
> tiny_power_button button fuse dm_mod dax configfs nfnetlink efivarfs
> ip_tables x_tables virtio_net net_failover virtio_blk virtio_scsi
> failover crc32c_intel i2c_i801 virtio_pci i2c_smbus
> virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
> [288931.217143][T109754] CR2: ffffffffff600c7d
> [288931.217143][T109754] ---[ end trace 0000000000000000 ]---
> [288931.509063][T109754] RIP: 0010:copy_from_kernel_nofault+0x89/0xe0
> [288931.509063][T109754] Code: 48 83 c3 04 48 83 c5 04 49 83 ec 04 49
> 83 fc 01 76 13 66 8b 03 66 89 45 00 48 83 c3 02 48 83 c5 02 49 83 ec
> 02 4d 85 e4 74 05 <8a> 03 88 45 00 65 48 8b 04 25 80 11 03 00 83 a8 54
> 1b 00 00 01 31
> [288931.509063][T109754] RSP: 0018:ffff93d787777d38 EFLAGS: 00010202
> [288931.509063][T109754] RAX: ffff910d4830a0c0 RBX: ffffffffff600c7d
> RCX: 0000000000000010
> [288931.509063][T109754] RDX: 0000000000000030 RSI: 0000000000000001
> RDI: ffffffffff600c7d
> [288931.509063][T109754] RBP: ffff93d787777d87 R08: 0101010101010101
> R09: 00000000fffffdf4
> [288931.509063][T109754] R10: 0000000000000000 R11: 0000000000000000
> R12: 0000000000000001
> [288931.509063][T109754] R13: ffff93d7807eb000 R14: 000000000000000a
> R15: 0000000000000000
> [288931.509063][T109754] FS:  00007f2818fd79c0(0000)
> GS:ffff911c7f800000(0000) knlGS:0000000000000000
> [288931.509063][T109754] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [288931.509063][T109754] CR2: ffffffffff600c7d CR3: 00000001174f4000
> CR4: 0000000000350ee0
> [288931.509063][T109754] note: bpftrace[109754] exited with irqs disabled
> [288932.319062][T109754] note: bpftrace[109754] exited with preempt_count 1
>
> And Jakub CCed here did it for 6.8.0-rc2+
>
> > There were some changes made to the JIT code around the bounds
> > checking to reduce the instruction count.
> > That was in 90156f4bfa21 ("bpf, x86: Improve PROBE_MEM runtime load check").
> > Especially when src_reg == dst_reg, the case which happens in the
> > splat at 0x2ff.
> > Nothing else comes immediately to mind in terms of changes that could
> > affect this exception handling stuff.

Thanks, let me check if this is included or helps.

> >
> > > thanks
> > > Yan
> > >
>
> Ignat

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Page faults in tracepoint caused by aliased pointer
  2024-02-12 23:15   ` Ignat Korchagin
  2024-02-12 23:27     ` Yan Zhai
@ 2024-02-12 23:33     ` Alexei Starovoitov
  2024-02-12 23:41       ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2024-02-12 23:33 UTC (permalink / raw)
  To: Ignat Korchagin
  Cc: Kumar Kartikeya Dwivedi, Yan Zhai, bpf, kernel-team,
	Jakub Sitnicki

On Mon, Feb 12, 2024 at 3:16 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
>
> [288931.217143][T109754] CPU: 4 PID: 109754 Comm: bpftrace Not tainted
> 6.6.16+ #10

...
> [288931.217143][T109754]  ? copy_from_kernel_nofault+0x1d/0xe0
> [288931.217143][T109754]  bpf_probe_read_compat+0x6a/0x90
>
> And Jakub CCed here did it for 6.8.0-rc2+

I suspect something is broken in your kernels.
Above is doing generic copy_from_kernel_nofault(),
so one should be able to crash the kernel without any bpf.

We have this in selftests/bpf:
__weak noinline struct file *bpf_testmod_return_ptr(int arg)
{
        static struct file f = {};

        switch (arg) {
        case 1: return (void *)EINVAL;          /* user addr */
        case 2: return (void *)0xcafe4a11;      /* user addr */
        case 3: return (void *)-EINVAL;         /* canonical, but invalid */
        case 4: return (void *)(1ull << 60);    /* non-canonical and invalid */
        case 5: return (void *)~(1ull << 30);   /* trigger extable */
        case 6: return &f;                      /* valid addr */
        case 7: return (void *)((long)&f | 1);  /* kernel tricks */
        default: return NULL;
        }
}
where we check that extables setup by JIT for bpf progs are working correctly.
You should see the kernel crashing when you just run bpf selftests.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Page faults in tracepoint caused by aliased pointer
  2024-02-12 23:33     ` Alexei Starovoitov
@ 2024-02-12 23:41       ` Kumar Kartikeya Dwivedi
  2024-02-12 23:52         ` Alexei Starovoitov
  0 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-12 23:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ignat Korchagin, Yan Zhai, bpf, kernel-team, Jakub Sitnicki

On Tue, 13 Feb 2024 at 00:34, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Feb 12, 2024 at 3:16 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
> >
> > [288931.217143][T109754] CPU: 4 PID: 109754 Comm: bpftrace Not tainted
> > 6.6.16+ #10
>
> ...
> > [288931.217143][T109754]  ? copy_from_kernel_nofault+0x1d/0xe0
> > [288931.217143][T109754]  bpf_probe_read_compat+0x6a/0x90
> >
> > And Jakub CCed here did it for 6.8.0-rc2+
>
> I suspect something is broken in your kernels.
> Above is doing generic copy_from_kernel_nofault(),
> so one should be able to crash the kernel without any bpf.
>
> We have this in selftests/bpf:
> __weak noinline struct file *bpf_testmod_return_ptr(int arg)
> {
>         static struct file f = {};
>
>         switch (arg) {
>         case 1: return (void *)EINVAL;          /* user addr */
>         case 2: return (void *)0xcafe4a11;      /* user addr */
>         case 3: return (void *)-EINVAL;         /* canonical, but invalid */
>         case 4: return (void *)(1ull << 60);    /* non-canonical and invalid */
>         case 5: return (void *)~(1ull << 30);   /* trigger extable */
>         case 6: return &f;                      /* valid addr */
>         case 7: return (void *)((long)&f | 1);  /* kernel tricks */
>         default: return NULL;
>         }
> }
> where we check that extables setup by JIT for bpf progs are working correctly.
> You should see the kernel crashing when you just run bpf selftests.

I agree, this appears unrelated to BPF since it is happening when
using copy_from_kernel_nofault (which should be jumping to the Efault
label instead of the oops), but I think it's not specific to some
custom kernel. I can reproduce it on my dev machine on top of bpf-next
as well, and another machine with Ubuntu's generic 6.5 kernel for
24.04. And I think Ignat tried it on the mainline 6.8-rc2 as well.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Page faults in tracepoint caused by aliased pointer
  2024-02-12 23:41       ` Kumar Kartikeya Dwivedi
@ 2024-02-12 23:52         ` Alexei Starovoitov
  2024-02-13  0:21           ` Yan Zhai
  0 siblings, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2024-02-12 23:52 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Ignat Korchagin, Yan Zhai, bpf, kernel-team, Jakub Sitnicki

On Mon, Feb 12, 2024 at 3:42 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, 13 Feb 2024 at 00:34, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Feb 12, 2024 at 3:16 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
> > >
> > > [288931.217143][T109754] CPU: 4 PID: 109754 Comm: bpftrace Not tainted
> > > 6.6.16+ #10
> >
> > ...
> > > [288931.217143][T109754]  ? copy_from_kernel_nofault+0x1d/0xe0
> > > [288931.217143][T109754]  bpf_probe_read_compat+0x6a/0x90
> > >
> > > And Jakub CCed here did it for 6.8.0-rc2+
> >
> > I suspect something is broken in your kernels.
> > Above is doing generic copy_from_kernel_nofault(),
> > so one should be able to crash the kernel without any bpf.
> >
> > We have this in selftests/bpf:
> > __weak noinline struct file *bpf_testmod_return_ptr(int arg)
> > {
> >         static struct file f = {};
> >
> >         switch (arg) {
> >         case 1: return (void *)EINVAL;          /* user addr */
> >         case 2: return (void *)0xcafe4a11;      /* user addr */
> >         case 3: return (void *)-EINVAL;         /* canonical, but invalid */
> >         case 4: return (void *)(1ull << 60);    /* non-canonical and invalid */
> >         case 5: return (void *)~(1ull << 30);   /* trigger extable */
> >         case 6: return &f;                      /* valid addr */
> >         case 7: return (void *)((long)&f | 1);  /* kernel tricks */
> >         default: return NULL;
> >         }
> > }
> > where we check that extables setup by JIT for bpf progs are working correctly.
> > You should see the kernel crashing when you just run bpf selftests.
>
> I agree, this appears unrelated to BPF since it is happening when
> using copy_from_kernel_nofault (which should be jumping to the Efault
> label instead of the oops), but I think it's not specific to some
> custom kernel. I can reproduce it on my dev machine on top of bpf-next
> as well, and another machine with Ubuntu's generic 6.5 kernel for
> 24.04. And I think Ignat tried it on the mainline 6.8-rc2 as well.

Then it must be vsyscall address that this series are fixing:
https://patchwork.kernel.org/project/netdevbpf/patch/20240202103935.3154011-3-houtao@huaweicloud.com/

We're still waiting on x86 maintainers to ack them.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Page faults in tracepoint caused by aliased pointer
  2024-02-12 23:52         ` Alexei Starovoitov
@ 2024-02-13  0:21           ` Yan Zhai
  2024-02-13  0:33             ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 11+ messages in thread
From: Yan Zhai @ 2024-02-13  0:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Ignat Korchagin, bpf, kernel-team,
	Jakub Sitnicki

On Mon, Feb 12, 2024 at 5:52 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Feb 12, 2024 at 3:42 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Tue, 13 Feb 2024 at 00:34, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Feb 12, 2024 at 3:16 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
> > > >
> > > > [288931.217143][T109754] CPU: 4 PID: 109754 Comm: bpftrace Not tainted
> > > > 6.6.16+ #10
> > >
> > > ...
> > > > [288931.217143][T109754]  ? copy_from_kernel_nofault+0x1d/0xe0
> > > > [288931.217143][T109754]  bpf_probe_read_compat+0x6a/0x90
> > > >
> > > > And Jakub CCed here did it for 6.8.0-rc2+
> > >
> > > I suspect something is broken in your kernels.
> > > Above is doing generic copy_from_kernel_nofault(),
> > > so one should be able to crash the kernel without any bpf.
> > >
> > > We have this in selftests/bpf:
> > > __weak noinline struct file *bpf_testmod_return_ptr(int arg)
> > > {
> > >         static struct file f = {};
> > >
> > >         switch (arg) {
> > >         case 1: return (void *)EINVAL;          /* user addr */
> > >         case 2: return (void *)0xcafe4a11;      /* user addr */
> > >         case 3: return (void *)-EINVAL;         /* canonical, but invalid */
> > >         case 4: return (void *)(1ull << 60);    /* non-canonical and invalid */
> > >         case 5: return (void *)~(1ull << 30);   /* trigger extable */
> > >         case 6: return &f;                      /* valid addr */
> > >         case 7: return (void *)((long)&f | 1);  /* kernel tricks */
> > >         default: return NULL;
> > >         }
> > > }
> > > where we check that extables setup by JIT for bpf progs are working correctly.
> > > You should see the kernel crashing when you just run bpf selftests.
> >
> > I agree, this appears unrelated to BPF since it is happening when
> > using copy_from_kernel_nofault (which should be jumping to the Efault
> > label instead of the oops), but I think it's not specific to some
> > custom kernel. I can reproduce it on my dev machine on top of bpf-next
> > as well, and another machine with Ubuntu's generic 6.5 kernel for
> > 24.04. And I think Ignat tried it on the mainline 6.8-rc2 as well.
>
copy_from_kernel_nofault is called in Jakub's reproducer, but the
panic case in our production seems to be direct memory accessing
according to bpftool dumped jited code. Will faults from such
instructions also be caught correctly?

Yan

> Then it must be vsyscall address that this series are fixing:
> https://patchwork.kernel.org/project/netdevbpf/patch/20240202103935.3154011-3-houtao@huaweicloud.com/
>
> We're still waiting on x86 maintainers to ack them.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Page faults in tracepoint caused by aliased pointer
  2024-02-13  0:21           ` Yan Zhai
@ 2024-02-13  0:33             ` Kumar Kartikeya Dwivedi
  2024-02-21 17:47               ` Ignat Korchagin
  2024-02-22  9:27               ` Hou Tao
  0 siblings, 2 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2024-02-13  0:33 UTC (permalink / raw)
  To: Yan Zhai
  Cc: Alexei Starovoitov, Ignat Korchagin, bpf, kernel-team,
	Jakub Sitnicki

On Tue, 13 Feb 2024 at 01:21, Yan Zhai <yan@cloudflare.com> wrote:
>
> On Mon, Feb 12, 2024 at 5:52 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Feb 12, 2024 at 3:42 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Tue, 13 Feb 2024 at 00:34, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Mon, Feb 12, 2024 at 3:16 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
> > > > >
> > > > > [288931.217143][T109754] CPU: 4 PID: 109754 Comm: bpftrace Not tainted
> > > > > 6.6.16+ #10
> > > >
> > > > ...
> > > > > [288931.217143][T109754]  ? copy_from_kernel_nofault+0x1d/0xe0
> > > > > [288931.217143][T109754]  bpf_probe_read_compat+0x6a/0x90
> > > > >
> > > > > And Jakub CCed here did it for 6.8.0-rc2+
> > > >
> > > > I suspect something is broken in your kernels.
> > > > Above is doing generic copy_from_kernel_nofault(),
> > > > so one should be able to crash the kernel without any bpf.
> > > >
> > > > We have this in selftests/bpf:
> > > > __weak noinline struct file *bpf_testmod_return_ptr(int arg)
> > > > {
> > > >         static struct file f = {};
> > > >
> > > >         switch (arg) {
> > > >         case 1: return (void *)EINVAL;          /* user addr */
> > > >         case 2: return (void *)0xcafe4a11;      /* user addr */
> > > >         case 3: return (void *)-EINVAL;         /* canonical, but invalid */
> > > >         case 4: return (void *)(1ull << 60);    /* non-canonical and invalid */
> > > >         case 5: return (void *)~(1ull << 30);   /* trigger extable */
> > > >         case 6: return &f;                      /* valid addr */
> > > >         case 7: return (void *)((long)&f | 1);  /* kernel tricks */
> > > >         default: return NULL;
> > > >         }
> > > > }
> > > > where we check that extables setup by JIT for bpf progs are working correctly.
> > > > You should see the kernel crashing when you just run bpf selftests.
> > >
> > > I agree, this appears unrelated to BPF since it is happening when
> > > using copy_from_kernel_nofault (which should be jumping to the Efault
> > > label instead of the oops), but I think it's not specific to some
> > > custom kernel. I can reproduce it on my dev machine on top of bpf-next
> > > as well, and another machine with Ubuntu's generic 6.5 kernel for
> > > 24.04. And I think Ignat tried it on the mainline 6.8-rc2 as well.
> >
> copy_from_kernel_nofault is called in Jakub's reproducer, but the
> panic case in our production seems to be direct memory accessing
> according to bpftool dumped jited code. Will faults from such
> instructions also be caught correctly?
>

Yep, since faults in both cases end up in the page fault handler.
Once the fix pointed out by Alexei is applied, it should address both scenarios.

> Yan
>
> > Then it must be vsyscall address that this series are fixing:
> > https://patchwork.kernel.org/project/netdevbpf/patch/20240202103935.3154011-3-houtao@huaweicloud.com/
> >
> > We're still waiting on x86 maintainers to ack them.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Page faults in tracepoint caused by aliased pointer
  2024-02-13  0:33             ` Kumar Kartikeya Dwivedi
@ 2024-02-21 17:47               ` Ignat Korchagin
  2024-02-22  9:27               ` Hou Tao
  1 sibling, 0 replies; 11+ messages in thread
From: Ignat Korchagin @ 2024-02-21 17:47 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Alexei Starovoitov
  Cc: Yan Zhai, bpf, kernel-team, Jakub Sitnicki

On Tue, Feb 13, 2024 at 12:34 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, 13 Feb 2024 at 01:21, Yan Zhai <yan@cloudflare.com> wrote:
> >
> > On Mon, Feb 12, 2024 at 5:52 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Feb 12, 2024 at 3:42 PM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > On Tue, 13 Feb 2024 at 00:34, Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Mon, Feb 12, 2024 at 3:16 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
> > > > > >
> > > > > > [288931.217143][T109754] CPU: 4 PID: 109754 Comm: bpftrace Not tainted
> > > > > > 6.6.16+ #10
> > > > >
> > > > > ...
> > > > > > [288931.217143][T109754]  ? copy_from_kernel_nofault+0x1d/0xe0
> > > > > > [288931.217143][T109754]  bpf_probe_read_compat+0x6a/0x90
> > > > > >
> > > > > > And Jakub CCed here did it for 6.8.0-rc2+
> > > > >
> > > > > I suspect something is broken in your kernels.
> > > > > Above is doing generic copy_from_kernel_nofault(),
> > > > > so one should be able to crash the kernel without any bpf.
> > > > >
> > > > > We have this in selftests/bpf:
> > > > > __weak noinline struct file *bpf_testmod_return_ptr(int arg)
> > > > > {
> > > > >         static struct file f = {};
> > > > >
> > > > >         switch (arg) {
> > > > >         case 1: return (void *)EINVAL;          /* user addr */
> > > > >         case 2: return (void *)0xcafe4a11;      /* user addr */
> > > > >         case 3: return (void *)-EINVAL;         /* canonical, but invalid */
> > > > >         case 4: return (void *)(1ull << 60);    /* non-canonical and invalid */
> > > > >         case 5: return (void *)~(1ull << 30);   /* trigger extable */
> > > > >         case 6: return &f;                      /* valid addr */
> > > > >         case 7: return (void *)((long)&f | 1);  /* kernel tricks */
> > > > >         default: return NULL;
> > > > >         }
> > > > > }
> > > > > where we check that extables setup by JIT for bpf progs are working correctly.
> > > > > You should see the kernel crashing when you just run bpf selftests.
> > > >
> > > > I agree, this appears unrelated to BPF since it is happening when
> > > > using copy_from_kernel_nofault (which should be jumping to the Efault
> > > > label instead of the oops), but I think it's not specific to some
> > > > custom kernel. I can reproduce it on my dev machine on top of bpf-next
> > > > as well, and another machine with Ubuntu's generic 6.5 kernel for
> > > > 24.04. And I think Ignat tried it on the mainline 6.8-rc2 as well.
> > >
> > copy_from_kernel_nofault is called in Jakub's reproducer, but the
> > panic case in our production seems to be direct memory accessing
> > according to bpftool dumped jited code. Will faults from such
> > instructions also be caught correctly?
> >
>
> Yep, since faults in both cases end up in the page fault handler.
> Once the fix pointed out by Alexei is applied, it should address both scenarios.

Just as a follow up the patches do seem to help for x86, but we've
recently encountered a similar problem on arm64 (6.1.74 kernel):

[Wed Feb 21 12:06:33 2024] Unable to handle kernel access to user
memory outside uaccess routines at virtual address 00007fff9959b150
[Wed Feb 21 12:06:33 2024] Mem abort info:
[Wed Feb 21 12:06:33 2024]   ESR = 0x000000009600000f
[Wed Feb 21 12:06:33 2024]   EC = 0x25: DABT (current EL), IL = 32 bits
[Wed Feb 21 12:06:33 2024]   SET = 0, FnV = 0
[Wed Feb 21 12:06:33 2024]   EA = 0, S1PTW = 0
[Wed Feb 21 12:06:33 2024]   FSC = 0x0f: level 3 permission fault
[Wed Feb 21 12:06:33 2024] Data abort info:
[Wed Feb 21 12:06:33 2024]   ISV = 0, ISS = 0x0000000f
[Wed Feb 21 12:06:33 2024]   CM = 0, WnR = 0
[Wed Feb 21 12:06:33 2024] user pgtable: 4k pages, 48-bit VAs,
pgdp=00000812b1f69000
[Wed Feb 21 12:06:33 2024] [00007fff9959b150] pgd=08000812b1f72003,
p4d=08000812b1f72003, pud=08000812b1ff2003, pmd=08000855b2eb4003,
pte=0068087760598fc3
[Wed Feb 21 12:06:33 2024] Internal error: Oops: 000000009600000f [#1] SMP
[Wed Feb 21 12:06:33 2024] Modules linked in: nft_compat xt_hashlimit
ip_set_hash_netport xt_length esp4 nf_conntrack_netlink zstd
zstd_compress zram zsmalloc xgene_edac dm_thin_pool dm_persistent_data
dm_bio_prison dm_bufio nft_fwd_netdev nf_dup_netdev xfrm_interface
xfrm6_tunnel mpls_gso mpls_iptunnel mpls_router sit nft_numgen nft_log
nft_limit dummy ipip tunnel4 xfrm_user xfrm_algo nft_ct iptable_raw
iptable_nat iptable_mangle ipt_REJECT nf_reject_ipv4 ip6table_security
xt_CT ip6table_raw xt_nat ip6table_nat nf_nat xt_TCPMSS xt_owner
xt_NFLOG xt_connbytes xt_connlabel xt_statistic xt_connmark
ip6table_mangle xt_limit xt_LOG nf_log_syslog xt_mark xt_tcpudp
xt_conntrack ip6t_REJECT nf_reject_ipv6 xt_multiport xt_set xt_tcpmss
xt_comment ip6table_filter ip6_tables iptable_filter nfnetlink_log
tcp_diag cls_bpf sch_ingress ip_gre gre geneve tun xt_bpf nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 fou6 fou ip_tunnel ip6_udp_tunnel
udp_tunnel ip6_tunnel tunnel6 veth nf_tables tcp_bbr sch_fq
[Wed Feb 21 12:06:33 2024]  ip_set_hash_ip ip_set_hash_net ip_set
nfnetlink udp_diag inet_diag raid0 md_mod dm_crypt trusted
asn1_encoder tee algif_skcipher af_alg 8021q garp mrp stp llc
nvme_fabrics crct10dif_ce ghash_ce acpi_ipmi mlx5_core sha2_ce
ipmi_ssif sha256_arm64 sha1_ce mlxfw ipmi_devintf arm_spe_pmu
tiny_power_button tls igb xhci_pci nvme psample nvme_core xhci_hcd
ipmi_msghandler i2c_algo_bit button i2c_designware_platform
i2c_designware_core cppc_cpufreq arm_dsu_pmu tpm_tis tpm_tis_core fuse
dm_mod dax efivarfs ip_tables x_tables bcmcrypt(O) aes_neon_bs
aes_neon_blk aes_ce_blk aes_ce_cipher [last unloaded: kheaders]
[Wed Feb 21 12:06:33 2024] CPU: 15 PID: 547138 Comm: nginx-ssl
Tainted: G           O       6.1.74-cloudflare-2024.1.14 #1
[Wed Feb 21 12:06:33 2024] Hardware name: GIGABYTE
[Wed Feb 21 12:06:33 2024] pstate: 20400009 (nzCv daif +PAN -UAO -TCO
-DIT -SSBS BTYPE=--)
[Wed Feb 21 12:06:33 2024] pc : 0xffff8000288c0674
[Wed Feb 21 12:06:33 2024] lr : 0xffff8000288c064c
[Wed Feb 21 12:06:33 2024] sp : ffff8000afdd3940
[Wed Feb 21 12:06:33 2024] x29: ffff8000afdd39d0 x28: ffff081142f99f80
x27: ffff8000afdd3940
[Wed Feb 21 12:06:33 2024] x26: 0000000000000000 x25: ffff8000afdd3990
x24: 0000000000000001
[Wed Feb 21 12:06:33 2024] x23: 000000002e4773f7 x22: ffff0800e7078300
x21: ffff08378b4c5180
[Wed Feb 21 12:06:33 2024] x20: 0000000000000000 x19: fffffbff5dc7d548
x18: 0000000000000000
[Wed Feb 21 12:06:33 2024] x17: 0000000000000000 x16: 0000000000000000
x15: ffff081b6e9e8196
[Wed Feb 21 12:06:33 2024] x14: 0000000000000000 x13: 0000000000000000
x12: 0000000000000000
[Wed Feb 21 12:06:33 2024] x11: 0000000000000000 x10: ffffda25e4cc90f0
x9 : ffffda25e4d71074
[Wed Feb 21 12:06:33 2024] x8 : ffff8000afdd3af8 x7 : 0000000000000000
x6 : 0000008124f0e5a3
[Wed Feb 21 12:06:33 2024] x5 : ffff80023c9cd000 x4 : 0000000000001000
x3 : 0000000000000008
[Wed Feb 21 12:06:33 2024] x2 : ffff081142f99f80 x1 : ffffda25e55e76a0
x0 : 00007fff9959a2d0
[Wed Feb 21 12:06:33 2024] Call trace:
[Wed Feb 21 12:06:33 2024]  0xffff8000288c0674
[Wed Feb 21 12:06:33 2024]  bpf_trace_run3+0xcc/0x148
[Wed Feb 21 12:06:34 2024]  __bpf_trace_kfree_skb+0x14/0x20
[Wed Feb 21 12:06:34 2024]  __traceiter_kfree_skb+0x50/0x78
[Wed Feb 21 12:06:34 2024]  kfree_skb_reason+0xa8/0x118
[Wed Feb 21 12:06:34 2024]  tcp_data_queue+0x9f8/0xe20
[Wed Feb 21 12:06:34 2024]  tcp_rcv_established+0x2b4/0x738
[Wed Feb 21 12:06:34 2024]  tcp_v4_do_rcv+0x194/0x2d8
[Wed Feb 21 12:06:34 2024]  __release_sock+0x90/0x138
[Wed Feb 21 12:06:34 2024]  release_sock+0x64/0x120
[Wed Feb 21 12:06:34 2024]  tcp_recvmsg+0x80/0x1c8
[Wed Feb 21 12:06:34 2024]  inet_recvmsg+0x50/0xf8
[Wed Feb 21 12:06:34 2024]  sock_read_iter+0xf4/0x128
[Wed Feb 21 12:06:34 2024]  vfs_read+0x27c/0x2b0
[Wed Feb 21 12:06:34 2024]  ksys_read+0xe4/0x108
[Wed Feb 21 12:06:34 2024]  __arm64_sys_read+0x24/0x38
[Wed Feb 21 12:06:34 2024]  invoke_syscall.constprop.0+0x58/0xf8
[Wed Feb 21 12:06:34 2024]  do_el0_svc+0x174/0x1a0
[Wed Feb 21 12:06:34 2024]  el0_svc+0x38/0xf0
[Wed Feb 21 12:06:34 2024]  el0t_64_sync_handler+0xbc/0x138
[Wed Feb 21 12:06:34 2024]  el0t_64_sync+0x18c/0x190
[Wed Feb 21 12:06:34 2024] Code: b94096c0 f9001360 f9400ac0 f9427c00 (f9474014)
[Wed Feb 21 12:06:34 2024] ---[ end trace 0000000000000000 ]---

Not sure if there's a similar fix for arm64 pending or is it some kind
more of a cross-platform problem

Ignat

> > Yan
> >
> > > Then it must be vsyscall address that this series are fixing:
> > > https://patchwork.kernel.org/project/netdevbpf/patch/20240202103935.3154011-3-houtao@huaweicloud.com/
> > >
> > > We're still waiting on x86 maintainers to ack them.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Page faults in tracepoint caused by aliased pointer
  2024-02-13  0:33             ` Kumar Kartikeya Dwivedi
  2024-02-21 17:47               ` Ignat Korchagin
@ 2024-02-22  9:27               ` Hou Tao
  1 sibling, 0 replies; 11+ messages in thread
From: Hou Tao @ 2024-02-22  9:27 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Yan Zhai
  Cc: Alexei Starovoitov, Ignat Korchagin, bpf, kernel-team,
	Jakub Sitnicki

Hi,

On 2/13/2024 8:33 AM, Kumar Kartikeya Dwivedi wrote:
> On Tue, 13 Feb 2024 at 01:21, Yan Zhai <yan@cloudflare.com> wrote:
>> On Mon, Feb 12, 2024 at 5:52 PM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>> On Mon, Feb 12, 2024 at 3:42 PM Kumar Kartikeya Dwivedi
>>> <memxor@gmail.com> wrote:
>>>> On Tue, 13 Feb 2024 at 00:34, Alexei Starovoitov
>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>> On Mon, Feb 12, 2024 at 3:16 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
>>>>>> [288931.217143][T109754] CPU: 4 PID: 109754 Comm: bpftrace Not tainted
>>>>>> 6.6.16+ #10
>>>>> ...
>>>>>> [288931.217143][T109754]  ? copy_from_kernel_nofault+0x1d/0xe0
>>>>>> [288931.217143][T109754]  bpf_probe_read_compat+0x6a/0x90
>>>>>>
>>>>>> And Jakub CCed here did it for 6.8.0-rc2+
>>>>> I suspect something is broken in your kernels.
>>>>> Above is doing generic copy_from_kernel_nofault(),
>>>>> so one should be able to crash the kernel without any bpf.
>>>>>
>>>>> We have this in selftests/bpf:
>>>>> __weak noinline struct file *bpf_testmod_return_ptr(int arg)
>>>>> {
>>>>>         static struct file f = {};
>>>>>
>>>>>         switch (arg) {
>>>>>         case 1: return (void *)EINVAL;          /* user addr */
>>>>>         case 2: return (void *)0xcafe4a11;      /* user addr */
>>>>>         case 3: return (void *)-EINVAL;         /* canonical, but invalid */
>>>>>         case 4: return (void *)(1ull << 60);    /* non-canonical and invalid */
>>>>>         case 5: return (void *)~(1ull << 30);   /* trigger extable */
>>>>>         case 6: return &f;                      /* valid addr */
>>>>>         case 7: return (void *)((long)&f | 1);  /* kernel tricks */
>>>>>         default: return NULL;
>>>>>         }
>>>>> }
>>>>> where we check that extables setup by JIT for bpf progs are working correctly.
>>>>> You should see the kernel crashing when you just run bpf selftests.
>>>> I agree, this appears unrelated to BPF since it is happening when
>>>> using copy_from_kernel_nofault (which should be jumping to the Efault
>>>> label instead of the oops), but I think it's not specific to some
>>>> custom kernel. I can reproduce it on my dev machine on top of bpf-next
>>>> as well, and another machine with Ubuntu's generic 6.5 kernel for
>>>> 24.04. And I think Ignat tried it on the mainline 6.8-rc2 as well.
>> copy_from_kernel_nofault is called in Jakub's reproducer, but the
>> panic case in our production seems to be direct memory accessing
>> according to bpftool dumped jited code. Will faults from such
>> instructions also be caught correctly?
>>
> Yep, since faults in both cases end up in the page fault handler.
> Once the fix pointed out by Alexei is applied, it should address both scenarios.

I didn't get the idea on how the vsyscall patch [1] will fix the
unhandled page fault caused by BTF pointer dereference. In my
understanding, for BTF pointer dereference, x86 JIT checks whether the
address is a kernel space address or not. If it is the kernel space
address, it will setup an exception fix-up entry for its dereference and
will try to do dereference directly. If the address is vsyscall address,
x86 JIT will consider it as kernel space address and will try to
dereference it directly. The dereference of vsyscall page in kernel will
trigger the page fault, handle_page_fault() will be invoked and it will
invoke do_user_addr_fault() and page_fault_oops() accordingly.

[1]:
https://patchwork.kernel.org/project/netdevbpf/patch/20240202103935.3154011-3-houtao@huaweicloud.com/

>
>> Yan
>>
>>> Then it must be vsyscall address that this series are fixing:
>>> https://patchwork.kernel.org/project/netdevbpf/patch/20240202103935.3154011-3-houtao@huaweicloud.com/
>>>
>>> We're still waiting on x86 maintainers to ack them.
> .


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-02-22  9:27 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-12 22:14 Page faults in tracepoint caused by aliased pointer Yan Zhai
2024-02-12 22:55 ` Kumar Kartikeya Dwivedi
2024-02-12 23:15   ` Ignat Korchagin
2024-02-12 23:27     ` Yan Zhai
2024-02-12 23:33     ` Alexei Starovoitov
2024-02-12 23:41       ` Kumar Kartikeya Dwivedi
2024-02-12 23:52         ` Alexei Starovoitov
2024-02-13  0:21           ` Yan Zhai
2024-02-13  0:33             ` Kumar Kartikeya Dwivedi
2024-02-21 17:47               ` Ignat Korchagin
2024-02-22  9:27               ` Hou Tao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox