Netdev List
 help / color / mirror / Atom feed
* Re: [syzbot] KASAN: slab-out-of-bounds Write in copy_array
From: syzbot @ 2022-12-20 19:53 UTC (permalink / raw)
  To: sdf
  Cc: andrii, ast, bpf, daniel, davem, haoluo, hawk, john.fastabend,
	jolsa, kpsingh, kuba, linux-kernel, llvm, martin.lau, nathan,
	ndesaulniers, netdev, sdf, song, syzkaller-bugs, trix, yhs
In-Reply-To: <Y6ISmUxKqXP6VpLS@google.com>

> On 11/28, syzbot wrote:
>> Hello,
>> 
>> syzbot found the following issue on:
>> 
>> HEAD commit:    c35bd4e42885 Add linux-next specific files for 20221124
>> git tree:       linux-next
>> console+strace: https://syzkaller.appspot.com/x/log.txt?x=13369dc5880000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=11e19c740a0b2926
>> dashboard link: https://syzkaller.appspot.com/bug?extid=b1e1f7feb407b56d0355
>> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1345a205880000
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=124c644b880000
>> 
>> Downloadable assets:
>> disk image: https://storage.googleapis.com/syzbot-assets/968fee464d14/disk-c35bd4e4.raw.xz
>> vmlinux: https://storage.googleapis.com/syzbot-assets/4f46fe801b5b/vmlinux-c35bd4e4.xz
>> kernel image: https://storage.googleapis.com/syzbot-assets/c2cdf8fb264e/bzImage-c35bd4e4.xz
>> 
>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>> Reported-by: syzbot+b1e1f7feb407b56d0355@syzkaller.appspotmail.com
>> 
>> ==================================================================
>> BUG: KASAN: slab-out-of-bounds in copy_array+0x96/0x100 kernel/bpf/verifier.c:1032
>> Write of size 232 at addr ffff88801ed62600 by task syz-executor990/5290
>> 
>> CPU: 0 PID: 5290 Comm: syz-executor990 Not tainted 6.1.0-rc6-next-20221124-syzkaller #0
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
>> Call Trace:
>>  <TASK>
>>  __dump_stack lib/dump_stack.c:88 [inline]
>>  dump_stack_lvl+0xd1/0x138 lib/dump_stack.c:106
>>  print_address_description mm/kasan/report.c:253 [inline]
>>  print_report+0x15e/0x45d mm/kasan/report.c:364
>>  kasan_report+0xbf/0x1f0 mm/kasan/report.c:464
>>  check_region_inline mm/kasan/generic.c:183 [inline]
>>  kasan_check_range+0x141/0x190 mm/kasan/generic.c:189
>>  memcpy+0x3d/0x60 mm/kasan/shadow.c:66
>>  copy_array+0x96/0x100 kernel/bpf/verifier.c:1032
>>  copy_verifier_state+0xa9/0xbe0 kernel/bpf/verifier.c:1210
>>  pop_stack+0x8c/0x2f0 kernel/bpf/verifier.c:1273
>>  do_check kernel/bpf/verifier.c:13733 [inline]
>>  do_check_common+0x372b/0xc5e0 kernel/bpf/verifier.c:15991
>>  do_check_main kernel/bpf/verifier.c:16054 [inline]
>>  bpf_check+0x7371/0xad00 kernel/bpf/verifier.c:16624
>>  bpf_prog_load+0x1543/0x2230 kernel/bpf/syscall.c:2619
>>  __sys_bpf+0x1436/0x4ff0 kernel/bpf/syscall.c:4979
>>  __do_sys_bpf kernel/bpf/syscall.c:5083 [inline]
>>  __se_sys_bpf kernel/bpf/syscall.c:5081 [inline]
>>  __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5081
>>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>>  do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
>>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> RIP: 0033:0x7fc18e7bbc29
>> Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
>> RSP: 002b:00007ffd8f27a968 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
>> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc18e7bbc29
>> RDX: 0000000000000048 RSI: 0000000020000200 RDI: 0000000000000005
>> RBP: 00007fc18e77fdd0 R08: 0000000000000000 R09: 0000000000000000
>> R10: 00000000ffffffff R11: 0000000000000246 R12: 00007fc18e77fe60
>> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>  </TASK>
>> 
>> Allocated by task 5290:
>>  kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
>>  kasan_set_track+0x25/0x30 mm/kasan/common.c:52
>>  ____kasan_kmalloc mm/kasan/common.c:376 [inline]
>>  ____kasan_kmalloc mm/kasan/common.c:335 [inline]
>>  __kasan_krealloc+0x145/0x180 mm/kasan/common.c:444
>>  kasan_krealloc include/linux/kasan.h:232 [inline]
>>  __do_krealloc mm/slab_common.c:1348 [inline]
>>  krealloc+0xa8/0x100 mm/slab_common.c:1385
>>  push_jmp_history+0x89/0x260 kernel/bpf/verifier.c:2528
>>  is_state_visited kernel/bpf/verifier.c:13269 [inline]
>>  do_check kernel/bpf/verifier.c:13466 [inline]
>>  do_check_common+0x4b47/0xc5e0 kernel/bpf/verifier.c:15991
>>  do_check_main kernel/bpf/verifier.c:16054 [inline]
>>  bpf_check+0x7371/0xad00 kernel/bpf/verifier.c:16624
>>  bpf_prog_load+0x1543/0x2230 kernel/bpf/syscall.c:2619
>>  __sys_bpf+0x1436/0x4ff0 kernel/bpf/syscall.c:4979
>>  __do_sys_bpf kernel/bpf/syscall.c:5083 [inline]
>>  __se_sys_bpf kernel/bpf/syscall.c:5081 [inline]
>>  __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5081
>>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>>  do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
>>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> 
>> The buggy address belongs to the object at ffff88801ed62600
>>  which belongs to the cache kmalloc-256 of size 256
>> The buggy address is located 0 bytes inside of
>>  256-byte region [ffff88801ed62600, ffff88801ed62700)
>> 
>> The buggy address belongs to the physical page:
>> page:ffffea00007b5880 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1ed62
>> head:ffffea00007b5880 order:1 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0
>> flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
>> raw: 00fff00000010200 ffff888012441b40 ffffea0000809f80 dead000000000002
>> raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
>> page dumped because: kasan: bad access detected
>> page_owner tracks the page as allocated
>> page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 56, tgid 56 (kworker/u4:4), ts 7761288109, free_ts 0
>>  prep_new_page mm/page_alloc.c:2541 [inline]
>>  get_page_from_freelist+0x119c/0x2cd0 mm/page_alloc.c:4293
>>  __alloc_pages+0x1cb/0x5b0 mm/page_alloc.c:5551
>>  alloc_pages+0x1aa/0x270 mm/mempolicy.c:2285
>>  alloc_slab_page mm/slub.c:1833 [inline]
>>  allocate_slab+0x25e/0x350 mm/slub.c:1980
>>  new_slab mm/slub.c:2033 [inline]
>>  ___slab_alloc+0xa91/0x1400 mm/slub.c:3211
>>  __slab_alloc.constprop.0+0x56/0xa0 mm/slub.c:3310
>>  slab_alloc_node mm/slub.c:3395 [inline]
>>  __kmem_cache_alloc_node+0x1a9/0x430 mm/slub.c:3472
>>  kmalloc_trace+0x26/0x60 mm/slab_common.c:1049
>>  kmalloc include/linux/slab.h:571 [inline]
>>  scsi_probe_and_add_lun+0x3ae/0x34d0 drivers/scsi/scsi_scan.c:1186
>>  __scsi_scan_target+0x21f/0xda0 drivers/scsi/scsi_scan.c:1664
>>  scsi_scan_channel drivers/scsi/scsi_scan.c:1752 [inline]
>>  scsi_scan_channel+0x148/0x1e0 drivers/scsi/scsi_scan.c:1728
>>  scsi_scan_host_selected+0x2e3/0x3b0 drivers/scsi/scsi_scan.c:1781
>>  do_scsi_scan_host+0x1e8/0x260 drivers/scsi/scsi_scan.c:1920
>>  do_scan_async+0x42/0x500 drivers/scsi/scsi_scan.c:1930
>>  async_run_entry_fn+0x9c/0x530 kernel/async.c:127
>>  process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
>> page_owner free stack trace missing
>> 
>> Memory state around the buggy address:
>>  ffff88801ed62500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>>  ffff88801ed62580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>> >ffff88801ed62600: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
>>                                ^
>>  ffff88801ed62680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>>  ffff88801ed62700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>> ==================================================================
>> 
>> 
>> ---
>> This report is generated by a bot. It may contain errors.
>> See https://goo.gl/tpsmEJ for more information about syzbot.
>> syzbot engineers can be reached at syzkaller@googlegroups.com.
>> 
>> syzbot will keep track of this issue. See:
>> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>> syzbot can test patches for this issue, for details see:
>> https://goo.gl/tpsmEJ#testing-patches
>
> #syz test

want 2 args (repo, branch), got 1


^ permalink raw reply

* Re: [syzbot] KASAN: slab-out-of-bounds Write in copy_verifier_state
From: sdf @ 2022-12-20 19:53 UTC (permalink / raw)
  To: syzbot
  Cc: andrii, ast, bpf, daniel, davem, haoluo, hawk, john.fastabend,
	jolsa, kpsingh, kuba, linux-kernel, llvm, martin.lau, nathan,
	ndesaulniers, netdev, song, syzkaller-bugs, trix, yhs
In-Reply-To: <000000000000269f9a05f02be9d8@google.com>

[-- Attachment #1: Type: text/plain, Size: 8729 bytes --]

On 12/19, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    041fae9c105a Merge tag 'f2fs-for-6.2-rc1' of git://git.ker..
> git tree:       upstream
> console+strace: https://syzkaller.appspot.com/x/log.txt?x=136be5d0480000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=e2f3d9d232a3cac5
> dashboard link: https://syzkaller.appspot.com/bug?extid=59af7bf76d795311da8c
> compiler:       Debian clang version 13.0.1-++20220126092033+75e33f71c2da-1~exp1~20220126212112.63, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1650d477880000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1305f993880000
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/e228bf558f91/disk-041fae9c.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/11eeb90801b7/vmlinux-041fae9c.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/1f9651f3b5bd/bzImage-041fae9c.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+59af7bf76d795311da8c@syzkaller.appspotmail.com
> 
> ==================================================================
> BUG: KASAN: slab-out-of-bounds in copy_array kernel/bpf/verifier.c:1072 [inline]
> BUG: KASAN: slab-out-of-bounds in copy_verifier_state+0x130/0xbe0 kernel/bpf/verifier.c:1250
> Write of size 80 at addr ffff888022c71000 by task syz-executor186/5067
> 
> CPU: 0 PID: 5067 Comm: syz-executor186 Not tainted 6.1.0-syzkaller-10971-g041fae9c105a #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
> Call Trace:
>  <TASK>
>  __dump_stack lib/dump_stack.c:88 [inline]
>  dump_stack_lvl+0x1e3/0x2d0 lib/dump_stack.c:106
>  print_address_description+0x74/0x340 mm/kasan/report.c:306
>  print_report+0x107/0x220 mm/kasan/report.c:417
>  kasan_report+0x139/0x170 mm/kasan/report.c:517
>  kasan_check_range+0x2a7/0x2e0 mm/kasan/generic.c:189
>  memcpy+0x3c/0x60 mm/kasan/shadow.c:66
>  copy_array kernel/bpf/verifier.c:1072 [inline]
>  copy_verifier_state+0x130/0xbe0 kernel/bpf/verifier.c:1250
>  pop_stack kernel/bpf/verifier.c:1314 [inline]
>  do_check+0x8e51/0x107b0 kernel/bpf/verifier.c:14031
>  do_check_common+0x909/0x1800 kernel/bpf/verifier.c:16289
>  do_check_main kernel/bpf/verifier.c:16352 [inline]
>  bpf_check+0x107e2/0x16170 kernel/bpf/verifier.c:16936
>  bpf_prog_load+0x1306/0x1be0 kernel/bpf/syscall.c:2619
>  __sys_bpf+0x396/0x6d0 kernel/bpf/syscall.c:4979
>  __do_sys_bpf kernel/bpf/syscall.c:5083 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:5081 [inline]
>  __x64_sys_bpf+0x78/0x90 kernel/bpf/syscall.c:5081
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7ff1fb190c29
> Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007ffeaae55678 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff1fb190c29
> RDX: 0000000000000048 RSI: 0000000020000200 RDI: 0000000000000005
> RBP: 00007ff1fb154dd0 R08: 0000000000000000 R09: 0000000000000000
> R10: 00000000ffffffff R11: 0000000000000246 R12: 00007ff1fb154e60
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>  </TASK>
> 
> Allocated by task 5067:
>  kasan_save_stack mm/kasan/common.c:45 [inline]
>  kasan_set_track+0x4c/0x70 mm/kasan/common.c:52
>  ____kasan_kmalloc mm/kasan/common.c:371 [inline]
>  __kasan_krealloc+0xbf/0xf0 mm/kasan/common.c:439
>  kasan_krealloc include/linux/kasan.h:231 [inline]
>  __do_krealloc mm/slab_common.c:1361 [inline]
>  krealloc+0xb2/0x110 mm/slab_common.c:1398
>  push_jmp_history kernel/bpf/verifier.c:2592 [inline]
>  is_state_visited kernel/bpf/verifier.c:13552 [inline]
>  do_check+0x9433/0x107b0 kernel/bpf/verifier.c:13752
>  do_check_common+0x909/0x1800 kernel/bpf/verifier.c:16289
>  do_check_main kernel/bpf/verifier.c:16352 [inline]
>  bpf_check+0x107e2/0x16170 kernel/bpf/verifier.c:16936
>  bpf_prog_load+0x1306/0x1be0 kernel/bpf/syscall.c:2619
>  __sys_bpf+0x396/0x6d0 kernel/bpf/syscall.c:4979
>  __do_sys_bpf kernel/bpf/syscall.c:5083 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:5081 [inline]
>  __x64_sys_bpf+0x78/0x90 kernel/bpf/syscall.c:5081
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> The buggy address belongs to the object at ffff888022c71000
>  which belongs to the cache kmalloc-96 of size 96
> The buggy address is located 0 bytes inside of
>  96-byte region [ffff888022c71000, ffff888022c71060)
> 
> The buggy address belongs to the physical page:
> page:ffffea00008b1c40 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22c71
> ksm flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
> raw: 00fff00000000200 ffff888012841780 ffffea0000a6d880 0000000000000003
> raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
> page dumped because: kasan: bad access detected
> page_owner tracks the page as allocated
> page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12c40(GFP_NOFS|__GFP_NOWARN|__GFP_NORETRY), pid 4437, tgid 4437 (udevd), ts 26581643327, free_ts 26581082061
>  prep_new_page mm/page_alloc.c:2531 [inline]
>  get_page_from_freelist+0x72b/0x7a0 mm/page_alloc.c:4283
>  __alloc_pages+0x259/0x560 mm/page_alloc.c:5549
>  alloc_slab_page+0xbd/0x190 mm/slub.c:1851
>  allocate_slab+0x5e/0x3c0 mm/slub.c:1998
>  new_slab mm/slub.c:2051 [inline]
>  ___slab_alloc+0x7f4/0xeb0 mm/slub.c:3193
>  __slab_alloc mm/slub.c:3292 [inline]
>  __slab_alloc_node mm/slub.c:3345 [inline]
>  slab_alloc_node mm/slub.c:3442 [inline]
>  __kmem_cache_alloc_node+0x25b/0x340 mm/slub.c:3491
>  __do_kmalloc_node mm/slab_common.c:967 [inline]
>  __kmalloc+0x9e/0x190 mm/slab_common.c:981
>  kmalloc include/linux/slab.h:584 [inline]
>  kzalloc include/linux/slab.h:720 [inline]
>  tomoyo_encode2 security/tomoyo/realpath.c:45 [inline]
>  tomoyo_encode+0x26f/0x540 security/tomoyo/realpath.c:80
>  tomoyo_realpath_from_path+0x5ae/0x5f0 security/tomoyo/realpath.c:283
>  tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
>  tomoyo_path_perm+0x280/0x680 security/tomoyo/file.c:822
>  security_inode_getattr+0xc0/0x140 security/security.c:1375
>  vfs_getattr fs/stat.c:161 [inline]
>  vfs_statx+0x198/0x4b0 fs/stat.c:236
>  vfs_fstatat fs/stat.c:270 [inline]
>  __do_sys_newfstatat fs/stat.c:440 [inline]
>  __se_sys_newfstatat+0x104/0x7b0 fs/stat.c:434
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> page last free stack trace:
>  reset_page_owner include/linux/page_owner.h:24 [inline]
>  free_pages_prepare mm/page_alloc.c:1446 [inline]
>  free_pcp_prepare+0x751/0x780 mm/page_alloc.c:1496
>  free_unref_page_prepare mm/page_alloc.c:3369 [inline]
>  free_unref_page+0x19/0x4c0 mm/page_alloc.c:3464
>  free_pipe_info+0x302/0x380 fs/pipe.c:851
>  put_pipe_info fs/pipe.c:711 [inline]
>  pipe_release+0x232/0x310 fs/pipe.c:734
>  __fput+0x3ba/0x880 fs/file_table.c:320
>  task_work_run+0x243/0x300 kernel/task_work.c:179
>  resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
>  exit_to_user_mode_loop+0x134/0x160 kernel/entry/common.c:171
>  exit_to_user_mode_prepare+0xad/0x110 kernel/entry/common.c:203
>  __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline]
>  syscall_exit_to_user_mode+0x2e/0x60 kernel/entry/common.c:296
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> Memory state around the buggy address:
>  ffff888022c70f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  ffff888022c70f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >ffff888022c71000: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>                          ^
>  ffff888022c71080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>  ffff888022c71100: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> ==================================================================
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches

#syz test

[-- Attachment #2: syz.patch --]
[-- Type: text/x-diff, Size: 3676 bytes --]

commit 589aa8077dcf413b06c78c6d8095496c98720c25
Author:     Stanislav Fomichev <sdf@google.com>
AuthorDate: Tue Dec 20 11:51:17 2022 -0800
Commit:     Stanislav Fomichev <sdf@google.com>
CommitDate: Tue Dec 20 11:51:17 2022 -0800

    Revert "mm: Make ksize() a reporting-only function"
    
    This reverts commit 38931d8989b5760b0bd17c9ec99e81986258e4cb.

diff --git a/mm/kasan/kasan_test.c b/mm/kasan/kasan_test.c
index 73684642c42d..0d59098f0876 100644
--- a/mm/kasan/kasan_test.c
+++ b/mm/kasan/kasan_test.c
@@ -783,30 +783,23 @@ static void kasan_global_oob_left(struct kunit *test)
 	KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)p);
 }
 
-/* Check that ksize() does NOT unpoison whole object. */
+/* Check that ksize() makes the whole object accessible. */
 static void ksize_unpoisons_memory(struct kunit *test)
 {
 	char *ptr;
-	size_t size = 128 - KASAN_GRANULE_SIZE - 5;
-	size_t real_size;
+	size_t size = 123, real_size;
 
 	ptr = kmalloc(size, GFP_KERNEL);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
-
 	real_size = ksize(ptr);
-	KUNIT_EXPECT_GT(test, real_size, size);
 
 	OPTIMIZER_HIDE_VAR(ptr);
 
-	/* These accesses shouldn't trigger a KASAN report. */
-	ptr[0] = 'x';
-	ptr[size - 1] = 'x';
+	/* This access shouldn't trigger a KASAN report. */
+	ptr[size] = 'x';
 
-	/* These must trigger a KASAN report. */
-	if (IS_ENABLED(CONFIG_KASAN_GENERIC))
-		KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[size]);
-	KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[size + 5]);
-	KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[real_size - 1]);
+	/* This one must. */
+	KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[real_size]);
 
 	kfree(ptr);
 }
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 7e96abf1bd7d..33b1886b06eb 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1333,11 +1333,11 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags)
 	void *ret;
 	size_t ks;
 
-	/* Check for double-free before calling ksize. */
+	/* Don't use instrumented ksize to allow precise KASAN poisoning. */
 	if (likely(!ZERO_OR_NULL_PTR(p))) {
 		if (!kasan_check_byte(p))
 			return NULL;
-		ks = ksize(p);
+		ks = kfence_ksize(p) ?: __ksize(p);
 	} else
 		ks = 0;
 
@@ -1405,10 +1405,8 @@ void kfree_sensitive(const void *p)
 	void *mem = (void *)p;
 
 	ks = ksize(mem);
-	if (ks) {
-		kasan_unpoison_range(mem, ks);
+	if (ks)
 		memzero_explicit(mem, ks);
-	}
 	kfree(mem);
 }
 EXPORT_SYMBOL(kfree_sensitive);
@@ -1429,11 +1427,13 @@ EXPORT_SYMBOL(kfree_sensitive);
  */
 size_t ksize(const void *objp)
 {
+	size_t size;
+
 	/*
-	 * We need to first check that the pointer to the object is valid.
-	 * The KASAN report printed from ksize() is more useful, then when
-	 * it's printed later when the behaviour could be undefined due to
-	 * a potential use-after-free or double-free.
+	 * We need to first check that the pointer to the object is valid, and
+	 * only then unpoison the memory. The report printed from ksize() is
+	 * more useful, then when it's printed later when the behaviour could
+	 * be undefined due to a potential use-after-free or double-free.
 	 *
 	 * We use kasan_check_byte(), which is supported for the hardware
 	 * tag-based KASAN mode, unlike kasan_check_read/write().
@@ -1447,7 +1447,13 @@ size_t ksize(const void *objp)
 	if (unlikely(ZERO_OR_NULL_PTR(objp)) || !kasan_check_byte(objp))
 		return 0;
 
-	return kfence_ksize(objp) ?: __ksize(objp);
+	size = kfence_ksize(objp) ?: __ksize(objp);
+	/*
+	 * We assume that ksize callers could use whole allocated area,
+	 * so we need to unpoison this area.
+	 */
+	kasan_unpoison_range(objp, size);
+	return size;
 }
 EXPORT_SYMBOL(ksize);
 

^ permalink raw reply related

* Re: [syzbot] KASAN: slab-out-of-bounds Write in copy_verifier_state
From: syzbot @ 2022-12-20 19:53 UTC (permalink / raw)
  To: sdf
  Cc: andrii, ast, bpf, daniel, davem, haoluo, hawk, john.fastabend,
	jolsa, kpsingh, kuba, linux-kernel, llvm, martin.lau, nathan,
	ndesaulniers, netdev, sdf, song, syzkaller-bugs, trix, yhs
In-Reply-To: <Y6ISrPSqopYajjfP@google.com>

> On 12/19, syzbot wrote:
>> Hello,
>> 
>> syzbot found the following issue on:
>> 
>> HEAD commit:    041fae9c105a Merge tag 'f2fs-for-6.2-rc1' of git://git.ker..
>> git tree:       upstream
>> console+strace: https://syzkaller.appspot.com/x/log.txt?x=136be5d0480000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=e2f3d9d232a3cac5
>> dashboard link: https://syzkaller.appspot.com/bug?extid=59af7bf76d795311da8c
>> compiler:       Debian clang version 13.0.1-++20220126092033+75e33f71c2da-1~exp1~20220126212112.63, GNU ld (GNU Binutils for Debian) 2.35.2
>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1650d477880000
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1305f993880000
>> 
>> Downloadable assets:
>> disk image: https://storage.googleapis.com/syzbot-assets/e228bf558f91/disk-041fae9c.raw.xz
>> vmlinux: https://storage.googleapis.com/syzbot-assets/11eeb90801b7/vmlinux-041fae9c.xz
>> kernel image: https://storage.googleapis.com/syzbot-assets/1f9651f3b5bd/bzImage-041fae9c.xz
>> 
>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>> Reported-by: syzbot+59af7bf76d795311da8c@syzkaller.appspotmail.com
>> 
>> ==================================================================
>> BUG: KASAN: slab-out-of-bounds in copy_array kernel/bpf/verifier.c:1072 [inline]
>> BUG: KASAN: slab-out-of-bounds in copy_verifier_state+0x130/0xbe0 kernel/bpf/verifier.c:1250
>> Write of size 80 at addr ffff888022c71000 by task syz-executor186/5067
>> 
>> CPU: 0 PID: 5067 Comm: syz-executor186 Not tainted 6.1.0-syzkaller-10971-g041fae9c105a #0
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
>> Call Trace:
>>  <TASK>
>>  __dump_stack lib/dump_stack.c:88 [inline]
>>  dump_stack_lvl+0x1e3/0x2d0 lib/dump_stack.c:106
>>  print_address_description+0x74/0x340 mm/kasan/report.c:306
>>  print_report+0x107/0x220 mm/kasan/report.c:417
>>  kasan_report+0x139/0x170 mm/kasan/report.c:517
>>  kasan_check_range+0x2a7/0x2e0 mm/kasan/generic.c:189
>>  memcpy+0x3c/0x60 mm/kasan/shadow.c:66
>>  copy_array kernel/bpf/verifier.c:1072 [inline]
>>  copy_verifier_state+0x130/0xbe0 kernel/bpf/verifier.c:1250
>>  pop_stack kernel/bpf/verifier.c:1314 [inline]
>>  do_check+0x8e51/0x107b0 kernel/bpf/verifier.c:14031
>>  do_check_common+0x909/0x1800 kernel/bpf/verifier.c:16289
>>  do_check_main kernel/bpf/verifier.c:16352 [inline]
>>  bpf_check+0x107e2/0x16170 kernel/bpf/verifier.c:16936
>>  bpf_prog_load+0x1306/0x1be0 kernel/bpf/syscall.c:2619
>>  __sys_bpf+0x396/0x6d0 kernel/bpf/syscall.c:4979
>>  __do_sys_bpf kernel/bpf/syscall.c:5083 [inline]
>>  __se_sys_bpf kernel/bpf/syscall.c:5081 [inline]
>>  __x64_sys_bpf+0x78/0x90 kernel/bpf/syscall.c:5081
>>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>>  do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
>>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> RIP: 0033:0x7ff1fb190c29
>> Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
>> RSP: 002b:00007ffeaae55678 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
>> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff1fb190c29
>> RDX: 0000000000000048 RSI: 0000000020000200 RDI: 0000000000000005
>> RBP: 00007ff1fb154dd0 R08: 0000000000000000 R09: 0000000000000000
>> R10: 00000000ffffffff R11: 0000000000000246 R12: 00007ff1fb154e60
>> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>>  </TASK>
>> 
>> Allocated by task 5067:
>>  kasan_save_stack mm/kasan/common.c:45 [inline]
>>  kasan_set_track+0x4c/0x70 mm/kasan/common.c:52
>>  ____kasan_kmalloc mm/kasan/common.c:371 [inline]
>>  __kasan_krealloc+0xbf/0xf0 mm/kasan/common.c:439
>>  kasan_krealloc include/linux/kasan.h:231 [inline]
>>  __do_krealloc mm/slab_common.c:1361 [inline]
>>  krealloc+0xb2/0x110 mm/slab_common.c:1398
>>  push_jmp_history kernel/bpf/verifier.c:2592 [inline]
>>  is_state_visited kernel/bpf/verifier.c:13552 [inline]
>>  do_check+0x9433/0x107b0 kernel/bpf/verifier.c:13752
>>  do_check_common+0x909/0x1800 kernel/bpf/verifier.c:16289
>>  do_check_main kernel/bpf/verifier.c:16352 [inline]
>>  bpf_check+0x107e2/0x16170 kernel/bpf/verifier.c:16936
>>  bpf_prog_load+0x1306/0x1be0 kernel/bpf/syscall.c:2619
>>  __sys_bpf+0x396/0x6d0 kernel/bpf/syscall.c:4979
>>  __do_sys_bpf kernel/bpf/syscall.c:5083 [inline]
>>  __se_sys_bpf kernel/bpf/syscall.c:5081 [inline]
>>  __x64_sys_bpf+0x78/0x90 kernel/bpf/syscall.c:5081
>>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>>  do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
>>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> 
>> The buggy address belongs to the object at ffff888022c71000
>>  which belongs to the cache kmalloc-96 of size 96
>> The buggy address is located 0 bytes inside of
>>  96-byte region [ffff888022c71000, ffff888022c71060)
>> 
>> The buggy address belongs to the physical page:
>> page:ffffea00008b1c40 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22c71
>> ksm flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
>> raw: 00fff00000000200 ffff888012841780 ffffea0000a6d880 0000000000000003
>> raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
>> page dumped because: kasan: bad access detected
>> page_owner tracks the page as allocated
>> page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12c40(GFP_NOFS|__GFP_NOWARN|__GFP_NORETRY), pid 4437, tgid 4437 (udevd), ts 26581643327, free_ts 26581082061
>>  prep_new_page mm/page_alloc.c:2531 [inline]
>>  get_page_from_freelist+0x72b/0x7a0 mm/page_alloc.c:4283
>>  __alloc_pages+0x259/0x560 mm/page_alloc.c:5549
>>  alloc_slab_page+0xbd/0x190 mm/slub.c:1851
>>  allocate_slab+0x5e/0x3c0 mm/slub.c:1998
>>  new_slab mm/slub.c:2051 [inline]
>>  ___slab_alloc+0x7f4/0xeb0 mm/slub.c:3193
>>  __slab_alloc mm/slub.c:3292 [inline]
>>  __slab_alloc_node mm/slub.c:3345 [inline]
>>  slab_alloc_node mm/slub.c:3442 [inline]
>>  __kmem_cache_alloc_node+0x25b/0x340 mm/slub.c:3491
>>  __do_kmalloc_node mm/slab_common.c:967 [inline]
>>  __kmalloc+0x9e/0x190 mm/slab_common.c:981
>>  kmalloc include/linux/slab.h:584 [inline]
>>  kzalloc include/linux/slab.h:720 [inline]
>>  tomoyo_encode2 security/tomoyo/realpath.c:45 [inline]
>>  tomoyo_encode+0x26f/0x540 security/tomoyo/realpath.c:80
>>  tomoyo_realpath_from_path+0x5ae/0x5f0 security/tomoyo/realpath.c:283
>>  tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
>>  tomoyo_path_perm+0x280/0x680 security/tomoyo/file.c:822
>>  security_inode_getattr+0xc0/0x140 security/security.c:1375
>>  vfs_getattr fs/stat.c:161 [inline]
>>  vfs_statx+0x198/0x4b0 fs/stat.c:236
>>  vfs_fstatat fs/stat.c:270 [inline]
>>  __do_sys_newfstatat fs/stat.c:440 [inline]
>>  __se_sys_newfstatat+0x104/0x7b0 fs/stat.c:434
>>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>>  do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
>>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> page last free stack trace:
>>  reset_page_owner include/linux/page_owner.h:24 [inline]
>>  free_pages_prepare mm/page_alloc.c:1446 [inline]
>>  free_pcp_prepare+0x751/0x780 mm/page_alloc.c:1496
>>  free_unref_page_prepare mm/page_alloc.c:3369 [inline]
>>  free_unref_page+0x19/0x4c0 mm/page_alloc.c:3464
>>  free_pipe_info+0x302/0x380 fs/pipe.c:851
>>  put_pipe_info fs/pipe.c:711 [inline]
>>  pipe_release+0x232/0x310 fs/pipe.c:734
>>  __fput+0x3ba/0x880 fs/file_table.c:320
>>  task_work_run+0x243/0x300 kernel/task_work.c:179
>>  resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
>>  exit_to_user_mode_loop+0x134/0x160 kernel/entry/common.c:171
>>  exit_to_user_mode_prepare+0xad/0x110 kernel/entry/common.c:203
>>  __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline]
>>  syscall_exit_to_user_mode+0x2e/0x60 kernel/entry/common.c:296
>>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> 
>> Memory state around the buggy address:
>>  ffff888022c70f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>  ffff888022c70f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> >ffff888022c71000: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>>                          ^
>>  ffff888022c71080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>>  ffff888022c71100: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>> ==================================================================
>> 
>> 
>> ---
>> This report is generated by a bot. It may contain errors.
>> See https://goo.gl/tpsmEJ for more information about syzbot.
>> syzbot engineers can be reached at syzkaller@googlegroups.com.
>> 
>> syzbot will keep track of this issue. See:
>> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>> syzbot can test patches for this issue, for details see:
>> https://goo.gl/tpsmEJ#testing-patches
>
> #syz test

want 2 args (repo, branch), got 1


^ permalink raw reply

* Re: [syzbot] KASAN: slab-out-of-bounds Write in copy_array
From: sdf @ 2022-12-20 19:56 UTC (permalink / raw)
  To: syzbot
  Cc: andrii, ast, bpf, daniel, davem, haoluo, hawk, john.fastabend,
	jolsa, kpsingh, kuba, linux-kernel, llvm, martin.lau, nathan,
	ndesaulniers, netdev, song, syzkaller-bugs, trix, yhs
In-Reply-To: <000000000000ab724705ee87e321@google.com>

[-- Attachment #1: Type: text/plain, Size: 7785 bytes --]

On 11/28, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    c35bd4e42885 Add linux-next specific files for 20221124
> git tree:       linux-next
> console+strace: https://syzkaller.appspot.com/x/log.txt?x=13369dc5880000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=11e19c740a0b2926
> dashboard link: https://syzkaller.appspot.com/bug?extid=b1e1f7feb407b56d0355
> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1345a205880000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=124c644b880000
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/968fee464d14/disk-c35bd4e4.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/4f46fe801b5b/vmlinux-c35bd4e4.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/c2cdf8fb264e/bzImage-c35bd4e4.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+b1e1f7feb407b56d0355@syzkaller.appspotmail.com
> 
> ==================================================================
> BUG: KASAN: slab-out-of-bounds in copy_array+0x96/0x100 kernel/bpf/verifier.c:1032
> Write of size 232 at addr ffff88801ed62600 by task syz-executor990/5290
> 
> CPU: 0 PID: 5290 Comm: syz-executor990 Not tainted 6.1.0-rc6-next-20221124-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
> Call Trace:
>  <TASK>
>  __dump_stack lib/dump_stack.c:88 [inline]
>  dump_stack_lvl+0xd1/0x138 lib/dump_stack.c:106
>  print_address_description mm/kasan/report.c:253 [inline]
>  print_report+0x15e/0x45d mm/kasan/report.c:364
>  kasan_report+0xbf/0x1f0 mm/kasan/report.c:464
>  check_region_inline mm/kasan/generic.c:183 [inline]
>  kasan_check_range+0x141/0x190 mm/kasan/generic.c:189
>  memcpy+0x3d/0x60 mm/kasan/shadow.c:66
>  copy_array+0x96/0x100 kernel/bpf/verifier.c:1032
>  copy_verifier_state+0xa9/0xbe0 kernel/bpf/verifier.c:1210
>  pop_stack+0x8c/0x2f0 kernel/bpf/verifier.c:1273
>  do_check kernel/bpf/verifier.c:13733 [inline]
>  do_check_common+0x372b/0xc5e0 kernel/bpf/verifier.c:15991
>  do_check_main kernel/bpf/verifier.c:16054 [inline]
>  bpf_check+0x7371/0xad00 kernel/bpf/verifier.c:16624
>  bpf_prog_load+0x1543/0x2230 kernel/bpf/syscall.c:2619
>  __sys_bpf+0x1436/0x4ff0 kernel/bpf/syscall.c:4979
>  __do_sys_bpf kernel/bpf/syscall.c:5083 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:5081 [inline]
>  __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5081
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7fc18e7bbc29
> Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007ffd8f27a968 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc18e7bbc29
> RDX: 0000000000000048 RSI: 0000000020000200 RDI: 0000000000000005
> RBP: 00007fc18e77fdd0 R08: 0000000000000000 R09: 0000000000000000
> R10: 00000000ffffffff R11: 0000000000000246 R12: 00007fc18e77fe60
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>  </TASK>
> 
> Allocated by task 5290:
>  kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
>  kasan_set_track+0x25/0x30 mm/kasan/common.c:52
>  ____kasan_kmalloc mm/kasan/common.c:376 [inline]
>  ____kasan_kmalloc mm/kasan/common.c:335 [inline]
>  __kasan_krealloc+0x145/0x180 mm/kasan/common.c:444
>  kasan_krealloc include/linux/kasan.h:232 [inline]
>  __do_krealloc mm/slab_common.c:1348 [inline]
>  krealloc+0xa8/0x100 mm/slab_common.c:1385
>  push_jmp_history+0x89/0x260 kernel/bpf/verifier.c:2528
>  is_state_visited kernel/bpf/verifier.c:13269 [inline]
>  do_check kernel/bpf/verifier.c:13466 [inline]
>  do_check_common+0x4b47/0xc5e0 kernel/bpf/verifier.c:15991
>  do_check_main kernel/bpf/verifier.c:16054 [inline]
>  bpf_check+0x7371/0xad00 kernel/bpf/verifier.c:16624
>  bpf_prog_load+0x1543/0x2230 kernel/bpf/syscall.c:2619
>  __sys_bpf+0x1436/0x4ff0 kernel/bpf/syscall.c:4979
>  __do_sys_bpf kernel/bpf/syscall.c:5083 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:5081 [inline]
>  __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5081
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> The buggy address belongs to the object at ffff88801ed62600
>  which belongs to the cache kmalloc-256 of size 256
> The buggy address is located 0 bytes inside of
>  256-byte region [ffff88801ed62600, ffff88801ed62700)
> 
> The buggy address belongs to the physical page:
> page:ffffea00007b5880 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1ed62
> head:ffffea00007b5880 order:1 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0
> flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
> raw: 00fff00000010200 ffff888012441b40 ffffea0000809f80 dead000000000002
> raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
> page dumped because: kasan: bad access detected
> page_owner tracks the page as allocated
> page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 56, tgid 56 (kworker/u4:4), ts 7761288109, free_ts 0
>  prep_new_page mm/page_alloc.c:2541 [inline]
>  get_page_from_freelist+0x119c/0x2cd0 mm/page_alloc.c:4293
>  __alloc_pages+0x1cb/0x5b0 mm/page_alloc.c:5551
>  alloc_pages+0x1aa/0x270 mm/mempolicy.c:2285
>  alloc_slab_page mm/slub.c:1833 [inline]
>  allocate_slab+0x25e/0x350 mm/slub.c:1980
>  new_slab mm/slub.c:2033 [inline]
>  ___slab_alloc+0xa91/0x1400 mm/slub.c:3211
>  __slab_alloc.constprop.0+0x56/0xa0 mm/slub.c:3310
>  slab_alloc_node mm/slub.c:3395 [inline]
>  __kmem_cache_alloc_node+0x1a9/0x430 mm/slub.c:3472
>  kmalloc_trace+0x26/0x60 mm/slab_common.c:1049
>  kmalloc include/linux/slab.h:571 [inline]
>  scsi_probe_and_add_lun+0x3ae/0x34d0 drivers/scsi/scsi_scan.c:1186
>  __scsi_scan_target+0x21f/0xda0 drivers/scsi/scsi_scan.c:1664
>  scsi_scan_channel drivers/scsi/scsi_scan.c:1752 [inline]
>  scsi_scan_channel+0x148/0x1e0 drivers/scsi/scsi_scan.c:1728
>  scsi_scan_host_selected+0x2e3/0x3b0 drivers/scsi/scsi_scan.c:1781
>  do_scsi_scan_host+0x1e8/0x260 drivers/scsi/scsi_scan.c:1920
>  do_scan_async+0x42/0x500 drivers/scsi/scsi_scan.c:1930
>  async_run_entry_fn+0x9c/0x530 kernel/async.c:127
>  process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
> page_owner free stack trace missing
> 
> Memory state around the buggy address:
>  ffff88801ed62500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  ffff88801ed62580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> >ffff88801ed62600: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
>                                ^
>  ffff88801ed62680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  ffff88801ed62700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ==================================================================
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git c35bd4e42885

[-- Attachment #2: syz.patch --]
[-- Type: text/x-diff, Size: 3676 bytes --]

commit 589aa8077dcf413b06c78c6d8095496c98720c25
Author:     Stanislav Fomichev <sdf@google.com>
AuthorDate: Tue Dec 20 11:51:17 2022 -0800
Commit:     Stanislav Fomichev <sdf@google.com>
CommitDate: Tue Dec 20 11:51:17 2022 -0800

    Revert "mm: Make ksize() a reporting-only function"
    
    This reverts commit 38931d8989b5760b0bd17c9ec99e81986258e4cb.

diff --git a/mm/kasan/kasan_test.c b/mm/kasan/kasan_test.c
index 73684642c42d..0d59098f0876 100644
--- a/mm/kasan/kasan_test.c
+++ b/mm/kasan/kasan_test.c
@@ -783,30 +783,23 @@ static void kasan_global_oob_left(struct kunit *test)
 	KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)p);
 }
 
-/* Check that ksize() does NOT unpoison whole object. */
+/* Check that ksize() makes the whole object accessible. */
 static void ksize_unpoisons_memory(struct kunit *test)
 {
 	char *ptr;
-	size_t size = 128 - KASAN_GRANULE_SIZE - 5;
-	size_t real_size;
+	size_t size = 123, real_size;
 
 	ptr = kmalloc(size, GFP_KERNEL);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
-
 	real_size = ksize(ptr);
-	KUNIT_EXPECT_GT(test, real_size, size);
 
 	OPTIMIZER_HIDE_VAR(ptr);
 
-	/* These accesses shouldn't trigger a KASAN report. */
-	ptr[0] = 'x';
-	ptr[size - 1] = 'x';
+	/* This access shouldn't trigger a KASAN report. */
+	ptr[size] = 'x';
 
-	/* These must trigger a KASAN report. */
-	if (IS_ENABLED(CONFIG_KASAN_GENERIC))
-		KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[size]);
-	KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[size + 5]);
-	KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[real_size - 1]);
+	/* This one must. */
+	KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[real_size]);
 
 	kfree(ptr);
 }
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 7e96abf1bd7d..33b1886b06eb 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1333,11 +1333,11 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags)
 	void *ret;
 	size_t ks;
 
-	/* Check for double-free before calling ksize. */
+	/* Don't use instrumented ksize to allow precise KASAN poisoning. */
 	if (likely(!ZERO_OR_NULL_PTR(p))) {
 		if (!kasan_check_byte(p))
 			return NULL;
-		ks = ksize(p);
+		ks = kfence_ksize(p) ?: __ksize(p);
 	} else
 		ks = 0;
 
@@ -1405,10 +1405,8 @@ void kfree_sensitive(const void *p)
 	void *mem = (void *)p;
 
 	ks = ksize(mem);
-	if (ks) {
-		kasan_unpoison_range(mem, ks);
+	if (ks)
 		memzero_explicit(mem, ks);
-	}
 	kfree(mem);
 }
 EXPORT_SYMBOL(kfree_sensitive);
@@ -1429,11 +1427,13 @@ EXPORT_SYMBOL(kfree_sensitive);
  */
 size_t ksize(const void *objp)
 {
+	size_t size;
+
 	/*
-	 * We need to first check that the pointer to the object is valid.
-	 * The KASAN report printed from ksize() is more useful, then when
-	 * it's printed later when the behaviour could be undefined due to
-	 * a potential use-after-free or double-free.
+	 * We need to first check that the pointer to the object is valid, and
+	 * only then unpoison the memory. The report printed from ksize() is
+	 * more useful, then when it's printed later when the behaviour could
+	 * be undefined due to a potential use-after-free or double-free.
 	 *
 	 * We use kasan_check_byte(), which is supported for the hardware
 	 * tag-based KASAN mode, unlike kasan_check_read/write().
@@ -1447,7 +1447,13 @@ size_t ksize(const void *objp)
 	if (unlikely(ZERO_OR_NULL_PTR(objp)) || !kasan_check_byte(objp))
 		return 0;
 
-	return kfence_ksize(objp) ?: __ksize(objp);
+	size = kfence_ksize(objp) ?: __ksize(objp);
+	/*
+	 * We assume that ksize callers could use whole allocated area,
+	 * so we need to unpoison this area.
+	 */
+	kasan_unpoison_range(objp, size);
+	return size;
 }
 EXPORT_SYMBOL(ksize);
 

^ permalink raw reply related

* Re: [syzbot] KASAN: slab-out-of-bounds Write in copy_verifier_state
From: sdf @ 2022-12-20 19:57 UTC (permalink / raw)
  To: syzbot
  Cc: andrii, ast, bpf, daniel, davem, haoluo, hawk, john.fastabend,
	jolsa, kpsingh, kuba, linux-kernel, llvm, martin.lau, nathan,
	ndesaulniers, netdev, song, syzkaller-bugs, trix, yhs
In-Reply-To: <000000000000269f9a05f02be9d8@google.com>

[-- Attachment #1: Type: text/plain, Size: 8808 bytes --]

On 12/19, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    041fae9c105a Merge tag 'f2fs-for-6.2-rc1' of git://git.ker..
> git tree:       upstream
> console+strace: https://syzkaller.appspot.com/x/log.txt?x=136be5d0480000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=e2f3d9d232a3cac5
> dashboard link: https://syzkaller.appspot.com/bug?extid=59af7bf76d795311da8c
> compiler:       Debian clang version 13.0.1-++20220126092033+75e33f71c2da-1~exp1~20220126212112.63, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1650d477880000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1305f993880000
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/e228bf558f91/disk-041fae9c.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/11eeb90801b7/vmlinux-041fae9c.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/1f9651f3b5bd/bzImage-041fae9c.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+59af7bf76d795311da8c@syzkaller.appspotmail.com
> 
> ==================================================================
> BUG: KASAN: slab-out-of-bounds in copy_array kernel/bpf/verifier.c:1072 [inline]
> BUG: KASAN: slab-out-of-bounds in copy_verifier_state+0x130/0xbe0 kernel/bpf/verifier.c:1250
> Write of size 80 at addr ffff888022c71000 by task syz-executor186/5067
> 
> CPU: 0 PID: 5067 Comm: syz-executor186 Not tainted 6.1.0-syzkaller-10971-g041fae9c105a #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
> Call Trace:
>  <TASK>
>  __dump_stack lib/dump_stack.c:88 [inline]
>  dump_stack_lvl+0x1e3/0x2d0 lib/dump_stack.c:106
>  print_address_description+0x74/0x340 mm/kasan/report.c:306
>  print_report+0x107/0x220 mm/kasan/report.c:417
>  kasan_report+0x139/0x170 mm/kasan/report.c:517
>  kasan_check_range+0x2a7/0x2e0 mm/kasan/generic.c:189
>  memcpy+0x3c/0x60 mm/kasan/shadow.c:66
>  copy_array kernel/bpf/verifier.c:1072 [inline]
>  copy_verifier_state+0x130/0xbe0 kernel/bpf/verifier.c:1250
>  pop_stack kernel/bpf/verifier.c:1314 [inline]
>  do_check+0x8e51/0x107b0 kernel/bpf/verifier.c:14031
>  do_check_common+0x909/0x1800 kernel/bpf/verifier.c:16289
>  do_check_main kernel/bpf/verifier.c:16352 [inline]
>  bpf_check+0x107e2/0x16170 kernel/bpf/verifier.c:16936
>  bpf_prog_load+0x1306/0x1be0 kernel/bpf/syscall.c:2619
>  __sys_bpf+0x396/0x6d0 kernel/bpf/syscall.c:4979
>  __do_sys_bpf kernel/bpf/syscall.c:5083 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:5081 [inline]
>  __x64_sys_bpf+0x78/0x90 kernel/bpf/syscall.c:5081
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7ff1fb190c29
> Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007ffeaae55678 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff1fb190c29
> RDX: 0000000000000048 RSI: 0000000020000200 RDI: 0000000000000005
> RBP: 00007ff1fb154dd0 R08: 0000000000000000 R09: 0000000000000000
> R10: 00000000ffffffff R11: 0000000000000246 R12: 00007ff1fb154e60
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>  </TASK>
> 
> Allocated by task 5067:
>  kasan_save_stack mm/kasan/common.c:45 [inline]
>  kasan_set_track+0x4c/0x70 mm/kasan/common.c:52
>  ____kasan_kmalloc mm/kasan/common.c:371 [inline]
>  __kasan_krealloc+0xbf/0xf0 mm/kasan/common.c:439
>  kasan_krealloc include/linux/kasan.h:231 [inline]
>  __do_krealloc mm/slab_common.c:1361 [inline]
>  krealloc+0xb2/0x110 mm/slab_common.c:1398
>  push_jmp_history kernel/bpf/verifier.c:2592 [inline]
>  is_state_visited kernel/bpf/verifier.c:13552 [inline]
>  do_check+0x9433/0x107b0 kernel/bpf/verifier.c:13752
>  do_check_common+0x909/0x1800 kernel/bpf/verifier.c:16289
>  do_check_main kernel/bpf/verifier.c:16352 [inline]
>  bpf_check+0x107e2/0x16170 kernel/bpf/verifier.c:16936
>  bpf_prog_load+0x1306/0x1be0 kernel/bpf/syscall.c:2619
>  __sys_bpf+0x396/0x6d0 kernel/bpf/syscall.c:4979
>  __do_sys_bpf kernel/bpf/syscall.c:5083 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:5081 [inline]
>  __x64_sys_bpf+0x78/0x90 kernel/bpf/syscall.c:5081
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> The buggy address belongs to the object at ffff888022c71000
>  which belongs to the cache kmalloc-96 of size 96
> The buggy address is located 0 bytes inside of
>  96-byte region [ffff888022c71000, ffff888022c71060)
> 
> The buggy address belongs to the physical page:
> page:ffffea00008b1c40 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22c71
> ksm flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
> raw: 00fff00000000200 ffff888012841780 ffffea0000a6d880 0000000000000003
> raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
> page dumped because: kasan: bad access detected
> page_owner tracks the page as allocated
> page last allocated via order 0, migratetype Unmovable, gfp_mask 0x12c40(GFP_NOFS|__GFP_NOWARN|__GFP_NORETRY), pid 4437, tgid 4437 (udevd), ts 26581643327, free_ts 26581082061
>  prep_new_page mm/page_alloc.c:2531 [inline]
>  get_page_from_freelist+0x72b/0x7a0 mm/page_alloc.c:4283
>  __alloc_pages+0x259/0x560 mm/page_alloc.c:5549
>  alloc_slab_page+0xbd/0x190 mm/slub.c:1851
>  allocate_slab+0x5e/0x3c0 mm/slub.c:1998
>  new_slab mm/slub.c:2051 [inline]
>  ___slab_alloc+0x7f4/0xeb0 mm/slub.c:3193
>  __slab_alloc mm/slub.c:3292 [inline]
>  __slab_alloc_node mm/slub.c:3345 [inline]
>  slab_alloc_node mm/slub.c:3442 [inline]
>  __kmem_cache_alloc_node+0x25b/0x340 mm/slub.c:3491
>  __do_kmalloc_node mm/slab_common.c:967 [inline]
>  __kmalloc+0x9e/0x190 mm/slab_common.c:981
>  kmalloc include/linux/slab.h:584 [inline]
>  kzalloc include/linux/slab.h:720 [inline]
>  tomoyo_encode2 security/tomoyo/realpath.c:45 [inline]
>  tomoyo_encode+0x26f/0x540 security/tomoyo/realpath.c:80
>  tomoyo_realpath_from_path+0x5ae/0x5f0 security/tomoyo/realpath.c:283
>  tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
>  tomoyo_path_perm+0x280/0x680 security/tomoyo/file.c:822
>  security_inode_getattr+0xc0/0x140 security/security.c:1375
>  vfs_getattr fs/stat.c:161 [inline]
>  vfs_statx+0x198/0x4b0 fs/stat.c:236
>  vfs_fstatat fs/stat.c:270 [inline]
>  __do_sys_newfstatat fs/stat.c:440 [inline]
>  __se_sys_newfstatat+0x104/0x7b0 fs/stat.c:434
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> page last free stack trace:
>  reset_page_owner include/linux/page_owner.h:24 [inline]
>  free_pages_prepare mm/page_alloc.c:1446 [inline]
>  free_pcp_prepare+0x751/0x780 mm/page_alloc.c:1496
>  free_unref_page_prepare mm/page_alloc.c:3369 [inline]
>  free_unref_page+0x19/0x4c0 mm/page_alloc.c:3464
>  free_pipe_info+0x302/0x380 fs/pipe.c:851
>  put_pipe_info fs/pipe.c:711 [inline]
>  pipe_release+0x232/0x310 fs/pipe.c:734
>  __fput+0x3ba/0x880 fs/file_table.c:320
>  task_work_run+0x243/0x300 kernel/task_work.c:179
>  resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
>  exit_to_user_mode_loop+0x134/0x160 kernel/entry/common.c:171
>  exit_to_user_mode_prepare+0xad/0x110 kernel/entry/common.c:203
>  __syscall_exit_to_user_mode_work kernel/entry/common.c:285 [inline]
>  syscall_exit_to_user_mode+0x2e/0x60 kernel/entry/common.c:296
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> Memory state around the buggy address:
>  ffff888022c70f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  ffff888022c70f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >ffff888022c71000: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>                          ^
>  ffff888022c71080: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>  ffff888022c71100: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> ==================================================================
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 041fae9c105a

[-- Attachment #2: syz.patch --]
[-- Type: text/x-diff, Size: 3676 bytes --]

commit 589aa8077dcf413b06c78c6d8095496c98720c25
Author:     Stanislav Fomichev <sdf@google.com>
AuthorDate: Tue Dec 20 11:51:17 2022 -0800
Commit:     Stanislav Fomichev <sdf@google.com>
CommitDate: Tue Dec 20 11:51:17 2022 -0800

    Revert "mm: Make ksize() a reporting-only function"
    
    This reverts commit 38931d8989b5760b0bd17c9ec99e81986258e4cb.

diff --git a/mm/kasan/kasan_test.c b/mm/kasan/kasan_test.c
index 73684642c42d..0d59098f0876 100644
--- a/mm/kasan/kasan_test.c
+++ b/mm/kasan/kasan_test.c
@@ -783,30 +783,23 @@ static void kasan_global_oob_left(struct kunit *test)
 	KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)p);
 }
 
-/* Check that ksize() does NOT unpoison whole object. */
+/* Check that ksize() makes the whole object accessible. */
 static void ksize_unpoisons_memory(struct kunit *test)
 {
 	char *ptr;
-	size_t size = 128 - KASAN_GRANULE_SIZE - 5;
-	size_t real_size;
+	size_t size = 123, real_size;
 
 	ptr = kmalloc(size, GFP_KERNEL);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
-
 	real_size = ksize(ptr);
-	KUNIT_EXPECT_GT(test, real_size, size);
 
 	OPTIMIZER_HIDE_VAR(ptr);
 
-	/* These accesses shouldn't trigger a KASAN report. */
-	ptr[0] = 'x';
-	ptr[size - 1] = 'x';
+	/* This access shouldn't trigger a KASAN report. */
+	ptr[size] = 'x';
 
-	/* These must trigger a KASAN report. */
-	if (IS_ENABLED(CONFIG_KASAN_GENERIC))
-		KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[size]);
-	KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[size + 5]);
-	KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[real_size - 1]);
+	/* This one must. */
+	KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[real_size]);
 
 	kfree(ptr);
 }
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 7e96abf1bd7d..33b1886b06eb 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1333,11 +1333,11 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags)
 	void *ret;
 	size_t ks;
 
-	/* Check for double-free before calling ksize. */
+	/* Don't use instrumented ksize to allow precise KASAN poisoning. */
 	if (likely(!ZERO_OR_NULL_PTR(p))) {
 		if (!kasan_check_byte(p))
 			return NULL;
-		ks = ksize(p);
+		ks = kfence_ksize(p) ?: __ksize(p);
 	} else
 		ks = 0;
 
@@ -1405,10 +1405,8 @@ void kfree_sensitive(const void *p)
 	void *mem = (void *)p;
 
 	ks = ksize(mem);
-	if (ks) {
-		kasan_unpoison_range(mem, ks);
+	if (ks)
 		memzero_explicit(mem, ks);
-	}
 	kfree(mem);
 }
 EXPORT_SYMBOL(kfree_sensitive);
@@ -1429,11 +1427,13 @@ EXPORT_SYMBOL(kfree_sensitive);
  */
 size_t ksize(const void *objp)
 {
+	size_t size;
+
 	/*
-	 * We need to first check that the pointer to the object is valid.
-	 * The KASAN report printed from ksize() is more useful, then when
-	 * it's printed later when the behaviour could be undefined due to
-	 * a potential use-after-free or double-free.
+	 * We need to first check that the pointer to the object is valid, and
+	 * only then unpoison the memory. The report printed from ksize() is
+	 * more useful, then when it's printed later when the behaviour could
+	 * be undefined due to a potential use-after-free or double-free.
 	 *
 	 * We use kasan_check_byte(), which is supported for the hardware
 	 * tag-based KASAN mode, unlike kasan_check_read/write().
@@ -1447,7 +1447,13 @@ size_t ksize(const void *objp)
 	if (unlikely(ZERO_OR_NULL_PTR(objp)) || !kasan_check_byte(objp))
 		return 0;
 
-	return kfence_ksize(objp) ?: __ksize(objp);
+	size = kfence_ksize(objp) ?: __ksize(objp);
+	/*
+	 * We assume that ksize callers could use whole allocated area,
+	 * so we need to unpoison this area.
+	 */
+	kasan_unpoison_range(objp, size);
+	return size;
 }
 EXPORT_SYMBOL(ksize);
 

^ permalink raw reply related

* Re: [syzbot] KASAN: slab-out-of-bounds Write in copy_array
From: syzbot @ 2022-12-20 20:11 UTC (permalink / raw)
  To: andrii, ast, bpf, daniel, davem, haoluo, hawk, john.fastabend,
	jolsa, kpsingh, kuba, linux-kernel, llvm, martin.lau, nathan,
	ndesaulniers, netdev, sdf, song, syzkaller-bugs, trix, yhs
In-Reply-To: <Y6ITdVe2DPTioPvc@google.com>

Hello,

syzbot tried to test the proposed patch but the build/boot failed:

failed to apply patch:
checking file mm/kasan/kasan_test.c
Hunk #1 FAILED at 783.
1 out of 1 hunk FAILED
checking file mm/slab_common.c
Hunk #1 succeeded at 1335 (offset 2 lines).
Hunk #2 succeeded at 1407 (offset 2 lines).
Hunk #3 succeeded at 1415 with fuzz 1 (offset -12 lines).
Hunk #4 succeeded at 1435 (offset -12 lines).



Tested on:

commit:         c35bd4e4 Add linux-next specific files for 20221124
git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
dashboard link: https://syzkaller.appspot.com/bug?extid=b1e1f7feb407b56d0355
compiler:       
patch:          https://syzkaller.appspot.com/x/patch.diff?x=17af7bf0480000


^ permalink raw reply

* [PATCH net ] vmxnet3: correctly report csum_level for encapsulated packet
From: Ronak Doshi @ 2022-12-20 20:25 UTC (permalink / raw)
  To: netdev
  Cc: stable, Ronak Doshi, VMware PV-Drivers Reviewers, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, open list

Commit dacce2be3312 ("vmxnet3: add geneve and vxlan tunnel offload
support") added support for encapsulation offload. However, the
pathc did not report correctly the csum_level for encapsulated packet.

This patch fixes this issue by reporting correct csum level for the
encapsulated packet.

Fixes: dacce2be3312 ("vmxnet3: add geneve and vxlan tunnel offload support")
Signed-off-by: Ronak Doshi <doshir@vmware.com>
Acked-by: Peng Li <lpeng@vmware.com>
---
 drivers/net/vmxnet3/vmxnet3_drv.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index 6f1e560fb15c..56267c327f0b 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -1288,6 +1288,10 @@ vmxnet3_rx_csum(struct vmxnet3_adapter *adapter,
 		    (le32_to_cpu(gdesc->dword[3]) &
 		     VMXNET3_RCD_CSUM_OK) == VMXNET3_RCD_CSUM_OK) {
 			skb->ip_summed = CHECKSUM_UNNECESSARY;
+			if ((le32_to_cpu(gdesc->dword[0]) &
+				     (1UL << VMXNET3_RCD_HDR_INNER_SHIFT))) {
+				skb->csum_level = 1;
+			}
 			WARN_ON_ONCE(!(gdesc->rcd.tcp || gdesc->rcd.udp) &&
 				     !(le32_to_cpu(gdesc->dword[0]) &
 				     (1UL << VMXNET3_RCD_HDR_INNER_SHIFT)));
@@ -1297,6 +1301,10 @@ vmxnet3_rx_csum(struct vmxnet3_adapter *adapter,
 		} else if (gdesc->rcd.v6 && (le32_to_cpu(gdesc->dword[3]) &
 					     (1 << VMXNET3_RCD_TUC_SHIFT))) {
 			skb->ip_summed = CHECKSUM_UNNECESSARY;
+			if ((le32_to_cpu(gdesc->dword[0]) &
+				     (1UL << VMXNET3_RCD_HDR_INNER_SHIFT))) {
+				skb->csum_level = 1;
+			}
 			WARN_ON_ONCE(!(gdesc->rcd.tcp || gdesc->rcd.udp) &&
 				     !(le32_to_cpu(gdesc->dword[0]) &
 				     (1UL << VMXNET3_RCD_HDR_INNER_SHIFT)));
-- 
2.11.0


^ permalink raw reply related

* [PULL] Networking for v6.2-rc1
From: Jakub Kicinski @ 2022-12-20 20:30 UTC (permalink / raw)
  To: torvalds; +Cc: kuba, davem, netdev, linux-kernel, pabeni

Hi Linus!

Traffic is winding down significantly so let us pass our fixes to you
earlier than the usual Thu schedule.

We have a fix for the BPF issue we were looking at before 6.1 final,
no surprises there. RxRPC fixes were merged relatively late so there's
an outpour of follow ups. Last one worth mentioning is the tree-wide
fix for network file systems / in-tree socket users, to prevent nested
networking calls from corrupting socket memory allocator.

Happy holidays (and unless something urgent happens) - till next year!


The following changes since commit 7e68dd7d07a28faa2e6574dd6b9dbd90cdeaae91:

  Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next (2022-12-13 15:47:48 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git tags/net-6.2-rc1

for you to fetch changes up to 19e72b064fc32cd58f6fc0b1eb64ac2e4f770e76:

  net: fec: check the return value of build_skb() (2022-12-20 11:33:24 -0800)

----------------------------------------------------------------
Including fixes from bpf, netfilter and can.

Current release - regressions:

 - bpf: synchronize dispatcher update with bpf_dispatcher_xdp_func

 - rxrpc:
  - fix security setting propagation
  - fix null-deref in rxrpc_unuse_local()
  - fix switched parameters in peer tracing

Current release - new code bugs:

 - rxrpc:
   - fix I/O thread startup getting skipped
   - fix locking issues in rxrpc_put_peer_locked()
   - fix I/O thread stop
   - fix uninitialised variable in rxperf server
   - fix the return value of rxrpc_new_incoming_call()

 - microchip: vcap: fix initialization of value and mask

 - nfp: fix unaligned io read of capabilities word

Previous releases - regressions:

 - stop in-kernel socket users from corrupting socket's task_frag

 - stream: purge sk_error_queue in sk_stream_kill_queues()

 - openvswitch: fix flow lookup to use unmasked key

 - dsa: mv88e6xxx: avoid reg_lock deadlock in mv88e6xxx_setup_port()

 - devlink:
   - hold region lock when flushing snapshots
   - protect devlink dump by the instance lock

Previous releases - always broken:

 - bpf:
   - prevent leak of lsm program after failed attach
   - resolve fext program type when checking map compatibility

 - skbuff: account for tail adjustment during pull operations

 - macsec: fix net device access prior to holding a lock

 - bonding: switch back when high prio link up

 - netfilter: flowtable: really fix NAT IPv6 offload

 - enetc: avoid buffer leaks on xdp_do_redirect() failure

 - unix: fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()

 - dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

----------------------------------------------------------------
Arnd Bergmann (1):
      net: ethernet: ti: am65-cpsw: fix CONFIG_PM #ifdef

Arun Ramadoss (1):
      net: dsa: microchip: remove IRQF_TRIGGER_FALLING in request_threaded_irq

Benjamin Coddington (2):
      Treewide: Stop corrupting socket's task_frag
      net: simplify sk_page_frag

Biju Das (1):
      ravb: Fix "failed to switch device to config mode" message during unbind

Christophe JAILLET (1):
      myri10ge: Fix an error handling path in myri10ge_probe()

Cong Wang (1):
      net_sched: reject TCF_EM_SIMPLE case for complex ematch module

Daniel Golle (1):
      net: dsa: mt7530: remove redundant assignment

David Howells (9):
      rxrpc: Fix missing unlock in rxrpc_do_sendmsg()
      rxrpc: Fix security setting propagation
      rxrpc: Fix NULL deref in rxrpc_unuse_local()
      rxrpc: Fix I/O thread startup getting skipped
      rxrpc: Fix locking issues in rxrpc_put_peer_locked()
      rxrpc: Fix switched parameters in peer tracing
      rxrpc: Fix I/O thread stop
      rxrpc: rxperf: Fix uninitialised variable
      rxrpc: Fix the return value of rxrpc_new_incoming_call()

David S. Miller (3):
      Merge branch 'devlink-fixes'
      Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
      Merge branch 'rxrpc-fixes'

Donald Hunter (1):
      docs/bpf: Reword docs for BPF_MAP_TYPE_SK_STORAGE

Eelco Chaudron (1):
      openvswitch: Fix flow lookup to use unmasked key

Emeel Hakim (1):
      net: macsec: fix net device access prior to holding a lock

Eric Dumazet (1):
      net: stream: purge sk_error_queue in sk_stream_kill_queues()

Gaosheng Cui (1):
      net: stmmac: fix errno when create_singlethread_workqueue() fails

Guillaume Nault (1):
      net: Introduce sk_use_task_frag in struct sock.

Haibo Chen (1):
      can: flexcan: avoid unbalanced pm_runtime_enable warning

Hangbin Liu (2):
      bonding: add missed __rcu annotation for curr_active_slave
      bonding: do failover when high prio link up

Horatiu Vultur (1):
      net: microchip: vcap: Fix initialization of value and mask

Huanhuan Wang (1):
      nfp: fix unaligned io read of capabilities word

Jakub Kicinski (10):
      Merge branch 'bonding-fix-high-prio-not-effect-issue'
      Merge branch 'misdn-don-t-call-dev_kfree_skb-kfree_skb-under-spin_lock_irqsave'
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
      devlink: hold region lock when flushing snapshots
      selftests: devlink: fix the fd redirect in dummy_reporter_test
      selftests: devlink: add a warning for interfaces coming up
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
      devlink: protect devlink dump by the instance lock
      Merge branch 'stop-corrupting-socket-s-task_frag'
      Merge tag 'linux-can-fixes-for-6.2-20221219' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Jeremy Kerr (1):
      mctp: serial: Fix starting value for frame check sequence

Jiri Olsa (1):
      bpf: Synchronize dispatcher update with bpf_dispatcher_xdp_func

Jiri Slaby (SUSE) (1):
      wireguard: timers: cast enum limits members to int in prints

Kirill Tkhai (1):
      unix: Fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()

Li Qiong (1):
      ipvs: add a 'default' case in do_ip_vs_set_ctl()

Li Zetao (1):
      r6040: Fix kmemleak in probe and remove

Liang Li (1):
      selftests: bonding: add bonding prio option test

Marc Kleine-Budde (1):
      can: kvaser_usb: hydra: help gcc-13 to figure out cmd_len

Matt Johnston (1):
      mctp: Remove device type check at unregister

Milan Landaverde (1):
      bpf: prevent leak of lsm program after failed attach

Minsuk Kang (1):
      nfc: pn533: Clear nfc_target before being used

Muhammad Husaini Zulkifli (1):
      igc: Add checking for basetime less than zero

Qingfang DENG (1):
      netfilter: flowtable: really fix NAT IPv6 offload

Song Liu (1):
      selftests/bpf: Select CONFIG_FUNCTION_ERROR_INJECTION

Sriram Yagnaraman (1):
      netfilter: conntrack: document sctp timeouts

Subash Abhinov Kasiviswanathan (1):
      skbuff: Account for tail adjustment during pull operations

Tan Tee Min (3):
      igc: allow BaseTime 0 enrollment for Qbv
      igc: recalculate Qbv end_time by considering cycle time
      igc: Set Qbv start_time and end_time to end_time if not being configured in GCL

Toke Høiland-Jørgensen (2):
      bpf: Resolve fext program type when checking map compatibility
      selftests/bpf: Add a test for using a cpumap from an freplace-to-XDP program

Tony Nguyen (1):
      igb: Initialize mailbox message for VF reset

Vincent Mailhol (1):
      Documentation: devlink: add missing toc entry for etas_es58x devlink doc

Vinicius Costa Gomes (2):
      igc: Enhance Qbv scheduling by using first flag bit
      igc: Use strict cycles for Qbv scheduling

Vladimir Oltean (2):
      net: enetc: avoid buffer leaks on xdp_do_redirect() failure
      net: dsa: mv88e6xxx: avoid reg_lock deadlock in mv88e6xxx_setup_port()

Wei Fang (1):
      net: fec: check the return value of build_skb()

Yang Yingliang (3):
      mISDN: hfcsusb: don't call dev_kfree_skb/kfree_skb() under spin_lock_irqsave()
      mISDN: hfcpci: don't call dev_kfree_skb/kfree_skb() under spin_lock_irqsave()
      mISDN: hfcmulti: don't call dev_kfree_skb/kfree_skb() under spin_lock_irqsave()

Yonghong Song (1):
      selftests/bpf: Fix a selftest compilation error with CONFIG_SMP=n

 Documentation/bpf/map_sk_storage.rst               |  56 ++---
 Documentation/networking/devlink/index.rst         |   1 +
 Documentation/networking/nf_conntrack-sysctl.rst   |  33 +++
 drivers/block/drbd/drbd_receiver.c                 |   3 +
 drivers/block/nbd.c                                |   1 +
 drivers/isdn/hardware/mISDN/hfcmulti.c             |  19 +-
 drivers/isdn/hardware/mISDN/hfcpci.c               |  13 +-
 drivers/isdn/hardware/mISDN/hfcsusb.c              |  12 +-
 drivers/net/bonding/bond_main.c                    |  24 +-
 drivers/net/can/flexcan/flexcan-core.c             |  12 +-
 drivers/net/can/usb/kvaser_usb/kvaser_usb_hydra.c  |  33 ++-
 drivers/net/dsa/microchip/ksz_common.c             |   3 +-
 drivers/net/dsa/mt7530.c                           |   3 -
 drivers/net/dsa/mv88e6xxx/chip.c                   |   9 +-
 drivers/net/ethernet/freescale/enetc/enetc.c       |  35 +--
 drivers/net/ethernet/freescale/fec_main.c          |   8 +
 drivers/net/ethernet/intel/igb/igb_main.c          |   2 +-
 drivers/net/ethernet/intel/igc/igc.h               |   3 +
 drivers/net/ethernet/intel/igc/igc_defines.h       |   2 +
 drivers/net/ethernet/intel/igc/igc_main.c          | 210 +++++++++++++++---
 drivers/net/ethernet/intel/igc/igc_tsn.c           |  13 +-
 .../net/ethernet/microchip/vcap/vcap_api_debugfs.c |   2 +
 drivers/net/ethernet/myricom/myri10ge/myri10ge.c   |   1 +
 .../net/ethernet/netronome/nfp/nfp_net_common.c    |   2 +-
 drivers/net/ethernet/rdc/r6040.c                   |   5 +-
 drivers/net/ethernet/renesas/ravb_main.c           |   2 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  |   1 +
 drivers/net/ethernet/ti/am65-cpsw-nuss.c           |   4 +-
 drivers/net/macsec.c                               |  34 +--
 drivers/net/mctp/mctp-serial.c                     |   6 +-
 drivers/net/wireguard/timers.c                     |   8 +-
 drivers/nfc/pn533/pn533.c                          |   4 +
 drivers/nvme/host/tcp.c                            |   1 +
 drivers/scsi/iscsi_tcp.c                           |   1 +
 drivers/usb/usbip/usbip_common.c                   |   1 +
 fs/cifs/connect.c                                  |   1 +
 fs/dlm/lowcomms.c                                  |   2 +
 fs/ocfs2/cluster/tcp.c                             |   1 +
 include/net/sock.h                                 |  10 +-
 include/trace/events/rxrpc.h                       |   2 +-
 kernel/bpf/core.c                                  |   5 +-
 kernel/bpf/dispatcher.c                            |   5 +
 kernel/bpf/syscall.c                               |   6 +-
 net/9p/trans_fd.c                                  |   1 +
 net/ceph/messenger.c                               |   1 +
 net/core/devlink.c                                 |   5 +
 net/core/skbuff.c                                  |   3 +
 net/core/sock.c                                    |   1 +
 net/core/stream.c                                  |   6 +
 net/mctp/device.c                                  |  14 +-
 net/netfilter/ipvs/ip_vs_ctl.c                     |   5 +
 net/netfilter/nf_flow_table_offload.c              |   6 +-
 net/openvswitch/datapath.c                         |  25 ++-
 net/rxrpc/ar-internal.h                            |   8 +-
 net/rxrpc/call_accept.c                            |  18 +-
 net/rxrpc/call_object.c                            |   1 +
 net/rxrpc/conn_client.c                            |   2 -
 net/rxrpc/io_thread.c                              |  10 +-
 net/rxrpc/local_object.c                           |   5 +-
 net/rxrpc/peer_event.c                             |  10 +-
 net/rxrpc/peer_object.c                            |  23 +-
 net/rxrpc/rxperf.c                                 |   2 +-
 net/rxrpc/security.c                               |   6 +-
 net/rxrpc/sendmsg.c                                |   2 +-
 net/sched/ematch.c                                 |   2 +
 net/sunrpc/xprtsock.c                              |   3 +
 net/unix/af_unix.c                                 |  11 +-
 net/xfrm/espintcp.c                                |   1 +
 tools/testing/selftests/bpf/config                 |   1 +
 .../selftests/bpf/prog_tests/fexit_bpf2bpf.c       |  48 ++++
 .../testing/selftests/bpf/progs/freplace_progmap.c |  24 ++
 tools/testing/selftests/bpf/progs/rcu_read_lock.c  |   8 +-
 .../selftests/bpf/progs/task_kfunc_failure.c       |   2 +-
 .../testing/selftests/drivers/net/bonding/Makefile |   3 +-
 .../selftests/drivers/net/bonding/option_prio.sh   | 245 +++++++++++++++++++++
 .../selftests/drivers/net/netdevsim/devlink.sh     |   4 +-
 .../drivers/net/netdevsim/devlink_trap.sh          |  13 ++
 77 files changed, 861 insertions(+), 257 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/freplace_progmap.c
 create mode 100755 tools/testing/selftests/drivers/net/bonding/option_prio.sh

^ permalink raw reply

* Re: [syzbot] KASAN: slab-out-of-bounds Write in copy_verifier_state
From: syzbot @ 2022-12-20 20:30 UTC (permalink / raw)
  To: andrii, ast, bpf, daniel, davem, haoluo, hawk, john.fastabend,
	jolsa, kpsingh, kuba, linux-kernel, llvm, martin.lau, nathan,
	ndesaulniers, netdev, sdf, song, syzkaller-bugs, trix, yhs
In-Reply-To: <Y6ITpE770+kZ63vp@google.com>

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-and-tested-by: syzbot+59af7bf76d795311da8c@syzkaller.appspotmail.com

Tested on:

commit:         041fae9c Merge tag 'f2fs-for-6.2-rc1' of git://git.ker..
git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
console output: https://syzkaller.appspot.com/x/log.txt?x=15fc914f880000
kernel config:  https://syzkaller.appspot.com/x/.config?x=e2f3d9d232a3cac5
dashboard link: https://syzkaller.appspot.com/bug?extid=59af7bf76d795311da8c
compiler:       Debian clang version 13.0.1-++20220126092033+75e33f71c2da-1~exp1~20220126212112.63, GNU ld (GNU Binutils for Debian) 2.35.2
patch:          https://syzkaller.appspot.com/x/patch.diff?x=10cacf77880000

Note: testing is done by a robot and is best-effort only.

^ permalink raw reply

* [PATCH v4] epoll: use refcount to reduce ep_mutex contention
From: Paolo Abeni @ 2022-12-20 20:55 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Soheil Hassas Yeganeh, Al Viro, Davidlohr Bueso, Jason Baron,
	netdev, Carlos Maiolino, Eric Biggers, Jacob Keller

We are observing huge contention on the epmutex during an http
connection/rate test:

 83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
[...]
           |--66.96%--__fput
                      |--60.04%--eventpoll_release_file
                                 |--58.41%--__mutex_lock.isra.6
                                           |--56.56%--osq_lock

The application is multi-threaded, creates a new epoll entry for
each incoming connection, and does not delete it before the
connection shutdown - that is, before the connection's fd close().

Many different threads compete frequently for the epmutex lock,
affecting the overall performance.

To reduce the contention this patch introduces explicit reference counting
for the eventpoll struct. Each registered event acquires a reference,
and references are released at ep_remove() time.

Additionally, this introduces a new 'dying' flag to prevent races between
the EP file close() and the monitored file close().
ep_eventpoll_release() marks, under f_lock spinlock, each epitem as before
removing it, while EP file close() does not touch dying epitems.

The eventpoll struct is released by whoever - among EP file close() and
and the monitored file close() drops its last reference.

With all the above in place, we can drop the epmutex usage at disposal time.

Overall this produces a significant performance improvement in the
mentioned connection/rate scenario: the mutex operations disappear from
the topmost offenders in the perf report, and the measured connections/rate
grows by ~60%.

To make the change more readable this additionally renames ep_free() to
ep_clear_and_put(), and moves the actual memory cleanup in a separate
ep_free() helper.

Tested-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v4: (addresses comments from Eric Biggers and Jacob)
 - replace a refcount_t
 - rename the ep book-keeping helpers

v3 at:
https://lore.kernel.org/linux-fsdevel/1aedd7e87097bc4352ba658ac948c585a655785a.1669657846.git.pabeni@redhat.com/

v2 at:
https://lore.kernel.org/linux-fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@redhat.com/

v1 at:
https://lore.kernel.org/linux-fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@redhat.com/

Previous related effort at:
https://lore.kernel.org/linux-fsdevel/20190727113542.162213-1-cj.chengjian@huawei.com/
https://lkml.org/lkml/2017/10/28/81
---
 fs/eventpoll.c | 185 +++++++++++++++++++++++++++++++------------------
 1 file changed, 116 insertions(+), 69 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 64659b110973..a43ccb02133c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -57,13 +57,7 @@
  * we need a lock that will allow us to sleep. This lock is a
  * mutex (ep->mtx). It is acquired during the event transfer loop,
  * during epoll_ctl(EPOLL_CTL_DEL) and during eventpoll_release_file().
- * Then we also need a global mutex to serialize eventpoll_release_file()
- * and ep_free().
- * This mutex is acquired by ep_free() during the epoll file
- * cleanup path and it is also acquired by eventpoll_release_file()
- * if a file has been pushed inside an epoll set and it is then
- * close()d without a previous call to epoll_ctl(EPOLL_CTL_DEL).
- * It is also acquired when inserting an epoll fd onto another epoll
+ * The epmutex is acquired when inserting an epoll fd onto another epoll
  * fd. We do this so that we walk the epoll tree and ensure that this
  * insertion does not create a cycle of epoll file descriptors, which
  * could lead to deadlock. We need a global mutex to prevent two
@@ -153,6 +147,13 @@ struct epitem {
 	/* The file descriptor information this item refers to */
 	struct epoll_filefd ffd;
 
+	/*
+	 * Protected by file->f_lock, true for to-be-released epitem already
+	 * removed from the "struct file" items list; together with
+	 * eventpoll->refcount orchestrates "struct eventpoll" disposal
+	 */
+	bool dying;
+
 	/* List containing poll wait queues */
 	struct eppoll_entry *pwqlist;
 
@@ -217,6 +218,12 @@ struct eventpoll {
 	u64 gen;
 	struct hlist_head refs;
 
+	/*
+	 * usage count, used together with epitem->dying to
+	 * orchestrate the disposal of this struct
+	 */
+	refcount_t refcount;
+
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	/* used to track busy poll napi_id */
 	unsigned int napi_id;
@@ -240,9 +247,7 @@ struct ep_pqueue {
 /* Maximum number of epoll watched descriptors, per user */
 static long max_user_watches __read_mostly;
 
-/*
- * This mutex is used to serialize ep_free() and eventpoll_release_file().
- */
+/* Used for cycles detection */
 static DEFINE_MUTEX(epmutex);
 
 static u64 loop_check_gen = 0;
@@ -557,8 +562,7 @@ static void ep_remove_wait_queue(struct eppoll_entry *pwq)
 
 /*
  * This function unregisters poll callbacks from the associated file
- * descriptor.  Must be called with "mtx" held (or "epmutex" if called from
- * ep_free).
+ * descriptor.  Must be called with "mtx" held.
  */
 static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
 {
@@ -681,11 +685,38 @@ static void epi_rcu_free(struct rcu_head *head)
 	kmem_cache_free(epi_cache, epi);
 }
 
+static void ep_get(struct eventpoll *ep)
+{
+	refcount_inc(&ep->refcount);
+}
+
+/*
+ * Returns true if the event poll can be disposed
+ */
+static bool ep_refcount_dec_and_test(struct eventpoll *ep)
+{
+	if (!refcount_dec_and_test(&ep->refcount))
+		return false;
+
+	WARN_ON_ONCE(!RB_EMPTY_ROOT(&ep->rbr.rb_root));
+	return true;
+}
+
+static void ep_free(struct eventpoll *ep)
+{
+	mutex_destroy(&ep->mtx);
+	free_uid(ep->user);
+	wakeup_source_unregister(ep->ws);
+	kfree(ep);
+}
+
 /*
  * Removes a "struct epitem" from the eventpoll RB tree and deallocates
  * all the associated resources. Must be called with "mtx" held.
+ * If the dying flag is set, do the removal only if force is true.
+ * Returns true if the eventpoll can be disposed.
  */
-static int ep_remove(struct eventpoll *ep, struct epitem *epi)
+static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
 {
 	struct file *file = epi->ffd.file;
 	struct epitems_head *to_free;
@@ -700,6 +731,11 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 
 	/* Remove the current item from the list of epoll hooks */
 	spin_lock(&file->f_lock);
+	if (epi->dying && !force) {
+		spin_unlock(&file->f_lock);
+		return false;
+	}
+
 	to_free = NULL;
 	head = file->f_ep;
 	if (head->first == &epi->fllink && !epi->fllink.next) {
@@ -733,28 +769,28 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 	call_rcu(&epi->rcu, epi_rcu_free);
 
 	percpu_counter_dec(&ep->user->epoll_watches);
+	return ep_refcount_dec_and_test(ep);
+}
 
-	return 0;
+/*
+ * ep_remove variant for callers owing an additional reference to the ep
+ */
+static void ep_remove_safe(struct eventpoll *ep, struct epitem *epi)
+{
+	WARN_ON_ONCE(__ep_remove(ep, epi, false));
 }
 
-static void ep_free(struct eventpoll *ep)
+static void ep_clear_and_put(struct eventpoll *ep)
 {
 	struct rb_node *rbp;
 	struct epitem *epi;
+	bool dispose;
 
 	/* We need to release all tasks waiting for these file */
 	if (waitqueue_active(&ep->poll_wait))
 		ep_poll_safewake(ep, NULL, 0);
 
-	/*
-	 * We need to lock this because we could be hit by
-	 * eventpoll_release_file() while we're freeing the "struct eventpoll".
-	 * We do not need to hold "ep->mtx" here because the epoll file
-	 * is on the way to be removed and no one has references to it
-	 * anymore. The only hit might come from eventpoll_release_file() but
-	 * holding "epmutex" is sufficient here.
-	 */
-	mutex_lock(&epmutex);
+	mutex_lock(&ep->mtx);
 
 	/*
 	 * Walks through the whole tree by unregistering poll callbacks.
@@ -768,25 +804,21 @@ static void ep_free(struct eventpoll *ep)
 
 	/*
 	 * Walks through the whole tree by freeing each "struct epitem". At this
-	 * point we are sure no poll callbacks will be lingering around, and also by
-	 * holding "epmutex" we can be sure that no file cleanup code will hit
-	 * us during this operation. So we can avoid the lock on "ep->lock".
-	 * We do not need to lock ep->mtx, either, we only do it to prevent
-	 * a lockdep warning.
+	 * point we are sure no poll callbacks will be lingering around.
+	 * Since we still own a reference to the eventpoll struct, the loop can't
+	 * dispose it.
 	 */
-	mutex_lock(&ep->mtx);
 	while ((rbp = rb_first_cached(&ep->rbr)) != NULL) {
 		epi = rb_entry(rbp, struct epitem, rbn);
-		ep_remove(ep, epi);
+		ep_remove_safe(ep, epi);
 		cond_resched();
 	}
+
+	dispose = ep_refcount_dec_and_test(ep);
 	mutex_unlock(&ep->mtx);
 
-	mutex_unlock(&epmutex);
-	mutex_destroy(&ep->mtx);
-	free_uid(ep->user);
-	wakeup_source_unregister(ep->ws);
-	kfree(ep);
+	if (dispose)
+		ep_free(ep);
 }
 
 static int ep_eventpoll_release(struct inode *inode, struct file *file)
@@ -794,7 +826,7 @@ static int ep_eventpoll_release(struct inode *inode, struct file *file)
 	struct eventpoll *ep = file->private_data;
 
 	if (ep)
-		ep_free(ep);
+		ep_clear_and_put(ep);
 
 	return 0;
 }
@@ -906,33 +938,35 @@ void eventpoll_release_file(struct file *file)
 {
 	struct eventpoll *ep;
 	struct epitem *epi;
-	struct hlist_node *next;
+	bool dispose;
 
 	/*
-	 * We don't want to get "file->f_lock" because it is not
-	 * necessary. It is not necessary because we're in the "struct file"
-	 * cleanup path, and this means that no one is using this file anymore.
-	 * So, for example, epoll_ctl() cannot hit here since if we reach this
-	 * point, the file counter already went to zero and fget() would fail.
-	 * The only hit might come from ep_free() but by holding the mutex
-	 * will correctly serialize the operation. We do need to acquire
-	 * "ep->mtx" after "epmutex" because ep_remove() requires it when called
-	 * from anywhere but ep_free().
-	 *
-	 * Besides, ep_remove() acquires the lock, so we can't hold it here.
+	 * Use the 'dying' flag to prevent a concurrent ep_cleat_and_put() from
+	 * touching the epitems list before eventpoll_release_file() can access
+	 * the ep->mtx.
 	 */
-	mutex_lock(&epmutex);
-	if (unlikely(!file->f_ep)) {
-		mutex_unlock(&epmutex);
-		return;
-	}
-	hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) {
+again:
+	spin_lock(&file->f_lock);
+	if (file->f_ep && file->f_ep->first) {
+		/* detach from ep tree */
+		epi = hlist_entry(file->f_ep->first, struct epitem, fllink);
+		epi->dying = true;
+		spin_unlock(&file->f_lock);
+
+		/*
+		 * ep access is safe as we still own a reference to the ep
+		 * struct
+		 */
 		ep = epi->ep;
-		mutex_lock_nested(&ep->mtx, 0);
-		ep_remove(ep, epi);
+		mutex_lock(&ep->mtx);
+		dispose = __ep_remove(ep, epi, true);
 		mutex_unlock(&ep->mtx);
+
+		if (dispose)
+			ep_free(ep);
+		goto again;
 	}
-	mutex_unlock(&epmutex);
+	spin_unlock(&file->f_lock);
 }
 
 static int ep_alloc(struct eventpoll **pep)
@@ -955,6 +989,7 @@ static int ep_alloc(struct eventpoll **pep)
 	ep->rbr = RB_ROOT_CACHED;
 	ep->ovflist = EP_UNACTIVE_PTR;
 	ep->user = user;
+	refcount_set(&ep->refcount, 1);
 
 	*pep = ep;
 
@@ -1223,10 +1258,10 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
 		 */
 		list_del_init(&wait->entry);
 		/*
-		 * ->whead != NULL protects us from the race with ep_free()
-		 * or ep_remove(), ep_remove_wait_queue() takes whead->lock
-		 * held by the caller. Once we nullify it, nothing protects
-		 * ep/epi or even wait.
+		 * ->whead != NULL protects us from the race with
+		 * ep_clear_and_put() or ep_remove(), ep_remove_wait_queue()
+		 * takes whead->lock held by the caller. Once we nullify it,
+		 * nothing protects ep/epi or even wait.
 		 */
 		smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
 	}
@@ -1496,16 +1531,22 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	if (tep)
 		mutex_unlock(&tep->mtx);
 
+	/*
+	 * ep_remove_safe() calls in the later error paths can't lead to
+	 * ep_free() as the ep file itself still holds an ep reference.
+	 */
+	ep_get(ep);
+
 	/* now check if we've created too many backpaths */
 	if (unlikely(full_check && reverse_path_check())) {
-		ep_remove(ep, epi);
+		ep_remove_safe(ep, epi);
 		return -EINVAL;
 	}
 
 	if (epi->event.events & EPOLLWAKEUP) {
 		error = ep_create_wakeup_source(epi);
 		if (error) {
-			ep_remove(ep, epi);
+			ep_remove_safe(ep, epi);
 			return error;
 		}
 	}
@@ -1529,7 +1570,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	 * high memory pressure.
 	 */
 	if (unlikely(!epq.epi)) {
-		ep_remove(ep, epi);
+		ep_remove_safe(ep, epi);
 		return -ENOMEM;
 	}
 
@@ -2025,7 +2066,7 @@ static int do_epoll_create(int flags)
 out_free_fd:
 	put_unused_fd(fd);
 out_free_ep:
-	ep_free(ep);
+	ep_clear_and_put(ep);
 	return error;
 }
 
@@ -2167,10 +2208,16 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
 			error = -EEXIST;
 		break;
 	case EPOLL_CTL_DEL:
-		if (epi)
-			error = ep_remove(ep, epi);
-		else
+		if (epi) {
+			/*
+			 * The eventpoll itself is still alive: the refcount
+			 * can't go to zero here.
+			 */
+			ep_remove_safe(ep, epi);
+			error = 0;
+		} else {
 			error = -ENOENT;
+		}
 		break;
 	case EPOLL_CTL_MOD:
 		if (epi) {
-- 
2.38.1


^ permalink raw reply related

* [PATCH net] fjes: Fix an error handling path in fjes_probe()
From: Christophe JAILLET @ 2022-12-20 20:57 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Taku Izumi
  Cc: linux-kernel, kernel-janitors, Christophe JAILLET, netdev

A netif_napi_add() call is hidden in fjes_sw_init(). It should be undone
by a corresponding netif_napi_del() call in the error handling path of the
probe, as already done inthe remove function.

Fixes: 265859309a76 ("fjes: NAPI polling function")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
---
 drivers/net/fjes/fjes_main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/fjes/fjes_main.c b/drivers/net/fjes/fjes_main.c
index 2513be6d4e11..01b4c9c6adbd 100644
--- a/drivers/net/fjes/fjes_main.c
+++ b/drivers/net/fjes/fjes_main.c
@@ -1370,7 +1370,7 @@ static int fjes_probe(struct platform_device *plat_dev)
 	adapter->txrx_wq = alloc_workqueue(DRV_NAME "/txrx", WQ_MEM_RECLAIM, 0);
 	if (unlikely(!adapter->txrx_wq)) {
 		err = -ENOMEM;
-		goto err_free_netdev;
+		goto err_del_napi;
 	}
 
 	adapter->control_wq = alloc_workqueue(DRV_NAME "/control",
@@ -1431,6 +1431,8 @@ static int fjes_probe(struct platform_device *plat_dev)
 	destroy_workqueue(adapter->control_wq);
 err_free_txrx_wq:
 	destroy_workqueue(adapter->txrx_wq);
+err_del_napi:
+	netif_napi_del(&adapter->napi);
 err_free_netdev:
 	free_netdev(netdev);
 err_out:
-- 
2.34.1


^ permalink raw reply related

* Re: [syzbot] KASAN: slab-out-of-bounds Write in copy_array
From: sdf @ 2022-12-20 21:01 UTC (permalink / raw)
  To: syzbot
  Cc: andrii, ast, bpf, daniel, davem, haoluo, hawk, john.fastabend,
	jolsa, kpsingh, kuba, linux-kernel, llvm, martin.lau, nathan,
	ndesaulniers, netdev, song, syzkaller-bugs, trix, yhs
In-Reply-To: <000000000000ab724705ee87e321@google.com>

[-- Attachment #1: Type: text/plain, Size: 7785 bytes --]

On 11/28, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    c35bd4e42885 Add linux-next specific files for 20221124
> git tree:       linux-next
> console+strace: https://syzkaller.appspot.com/x/log.txt?x=13369dc5880000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=11e19c740a0b2926
> dashboard link: https://syzkaller.appspot.com/bug?extid=b1e1f7feb407b56d0355
> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1345a205880000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=124c644b880000
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/968fee464d14/disk-c35bd4e4.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/4f46fe801b5b/vmlinux-c35bd4e4.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/c2cdf8fb264e/bzImage-c35bd4e4.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+b1e1f7feb407b56d0355@syzkaller.appspotmail.com
> 
> ==================================================================
> BUG: KASAN: slab-out-of-bounds in copy_array+0x96/0x100 kernel/bpf/verifier.c:1032
> Write of size 232 at addr ffff88801ed62600 by task syz-executor990/5290
> 
> CPU: 0 PID: 5290 Comm: syz-executor990 Not tainted 6.1.0-rc6-next-20221124-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
> Call Trace:
>  <TASK>
>  __dump_stack lib/dump_stack.c:88 [inline]
>  dump_stack_lvl+0xd1/0x138 lib/dump_stack.c:106
>  print_address_description mm/kasan/report.c:253 [inline]
>  print_report+0x15e/0x45d mm/kasan/report.c:364
>  kasan_report+0xbf/0x1f0 mm/kasan/report.c:464
>  check_region_inline mm/kasan/generic.c:183 [inline]
>  kasan_check_range+0x141/0x190 mm/kasan/generic.c:189
>  memcpy+0x3d/0x60 mm/kasan/shadow.c:66
>  copy_array+0x96/0x100 kernel/bpf/verifier.c:1032
>  copy_verifier_state+0xa9/0xbe0 kernel/bpf/verifier.c:1210
>  pop_stack+0x8c/0x2f0 kernel/bpf/verifier.c:1273
>  do_check kernel/bpf/verifier.c:13733 [inline]
>  do_check_common+0x372b/0xc5e0 kernel/bpf/verifier.c:15991
>  do_check_main kernel/bpf/verifier.c:16054 [inline]
>  bpf_check+0x7371/0xad00 kernel/bpf/verifier.c:16624
>  bpf_prog_load+0x1543/0x2230 kernel/bpf/syscall.c:2619
>  __sys_bpf+0x1436/0x4ff0 kernel/bpf/syscall.c:4979
>  __do_sys_bpf kernel/bpf/syscall.c:5083 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:5081 [inline]
>  __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5081
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7fc18e7bbc29
> Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007ffd8f27a968 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc18e7bbc29
> RDX: 0000000000000048 RSI: 0000000020000200 RDI: 0000000000000005
> RBP: 00007fc18e77fdd0 R08: 0000000000000000 R09: 0000000000000000
> R10: 00000000ffffffff R11: 0000000000000246 R12: 00007fc18e77fe60
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>  </TASK>
> 
> Allocated by task 5290:
>  kasan_save_stack+0x22/0x40 mm/kasan/common.c:45
>  kasan_set_track+0x25/0x30 mm/kasan/common.c:52
>  ____kasan_kmalloc mm/kasan/common.c:376 [inline]
>  ____kasan_kmalloc mm/kasan/common.c:335 [inline]
>  __kasan_krealloc+0x145/0x180 mm/kasan/common.c:444
>  kasan_krealloc include/linux/kasan.h:232 [inline]
>  __do_krealloc mm/slab_common.c:1348 [inline]
>  krealloc+0xa8/0x100 mm/slab_common.c:1385
>  push_jmp_history+0x89/0x260 kernel/bpf/verifier.c:2528
>  is_state_visited kernel/bpf/verifier.c:13269 [inline]
>  do_check kernel/bpf/verifier.c:13466 [inline]
>  do_check_common+0x4b47/0xc5e0 kernel/bpf/verifier.c:15991
>  do_check_main kernel/bpf/verifier.c:16054 [inline]
>  bpf_check+0x7371/0xad00 kernel/bpf/verifier.c:16624
>  bpf_prog_load+0x1543/0x2230 kernel/bpf/syscall.c:2619
>  __sys_bpf+0x1436/0x4ff0 kernel/bpf/syscall.c:4979
>  __do_sys_bpf kernel/bpf/syscall.c:5083 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:5081 [inline]
>  __x64_sys_bpf+0x79/0xc0 kernel/bpf/syscall.c:5081
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> The buggy address belongs to the object at ffff88801ed62600
>  which belongs to the cache kmalloc-256 of size 256
> The buggy address is located 0 bytes inside of
>  256-byte region [ffff88801ed62600, ffff88801ed62700)
> 
> The buggy address belongs to the physical page:
> page:ffffea00007b5880 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1ed62
> head:ffffea00007b5880 order:1 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0
> flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
> raw: 00fff00000010200 ffff888012441b40 ffffea0000809f80 dead000000000002
> raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
> page dumped because: kasan: bad access detected
> page_owner tracks the page as allocated
> page last allocated via order 1, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 56, tgid 56 (kworker/u4:4), ts 7761288109, free_ts 0
>  prep_new_page mm/page_alloc.c:2541 [inline]
>  get_page_from_freelist+0x119c/0x2cd0 mm/page_alloc.c:4293
>  __alloc_pages+0x1cb/0x5b0 mm/page_alloc.c:5551
>  alloc_pages+0x1aa/0x270 mm/mempolicy.c:2285
>  alloc_slab_page mm/slub.c:1833 [inline]
>  allocate_slab+0x25e/0x350 mm/slub.c:1980
>  new_slab mm/slub.c:2033 [inline]
>  ___slab_alloc+0xa91/0x1400 mm/slub.c:3211
>  __slab_alloc.constprop.0+0x56/0xa0 mm/slub.c:3310
>  slab_alloc_node mm/slub.c:3395 [inline]
>  __kmem_cache_alloc_node+0x1a9/0x430 mm/slub.c:3472
>  kmalloc_trace+0x26/0x60 mm/slab_common.c:1049
>  kmalloc include/linux/slab.h:571 [inline]
>  scsi_probe_and_add_lun+0x3ae/0x34d0 drivers/scsi/scsi_scan.c:1186
>  __scsi_scan_target+0x21f/0xda0 drivers/scsi/scsi_scan.c:1664
>  scsi_scan_channel drivers/scsi/scsi_scan.c:1752 [inline]
>  scsi_scan_channel+0x148/0x1e0 drivers/scsi/scsi_scan.c:1728
>  scsi_scan_host_selected+0x2e3/0x3b0 drivers/scsi/scsi_scan.c:1781
>  do_scsi_scan_host+0x1e8/0x260 drivers/scsi/scsi_scan.c:1920
>  do_scan_async+0x42/0x500 drivers/scsi/scsi_scan.c:1930
>  async_run_entry_fn+0x9c/0x530 kernel/async.c:127
>  process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
> page_owner free stack trace missing
> 
> Memory state around the buggy address:
>  ffff88801ed62500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  ffff88801ed62580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> >ffff88801ed62600: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
>                                ^
>  ffff88801ed62680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  ffff88801ed62700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ==================================================================
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git e45fb347b630

[-- Attachment #2: syz.patch --]
[-- Type: text/x-diff, Size: 3618 bytes --]

commit d8e76e85481b7885ee86bb57a11274ae9b68454e
Author:     Stanislav Fomichev <sdf@google.com>
AuthorDate: Tue Dec 20 13:00:17 2022 -0800
Commit:     Stanislav Fomichev <sdf@google.com>
CommitDate: Tue Dec 20 13:00:17 2022 -0800

    Revert "mm: Make ksize() a reporting-only function"
    
    This reverts commit 38931d8989b5760b0bd17c9ec99e81986258e4cb.

diff --git a/mm/kasan/kasan_test.c b/mm/kasan/kasan_test.c
index 74cd80c12b25..d1439669d6bc 100644
--- a/mm/kasan/kasan_test.c
+++ b/mm/kasan/kasan_test.c
@@ -825,30 +825,23 @@ static void kasan_global_oob_left(struct kunit *test)
 	KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)p);
 }
 
-/* Check that ksize() does NOT unpoison whole object. */
+/* Check that ksize() makes the whole object accessible. */
 static void ksize_unpoisons_memory(struct kunit *test)
 {
 	char *ptr;
-	size_t size = 128 - KASAN_GRANULE_SIZE - 5;
-	size_t real_size;
+	size_t size = 123, real_size;
 
 	ptr = kmalloc(size, GFP_KERNEL);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
-
 	real_size = ksize(ptr);
-	KUNIT_EXPECT_GT(test, real_size, size);
 
 	OPTIMIZER_HIDE_VAR(ptr);
 
-	/* These accesses shouldn't trigger a KASAN report. */
-	ptr[0] = 'x';
-	ptr[size - 1] = 'x';
+	/* This access shouldn't trigger a KASAN report. */
+	ptr[size] = 'x';
 
-	/* These must trigger a KASAN report. */
-	if (IS_ENABLED(CONFIG_KASAN_GENERIC))
-		KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[size]);
-	KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[size + 5]);
-	KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[real_size - 1]);
+	/* This one must. */
+	KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)ptr)[real_size]);
 
 	kfree(ptr);
 }
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 1cba98acc486..3e49bb830060 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1348,11 +1348,11 @@ __do_krealloc(const void *p, size_t new_size, gfp_t flags)
 	void *ret;
 	size_t ks;
 
-	/* Check for double-free before calling ksize. */
+	/* Don't use instrumented ksize to allow precise KASAN poisoning. */
 	if (likely(!ZERO_OR_NULL_PTR(p))) {
 		if (!kasan_check_byte(p))
 			return NULL;
-		ks = ksize(p);
+		ks = kfence_ksize(p) ?: __ksize(p);
 	} else
 		ks = 0;
 
@@ -1420,21 +1420,21 @@ void kfree_sensitive(const void *p)
 	void *mem = (void *)p;
 
 	ks = ksize(mem);
-	if (ks) {
-		kasan_unpoison_range(mem, ks);
+	if (ks)
 		memzero_explicit(mem, ks);
-	}
 	kfree(mem);
 }
 EXPORT_SYMBOL(kfree_sensitive);
 
 size_t ksize(const void *objp)
 {
+	size_t size;
+
 	/*
-	 * We need to first check that the pointer to the object is valid.
-	 * The KASAN report printed from ksize() is more useful, then when
-	 * it's printed later when the behaviour could be undefined due to
-	 * a potential use-after-free or double-free.
+	 * We need to first check that the pointer to the object is valid, and
+	 * only then unpoison the memory. The report printed from ksize() is
+	 * more useful, then when it's printed later when the behaviour could
+	 * be undefined due to a potential use-after-free or double-free.
 	 *
 	 * We use kasan_check_byte(), which is supported for the hardware
 	 * tag-based KASAN mode, unlike kasan_check_read/write().
@@ -1448,7 +1448,13 @@ size_t ksize(const void *objp)
 	if (unlikely(ZERO_OR_NULL_PTR(objp)) || !kasan_check_byte(objp))
 		return 0;
 
-	return kfence_ksize(objp) ?: __ksize(objp);
+	size = kfence_ksize(objp) ?: __ksize(objp);
+	/*
+	 * We assume that ksize callers could use whole allocated area,
+	 * so we need to unpoison this area.
+	 */
+	kasan_unpoison_range(objp, size);
+	return size;
 }
 EXPORT_SYMBOL(ksize);
 

^ permalink raw reply related

* Re: [syzbot] KASAN: slab-out-of-bounds Write in copy_array
From: syzbot @ 2022-12-20 21:25 UTC (permalink / raw)
  To: andrii, ast, bpf, daniel, davem, haoluo, hawk, john.fastabend,
	jolsa, kpsingh, kuba, linux-kernel, llvm, martin.lau, nathan,
	ndesaulniers, netdev, sdf, song, syzkaller-bugs, trix, yhs
In-Reply-To: <Y6Iipad5vz55tl2A@google.com>

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-and-tested-by: syzbot+b1e1f7feb407b56d0355@syzkaller.appspotmail.com

Tested on:

commit:         e45fb347 Add linux-next specific files for 20221220
git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
console output: https://syzkaller.appspot.com/x/log.txt?x=13c134d7880000
kernel config:  https://syzkaller.appspot.com/x/.config?x=9445b3c551452920
dashboard link: https://syzkaller.appspot.com/bug?extid=b1e1f7feb407b56d0355
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
patch:          https://syzkaller.appspot.com/x/patch.diff?x=118c1408480000

Note: testing is done by a robot and is best-effort only.

^ permalink raw reply

* Re: [PATCH v4] epoll: use refcount to reduce ep_mutex contention
From: Soheil Hassas Yeganeh @ 2022-12-20 21:26 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: linux-fsdevel, Al Viro, Davidlohr Bueso, Jason Baron, netdev,
	Carlos Maiolino, Eric Biggers, Jacob Keller, Eric Dumazet
In-Reply-To: <9d8ad7995e51ad3aecdfe6f7f9e72231b8c9d3b5.1671569682.git.pabeni@redhat.com>

On Tue, Dec 20, 2022 at 3:55 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> We are observing huge contention on the epmutex during an http
> connection/rate test:
>
>  83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
> [...]
>            |--66.96%--__fput
>                       |--60.04%--eventpoll_release_file
>                                  |--58.41%--__mutex_lock.isra.6
>                                            |--56.56%--osq_lock
>
> The application is multi-threaded, creates a new epoll entry for
> each incoming connection, and does not delete it before the
> connection shutdown - that is, before the connection's fd close().
>
> Many different threads compete frequently for the epmutex lock,
> affecting the overall performance.
>
> To reduce the contention this patch introduces explicit reference counting
> for the eventpoll struct. Each registered event acquires a reference,
> and references are released at ep_remove() time.
>
> Additionally, this introduces a new 'dying' flag to prevent races between
> the EP file close() and the monitored file close().
> ep_eventpoll_release() marks, under f_lock spinlock, each epitem as before
> removing it, while EP file close() does not touch dying epitems.
>
> The eventpoll struct is released by whoever - among EP file close() and
> and the monitored file close() drops its last reference.
>
> With all the above in place, we can drop the epmutex usage at disposal time.
>
> Overall this produces a significant performance improvement in the
> mentioned connection/rate scenario: the mutex operations disappear from
> the topmost offenders in the perf report, and the measured connections/rate
> grows by ~60%.
>
> To make the change more readable this additionally renames ep_free() to
> ep_clear_and_put(), and moves the actual memory cleanup in a separate
> ep_free() helper.
>
> Tested-by: Xiumei Mu <xmu@redhat.com>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Thanks for all the improvements and updates!

> ---
> v4: (addresses comments from Eric Biggers and Jacob)
>  - replace a refcount_t
>  - rename the ep book-keeping helpers
>
> v3 at:
> https://lore.kernel.org/linux-fsdevel/1aedd7e87097bc4352ba658ac948c585a655785a.1669657846.git.pabeni@redhat.com/
>
> v2 at:
> https://lore.kernel.org/linux-fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@redhat.com/
>
> v1 at:
> https://lore.kernel.org/linux-fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@redhat.com/
>
> Previous related effort at:
> https://lore.kernel.org/linux-fsdevel/20190727113542.162213-1-cj.chengjian@huawei.com/
> https://lkml.org/lkml/2017/10/28/81
> ---
>  fs/eventpoll.c | 185 +++++++++++++++++++++++++++++++------------------
>  1 file changed, 116 insertions(+), 69 deletions(-)
>
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 64659b110973..a43ccb02133c 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -57,13 +57,7 @@
>   * we need a lock that will allow us to sleep. This lock is a
>   * mutex (ep->mtx). It is acquired during the event transfer loop,
>   * during epoll_ctl(EPOLL_CTL_DEL) and during eventpoll_release_file().
> - * Then we also need a global mutex to serialize eventpoll_release_file()
> - * and ep_free().
> - * This mutex is acquired by ep_free() during the epoll file
> - * cleanup path and it is also acquired by eventpoll_release_file()
> - * if a file has been pushed inside an epoll set and it is then
> - * close()d without a previous call to epoll_ctl(EPOLL_CTL_DEL).
> - * It is also acquired when inserting an epoll fd onto another epoll
> + * The epmutex is acquired when inserting an epoll fd onto another epoll
>   * fd. We do this so that we walk the epoll tree and ensure that this
>   * insertion does not create a cycle of epoll file descriptors, which
>   * could lead to deadlock. We need a global mutex to prevent two
> @@ -153,6 +147,13 @@ struct epitem {
>         /* The file descriptor information this item refers to */
>         struct epoll_filefd ffd;
>
> +       /*
> +        * Protected by file->f_lock, true for to-be-released epitem already
> +        * removed from the "struct file" items list; together with
> +        * eventpoll->refcount orchestrates "struct eventpoll" disposal
> +        */
> +       bool dying;
> +
>         /* List containing poll wait queues */
>         struct eppoll_entry *pwqlist;
>
> @@ -217,6 +218,12 @@ struct eventpoll {
>         u64 gen;
>         struct hlist_head refs;
>
> +       /*
> +        * usage count, used together with epitem->dying to
> +        * orchestrate the disposal of this struct
> +        */
> +       refcount_t refcount;
> +
>  #ifdef CONFIG_NET_RX_BUSY_POLL
>         /* used to track busy poll napi_id */
>         unsigned int napi_id;
> @@ -240,9 +247,7 @@ struct ep_pqueue {
>  /* Maximum number of epoll watched descriptors, per user */
>  static long max_user_watches __read_mostly;
>
> -/*
> - * This mutex is used to serialize ep_free() and eventpoll_release_file().
> - */
> +/* Used for cycles detection */
>  static DEFINE_MUTEX(epmutex);
>
>  static u64 loop_check_gen = 0;
> @@ -557,8 +562,7 @@ static void ep_remove_wait_queue(struct eppoll_entry *pwq)
>
>  /*
>   * This function unregisters poll callbacks from the associated file
> - * descriptor.  Must be called with "mtx" held (or "epmutex" if called from
> - * ep_free).
> + * descriptor.  Must be called with "mtx" held.
>   */
>  static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
>  {
> @@ -681,11 +685,38 @@ static void epi_rcu_free(struct rcu_head *head)
>         kmem_cache_free(epi_cache, epi);
>  }
>
> +static void ep_get(struct eventpoll *ep)
> +{
> +       refcount_inc(&ep->refcount);
> +}
> +
> +/*
> + * Returns true if the event poll can be disposed
> + */
> +static bool ep_refcount_dec_and_test(struct eventpoll *ep)
> +{
> +       if (!refcount_dec_and_test(&ep->refcount))
> +               return false;
> +
> +       WARN_ON_ONCE(!RB_EMPTY_ROOT(&ep->rbr.rb_root));
> +       return true;
> +}
> +
> +static void ep_free(struct eventpoll *ep)
> +{
> +       mutex_destroy(&ep->mtx);
> +       free_uid(ep->user);
> +       wakeup_source_unregister(ep->ws);
> +       kfree(ep);
> +}
> +
>  /*
>   * Removes a "struct epitem" from the eventpoll RB tree and deallocates
>   * all the associated resources. Must be called with "mtx" held.
> + * If the dying flag is set, do the removal only if force is true.
> + * Returns true if the eventpoll can be disposed.
>   */
> -static int ep_remove(struct eventpoll *ep, struct epitem *epi)
> +static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
>  {
>         struct file *file = epi->ffd.file;
>         struct epitems_head *to_free;
> @@ -700,6 +731,11 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
>
>         /* Remove the current item from the list of epoll hooks */
>         spin_lock(&file->f_lock);
> +       if (epi->dying && !force) {
> +               spin_unlock(&file->f_lock);
> +               return false;
> +       }
> +
>         to_free = NULL;
>         head = file->f_ep;
>         if (head->first == &epi->fllink && !epi->fllink.next) {
> @@ -733,28 +769,28 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
>         call_rcu(&epi->rcu, epi_rcu_free);
>
>         percpu_counter_dec(&ep->user->epoll_watches);
> +       return ep_refcount_dec_and_test(ep);
> +}
>
> -       return 0;
> +/*
> + * ep_remove variant for callers owing an additional reference to the ep
> + */
> +static void ep_remove_safe(struct eventpoll *ep, struct epitem *epi)
> +{
> +       WARN_ON_ONCE(__ep_remove(ep, epi, false));
>  }
>
> -static void ep_free(struct eventpoll *ep)
> +static void ep_clear_and_put(struct eventpoll *ep)
>  {
>         struct rb_node *rbp;
>         struct epitem *epi;
> +       bool dispose;
>
>         /* We need to release all tasks waiting for these file */
>         if (waitqueue_active(&ep->poll_wait))
>                 ep_poll_safewake(ep, NULL, 0);
>
> -       /*
> -        * We need to lock this because we could be hit by
> -        * eventpoll_release_file() while we're freeing the "struct eventpoll".
> -        * We do not need to hold "ep->mtx" here because the epoll file
> -        * is on the way to be removed and no one has references to it
> -        * anymore. The only hit might come from eventpoll_release_file() but
> -        * holding "epmutex" is sufficient here.
> -        */
> -       mutex_lock(&epmutex);
> +       mutex_lock(&ep->mtx);
>
>         /*
>          * Walks through the whole tree by unregistering poll callbacks.
> @@ -768,25 +804,21 @@ static void ep_free(struct eventpoll *ep)
>
>         /*
>          * Walks through the whole tree by freeing each "struct epitem". At this
> -        * point we are sure no poll callbacks will be lingering around, and also by
> -        * holding "epmutex" we can be sure that no file cleanup code will hit
> -        * us during this operation. So we can avoid the lock on "ep->lock".
> -        * We do not need to lock ep->mtx, either, we only do it to prevent
> -        * a lockdep warning.
> +        * point we are sure no poll callbacks will be lingering around.
> +        * Since we still own a reference to the eventpoll struct, the loop can't
> +        * dispose it.
>          */
> -       mutex_lock(&ep->mtx);
>         while ((rbp = rb_first_cached(&ep->rbr)) != NULL) {
>                 epi = rb_entry(rbp, struct epitem, rbn);
> -               ep_remove(ep, epi);
> +               ep_remove_safe(ep, epi);
>                 cond_resched();
>         }
> +
> +       dispose = ep_refcount_dec_and_test(ep);
>         mutex_unlock(&ep->mtx);
>
> -       mutex_unlock(&epmutex);
> -       mutex_destroy(&ep->mtx);
> -       free_uid(ep->user);
> -       wakeup_source_unregister(ep->ws);
> -       kfree(ep);
> +       if (dispose)
> +               ep_free(ep);
>  }
>
>  static int ep_eventpoll_release(struct inode *inode, struct file *file)
> @@ -794,7 +826,7 @@ static int ep_eventpoll_release(struct inode *inode, struct file *file)
>         struct eventpoll *ep = file->private_data;
>
>         if (ep)
> -               ep_free(ep);
> +               ep_clear_and_put(ep);
>
>         return 0;
>  }
> @@ -906,33 +938,35 @@ void eventpoll_release_file(struct file *file)
>  {
>         struct eventpoll *ep;
>         struct epitem *epi;
> -       struct hlist_node *next;
> +       bool dispose;
>
>         /*
> -        * We don't want to get "file->f_lock" because it is not
> -        * necessary. It is not necessary because we're in the "struct file"
> -        * cleanup path, and this means that no one is using this file anymore.
> -        * So, for example, epoll_ctl() cannot hit here since if we reach this
> -        * point, the file counter already went to zero and fget() would fail.
> -        * The only hit might come from ep_free() but by holding the mutex
> -        * will correctly serialize the operation. We do need to acquire
> -        * "ep->mtx" after "epmutex" because ep_remove() requires it when called
> -        * from anywhere but ep_free().
> -        *
> -        * Besides, ep_remove() acquires the lock, so we can't hold it here.
> +        * Use the 'dying' flag to prevent a concurrent ep_cleat_and_put() from
> +        * touching the epitems list before eventpoll_release_file() can access
> +        * the ep->mtx.
>          */
> -       mutex_lock(&epmutex);
> -       if (unlikely(!file->f_ep)) {
> -               mutex_unlock(&epmutex);
> -               return;
> -       }
> -       hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) {
> +again:
> +       spin_lock(&file->f_lock);
> +       if (file->f_ep && file->f_ep->first) {
> +               /* detach from ep tree */
> +               epi = hlist_entry(file->f_ep->first, struct epitem, fllink);
> +               epi->dying = true;
> +               spin_unlock(&file->f_lock);
> +
> +               /*
> +                * ep access is safe as we still own a reference to the ep
> +                * struct
> +                */
>                 ep = epi->ep;
> -               mutex_lock_nested(&ep->mtx, 0);
> -               ep_remove(ep, epi);
> +               mutex_lock(&ep->mtx);
> +               dispose = __ep_remove(ep, epi, true);
>                 mutex_unlock(&ep->mtx);
> +
> +               if (dispose)
> +                       ep_free(ep);
> +               goto again;
>         }
> -       mutex_unlock(&epmutex);
> +       spin_unlock(&file->f_lock);
>  }
>
>  static int ep_alloc(struct eventpoll **pep)
> @@ -955,6 +989,7 @@ static int ep_alloc(struct eventpoll **pep)
>         ep->rbr = RB_ROOT_CACHED;
>         ep->ovflist = EP_UNACTIVE_PTR;
>         ep->user = user;
> +       refcount_set(&ep->refcount, 1);
>
>         *pep = ep;
>
> @@ -1223,10 +1258,10 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
>                  */
>                 list_del_init(&wait->entry);
>                 /*
> -                * ->whead != NULL protects us from the race with ep_free()
> -                * or ep_remove(), ep_remove_wait_queue() takes whead->lock
> -                * held by the caller. Once we nullify it, nothing protects
> -                * ep/epi or even wait.
> +                * ->whead != NULL protects us from the race with
> +                * ep_clear_and_put() or ep_remove(), ep_remove_wait_queue()
> +                * takes whead->lock held by the caller. Once we nullify it,
> +                * nothing protects ep/epi or even wait.
>                  */
>                 smp_store_release(&ep_pwq_from_wait(wait)->whead, NULL);
>         }
> @@ -1496,16 +1531,22 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
>         if (tep)
>                 mutex_unlock(&tep->mtx);
>
> +       /*
> +        * ep_remove_safe() calls in the later error paths can't lead to
> +        * ep_free() as the ep file itself still holds an ep reference.
> +        */
> +       ep_get(ep);
> +
>         /* now check if we've created too many backpaths */
>         if (unlikely(full_check && reverse_path_check())) {
> -               ep_remove(ep, epi);
> +               ep_remove_safe(ep, epi);
>                 return -EINVAL;
>         }
>
>         if (epi->event.events & EPOLLWAKEUP) {
>                 error = ep_create_wakeup_source(epi);
>                 if (error) {
> -                       ep_remove(ep, epi);
> +                       ep_remove_safe(ep, epi);
>                         return error;
>                 }
>         }
> @@ -1529,7 +1570,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
>          * high memory pressure.
>          */
>         if (unlikely(!epq.epi)) {
> -               ep_remove(ep, epi);
> +               ep_remove_safe(ep, epi);
>                 return -ENOMEM;
>         }
>
> @@ -2025,7 +2066,7 @@ static int do_epoll_create(int flags)
>  out_free_fd:
>         put_unused_fd(fd);
>  out_free_ep:
> -       ep_free(ep);
> +       ep_clear_and_put(ep);
>         return error;
>  }
>
> @@ -2167,10 +2208,16 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
>                         error = -EEXIST;
>                 break;
>         case EPOLL_CTL_DEL:
> -               if (epi)
> -                       error = ep_remove(ep, epi);
> -               else
> +               if (epi) {
> +                       /*
> +                        * The eventpoll itself is still alive: the refcount
> +                        * can't go to zero here.
> +                        */
> +                       ep_remove_safe(ep, epi);
> +                       error = 0;
> +               } else {
>                         error = -ENOENT;
> +               }
>                 break;
>         case EPOLL_CTL_MOD:
>                 if (epi) {
> --
> 2.38.1
>

^ permalink raw reply

* [PATCH net] net: openvswitch: release vport resources on failure
From: Aaron Conole @ 2022-12-20 21:27 UTC (permalink / raw)
  To: netdev
  Cc: Pravin B Shelar, Jakub Kicinski, David S. Miller, Paolo Abeni,
	Eric Dumazet, Thomas Graf, dev, Eelco Chaudron, Ilya Maximets,
	wangchuanlei, linux-kernel

A recent commit introducing upcall packet accounting failed to properly
release the vport object when the per-cpu stats struct couldn't be
allocated.  This can cause dangling pointers to dp objects long after
they've been released.

Cc: Eelco Chaudron <echaudro@redhat.com>
Cc: wangchuanlei <wangchuanlei@inspur.com>
Fixes: 1933ea365aa7 ("net: openvswitch: Add support to count upcall packets")
Reported-by: syzbot+8f4e2dcfcb3209ac35f9@syzkaller.appspotmail.com
Signed-off-by: Aaron Conole <aconole@redhat.com>
---
 net/openvswitch/datapath.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 932bcf766d63..6774baf9e212 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -1854,7 +1854,7 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	vport->upcall_stats = netdev_alloc_pcpu_stats(struct vport_upcall_stats_percpu);
 	if (!vport->upcall_stats) {
 		err = -ENOMEM;
-		goto err_destroy_portids;
+		goto err_destroy_vport;
 	}
 
 	err = ovs_dp_cmd_fill_info(dp, reply, info->snd_portid,
@@ -1869,6 +1869,8 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	ovs_notify(&dp_datapath_genl_family, reply, info);
 	return 0;
 
+err_destroy_vport:
+	ovs_dp_detach_port(vport);
 err_destroy_portids:
 	kfree(rcu_dereference_raw(dp->upcall_portids));
 err_unlock_and_destroy_meters:
@@ -2316,7 +2318,7 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	vport->upcall_stats = netdev_alloc_pcpu_stats(struct vport_upcall_stats_percpu);
 	if (!vport->upcall_stats) {
 		err = -ENOMEM;
-		goto exit_unlock_free;
+		goto exit_unlock_free_vport;
 	}
 
 	err = ovs_vport_cmd_fill_info(vport, reply, genl_info_net(info),
@@ -2336,6 +2338,8 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	ovs_notify(&dp_vport_genl_family, reply, info);
 	return 0;
 
+exit_unlock_free_vport:
+	ovs_dp_detach_port(vport);
 exit_unlock_free:
 	ovs_unlock();
 	kfree_skb(reply);
-- 
2.31.1


^ permalink raw reply related

* Re: [PATCH net-next v1 4/4] net: phy: mxl-gpy: disable interrupts on GPY215 by default
From: Michael Walle @ 2022-12-20 21:39 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Xu Liang, Heiner Kallweit, Russell King, David S . Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, netdev, linux-kernel, devicetree
In-Reply-To: <Y6G5phSGSPk+7Dgj@lunn.ch>

Am 2022-12-20 14:33, schrieb Andrew Lunn:
>> > > > I think a better place for this test is in gpy_config_intr(), return
>> > > > -EOPNOTSUPP. phy_enable_interrupts() failing should then cause
>> > > > phy_request_interrupt() to use polling.
>> > >
>> > > Which will then print a warning, which might be misleading.
>> > > Or we disable the warning if -EOPNOTSUPP is returned?
>> >
>> > Disabling the warning is the right thing to do.
>> 
>> There is more to this. .config_intr() is also used in
>> phy_init_hw() and phy_drv_supports_irq(). The latter would
>> still return true in our case. I'm not sure that is correct.
>> 
>> After trying your suggestion, I'm still in favor of somehow
>> tell the phy core to force polling mode during probe() of the
>> driver.
> 
> The problem is that the MAC can set the interrupt number after the PHY
> probe has been called. e.g.
> 
> https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c#L524
> 
> The interrupt needs to be set by the time the PHY is connected to the
> MAC, which is often in the MAC open method, much later than the PHY
> probe.

Ok, then phydev->irq should be updated within the phy_attach_direct().
Something like the following:

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index e865be3d7f01..c6c5830f5214 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1537,6 +1537,14 @@ int phy_attach_direct(struct net_device *dev, 
struct phy_device *phydev,

         phydev->interrupts = PHY_INTERRUPT_DISABLED;

+       /* PHYs can request to use poll mode even though they have an
+        * associated interrupt line. This could be the case if they
+        * detect a broken interrupt handling.
+        */
+       if (phydev->drv->force_polling_mode &&
+           phydev->drv->force_polling_mode(phydev))
+               phydev->irq = PHY_POLL;
+
         /* Port is set to PORT_TP by default and the actual PHY driver 
will set
          * it to different value depending on the PHY configuration. If 
we have
          * the generic PHY driver we can't figure it out, thus set the 
old

That callback could be too specifc, I don't know. We could also have
phydev->drv->pre_attach() which then can update the phydev->irq itself.

Btw. the phy_attached_info() in the stmmac seems to be a leftover
from before the phylink conversion. phylink will print a similar info
but when the PHY is actually attached.

-michael

^ permalink raw reply related

* Re: [net-next] ipv6: fix routing cache overflow for raw sockets
From: Jonathan Maxwell @ 2022-12-20 21:48 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: davem, edumazet, kuba, yoshfuji, dsahern, netdev, linux-kernel
In-Reply-To: <9f145202ca6a59b48d4430ed26a7ab0fe4c5dfaf.camel@redhat.com>

On Tue, Dec 20, 2022 at 11:35 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> > Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
> > route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
> > consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
> > these warnings:
> >
> > [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> > [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> > .
> > .
> > [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
>
> If I read correctly, the maximum number of dst that the raw socket can
> use this way is limited by the number of packets it allows via the
> sndbuf limit, right?
>

Yes, but in my test sndbuf limit is never hit so it clones a route for
every packet.

e.g:

output from C program sending 5000000 packets via a raw socket.

ip raw: total num pkts 5000000

# bpftrace -e 'kprobe:dst_alloc {@count[comm] = count()}'
Attaching 1 probe...

@count[a.out]: 5000009

> Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> ipvs, seg6?
>

Any call to ip6_pol_route(s) where no res.nh->fib_nh_gw_family is 0 can do it.
But we have only seen this for raw sockets so far.

Regards

Jon

> @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows?
>
> Thanks,
>
> Paolo
>

^ permalink raw reply

* Re: [net-next] ipv6: fix routing cache overflow for raw sockets
From: Jonathan Maxwell @ 2022-12-20 21:55 UTC (permalink / raw)
  To: David Ahern
  Cc: Paolo Abeni, davem, edumazet, kuba, yoshfuji, netdev,
	linux-kernel
In-Reply-To: <bf56c3aa-85df-734d-f419-835a35e66e03@kernel.org>

On Wed, Dec 21, 2022 at 2:10 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 12/20/22 5:35 AM, Paolo Abeni wrote:
> > On Mon, 2022-12-19 at 10:48 +1100, Jon Maxwell wrote:
> >> Sending Ipv6 packets in a loop via a raw socket triggers an issue where a
> >> route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly
> >> consumes the Ipv6 max_size threshold which defaults to 4096 resulting in
> >> these warnings:
> >>
> >> [1]   99.187805] dst_alloc: 7728 callbacks suppressed
> >> [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> >> .
> >> .
> >> [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size.
> >
> > If I read correctly, the maximum number of dst that the raw socket can
> > use this way is limited by the number of packets it allows via the
> > sndbuf limit, right?
> >
> > Are other FLOWI_FLAG_KNOWN_NH users affected, too? e.g. nf_dup_ipv6,
> > ipvs, seg6?
> >
> > @DavidA: why do we need to create RTF_CACHE clones for KNOWN_NH flows?
> >
> > Thanks,
> >
> > Paolo
> >
>
> If I recall the details correctly: that sysctl limit was added back when
> ipv6 routes were managed as dst_entries and there was a desire to allow
> an admin to limit the memory consumed. At this point in time, IPv6 is
> more inline with IPv4 - a separate struct for fib entries from dst
> entries. That "Route cache is full" message is now out of date since
> this is dst_entries which have a gc mechanism.
>
> IPv4 does not limit the number of dst_entries that can be allocated
> (ip_rt_max_size is the sysctl variable behind the ipv4 version of
> max_size and it is a no-op). IPv6 can probably do the same here?
>

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index dbc224023977..701aba7feaf5 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -6470,7 +6470,7 @@ static int __net_init ip6_route_net_init(struct net *net)
 #endif

        net->ipv6.sysctl.flush_delay = 0;
-       net->ipv6.sysctl.ip6_rt_max_size = 4096;
+       net->ipv6.sysctl.ip6_rt_max_size = INT_MAX;
        net->ipv6.sysctl.ip6_rt_gc_min_interval = HZ / 2;
        net->ipv6.sysctl.ip6_rt_gc_timeout = 60*HZ;
        net->ipv6.sysctl.ip6_rt_gc_interval = 30*HZ;

The above patch resolved it for the Ipv6 reproducer.

Would that be sufficient?

> I do not believe the suggested flag is the right change.

Regards

Jon

^ permalink raw reply related

* [PATCH bpf-next v5 00/17] xdp: hints via kfuncs
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Please see the first patch in the series for the overall
design and use-cases.

See the following email from Toke for the per-packet metadata overhead:
https://lore.kernel.org/bpf/20221206024554.3826186-1-sdf@google.com/T/#m49d48ea08d525ec88360c7d14c4d34fb0e45e798

Recent changes:

- Various updates to the documentation (Toke)

- !CONFIG_NET for bpf_dev_bound_resolve_kfunc (Martin)

- Comment about race when resolving multiple kfuncs vs attach (Martin)

- Move kfuncs check part under add_kfunc_call (Martin)

- Add missing freplace locks (via bpf_prog_dev_bound_inherit and
  bpf_prog_dev_bound_match) (Martin)

- Rework selftest to expect 0 HW timestamp instead of adding test-specific
  code to veth (Martin)

- Separate patches for offload.c refactoring (Martin)

- Remove sem unlock in __bpf_offload_dev_netdev_register (Martin)

- Rework error handling in bpf_prog_offload_init (Martin)

- Fix wrongly placed list_del_init and unroll
  bpf_dev_bound_try_remove_netdev (Martin)

- Drop locks from bpf_offload_init (Martin)

- Swap the order of bpf_dev_bound_netdev_unregister and
  dev_xdp_uninstall (Martin)

- Return HW timestamp in veth kfunc (Jesper)

- Various fixes around documentation (David)

- Don't define kfunc prototypes (David)

- More clear XDP_METADATA_KFUNC argument names (David)

Prior art (to record pros/cons for different approaches):

- Stable UAPI approach:
  https://lore.kernel.org/bpf/20220628194812.1453059-1-alexandr.lobakin@intel.com/
- Metadata+BTF_ID appoach:
  https://lore.kernel.org/bpf/166256538687.1434226.15760041133601409770.stgit@firesoul/
- v4:
  https://lore.kernel.org/bpf/20221213023605.737383-1-sdf@google.com/
- v3:
  https://lore.kernel.org/bpf/20221206024554.3826186-1-sdf@google.com/
- v2:
  https://lore.kernel.org/bpf/20221121182552.2152891-1-sdf@google.com/
- v1:
  https://lore.kernel.org/bpf/20221115030210.3159213-1-sdf@google.com/
- kfuncs v2 RFC:
  https://lore.kernel.org/bpf/20221027200019.4106375-1-sdf@google.com/
- kfuncs v1 RFC:
  https://lore.kernel.org/bpf/20221104032532.1615099-1-sdf@google.com/

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org

Stanislav Fomichev (13):
  bpf: Document XDP RX metadata
  bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
  bpf: Move offload initialization into late_initcall
  bpf: Reshuffle some parts of bpf/offload.c
  bpf: Introduce device-bound XDP programs
  selftests/bpf: Update expected test_offload.py messages
  bpf: XDP metadata RX kfuncs
  veth: Introduce veth_xdp_buff wrapper for xdp_buff
  veth: Support RX XDP metadata
  selftests/bpf: Verify xdp_metadata xdp->af_xdp path
  net/mlx4_en: Introduce wrapper for xdp_buff
  net/mlx4_en: Support RX XDP metadata
  selftests/bpf: Simple program to dump XDP RX metadata

Toke Høiland-Jørgensen (4):
  bpf: Support consuming XDP HW metadata from fext programs
  xsk: Add cb area to struct xdp_buff_xsk
  net/mlx5e: Introduce wrapper for xdp_buff
  net/mlx5e: Support RX XDP metadata

 Documentation/networking/index.rst            |   1 +
 Documentation/networking/xdp-rx-metadata.rst  | 107 +++++
 drivers/net/ethernet/mellanox/mlx4/en_clock.c |  13 +-
 .../net/ethernet/mellanox/mlx4/en_netdev.c    |   6 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |  63 ++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |   5 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  11 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  26 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  11 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |  35 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   6 +
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  99 +++--
 drivers/net/netdevsim/bpf.c                   |   4 -
 drivers/net/veth.c                            |  87 ++--
 include/linux/bpf.h                           |  53 ++-
 include/linux/netdevice.h                     |   7 +
 include/net/xdp.h                             |  21 +
 include/net/xsk_buff_pool.h                   |   5 +
 include/uapi/linux/bpf.h                      |   5 +
 kernel/bpf/core.c                             |  12 +-
 kernel/bpf/offload.c                          | 404 +++++++++++------
 kernel/bpf/syscall.c                          |  39 +-
 kernel/bpf/verifier.c                         |  54 ++-
 net/bpf/test_run.c                            |   3 +
 net/core/dev.c                                |   9 +-
 net/core/filter.c                             |   2 +-
 net/core/xdp.c                                |  50 +++
 tools/include/uapi/linux/bpf.h                |   5 +
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   8 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 412 ++++++++++++++++++
 .../selftests/bpf/progs/xdp_hw_metadata.c     |  81 ++++
 .../selftests/bpf/progs/xdp_metadata.c        |  64 +++
 .../selftests/bpf/progs/xdp_metadata2.c       |  23 +
 tools/testing/selftests/bpf/test_offload.py   |  10 +-
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 405 +++++++++++++++++
 tools/testing/selftests/bpf/xdp_metadata.h    |  15 +
 38 files changed, 1883 insertions(+), 281 deletions(-)
 create mode 100644 Documentation/networking/xdp-rx-metadata.rst
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata2.c
 create mode 100644 tools/testing/selftests/bpf/xdp_hw_metadata.c
 create mode 100644 tools/testing/selftests/bpf/xdp_metadata.h

-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply

* [PATCH bpf-next v5 01/17] bpf: Document XDP RX metadata
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

Document all current use-cases and assumptions.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 Documentation/networking/index.rst           |   1 +
 Documentation/networking/xdp-rx-metadata.rst | 107 +++++++++++++++++++
 2 files changed, 108 insertions(+)
 create mode 100644 Documentation/networking/xdp-rx-metadata.rst

diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 4f2d1f682a18..4ddcae33c336 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -120,6 +120,7 @@ Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics.
    xfrm_proc
    xfrm_sync
    xfrm_sysctl
+   xdp-rx-metadata
 
 .. only::  subproject and html
 
diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
new file mode 100644
index 000000000000..37e8192d9b60
--- /dev/null
+++ b/Documentation/networking/xdp-rx-metadata.rst
@@ -0,0 +1,107 @@
+===============
+XDP RX Metadata
+===============
+
+This document describes how an XDP program can access hardware metadata
+related to a packet using a set of helper functions, and how it can pass
+that metadata on to other consumers.
+
+General Design
+==============
+
+XDP has access to a set of kfuncs to manipulate the metadata in an XDP frame.
+Every device driver that wishes to expose additional packet metadata can
+implement these kfuncs. The set of kfuncs is declared in ``include/net/xdp.h``
+via ``XDP_METADATA_KFUNC_xxx``.
+
+Currently, the following kfuncs are supported. In the future, as more
+metadata is supported, this set will grow:
+
+- ``bpf_xdp_metadata_rx_timestamp`` returns a packet's RX timestamp
+- ``bpf_xdp_metadata_rx_hash`` returns a packet's RX hash
+
+The XDP program can use these kfuncs to read the metadata into stack
+variables for its own consumption. Or, to pass the metadata on to other
+consumers, an XDP program can store it into the metadata area carried
+ahead of the packet.
+
+Not all kfuncs have to be implemented by the device driver; when not
+implemented, the default ones that return ``-EOPNOTSUPP`` will be used.
+
+Within the XDP frame, the metadata layout is as follows::
+
+  +----------+-----------------+------+
+  | headroom | custom metadata | data |
+  +----------+-----------------+------+
+             ^                 ^
+             |                 |
+   xdp_buff->data_meta   xdp_buff->data
+
+The XDP program can store individual metadata items into this data_meta
+area in whichever format it chooses. Later consumers of the metadata
+will have to agree on the format by some out of band contract (like for
+the AF_XDP use case, see below).
+
+AF_XDP
+======
+
+``AF_XDP`` use-case implies that there is a contract between the BPF program
+that redirects XDP frames into the ``AF_XDP`` socket (``XSK``) and the final
+consumer. Thus the BPF program manually allocates a fixed number of
+bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
+of kfuncs to populate it. The userspace ``XSK`` consumer computes
+``xsk_umem__get_data() - METADATA_SIZE`` to locate its metadata.
+
+Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
+
+  +----------+-----------------+------+
+  | headroom | custom metadata | data |
+  +----------+-----------------+------+
+                               ^
+                               |
+                        rx_desc->address
+
+XDP_PASS
+========
+
+This is the path where the packets processed by the XDP program are passed
+into the kernel. The kernel creates the ``skb`` out of the ``xdp_buff``
+contents. Currently, every driver has custom kernel code to parse
+the descriptors and populate ``skb`` metadata when doing this ``xdp_buff->skb``
+conversion, and the XDP metadata is not used by the kernel when building
+skbs. However, TC-BPF programs can access the XDP metadata area using
+the data_meta pointer.
+
+In the future, we'd like to support a case where an XDP program
+can override some of the metadata used for building skbs.
+
+bpf_redirect_map
+================
+
+``bpf_redirect_map`` can redirect the frame to a different device.
+Some devices (like virtual ethernet links) support running a second XDP
+program after the redirect. However, the final consumer doesn't have
+access to the original hardware descriptor and can't access any of
+the original metadata. The same applies to XDP programs installed
+into devmaps and cpumaps.
+
+This means that for redirected packets only custom metadata is
+currently supported, which has to be prepared by the initial XDP program
+before redirect. If the frame is eventually passed to the kernel, the
+skb created from such a frame won't have any hardware metadata populated
+in its skb. And if such a packet is later redirected into an ``XSK``,
+that will also only have access to the custom metadata.
+
+
+bpf_tail_call
+=============
+
+Adding programs that access metadata kfuncs to the ``BPF_MAP_TYPE_PROG_ARRAY``
+is currently not supported.
+
+Example
+=======
+
+See ``tools/testing/selftests/bpf/progs/xdp_metadata.c`` and
+``tools/testing/selftests/bpf/prog_tests/xdp_metadata.c`` for an example of
+BPF program that handles XDP metadata.
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 02/17] bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, Jakub Kicinski
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

BPF offloading infra will be reused to implement
bound-but-not-offloaded bpf programs. Rename existing
helpers for clarity. No functional changes.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/bpf.h   |  8 ++++----
 kernel/bpf/core.c     |  4 ++--
 kernel/bpf/offload.c  |  4 ++--
 kernel/bpf/syscall.c  | 22 +++++++++++-----------
 kernel/bpf/verifier.c | 18 +++++++++---------
 net/core/dev.c        |  4 ++--
 net/core/filter.c     |  2 +-
 7 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5fec2d1be6d7..630f259d0447 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2481,12 +2481,12 @@ void unpriv_ebpf_notify(int new_state);
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
 
-static inline bool bpf_prog_is_dev_bound(const struct bpf_prog_aux *aux)
+static inline bool bpf_prog_is_offloaded(const struct bpf_prog_aux *aux)
 {
 	return aux->offload_requested;
 }
 
-static inline bool bpf_map_is_dev_bound(struct bpf_map *map)
+static inline bool bpf_map_is_offloaded(struct bpf_map *map)
 {
 	return unlikely(map->ops == &bpf_map_offload_ops);
 }
@@ -2513,12 +2513,12 @@ static inline int bpf_prog_offload_init(struct bpf_prog *prog,
 	return -EOPNOTSUPP;
 }
 
-static inline bool bpf_prog_is_dev_bound(struct bpf_prog_aux *aux)
+static inline bool bpf_prog_is_offloaded(struct bpf_prog_aux *aux)
 {
 	return false;
 }
 
-static inline bool bpf_map_is_dev_bound(struct bpf_map *map)
+static inline bool bpf_map_is_offloaded(struct bpf_map *map)
 {
 	return false;
 }
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 7f98dec6e90f..e9d64e26ef5b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2183,7 +2183,7 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 	 * valid program, which in this case would simply not
 	 * be JITed, but falls back to the interpreter.
 	 */
-	if (!bpf_prog_is_dev_bound(fp->aux)) {
+	if (!bpf_prog_is_offloaded(fp->aux)) {
 		*err = bpf_prog_alloc_jited_linfo(fp);
 		if (*err)
 			return fp;
@@ -2554,7 +2554,7 @@ static void bpf_prog_free_deferred(struct work_struct *work)
 #endif
 	bpf_free_used_maps(aux);
 	bpf_free_used_btfs(aux);
-	if (bpf_prog_is_dev_bound(aux))
+	if (bpf_prog_is_offloaded(aux))
 		bpf_prog_offload_destroy(aux->prog);
 #ifdef CONFIG_PERF_EVENTS
 	if (aux->prog->has_callchain_buf)
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 13e4efc971e6..f5769a8ecbee 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -549,7 +549,7 @@ static bool __bpf_offload_dev_match(struct bpf_prog *prog,
 	struct bpf_offload_netdev *ondev1, *ondev2;
 	struct bpf_prog_offload *offload;
 
-	if (!bpf_prog_is_dev_bound(prog->aux))
+	if (!bpf_prog_is_offloaded(prog->aux))
 		return false;
 
 	offload = prog->aux->offload;
@@ -581,7 +581,7 @@ bool bpf_offload_prog_map_match(struct bpf_prog *prog, struct bpf_map *map)
 	struct bpf_offloaded_map *offmap;
 	bool ret;
 
-	if (!bpf_map_is_dev_bound(map))
+	if (!bpf_map_is_offloaded(map))
 		return bpf_map_offload_neutral(map);
 	offmap = map_to_offmap(map);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 35972afb6850..13bc96035116 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -181,7 +181,7 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
 	int err;
 
 	/* Need to create a kthread, thus must support schedule */
-	if (bpf_map_is_dev_bound(map)) {
+	if (bpf_map_is_offloaded(map)) {
 		return bpf_map_offload_update_elem(map, key, value, flags);
 	} else if (map->map_type == BPF_MAP_TYPE_CPUMAP ||
 		   map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
@@ -238,7 +238,7 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 	void *ptr;
 	int err;
 
-	if (bpf_map_is_dev_bound(map))
+	if (bpf_map_is_offloaded(map))
 		return bpf_map_offload_lookup_elem(map, key, value);
 
 	bpf_disable_instrumentation();
@@ -1483,7 +1483,7 @@ static int map_delete_elem(union bpf_attr *attr, bpfptr_t uattr)
 		goto err_put;
 	}
 
-	if (bpf_map_is_dev_bound(map)) {
+	if (bpf_map_is_offloaded(map)) {
 		err = bpf_map_offload_delete_elem(map, key);
 		goto out;
 	} else if (IS_FD_PROG_ARRAY(map) ||
@@ -1547,7 +1547,7 @@ static int map_get_next_key(union bpf_attr *attr)
 	if (!next_key)
 		goto free_key;
 
-	if (bpf_map_is_dev_bound(map)) {
+	if (bpf_map_is_offloaded(map)) {
 		err = bpf_map_offload_get_next_key(map, key, next_key);
 		goto out;
 	}
@@ -1605,7 +1605,7 @@ int generic_map_delete_batch(struct bpf_map *map,
 				   map->key_size))
 			break;
 
-		if (bpf_map_is_dev_bound(map)) {
+		if (bpf_map_is_offloaded(map)) {
 			err = bpf_map_offload_delete_elem(map, key);
 			break;
 		}
@@ -1851,7 +1851,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr)
 		   map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
 		   map->map_type == BPF_MAP_TYPE_LRU_HASH ||
 		   map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
-		if (!bpf_map_is_dev_bound(map)) {
+		if (!bpf_map_is_offloaded(map)) {
 			bpf_disable_instrumentation();
 			rcu_read_lock();
 			err = map->ops->map_lookup_and_delete_elem(map, key, value, attr->flags);
@@ -1944,7 +1944,7 @@ static int find_prog_type(enum bpf_prog_type type, struct bpf_prog *prog)
 	if (!ops)
 		return -EINVAL;
 
-	if (!bpf_prog_is_dev_bound(prog->aux))
+	if (!bpf_prog_is_offloaded(prog->aux))
 		prog->aux->ops = ops;
 	else
 		prog->aux->ops = &bpf_offload_prog_ops;
@@ -2255,7 +2255,7 @@ bool bpf_prog_get_ok(struct bpf_prog *prog,
 
 	if (prog->type != *attach_type)
 		return false;
-	if (bpf_prog_is_dev_bound(prog->aux) && !attach_drv)
+	if (bpf_prog_is_offloaded(prog->aux) && !attach_drv)
 		return false;
 
 	return true;
@@ -2598,7 +2598,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	atomic64_set(&prog->aux->refcnt, 1);
 	prog->gpl_compatible = is_gpl ? 1 : 0;
 
-	if (bpf_prog_is_dev_bound(prog->aux)) {
+	if (bpf_prog_is_offloaded(prog->aux)) {
 		err = bpf_prog_offload_init(prog, attr);
 		if (err)
 			goto free_prog_sec;
@@ -3997,7 +3997,7 @@ static int bpf_prog_get_info_by_fd(struct file *file,
 			return -EFAULT;
 	}
 
-	if (bpf_prog_is_dev_bound(prog->aux)) {
+	if (bpf_prog_is_offloaded(prog->aux)) {
 		err = bpf_prog_offload_info_fill(&info, prog);
 		if (err)
 			return err;
@@ -4225,7 +4225,7 @@ static int bpf_map_get_info_by_fd(struct file *file,
 	}
 	info.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id;
 
-	if (bpf_map_is_dev_bound(map)) {
+	if (bpf_map_is_offloaded(map)) {
 		err = bpf_map_offload_info_fill(&info, map);
 		if (err)
 			return err;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index faa358b3d5d7..fdfdcab4a59d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -13807,7 +13807,7 @@ static int do_check(struct bpf_verifier_env *env)
 			env->prev_log_len = env->log.len_used;
 		}
 
-		if (bpf_prog_is_dev_bound(env->prog->aux)) {
+		if (bpf_prog_is_offloaded(env->prog->aux)) {
 			err = bpf_prog_offload_verify_insn(env, env->insn_idx,
 							   env->prev_insn_idx);
 			if (err)
@@ -14287,7 +14287,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		}
 	}
 
-	if ((bpf_prog_is_dev_bound(prog->aux) || bpf_map_is_dev_bound(map)) &&
+	if ((bpf_prog_is_offloaded(prog->aux) || bpf_map_is_offloaded(map)) &&
 	    !bpf_offload_prog_map_match(prog, map)) {
 		verbose(env, "offload device mismatch between prog and map\n");
 		return -EINVAL;
@@ -14768,7 +14768,7 @@ static int verifier_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt)
 	unsigned int orig_prog_len = env->prog->len;
 	int err;
 
-	if (bpf_prog_is_dev_bound(env->prog->aux))
+	if (bpf_prog_is_offloaded(env->prog->aux))
 		bpf_prog_offload_remove_insns(env, off, cnt);
 
 	err = bpf_remove_insns(env->prog, off, cnt);
@@ -14849,7 +14849,7 @@ static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env)
 		else
 			continue;
 
-		if (bpf_prog_is_dev_bound(env->prog->aux))
+		if (bpf_prog_is_offloaded(env->prog->aux))
 			bpf_prog_offload_replace_insn(env, i, &ja);
 
 		memcpy(insn, &ja, sizeof(ja));
@@ -15036,7 +15036,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		}
 	}
 
-	if (bpf_prog_is_dev_bound(env->prog->aux))
+	if (bpf_prog_is_offloaded(env->prog->aux))
 		return 0;
 
 	insn = env->prog->insnsi + delta;
@@ -15436,7 +15436,7 @@ static int fixup_call_args(struct bpf_verifier_env *env)
 	int err = 0;
 
 	if (env->prog->jit_requested &&
-	    !bpf_prog_is_dev_bound(env->prog->aux)) {
+	    !bpf_prog_is_offloaded(env->prog->aux)) {
 		err = jit_subprogs(env);
 		if (err == 0)
 			return 0;
@@ -16923,7 +16923,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr)
 	if (ret < 0)
 		goto skip_full_check;
 
-	if (bpf_prog_is_dev_bound(env->prog->aux)) {
+	if (bpf_prog_is_offloaded(env->prog->aux)) {
 		ret = bpf_prog_offload_verifier_prep(env->prog);
 		if (ret)
 			goto skip_full_check;
@@ -16936,7 +16936,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr)
 	ret = do_check_subprogs(env);
 	ret = ret ?: do_check_main(env);
 
-	if (ret == 0 && bpf_prog_is_dev_bound(env->prog->aux))
+	if (ret == 0 && bpf_prog_is_offloaded(env->prog->aux))
 		ret = bpf_prog_offload_finalize(env);
 
 skip_full_check:
@@ -16971,7 +16971,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr)
 	/* do 32-bit optimization after insn patching has done so those patched
 	 * insns could be handled correctly.
 	 */
-	if (ret == 0 && !bpf_prog_is_dev_bound(env->prog->aux)) {
+	if (ret == 0 && !bpf_prog_is_offloaded(env->prog->aux)) {
 		ret = opt_subreg_zext_lo32_rnd_hi32(env, attr);
 		env->prog->aux->verifier_zext = bpf_jit_needs_zext() ? !ret
 								     : false;
diff --git a/net/core/dev.c b/net/core/dev.c
index b76fb37b381e..7d31c57ddd7e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9224,8 +9224,8 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
 			NL_SET_ERR_MSG(extack, "Native and generic XDP can't be active at the same time");
 			return -EEXIST;
 		}
-		if (!offload && bpf_prog_is_dev_bound(new_prog->aux)) {
-			NL_SET_ERR_MSG(extack, "Using device-bound program without HW_MODE flag is not supported");
+		if (!offload && bpf_prog_is_offloaded(new_prog->aux)) {
+			NL_SET_ERR_MSG(extack, "Using offloaded program without HW_MODE flag is not supported");
 			return -EINVAL;
 		}
 		if (new_prog->expected_attach_type == BPF_XDP_DEVMAP) {
diff --git a/net/core/filter.c b/net/core/filter.c
index c746e4d77214..6579ddd20325 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8731,7 +8731,7 @@ static bool xdp_is_valid_access(int off, int size,
 	}
 
 	if (type == BPF_WRITE) {
-		if (bpf_prog_is_dev_bound(prog->aux)) {
+		if (bpf_prog_is_offloaded(prog->aux)) {
 			switch (off) {
 			case offsetof(struct xdp_md, rx_queue_index):
 				return __is_valid_xdp_access(off, size);
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 03/17] bpf: Move offload initialization into late_initcall
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

So we don't have to initialize it manually from several paths.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 kernel/bpf/offload.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index f5769a8ecbee..621e8738f304 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -56,7 +56,6 @@ static const struct rhashtable_params offdevs_params = {
 };
 
 static struct rhashtable offdevs;
-static bool offdevs_inited;
 
 static int bpf_dev_offload_check(struct net_device *netdev)
 {
@@ -72,8 +71,6 @@ bpf_offload_find_netdev(struct net_device *netdev)
 {
 	lockdep_assert_held(&bpf_devs_lock);
 
-	if (!offdevs_inited)
-		return NULL;
 	return rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params);
 }
 
@@ -673,18 +670,6 @@ struct bpf_offload_dev *
 bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops, void *priv)
 {
 	struct bpf_offload_dev *offdev;
-	int err;
-
-	down_write(&bpf_devs_lock);
-	if (!offdevs_inited) {
-		err = rhashtable_init(&offdevs, &offdevs_params);
-		if (err) {
-			up_write(&bpf_devs_lock);
-			return ERR_PTR(err);
-		}
-		offdevs_inited = true;
-	}
-	up_write(&bpf_devs_lock);
 
 	offdev = kzalloc(sizeof(*offdev), GFP_KERNEL);
 	if (!offdev)
@@ -710,3 +695,10 @@ void *bpf_offload_dev_priv(struct bpf_offload_dev *offdev)
 	return offdev->priv;
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_priv);
+
+static int __init bpf_offload_init(void)
+{
+	return rhashtable_init(&offdevs, &offdevs_params);
+}
+
+late_initcall(bpf_offload_init);
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 05/17] bpf: Introduce device-bound XDP programs
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way
to associate a netdev with a BPF program at load time.

Some existing 'offloaded' routines are renamed to 'dev_bound' for
consistency with the rest.

netdevsim checks are dropped in favor of generic check in dev_xdp_attach.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/netdevsim/bpf.c    |   4 --
 include/linux/bpf.h            |  24 ++++++--
 include/uapi/linux/bpf.h       |   5 ++
 kernel/bpf/core.c              |   4 +-
 kernel/bpf/offload.c           | 109 +++++++++++++++++++++++++--------
 kernel/bpf/syscall.c           |   9 +--
 net/core/dev.c                 |   5 ++
 tools/include/uapi/linux/bpf.h |   5 ++
 8 files changed, 126 insertions(+), 39 deletions(-)

diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c
index 50854265864d..f60eb97e3a62 100644
--- a/drivers/net/netdevsim/bpf.c
+++ b/drivers/net/netdevsim/bpf.c
@@ -315,10 +315,6 @@ nsim_setup_prog_hw_checks(struct netdevsim *ns, struct netdev_bpf *bpf)
 		NSIM_EA(bpf->extack, "xdpoffload of non-bound program");
 		return -EINVAL;
 	}
-	if (!bpf_offload_dev_match(bpf->prog, ns->netdev)) {
-		NSIM_EA(bpf->extack, "program bound to different dev");
-		return -EINVAL;
-	}
 
 	state = bpf->prog->aux->offload->dev_priv;
 	if (WARN_ON(strcmp(state->state, "xlated"))) {
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 630f259d0447..4c9649e69993 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1261,7 +1261,8 @@ struct bpf_prog_aux {
 	enum bpf_prog_type saved_dst_prog_type;
 	enum bpf_attach_type saved_dst_attach_type;
 	bool verifier_zext; /* Zero extensions has been inserted by verifier. */
-	bool offload_requested;
+	bool dev_bound; /* Program is bound to the netdev. */
+	bool offload_requested; /* Program is bound and offloaded to the netdev. */
 	bool attach_btf_trace; /* true if attaching to BTF-enabled raw tp */
 	bool func_proto_unreliable;
 	bool sleepable;
@@ -2451,7 +2452,7 @@ void __bpf_free_used_maps(struct bpf_prog_aux *aux,
 bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool);
 
 int bpf_prog_offload_compile(struct bpf_prog *prog);
-void bpf_prog_offload_destroy(struct bpf_prog *prog);
+void bpf_prog_dev_bound_destroy(struct bpf_prog *prog);
 int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
 			       struct bpf_prog *prog);
 
@@ -2479,7 +2480,13 @@ bool bpf_offload_dev_match(struct bpf_prog *prog, struct net_device *netdev);
 void unpriv_ebpf_notify(int new_state);
 
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
-int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
+int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr);
+void bpf_dev_bound_netdev_unregister(struct net_device *dev);
+
+static inline bool bpf_prog_is_dev_bound(const struct bpf_prog_aux *aux)
+{
+	return aux->dev_bound;
+}
 
 static inline bool bpf_prog_is_offloaded(const struct bpf_prog_aux *aux)
 {
@@ -2507,12 +2514,21 @@ void sock_map_unhash(struct sock *sk);
 void sock_map_destroy(struct sock *sk);
 void sock_map_close(struct sock *sk, long timeout);
 #else
-static inline int bpf_prog_offload_init(struct bpf_prog *prog,
+static inline int bpf_prog_dev_bound_init(struct bpf_prog *prog,
 					union bpf_attr *attr)
 {
 	return -EOPNOTSUPP;
 }
 
+static inline void bpf_dev_bound_netdev_unregister(struct net_device *dev)
+{
+}
+
+static inline bool bpf_prog_is_dev_bound(const struct bpf_prog_aux *aux)
+{
+	return false;
+}
+
 static inline bool bpf_prog_is_offloaded(struct bpf_prog_aux *aux)
 {
 	return false;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index bc1a3d232ae4..d0b6ac896699 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1156,6 +1156,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
 
+/* If BPF_F_XDP_DEV_BOUND_ONLY is used in BPF_PROG_LOAD command, the loaded
+ * program becomes device-bound but can access XDP metadata.
+ */
+#define BPF_F_XDP_DEV_BOUND_ONLY	(1U << 6)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index e9d64e26ef5b..bafcb7a3ae6f 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2554,8 +2554,8 @@ static void bpf_prog_free_deferred(struct work_struct *work)
 #endif
 	bpf_free_used_maps(aux);
 	bpf_free_used_btfs(aux);
-	if (bpf_prog_is_offloaded(aux))
-		bpf_prog_offload_destroy(aux->prog);
+	if (bpf_prog_is_dev_bound(aux))
+		bpf_prog_dev_bound_destroy(aux->prog);
 #ifdef CONFIG_PERF_EVENTS
 	if (aux->prog->has_callchain_buf)
 		put_callchain_buffers();
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index deb06498da0b..e803824c1ddd 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -41,7 +41,7 @@ struct bpf_offload_dev {
 struct bpf_offload_netdev {
 	struct rhash_head l;
 	struct net_device *netdev;
-	struct bpf_offload_dev *offdev;
+	struct bpf_offload_dev *offdev; /* NULL when bound-only */
 	struct list_head progs;
 	struct list_head maps;
 	struct list_head offdev_netdevs;
@@ -89,19 +89,17 @@ static int __bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
 	INIT_LIST_HEAD(&ondev->progs);
 	INIT_LIST_HEAD(&ondev->maps);
 
-	down_write(&bpf_devs_lock);
 	err = rhashtable_insert_fast(&offdevs, &ondev->l, offdevs_params);
 	if (err) {
 		netdev_warn(netdev, "failed to register for BPF offload\n");
-		goto err_unlock_free;
+		goto err_free;
 	}
 
-	list_add(&ondev->offdev_netdevs, &offdev->netdevs);
-	up_write(&bpf_devs_lock);
+	if (offdev)
+		list_add(&ondev->offdev_netdevs, &offdev->netdevs);
 	return 0;
 
-err_unlock_free:
-	up_write(&bpf_devs_lock);
+err_free:
 	kfree(ondev);
 	return err;
 }
@@ -149,24 +147,26 @@ static void __bpf_map_offload_destroy(struct bpf_offloaded_map *offmap)
 static void __bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
 						struct net_device *netdev)
 {
-	struct bpf_offload_netdev *ondev, *altdev;
+	struct bpf_offload_netdev *ondev, *altdev = NULL;
 	struct bpf_offloaded_map *offmap, *mtmp;
 	struct bpf_prog_offload *offload, *ptmp;
 
 	ASSERT_RTNL();
 
-	down_write(&bpf_devs_lock);
 	ondev = rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params);
 	if (WARN_ON(!ondev))
-		goto unlock;
+		return;
 
 	WARN_ON(rhashtable_remove_fast(&offdevs, &ondev->l, offdevs_params));
-	list_del(&ondev->offdev_netdevs);
 
 	/* Try to move the objects to another netdev of the device */
-	altdev = list_first_entry_or_null(&offdev->netdevs,
-					  struct bpf_offload_netdev,
-					  offdev_netdevs);
+	if (offdev) {
+		list_del(&ondev->offdev_netdevs);
+		altdev = list_first_entry_or_null(&offdev->netdevs,
+						  struct bpf_offload_netdev,
+						  offdev_netdevs);
+	}
+
 	if (altdev) {
 		list_for_each_entry(offload, &ondev->progs, offloads)
 			offload->netdev = altdev->netdev;
@@ -185,11 +185,9 @@ static void __bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
 	WARN_ON(!list_empty(&ondev->progs));
 	WARN_ON(!list_empty(&ondev->maps));
 	kfree(ondev);
-unlock:
-	up_write(&bpf_devs_lock);
 }
 
-int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
+int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
 {
 	struct bpf_offload_netdev *ondev;
 	struct bpf_prog_offload *offload;
@@ -199,7 +197,7 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
 	    attr->prog_type != BPF_PROG_TYPE_XDP)
 		return -EINVAL;
 
-	if (attr->prog_flags)
+	if (attr->prog_flags & ~BPF_F_XDP_DEV_BOUND_ONLY)
 		return -EINVAL;
 
 	offload = kzalloc(sizeof(*offload), GFP_USER);
@@ -214,11 +212,23 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
 	if (err)
 		goto err_maybe_put;
 
+	prog->aux->offload_requested = !(attr->prog_flags & BPF_F_XDP_DEV_BOUND_ONLY);
+
 	down_write(&bpf_devs_lock);
 	ondev = bpf_offload_find_netdev(offload->netdev);
 	if (!ondev) {
-		err = -EINVAL;
-		goto err_unlock;
+		if (bpf_prog_is_offloaded(prog->aux)) {
+			err = -EINVAL;
+			goto err_unlock;
+		}
+
+		/* When only binding to the device, explicitly
+		 * create an entry in the hashtable.
+		 */
+		err = __bpf_offload_dev_netdev_register(NULL, offload->netdev);
+		if (err)
+			goto err_unlock;
+		ondev = bpf_offload_find_netdev(offload->netdev);
 	}
 	offload->offdev = ondev->offdev;
 	prog->aux->offload = offload;
@@ -321,12 +331,41 @@ bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt)
 	up_read(&bpf_devs_lock);
 }
 
-void bpf_prog_offload_destroy(struct bpf_prog *prog)
+static void __bpf_prog_dev_bound_destroy(struct bpf_prog *prog)
+{
+	struct bpf_prog_offload *offload = prog->aux->offload;
+
+	if (offload->dev_state)
+		offload->offdev->ops->destroy(prog);
+
+	/* Make sure BPF_PROG_GET_NEXT_ID can't find this dead program */
+	bpf_prog_free_id(prog, true);
+
+	kfree(offload);
+	prog->aux->offload = NULL;
+}
+
+void bpf_prog_dev_bound_destroy(struct bpf_prog *prog)
 {
+	struct bpf_offload_netdev *ondev;
+	struct net_device *netdev;
+
+	rtnl_lock();
 	down_write(&bpf_devs_lock);
-	if (prog->aux->offload)
-		__bpf_prog_offload_destroy(prog);
+	if (prog->aux->offload) {
+		list_del_init(&prog->aux->offload->offloads);
+
+		netdev = prog->aux->offload->netdev;
+		if (netdev) {
+			ondev = bpf_offload_find_netdev(netdev);
+			if (ondev && !ondev->offdev && list_empty(&ondev->progs))
+				__bpf_offload_dev_netdev_unregister(NULL, netdev);
+		}
+
+		__bpf_prog_dev_bound_destroy(prog);
+	}
 	up_write(&bpf_devs_lock);
+	rtnl_unlock();
 }
 
 static int bpf_prog_offload_translate(struct bpf_prog *prog)
@@ -621,7 +660,7 @@ static bool __bpf_offload_dev_match(struct bpf_prog *prog,
 	struct bpf_offload_netdev *ondev1, *ondev2;
 	struct bpf_prog_offload *offload;
 
-	if (!bpf_prog_is_offloaded(prog->aux))
+	if (!bpf_prog_is_dev_bound(prog->aux))
 		return false;
 
 	offload = prog->aux->offload;
@@ -667,14 +706,21 @@ bool bpf_offload_prog_map_match(struct bpf_prog *prog, struct bpf_map *map)
 int bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
 				    struct net_device *netdev)
 {
-	return __bpf_offload_dev_netdev_register(offdev, netdev);
+	int err;
+
+	down_write(&bpf_devs_lock);
+	err = __bpf_offload_dev_netdev_register(offdev, netdev);
+	up_write(&bpf_devs_lock);
+	return err;
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_register);
 
 void bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
 				       struct net_device *netdev)
 {
+	down_write(&bpf_devs_lock);
 	__bpf_offload_dev_netdev_unregister(offdev, netdev);
+	up_write(&bpf_devs_lock);
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_unregister);
 
@@ -708,6 +754,19 @@ void *bpf_offload_dev_priv(struct bpf_offload_dev *offdev)
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_priv);
 
+void bpf_dev_bound_netdev_unregister(struct net_device *dev)
+{
+	struct bpf_offload_netdev *ondev;
+
+	ASSERT_RTNL();
+
+	down_write(&bpf_devs_lock);
+	ondev = bpf_offload_find_netdev(dev);
+	if (ondev && !ondev->offdev)
+		__bpf_offload_dev_netdev_unregister(NULL, ondev->netdev);
+	up_write(&bpf_devs_lock);
+}
+
 static int __init bpf_offload_init(void)
 {
 	return rhashtable_init(&offdevs, &offdevs_params);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 13bc96035116..11c558be4992 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2491,7 +2491,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 				 BPF_F_TEST_STATE_FREQ |
 				 BPF_F_SLEEPABLE |
 				 BPF_F_TEST_RND_HI32 |
-				 BPF_F_XDP_HAS_FRAGS))
+				 BPF_F_XDP_HAS_FRAGS |
+				 BPF_F_XDP_DEV_BOUND_ONLY))
 		return -EINVAL;
 
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
@@ -2575,7 +2576,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	prog->aux->attach_btf = attach_btf;
 	prog->aux->attach_btf_id = attr->attach_btf_id;
 	prog->aux->dst_prog = dst_prog;
-	prog->aux->offload_requested = !!attr->prog_ifindex;
+	prog->aux->dev_bound = !!attr->prog_ifindex;
 	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
 	prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
 
@@ -2598,8 +2599,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	atomic64_set(&prog->aux->refcnt, 1);
 	prog->gpl_compatible = is_gpl ? 1 : 0;
 
-	if (bpf_prog_is_offloaded(prog->aux)) {
-		err = bpf_prog_offload_init(prog, attr);
+	if (bpf_prog_is_dev_bound(prog->aux)) {
+		err = bpf_prog_dev_bound_init(prog, attr);
 		if (err)
 			goto free_prog_sec;
 	}
diff --git a/net/core/dev.c b/net/core/dev.c
index 7d31c57ddd7e..573b819c219a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9228,6 +9228,10 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
 			NL_SET_ERR_MSG(extack, "Using offloaded program without HW_MODE flag is not supported");
 			return -EINVAL;
 		}
+		if (bpf_prog_is_dev_bound(new_prog->aux) && !bpf_offload_dev_match(new_prog, dev)) {
+			NL_SET_ERR_MSG(extack, "Program bound to different device");
+			return -EINVAL;
+		}
 		if (new_prog->expected_attach_type == BPF_XDP_DEVMAP) {
 			NL_SET_ERR_MSG(extack, "BPF_XDP_DEVMAP programs can not be attached to a device");
 			return -EINVAL;
@@ -10830,6 +10834,7 @@ void unregister_netdevice_many_notify(struct list_head *head,
 		dev_shutdown(dev);
 
 		dev_xdp_uninstall(dev);
+		bpf_dev_bound_netdev_unregister(dev);
 
 		netdev_offload_xstats_disable_all(dev);
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index bc1a3d232ae4..d0b6ac896699 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1156,6 +1156,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
 
+/* If BPF_F_XDP_DEV_BOUND_ONLY is used in BPF_PROG_LOAD command, the loaded
+ * program becomes device-bound but can access XDP metadata.
+ */
+#define BPF_F_XDP_DEV_BOUND_ONLY	(1U << 6)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 04/17] bpf: Reshuffle some parts of bpf/offload.c
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

To avoid adding forward declarations in the main patch, shuffle
some code around. No functional changes.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 kernel/bpf/offload.c | 222 +++++++++++++++++++++++--------------------
 1 file changed, 117 insertions(+), 105 deletions(-)

diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 621e8738f304..deb06498da0b 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -74,6 +74,121 @@ bpf_offload_find_netdev(struct net_device *netdev)
 	return rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params);
 }
 
+static int __bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
+					     struct net_device *netdev)
+{
+	struct bpf_offload_netdev *ondev;
+	int err;
+
+	ondev = kzalloc(sizeof(*ondev), GFP_KERNEL);
+	if (!ondev)
+		return -ENOMEM;
+
+	ondev->netdev = netdev;
+	ondev->offdev = offdev;
+	INIT_LIST_HEAD(&ondev->progs);
+	INIT_LIST_HEAD(&ondev->maps);
+
+	down_write(&bpf_devs_lock);
+	err = rhashtable_insert_fast(&offdevs, &ondev->l, offdevs_params);
+	if (err) {
+		netdev_warn(netdev, "failed to register for BPF offload\n");
+		goto err_unlock_free;
+	}
+
+	list_add(&ondev->offdev_netdevs, &offdev->netdevs);
+	up_write(&bpf_devs_lock);
+	return 0;
+
+err_unlock_free:
+	up_write(&bpf_devs_lock);
+	kfree(ondev);
+	return err;
+}
+
+static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
+{
+	struct bpf_prog_offload *offload = prog->aux->offload;
+
+	if (offload->dev_state)
+		offload->offdev->ops->destroy(prog);
+
+	/* Make sure BPF_PROG_GET_NEXT_ID can't find this dead program */
+	bpf_prog_free_id(prog, true);
+
+	list_del_init(&offload->offloads);
+	kfree(offload);
+	prog->aux->offload = NULL;
+}
+
+static int bpf_map_offload_ndo(struct bpf_offloaded_map *offmap,
+			       enum bpf_netdev_command cmd)
+{
+	struct netdev_bpf data = {};
+	struct net_device *netdev;
+
+	ASSERT_RTNL();
+
+	data.command = cmd;
+	data.offmap = offmap;
+	/* Caller must make sure netdev is valid */
+	netdev = offmap->netdev;
+
+	return netdev->netdev_ops->ndo_bpf(netdev, &data);
+}
+
+static void __bpf_map_offload_destroy(struct bpf_offloaded_map *offmap)
+{
+	WARN_ON(bpf_map_offload_ndo(offmap, BPF_OFFLOAD_MAP_FREE));
+	/* Make sure BPF_MAP_GET_NEXT_ID can't find this dead map */
+	bpf_map_free_id(&offmap->map, true);
+	list_del_init(&offmap->offloads);
+	offmap->netdev = NULL;
+}
+
+static void __bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
+						struct net_device *netdev)
+{
+	struct bpf_offload_netdev *ondev, *altdev;
+	struct bpf_offloaded_map *offmap, *mtmp;
+	struct bpf_prog_offload *offload, *ptmp;
+
+	ASSERT_RTNL();
+
+	down_write(&bpf_devs_lock);
+	ondev = rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params);
+	if (WARN_ON(!ondev))
+		goto unlock;
+
+	WARN_ON(rhashtable_remove_fast(&offdevs, &ondev->l, offdevs_params));
+	list_del(&ondev->offdev_netdevs);
+
+	/* Try to move the objects to another netdev of the device */
+	altdev = list_first_entry_or_null(&offdev->netdevs,
+					  struct bpf_offload_netdev,
+					  offdev_netdevs);
+	if (altdev) {
+		list_for_each_entry(offload, &ondev->progs, offloads)
+			offload->netdev = altdev->netdev;
+		list_splice_init(&ondev->progs, &altdev->progs);
+
+		list_for_each_entry(offmap, &ondev->maps, offloads)
+			offmap->netdev = altdev->netdev;
+		list_splice_init(&ondev->maps, &altdev->maps);
+	} else {
+		list_for_each_entry_safe(offload, ptmp, &ondev->progs, offloads)
+			__bpf_prog_offload_destroy(offload->prog);
+		list_for_each_entry_safe(offmap, mtmp, &ondev->maps, offloads)
+			__bpf_map_offload_destroy(offmap);
+	}
+
+	WARN_ON(!list_empty(&ondev->progs));
+	WARN_ON(!list_empty(&ondev->maps));
+	kfree(ondev);
+unlock:
+	up_write(&bpf_devs_lock);
+}
+
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
 {
 	struct bpf_offload_netdev *ondev;
@@ -206,21 +321,6 @@ bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt)
 	up_read(&bpf_devs_lock);
 }
 
-static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
-{
-	struct bpf_prog_offload *offload = prog->aux->offload;
-
-	if (offload->dev_state)
-		offload->offdev->ops->destroy(prog);
-
-	/* Make sure BPF_PROG_GET_NEXT_ID can't find this dead program */
-	bpf_prog_free_id(prog, true);
-
-	list_del_init(&offload->offloads);
-	kfree(offload);
-	prog->aux->offload = NULL;
-}
-
 void bpf_prog_offload_destroy(struct bpf_prog *prog)
 {
 	down_write(&bpf_devs_lock);
@@ -340,22 +440,6 @@ int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
 const struct bpf_prog_ops bpf_offload_prog_ops = {
 };
 
-static int bpf_map_offload_ndo(struct bpf_offloaded_map *offmap,
-			       enum bpf_netdev_command cmd)
-{
-	struct netdev_bpf data = {};
-	struct net_device *netdev;
-
-	ASSERT_RTNL();
-
-	data.command = cmd;
-	data.offmap = offmap;
-	/* Caller must make sure netdev is valid */
-	netdev = offmap->netdev;
-
-	return netdev->netdev_ops->ndo_bpf(netdev, &data);
-}
-
 struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
 {
 	struct net *net = current->nsproxy->net_ns;
@@ -405,15 +489,6 @@ struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
-static void __bpf_map_offload_destroy(struct bpf_offloaded_map *offmap)
-{
-	WARN_ON(bpf_map_offload_ndo(offmap, BPF_OFFLOAD_MAP_FREE));
-	/* Make sure BPF_MAP_GET_NEXT_ID can't find this dead map */
-	bpf_map_free_id(&offmap->map, true);
-	list_del_init(&offmap->offloads);
-	offmap->netdev = NULL;
-}
-
 void bpf_map_offload_map_free(struct bpf_map *map)
 {
 	struct bpf_offloaded_map *offmap = map_to_offmap(map);
@@ -592,77 +667,14 @@ bool bpf_offload_prog_map_match(struct bpf_prog *prog, struct bpf_map *map)
 int bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
 				    struct net_device *netdev)
 {
-	struct bpf_offload_netdev *ondev;
-	int err;
-
-	ondev = kzalloc(sizeof(*ondev), GFP_KERNEL);
-	if (!ondev)
-		return -ENOMEM;
-
-	ondev->netdev = netdev;
-	ondev->offdev = offdev;
-	INIT_LIST_HEAD(&ondev->progs);
-	INIT_LIST_HEAD(&ondev->maps);
-
-	down_write(&bpf_devs_lock);
-	err = rhashtable_insert_fast(&offdevs, &ondev->l, offdevs_params);
-	if (err) {
-		netdev_warn(netdev, "failed to register for BPF offload\n");
-		goto err_unlock_free;
-	}
-
-	list_add(&ondev->offdev_netdevs, &offdev->netdevs);
-	up_write(&bpf_devs_lock);
-	return 0;
-
-err_unlock_free:
-	up_write(&bpf_devs_lock);
-	kfree(ondev);
-	return err;
+	return __bpf_offload_dev_netdev_register(offdev, netdev);
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_register);
 
 void bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
 				       struct net_device *netdev)
 {
-	struct bpf_offload_netdev *ondev, *altdev;
-	struct bpf_offloaded_map *offmap, *mtmp;
-	struct bpf_prog_offload *offload, *ptmp;
-
-	ASSERT_RTNL();
-
-	down_write(&bpf_devs_lock);
-	ondev = rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params);
-	if (WARN_ON(!ondev))
-		goto unlock;
-
-	WARN_ON(rhashtable_remove_fast(&offdevs, &ondev->l, offdevs_params));
-	list_del(&ondev->offdev_netdevs);
-
-	/* Try to move the objects to another netdev of the device */
-	altdev = list_first_entry_or_null(&offdev->netdevs,
-					  struct bpf_offload_netdev,
-					  offdev_netdevs);
-	if (altdev) {
-		list_for_each_entry(offload, &ondev->progs, offloads)
-			offload->netdev = altdev->netdev;
-		list_splice_init(&ondev->progs, &altdev->progs);
-
-		list_for_each_entry(offmap, &ondev->maps, offloads)
-			offmap->netdev = altdev->netdev;
-		list_splice_init(&ondev->maps, &altdev->maps);
-	} else {
-		list_for_each_entry_safe(offload, ptmp, &ondev->progs, offloads)
-			__bpf_prog_offload_destroy(offload->prog);
-		list_for_each_entry_safe(offmap, mtmp, &ondev->maps, offloads)
-			__bpf_map_offload_destroy(offmap);
-	}
-
-	WARN_ON(!list_empty(&ondev->progs));
-	WARN_ON(!list_empty(&ondev->maps));
-	kfree(ondev);
-unlock:
-	up_write(&bpf_devs_lock);
+	__bpf_offload_dev_netdev_unregister(offdev, netdev);
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_unregister);
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* [PATCH bpf-next v5 06/17] selftests/bpf: Update expected test_offload.py messages
From: Stanislav Fomichev @ 2022-12-20 22:20 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev
In-Reply-To: <20221220222043.3348718-1-sdf@google.com>

Generic check has a different error message, update the selftest.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/test_offload.py | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_offload.py b/tools/testing/selftests/bpf/test_offload.py
index 7cb1bc05e5cf..40cba8d368d9 100755
--- a/tools/testing/selftests/bpf/test_offload.py
+++ b/tools/testing/selftests/bpf/test_offload.py
@@ -1039,7 +1039,7 @@ netns = []
     offload = bpf_pinned("/sys/fs/bpf/offload")
     ret, _, err = sim.set_xdp(offload, "drv", fail=False, include_stderr=True)
     fail(ret == 0, "attached offloaded XDP program to drv")
-    check_extack(err, "Using device-bound program without HW_MODE flag is not supported.", args)
+    check_extack(err, "Using offloaded program without HW_MODE flag is not supported.", args)
     rm("/sys/fs/bpf/offload")
     sim.wait_for_flush()
 
@@ -1088,12 +1088,12 @@ netns = []
     ret, _, err = sim.set_xdp(pinned, "offload",
                               fail=False, include_stderr=True)
     fail(ret == 0, "Pinned program loaded for a different device accepted")
-    check_extack_nsim(err, "program bound to different dev.", args)
+    check_extack(err, "Program bound to different device.", args)
     simdev2.remove()
     ret, _, err = sim.set_xdp(pinned, "offload",
                               fail=False, include_stderr=True)
     fail(ret == 0, "Pinned program loaded for a removed device accepted")
-    check_extack_nsim(err, "xdpoffload of non-bound program.", args)
+    check_extack(err, "Program bound to different device.", args)
     rm(pin_file)
     bpftool_prog_list_wait(expected=0)
 
@@ -1334,12 +1334,12 @@ netns = []
     ret, _, err = simA.set_xdp(progB, "offload", force=True, JSON=False,
                                fail=False, include_stderr=True)
     fail(ret == 0, "cross-ASIC program allowed")
-    check_extack_nsim(err, "program bound to different dev.", args)
+    check_extack(err, "Program bound to different device.", args)
     for d in simdevB.nsims:
         ret, _, err = d.set_xdp(progA, "offload", force=True, JSON=False,
                                 fail=False, include_stderr=True)
         fail(ret == 0, "cross-ASIC program allowed")
-        check_extack_nsim(err, "program bound to different dev.", args)
+        check_extack(err, "Program bound to different device.", args)
 
     start_test("Test multi-dev ASIC cross-dev map reuse...")
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox