From: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
To: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Alasdair McWilliam <alasdair.mcwilliam@outlook.com>,
"xdp-newbies@vger.kernel.org" <xdp-newbies@vger.kernel.org>
Subject: Re: ICE + XSK ZC - page faults on 6.1 LTS when process exits?
Date: Fri, 23 Aug 2024 16:09:23 +0200 [thread overview]
Message-ID: <ZsiYE9j5DK79h1+/@boxer> (raw)
In-Reply-To: <CAJ8uoz0XANzvCwdJYUaY=CcK__AfL7x-FvjQKLCbngZT3_=2gw@mail.gmail.com>
On Fri, Aug 23, 2024 at 10:17:35AM +0200, Magnus Karlsson wrote:
> On Thu, 22 Aug 2024 at 18:25, Alasdair McWilliam
> <alasdair.mcwilliam@outlook.com> wrote:
> >
> > Hi,
> >
> > I've been testing apps that use XSK+ZC on ICE with newer builds of the 6.1 LTS kernel in preparation for some production upgrades, and I've started to notice some instability on newer versions. I can reproduce the issue easily in the lab.
> >
> > Config:
> > - Known good multi-threaded application (i.e. production grade)
> > - Uses eBPF and AF_XDP with zero copy to act as 'bump in wire' in network
> > - Xeon's with Intel E810-CQDA2 (firmware: 3.20 0x8000d83e 1.3146.0)
> > - Effectively a vanilla rebuild of 6.1 using configs from el-repo project
> >
> > Scenario:
> > - Noticing hard kernel faults when shutting down application
> > - Can happen if the process is shut down via systemctl stop
> > - Can even happen with a simple kill -9 command to the PID
> > - Appears in builds after 6.1.87
> >
> > Tested kernels:
> > - 6.1.84: process exits smoothly
> > - 6.1.87: process exits smoothly
> > - 6.1.97: BUG: unable to handle page fault for address
> > - 6.1.106: BUG: unable to handle page fault for address
> >
> > Kdump log is below [1] from 6.1.106 but does seem to be the same in the earlier version.
> >
> > Can anyone advise if this is a known issue?
> >
> > I don't have any builds between 6.1.87 and 6.1.97 but I can spend some time trying to pinpoint the exact version things start to go wrong in, if it would help anyone better equipped than me to debug!
>
> Hi Alasdair,
>
> It would be of great help if you could pinpoint the exact version for
> this breakage. Hopefully we could then find the commit in the ice
> driver that breaks your app, since there should be just a handful of
> commits in the ice driver for any stable release.
$ git log --oneline v6.1.87..v6.1.97 drivers/net/ethernet/intel/ice/
dd37b86999fd ice: Fix VSI list rule with ICE_SW_LKUP_LAST type
224b69e8751c ice: avoid IRQ collision to fix init failure on ACPI S3 resume
531d85b4fb66 ice: move RDMA init to ice_idc.c
a62c50545b4d ice: remove af_xdp_zc_qps bitmap
447a5433bd1e ice: remove null checks before devm_kfree() calls
a388961be5ed ice: Introduce new parameters in ice_sched_node
17ccdebe5ac7 ice: fix iteration of TLVs in Preserved Fields Area
07cbc5512023 ice: fix accounting if a VLAN already exists
5ef3a27c6142 ice: Interpret .set_channels() input differently
90cbd4c081bb ice: remove unnecessary duplicate checks for VF VSI ID
59161a21cae0 ice: pass VSI pointer into ice_vc_isvalid_q_id
6a6ebec40820 ice: tc: allow zero flags in parsing tc flower
can you revert a62c50545b4d and see if the issue persists?
>
> > Kind regards
> > Alasdair
> >
> > [1] kdump log
> >
> > [ 158.666867] BUG: unable to handle page fault for address: ffffa6510e5580c0
> > [ 158.666887] #PF: supervisor read access in kernel mode
> > [ 158.666896] #PF: error_code(0x0000) - not-present page
> > [ 158.666903] PGD 100000067 P4D 100000067 PUD 106dc4067 PMD 0
> > [ 158.666914] Oops: 0000 [#1] PREEMPT SMP PTI
> > [ 158.666922] CPU: 7 PID: 1808 Comm: tlndd.bin Kdump: loaded Tainted: G E 6.1.106-1.X.el9.x86_64 #1
> > [ 158.666940] Hardware name: Supermicro SYS-1028R-TDW/X10DDW-i, BIOS 3.2 12/16/2019
> > [ 158.666950] RIP: 0010:xp_free+0x11/0x80
> > [ 158.666962] Code: 8b 04 d0 48 83 e0 fe 48 01 f0 c3 cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 56 41 55 41 54 55 48 8d 6f 58 53 <48> 8b 47 58 48 39 c5 74 0d 5b 5d 41 5c 41 5d 41 5e c3 cc cc cc cc
> > [ 158.666985] RSP: 0018:ffffa65089e8b760 EFLAGS: 00010202
> > [ 158.666993] RAX: ffff8fcf077c0000 RBX: 0000000000000001 RCX: 0000000000000000
> > [ 158.667003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa6510e558068
> > [ 158.667012] RBP: ffffa6510e5580c0 R08: fffff8c50415a108 R09: ffff8fc7cac60000
> > [ 158.667022] R10: 0000000000000219 R11: ffffffffffffffff R12: 0000000000000fff
> > [ 158.667031] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8fc7c139d340
> > [ 158.667040] FS: 00007f8504996880(0000) GS:ffff8fcedfdc0000(0000) knlGS:0000000000000000
> > [ 158.667050] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 158.667058] CR2: ffffa6510e5580c0 CR3: 00000001448e2002 CR4: 00000000001706e0
> > [ 158.667068] Call Trace:
> > [ 158.667075] <TASK>
> > [ 158.667082] ? show_trace_log_lvl+0x1c4/0x2df
> > [ 158.667094] ? show_trace_log_lvl+0x1c4/0x2df
> > [ 158.667103] ? ice_xsk_clean_rx_ring+0x39/0x60 [ice]
> > [ 158.667157] ? __die_body.cold+0x8/0xd
> > [ 158.667166] ? page_fault_oops+0xac/0x150
> > [ 158.667176] ? fixup_exception+0x22/0x340
> > [ 158.667185] ? exc_page_fault+0xb2/0x150
> > [ 158.667195] ? asm_exc_page_fault+0x22/0x30
> > [ 158.667206] ? xp_free+0x11/0x80
> > [ 158.667215] ice_xsk_clean_rx_ring+0x39/0x60 [ice]
> > [ 158.667250] ice_clean_rx_ring+0x157/0x180 [ice]
> > [ 158.667284] ice_down+0x172/0x2b0 [ice]
> > [ 158.667311] ? ice_xdp_setup_prog+0x3b0/0x3b0 [ice]
> > [ 158.667337] ice_xdp_setup_prog+0xe3/0x3b0 [ice]
> > [ 158.667364] ? ice_xdp_setup_prog+0x3b0/0x3b0 [ice]
> > [ 158.667391] dev_xdp_install+0xc7/0x100
> > [ 158.667402] dev_xdp_attach+0x1e0/0x560
> > [ 158.667412] do_setlink+0x7a8/0xc10
> > [ 158.667422] ? __nla_validate_parse+0x12b/0x1b0
> > [ 158.667436] __rtnl_newlink+0x540/0x650
> > [ 158.667446] rtnl_newlink+0x44/0x70
> > [ 158.667454] rtnetlink_rcv_msg+0x15c/0x3d0
> > [ 158.667477] ? rtnl_calcit.isra.0+0x140/0x140
> > [ 158.667485] netlink_rcv_skb+0x51/0x100
> > [ 158.667727] netlink_unicast+0x246/0x360
> > [ 158.667953] netlink_sendmsg+0x24e/0x4b0
> > [ 158.668173] __sock_sendmsg+0x62/0x70
> > [ 158.668389] ____sys_sendmsg+0x247/0x2d0
> > [ 158.668602] ? copy_msghdr_from_user+0x6d/0xa0
> > [ 158.668815] ___sys_sendmsg+0x88/0xd0
> > [ 158.669028] ? __sk_destruct+0x156/0x230
> > [ 158.669234] ? kmem_cache_free+0x134/0x300
> > [ 158.669437] ? rcu_nocb_try_bypass+0x4a/0x440
> > [ 158.669634] ? __sk_destruct+0x156/0x230
> > [ 158.669825] ? _raw_spin_unlock_irqrestore+0x23/0x40
> > [ 158.670010] ? mod_objcg_state+0xc9/0x2f0
> > [ 158.670186] ? refill_obj_stock+0xae/0x160
> > [ 158.670359] ? rseq_get_rseq_cs.isra.0+0x16/0x220
> > [ 158.670529] ? rcu_nocb_try_bypass+0x4a/0x440
> > [ 158.670696] ? rseq_ip_fixup+0x72/0x1e0
> > [ 158.670860] __sys_sendmsg+0x59/0xa0
> > [ 158.671021] ? syscall_trace_enter.constprop.0+0x11e/0x190
> > [ 158.671185] do_syscall_64+0x35/0x80
> > [ 158.671345] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > [ 158.671503] RIP: 0033:0x7f850510f917
> > [ 158.671658] Code: 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
> > [ 158.671993] RSP: 002b:00007ffcc805f238 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> > [ 158.672171] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f850510f917
> > [ 158.672352] RDX: 0000000000000000 RSI: 000000000198e9e8 RDI: 0000000000000009
> > [ 158.672534] RBP: 0000000001933c00 R08: 0000000001935980 R09: 0000000000460e48
> > [ 158.672716] R10: 0000000000000011 R11: 0000000000000246 R12: 0000000001933c30
> > [ 158.672899] R13: 0000000000515fd8 R14: 000000000198e9d0 R15: 0000000000513690
> > [ 158.673086] </TASK>
> > [ 158.673269] Modules linked in: bonding(E) tls(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) rfkill(E) ip_set(E) nf_tables(E) libcrc32c(E) nfnetlink(E) vfat(E) fat(E) ipmi_ssif(E) intel_rapl_msr(E) intel_rapl_common(E) sb_edac(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) iTCO_wdt(E) intel_pmc_bxt(E) iTCO_vendor_support(E) kvm(E) irqbypass(E) rapl(E) intel_cstate(E) ast(E) intel_uncore(E) drm_vram_helper(E) drm_ttm_helper(E) ttm(E) pcspkr(E) mei_me(E) drm_kms_helper(E) i2c_i801(E) lpc_ich(E) mei(E) i2c_smbus(E) mxm_wmi(E) ioatdma(E) acpi_ipmi(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) acpi_pad(E) acpi_power_meter(E) joydev(E) drm(E) fuse(E) ext4(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) sg(E) ahci(E) crct10dif_pclmul(E) crc32_pclmul(E) libahci(E) crc32c_intel(E) ice(E)
> > [ 158.673314] polyval_clmulni(E) polyval_generic(E) igb(E) libata(E) ghash_clmulni_intel(E) i2c_algo_bit(E) dca(E) wmi(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)
> > [ 158.675578] CR2: ffffa6510e5580c0
next prev parent reply other threads:[~2024-08-23 14:09 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-22 16:25 ICE + XSK ZC - page faults on 6.1 LTS when process exits? Alasdair McWilliam
2024-08-23 8:17 ` Magnus Karlsson
2024-08-23 14:09 ` Maciej Fijalkowski [this message]
2024-08-27 13:35 ` Alasdair McWilliam
[not found] ` <AS8P194MB204216F8B886FBE04D1B51FD86942@AS8P194MB2042.EURP194.PROD.OUTLOOK.COM>
2024-09-02 16:09 ` Alasdair McWilliam
2024-09-04 10:30 ` Maciej Fijalkowski
2024-09-05 12:50 ` Alasdair McWilliam
2024-09-13 15:54 ` Alasdair McWilliam
2024-09-27 11:32 ` Thorsten Leemhuis
2024-11-01 12:37 ` Alasdair McWilliam
2024-11-04 7:11 ` Larysa Zaremba
2024-11-04 12:18 ` Alasdair McWilliam
2024-11-18 14:24 ` Larysa Zaremba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZsiYE9j5DK79h1+/@boxer \
--to=maciej.fijalkowski@intel.com \
--cc=alasdair.mcwilliam@outlook.com \
--cc=magnus.karlsson@gmail.com \
--cc=xdp-newbies@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.