Kernel 4.1.6 Panic due to slab corruption

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Kernel 4.1.6 Panic due to slab corruption
@ 2015-09-07  8:41 Nikolay Borisov
  2015-09-07 10:37 ` Holger Hoffstätte
  2015-09-07 11:14 ` Nikolay Borisov
  0 siblings, 2 replies; 17+ messages in thread
From: Nikolay Borisov @ 2015-09-07  8:41 UTC (permalink / raw)
  To: Linux-Kernel@Vger. Kernel. Org; +Cc: Marian Marinov, SiteGround Operations, cl

Hello, 

On one of our servers I've observed the a kernel pannic 
happening with the following backtrace:

[654405.527070] BUG: unable to handle kernel paging request at 0000000000028001
[654405.527076] IP: [<ffffffff81182a59>] kmem_cache_alloc_node+0x99/0x1e0
[654405.527085] PGD 14bef58067 PUD 2ab358067 PMD 0 
[654405.527089] Oops: 0000 [#11] SMP 
[654405.527093] Modules linked in: xt_multiport tcp_diag inet_diag act_police cls_basic sch_ingress scsi_transport_iscsi ipt_REJECT nf_reject_ipv4 xt_pkttype xt_state veth openvswitch xt_owner xt_conntrack iptable_filter iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c dm_mirror dm_region_hash dm_log iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core ioatdma dca ipmi_devintf ipmi_si ipmi_msghandler mpt2sas scsi_transport_sas raid_class
[654405.527145] CPU: 14 PID: 32267 Comm: httpd Tainted: G      D      L  4.1.6-clouder1 #1
[654405.527147] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0  07/09/2013
[654405.527149] task: ffff88139d3b1ec0 ti: ffff8808eda14000 task.ti: ffff8808eda14000
[654405.527151] RIP: 0010:[<ffffffff81182a59>]  [<ffffffff81182a59>] kmem_cache_alloc_node+0x99/0x1e0
[654405.527155] RSP: 0018:ffff88407fcc3a98  EFLAGS: 00210246
[654405.527156] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8814ce9acf80
[654405.527157] RDX: 00000000837ad864 RSI: 0000000000050200 RDI: 0000000000018ce0
[654405.527158] RBP: ffff88407fcc3af8 R08: ffff88407fcd8ce0 R09: ffffffffa033d990
[654405.527159] R10: ffff88058676fdd8 R11: 0000000000007b4a R12: ffff881fff807ac0
[654405.527161] R13: 0000000000028001 R14: 0000000000000001 R15: ffff881fff807ac0
[654405.527162] FS:  0000000000000000(0000) GS:ffff88407fcc0000(0063) knlGS:0000000055c832e0
[654405.527164] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[654405.527165] CR2: 0000000000028001 CR3: 0000001467b64000 CR4: 00000000000406e0
[654405.527166] Stack:
[654405.527167]  0000000000000000 0000000000000000 0000000000000000 ffff881ff2d05000
[654405.527170]  ffff88407fcc3ae8 00050200812b5903 ffff88407fcc3ae8 00000000000001a2
[654405.527172]  0000000000000001 ffff88058676fc60 ffff88058676fe80 0000000000001800
[654405.527175] Call Trace:
[654405.527177]  <IRQ> 
[654405.527184]  [<ffffffffa033d990>] ovs_flow_stats_update+0x110/0x160 [openvswitch]
[654405.527189]  [<ffffffffa033ae74>] ovs_dp_process_packet+0x64/0xf0 [openvswitch]
[654405.527193]  [<ffffffffa0345c60>] ? netdev_port_receive+0x110/0x110 [openvswitch]
[654405.527197]  [<ffffffffa0345c60>] ? netdev_port_receive+0x110/0x110 [openvswitch]
[654405.527201]  [<ffffffffa0344815>] ovs_vport_receive+0x85/0xb0 [openvswitch]
[654405.527207]  [<ffffffff812c7636>] ? blk_mq_free_hctx_request+0x36/0x40
[654405.527209]  [<ffffffff812c7671>] ? blk_mq_free_request+0x31/0x40
[654405.527214]  [<ffffffff8100c2f9>] ? read_tsc+0x9/0x10
[654405.527220]  [<ffffffff810b9f04>] ? ktime_get+0x54/0xc0
[654405.527225]  [<ffffffff813cf577>] ? put_device+0x17/0x20
[654405.527227]  [<ffffffffa0048a50>] ? tcf_act_police+0x150/0x210 [act_police]
[654405.527232]  [<ffffffff8150cdc1>] ? tcf_action_exec+0x51/0xa0
[654405.527235]  [<ffffffffa0011445>] ? basic_classify+0x75/0xe0 [cls_basic]
[654405.527237]  [<ffffffff815091d5>] ? tc_classify+0x55/0xc0
[654405.527241]  [<ffffffffa0345bed>] netdev_port_receive+0x9d/0x110 [openvswitch]
[654405.527245]  [<ffffffffa0345c94>] netdev_frame_hook+0x34/0x50 [openvswitch]
[654405.527250]  [<ffffffff814e58e6>] __netif_receive_skb_core+0x206/0x880
[654405.527252]  [<ffffffff814e5f87>] __netif_receive_skb+0x27/0x70
[654405.527254]  [<ffffffff814e60c1>] process_backlog+0xf1/0x1b0
[654405.527257]  [<ffffffff814e68d3>] napi_poll+0xd3/0x1c0
[654405.527259]  [<ffffffff814e6a50>] net_rx_action+0x90/0x1c0
[654405.527264]  [<ffffffff810595ab>] __do_softirq+0xfb/0x2a0
[654405.527270]  [<ffffffff815b269c>] do_softirq_own_stack+0x1c/0x30
[654405.527271]  <EOI> 
[654405.527273]  [<ffffffff810590b5>] do_softirq+0x55/0x60
[654405.527276]  [<ffffffff81059198>] __local_bh_enable_ip+0x88/0x90
[654405.527279]  [<ffffffff8152b062>] ip_finish_output+0x282/0x490
[654405.527281]  [<ffffffff8152b55b>] ip_output+0xab/0xc0
[654405.527283]  [<ffffffff8152ade0>] ? ip_finish_output_gso+0x4e0/0x4e0
[654405.527285]  [<ffffffff815296fb>] ip_local_out_sk+0x3b/0x50
[654405.527287]  [<ffffffff81529e0e>] ip_queue_xmit+0x14e/0x3c0
[654405.527291]  [<ffffffff815422d2>] tcp_transmit_skb+0x4c2/0x850
[654405.527294]  [<ffffffff81544c1d>] tcp_write_xmit+0x19d/0x670
[654405.527298]  [<ffffffff812f32d1>] ? copy_user_generic_string+0x31/0x40
[654405.527300]  [<ffffffff81545cd2>] __tcp_push_pending_frames+0x32/0xd0
[654405.527302]  [<ffffffff81532911>] tcp_push+0xf1/0x120
[654405.527304]  [<ffffffff815361f3>] tcp_sendmsg+0x373/0xb60
[654405.527307]  [<ffffffff811be0b3>] ? mntput+0x23/0x40
[654405.527310]  [<ffffffff811a7c32>] ? path_put+0x22/0x30
[654405.527315]  [<ffffffff81561272>] inet_sendmsg+0x42/0xb0
[654405.527317]  [<ffffffff81182e4e>] ? kmem_cache_alloc+0xee/0x1c0
[654405.527321]  [<ffffffff814c639d>] sock_sendmsg+0x4d/0x60
[654405.527324]  [<ffffffff814c64a6>] sock_write_iter+0xb6/0x100
[654405.527328]  [<ffffffff8119d9d0>] do_iter_readv_writev+0x60/0x90
[654405.527330]  [<ffffffff814c63f0>] ? kernel_sendmsg+0x40/0x40
[654405.527332]  [<ffffffff8119e354>] compat_do_readv_writev+0x174/0x1f0
[654405.527337]  [<ffffffff810aa6d9>] ? rcu_eqs_exit+0x79/0xb0
[654405.527339]  [<ffffffff810aa723>] ? rcu_user_exit+0x13/0x20
[654405.527342]  [<ffffffff8119e591>] compat_SyS_writev+0xc1/0x110
[654405.527346]  [<ffffffff811274a3>] ? context_tracking_user_enter+0x13/0x20
[654405.527349]  [<ffffffff815b2fc5>] sysenter_dispatch+0x7/0x25
[654405.527350] Code: 8b 00 48 c1 e8 38 41 39 c6 74 17 4c 89 c9 44 89 f2 8b 75 cc 4c 89 e7 e8 46 f6 ff ff 49 89 c5 eb 2b 90 49 63 44 24 20 49 8b 3c 24 <49> 8b 5c 05 00 48 8d 4a 01 4c 89 e8 65 48 0f c7 0f 0f 94 c0 3c 
[654405.527378] RIP  [<ffffffff81182a59>] kmem_cache_alloc_node+0x99/0x1e0
[654405.527381]  RSP <ffff88407fcc3a98>
[654405.527383] CR2: 0000000000028001

Before this occurs there are also several more "can't handle paging requests" e.g:

[654405.518482] BUG: unable to handle kernel paging request at 0000000000028001
[654405.518488] IP: [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
[654405.518496] PGD 364da24067 PUD 3733ae2067 PMD 0 
[654405.518501] Oops: 0000 [#10] SMP 
[654405.518504] Modules linked in: xt_multiport tcp_diag inet_diag act_police cls_basic sch_ingress scsi_transport_iscsi ipt_REJECT nf_reject_ipv4 xt_pkttype xt_state veth openvswitch xt_owner xt_conntrack iptable_filter iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c dm_mirror dm_region_hash dm_log iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core ioatdma dca ipmi_devintf ipmi_si ipmi_msghandler mpt2sas scsi_transport_sas raid_class
[654405.518555] CPU: 14 PID: 15732 Comm: guardian Tainted: G      D      L  4.1.6-clouder1 #1
[654405.518557] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0  07/09/2013
[654405.518559] task: ffff88373303e680 ti: ffff88369b388000 task.ti: ffff88369b388000
[654405.518560] RIP: 0010:[<ffffffff811824e5>]  [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
[654405.518564] RSP: 0018:ffff88369b38bb48  EFLAGS: 00010282
[654405.518565] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[654405.518567] RDX: 00000000837ad864 RSI: 00000000000000d0 RDI: 0000000000018ce0
[654405.518568] RBP: ffff88369b38bb88 R08: ffff88407fcd8ce0 R09: ffffffff811c272c
[654405.518569] R10: ffff88369b38bb74 R11: ffff881f7c678db8 R12: ffff881fff807ac0
[654405.518570] R13: 0000000000028001 R14: ffff881fff807ac0 R15: 00000000000000d0
[654405.518572] FS:  00002b784bf66800(0000) GS:ffff88407fcc0000(0000) knlGS:0000000000000000
[654405.518574] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[654405.518575] CR2: 0000000000028001 CR3: 000000364d574000 CR4: 00000000000406e0
[654405.518576] Stack:
[654405.518578]  000000013a481c58 0000000000000020 ffff883600010000 ffff88245528ca00
[654405.518580]  ffffffff8120bc50 ffff881a3d3433c8 ffff88245528ca10 ffffffff81209ed0
[654405.518583]  ffff88369b38bbc8 ffffffff811c272c ffff88245528ca10 0000000000000000
[654405.518586] Call Trace:
[654405.518593]  [<ffffffff8120bc50>] ? proc_pid_follow_link+0x80/0x80
[654405.518596]  [<ffffffff81209ed0>] ? sched_autogroup_open+0x50/0x50
[654405.518601]  [<ffffffff811c272c>] single_open+0x3c/0xb0
[654405.518603]  [<ffffffff81209eeb>] proc_single_open+0x1b/0x20
[654405.518606]  [<ffffffff8119b69a>] do_dentry_open+0x22a/0x350
[654405.518608]  [<ffffffff8119b809>] vfs_open+0x49/0x50
[654405.518612]  [<ffffffff811ae652>] do_last+0x412/0x890
[654405.518615]  [<ffffffff81182e4e>] ? kmem_cache_alloc+0xee/0x1c0
[654405.518620]  [<ffffffff8129d6b6>] ? security_file_alloc+0x16/0x20
[654405.518623]  [<ffffffff811aeb62>] path_openat+0x92/0x470
[654405.518626]  [<ffffffff811ac753>] ? user_path_at_empty+0x63/0xa0
[654405.518628]  [<ffffffff811aef8a>] do_filp_open+0x4a/0xa0
[654405.518633]  [<ffffffff812fb140>] ? find_next_zero_bit+0x10/0x20
[654405.518637]  [<ffffffff811bb64c>] ? __alloc_fd+0xac/0x150
[654405.518640]  [<ffffffff8119ce9a>] do_sys_open+0x11a/0x230
[654405.518644]  [<ffffffff8101190e>] ? syscall_trace_enter_phase1+0x14e/0x160
[654405.518650]  [<ffffffff811274a3>] ? context_tracking_user_enter+0x13/0x20
[654405.518652]  [<ffffffff8119cfee>] SyS_open+0x1e/0x20
[654405.518656]  [<ffffffff815b0bee>] system_call_fastpath+0x12/0x71
[654405.518658] Code: 08 65 4c 03 05 5d 7c e8 7e 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 8c 00 00 00 48 85 c0 0f 84 83 00 00 00 49 63 44 24 20 49 8b 3c 24 <49> 8b 5c 05 00 48 8d 4a 01 4c 89 e8 65 48 0f c7 0f 0f 94 c0 3c 
[654405.518686] RIP  [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
[654405.518689]  RSP <ffff88369b38bb48>
[654405.518690] CR2: 0000000000028001


[654405.511613] BUG: unable to handle kernel paging request at 0000000000028001
[654405.511619] IP: [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
[654405.511628] PGD 3f9a016067 PUD 3ee598c067 PMD 0 
[654405.511632] Oops: 0000 [#9] SMP 
[654405.511634] Modules linked in: xt_multiport tcp_diag inet_diag act_police cls_basic sch_ingress scsi_transport_iscsi ipt_REJECT nf_reject_ipv4 xt_pkttype xt_state veth openvswitch xt_owner xt_conntrack iptable_filter iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c dm_mirror dm_region_hash dm_log iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core ioatdma dca ipmi_devintf ipmi_si ipmi_msghandler mpt2sas scsi_transport_sas raid_class
[654405.511684] CPU: 14 PID: 14914 Comm: templar.pl Tainted: G      D      L  4.1.6-clouder1 #1
[654405.511687] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0  07/09/2013
[654405.511689] task: ffff881f46d8bd80 ti: ffff883ee583c000 task.ti: ffff883ee583c000
[654405.511690] RIP: 0010:[<ffffffff811824e5>]  [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
[654405.511694] RSP: 0018:ffff883ee583fe38  EFLAGS: 00010282
[654405.511695] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff881f3e1f8540
[654405.511697] RDX: 00000000837ad864 RSI: 00000000000080d0 RDI: 0000000000018ce0
[654405.511698] RBP: ffff883ee583fe78 R08: ffff88407fcd8ce0 R09: ffffffff8129028f
[654405.511699] R10: 0000000000000008 R11: 0000000000000246 R12: ffff881fff807ac0
[654405.511701] R13: 0000000000028001 R14: ffff881fff807ac0 R15: 00000000000080d0
[654405.511703] FS:  00002b06256163a0(0000) GS:ffff88407fcc0000(0000) knlGS:0000000000000000
[654405.511704] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[654405.511706] CR2: 0000000000028001 CR3: 0000003f520c4000 CR4: 00000000000406e0
[654405.511707] Stack:
[654405.511708]  0000000100000404 0000000000000020 ffff883ee583fe78 0000000000000000
[654405.511711]  0000000000001000 0000000000000001 0000000000018003 0000000000000001
[654405.511715]  ffff883ee583ff28 ffffffff8129028f 0000000000000001 00000000000007d0
[654405.511717] Call Trace:
[654405.511726]  [<ffffffff8129028f>] do_shmat+0x22f/0x4a0
[654405.511729]  [<ffffffff8129051c>] SyS_shmat+0x1c/0x30
[654405.511734]  [<ffffffff815b0bee>] system_call_fastpath+0x12/0x71
[654405.511736] Code: 08 65 4c 03 05 5d 7c e8 7e 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 8c 00 00 00 48 85 c0 0f 84 83 00 00 00 49 63 44 24 20 49 8b 3c 24 <49> 8b 5c 05 00 48 8d 4a 01 4c 89 e8 65 48 0f c7 0f 0f 94 c0 3c 
[654405.511763] RIP  [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
[654405.511765]  RSP <ffff883ee583fe38>
[654405.511766] CR2: 0000000000028001

[654405.502947] BUG: unable to handle kernel paging request at 0000000000028001
[654405.502952] IP: [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
[654405.502961] PGD 1c7d1ba067 PUD 1d7c06d067 PMD 0 
[654405.502965] Oops: 0000 [#8] SMP 
[654405.502968] Modules linked in: xt_multiport tcp_diag inet_diag act_police cls_basic sch_ingress scsi_transport_iscsi ipt_REJECT nf_reject_ipv4 xt_pkttype xt_state veth openvswitch xt_owner xt_conntrack iptable_filter iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c dm_mirror dm_region_hash dm_log iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core ioatdma dca ipmi_devintf ipmi_si ipmi_msghandler mpt2sas scsi_transport_sas raid_class
[654405.503021] CPU: 14 PID: 1342 Comm: gather_daemon.p Tainted: G      D      L  4.1.6-clouder1 #1
[654405.503024] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0  07/09/2013
[654405.503026] task: ffff883dc1e170c0 ti: ffff881df4f80000 task.ti: ffff881df4f80000
[654405.503027] RIP: 0010:[<ffffffff811824e5>]  [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
[654405.503031] RSP: 0018:ffff881df4f83a98  EFLAGS: 00010282
[654405.503033] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000001884e6d
[654405.503034] RDX: 00000000837ad864 RSI: 00000000000000d0 RDI: 0000000000018ce0
[654405.503035] RBP: ffff881df4f83ad8 R08: ffff88407fcd8ce0 R09: ffffffff811c272c
[654405.503037] R10: 0000000000000008 R11: 0000000000000001 R12: ffff881fff807ac0
[654405.503038] R13: 0000000000028001 R14: ffff881fff807ac0 R15: 00000000000000d0
[654405.503040] FS:  0000000000000000(0000) GS:ffff88407fcc0000(0063) knlGS:00000000558d2c00
[654405.503041] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[654405.503043] CR2: 0000000000028001 CR3: 0000001daa3cd000 CR4: 00000000000406e0
[654405.503044] Stack:
[654405.503046]  ffff883a856c0402 0000000000000020 ffff881df4f83af8 ffff8825209b0b00
[654405.503049]  ffffffff81212960 0000000000000000 ffffffff81212960 0000000000000000
[654405.503051]  ffff881df4f83b18 ffffffff811c272c ffffffff81212960 0000000000000000
[654405.503054] Call Trace:
[654405.503063]  [<ffffffff81212960>] ? get_iowait_time+0x70/0x70
[654405.503066]  [<ffffffff81212960>] ? get_iowait_time+0x70/0x70
[654405.503070]  [<ffffffff811c272c>] single_open+0x3c/0xb0
[654405.503073]  [<ffffffff81212960>] ? get_iowait_time+0x70/0x70
[654405.503075]  [<ffffffff81212960>] ? get_iowait_time+0x70/0x70
[654405.503077]  [<ffffffff811c27f0>] single_open_size+0x50/0x90
[654405.503080]  [<ffffffff811c1d20>] ? seq_release_private+0x60/0x60
[654405.503082]  [<ffffffff8121286a>] stat_open+0x4a/0x60
[654405.503085]  [<ffffffff81209574>] proc_reg_open+0x84/0x120
[654405.503088]  [<ffffffff812094f0>] ? proc_entry_rundown+0xa0/0xa0
[654405.503091]  [<ffffffff8119b69a>] do_dentry_open+0x22a/0x350
[654405.503093]  [<ffffffff8119b809>] vfs_open+0x49/0x50
[654405.503097]  [<ffffffff811ae652>] do_last+0x412/0x890
[654405.503102]  [<ffffffff8100c299>] ? sched_clock+0x9/0x10
[654405.503107]  [<ffffffff81084a7b>] ? sched_clock_cpu+0xab/0xc0
[654405.503110]  [<ffffffff81182e4e>] ? kmem_cache_alloc+0xee/0x1c0
[654405.503115]  [<ffffffff8129d6b6>] ? security_file_alloc+0x16/0x20
[654405.503118]  [<ffffffff811aeb62>] path_openat+0x92/0x470
[654405.503122]  [<ffffffff8108ff1f>] ? put_prev_task_fair+0x2f/0x50
[654405.503126]  [<ffffffff810b2931>] ? lock_hrtimer_base+0x31/0x60
[654405.503128]  [<ffffffff811aef8a>] do_filp_open+0x4a/0xa0
[654405.503132]  [<ffffffff812fb140>] ? find_next_zero_bit+0x10/0x20
[654405.503136]  [<ffffffff811bb64c>] ? __alloc_fd+0xac/0x150
[654405.503140]  [<ffffffff8119ce9a>] do_sys_open+0x11a/0x230
[654405.503145]  [<ffffffff810b9b2e>] ? getnstimeofday64+0xe/0x30
[654405.503150]  [<ffffffff811274a3>] ? context_tracking_user_enter+0x13/0x20
[654405.503154]  [<ffffffff811ee4cb>] compat_SyS_open+0x1b/0x20
[654405.503160]  [<ffffffff815b2fc5>] sysenter_dispatch+0x7/0x25
[654405.503162] Code: 08 65 4c 03 05 5d 7c e8 7e 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 8c 00 00 00 48 85 c0 0f 84 83 00 00 00 49 63 44 24 20 49 8b 3c 24 <49> 8b 5c 05 00 48 8d 4a 01 4c 89 e8 65 48 0f c7 0f 0f 94 c0 3c 
[654405.503191] RIP  [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
[654405.503194]  RSP <ffff881df4f83a98>
[654405.503195] CR2: 0000000000028001


I have more but like these but I believe those are enough. The
following things arise as a pattern in those failures: 

1. All these failures are happening when allocating 32 bytes struct, 
this leads me to believe that the corruption has happened in the 
kmalloc-32 slab cache. 

2. Another thing which also stands out is the faulting address: 
The value 0000000000028001 can predominantly be seen. In the case
when the panic has occured here is what the docded code shows:

Code: 8b 00 48 c1 e8 38 41 39 c6 74 17 4c 89 c9 44 89 f2 8b 75 cc 4c 89 e7 e8 46 f6 ff ff 49 89 c5 eb 2b 90 49 63 44 24 20 49 8b 3c 24 <49> 8b 5c 05 00 48 8d 4a 01 4c 89 e8 65 48 0f c7 0f 0f 94 c0 3c

Code starting with the faulting instruction
===========================================
   0:	49 8b 5c 05 00       	mov    0x0(%r13,%rax,1),%rbx
   5:	48 8d 4a 01          	lea    0x1(%rdx),%rcx
   9:	4c 89 e8             	mov    %r13,%rax
   c:	65 48 0f c7 0f       	cmpxchg16b %gs:(%rdi)
  11:	0f 94 c0             	sete   %al
  14:	3c                   	.byte 0x3c

r13 takes part in the calculation of the address rbx has to be stored, 
r13 =  0000000000028001

Any ideas how to debug this? The first thing that comes to mind, is
to boot the machine with slab merging disabled, in the hopes
that this would reduce the scope of the memory corruption and 
the next time this occurs it would be easier to identify the culprit.

Here are the config options for the allocator in use: 

grep -i slub kernel-conf-4.1
# CONFIG_SLUB_DEBUG is not set
CONFIG_SLUB=y
CONFIG_SLUB_CPU_PARTIAL=y
# CONFIG_SLUB_STATS is not set

If more information is needed I'm happy to provide it. 

Any help will be much appreciated.

Regards, 
Nikolay

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-07  8:41 Kernel 4.1.6 Panic due to slab corruption Nikolay Borisov
@ 2015-09-07 10:37 ` Holger Hoffstätte
  2015-09-07 11:30   ` Nikolay Borisov
  2015-09-07 11:14 ` Nikolay Borisov
  1 sibling, 1 reply; 17+ messages in thread
From: Holger Hoffstätte @ 2015-09-07 10:37 UTC (permalink / raw)
  To: linux-kernel

On Mon, 07 Sep 2015 11:41:17 +0300, Nikolay Borisov wrote:

> Hello, 
> 
> On one of our servers I've observed the a kernel pannic 
> happening with the following backtrace:
> 
> [654405.527070] BUG: unable to handle kernel paging request at 0000000000028001
> [654405.527076] IP: [<ffffffff81182a59>] kmem_cache_alloc_node+0x99/0x1e0
> [654405.527085] PGD 14bef58067 PUD 2ab358067 PMD 0 

Interesting! I can't offer much help but had a similar panic just the other day
for no apparent reason while running a bunch of compiles. First time I've seen
this with 4.1.6:

Sep  5 20:42:02 ragnarok kernel: BUG: unable to handle kernel paging request at ffff8800e789b740
Sep  5 20:42:02 ragnarok kernel: IP: [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: PGD 1aa2067 PUD 61f7fd067 PMD 0 
Sep  5 20:42:02 ragnarok kernel: Oops: 0000 [#1] SMP 
Sep  5 20:42:02 ragnarok kernel: Modules linked in: auth_rpcgss oid_registry nfsv4 nfs lockd grace fscache sunrpc autofs4 sch_fq_codel snd_hda_codec_realtek x86_pkg_temp_thermal coretemp snd_hda_codec_generic crc32_pclmul crc32c_intel aesni_intel radeon aes_x86_64 glue_helper snd_hda_codec_hdmi lrw gf128mul ablk_helper cryptd i2c_algo_bit snd_usb_audio uvcvideo snd_hda_intel drm_kms_helper snd_hda_controller snd_hwdep videobuf2_vmalloc snd_usbmidi_lib videobuf2_memops snd_hda_codec videobuf2_core snd_rawmidi i2c_i801 ttm snd_hda_core v4l2_common snd_seq_device videodev snd_pcm usbhid drm snd_timer r8169 snd i2c_core mii soundcore parport_pc parport
Sep  5 20:42:02 ragnarok kernel: CPU: 0 PID: 32755 Comm: sh Not tainted 4.1.6 #1
Sep  5 20:42:02 ragnarok kernel: Hardware name: Gigabyte Technology Co., Ltd. P67-DS3-B3/P67-DS3-B3, BIOS F1 05/06/2011
Sep  5 20:42:02 ragnarok kernel: task: ffff880569712e20 ti: ffff8804e4d90000 task.ti: ffff8804e4d90000
Sep  5 20:42:02 ragnarok kernel: RIP: 0010:[<ffffffff8115bd4d>]  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: RSP: 0018:ffff8804e4d93d88  EFLAGS: 00010282
Sep  5 20:42:02 ragnarok kernel: RAX: 0000000000000000 RBX: ffff8805e7eacce0 RCX: 000000000001f7e8
Sep  5 20:42:02 ragnarok kernel: RDX: 000000000001f7e7 RSI: 00000000000000d0 RDI: 0000000000018c70
Sep  5 20:42:02 ragnarok kernel: RBP: ffff8804e4d93dc8 R08: ffff88061f418c70 R09: 0000000000000000
Sep  5 20:42:02 ragnarok kernel: R10: ffffffff81748318 R11: ffffea00139bb500 R12: 00000000000000d0
Sep  5 20:42:02 ragnarok kernel: R13: ffff880606890600 R14: ffffffff8100d039 R15: ffff8800e789b740
Sep  5 20:42:02 ragnarok kernel: FS:  00007f9c1d2f2700(0000) GS:ffff88061f400000(0000) knlGS:0000000000000000
Sep  5 20:42:02 ragnarok kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  5 20:42:02 ragnarok kernel: CR2: ffff8800e789b740 CR3: 00000005f68ce000 CR4: 00000000000406f0
Sep  5 20:42:02 ragnarok kernel: Stack:
Sep  5 20:42:02 ragnarok kernel:  0000000000000000 ffff88061f7e6c00 0000000000000002 ffff8805e7eacce0
Sep  5 20:42:02 ragnarok kernel:  ffff880569712e20 0000000001200011 ffff8805e7eacce0 ffff880569712e20
Sep  5 20:42:02 ragnarok kernel:  ffff8804e4d93de8 ffffffff8100d039 0000000000000000 00007f9c1d2f29d0
Sep  5 20:42:02 ragnarok kernel: Call Trace:
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8100d039>] arch_dup_task_struct+0x69/0x170
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8104de8f>] copy_process.part.8+0x14f/0x1760
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8113909f>] ? handle_mm_fault+0xd0f/0x13a0
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff81171c14>] ? get_empty_filp+0xd4/0x1c0
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8105b63f>] ? recalc_sigpending+0x1f/0x60
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8104f657>] do_fork+0xd7/0x370
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8105ed07>] ? sigprocmask+0x57/0x90
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8104f976>] SyS_clone+0x16/0x20
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff81571d17>] system_call_fastpath+0x12/0x6a
Sep  5 20:42:02 ragnarok kernel: Code: 65 4c 03 05 ee e3 ea 7e 49 83 78 10 00 4d 8b 38 0f 84 b0 00 00 00 4d 85 ff 0f 84 a7 00 00 00 49 63 45 20 48 8d 4a 01 49 8b 7d 00 <49> 8b 1c 07 4c 89 f8 65 48 0f c7 0f 0f 94 c0 84 c0 74 b9 49 63 
Sep  5 20:42:02 ragnarok kernel: RIP  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel:  RSP <ffff8804e4d93d88>
Sep  5 20:42:02 ragnarok kernel: CR2: ffff8800e789b740
Sep  5 20:42:02 ragnarok kernel: ---[ end trace e4478715791f5752 ]---
Sep  5 20:42:02 ragnarok kernel: BUG: unable to handle kernel paging request at ffff8800e789b740
Sep  5 20:42:02 ragnarok kernel: IP: [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: PGD 1aa2067 PUD 61f7fd067 PMD 0 
Sep  5 20:42:02 ragnarok kernel: Oops: 0000 [#2] SMP 
Sep  5 20:42:02 ragnarok kernel: Modules linked in: auth_rpcgss oid_registry nfsv4 nfs lockd grace fscache sunrpc autofs4 sch_fq_codel snd_hda_codec_realtek x86_pkg_temp_thermal coretemp snd_hda_codec_generic crc32_pclmul crc32c_intel aesni_intel radeon aes_x86_64 glue_helper snd_hda_codec_hdmi lrw gf128mul ablk_helper cryptd i2c_algo_bit snd_usb_audio uvcvideo snd_hda_intel drm_kms_helper snd_hda_controller snd_hwdep videobuf2_vmalloc snd_usbmidi_lib videobuf2_memops snd_hda_codec videobuf2_core snd_rawmidi i2c_i801 ttm snd_hda_core v4l2_common snd_seq_device videodev snd_pcm usbhid drm snd_timer r8169 snd i2c_core mii soundcore parport_pc parport
Sep  5 20:42:02 ragnarok kernel: CPU: 0 PID: 32550 Comm: sh Tainted: G      D         4.1.6 #1
Sep  5 20:42:02 ragnarok kernel: Hardware name: Gigabyte Technology Co., Ltd. P67-DS3-B3/P67-DS3-B3, BIOS F1 05/06/2011
Sep  5 20:42:02 ragnarok kernel: task: ffff880602cd1ec0 ti: ffff8805b26ac000 task.ti: ffff8805b26ac000
Sep  5 20:42:02 ragnarok kernel: RIP: 0010:[<ffffffff8115bd4d>]  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: RSP: 0018:ffff8805b26afd88  EFLAGS: 00010282
Sep  5 20:42:02 ragnarok kernel: RAX: 0000000000000000 RBX: ffff8805e7ea8f60 RCX: 000000000001f7e8
Sep  5 20:42:02 ragnarok kernel: RDX: 000000000001f7e7 RSI: 00000000000000d0 RDI: 0000000000018c70
Sep  5 20:42:02 ragnarok kernel: RBP: ffff8805b26afdc8 R08: ffff88061f418c70 R09: 0000000000000000
Sep  5 20:42:02 ragnarok kernel: R10: ffffffff81748318 R11: ffffea0015a2ec00 R12: 00000000000000d0
Sep  5 20:42:02 ragnarok kernel: R13: ffff880606890600 R14: ffffffff8100d039 R15: ffff8800e789b740
Sep  5 20:42:02 ragnarok kernel: FS:  00007f9c1d2f2700(0000) GS:ffff88061f400000(0000) knlGS:0000000000000000
Sep  5 20:42:02 ragnarok kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep  5 20:42:02 ragnarok kernel: CR2: ffff8800e789b740 CR3: 00000005e3b76000 CR4: 00000000000406f0
Sep  5 20:42:02 ragnarok kernel: Stack:
Sep  5 20:42:02 ragnarok kernel:  0000000000000000 ffff88061f7e6c00 0000000000000002 ffff8805e7ea8f60
Sep  5 20:42:02 ragnarok kernel:  ffff880602cd1ec0 0000000001200011 ffff8805e7ea8f60 ffff880602cd1ec0
Sep  5 20:42:02 ragnarok kernel:  ffff8805b26afde8 ffffffff8100d039 0000000000000000 00007f9c1d2f29d0
Sep  5 20:42:02 ragnarok kernel: Call Trace:
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8100d039>] arch_dup_task_struct+0x69/0x170
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8104de8f>] copy_process.part.8+0x14f/0x1760
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8126a936>] ? security_file_alloc+0x16/0x20
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff81171c14>] ? get_empty_filp+0xd4/0x1c0
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff81185966>] ? __d_instantiate+0x96/0xf0
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff812c6b1a>] ? find_next_zero_bit+0x1a/0x30
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8105b63f>] ? recalc_sigpending+0x1f/0x60
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8104f657>] do_fork+0xd7/0x370
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8105ed07>] ? sigprocmask+0x57/0x90
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8104f976>] SyS_clone+0x16/0x20
Sep  5 20:42:02 ragnarok kernel:  [<ffffffff81571d17>] system_call_fastpath+0x12/0x6a
Sep  5 20:42:02 ragnarok kernel: Code: 65 4c 03 05 ee e3 ea 7e 49 83 78 10 00 4d 8b 38 0f 84 b0 00 00 00 4d 85 ff 0f 84 a7 00 00 00 49 63 45 20 48 8d 4a 01 49 8b 7d 00 <49> 8b 1c 07 4c 89 f8 65 48 0f c7 0f 0f 94 c0 84 c0 74 b9 49 63 
Sep  5 20:42:02 ragnarok kernel: RIP  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel:  RSP <ffff8805b26afd88>
Sep  5 20:42:02 ragnarok kernel: CR2: ffff8800e789b740
Sep  5 20:42:02 ragnarok kernel: ---[ end trace e4478715791f5753 ]---

..etc.

I also have all of

CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y
CONFIG_SLUB_CPU_PARTIAL=y

set.

Hope this helps somewhat.

Holger


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-07 10:37 ` Holger Hoffstätte
@ 2015-09-07 11:30   ` Nikolay Borisov
  2015-09-07 11:49     ` Holger Hoffstätte
  0 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2015-09-07 11:30 UTC (permalink / raw)
  To: Holger Hoffstätte, linux-kernel

Hi,

If you have the vmlinux image for the kernel you were running at the
time, the crash occured, could you post the output of addr2line -f -e
path/to/vmlinux ffffffff8115bd4d to see if it also fails in
get_freepointer.

Regards,
Nikolay

On 09/07/2015 01:37 PM, Holger Hoffstätte wrote:
> On Mon, 07 Sep 2015 11:41:17 +0300, Nikolay Borisov wrote:
> 
>> Hello, 
>>
>> On one of our servers I've observed the a kernel pannic 
>> happening with the following backtrace:
>>
>> [654405.527070] BUG: unable to handle kernel paging request at 0000000000028001
>> [654405.527076] IP: [<ffffffff81182a59>] kmem_cache_alloc_node+0x99/0x1e0
>> [654405.527085] PGD 14bef58067 PUD 2ab358067 PMD 0 
> 
> Interesting! I can't offer much help but had a similar panic just the other day
> for no apparent reason while running a bunch of compiles. First time I've seen
> this with 4.1.6:
> 
> Sep  5 20:42:02 ragnarok kernel: BUG: unable to handle kernel paging request at ffff8800e789b740
> Sep  5 20:42:02 ragnarok kernel: IP: [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
> Sep  5 20:42:02 ragnarok kernel: PGD 1aa2067 PUD 61f7fd067 PMD 0 
> Sep  5 20:42:02 ragnarok kernel: Oops: 0000 [#1] SMP 
> Sep  5 20:42:02 ragnarok kernel: Modules linked in: auth_rpcgss oid_registry nfsv4 nfs lockd grace fscache sunrpc autofs4 sch_fq_codel snd_hda_codec_realtek x86_pkg_temp_thermal coretemp snd_hda_codec_generic crc32_pclmul crc32c_intel aesni_intel radeon aes_x86_64 glue_helper snd_hda_codec_hdmi lrw gf128mul ablk_helper cryptd i2c_algo_bit snd_usb_audio uvcvideo snd_hda_intel drm_kms_helper snd_hda_controller snd_hwdep videobuf2_vmalloc snd_usbmidi_lib videobuf2_memops snd_hda_codec videobuf2_core snd_rawmidi i2c_i801 ttm snd_hda_core v4l2_common snd_seq_device videodev snd_pcm usbhid drm snd_timer r8169 snd i2c_core mii soundcore parport_pc parport
> Sep  5 20:42:02 ragnarok kernel: CPU: 0 PID: 32755 Comm: sh Not tainted 4.1.6 #1
> Sep  5 20:42:02 ragnarok kernel: Hardware name: Gigabyte Technology Co., Ltd. P67-DS3-B3/P67-DS3-B3, BIOS F1 05/06/2011
> Sep  5 20:42:02 ragnarok kernel: task: ffff880569712e20 ti: ffff8804e4d90000 task.ti: ffff8804e4d90000
> Sep  5 20:42:02 ragnarok kernel: RIP: 0010:[<ffffffff8115bd4d>]  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
> Sep  5 20:42:02 ragnarok kernel: RSP: 0018:ffff8804e4d93d88  EFLAGS: 00010282
> Sep  5 20:42:02 ragnarok kernel: RAX: 0000000000000000 RBX: ffff8805e7eacce0 RCX: 000000000001f7e8
> Sep  5 20:42:02 ragnarok kernel: RDX: 000000000001f7e7 RSI: 00000000000000d0 RDI: 0000000000018c70
> Sep  5 20:42:02 ragnarok kernel: RBP: ffff8804e4d93dc8 R08: ffff88061f418c70 R09: 0000000000000000
> Sep  5 20:42:02 ragnarok kernel: R10: ffffffff81748318 R11: ffffea00139bb500 R12: 00000000000000d0
> Sep  5 20:42:02 ragnarok kernel: R13: ffff880606890600 R14: ffffffff8100d039 R15: ffff8800e789b740
> Sep  5 20:42:02 ragnarok kernel: FS:  00007f9c1d2f2700(0000) GS:ffff88061f400000(0000) knlGS:0000000000000000
> Sep  5 20:42:02 ragnarok kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Sep  5 20:42:02 ragnarok kernel: CR2: ffff8800e789b740 CR3: 00000005f68ce000 CR4: 00000000000406f0
> Sep  5 20:42:02 ragnarok kernel: Stack:
> Sep  5 20:42:02 ragnarok kernel:  0000000000000000 ffff88061f7e6c00 0000000000000002 ffff8805e7eacce0
> Sep  5 20:42:02 ragnarok kernel:  ffff880569712e20 0000000001200011 ffff8805e7eacce0 ffff880569712e20
> Sep  5 20:42:02 ragnarok kernel:  ffff8804e4d93de8 ffffffff8100d039 0000000000000000 00007f9c1d2f29d0
> Sep  5 20:42:02 ragnarok kernel: Call Trace:
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8100d039>] arch_dup_task_struct+0x69/0x170
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8104de8f>] copy_process.part.8+0x14f/0x1760
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8113909f>] ? handle_mm_fault+0xd0f/0x13a0
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff81171c14>] ? get_empty_filp+0xd4/0x1c0
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8105b63f>] ? recalc_sigpending+0x1f/0x60
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8104f657>] do_fork+0xd7/0x370
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8105ed07>] ? sigprocmask+0x57/0x90
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8104f976>] SyS_clone+0x16/0x20
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff81571d17>] system_call_fastpath+0x12/0x6a
> Sep  5 20:42:02 ragnarok kernel: Code: 65 4c 03 05 ee e3 ea 7e 49 83 78 10 00 4d 8b 38 0f 84 b0 00 00 00 4d 85 ff 0f 84 a7 00 00 00 49 63 45 20 48 8d 4a 01 49 8b 7d 00 <49> 8b 1c 07 4c 89 f8 65 48 0f c7 0f 0f 94 c0 84 c0 74 b9 49 63 
> Sep  5 20:42:02 ragnarok kernel: RIP  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
> Sep  5 20:42:02 ragnarok kernel:  RSP <ffff8804e4d93d88>
> Sep  5 20:42:02 ragnarok kernel: CR2: ffff8800e789b740
> Sep  5 20:42:02 ragnarok kernel: ---[ end trace e4478715791f5752 ]---
> Sep  5 20:42:02 ragnarok kernel: BUG: unable to handle kernel paging request at ffff8800e789b740
> Sep  5 20:42:02 ragnarok kernel: IP: [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
> Sep  5 20:42:02 ragnarok kernel: PGD 1aa2067 PUD 61f7fd067 PMD 0 
> Sep  5 20:42:02 ragnarok kernel: Oops: 0000 [#2] SMP 
> Sep  5 20:42:02 ragnarok kernel: Modules linked in: auth_rpcgss oid_registry nfsv4 nfs lockd grace fscache sunrpc autofs4 sch_fq_codel snd_hda_codec_realtek x86_pkg_temp_thermal coretemp snd_hda_codec_generic crc32_pclmul crc32c_intel aesni_intel radeon aes_x86_64 glue_helper snd_hda_codec_hdmi lrw gf128mul ablk_helper cryptd i2c_algo_bit snd_usb_audio uvcvideo snd_hda_intel drm_kms_helper snd_hda_controller snd_hwdep videobuf2_vmalloc snd_usbmidi_lib videobuf2_memops snd_hda_codec videobuf2_core snd_rawmidi i2c_i801 ttm snd_hda_core v4l2_common snd_seq_device videodev snd_pcm usbhid drm snd_timer r8169 snd i2c_core mii soundcore parport_pc parport
> Sep  5 20:42:02 ragnarok kernel: CPU: 0 PID: 32550 Comm: sh Tainted: G      D         4.1.6 #1
> Sep  5 20:42:02 ragnarok kernel: Hardware name: Gigabyte Technology Co., Ltd. P67-DS3-B3/P67-DS3-B3, BIOS F1 05/06/2011
> Sep  5 20:42:02 ragnarok kernel: task: ffff880602cd1ec0 ti: ffff8805b26ac000 task.ti: ffff8805b26ac000
> Sep  5 20:42:02 ragnarok kernel: RIP: 0010:[<ffffffff8115bd4d>]  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
> Sep  5 20:42:02 ragnarok kernel: RSP: 0018:ffff8805b26afd88  EFLAGS: 00010282
> Sep  5 20:42:02 ragnarok kernel: RAX: 0000000000000000 RBX: ffff8805e7ea8f60 RCX: 000000000001f7e8
> Sep  5 20:42:02 ragnarok kernel: RDX: 000000000001f7e7 RSI: 00000000000000d0 RDI: 0000000000018c70
> Sep  5 20:42:02 ragnarok kernel: RBP: ffff8805b26afdc8 R08: ffff88061f418c70 R09: 0000000000000000
> Sep  5 20:42:02 ragnarok kernel: R10: ffffffff81748318 R11: ffffea0015a2ec00 R12: 00000000000000d0
> Sep  5 20:42:02 ragnarok kernel: R13: ffff880606890600 R14: ffffffff8100d039 R15: ffff8800e789b740
> Sep  5 20:42:02 ragnarok kernel: FS:  00007f9c1d2f2700(0000) GS:ffff88061f400000(0000) knlGS:0000000000000000
> Sep  5 20:42:02 ragnarok kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Sep  5 20:42:02 ragnarok kernel: CR2: ffff8800e789b740 CR3: 00000005e3b76000 CR4: 00000000000406f0
> Sep  5 20:42:02 ragnarok kernel: Stack:
> Sep  5 20:42:02 ragnarok kernel:  0000000000000000 ffff88061f7e6c00 0000000000000002 ffff8805e7ea8f60
> Sep  5 20:42:02 ragnarok kernel:  ffff880602cd1ec0 0000000001200011 ffff8805e7ea8f60 ffff880602cd1ec0
> Sep  5 20:42:02 ragnarok kernel:  ffff8805b26afde8 ffffffff8100d039 0000000000000000 00007f9c1d2f29d0
> Sep  5 20:42:02 ragnarok kernel: Call Trace:
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8100d039>] arch_dup_task_struct+0x69/0x170
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8104de8f>] copy_process.part.8+0x14f/0x1760
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8126a936>] ? security_file_alloc+0x16/0x20
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff81171c14>] ? get_empty_filp+0xd4/0x1c0
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff81185966>] ? __d_instantiate+0x96/0xf0
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff812c6b1a>] ? find_next_zero_bit+0x1a/0x30
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8105b63f>] ? recalc_sigpending+0x1f/0x60
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8104f657>] do_fork+0xd7/0x370
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8105ed07>] ? sigprocmask+0x57/0x90
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff8104f976>] SyS_clone+0x16/0x20
> Sep  5 20:42:02 ragnarok kernel:  [<ffffffff81571d17>] system_call_fastpath+0x12/0x6a
> Sep  5 20:42:02 ragnarok kernel: Code: 65 4c 03 05 ee e3 ea 7e 49 83 78 10 00 4d 8b 38 0f 84 b0 00 00 00 4d 85 ff 0f 84 a7 00 00 00 49 63 45 20 48 8d 4a 01 49 8b 7d 00 <49> 8b 1c 07 4c 89 f8 65 48 0f c7 0f 0f 94 c0 84 c0 74 b9 49 63 
> Sep  5 20:42:02 ragnarok kernel: RIP  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
> Sep  5 20:42:02 ragnarok kernel:  RSP <ffff8805b26afd88>
> Sep  5 20:42:02 ragnarok kernel: CR2: ffff8800e789b740
> Sep  5 20:42:02 ragnarok kernel: ---[ end trace e4478715791f5753 ]---
> 
> ..etc.
> 
> I also have all of
> 
> CONFIG_SLUB_DEBUG=y
> CONFIG_SLUB=y
> CONFIG_SLUB_CPU_PARTIAL=y
> 
> set.
> 
> Hope this helps somewhat.
> 
> Holger
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-07 11:30   ` Nikolay Borisov
@ 2015-09-07 11:49     ` Holger Hoffstätte
  2015-09-07 11:58       ` Holger Hoffstätte
  0 siblings, 1 reply; 17+ messages in thread
From: Holger Hoffstätte @ 2015-09-07 11:49 UTC (permalink / raw)
  To: linux-kernel

On Mon, 07 Sep 2015 14:30:49 +0300, Nikolay Borisov wrote:

> If you have the vmlinux image for the kernel you were running at the
> time, the crash occured, could you post the output of addr2line -f -e
> path/to/vmlinux ffffffff8115bd4d to see if it also fails in
> get_freepointer.

Had to rebuild to get an uncompressed vmlinux, but here it is:

holger>addr2line -f -e vmlinux ffffffff8115bd4d
kmem_cache_alloc
??:?

Not sure how much we can learn from this. :}

-h


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-07 11:49     ` Holger Hoffstätte
@ 2015-09-07 11:58       ` Holger Hoffstätte
  0 siblings, 0 replies; 17+ messages in thread
From: Holger Hoffstätte @ 2015-09-07 11:58 UTC (permalink / raw)
  To: linux-kernel

On Mon, 07 Sep 2015 11:49:12 +0000, Holger Hoffstätte wrote:

> On Mon, 07 Sep 2015 14:30:49 +0300, Nikolay Borisov wrote:
> 
>> If you have the vmlinux image for the kernel you were running at the
>> time, the crash occured, could you post the output of addr2line -f -e
>> path/to/vmlinux ffffffff8115bd4d to see if it also fails in
>> get_freepointer.
> 
> Had to rebuild to get an uncompressed vmlinux, but here it is:
> 
> holger>addr2line -f -e vmlinux ffffffff8115bd4d
> kmem_cache_alloc
> ??:?
> 
> Not sure how much we can learn from this. :}

Also for what it's worth the related splatter is all in the same place:

holger>zgrep ffffffff8115bd4d /var/log/kern.log.0.gz 
Sep  5 20:42:02 ragnarok kernel: IP: [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: RIP: 0010:[<ffffffff8115bd4d>]  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: RIP  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: IP: [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: RIP: 0010:[<ffffffff8115bd4d>]  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: RIP  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: IP: [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: RIP: 0010:[<ffffffff8115bd4d>]  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: RIP  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: IP: [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: RIP: 0010:[<ffffffff8115bd4d>]  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: RIP  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: IP: [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: RIP: 0010:[<ffffffff8115bd4d>]  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:02 ragnarok kernel: RIP  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:03 ragnarok kernel: IP: [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:03 ragnarok kernel: RIP: 0010:[<ffffffff8115bd4d>]  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:03 ragnarok kernel: RIP  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:10 ragnarok kernel: IP: [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:10 ragnarok kernel: RIP: 0010:[<ffffffff8115bd4d>]  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:10 ragnarok kernel: RIP  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:10 ragnarok kernel: IP: [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:10 ragnarok kernel: RIP: 0010:[<ffffffff8115bd4d>]  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160
Sep  5 20:42:10 ragnarok kernel: RIP  [<ffffffff8115bd4d>] kmem_cache_alloc+0x6d/0x160

CPU is a 4 core HT i7 (aka 8 vcores), and so I have 8 splats.

thanks,

Holger


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-07  8:41 Kernel 4.1.6 Panic due to slab corruption Nikolay Borisov
  2015-09-07 10:37 ` Holger Hoffstätte
@ 2015-09-07 11:14 ` Nikolay Borisov
  2015-09-08 13:58   ` Christoph Lameter
  1 sibling, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2015-09-07 11:14 UTC (permalink / raw)
  To: Linux-Kernel@Vger. Kernel. Org; +Cc: Marian Marinov, SiteGround Operations, cl

Did a bit more investigation and it turns out the
corruption is happening in slab_alloc_node, in the
'else' branch when get_freepointer is being called:

0xffffffff81182a50 <+144>:	movsxd rax,DWORD PTR [r12+0x20]
0xffffffff81182a55 <+149>:	mov    rdi,QWORD PTR [r12]
0xffffffff81182a59 <+153>:	mov    rbx,QWORD PTR [r13+rax*1+0x0]

The problematic line is the +153 offset, running addr2line shows that,
this is get_freepointer:

addr2line -f -e vmlinux-4.1.6-clouder1 ffffffff81182a59
get_freepointer
/home/projects/linux-stable/mm/slub.c:247

In this case the values of the arguments of this function are completely
bogus (or so it seems):

1. RAX is shown to be 0 and rax is supposed to hold the pointer to
struct kmem_cache. But curiously there isn't an error for NULL ptr,
as well as the check for the return value of slab_pre_alloc_hook would
have terminated the function early.

2. The value of r13 (which holds the pointer to the first free object
from the freelist) is also bogus: 0000000000028001

I'm a bit puzzled as to why am I not getting a NULL ptr error. But in
any case it looks that the per-cpu slub cache freelist has been corrupted.

Doing addr2line on the other paging request failures also show that the
issue is in the same function - get_freepointer:

addr2line -f -e vmlinux-4.1.6-clouder1 ffffffff811824e5
get_freepointer
/home/projects/linux-stable/mm/slub.c:247

Regards,
Nikolay

On 09/07/2015 11:41 AM, Nikolay Borisov wrote:
> Hello, 
> 
> On one of our servers I've observed the a kernel pannic 
> happening with the following backtrace:
> 
> [654405.527070] BUG: unable to handle kernel paging request at 0000000000028001
> [654405.527076] IP: [<ffffffff81182a59>] kmem_cache_alloc_node+0x99/0x1e0
> [654405.527085] PGD 14bef58067 PUD 2ab358067 PMD 0 
> [654405.527089] Oops: 0000 [#11] SMP 
> [654405.527093] Modules linked in: xt_multiport tcp_diag inet_diag act_police cls_basic sch_ingress scsi_transport_iscsi ipt_REJECT nf_reject_ipv4 xt_pkttype xt_state veth openvswitch xt_owner xt_conntrack iptable_filter iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c dm_mirror dm_region_hash dm_log iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core ioatdma dca ipmi_devintf ipmi_si ipmi_msghandler mpt2sas scsi_transport_sas raid_class
> [654405.527145] CPU: 14 PID: 32267 Comm: httpd Tainted: G      D      L  4.1.6-clouder1 #1
> [654405.527147] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0  07/09/2013
> [654405.527149] task: ffff88139d3b1ec0 ti: ffff8808eda14000 task.ti: ffff8808eda14000
> [654405.527151] RIP: 0010:[<ffffffff81182a59>]  [<ffffffff81182a59>] kmem_cache_alloc_node+0x99/0x1e0
> [654405.527155] RSP: 0018:ffff88407fcc3a98  EFLAGS: 00210246
> [654405.527156] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8814ce9acf80
> [654405.527157] RDX: 00000000837ad864 RSI: 0000000000050200 RDI: 0000000000018ce0
> [654405.527158] RBP: ffff88407fcc3af8 R08: ffff88407fcd8ce0 R09: ffffffffa033d990
> [654405.527159] R10: ffff88058676fdd8 R11: 0000000000007b4a R12: ffff881fff807ac0
> [654405.527161] R13: 0000000000028001 R14: 0000000000000001 R15: ffff881fff807ac0
> [654405.527162] FS:  0000000000000000(0000) GS:ffff88407fcc0000(0063) knlGS:0000000055c832e0
> [654405.527164] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> [654405.527165] CR2: 0000000000028001 CR3: 0000001467b64000 CR4: 00000000000406e0
> [654405.527166] Stack:
> [654405.527167]  0000000000000000 0000000000000000 0000000000000000 ffff881ff2d05000
> [654405.527170]  ffff88407fcc3ae8 00050200812b5903 ffff88407fcc3ae8 00000000000001a2
> [654405.527172]  0000000000000001 ffff88058676fc60 ffff88058676fe80 0000000000001800
> [654405.527175] Call Trace:
> [654405.527177]  <IRQ> 
> [654405.527184]  [<ffffffffa033d990>] ovs_flow_stats_update+0x110/0x160 [openvswitch]
> [654405.527189]  [<ffffffffa033ae74>] ovs_dp_process_packet+0x64/0xf0 [openvswitch]
> [654405.527193]  [<ffffffffa0345c60>] ? netdev_port_receive+0x110/0x110 [openvswitch]
> [654405.527197]  [<ffffffffa0345c60>] ? netdev_port_receive+0x110/0x110 [openvswitch]
> [654405.527201]  [<ffffffffa0344815>] ovs_vport_receive+0x85/0xb0 [openvswitch]
> [654405.527207]  [<ffffffff812c7636>] ? blk_mq_free_hctx_request+0x36/0x40
> [654405.527209]  [<ffffffff812c7671>] ? blk_mq_free_request+0x31/0x40
> [654405.527214]  [<ffffffff8100c2f9>] ? read_tsc+0x9/0x10
> [654405.527220]  [<ffffffff810b9f04>] ? ktime_get+0x54/0xc0
> [654405.527225]  [<ffffffff813cf577>] ? put_device+0x17/0x20
> [654405.527227]  [<ffffffffa0048a50>] ? tcf_act_police+0x150/0x210 [act_police]
> [654405.527232]  [<ffffffff8150cdc1>] ? tcf_action_exec+0x51/0xa0
> [654405.527235]  [<ffffffffa0011445>] ? basic_classify+0x75/0xe0 [cls_basic]
> [654405.527237]  [<ffffffff815091d5>] ? tc_classify+0x55/0xc0
> [654405.527241]  [<ffffffffa0345bed>] netdev_port_receive+0x9d/0x110 [openvswitch]
> [654405.527245]  [<ffffffffa0345c94>] netdev_frame_hook+0x34/0x50 [openvswitch]
> [654405.527250]  [<ffffffff814e58e6>] __netif_receive_skb_core+0x206/0x880
> [654405.527252]  [<ffffffff814e5f87>] __netif_receive_skb+0x27/0x70
> [654405.527254]  [<ffffffff814e60c1>] process_backlog+0xf1/0x1b0
> [654405.527257]  [<ffffffff814e68d3>] napi_poll+0xd3/0x1c0
> [654405.527259]  [<ffffffff814e6a50>] net_rx_action+0x90/0x1c0
> [654405.527264]  [<ffffffff810595ab>] __do_softirq+0xfb/0x2a0
> [654405.527270]  [<ffffffff815b269c>] do_softirq_own_stack+0x1c/0x30
> [654405.527271]  <EOI> 
> [654405.527273]  [<ffffffff810590b5>] do_softirq+0x55/0x60
> [654405.527276]  [<ffffffff81059198>] __local_bh_enable_ip+0x88/0x90
> [654405.527279]  [<ffffffff8152b062>] ip_finish_output+0x282/0x490
> [654405.527281]  [<ffffffff8152b55b>] ip_output+0xab/0xc0
> [654405.527283]  [<ffffffff8152ade0>] ? ip_finish_output_gso+0x4e0/0x4e0
> [654405.527285]  [<ffffffff815296fb>] ip_local_out_sk+0x3b/0x50
> [654405.527287]  [<ffffffff81529e0e>] ip_queue_xmit+0x14e/0x3c0
> [654405.527291]  [<ffffffff815422d2>] tcp_transmit_skb+0x4c2/0x850
> [654405.527294]  [<ffffffff81544c1d>] tcp_write_xmit+0x19d/0x670
> [654405.527298]  [<ffffffff812f32d1>] ? copy_user_generic_string+0x31/0x40
> [654405.527300]  [<ffffffff81545cd2>] __tcp_push_pending_frames+0x32/0xd0
> [654405.527302]  [<ffffffff81532911>] tcp_push+0xf1/0x120
> [654405.527304]  [<ffffffff815361f3>] tcp_sendmsg+0x373/0xb60
> [654405.527307]  [<ffffffff811be0b3>] ? mntput+0x23/0x40
> [654405.527310]  [<ffffffff811a7c32>] ? path_put+0x22/0x30
> [654405.527315]  [<ffffffff81561272>] inet_sendmsg+0x42/0xb0
> [654405.527317]  [<ffffffff81182e4e>] ? kmem_cache_alloc+0xee/0x1c0
> [654405.527321]  [<ffffffff814c639d>] sock_sendmsg+0x4d/0x60
> [654405.527324]  [<ffffffff814c64a6>] sock_write_iter+0xb6/0x100
> [654405.527328]  [<ffffffff8119d9d0>] do_iter_readv_writev+0x60/0x90
> [654405.527330]  [<ffffffff814c63f0>] ? kernel_sendmsg+0x40/0x40
> [654405.527332]  [<ffffffff8119e354>] compat_do_readv_writev+0x174/0x1f0
> [654405.527337]  [<ffffffff810aa6d9>] ? rcu_eqs_exit+0x79/0xb0
> [654405.527339]  [<ffffffff810aa723>] ? rcu_user_exit+0x13/0x20
> [654405.527342]  [<ffffffff8119e591>] compat_SyS_writev+0xc1/0x110
> [654405.527346]  [<ffffffff811274a3>] ? context_tracking_user_enter+0x13/0x20
> [654405.527349]  [<ffffffff815b2fc5>] sysenter_dispatch+0x7/0x25
> [654405.527350] Code: 8b 00 48 c1 e8 38 41 39 c6 74 17 4c 89 c9 44 89 f2 8b 75 cc 4c 89 e7 e8 46 f6 ff ff 49 89 c5 eb 2b 90 49 63 44 24 20 49 8b 3c 24 <49> 8b 5c 05 00 48 8d 4a 01 4c 89 e8 65 48 0f c7 0f 0f 94 c0 3c 
> [654405.527378] RIP  [<ffffffff81182a59>] kmem_cache_alloc_node+0x99/0x1e0
> [654405.527381]  RSP <ffff88407fcc3a98>
> [654405.527383] CR2: 0000000000028001
> 
> Before this occurs there are also several more "can't handle paging requests" e.g:
> 
> [654405.518482] BUG: unable to handle kernel paging request at 0000000000028001
> [654405.518488] IP: [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
> [654405.518496] PGD 364da24067 PUD 3733ae2067 PMD 0 
> [654405.518501] Oops: 0000 [#10] SMP 
> [654405.518504] Modules linked in: xt_multiport tcp_diag inet_diag act_police cls_basic sch_ingress scsi_transport_iscsi ipt_REJECT nf_reject_ipv4 xt_pkttype xt_state veth openvswitch xt_owner xt_conntrack iptable_filter iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c dm_mirror dm_region_hash dm_log iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core ioatdma dca ipmi_devintf ipmi_si ipmi_msghandler mpt2sas scsi_transport_sas raid_class
> [654405.518555] CPU: 14 PID: 15732 Comm: guardian Tainted: G      D      L  4.1.6-clouder1 #1
> [654405.518557] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0  07/09/2013
> [654405.518559] task: ffff88373303e680 ti: ffff88369b388000 task.ti: ffff88369b388000
> [654405.518560] RIP: 0010:[<ffffffff811824e5>]  [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
> [654405.518564] RSP: 0018:ffff88369b38bb48  EFLAGS: 00010282
> [654405.518565] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
> [654405.518567] RDX: 00000000837ad864 RSI: 00000000000000d0 RDI: 0000000000018ce0
> [654405.518568] RBP: ffff88369b38bb88 R08: ffff88407fcd8ce0 R09: ffffffff811c272c
> [654405.518569] R10: ffff88369b38bb74 R11: ffff881f7c678db8 R12: ffff881fff807ac0
> [654405.518570] R13: 0000000000028001 R14: ffff881fff807ac0 R15: 00000000000000d0
> [654405.518572] FS:  00002b784bf66800(0000) GS:ffff88407fcc0000(0000) knlGS:0000000000000000
> [654405.518574] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [654405.518575] CR2: 0000000000028001 CR3: 000000364d574000 CR4: 00000000000406e0
> [654405.518576] Stack:
> [654405.518578]  000000013a481c58 0000000000000020 ffff883600010000 ffff88245528ca00
> [654405.518580]  ffffffff8120bc50 ffff881a3d3433c8 ffff88245528ca10 ffffffff81209ed0
> [654405.518583]  ffff88369b38bbc8 ffffffff811c272c ffff88245528ca10 0000000000000000
> [654405.518586] Call Trace:
> [654405.518593]  [<ffffffff8120bc50>] ? proc_pid_follow_link+0x80/0x80
> [654405.518596]  [<ffffffff81209ed0>] ? sched_autogroup_open+0x50/0x50
> [654405.518601]  [<ffffffff811c272c>] single_open+0x3c/0xb0
> [654405.518603]  [<ffffffff81209eeb>] proc_single_open+0x1b/0x20
> [654405.518606]  [<ffffffff8119b69a>] do_dentry_open+0x22a/0x350
> [654405.518608]  [<ffffffff8119b809>] vfs_open+0x49/0x50
> [654405.518612]  [<ffffffff811ae652>] do_last+0x412/0x890
> [654405.518615]  [<ffffffff81182e4e>] ? kmem_cache_alloc+0xee/0x1c0
> [654405.518620]  [<ffffffff8129d6b6>] ? security_file_alloc+0x16/0x20
> [654405.518623]  [<ffffffff811aeb62>] path_openat+0x92/0x470
> [654405.518626]  [<ffffffff811ac753>] ? user_path_at_empty+0x63/0xa0
> [654405.518628]  [<ffffffff811aef8a>] do_filp_open+0x4a/0xa0
> [654405.518633]  [<ffffffff812fb140>] ? find_next_zero_bit+0x10/0x20
> [654405.518637]  [<ffffffff811bb64c>] ? __alloc_fd+0xac/0x150
> [654405.518640]  [<ffffffff8119ce9a>] do_sys_open+0x11a/0x230
> [654405.518644]  [<ffffffff8101190e>] ? syscall_trace_enter_phase1+0x14e/0x160
> [654405.518650]  [<ffffffff811274a3>] ? context_tracking_user_enter+0x13/0x20
> [654405.518652]  [<ffffffff8119cfee>] SyS_open+0x1e/0x20
> [654405.518656]  [<ffffffff815b0bee>] system_call_fastpath+0x12/0x71
> [654405.518658] Code: 08 65 4c 03 05 5d 7c e8 7e 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 8c 00 00 00 48 85 c0 0f 84 83 00 00 00 49 63 44 24 20 49 8b 3c 24 <49> 8b 5c 05 00 48 8d 4a 01 4c 89 e8 65 48 0f c7 0f 0f 94 c0 3c 
> [654405.518686] RIP  [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
> [654405.518689]  RSP <ffff88369b38bb48>
> [654405.518690] CR2: 0000000000028001
> 
> 
> [654405.511613] BUG: unable to handle kernel paging request at 0000000000028001
> [654405.511619] IP: [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
> [654405.511628] PGD 3f9a016067 PUD 3ee598c067 PMD 0 
> [654405.511632] Oops: 0000 [#9] SMP 
> [654405.511634] Modules linked in: xt_multiport tcp_diag inet_diag act_police cls_basic sch_ingress scsi_transport_iscsi ipt_REJECT nf_reject_ipv4 xt_pkttype xt_state veth openvswitch xt_owner xt_conntrack iptable_filter iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c dm_mirror dm_region_hash dm_log iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core ioatdma dca ipmi_devintf ipmi_si ipmi_msghandler mpt2sas scsi_transport_sas raid_class
> [654405.511684] CPU: 14 PID: 14914 Comm: templar.pl Tainted: G      D      L  4.1.6-clouder1 #1
> [654405.511687] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0  07/09/2013
> [654405.511689] task: ffff881f46d8bd80 ti: ffff883ee583c000 task.ti: ffff883ee583c000
> [654405.511690] RIP: 0010:[<ffffffff811824e5>]  [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
> [654405.511694] RSP: 0018:ffff883ee583fe38  EFLAGS: 00010282
> [654405.511695] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff881f3e1f8540
> [654405.511697] RDX: 00000000837ad864 RSI: 00000000000080d0 RDI: 0000000000018ce0
> [654405.511698] RBP: ffff883ee583fe78 R08: ffff88407fcd8ce0 R09: ffffffff8129028f
> [654405.511699] R10: 0000000000000008 R11: 0000000000000246 R12: ffff881fff807ac0
> [654405.511701] R13: 0000000000028001 R14: ffff881fff807ac0 R15: 00000000000080d0
> [654405.511703] FS:  00002b06256163a0(0000) GS:ffff88407fcc0000(0000) knlGS:0000000000000000
> [654405.511704] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [654405.511706] CR2: 0000000000028001 CR3: 0000003f520c4000 CR4: 00000000000406e0
> [654405.511707] Stack:
> [654405.511708]  0000000100000404 0000000000000020 ffff883ee583fe78 0000000000000000
> [654405.511711]  0000000000001000 0000000000000001 0000000000018003 0000000000000001
> [654405.511715]  ffff883ee583ff28 ffffffff8129028f 0000000000000001 00000000000007d0
> [654405.511717] Call Trace:
> [654405.511726]  [<ffffffff8129028f>] do_shmat+0x22f/0x4a0
> [654405.511729]  [<ffffffff8129051c>] SyS_shmat+0x1c/0x30
> [654405.511734]  [<ffffffff815b0bee>] system_call_fastpath+0x12/0x71
> [654405.511736] Code: 08 65 4c 03 05 5d 7c e8 7e 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 8c 00 00 00 48 85 c0 0f 84 83 00 00 00 49 63 44 24 20 49 8b 3c 24 <49> 8b 5c 05 00 48 8d 4a 01 4c 89 e8 65 48 0f c7 0f 0f 94 c0 3c 
> [654405.511763] RIP  [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
> [654405.511765]  RSP <ffff883ee583fe38>
> [654405.511766] CR2: 0000000000028001
> 
> [654405.502947] BUG: unable to handle kernel paging request at 0000000000028001
> [654405.502952] IP: [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
> [654405.502961] PGD 1c7d1ba067 PUD 1d7c06d067 PMD 0 
> [654405.502965] Oops: 0000 [#8] SMP 
> [654405.502968] Modules linked in: xt_multiport tcp_diag inet_diag act_police cls_basic sch_ingress scsi_transport_iscsi ipt_REJECT nf_reject_ipv4 xt_pkttype xt_state veth openvswitch xt_owner xt_conntrack iptable_filter iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_CT nf_conntrack iptable_raw ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c dm_mirror dm_region_hash dm_log iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 lpc_ich mfd_core igb i2c_algo_bit i2c_core ioatdma dca ipmi_devintf ipmi_si ipmi_msghandler mpt2sas scsi_transport_sas raid_class
> [654405.503021] CPU: 14 PID: 1342 Comm: gather_daemon.p Tainted: G      D      L  4.1.6-clouder1 #1
> [654405.503024] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0  07/09/2013
> [654405.503026] task: ffff883dc1e170c0 ti: ffff881df4f80000 task.ti: ffff881df4f80000
> [654405.503027] RIP: 0010:[<ffffffff811824e5>]  [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
> [654405.503031] RSP: 0018:ffff881df4f83a98  EFLAGS: 00010282
> [654405.503033] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000001884e6d
> [654405.503034] RDX: 00000000837ad864 RSI: 00000000000000d0 RDI: 0000000000018ce0
> [654405.503035] RBP: ffff881df4f83ad8 R08: ffff88407fcd8ce0 R09: ffffffff811c272c
> [654405.503037] R10: 0000000000000008 R11: 0000000000000001 R12: ffff881fff807ac0
> [654405.503038] R13: 0000000000028001 R14: ffff881fff807ac0 R15: 00000000000000d0
> [654405.503040] FS:  0000000000000000(0000) GS:ffff88407fcc0000(0063) knlGS:00000000558d2c00
> [654405.503041] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
> [654405.503043] CR2: 0000000000028001 CR3: 0000001daa3cd000 CR4: 00000000000406e0
> [654405.503044] Stack:
> [654405.503046]  ffff883a856c0402 0000000000000020 ffff881df4f83af8 ffff8825209b0b00
> [654405.503049]  ffffffff81212960 0000000000000000 ffffffff81212960 0000000000000000
> [654405.503051]  ffff881df4f83b18 ffffffff811c272c ffffffff81212960 0000000000000000
> [654405.503054] Call Trace:
> [654405.503063]  [<ffffffff81212960>] ? get_iowait_time+0x70/0x70
> [654405.503066]  [<ffffffff81212960>] ? get_iowait_time+0x70/0x70
> [654405.503070]  [<ffffffff811c272c>] single_open+0x3c/0xb0
> [654405.503073]  [<ffffffff81212960>] ? get_iowait_time+0x70/0x70
> [654405.503075]  [<ffffffff81212960>] ? get_iowait_time+0x70/0x70
> [654405.503077]  [<ffffffff811c27f0>] single_open_size+0x50/0x90
> [654405.503080]  [<ffffffff811c1d20>] ? seq_release_private+0x60/0x60
> [654405.503082]  [<ffffffff8121286a>] stat_open+0x4a/0x60
> [654405.503085]  [<ffffffff81209574>] proc_reg_open+0x84/0x120
> [654405.503088]  [<ffffffff812094f0>] ? proc_entry_rundown+0xa0/0xa0
> [654405.503091]  [<ffffffff8119b69a>] do_dentry_open+0x22a/0x350
> [654405.503093]  [<ffffffff8119b809>] vfs_open+0x49/0x50
> [654405.503097]  [<ffffffff811ae652>] do_last+0x412/0x890
> [654405.503102]  [<ffffffff8100c299>] ? sched_clock+0x9/0x10
> [654405.503107]  [<ffffffff81084a7b>] ? sched_clock_cpu+0xab/0xc0
> [654405.503110]  [<ffffffff81182e4e>] ? kmem_cache_alloc+0xee/0x1c0
> [654405.503115]  [<ffffffff8129d6b6>] ? security_file_alloc+0x16/0x20
> [654405.503118]  [<ffffffff811aeb62>] path_openat+0x92/0x470
> [654405.503122]  [<ffffffff8108ff1f>] ? put_prev_task_fair+0x2f/0x50
> [654405.503126]  [<ffffffff810b2931>] ? lock_hrtimer_base+0x31/0x60
> [654405.503128]  [<ffffffff811aef8a>] do_filp_open+0x4a/0xa0
> [654405.503132]  [<ffffffff812fb140>] ? find_next_zero_bit+0x10/0x20
> [654405.503136]  [<ffffffff811bb64c>] ? __alloc_fd+0xac/0x150
> [654405.503140]  [<ffffffff8119ce9a>] do_sys_open+0x11a/0x230
> [654405.503145]  [<ffffffff810b9b2e>] ? getnstimeofday64+0xe/0x30
> [654405.503150]  [<ffffffff811274a3>] ? context_tracking_user_enter+0x13/0x20
> [654405.503154]  [<ffffffff811ee4cb>] compat_SyS_open+0x1b/0x20
> [654405.503160]  [<ffffffff815b2fc5>] sysenter_dispatch+0x7/0x25
> [654405.503162] Code: 08 65 4c 03 05 5d 7c e8 7e 4d 8b 28 49 8b 40 10 4d 85 ed 0f 84 8c 00 00 00 48 85 c0 0f 84 83 00 00 00 49 63 44 24 20 49 8b 3c 24 <49> 8b 5c 05 00 48 8d 4a 01 4c 89 e8 65 48 0f c7 0f 0f 94 c0 3c 
> [654405.503191] RIP  [<ffffffff811824e5>] kmem_cache_alloc_trace+0x75/0x1d0
> [654405.503194]  RSP <ffff881df4f83a98>
> [654405.503195] CR2: 0000000000028001
> 
> 
> I have more but like these but I believe those are enough. The
> following things arise as a pattern in those failures: 
> 
> 1. All these failures are happening when allocating 32 bytes struct, 
> this leads me to believe that the corruption has happened in the 
> kmalloc-32 slab cache. 
> 
> 2. Another thing which also stands out is the faulting address: 
> The value 0000000000028001 can predominantly be seen. In the case
> when the panic has occured here is what the docded code shows:
> 
> Code: 8b 00 48 c1 e8 38 41 39 c6 74 17 4c 89 c9 44 89 f2 8b 75 cc 4c 89 e7 e8 46 f6 ff ff 49 89 c5 eb 2b 90 49 63 44 24 20 49 8b 3c 24 <49> 8b 5c 05 00 48 8d 4a 01 4c 89 e8 65 48 0f c7 0f 0f 94 c0 3c
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	49 8b 5c 05 00       	mov    0x0(%r13,%rax,1),%rbx
>    5:	48 8d 4a 01          	lea    0x1(%rdx),%rcx
>    9:	4c 89 e8             	mov    %r13,%rax
>    c:	65 48 0f c7 0f       	cmpxchg16b %gs:(%rdi)
>   11:	0f 94 c0             	sete   %al
>   14:	3c                   	.byte 0x3c
> 
> r13 takes part in the calculation of the address rbx has to be stored, 
> r13 =  0000000000028001
> 
> Any ideas how to debug this? The first thing that comes to mind, is
> to boot the machine with slab merging disabled, in the hopes
> that this would reduce the scope of the memory corruption and 
> the next time this occurs it would be easier to identify the culprit.
> 
> Here are the config options for the allocator in use: 
> 
> grep -i slub kernel-conf-4.1
> # CONFIG_SLUB_DEBUG is not set
> CONFIG_SLUB=y
> CONFIG_SLUB_CPU_PARTIAL=y
> # CONFIG_SLUB_STATS is not set
> 
> If more information is needed I'm happy to provide it. 
> 
> Any help will be much appreciated.
> 
> Regards, 
> Nikolay
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-07 11:14 ` Nikolay Borisov
@ 2015-09-08 13:58   ` Christoph Lameter
  2015-09-08 14:06     ` Nikolay Borisov
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2015-09-08 13:58 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Linux-Kernel@Vger. Kernel. Org, Marian Marinov,
	SiteGround Operations

On Mon, 7 Sep 2015, Nikolay Borisov wrote:

> Did a bit more investigation and it turns out the
> corruption is happening in slab_alloc_node, in the
> 'else' branch when get_freepointer is being called:

Please reboot the system and specify

	slub_debug

on the kernel command line. This will enable additional diagnostics which
will allow tracking down the issue to the subsystem causing it.


Or rebuild with

CONFIG_SLUB_DEBUG_ON


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-08 13:58   ` Christoph Lameter
@ 2015-09-08 14:06     ` Nikolay Borisov
  2015-09-08 14:27       ` Christoph Lameter
  0 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2015-09-08 14:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux-Kernel@Vger. Kernel. Org, Marian Marinov,
	SiteGround Operations

On 09/08/2015 04:58 PM, Christoph Lameter wrote:
> On Mon, 7 Sep 2015, Nikolay Borisov wrote:
> 
>> Did a bit more investigation and it turns out the
>> corruption is happening in slab_alloc_node, in the
>> 'else' branch when get_freepointer is being called:
> 
> Please reboot the system and specify
> 
> 	slub_debug
> 

Unfortunately I haven't found a way to reproduce it so the only option
would be to do this on a live server. However, the performance impact I
believe is going to be very prohibitive :(.  Alternatively what I could
do is probably leave merging on but enable debugging only for the
kmalloc-32 slab cache. Do you think this would provide enough
information to help track the corruption when it happens, without
impacting performance?

> on the kernel command line. This will enable additional diagnostics which
> will allow tracking down the issue to the subsystem causing it.
> 
> 
> Or rebuild with
> 
> CONFIG_SLUB_DEBUG_ON
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-08 14:06     ` Nikolay Borisov
@ 2015-09-08 14:27       ` Christoph Lameter
  2015-09-08 14:41         ` Nikolay Borisov
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2015-09-08 14:27 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Linux-Kernel@Vger. Kernel. Org, Marian Marinov,
	SiteGround Operations

On Tue, 8 Sep 2015, Nikolay Borisov wrote:

> Unfortunately I haven't found a way to reproduce it so the only option
> would be to do this on a live server. However, the performance impact I
> believe is going to be very prohibitive :(.  Alternatively what I could
> do is probably leave merging on but enable debugging only for the
> kmalloc-32 slab cache. Do you think this would provide enough
> information to help track the corruption when it happens, without
> impacting performance?

You have read https://www.kernel.org/doc/Documentation/vm/slub.txt?

The problem now is that merging is on so it could be that the corruption
happens in one of the aliased caches. So maybe only kmalloc-32 wont do
much good.

Run

	slabinfo -a

(slabinfo.c is a tool in the kernel tree.)

to see the list of aliases for kmalloc-32.

You can also use slabinfo to enable some debugging at runtime. Just
enabling sanity checks may catch something that allows us to track this to
the subsystem.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-08 14:27       ` Christoph Lameter
@ 2015-09-08 14:41         ` Nikolay Borisov
  2015-09-08 15:15           ` Christoph Lameter
  0 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2015-09-08 14:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux-Kernel@Vger. Kernel. Org, Marian Marinov,
	SiteGround Operations



On 09/08/2015 05:27 PM, Christoph Lameter wrote:
> On Tue, 8 Sep 2015, Nikolay Borisov wrote:
> 
>> Unfortunately I haven't found a way to reproduce it so the only option
>> would be to do this on a live server. However, the performance impact I
>> believe is going to be very prohibitive :(.  Alternatively what I could
>> do is probably leave merging on but enable debugging only for the
>> kmalloc-32 slab cache. Do you think this would provide enough
>> information to help track the corruption when it happens, without
>> impacting performance?
> 
> You have read https://www.kernel.org/doc/Documentation/vm/slub.txt?

I've read that I'm also following the merge/nomerge thread on the DM
mailing list. I guess my understanding is wrong in that if multiple slab
caches are merged, then it's enough to just instrument the cache to
which they are being merge in order to have them all instrumented? I
guess that's not the case, so even though slab caches might be merge
they are still somehow considered different entities in the kernel?

> 
> 
> The problem now is that merging is on so it could be that the corruption
> happens in one of the aliased caches. So maybe only kmalloc-32 wont do
> much good.
> 
> Run
> 
> 	slabinfo -a
> 
> (slabinfo.c is a tool in the kernel tree.)
> 
> to see the list of aliases for kmalloc-32.
> 
> You can also use slabinfo to enable some debugging at runtime. Just
> enabling sanity checks may catch something that allows us to track this to
> the subsystem.

I will experiment with slabinfo.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-08 14:41         ` Nikolay Borisov
@ 2015-09-08 15:15           ` Christoph Lameter
  2015-09-09 11:40             ` Nikolay Borisov
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2015-09-08 15:15 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Linux-Kernel@Vger. Kernel. Org, Marian Marinov,
	SiteGround Operations

On Tue, 8 Sep 2015, Nikolay Borisov wrote:

> > You have read https://www.kernel.org/doc/Documentation/vm/slub.txt?
>
> I've read that I'm also following the merge/nomerge thread on the DM
> mailing list. I guess my understanding is wrong in that if multiple slab
> caches are merged, then it's enough to just instrument the cache to
> which they are being merge in order to have them all instrumented? I
> guess that's not the case, so even though slab caches might be merge
> they are still somehow considered different entities in the kernel?

Enabling debugging on bootup disables merging.

If you switch debug options later then its possible to affect all aliases
caches. But then these option changes are only allowed to be used in such
a way as to not affect the object layout. You can switch on sanity checks
and double free checks I believe but nothing else.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-08 15:15           ` Christoph Lameter
@ 2015-09-09 11:40             ` Nikolay Borisov
  2015-09-09 14:01               ` Christoph Lameter
  0 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2015-09-09 11:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux-Kernel@Vger. Kernel. Org, Marian Marinov,
	SiteGround Operations



On 09/08/2015 06:15 PM, Christoph Lameter wrote:
> On Tue, 8 Sep 2015, Nikolay Borisov wrote:
> 
>>> You have read https://www.kernel.org/doc/Documentation/vm/slub.txt?
>>
>> I've read that I'm also following the merge/nomerge thread on the DM
>> mailing list. I guess my understanding is wrong in that if multiple slab
>> caches are merged, then it's enough to just instrument the cache to
>> which they are being merge in order to have them all instrumented? I
>> guess that's not the case, so even though slab caches might be merge
>> they are still somehow considered different entities in the kernel?
> 
> Enabling debugging on bootup disables merging.
> 
> If you switch debug options later then its possible to affect all aliases
> caches. But then these option changes are only allowed to be used in such
> a way as to not affect the object layout. You can switch on sanity checks
> and double free checks I believe but nothing else.

I tried the following:

[root@kernighan vm]# ./slabinfo -da kmalloc-32
Cannot write to dma-kmalloc-32/sanity
[root@kernighan vm]# ./slabinfo -dF kmalloc-32
Cannot write to dma-kmalloc-32/sanity
[root@kernighan vm]# ./slabinfo -dz kmalloc-32
kmalloc-32 not empty cannot enable redzoning
[root@kernighan vm]# ./slabinfo -dp kmalloc-32
kmalloc-32 not empty cannot enable poisoning
[root@kernighan vm]# ./slabinfo -du kmalloc-32
kmalloc-32 not empty cannot enable tracking
[root@kernighan vm]# ./slabinfo -dt ^kmalloc-32$
kmalloc-32 can only enable trace for one slab at a time

I did however had success with enabling tracing but couldn't see where
the output is produced - tried dmesg and the ftrace buffer but nothing
turned up.

But it seems it is not possible to enable any debugging whatsoever, so I
will restor to doing it at boot time. In this case can you advice which
options might not result in very high performance degradation - I'm
thinking of sanity checking and maybe redzoning?


> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-09 11:40             ` Nikolay Borisov
@ 2015-09-09 14:01               ` Christoph Lameter
  2015-09-09 16:26                 ` Vlastimil Babka
  2015-09-10  6:12                 ` Nikolay Borisov
  0 siblings, 2 replies; 17+ messages in thread
From: Christoph Lameter @ 2015-09-09 14:01 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Linux-Kernel@Vger. Kernel. Org, Marian Marinov,
	SiteGround Operations

On Wed, 9 Sep 2015, Nikolay Borisov wrote:

> [root@kernighan vm]# ./slabinfo -da kmalloc-32
> Cannot write to dma-kmalloc-32/sanity
> [root@kernighan vm]# ./slabinfo -dF kmalloc-32
> Cannot write to dma-kmalloc-32/sanity
> [root@kernighan vm]# ./slabinfo -dz kmalloc-32
> kmalloc-32 not empty cannot enable redzoning
> [root@kernighan vm]# ./slabinfo -dp kmalloc-32
> kmalloc-32 not empty cannot enable poisoning
> [root@kernighan vm]# ./slabinfo -du kmalloc-32
> kmalloc-32 not empty cannot enable tracking
> [root@kernighan vm]# ./slabinfo -dt ^kmalloc-32$
> kmalloc-32 can only enable trace for one slab at a time


Hmmmm.. Whats the problem here?

christoph@gentwo:/sys/kernel/slab/kmalloc-32$ ls -l
total 0
-r-------- 1 root root 4096 Sep  9 08:57 aliases
-r-------- 1 root root 4096 Sep  9 08:57 align
-r-------- 1 root root 4096 Sep  9 08:57 alloc_calls
-r-------- 1 root root 4096 Sep  9 08:57 cache_dma
-rw------- 1 root root 4096 Sep  9 08:57 cpu_partial
-r-------- 1 root root 4096 Sep  9 08:57 cpu_slabs
-r-------- 1 root root 4096 Sep  9 08:57 ctor
-r-------- 1 root root 4096 Sep  9 08:57 destroy_by_rcu
-r-------- 1 root root 4096 Sep  9 08:57 free_calls
-r-------- 1 root root 4096 Sep  9 08:57 hwcache_align
-rw------- 1 root root 4096 Sep  9 08:57 min_partial
-r-------- 1 root root 4096 Sep  9 08:57 objects
-r-------- 1 root root 4096 Sep  9 08:57 object_size
-r-------- 1 root root 4096 Sep  9 08:57 objects_partial
-r-------- 1 root root 4096 Sep  9 08:57 objs_per_slab
-rw------- 1 root root 4096 Sep  9 08:57 order
-r-------- 1 root root 4096 Sep  9 08:57 partial
-rw------- 1 root root 4096 Sep  9 08:57 poison
-rw------- 1 root root 4096 Sep  9 08:57 reclaim_account
-rw------- 1 root root 4096 Sep  9 08:57 red_zone
-rw------- 1 root root 4096 Sep  9 08:57 remote_node_defrag_ratio
-r-------- 1 root root 4096 Sep  9 08:57 reserved
-rw------- 1 root root 4096 Sep  9 08:57 sanity_checks
-rw------- 1 root root 4096 Sep  9 08:57 shrink
-r-------- 1 root root 4096 Sep  9 08:57 slabs
-r-------- 1 root root 4096 Sep  9 08:57 slabs_cpu_partial
-r-------- 1 root root 4096 Sep  9 08:57 slab_size
-rw------- 1 root root 4096 Sep  9 08:57 store_user
-r-------- 1 root root 4096 Sep  9 08:57 total_objects
-rw------- 1 root root 4096 Sep  9 08:57 trace
-rw------- 1 root root 4096 Sep  9 08:57 validate

Try

	echo 1 >santy_checks


>
> I did however had success with enabling tracing but couldn't see where
> the output is produced - tried dmesg and the ftrace buffer but nothing
> turned up.

dmesg is the output channel for tracing.

What does:

	echo 1 >trace

do? Could crash the sysem due to overload of messages.

> But it seems it is not possible to enable any debugging whatsoever, so I
> will restor to doing it at boot time. In this case can you advice which
> options might not result in very high performance degradation - I'm
> thinking of sanity checking and maybe redzoning?

Sanity checking is ok. But I would think you should be fine with enabling
full debugging on the particular caches of interest.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-09 14:01               ` Christoph Lameter
@ 2015-09-09 16:26                 ` Vlastimil Babka
  2015-09-09 17:58                   ` Christoph Lameter
  2015-09-10  6:12                 ` Nikolay Borisov
  1 sibling, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2015-09-09 16:26 UTC (permalink / raw)
  To: Christoph Lameter, Nikolay Borisov
  Cc: Linux-Kernel@Vger. Kernel. Org, Marian Marinov,
	SiteGround Operations

On 09/09/2015 04:01 PM, Christoph Lameter wrote:
> On Wed, 9 Sep 2015, Nikolay Borisov wrote:
>
> What does:
>
> 	echo 1 >trace
>
> do? Could crash the sysem due to overload of messages.

Yes I've seen that happen. Did you consider hooking it to trace_printk() 
instead of printk()?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-09 16:26                 ` Vlastimil Babka
@ 2015-09-09 17:58                   ` Christoph Lameter
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Lameter @ 2015-09-09 17:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Nikolay Borisov, Linux-Kernel@Vger. Kernel. Org, Marian Marinov,
	SiteGround Operations

On Wed, 9 Sep 2015, Vlastimil Babka wrote:

> On 09/09/2015 04:01 PM, Christoph Lameter wrote:
> > On Wed, 9 Sep 2015, Nikolay Borisov wrote:
> >
> > What does:
> >
> > 	echo 1 >trace
> >
> > do? Could crash the sysem due to overload of messages.
>
> Yes I've seen that happen. Did you consider hooking it to trace_printk()
> instead of printk()?

The code was there earlier than trace_printk as far as I can tell and the
formatting functions are based on printk.

Hmmm.. the pr_info in the trace() function could use trace_printk() but
that leaves the stack_dump().


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-09 14:01               ` Christoph Lameter
  2015-09-09 16:26                 ` Vlastimil Babka
@ 2015-09-10  6:12                 ` Nikolay Borisov
  2015-09-10 16:30                   ` Christoph Lameter
  1 sibling, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2015-09-10  6:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linux-Kernel@Vger. Kernel. Org, Marian Marinov,
	SiteGround Operations



On 09/09/2015 05:01 PM, Christoph Lameter wrote:
> On Wed, 9 Sep 2015, Nikolay Borisov wrote:
> 
>> [root@kernighan vm]# ./slabinfo -da kmalloc-32
>> Cannot write to dma-kmalloc-32/sanity
>> [root@kernighan vm]# ./slabinfo -dF kmalloc-32
>> Cannot write to dma-kmalloc-32/sanity
>> [root@kernighan vm]# ./slabinfo -dz kmalloc-32
>> kmalloc-32 not empty cannot enable redzoning
>> [root@kernighan vm]# ./slabinfo -dp kmalloc-32
>> kmalloc-32 not empty cannot enable poisoning
>> [root@kernighan vm]# ./slabinfo -du kmalloc-32
>> kmalloc-32 not empty cannot enable tracking
>> [root@kernighan vm]# ./slabinfo -dt ^kmalloc-32$
>> kmalloc-32 can only enable trace for one slab at a time
> 
> 
> Hmmmm.. Whats the problem here?
> 
> christoph@gentwo:/sys/kernel/slab/kmalloc-32$ ls -l
> total 0
> -r-------- 1 root root 4096 Sep  9 08:57 aliases
> -r-------- 1 root root 4096 Sep  9 08:57 align
> -r-------- 1 root root 4096 Sep  9 08:57 alloc_calls
> -r-------- 1 root root 4096 Sep  9 08:57 cache_dma
> -rw------- 1 root root 4096 Sep  9 08:57 cpu_partial
> -r-------- 1 root root 4096 Sep  9 08:57 cpu_slabs
> -r-------- 1 root root 4096 Sep  9 08:57 ctor
> -r-------- 1 root root 4096 Sep  9 08:57 destroy_by_rcu
> -r-------- 1 root root 4096 Sep  9 08:57 free_calls
> -r-------- 1 root root 4096 Sep  9 08:57 hwcache_align
> -rw------- 1 root root 4096 Sep  9 08:57 min_partial
> -r-------- 1 root root 4096 Sep  9 08:57 objects
> -r-------- 1 root root 4096 Sep  9 08:57 object_size
> -r-------- 1 root root 4096 Sep  9 08:57 objects_partial
> -r-------- 1 root root 4096 Sep  9 08:57 objs_per_slab
> -rw------- 1 root root 4096 Sep  9 08:57 order
> -r-------- 1 root root 4096 Sep  9 08:57 partial
> -rw------- 1 root root 4096 Sep  9 08:57 poison
> -rw------- 1 root root 4096 Sep  9 08:57 reclaim_account
> -rw------- 1 root root 4096 Sep  9 08:57 red_zone
> -rw------- 1 root root 4096 Sep  9 08:57 remote_node_defrag_ratio
> -r-------- 1 root root 4096 Sep  9 08:57 reserved
> -rw------- 1 root root 4096 Sep  9 08:57 sanity_checks
> -rw------- 1 root root 4096 Sep  9 08:57 shrink
> -r-------- 1 root root 4096 Sep  9 08:57 slabs
> -r-------- 1 root root 4096 Sep  9 08:57 slabs_cpu_partial
> -r-------- 1 root root 4096 Sep  9 08:57 slab_size
> -rw------- 1 root root 4096 Sep  9 08:57 store_user
> -r-------- 1 root root 4096 Sep  9 08:57 total_objects
> -rw------- 1 root root 4096 Sep  9 08:57 trace
> -rw------- 1 root root 4096 Sep  9 08:57 validate
> 
> Try
> 
> 	echo 1 >santy_checks

[root@kernighan linux-stable]# cd /sys/kernel/slab/kmalloc-32/
[root@kernighan kmalloc-32]# echo 1 > sanity_checks
[root@kernighan kmalloc-32]# cat sanity_checks
1

So this works as expected when set by echo. Just for testing I then
tried the following:
[root@kernighan kmalloc-32]# slabinfo -d- kmalloc-32
kmalloc-32 not empty cannot disable sanity checks

[root@kernighan kmalloc-32]# echo 0 > sanity_checks
[root@kernighan kmalloc-32]# slabinfo -d- kmalloc-32

So turns out slabinfo fails where the raw sys interface succeeds, strange?



> 
> 
>>
>> I did however had success with enabling tracing but couldn't see where
>> the output is produced - tried dmesg and the ftrace buffer but nothing
>> turned up.
> 
> dmesg is the output channel for tracing.
> 
> What does:
> 
> 	echo 1 >trace
> 
> do? Could crash the sysem due to overload of messages.

Didn't have that much luck with this one:
[root@kernighan kmalloc-32]# dmesg -c > /dev/null
[root@kernighan kmalloc-32]# echo 1 > trace
-bash: echo: write error: Invalid argument

> 
>> But it seems it is not possible to enable any debugging whatsoever, so I
>> will restor to doing it at boot time. In this case can you advice which
>> options might not result in very high performance degradation - I'm
>> thinking of sanity checking and maybe redzoning?
> 
> Sanity checking is ok. But I would think you should be fine with enabling
> full debugging on the particular caches of interest.

I was just thinking that if enabling debug options disables merging this
means it won't be sufficient to enable debugging on kmalloc-32 but
rather before enabling debugging I do need to check which caches were
aliased and enable debugging on those as well, correct?


Regards,
Nikolay

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Kernel 4.1.6 Panic due to slab corruption
  2015-09-10  6:12                 ` Nikolay Borisov
@ 2015-09-10 16:30                   ` Christoph Lameter
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Lameter @ 2015-09-10 16:30 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Linux-Kernel@Vger. Kernel. Org, Marian Marinov,
	SiteGround Operations

On Thu, 10 Sep 2015, Nikolay Borisov wrote:

> > 	echo 1 >santy_checks
>
> [root@kernighan linux-stable]# cd /sys/kernel/slab/kmalloc-32/
> [root@kernighan kmalloc-32]# echo 1 > sanity_checks
> [root@kernighan kmalloc-32]# cat sanity_checks
> 1
>
> So this works as expected when set by echo. Just for testing I then
> tried the following:
> [root@kernighan kmalloc-32]# slabinfo -d- kmalloc-32
> kmalloc-32 not empty cannot disable sanity checks
>
> [root@kernighan kmalloc-32]# echo 0 > sanity_checks
> [root@kernighan kmalloc-32]# slabinfo -d- kmalloc-32
>
> So turns out slabinfo fails where the raw sys interface succeeds, strange?

Weird. slabinfo needs fixing.

> > do? Could crash the sysem due to overload of messages.
>
> Didn't have that much luck with this one:
> [root@kernighan kmalloc-32]# dmesg -c > /dev/null
> [root@kernighan kmalloc-32]# echo 1 > trace
> -bash: echo: write error: Invalid argument


Huh? There is no check that I am ware of in the slab code that would
return -EINVAL.

> > Sanity checking is ok. But I would think you should be fine with enabling
> > full debugging on the particular caches of interest.
>
> I was just thinking that if enabling debug options disables merging this
> means it won't be sufficient to enable debugging on kmalloc-32 but
> rather before enabling debugging I do need to check which caches were
> aliased and enable debugging on those as well, correct?

Correct.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-09-10 16:30 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-07  8:41 Kernel 4.1.6 Panic due to slab corruption Nikolay Borisov
2015-09-07 10:37 ` Holger Hoffstätte
2015-09-07 11:30   ` Nikolay Borisov
2015-09-07 11:49     ` Holger Hoffstätte
2015-09-07 11:58       ` Holger Hoffstätte
2015-09-07 11:14 ` Nikolay Borisov
2015-09-08 13:58   ` Christoph Lameter
2015-09-08 14:06     ` Nikolay Borisov
2015-09-08 14:27       ` Christoph Lameter
2015-09-08 14:41         ` Nikolay Borisov
2015-09-08 15:15           ` Christoph Lameter
2015-09-09 11:40             ` Nikolay Borisov
2015-09-09 14:01               ` Christoph Lameter
2015-09-09 16:26                 ` Vlastimil Babka
2015-09-09 17:58                   ` Christoph Lameter
2015-09-10  6:12                 ` Nikolay Borisov
2015-09-10 16:30                   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).