From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935074AbcATENK (ORCPT ); Tue, 19 Jan 2016 23:13:10 -0500 Received: from mail-bl2on0136.outbound.protection.outlook.com ([65.55.169.136]:11632 "EHLO na01-bl2-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S933267AbcATEM7 (ORCPT ); Tue, 19 Jan 2016 23:12:59 -0500 Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=Joe.Lawrence@stratus.com; Subject: Re: [patch 00/14] x86/irq: Plug various vector cleanup races To: Thomas Gleixner References: <20151231155849.772553760@linutronix.de> <568A9157.9070402@stratus.com> <20160114103326.GG8496@pd.tnic> <569AB81D.9090904@stratus.com> <569CFE21.9010104@stratus.com> CC: Borislav Petkov , LKML , Ingo Molnar , Peter Anvin , Jiang Liu , Jeremiah Mahler , , Guenter Roeck From: Joe Lawrence Message-ID: <569F05AF.5070006@stratus.com> Date: Tue, 19 Jan 2016 22:57:35 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: <569CFE21.9010104@stratus.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [198.97.41.12] X-ClientProxiedBy: CY1PR0801CA0029.namprd08.prod.outlook.com (25.163.136.167) To DM2PR0801MB585.namprd08.prod.outlook.com (10.242.127.13) X-Microsoft-Exchange-Diagnostics: 1;DM2PR0801MB585;2:gDmZ706fP8x6EqzO0yjU/hqIBYJpF2UypdDHWCqmVpJi4iuTV5PgCnNKYvcBUSFEL+hVKCmCV6GbCy45QstPHfdN3OfvljyBXnZzhL96nCyJGmVU88JQnL1n8bNvrU3kjOw01eLWzGsMTFunpQ6hYg==;3:eO2QdJNDTM8eKCPS4humfOYJZrpR3tcKdYQksI+u30cV+vxfRRYogN7xFXP6NOdjrxbSvsK+tpYw/Wx3p+K6qsLoij4tMopvwhWNq2/myEqMLrh/tV1znyFnjFd0biRQ;25:XLov1HDhZuehDj7FdafxiJcKsypS408u9AQKhDOUu8UYjO4W1DZA2tfVlKwzpCCWcZPC8qhlQdZWO8L6pxrK2irhOPJnf+bogb/mXTpuC0rvEyKNFmE4UN82eC1MFEui2Qmy/hfZmno2z91RXrc/dL354rUyqlstWEy6QDZ6hkkemdh4W+sT4SWRDok/Eo/r/B9d8AzOFneeqYp9bqJYgBip2KbMXlKPRhiFR15TqQatvvZaJp/CS8yL8omqRZvV X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:DM2PR0801MB585; X-MS-Office365-Filtering-Correlation-Id: 55a5ad34-b32a-4957-382d-08d3214dd987 X-Microsoft-Exchange-Diagnostics: 1;DM2PR0801MB585;20:rlXxSWTtgxGV64IiMCTyAvVEHVD4ZsTfSTQVZFGWlPf7UegFBG1vIkY0hAptu9S/GHhv9wMwLMlh4m9+Rz9cwi6b6BvaxJcqt9lJsMpJicXP7Ba+ncXIjEM6KM1+vavT1L/tsugFUIGpAj39ywlIHq8W6xba8X0g+xstnBGbNiczz1y3L5azpJ/VINTHgVna62wx+0Mad7VUYQH2SivWS0JSioh0INYKFcNCUimURqwEz2YXXp7K38bNaXGPpwsqTBy+d/hhEkm84pJ6ciebw/sbNZceDBcx0bOjNw2W4lbHGXPnr3u1fkTddjIyPamDEzhtNfcEg2RMi681Qe5HGkRGvaoGuy9c+h+pdCWa8mb5S4YG1ZYqhqCPFVFnl+eK9iciOIP4d07GYS50BNLA7SosnwwJyBYTcwdFgFhGVHJ297itLloZyNRvOJO4zq2yhzUaKVFiZb//Wzr7RbpXZIrvNNSedgWj3w3BWT0G0gcUnZMOngVhYk+Dz2lw27hh;4:8PF8nHc5g1u874+JhS1veyWusKrVmpV+6pxx/5XdAOujuOKetM5zkmopoSRsXyPU2uqbMSF9cHvfS48YvEHIP7hWjbkdqVWAOrJuGacLLgvGKnsI8GeG71L4+KSLMquAu1444CXeO6gU2P6WBChbblfqw8ynzmWYVeEPmDQIcQSYd6zrDNMiID7okakP1jcFSkTbPENu03vIszbHKt/gtjmCwv9tex0o+v4QuxqAJcR2pCSAoSgcH63BhFmA3ygibOo11BcoZ8iYLvBR+v+cENXCGV8w21L2rZdSTFrlOKazg7KjRQhERJP8CULx4ffg4DJ3wxTZJOdxdBXKAsI+abLzRKQDgGWwH00JJZJ4el7nVAlL5cIhkhPJYr0gqroO X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(2401047)(8121501046)(520078)(5005006)(10201501046)(3002001);SRVR:DM2PR0801MB585;BCL:0;PCL:0;RULEID:;SRVR:DM2PR0801MB585; X-Forefront-PRVS: 0827D7ACB9 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(6049001)(6009001)(24454002)(199003)(479174004)(377454003)(189002)(51234002)(23676002)(64126003)(93886004)(19580395003)(189998001)(59896002)(2950100001)(110136002)(50466002)(76176999)(92566002)(36756003)(65956001)(4001350100001)(33656002)(77096005)(81156007)(97736004)(87266999)(83506001)(54356999)(50986999)(65816999)(101416001)(86362001)(80316001)(47776003)(3846002)(87976001)(40100003)(6116002)(106356001)(575784001)(1096002)(42186005)(586003)(5004730100002)(230700001)(5001960100002)(2906002)(66066001)(65806001)(5008740100001)(105586002)(122386002)(4326007);DIR:OUT;SFP:1102;SCL:1;SRVR:DM2PR0801MB585;H:[134.111.23.77];FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?utf-8?B?MTtETTJQUjA4MDFNQjU4NTsyMzpHbnFvc2JYQTY3all2TFNJeDYveFd5aHR6?= =?utf-8?B?OE1Bbm1NbjNFbkxwcUJydkVpeWhnQjdwdEl1eE9XdHZ3WG1HcytNRUhLdkVy?= =?utf-8?B?UU5vODFEd3E4MEpNUVlmTHpESjhkVlNidC9Cb3NzdVpoUFY4a1RCS1dLWW1C?= =?utf-8?B?bkRsa2daT0lNWURUWFRGOStvVzVSakx4c0gxbCtwaG9aOHhrNkZMdmVncVQ2?= =?utf-8?B?TWc4YXpocnBGTnlHYzZqd0RIUkZEcm9leVg5bXBCaHNHbElQblRwT3hHeHNw?= =?utf-8?B?R1pGS1VxdVp1LzY1S2ErbWxPR2lvb2FQMk1XQ0VNbWhsd3UvY2xybnF6OVFU?= =?utf-8?B?UnlmajIrdUtRQUQ0d1ZrSUpCYXNXZzk1LzZ2K3BBT3d5dmxKUDVlZkFNUExv?= =?utf-8?B?TUZOZXlrSXBqdUdvQVEyODR0S0Q1ZE8wWXJGZmt3aHM1MDBiVDNOTUhWYkUz?= =?utf-8?B?M3RLOEV5VmdjSmRsNG8xUFZqazVieGtlZUtaTVdlNmhRL0ZMZ2lzNVErci9H?= =?utf-8?B?U1oyUFdxOUVsWFZZYVluWUYvUHJ1elNHQXRpa3lzUUpscnF5dUQxUElHbGhF?= =?utf-8?B?WFVlR3YxOTZyTHlaZGFIWW0wVytmcFI3ZWtaai9SMjJ6djd6d1BGRnUyS00x?= =?utf-8?B?SDJDVkpOdG13blQwejh4VmZtMlc1WnJrTEY4bk9WWnllT1hsdFQybnEwSU9w?= =?utf-8?B?RnZzSFVBV1p3ODhwcnpKVXRNMU0xWWpKOHQzWEZxek5USjBPSm1nR3oxTXZF?= =?utf-8?B?NkFIcGZjV0huWS9Ob3JlRlZpQ1NUVmc2bjYyYi8vTGxRZVBQUmJyNVhhUXNZ?= =?utf-8?B?N01leTl2dkRUN1VpVEtyYUxHd0VCdkRSaUVEekQvTXVjTEk0TSt2aTdFOEIr?= =?utf-8?B?MkJQNm0vVDVid2dOaVVLWmdpa2hSc2cyNUVTbHJjaGgvNnRpU2tnVDRRNDJl?= =?utf-8?B?Ty84dVFjd1k4SUtjTThSTlBLeDV6amNKMktzaE1ud0xZUTlsamF0Y2pnd1lr?= =?utf-8?B?VG1veFFJaDB5eXhwdUMza0pabTk3UVZ0bmthQTJzM1lxQnBUa01ydFB0RnZ2?= =?utf-8?B?UVJ0ZmpsVjJlYTJDMGRkNDhFY2Z0T00wTnNEWVRQTTlkRStWeFUxQ3FocXlN?= =?utf-8?B?cWozdDl2Mko2SlgwdWY0Q0dqRlFSMUJITnhRWHQ4dEtiQ20renNEMkxHaUdq?= =?utf-8?B?a0s5THNVendFMWszcnVJQ3djVHBsS0lVYWNPU3dsWmRqVmFzTmtwRkRkMHps?= =?utf-8?B?T3UrOGFxWGhWTThOMUpMSlFNL29DbWFxL2xyNWQyNFZIaTM1TkFBV0VZYU1w?= =?utf-8?B?RkhMR1c5Y0R5SzNONE5kWHJyMEYrci9YL3hzSGR5YU9hUUlONW96Vnk4TFpi?= =?utf-8?B?OHR3eFFLdGhBMmwrZEVWQjZ2M3dLYXlTREVaMnFCdzB3SUwvM09Ec3BFQy9F?= =?utf-8?B?ZEdQajhOcXBKMndCbm4rUzZwSHVDNXp6Rlc3M25xNVc1NmovT0RDTjJKUEZq?= =?utf-8?B?YzUwL084UjZCbDVSSkN4S0VKZ3pCeDZFK3pjN3hyUS9Fb09YM0ErL1VuMGRy?= =?utf-8?B?NTFQL2xUVU5UazMzUG1QTXpXd3YwK2I4dmJhSWtVS3JqVUEyK3NHYitzUW5y?= =?utf-8?B?RTdYaWZIWi92RkQzQ1ZJSnZTU1VJNitubE11SStHSmIwc3RoaGh2UGl5WXhH?= =?utf-8?B?RkxybFIvcWltZ3pibFNJYVJGeWZoMG5tMkxoMnZaV09FNnpvL3VhNXRaYktw?= =?utf-8?B?SUVWb1B4OVc3S2NWanE0aTVkZnMzcWJML0prMjRCNWROUXhha04yb2RRMGZR?= =?utf-8?B?THgwYkwwWCtrODk1a1dOSllhRlpoeUpTZnpiS3V3R3FKK2JRPT0=?= X-Microsoft-Exchange-Diagnostics: 1;DM2PR0801MB585;5:5wxKNrDXwaWYWFNNN4PWZ1OzO4sP8qSXaLuIh+UzQeajwCN0W7W2Jf07G2uSAM5idQEsbTPjJEZH+g8KrGf2yBeaiwAyiA0ogPOBKpc2A/D5niUkflIdiHpx1gr3KTZGHpAkiwsQY58IO8jxZORvgg==;24:UDD+nFmReXBJHa7IB/tjrIRUKkJBSjmtO17VqlseYMAceZptwqEP3x5uykhioGmC5lOFVvewwDgqx7OEBJYLlMrE8x7Gf/N20ZRPHmPdhrw=;20:5/sMmVgYNfH8AKRxyg1FlK4tZWBaGX8Ixc+F3HCGsRmJsuGIMG9EQMo3lzMk+XWUDEK8XfpgkgjuG9Z2UN8dcLILjwPr55KP/rz2G96eTY9co65+IVBbZiZZcNkuKzvNnWyFA5nIL0Y25isIhu5etL3TOi4MZEKCaGMX//EKiRk= SpamDiagnosticOutput: 1:23 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: stratus.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 20 Jan 2016 03:57:42.2587 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM2PR0801MB585 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/18/2016 10:00 AM, Joe Lawrence wrote: [... snip ... ] > Hi Thomas, > > When logging in this morning and looking at the box running the 14 > patches + additional patch, I see it hit a hung task timeout in xhci USB > code about 39 hours in. Stack trace below (looks to be waiting on a > completion that never comes). > > I didn't see this when running only the *initial* 14 patches. Of > course, before these irq cleanup fixes my tests never ran this long :) > So it may or may not be related to the patchset, I'm still poking around > the generated vmcore. Let me know if there is anything you might be > interested in looking at from the wreckage. > > -- Joe > > > > INFO: task kworker/0:1:1506 blocked for more than 120 seconds. > Tainted: P OE 4.3.0sra12+ #50 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kworker/0:1 D 0000000000000000 0 1506 2 0x00000080 > Workqueue: usb_hub_wq hub_event > ffff8801e46dba58 0000000000000046 ffff8810375dac00 ffff881038430000 > ffff8801e46dc000 ffff88025ac20440 ffff88025ac20438 ffff881038430000 > 0000000000000000 ffff8801e46dba70 ffffffff81659893 7fffffffffffffff > Call Trace: > [] schedule+0x33/0x80 > [] schedule_timeout+0x200/0x2a0 > [] ? internal_add_timer+0x71/0xb0 > [] ? mod_timer+0x114/0x210 > [] wait_for_completion+0xf1/0x130 > [] ? wake_up_q+0x70/0x70 > [] xhci_discover_or_reset_device+0x1e1/0x540 > [] hub_port_reset+0x3c8/0x590 > [] hub_port_init+0x525/0xb00 > [] hub_port_connect+0x328/0x940 > [] hub_event+0x63c/0xb00 > [] process_one_work+0x14c/0x3c0 > [] worker_thread+0x114/0x470 > [] ? __schedule+0x2af/0x8b0 > [] ? rescuer_thread+0x310/0x310 > [] kthread+0xd8/0xf0 > [] ? kthread_park+0x60/0x60 > [] ret_from_fork+0x3f/0x70 > [] ? kthread_park+0x60/0x60 Hi Thomas / Boris, In an effort to exonerate the patchset, I instrumented xHCI to monitor complementary wait_for_completion / complete calls in that driver, hoping that an early exit in its probe might be simply skipping the complete call ... but I ended up collecting two new crashes in get_next_timer_interrupt: (Again with proprietary and out-of-tree drivers loaded.) general protection fault: 0000 [#1] SMP Modules linked in: xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun matroxfb(OE) ccmod(POE) ftmod(OE) videosw(OE) [ ... snip ... ] CPU: 10 PID: 0 Comm: swapper/10 Tainted: P OE 4.3.0sra13+ #53 Hardware name: Stratus ftServer 6800/G7LYY, BIOS BIOS Version 8.1:61 09/10/2015 task: ffff881038e35800 ti: ffff881038e3c000 task.ti: ffff881038e3c000 RIP: 0010:[] [] get_next_timer_interrupt+0x1a5/0x240 RSP: 0018:ffff881038e3fde0 EFLAGS: 00010002 RAX: ffff88103fa8e8b8 RBX: 000013629b0c5740 RCX: 000000014140a6d6 RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000001 RDI: 00000000010140a7 RBP: ffff881038e3fe30 R08: 6b6b6b6b6b6b6b6b R09: 0000000000000027 R10: 0000000000000027 R11: ffff881038e3fde8 R12: 000000010140a6d6 R13: ffff88103fa8e080 R14: ffff881038e3fe00 R15: 0000000000000040 FS: 0000000000000000(0000) GS:ffff88103fa80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcbf695a000 CR3: 0000002030582000 CR4: 00000000001406e0 Stack: ffff88103fa8e8b8 ffff88103fa8eab8 ffff88103fa8ecb8 ffff88103fa8eeb8 c5067d2c236293b0 ffff88103fa8f500 ffff88103fa8d600 000013629b1632f3 000000010140a6d6 000013629b0c5740 ffff881038e3fe88 ffffffff810f421f Call Trace: [] tick_nohz_stop_sched_tick+0x1bf/0x2b0 [] __tick_nohz_idle_enter+0x9f/0x150 [] tick_nohz_idle_enter+0x3c/0x70 [] cpu_startup_entry+0x9c/0x330 [] start_secondary+0x160/0x1a0 Code: 95 38 0e 00 00 48 89 55 c8 48 8d 55 b0 4c 8d 5a 08 4c 8d 72 20 41 89 fa 41 83 e2 3f 45 89 d1 49 63 d1 48 8b 14 d0 48 85 d2 74 1e 42 2a 10 75 10 4c 8b 42 10 be 01 00 00 00 49 39 c8 49 0f 48 RIP [] get_next_timer_interrupt+0x1a5/0x240 RSP crash> dis -l get_next_timer_interrupt+0x1a5 /root/linus/kernel/time/timer.c: 1289 : testb $0x10,0x2a(%rdx) RDX: 6b6b6b6b6b6b6b6b 1246 static unsigned long __next_timer_interrupt(struct tvec_base *base) ... 1251 struct timer_list *nte; ... 1277 /* Check tv2-tv5. */ 1278 varray[0] = &base->tv2; 1279 varray[1] = &base->tv3; 1280 varray[2] = &base->tv4; 1281 varray[3] = &base->tv5; ... 1283 for (array = 0; array < 4; array++) { 1284 struct tvec *varp = varray[array]; 1285 1286 index = slot = timer_jiffies & TVN_MASK; 1287 do { 1288 hlist_for_each_entry(nte, varp->vec + slot, entry) { 1289 if (nte->flags & TIMER_DEFERRABLE) So the nte pointer contains the slub_debug poison pattern. >>From the disassembly of __next_timer_interrupt, it looks like r13 is used to store "base". R13: ffff88103fa8e080 crash> struct tvec_base ffff88103fa8e080 struct tvec_base { lock = { { rlock = { raw_lock = { val = { counter = 0x1 } } } } }, running_timer = 0x0, timer_jiffies = 0x10140a6d7, next_timer = 0x10140a6d6, active_timers = 0x7, all_timers = 0x8, cpu = 0xa, migration_enabled = 0x1, nohz_active = 0x1, ... tv1 = { vec = {{ ... first = 0xffff88203800dff8 ... tv2 = { vec = {{ ... first = 0xffff88100aa9a550 << index 39 points to slub poison ... first = 0xffff88103fa90ea0 ... tv3 = { vec = {{ ... first = 0xffff8807452b3928 ... first = 0xffff88103fa8d380 ... tv4 = { vec = {{ ... tv5 = { vec = {{ ... first = 0xffff88100c8955e8 crash> struct hlist_node 0xffff88203800dff8 struct hlist_node { next = 0x0, pprev = 0xffff88103fa8e790 } crash> struct hlist_node 0xffff88100aa9a550 struct hlist_node { next = 0x6b6b6b6b6b6b6b6b, << uhoh! pprev = 0x6b6b6b6b6b6b6b6b << } crash> struct hlist_node 0xffff88103fa90ea0 struct hlist_node { next = 0x0, pprev = 0xffff88103fa8e9f8 } crash> struct hlist_node 0xffff8807452b3928 struct hlist_node { next = 0xffff88100aa9da68, pprev = 0xffff88103fa8eae0 } crash> struct hlist_node 0xffff88103fa8d380 struct hlist_node { next = 0x0, pprev = 0xffff88103fa8eb58 } crash> struct hlist_node 0xffff88100c8955e8 struct hlist_node { next = 0xffff881021e47598, pprev = 0xffff88103fa8eec0 } crash utility confirms it in its "timer" display: crash> timer TVEC_BASES[9]: ffff88103fa4e080 JIFFIES 4315982678 EXPIRES TIMER_LIST FUNCTION 4315982681 ffff88203800ddb0 ffffffff8150e7a0 4315982973 ffff88103fa50ea0 ffffffff81092d90 4316267973 ffff88103fa4d380 ffffffff81041930 timer: invalid list entry: 6b6b6b6b6b6b6b6b timer: ignoring faulty timer list at index 39 of timer array timer: invalid list entry: 6b6b6b6b6b6b6b6b timer: ignoring faulty timer list at index 39 of timer array TVEC_BASES[10]: ffff88103fa8e080 JIFFIES 4315982678 EXPIRES TIMER_LIST FUNCTION 4315981531 ffff88203800dff8 ffffffff8150e7a0 4316034039 ffff88100aa9da68 ffffffff81092d90 4316034111 ffff8807452b3928 ffffffff81092d90 4316267970 ffff88103fa8d380 ffffffff81041930 4401397760 ffff88100c8955e8 ffffffff8160bbe0 4401397760 ffff881021e47598 ffffffff8160bbe0 7740398493674204011 ffff88100aa9a550 6b6b6b6b6b6b6b6b crash> struct timer_list ffff88100aa9a550 struct timer_list { entry = { next = 0x6b6b6b6b6b6b6b6b, pprev = 0x6b6b6b6b6b6b6b6b }, expires = 0x6b6b6b6b6b6b6b6b, function = 0x6b6b6b6b6b6b6b6b, data = 0x6b6b6b6b6b6b6b6b, flags = 0x6b6b6b6b, slack = 0x6b6b6b6b, start_pid = 0x6b6b6b6b, start_site = 0x6b6b6b6b6b6b6b6b, start_comm = "kkkkkkkkkkkkkkkk" } A second crash of the same signature occurred a few hours later. Unfortunately I only have a single box to run these tests in what amounts to an after-hours effort. I started testing back in 4.3 but avoided moving forward to avoid the 4.4 development cycle (and incidental issues that it might have muddled the waters). That said, what would be the best way to proceed? Change device removal tests to avoid proprietary drivers. What about the other out-of-tree device drivers (mpt3sas, ixgbe, etc.)? These are open source, but contain much Stratus device removal paranoia that upstream hasn't adopted. Rebase evil(TM) proprietary/out-of-tree drivers against 4.4 or 4.5rcX, apply this patchset and any other required device removal fixups. If proprietary/out-of-tree drivers are a debugging deal breaker, I understand. The platform offers a unique hotplug testbed, so I try to contribute testing and bug reports where I feel they apply equally to untainted upstream. Regards, -- Joe