From mboxrd@z Thu Jan 1 00:00:00 1970 From: mr.nuke.me@gmail.com (Alex G.) Date: Fri, 6 Apr 2018 14:08:38 -0500 Subject: IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again In-Reply-To: <20180406180445.GL10098@localhost.localdomain> References: <719ea777-e57d-511e-52c5-cf83027d1fd0@gmail.com> <20180405224138.GH10098@localhost.localdomain> <20180405224830.GI10098@localhost.localdomain> <20180405230515.GJ10098@localhost.localdomain> <75edea4e-b961-82a1-3612-fc682a248819@gmail.com> <20180406153236.GK10098@localhost.localdomain> <94d77cb7-759f-595a-2264-37305dfa96c4@gmail.com> <20180406171622.aso3h6ydpmcdizl3@sbauer-Z170X-UD5> <93003ab7-f4a0-7e5d-f107-277df20f5566@gmail.com> <20180406180445.GL10098@localhost.localdomain> Message-ID: On 04/06/2018 01:04 PM, Keith Busch wrote: > On Fri, Apr 06, 2018@12:46:06PM -0500, Alex G. wrote: >> On 04/06/2018 12:16 PM, Scott Bauer wrote: >>> You're using AER inject, right? >> >> No. I'm causing the errors in hardware with hot-unplug. > > I think Scott's still on the right track for this particular sighting. > The AER handler looks unsafe under changing topologies. It might need run > under pci_lock_rescan_remove() before walking the bus to prevent races > with the surprise removal, but it's not clear to me yet if holding that > lock is okay to do in this context. I think we have three mechanisms that can remove a device: nvme timeout, Link Down interrupt, and AER. Link Down comes 20-60ms after the link actually dies, in which time nvme will queue IO, which can cause a flood of PCIe errors, which trigger AER handling. I suspect there is a massive race condition somewhere, but I don't yet have convincing evidence to prove it. > This however does not appear to resemble your previous sightings. In your > previous sightings, it looks like something has lost track of commands, > and we're freeing the resources with them a second time. I think they might be related. Alex