* Re: [PATCH 1/3] PCI: ensure the PCI device is locked over ->reset_notify calls [not found] ` <20170606053142.GA25064@bhelgaas-glaptop.roam.corp.google.com> @ 2017-06-06 10:48 ` Christoph Hellwig 2017-06-06 21:14 ` Bjorn Helgaas 0 siblings, 1 reply; 7+ messages in thread From: Christoph Hellwig @ 2017-06-06 10:48 UTC (permalink / raw) To: Bjorn Helgaas Cc: Christoph Hellwig, rakesh, linux-pci, linux-nvme, Greg Kroah-Hartman, linux-kernel On Tue, Jun 06, 2017 at 12:31:42AM -0500, Bjorn Helgaas wrote: > OK, sorry to be dense; it's taking me a long time to work out the > details here. It feels like there should be a general principle to > help figure out where we need locking, and it would be really awesome > if we could include that in the changelog. But it's not obvious to me > what that principle would be. The principle is very simple: every method in struct device_driver or structures derived from it like struct pci_driver MUST provide exclusion vs ->remove. Usuaull by using device_lock(). If we don't provide such an exclusion the method call can race with a removal in one form or another. > But I'm still nervous because I think both threads will queue > nvme_reset_work() work items for the same device, and I'm not sure > they're prepared to run concurrently. We had another bug in that area, and the fix for that is hopefully going to go into the next 4.12-rc. > I don't really think it should be the driver's responsibility to > understand issues like this and worry about things like > nvme_reset_work() running concurrently. So I'm thinking maybe the PCI > core needs to be a little stricter here, but I don't know exactly > *how*. > > What do you think? The driver core / bus driver must ensure that method calls don't race with ->remove. There is nothing the driver can do about it, and the race is just as possible with explicit user removals or hardware hotplug. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/3] PCI: ensure the PCI device is locked over ->reset_notify calls 2017-06-06 10:48 ` [PATCH 1/3] PCI: ensure the PCI device is locked over ->reset_notify calls Christoph Hellwig @ 2017-06-06 21:14 ` Bjorn Helgaas 2017-06-07 18:29 ` Christoph Hellwig 0 siblings, 1 reply; 7+ messages in thread From: Bjorn Helgaas @ 2017-06-06 21:14 UTC (permalink / raw) To: Christoph Hellwig Cc: rakesh, linux-pci, linux-nvme, Greg Kroah-Hartman, linux-kernel On Tue, Jun 06, 2017 at 12:48:36PM +0200, Christoph Hellwig wrote: > On Tue, Jun 06, 2017 at 12:31:42AM -0500, Bjorn Helgaas wrote: > > OK, sorry to be dense; it's taking me a long time to work out the > > details here. It feels like there should be a general principle to > > help figure out where we need locking, and it would be really awesome > > if we could include that in the changelog. But it's not obvious to me > > what that principle would be. > > The principle is very simple: every method in struct device_driver > or structures derived from it like struct pci_driver MUST provide > exclusion vs ->remove. Usuaull by using device_lock(). > > If we don't provide such an exclusion the method call can race with > a removal in one form or another. So I guess the method here is dev->driver->err_handler->reset_notify(), and the PCI core should be holding device_lock() while calling it? That makes sense to me; thanks a lot for articulating that! 1) The current patch protects the err_handler->reset_notify() uses by adding or expanding device_lock regions in the paths that lead to pci_reset_notify(). Could we simplify it by doing the locking directly in pci_reset_notify()? Then it would be easy to verify the locking, and we would be less likely to add new callers without the proper locking. 2) Stating the rule explicitly helps look for other problems, and I think we have a similar problem in all the pcie_portdrv_err_handler methods. These are all called in the AER do_recovery() path, and the functions there, e.g., report_error_detected() do hold device_lock(). But pcie_portdrv_error_detected() propagates this to all the children, and we *don't* hold the lock for the children. Bjorn ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/3] PCI: ensure the PCI device is locked over ->reset_notify calls 2017-06-06 21:14 ` Bjorn Helgaas @ 2017-06-07 18:29 ` Christoph Hellwig 2017-06-12 23:14 ` Bjorn Helgaas 0 siblings, 1 reply; 7+ messages in thread From: Christoph Hellwig @ 2017-06-07 18:29 UTC (permalink / raw) To: Bjorn Helgaas Cc: Christoph Hellwig, rakesh, linux-pci, linux-nvme, Greg Kroah-Hartman, linux-kernel On Tue, Jun 06, 2017 at 04:14:43PM -0500, Bjorn Helgaas wrote: > So I guess the method here is > dev->driver->err_handler->reset_notify(), and the PCI core should be > holding device_lock() while calling it? That makes sense to me; > thanks a lot for articulating that! Yes. > 1) The current patch protects the err_handler->reset_notify() uses by > adding or expanding device_lock regions in the paths that lead to > pci_reset_notify(). Could we simplify it by doing the locking > directly in pci_reset_notify()? Then it would be easy to verify the > locking, and we would be less likely to add new callers without the > proper locking. We could do that, except that I'd rather hold the lock over a longer period if we have many calls following each other. I also have a patch to actually kill pci_reset_notify() later in the series as well, as the calling convention for it and ->reset_notify() are awkward - depending on prepare parameter they do two entirely different things. That being said I could also add new pci_reset_prepare() and pci_reset_done() helpers. > 2) Stating the rule explicitly helps look for other problems, and I > think we have a similar problem in all the pcie_portdrv_err_handler > methods. Yes, I mentioned this earlier, and I also vaguely remember we got bug reports from IBM on power for this a while ago. I just don't feel confident enough to touch all these without a good test plan. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/3] PCI: ensure the PCI device is locked over ->reset_notify calls 2017-06-07 18:29 ` Christoph Hellwig @ 2017-06-12 23:14 ` Bjorn Helgaas 2017-06-13 7:08 ` Christoph Hellwig 2017-06-22 20:41 ` Guilherme G. Piccoli 0 siblings, 2 replies; 7+ messages in thread From: Bjorn Helgaas @ 2017-06-12 23:14 UTC (permalink / raw) To: Christoph Hellwig Cc: rakesh, linux-pci, linux-nvme, Greg Kroah-Hartman, linux-kernel On Wed, Jun 07, 2017 at 08:29:36PM +0200, Christoph Hellwig wrote: > On Tue, Jun 06, 2017 at 04:14:43PM -0500, Bjorn Helgaas wrote: > > So I guess the method here is > > dev->driver->err_handler->reset_notify(), and the PCI core should be > > holding device_lock() while calling it? That makes sense to me; > > thanks a lot for articulating that! > > Yes. > > > 1) The current patch protects the err_handler->reset_notify() uses by > > adding or expanding device_lock regions in the paths that lead to > > pci_reset_notify(). Could we simplify it by doing the locking > > directly in pci_reset_notify()? Then it would be easy to verify the > > locking, and we would be less likely to add new callers without the > > proper locking. > > We could do that, except that I'd rather hold the lock over a longer > period if we have many calls following each other. My main concern is being able to verify the locking. I think that is much easier if the locking is adjacent to the method invocation. But if you just add a comment at the method invocation about where the locking is, that should be sufficient. > I also have > a patch to actually kill pci_reset_notify() later in the series as > well, as the calling convention for it and ->reset_notify() are > awkward - depending on prepare parameter they do two entirely > different things. That being said I could also add new > pci_reset_prepare() and pci_reset_done() helpers. I like your pci_reset_notify() changes; they make that much clearer. I don't think new helpers are necessary. > > 2) Stating the rule explicitly helps look for other problems, and I > > think we have a similar problem in all the pcie_portdrv_err_handler > > methods. > > Yes, I mentioned this earlier, and I also vaguely remember we got > bug reports from IBM on power for this a while ago. I just don't > feel confident enough to touch all these without a good test plan. Hmmm. I see your point, but I hate leaving a known bug unfixed. I wonder if some enterprising soul could tickle this bug by injecting errors while removing and rescanning devices below the bridge? Bjorn ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/3] PCI: ensure the PCI device is locked over ->reset_notify calls 2017-06-12 23:14 ` Bjorn Helgaas @ 2017-06-13 7:08 ` Christoph Hellwig 2017-06-13 14:05 ` Bjorn Helgaas 2017-06-22 20:41 ` Guilherme G. Piccoli 1 sibling, 1 reply; 7+ messages in thread From: Christoph Hellwig @ 2017-06-13 7:08 UTC (permalink / raw) To: Bjorn Helgaas Cc: Christoph Hellwig, rakesh, linux-pci, linux-nvme, Greg Kroah-Hartman, linux-kernel On Mon, Jun 12, 2017 at 06:14:23PM -0500, Bjorn Helgaas wrote: > My main concern is being able to verify the locking. I think that is > much easier if the locking is adjacent to the method invocation. But > if you just add a comment at the method invocation about where the > locking is, that should be sufficient. Ok. I can add comments for all the methods as a separate patch, similar to Documentation/vfs/Locking > > Yes, I mentioned this earlier, and I also vaguely remember we got > > bug reports from IBM on power for this a while ago. I just don't > > feel confident enough to touch all these without a good test plan. > > Hmmm. I see your point, but I hate leaving a known bug unfixed. I > wonder if some enterprising soul could tickle this bug by injecting > errors while removing and rescanning devices below the bridge? I'm completely loaded up at the moment, but this sounds like a good idea. In the meantime how do you want to proceed with this patch? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/3] PCI: ensure the PCI device is locked over ->reset_notify calls 2017-06-13 7:08 ` Christoph Hellwig @ 2017-06-13 14:05 ` Bjorn Helgaas 0 siblings, 0 replies; 7+ messages in thread From: Bjorn Helgaas @ 2017-06-13 14:05 UTC (permalink / raw) To: Christoph Hellwig Cc: rakesh, linux-pci, linux-nvme, Greg Kroah-Hartman, linux-kernel On Tue, Jun 13, 2017 at 09:08:10AM +0200, Christoph Hellwig wrote: > On Mon, Jun 12, 2017 at 06:14:23PM -0500, Bjorn Helgaas wrote: > > My main concern is being able to verify the locking. I think that is > > much easier if the locking is adjacent to the method invocation. But > > if you just add a comment at the method invocation about where the > > locking is, that should be sufficient. > > Ok. I can add comments for all the methods as a separate patch, > similar to Documentation/vfs/Locking > > > > Yes, I mentioned this earlier, and I also vaguely remember we got > > > bug reports from IBM on power for this a while ago. I just don't > > > feel confident enough to touch all these without a good test plan. > > > > Hmmm. I see your point, but I hate leaving a known bug unfixed. I > > wonder if some enterprising soul could tickle this bug by injecting > > errors while removing and rescanning devices below the bridge? > > I'm completely loaded up at the moment, but this sounds like a good > idea. > > In the meantime how do you want to proceed with this patch? Can you just add comments about the locking? I'd prefer that in the same patch that adds the locking because that's what I had a hard time reviewing. I'm not thinking of anything fancy like Documentation/filesystems/Locking; I'm just thinking of something along the lines of "caller must hold pci_dev_lock() to protect err_handler->reset_notify from racing ->remove()". And the changelog should contain the general principle about the locking strategy. Bjorn ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/3] PCI: ensure the PCI device is locked over ->reset_notify calls 2017-06-12 23:14 ` Bjorn Helgaas 2017-06-13 7:08 ` Christoph Hellwig @ 2017-06-22 20:41 ` Guilherme G. Piccoli 1 sibling, 0 replies; 7+ messages in thread From: Guilherme G. Piccoli @ 2017-06-22 20:41 UTC (permalink / raw) To: Bjorn Helgaas, Christoph Hellwig Cc: rakesh, linux-pci, linux-nvme, Greg Kroah-Hartman, linux-kernel On 06/12/2017 08:14 PM, Bjorn Helgaas wrote: > On Wed, Jun 07, 2017 at 08:29:36PM +0200, Christoph Hellwig wrote: >> On Tue, Jun 06, 2017 at 04:14:43PM -0500, Bjorn Helgaas wrote: >>> So I guess the method here is >>> dev->driver->err_handler->reset_notify(), and the PCI core should be >>> holding device_lock() while calling it? That makes sense to me; >>> thanks a lot for articulating that! >> >> Yes. >> >>> 1) The current patch protects the err_handler->reset_notify() uses by >>> adding or expanding device_lock regions in the paths that lead to >>> pci_reset_notify(). Could we simplify it by doing the locking >>> directly in pci_reset_notify()? Then it would be easy to verify the >>> locking, and we would be less likely to add new callers without the >>> proper locking. >> >> We could do that, except that I'd rather hold the lock over a longer >> period if we have many calls following each other. > > My main concern is being able to verify the locking. I think that is > much easier if the locking is adjacent to the method invocation. But > if you just add a comment at the method invocation about where the > locking is, that should be sufficient. > >> I also have >> a patch to actually kill pci_reset_notify() later in the series as >> well, as the calling convention for it and ->reset_notify() are >> awkward - depending on prepare parameter they do two entirely >> different things. That being said I could also add new >> pci_reset_prepare() and pci_reset_done() helpers. > > I like your pci_reset_notify() changes; they make that much clearer. > I don't think new helpers are necessary. > >>> 2) Stating the rule explicitly helps look for other problems, and I >>> think we have a similar problem in all the pcie_portdrv_err_handler >>> methods. >> >> Yes, I mentioned this earlier, and I also vaguely remember we got >> bug reports from IBM on power for this a while ago. I just don't >> feel confident enough to touch all these without a good test plan. > > Hmmm. I see your point, but I hate leaving a known bug unfixed. I > wonder if some enterprising soul could tickle this bug by injecting > errors while removing and rescanning devices below the bridge? Well, although I don't consider myself an enterprising soul...heheh I can test it, just CC me in next spin and provide some comment on how to test (or point me the thread of original report). I guess it was myself the reporter of the issue, I tried a simple fix for our case and Christoph mentioned issue was more generic and needed a proper fix.. Hopefully this one is that fix! Thanks, Guilherme > > Bjorn > ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-06-22 20:46 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20170601111039.8913-1-hch@lst.de>
[not found] ` <20170601111039.8913-2-hch@lst.de>
[not found] ` <20170606053142.GA25064@bhelgaas-glaptop.roam.corp.google.com>
2017-06-06 10:48 ` [PATCH 1/3] PCI: ensure the PCI device is locked over ->reset_notify calls Christoph Hellwig
2017-06-06 21:14 ` Bjorn Helgaas
2017-06-07 18:29 ` Christoph Hellwig
2017-06-12 23:14 ` Bjorn Helgaas
2017-06-13 7:08 ` Christoph Hellwig
2017-06-13 14:05 ` Bjorn Helgaas
2017-06-22 20:41 ` Guilherme G. Piccoli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox