* Unsynchronized access to spi bus by mmc_rescan? @ 2016-04-22 21:28 Rich Felker 2016-05-04 22:44 ` Rich Felker 0 siblings, 1 reply; 8+ messages in thread From: Rich Felker @ 2016-04-22 21:28 UTC (permalink / raw) To: linux-spi, linux-mmc I'm working on a driver for the J-core (http://j-core.org) SPI master, which is currently limited to PIO and using the spi-bitbang framework (probably not the right thing to use, but planned to change, and seems orthogonal to the issue at hand). We're using it to access an SD card via the mmc_spi mmc host driver, and experiencing crashes/corruption that I tracked down to mmc_rescan (we don't yet have an interrupt for media change) happening during SPI message transfers. Which locks are supposed to preclude this from happening? Is it likely something wrong our driver is using, or is there possibly a general bug in the MMC/SPI subsystem(s)? Rich ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Unsynchronized access to spi bus by mmc_rescan? 2016-04-22 21:28 Unsynchronized access to spi bus by mmc_rescan? Rich Felker @ 2016-05-04 22:44 ` Rich Felker [not found] ` <20160504224445.GW21636-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Rich Felker @ 2016-05-04 22:44 UTC (permalink / raw) To: linux-spi, linux-mmc Ping. On Fri, Apr 22, 2016 at 05:28:42PM -0400, Rich Felker wrote: > I'm working on a driver for the J-core (http://j-core.org) SPI master, > which is currently limited to PIO and using the spi-bitbang framework > (probably not the right thing to use, but planned to change, and seems > orthogonal to the issue at hand). We're using it to access an SD card > via the mmc_spi mmc host driver, and experiencing crashes/corruption > that I tracked down to mmc_rescan (we don't yet have an interrupt for > media change) happening during SPI message transfers. Which locks are > supposed to preclude this from happening? Is it likely something wrong > our driver is using, or is there possibly a general bug in the MMC/SPI > subsystem(s)? > > Rich ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20160504224445.GW21636-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>]
* Re: Unsynchronized access to spi bus by mmc_rescan? [not found] ` <20160504224445.GW21636-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org> @ 2016-07-21 20:44 ` Rich Felker [not found] ` <20160721204451.GE15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Rich Felker @ 2016-07-21 20:44 UTC (permalink / raw) To: linux-spi-u79uwXL29TY76Z2rM5mHXA, linux-mmc-u79uwXL29TY76Z2rM5mHXA On Wed, May 04, 2016 at 06:44:45PM -0400, Rich Felker wrote: > Ping. > > On Fri, Apr 22, 2016 at 05:28:42PM -0400, Rich Felker wrote: > > I'm working on a driver for the J-core (http://j-core.org) SPI master, > > which is currently limited to PIO and using the spi-bitbang framework > > (probably not the right thing to use, but planned to change, and seems > > orthogonal to the issue at hand). We're using it to access an SD card > > via the mmc_spi mmc host driver, and experiencing crashes/corruption > > that I tracked down to mmc_rescan (we don't yet have an interrupt for > > media change) happening during SPI message transfers. Which locks are > > supposed to preclude this from happening? Is it likely something wrong > > our driver is using, or is there possibly a general bug in the MMC/SPI > > subsystem(s)? Ping. I'm also experiencing this issue with multiple devices on the same spi bus; access seems completely unsynchronized. While it's hard to rule out the possibility of it being specific to the spi master driver, the driver just implements transfer_one and set_cs, so synchronization would necessarily have to happen at a higher level. Any tips on how I could further track down the cause? Rich -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20160721204451.GE15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>]
* Re: Unsynchronized access to spi bus by mmc_rescan? [not found] ` <20160721204451.GE15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org> @ 2016-07-21 21:40 ` Rich Felker [not found] ` <20160721214015.GG15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Rich Felker @ 2016-07-21 21:40 UTC (permalink / raw) To: linux-spi-u79uwXL29TY76Z2rM5mHXA, linux-mmc-u79uwXL29TY76Z2rM5mHXA Cc: Heiko Stübner, Ernst Schwab, Grant Likely, Matt Fleming, Antonio Ospite, Mark Brown On Thu, Jul 21, 2016 at 04:44:51PM -0400, Rich Felker wrote: > On Wed, May 04, 2016 at 06:44:45PM -0400, Rich Felker wrote: > > Ping. > > > > On Fri, Apr 22, 2016 at 05:28:42PM -0400, Rich Felker wrote: > > > I'm working on a driver for the J-core (http://j-core.org) SPI master, > > > which is currently limited to PIO and using the spi-bitbang framework > > > (probably not the right thing to use, but planned to change, and seems > > > orthogonal to the issue at hand). We're using it to access an SD card > > > via the mmc_spi mmc host driver, and experiencing crashes/corruption > > > that I tracked down to mmc_rescan (we don't yet have an interrupt for > > > media change) happening during SPI message transfers. Which locks are > > > supposed to preclude this from happening? Is it likely something wrong > > > our driver is using, or is there possibly a general bug in the MMC/SPI > > > subsystem(s)? > > Ping. I'm also experiencing this issue with multiple devices on the > same spi bus; access seems completely unsynchronized. While it's hard > to rule out the possibility of it being specific to the spi master > driver, the driver just implements transfer_one and set_cs, so > synchronization would necessarily have to happen at a higher level. > Any tips on how I could further track down the cause? OK, I've been reading the code again, and unless I'm mistaken it's utter nonsense. The spi master device has a "bus lock flag" that's used in lieu of a recursive mutex, but it seems to be used entirely incorrectly. The logic I'm seeing in spi.c is essentially: if (some_task_has_spi_bus_locked) use_spi_master_without_locking(); There seems to be no check that the task that's using the master without locking is the same one that locked it! The relevant code in spi.c is in __pump_messages, where locking is if (!bus_locked) mutex_lock(&master->bus_lock_mutex); bus_locked was provided by the caller from master->bus_lock_flag. Also, master->bus_lock_flag is modified completely unsynchonized from read access to master->bus_lock_flag. There is no locking of master->bus_lock_mutex or master->bus_lock_spinlock around the read accesses to master->bus_lock_flag, and in one place the write (of 0) to master->bus_lock_flag takes place without holding the spinlock. The obvious fix would be using an actual recursive mutex to obtain recursive mutex semantics. I've Cc'd everybody who appears to have been involved in the original addition of the spi bus lock infrastructure and subsequent changes to it. Rich -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20160721214015.GG15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>]
* Re: Unsynchronized access to spi bus by mmc_rescan? [not found] ` <20160721214015.GG15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org> @ 2016-07-21 22:43 ` Mark Brown [not found] ` <20160721224308.GT6509-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 0 siblings, 1 reply; 8+ messages in thread From: Mark Brown @ 2016-07-21 22:43 UTC (permalink / raw) To: Rich Felker Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, linux-mmc-u79uwXL29TY76Z2rM5mHXA, Heiko Stübner, Ernst Schwab, Matt Fleming, Antonio Ospite [-- Attachment #1: Type: text/plain, Size: 2320 bytes --] On Thu, Jul 21, 2016 at 05:40:15PM -0400, Rich Felker wrote: > The spi master device has a "bus lock flag" that's used in lieu of a > recursive mutex, but it seems to be used entirely incorrectly. The > logic I'm seeing in spi.c is essentially: I am entirely unclear why you believe there is a recursive mutex here, that's not the purpose of the spi_bus_lock() API at all. > if (some_task_has_spi_bus_locked) > use_spi_master_without_locking(); > There seems to be no check that the task that's using the master > without locking is the same one that locked it! That's because there is no requirement that something that takes a bus lock be executing from a single context. It is just granting exclusive access to the bus, the caller is free to do those accesses in any manner it sees fit. > Also, master->bus_lock_flag is modified completely unsynchonized from > read access to master->bus_lock_flag. There is no locking of > master->bus_lock_mutex or master->bus_lock_spinlock around the read > accesses to master->bus_lock_flag, and in one place the write (of 0) > to master->bus_lock_flag takes place without holding the spinlock. The critical read access is in spi_async where we exclude users that don't hold the lock which *is* spinlocked; we should probably hold the spinlock on release but it's not so important since we know that we hold the lock when we call it and the mutex ends up providing protection. spi_sync() shouldn't be reading the flag in the first place as far as I can tell, it should unconditionally have the flag clear as locked callers have to use spi_sync_locked() but that will give us a fast path that doesn't check for the bus lock which starts to reveal the underlying issue. This underlying issue is that we are trying to use one mutex for two purposes, the existing mutex is mainly being used to serialize access to the physical bus but the externally visible bus lock is there to ensure that only one caller is able to queue new transfers which is not the same thing at all. It is completely irrelevant when we are pushing messages to the bus if we are doing so with the bus lock held. > The obvious fix would be using an actual recursive mutex to obtain > recursive mutex semantics. This would be broken since we need to be able to start transfers from atomic context. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20160721224308.GT6509-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]
* Re: Unsynchronized access to spi bus by mmc_rescan? [not found] ` <20160721224308.GT6509-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> @ 2016-07-21 23:06 ` Rich Felker 2016-07-21 23:27 ` Mark Brown 2016-07-21 23:12 ` Mark Brown 1 sibling, 1 reply; 8+ messages in thread From: Rich Felker @ 2016-07-21 23:06 UTC (permalink / raw) To: Mark Brown Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, linux-mmc-u79uwXL29TY76Z2rM5mHXA, Heiko Stübner, Ernst Schwab, Matt Fleming, Antonio Ospite On Thu, Jul 21, 2016 at 11:43:08PM +0100, Mark Brown wrote: > On Thu, Jul 21, 2016 at 05:40:15PM -0400, Rich Felker wrote: > > > The spi master device has a "bus lock flag" that's used in lieu of a > > recursive mutex, but it seems to be used entirely incorrectly. The > > logic I'm seeing in spi.c is essentially: > > I am entirely unclear why you believe there is a recursive mutex here, > that's not the purpose of the spi_bus_lock() API at all. I consider the "if not yet locked, take lock" pattern logically akin to a recursive mutex, but apologies if the language was imprecise. > > if (some_task_has_spi_bus_locked) > > use_spi_master_without_locking(); > > > There seems to be no check that the task that's using the master > > without locking is the same one that locked it! > > That's because there is no requirement that something that takes a bus > lock be executing from a single context. It is just granting exclusive > access to the bus, the caller is free to do those accesses in any manner > it sees fit. Well then what exactly is the contract for who can access it and who can't? What I'm experiencing is that, when the driver for one device on the spi bus has the bus locked, other devices access the bus with no synchronization at all, resulting in serious data corruption. The media-change-detect probe also does this even with a single mmc device on the bus. Reverting commits 24c8cd1b0812, 49023d2e4ead, and 556351f14e74 (the first two revert cleanly; the latter I did by hand) makes the problem go away. > > Also, master->bus_lock_flag is modified completely unsynchonized from > > read access to master->bus_lock_flag. There is no locking of > > master->bus_lock_mutex or master->bus_lock_spinlock around the read > > accesses to master->bus_lock_flag, and in one place the write (of 0) > > to master->bus_lock_flag takes place without holding the spinlock. > > The critical read access is in spi_async where we exclude users that > don't hold the lock which *is* spinlocked; we should probably hold the > spinlock on release but it's not so important since we know that we hold > the lock when we call it and the mutex ends up providing protection. > spi_sync() shouldn't be reading the flag in the first place as far as I > can tell, it should unconditionally have the flag clear as locked > callers have to use spi_sync_locked() but that will give us a fast path > that doesn't check for the bus lock which starts to reveal the > underlying issue. Ah, in that case maybe the actual bug is just one line of commit 24c8cd1b0812? Reverting just that also seems to make the problem go away. > This underlying issue is that we are trying to use one mutex for two > purposes, the existing mutex is mainly being used to serialize access to > the physical bus but the externally visible bus lock is there to ensure > that only one caller is able to queue new transfers which is not the > same thing at all. It is completely irrelevant when we are pushing > messages to the bus if we are doing so with the bus lock held. OK. I didn't understand what synchronization prevents messages from multiple drivers/sources from getting pushed at the same time, but apparently the problem was just that spi_sync was behaving like spi_sync_locked when the bus was already locked. > > The obvious fix would be using an actual recursive mutex to obtain > > recursive mutex semantics. > > This would be broken since we need to be able to start transfers from > atomic context. *Nod*, OK. Rich -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Unsynchronized access to spi bus by mmc_rescan? 2016-07-21 23:06 ` Rich Felker @ 2016-07-21 23:27 ` Mark Brown 0 siblings, 0 replies; 8+ messages in thread From: Mark Brown @ 2016-07-21 23:27 UTC (permalink / raw) To: Rich Felker Cc: linux-spi, linux-mmc, Heiko Stübner, Ernst Schwab, Matt Fleming, Antonio Ospite [-- Attachment #1: Type: text/plain, Size: 1770 bytes --] On Thu, Jul 21, 2016 at 07:06:15PM -0400, Rich Felker wrote: > On Thu, Jul 21, 2016 at 11:43:08PM +0100, Mark Brown wrote: > > On Thu, Jul 21, 2016 at 05:40:15PM -0400, Rich Felker wrote: > > spi_sync() shouldn't be reading the flag in the first place as far as I > > can tell, it should unconditionally have the flag clear as locked > > callers have to use spi_sync_locked() but that will give us a fast path > > that doesn't check for the bus lock which starts to reveal the > > underlying issue. > Ah, in that case maybe the actual bug is just one line of commit > 24c8cd1b0812? Reverting just that also seems to make the problem go > away. For those playing at home that's "spi: fix possible deadlock between internal bus locks and bus_lock_flag". That won't work entirely since that'd mean that we're not doing I/O locking to exclude the message pump in cases where we push data directly in spi_sync() with the bus lock held by a caller. That's probably going to work most of the time as most callers will be single threaded but they might not be. It should be saved by the checks to see if we're already processing a message inside the pump function but it's better to be clear about what we're doing I think, and we can clear cur_msg inside the message pump. But perhaps it's enough... anyway, I'm not going to think through all the cases properly at this time of night. Please include human readable descriptions of things like commits and issues being discussed in e-mail in your mails, this makes them much easier for humans to read especially when they have no internet access. I do frequently catch up on my mail on flights or while otherwise travelling so this is even more pressing for me than just being about making things a bit easier to read. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Unsynchronized access to spi bus by mmc_rescan? [not found] ` <20160721224308.GT6509-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2016-07-21 23:06 ` Rich Felker @ 2016-07-21 23:12 ` Mark Brown 1 sibling, 0 replies; 8+ messages in thread From: Mark Brown @ 2016-07-21 23:12 UTC (permalink / raw) To: Rich Felker Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA, linux-mmc-u79uwXL29TY76Z2rM5mHXA, Heiko Stübner, Ernst Schwab, Matt Fleming, Antonio Ospite [-- Attachment #1: Type: text/plain, Size: 1050 bytes --] On Thu, Jul 21, 2016 at 11:43:08PM +0100, Mark Brown wrote: > This underlying issue is that we are trying to use one mutex for two > purposes, the existing mutex is mainly being used to serialize access to > the physical bus but the externally visible bus lock is there to ensure > that only one caller is able to queue new transfers which is not the > same thing at all. It is completely irrelevant when we are pushing > messages to the bus if we are doing so with the bus lock held. Just sent a test patch for this, completely untested given the hour but hopefully you can give it a spin if this is easy to reproduce for you. Rich, I also just noticed that you didn't CC any people on any of your postings prior to today. As I know we've discussed before you should *always* directly CC at least maintainers, you can not assume that anyone is going to see anything that is only on the lists. This was the same reason why your top posted content free pings didn't get anywhere, most likely none of the maintainers had seen any of your emails. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-07-21 23:27 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-04-22 21:28 Unsynchronized access to spi bus by mmc_rescan? Rich Felker 2016-05-04 22:44 ` Rich Felker [not found] ` <20160504224445.GW21636-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org> 2016-07-21 20:44 ` Rich Felker [not found] ` <20160721204451.GE15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org> 2016-07-21 21:40 ` Rich Felker [not found] ` <20160721214015.GG15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org> 2016-07-21 22:43 ` Mark Brown [not found] ` <20160721224308.GT6509-GFdadSzt00ze9xe1eoZjHA@public.gmane.org> 2016-07-21 23:06 ` Rich Felker 2016-07-21 23:27 ` Mark Brown 2016-07-21 23:12 ` Mark Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).