Unsynchronized access to spi bus by mmc

linux-spi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Unsynchronized access to spi bus by mmc_rescan?
@ 2016-04-22 21:28 Rich Felker
  2016-05-04 22:44 ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2016-04-22 21:28 UTC (permalink / raw)
  To: linux-spi, linux-mmc

I'm working on a driver for the J-core (http://j-core.org) SPI master,
which is currently limited to PIO and using the spi-bitbang framework
(probably not the right thing to use, but planned to change, and seems
orthogonal to the issue at hand). We're using it to access an SD card
via the mmc_spi mmc host driver, and experiencing crashes/corruption
that I tracked down to mmc_rescan (we don't yet have an interrupt for
media change) happening during SPI message transfers. Which locks are
supposed to preclude this from happening? Is it likely something wrong
our driver is using, or is there possibly a general bug in the MMC/SPI
subsystem(s)?

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Unsynchronized access to spi bus by mmc_rescan?
  2016-04-22 21:28 Unsynchronized access to spi bus by mmc_rescan? Rich Felker
@ 2016-05-04 22:44 ` Rich Felker
       [not found]   ` <20160504224445.GW21636-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2016-05-04 22:44 UTC (permalink / raw)
  To: linux-spi, linux-mmc

Ping.

On Fri, Apr 22, 2016 at 05:28:42PM -0400, Rich Felker wrote:
> I'm working on a driver for the J-core (http://j-core.org) SPI master,
> which is currently limited to PIO and using the spi-bitbang framework
> (probably not the right thing to use, but planned to change, and seems
> orthogonal to the issue at hand). We're using it to access an SD card
> via the mmc_spi mmc host driver, and experiencing crashes/corruption
> that I tracked down to mmc_rescan (we don't yet have an interrupt for
> media change) happening during SPI message transfers. Which locks are
> supposed to preclude this from happening? Is it likely something wrong
> our driver is using, or is there possibly a general bug in the MMC/SPI
> subsystem(s)?
> 
> Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <20160504224445.GW21636-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>]

* Re: Unsynchronized access to spi bus by mmc_rescan?
       [not found]   ` <20160504224445.GW21636-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>
@ 2016-07-21 20:44     ` Rich Felker
       [not found]       ` <20160721204451.GE15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2016-07-21 20:44 UTC (permalink / raw)
  To: linux-spi-u79uwXL29TY76Z2rM5mHXA,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA

On Wed, May 04, 2016 at 06:44:45PM -0400, Rich Felker wrote:
> Ping.
> 
> On Fri, Apr 22, 2016 at 05:28:42PM -0400, Rich Felker wrote:
> > I'm working on a driver for the J-core (http://j-core.org) SPI master,
> > which is currently limited to PIO and using the spi-bitbang framework
> > (probably not the right thing to use, but planned to change, and seems
> > orthogonal to the issue at hand). We're using it to access an SD card
> > via the mmc_spi mmc host driver, and experiencing crashes/corruption
> > that I tracked down to mmc_rescan (we don't yet have an interrupt for
> > media change) happening during SPI message transfers. Which locks are
> > supposed to preclude this from happening? Is it likely something wrong
> > our driver is using, or is there possibly a general bug in the MMC/SPI
> > subsystem(s)?

Ping. I'm also experiencing this issue with multiple devices on the
same spi bus; access seems completely unsynchronized. While it's hard
to rule out the possibility of it being specific to the spi master
driver, the driver just implements transfer_one and set_cs, so
synchronization would necessarily have to happen at a higher level.
Any tips on how I could further track down the cause?

Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <20160721204451.GE15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>]

* Re: Unsynchronized access to spi bus by mmc_rescan?
       [not found]       ` <20160721204451.GE15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>
@ 2016-07-21 21:40         ` Rich Felker
       [not found]           ` <20160721214015.GG15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2016-07-21 21:40 UTC (permalink / raw)
  To: linux-spi-u79uwXL29TY76Z2rM5mHXA,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA
  Cc: Heiko Stübner, Ernst Schwab, Grant Likely, Matt Fleming,
	Antonio Ospite, Mark Brown

On Thu, Jul 21, 2016 at 04:44:51PM -0400, Rich Felker wrote:
> On Wed, May 04, 2016 at 06:44:45PM -0400, Rich Felker wrote:
> > Ping.
> > 
> > On Fri, Apr 22, 2016 at 05:28:42PM -0400, Rich Felker wrote:
> > > I'm working on a driver for the J-core (http://j-core.org) SPI master,
> > > which is currently limited to PIO and using the spi-bitbang framework
> > > (probably not the right thing to use, but planned to change, and seems
> > > orthogonal to the issue at hand). We're using it to access an SD card
> > > via the mmc_spi mmc host driver, and experiencing crashes/corruption
> > > that I tracked down to mmc_rescan (we don't yet have an interrupt for
> > > media change) happening during SPI message transfers. Which locks are
> > > supposed to preclude this from happening? Is it likely something wrong
> > > our driver is using, or is there possibly a general bug in the MMC/SPI
> > > subsystem(s)?
> 
> Ping. I'm also experiencing this issue with multiple devices on the
> same spi bus; access seems completely unsynchronized. While it's hard
> to rule out the possibility of it being specific to the spi master
> driver, the driver just implements transfer_one and set_cs, so
> synchronization would necessarily have to happen at a higher level.
> Any tips on how I could further track down the cause?

OK, I've been reading the code again, and unless I'm mistaken it's
utter nonsense.

The spi master device has a "bus lock flag" that's used in lieu of a
recursive mutex, but it seems to be used entirely incorrectly. The
logic I'm seeing in spi.c is essentially:

	if (some_task_has_spi_bus_locked)
		use_spi_master_without_locking();

There seems to be no check that the task that's using the master
without locking is the same one that locked it!

The relevant code in spi.c is in __pump_messages, where locking is

	if (!bus_locked)
		mutex_lock(&master->bus_lock_mutex);

bus_locked was provided by the caller from master->bus_lock_flag.

Also, master->bus_lock_flag is modified completely unsynchonized from
read access to master->bus_lock_flag. There is no locking of
master->bus_lock_mutex or master->bus_lock_spinlock around the read
accesses to master->bus_lock_flag, and in one place the write (of 0)
to master->bus_lock_flag takes place without holding the spinlock.

The obvious fix would be using an actual recursive mutex to obtain
recursive mutex semantics.

I've Cc'd everybody who appears to have been involved in the original
addition of the spi bus lock infrastructure and subsequent changes to
it.

Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <20160721214015.GG15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>]

* Re: Unsynchronized access to spi bus by mmc_rescan?
       [not found]           ` <20160721214015.GG15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>
@ 2016-07-21 22:43             ` Mark Brown
       [not found]               ` <20160721224308.GT6509-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Brown @ 2016-07-21 22:43 UTC (permalink / raw)
  To: Rich Felker
  Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA, Heiko Stübner,
	Ernst Schwab, Matt Fleming, Antonio Ospite

[-- Attachment #1: Type: text/plain, Size: 2320 bytes --]

On Thu, Jul 21, 2016 at 05:40:15PM -0400, Rich Felker wrote:

> The spi master device has a "bus lock flag" that's used in lieu of a
> recursive mutex, but it seems to be used entirely incorrectly. The
> logic I'm seeing in spi.c is essentially:

I am entirely unclear why you believe there is a recursive mutex here,
that's not the purpose of the spi_bus_lock() API at all.

> 	if (some_task_has_spi_bus_locked)
> 		use_spi_master_without_locking();

> There seems to be no check that the task that's using the master
> without locking is the same one that locked it!

That's because there is no requirement that something that takes a bus
lock be executing from a single context.  It is just granting exclusive
access to the bus, the caller is free to do those accesses in any manner
it sees fit.

> Also, master->bus_lock_flag is modified completely unsynchonized from
> read access to master->bus_lock_flag. There is no locking of
> master->bus_lock_mutex or master->bus_lock_spinlock around the read
> accesses to master->bus_lock_flag, and in one place the write (of 0)
> to master->bus_lock_flag takes place without holding the spinlock.

The critical read access is in spi_async where we exclude users that
don't hold the lock which *is* spinlocked; we should probably hold the
spinlock on release but it's not so important since we know that we hold
the lock when we call it and the mutex ends up providing protection.
spi_sync() shouldn't be reading the flag in the first place as far as I
can tell, it should unconditionally have the flag clear as locked
callers have to use spi_sync_locked() but that will give us a fast path
that doesn't check for the bus lock which starts to reveal the
underlying issue.

This underlying issue is that we are trying to use one mutex for two
purposes, the existing mutex is mainly being used to serialize access to
the physical bus but the externally visible bus lock is there to ensure
that only one caller is able to queue new transfers which is not the
same thing at all.  It is completely irrelevant when we are pushing
messages to the bus if we are doing so with the bus lock held.

> The obvious fix would be using an actual recursive mutex to obtain
> recursive mutex semantics.

This would be broken since we need to be able to start transfers from
atomic context.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <20160721224308.GT6509-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>]

* Re: Unsynchronized access to spi bus by mmc_rescan?
       [not found]               ` <20160721224308.GT6509-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2016-07-21 23:06                 ` Rich Felker
  2016-07-21 23:27                   ` Mark Brown
  2016-07-21 23:12                 ` Mark Brown
  1 sibling, 1 reply; 8+ messages in thread
From: Rich Felker @ 2016-07-21 23:06 UTC (permalink / raw)
  To: Mark Brown
  Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA, Heiko Stübner,
	Ernst Schwab, Matt Fleming, Antonio Ospite

On Thu, Jul 21, 2016 at 11:43:08PM +0100, Mark Brown wrote:
> On Thu, Jul 21, 2016 at 05:40:15PM -0400, Rich Felker wrote:
> 
> > The spi master device has a "bus lock flag" that's used in lieu of a
> > recursive mutex, but it seems to be used entirely incorrectly. The
> > logic I'm seeing in spi.c is essentially:
> 
> I am entirely unclear why you believe there is a recursive mutex here,
> that's not the purpose of the spi_bus_lock() API at all.

I consider the "if not yet locked, take lock" pattern logically akin
to a recursive mutex, but apologies if the language was imprecise.

> > 	if (some_task_has_spi_bus_locked)
> > 		use_spi_master_without_locking();
> 
> > There seems to be no check that the task that's using the master
> > without locking is the same one that locked it!
> 
> That's because there is no requirement that something that takes a bus
> lock be executing from a single context.  It is just granting exclusive
> access to the bus, the caller is free to do those accesses in any manner
> it sees fit.

Well then what exactly is the contract for who can access it and who
can't? What I'm experiencing is that, when the driver for one device
on the spi bus has the bus locked, other devices access the bus with
no synchronization at all, resulting in serious data corruption. The
media-change-detect probe also does this even with a single mmc device
on the bus.

Reverting commits 24c8cd1b0812, 49023d2e4ead, and 556351f14e74 (the
first two revert cleanly; the latter I did by hand) makes the problem
go away.

> > Also, master->bus_lock_flag is modified completely unsynchonized from
> > read access to master->bus_lock_flag. There is no locking of
> > master->bus_lock_mutex or master->bus_lock_spinlock around the read
> > accesses to master->bus_lock_flag, and in one place the write (of 0)
> > to master->bus_lock_flag takes place without holding the spinlock.
> 
> The critical read access is in spi_async where we exclude users that
> don't hold the lock which *is* spinlocked; we should probably hold the
> spinlock on release but it's not so important since we know that we hold
> the lock when we call it and the mutex ends up providing protection.
> spi_sync() shouldn't be reading the flag in the first place as far as I
> can tell, it should unconditionally have the flag clear as locked
> callers have to use spi_sync_locked() but that will give us a fast path
> that doesn't check for the bus lock which starts to reveal the
> underlying issue.

Ah, in that case maybe the actual bug is just one line of commit
24c8cd1b0812? Reverting just that also seems to make the problem go
away.

> This underlying issue is that we are trying to use one mutex for two
> purposes, the existing mutex is mainly being used to serialize access to
> the physical bus but the externally visible bus lock is there to ensure
> that only one caller is able to queue new transfers which is not the
> same thing at all.  It is completely irrelevant when we are pushing
> messages to the bus if we are doing so with the bus lock held.

OK. I didn't understand what synchronization prevents messages from
multiple drivers/sources from getting pushed at the same time, but
apparently the problem was just that spi_sync was behaving like
spi_sync_locked when the bus was already locked.

> > The obvious fix would be using an actual recursive mutex to obtain
> > recursive mutex semantics.
> 
> This would be broken since we need to be able to start transfers from
> atomic context.

*Nod*, OK.

Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Unsynchronized access to spi bus by mmc_rescan?
  2016-07-21 23:06                 ` Rich Felker
@ 2016-07-21 23:27                   ` Mark Brown
  0 siblings, 0 replies; 8+ messages in thread
From: Mark Brown @ 2016-07-21 23:27 UTC (permalink / raw)
  To: Rich Felker
  Cc: linux-spi, linux-mmc, Heiko Stübner, Ernst Schwab,
	Matt Fleming, Antonio Ospite

[-- Attachment #1: Type: text/plain, Size: 1770 bytes --]

On Thu, Jul 21, 2016 at 07:06:15PM -0400, Rich Felker wrote:
> On Thu, Jul 21, 2016 at 11:43:08PM +0100, Mark Brown wrote:
> > On Thu, Jul 21, 2016 at 05:40:15PM -0400, Rich Felker wrote:

> > spi_sync() shouldn't be reading the flag in the first place as far as I
> > can tell, it should unconditionally have the flag clear as locked
> > callers have to use spi_sync_locked() but that will give us a fast path
> > that doesn't check for the bus lock which starts to reveal the
> > underlying issue.

> Ah, in that case maybe the actual bug is just one line of commit
> 24c8cd1b0812? Reverting just that also seems to make the problem go
> away.

For those playing at home that's "spi: fix possible deadlock between
internal bus locks and bus_lock_flag".  That won't work entirely since
that'd mean that we're not doing I/O locking to exclude the message pump
in cases where we push data directly in spi_sync() with the bus lock
held by a caller.  That's probably going to work most of the time as
most callers will be single threaded but they might not be.  It should
be saved by the checks to see if we're already processing a message
inside the pump function but it's better to be clear about what we're
doing I think, and we can clear cur_msg inside the message pump.  But
perhaps it's enough...  anyway, I'm not going to think through all the
cases properly at this time of night.

Please include human readable descriptions of things like commits and
issues being discussed in e-mail in your mails, this makes them much
easier for humans to read especially when they have no internet access.
I do frequently catch up on my mail on flights or while otherwise
travelling so this is even more pressing for me than just being about
making things a bit easier to read.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Unsynchronized access to spi bus by mmc_rescan?
       [not found]               ` <20160721224308.GT6509-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  2016-07-21 23:06                 ` Rich Felker
@ 2016-07-21 23:12                 ` Mark Brown
  1 sibling, 0 replies; 8+ messages in thread
From: Mark Brown @ 2016-07-21 23:12 UTC (permalink / raw)
  To: Rich Felker
  Cc: linux-spi-u79uwXL29TY76Z2rM5mHXA,
	linux-mmc-u79uwXL29TY76Z2rM5mHXA, Heiko Stübner,
	Ernst Schwab, Matt Fleming, Antonio Ospite

[-- Attachment #1: Type: text/plain, Size: 1050 bytes --]

On Thu, Jul 21, 2016 at 11:43:08PM +0100, Mark Brown wrote:

> This underlying issue is that we are trying to use one mutex for two
> purposes, the existing mutex is mainly being used to serialize access to
> the physical bus but the externally visible bus lock is there to ensure
> that only one caller is able to queue new transfers which is not the
> same thing at all.  It is completely irrelevant when we are pushing
> messages to the bus if we are doing so with the bus lock held.

Just sent a test patch for this, completely untested given the hour but
hopefully you can give it a spin if this is easy to reproduce for you.

Rich, I also just noticed that you didn't CC any people on any of your
postings prior to today.  As I know we've discussed before you should
*always* directly CC at least maintainers, you can not assume that
anyone is going to see anything that is only on the lists.  This was the
same reason why your top posted content free pings didn't get anywhere,
most likely none of the maintainers had seen any of your emails.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-07-21 23:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-22 21:28 Unsynchronized access to spi bus by mmc_rescan? Rich Felker
2016-05-04 22:44 ` Rich Felker
     [not found]   ` <20160504224445.GW21636-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>
2016-07-21 20:44     ` Rich Felker
     [not found]       ` <20160721204451.GE15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>
2016-07-21 21:40         ` Rich Felker
     [not found]           ` <20160721214015.GG15995-C3MtFaGISjmo6RMmaWD+6Sb1p8zYI1N1@public.gmane.org>
2016-07-21 22:43             ` Mark Brown
     [not found]               ` <20160721224308.GT6509-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2016-07-21 23:06                 ` Rich Felker
2016-07-21 23:27                   ` Mark Brown
2016-07-21 23:12                 ` Mark Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).