* st.ko with usb-storage problems
@ 2003-09-15 0:41 Matthew Dharm
2003-09-15 18:08 ` Kai Makisara
0 siblings, 1 reply; 5+ messages in thread
From: Matthew Dharm @ 2003-09-15 0:41 UTC (permalink / raw)
To: USB Developers, Linux SCSI list, USB Storage List
[-- Attachment #1: Type: text/plain, Size: 1797 bytes --]
(NOTE: Cross-posted to several lists. Watch your replies.)
For a while now in 2.5 and 2.6, I've been getting reports of an
incompatibility between st.ko and the usb-storage driver. I've finally
managed to figure out the problem; it only affects certain hardware, and
only on certain operations.
USB is a packet-oriented protocol. Important to the protocol is the
concept of a packet being full or not -- in a given transfer, all packets
should be full except the last one. Since usb-storage maps the SCSI
command's scatter-gather list for direct I/O, each scatter-gather segment
(except for the last one) must be a multiple of the packet size. That
value can vary, but is always a power of 2 (I've personally seen up to
512 bytes).
When st.ko goes to do a read, it will create a scatter-gather list with
(what looks like) a mapping of the user-space buffer. This often results
in not-so-nicely sized segments, which are often a multiple of 8, but not
64, 128, or 512 (which are the common USB packet sizes). This causes a
problem for the USB host-controller hardware, and the transfer fails.
Why this doesn't happen on a write is something of a mystery to me, but it
doesn't seem to happen.
If the module flag 'try_direct_io' is set to 0 (not the default), then
st.ko doesn't generate scatter-gather lists at all, and all is well.
It's not immediately clear to me how to fix this. First, I need to figure
out why 2.4 kernels don't show this problem, I think. Help from people who
understand st.ko would be appreciated.
Matt
--
Matthew Dharm Home: mdharm-usb@one-eyed-alien.net
Maintainer, Linux USB Mass Storage Driver
THEY CASTRATED MY QUAKE BITS! I WANT THEM BACK!!!!
-- Greg
User Friendly, 3/27/1998
[-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: st.ko with usb-storage problems
2003-09-15 0:41 st.ko with usb-storage problems Matthew Dharm
@ 2003-09-15 18:08 ` Kai Makisara
2003-09-15 19:52 ` serial_number and serial_number_at_timeout in 2.4 Charlie Brett
0 siblings, 1 reply; 5+ messages in thread
From: Kai Makisara @ 2003-09-15 18:08 UTC (permalink / raw)
To: Matthew Dharm; +Cc: USB Developers, Linux SCSI list, USB Storage List
On Sun, 14 Sep 2003, Matthew Dharm wrote:
> (NOTE: Cross-posted to several lists. Watch your replies.)
>
> For a while now in 2.5 and 2.6, I've been getting reports of an
> incompatibility between st.ko and the usb-storage driver. I've finally
> managed to figure out the problem; it only affects certain hardware, and
> only on certain operations.
>
...
> When st.ko goes to do a read, it will create a scatter-gather list with
> (what looks like) a mapping of the user-space buffer. This often results
> in not-so-nicely sized segments, which are often a multiple of 8, but not
> 64, 128, or 512 (which are the common USB packet sizes). This causes a
> problem for the USB host-controller hardware, and the transfer fails.
>
The glibc malloc() function returns memory allocated at 8 byte boundaries
(16 bytes for 64-bit systems).
> Why this doesn't happen on a write is something of a mystery to me, but it
> doesn't seem to happen.
>
The user space buffer alignment depends on the application. Was the write
buffer accidentally or on purpose properly aligned? This is the only
explanation that comes to my mind.
> If the module flag 'try_direct_io' is set to 0 (not the default), then
> st.ko doesn't generate scatter-gather lists at all, and all is well.
>
st can use s/g lists also when try_direct_io. It tries to allocate the
driver buffer using as big chunks as possible and this is why the buffer
is usually contiguous in physical memory for reasonable buffer sizes. The
driver buffer is allocated using get_free_pages() and so each segment size
should be divisible by 512 even if s/g is used.
> It's not immediately clear to me how to fix this. First, I need to figure
> out why 2.4 kernels don't show this problem, I think. Help from people who
> understand st.ko would be appreciated.
>
The 2.4 driver does not do direct transfers between user buffer and
device. It was added in 2.5.
The simple solution would be for st to set try_direct_io to zero if the
HBA does not support arbitrary alignment. The problem is that (AFAIK) this
information is not currently available to the high level driver.
--
Kai
^ permalink raw reply [flat|nested] 5+ messages in thread
* serial_number and serial_number_at_timeout in 2.4
2003-09-15 18:08 ` Kai Makisara
@ 2003-09-15 19:52 ` Charlie Brett
2003-09-15 21:09 ` Mike Anderson
0 siblings, 1 reply; 5+ messages in thread
From: Charlie Brett @ 2003-09-15 19:52 UTC (permalink / raw)
To: Linux SCSI list; +Cc: dann frazier
We discovered what appears to be a race problem in scsi_done() on the 2.4
kernel. On MP systems, we found the following to occur:
1) A command is scheduled through scsi_dispatch_cmd(). This will start the
timeout timer and call queuecommand().
2) The command completes and the interrupt handler acquires the lock.
3) The timeout occurs before scsi_done()[scsi_old_done()] is called (which
would have turned off the timeout).
4) The timeout routine, scsi_old_times_out(), waits for the lock.
In scsi_done() the following lines are executed:
SCpnt->serial_number = 0;
SCpnt->serial_number_at_timeout = 0;
5) If, it is decided that the command should be retried, the lock is
released to call scsi_dispatch_cmd().
6) As soon as the lock is released, the timeout routine acquires the lock
and starts to complete. It will either call scsi_abort() or scsi_reset(). In
each of the routines, serial_number and serial_number_at_timeout are
compared, which are now both 0, so the routines continue (calling
scsi_done() a second time).
To minimize the changes, we would like propose the addition of a check to
see if the serial number is zero at the beginning of scsi_old_times_out().
This would prevent to continuation of an abort or reset on a command that
scsi_done() has already processed.
Note: If scsi_dispatch_cmd() does start, before the timeout routine, then a
new serial number is assigned, which cause the current test to work.
Any thoughts?
Thanks,
Charlie Brett - Hewlett Packard
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: serial_number and serial_number_at_timeout in 2.4
2003-09-15 19:52 ` serial_number and serial_number_at_timeout in 2.4 Charlie Brett
@ 2003-09-15 21:09 ` Mike Anderson
2003-09-16 14:08 ` Charlie Brett
0 siblings, 1 reply; 5+ messages in thread
From: Mike Anderson @ 2003-09-15 21:09 UTC (permalink / raw)
To: Charlie Brett; +Cc: Linux SCSI list, dann frazier
Charlie Brett [cfb@ldl.fc.hp.com] wrote:
> We discovered what appears to be a race problem in scsi_done() on the 2.4
> kernel. On MP systems, we found the following to occur:
>
> 1) A command is scheduled through scsi_dispatch_cmd(). This will start the
> timeout timer and call queuecommand().
> 2) The command completes and the interrupt handler acquires the lock.
> 3) The timeout occurs before scsi_done()[scsi_old_done()] is called (which
> would have turned off the timeout).
> 4) The timeout routine, scsi_old_times_out(), waits for the lock.
> In scsi_done() the following lines are executed:
>
> SCpnt->serial_number = 0;
> SCpnt->serial_number_at_timeout = 0;
>
> 5) If, it is decided that the command should be retried, the lock is
> released to call scsi_dispatch_cmd().
> 6) As soon as the lock is released, the timeout routine acquires the lock
> and starts to complete. It will either call scsi_abort() or scsi_reset(). In
> each of the routines, serial_number and serial_number_at_timeout are
> compared, which are now both 0, so the routines continue (calling
> scsi_done() a second time).
>
> To minimize the changes, we would like propose the addition of a check to
> see if the serial number is zero at the beginning of scsi_old_times_out().
> This would prevent to continuation of an abort or reset on a command that
> scsi_done() has already processed.
>
> Note: If scsi_dispatch_cmd() does start, before the timeout routine, then a
> new serial number is assigned, which cause the current test to work.
>
> Any thoughts?
It has been a while since I looked at that the old error handler, but
with your change you would still have an issue with the call to
scsi_old_done from the driver if scsi_old_times_out aquired the lock
first. As the update_timeout function does not past the return value
back from scsi_delete_timer so scsi_old_done does not know the timeout
is already running.
Moving to the new error handler would be better answer, but depending on
your driver maybe plugging the old error handler is your only option.
-andmike
--
Michael Anderson
andmike@us.ibm.com
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: serial_number and serial_number_at_timeout in 2.4
2003-09-15 21:09 ` Mike Anderson
@ 2003-09-16 14:08 ` Charlie Brett
0 siblings, 0 replies; 5+ messages in thread
From: Charlie Brett @ 2003-09-16 14:08 UTC (permalink / raw)
To: Linux SCSI list
You're correct about the reverse scenerio. We were trying to keep the fix as
simple as possible. With that in mind I would like to go back to our
original thought and propose changing the 2 locations in scsi_abort() where
the serial numbers are compared to also test for serial_number having a zero
value. I haven't quite figured out why the second test is there, it really
doesn't make sense, and if either one of the fields changed between the
first test and the second test, there's a serious problem (considering the
lock was held during this time). There is also one test in scsi_reset() that
could be changed as well. This should not affect how it behaves when it is
called by scsi_done(), since it only does this with the
SCSI_RESET_ASYNCHRONOUS cleared, and the test is not performed. Adding the
test to scsi_reset() does reduce the possibility of collision from the scsi
error handler (scsi_eh_*), which can call scsi_reset() from
scsi_unjam_host() without the lock held.
So now, how about in scsi_abort() and scsi_reset(), changing:
if (SCpnt->serial_number != SCpnt->serial_number_at_timeout) {
return 0;
}
to:
if (!SCpnt->serial_number ||
(SCpnt->serial_number != SCpnt->serial_number_at_timeout)) {
return 0;
}
Note: The scsi_done() function I was referring to is ->scsi_done() which in
this case is scsi_old_done().
Charlie Brett - Hewlett Packard
----- Original Message -----
From: "Mike Anderson" <andmike@us.ibm.com>
To: "Charlie Brett" <cfb@ldl.fc.hp.com>
Cc: "Linux SCSI list" <linux-scsi@vger.kernel.org>; "dann frazier"
<dannf@hp.com>
Sent: Monday, September 15, 2003 3:09 PM
Subject: Re: serial_number and serial_number_at_timeout in 2.4
> Charlie Brett [cfb@ldl.fc.hp.com] wrote:
> > We discovered what appears to be a race problem in scsi_done() on the
2.4
> > kernel. On MP systems, we found the following to occur:
> >
> > 1) A command is scheduled through scsi_dispatch_cmd(). This will start
the
> > timeout timer and call queuecommand().
> > 2) The command completes and the interrupt handler acquires the lock.
> > 3) The timeout occurs before scsi_done()[scsi_old_done()] is called
(which
> > would have turned off the timeout).
> > 4) The timeout routine, scsi_old_times_out(), waits for the lock.
> > In scsi_done() the following lines are executed:
> >
> > SCpnt->serial_number = 0;
> > SCpnt->serial_number_at_timeout = 0;
> >
> > 5) If, it is decided that the command should be retried, the lock is
> > released to call scsi_dispatch_cmd().
> > 6) As soon as the lock is released, the timeout routine acquires the
lock
> > and starts to complete. It will either call scsi_abort() or
scsi_reset(). In
> > each of the routines, serial_number and serial_number_at_timeout are
> > compared, which are now both 0, so the routines continue (calling
> > scsi_done() a second time).
> >
> > To minimize the changes, we would like propose the addition of a check
to
> > see if the serial number is zero at the beginning of
scsi_old_times_out().
> > This would prevent to continuation of an abort or reset on a command
that
> > scsi_done() has already processed.
> >
> > Note: If scsi_dispatch_cmd() does start, before the timeout routine,
then a
> > new serial number is assigned, which cause the current test to work.
> >
> > Any thoughts?
>
> It has been a while since I looked at that the old error handler, but
> with your change you would still have an issue with the call to
> scsi_old_done from the driver if scsi_old_times_out aquired the lock
> first. As the update_timeout function does not past the return value
> back from scsi_delete_timer so scsi_old_done does not know the timeout
> is already running.
>
> Moving to the new error handler would be better answer, but depending on
> your driver maybe plugging the old error handler is your only option.
>
> -andmike
> --
> Michael Anderson
> andmike@us.ibm.com
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2003-09-16 14:08 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-09-15 0:41 st.ko with usb-storage problems Matthew Dharm
2003-09-15 18:08 ` Kai Makisara
2003-09-15 19:52 ` serial_number and serial_number_at_timeout in 2.4 Charlie Brett
2003-09-15 21:09 ` Mike Anderson
2003-09-16 14:08 ` Charlie Brett
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox