linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ide-io.c, ide_do_request -- race condition?
@ 2004-07-05 22:51 Max T. Woodbury
  2004-07-07 19:43 ` Bartlomiej Zolnierkiewicz
  0 siblings, 1 reply; 16+ messages in thread
From: Max T. Woodbury @ 2004-07-05 22:51 UTC (permalink / raw)
  To: linux-ide

I have a question about a specific statement in the ide-io.c
code.  However, I know that I need to establish a context
for that question, so please be patient with the fairly long
discussion that follows.

I recently decided to install Linux (Fedora Core 1 to be
specific) on a venerably old (Thinkpad 760ED) laptop.
The process proved troublesome.  Part way through the
installation, the file system became corrupted, throwing
read errors on a block in the installed packages (i.e.
RPM) database.  I did a bad block scan, zeroed the partition
and tried again.  Twice.  The list of bad blocks was NOT
consistent from time to time.  That eliminated the disk
drive as the source of the problem.  A fourth attempt
produced a clue.  I was watching the installation using
"top" and it took a bit longer for the corruption to occur.
Three attempts later and I had a clean installation.  The
trick was to put a heavy computational load on the machine
("while [[ 0 == 0 ]]; do echo -n; done &" x 5) during the
installation.  However the installation took about 6 hours
as a result, two and a half to three times the normal time.

The problem did not end there.  Updating from the 
installation base to the current patch level also corrupted
the file system.  It took three more tries, from scratch, to
get a usable system.

Surprisingly, a kernel build did NOT produce any corruption.
This pretty much eliminated memory as a source of the problem.
The build process is a much more memory intense process than
the installation process.  It would have blown up faster than
the installations if there was a memory problem.

(The fact that the machine runs other OSs without noticeable
problems is also an indication that the underlying hardware
is in working order.  Only the system software and disk
drive changed between the two setups and I have explained
why I do not think it is the disk drive.)

I concluded that there was a race condition someplace in
the disk drivers sequence and went a hunting through the 
driver code starting from the IDE end.

In ide-io.c there is a block comment just before the place
where the i/o request setup routine is called.  It notes
that some older chipsets do not like to be interrupted
during the setup process.  It also notes that 'massive
fs corruption" will result if the setup process is
interrupted.  (Hmmmm.)  A search of the kernel mailing list
archives found one note on this piece of code where commands
were getting lost (rarely) in an SMP environment.  I thought
that this would be a good piece of code to fine tooth.

The code was more or less what I expected.  A pair of calls
locked the register set by turing the relevant interrupt off
before the setup and back on when it was done.  They were the
obvious places to do the SMP synchronization and inspection
proved that that was in fact the case so premature interrupts
should not be a problem if everything else was kosher.

What I found next was very surprising given the comments.
Local interrupts were then turned back on!  I expected to see
ALL interrupts turned off here because the setup had to be
atomic, but the opposite was what the code did.  What is going
on here?

This is 2.4.22 code, but it has not been changed for 2.4.26.
There is some significant changes with 2.6.7, but it is worse
if anything.  A little more explanation is probably in order.

At the top of the routine all interrupts are turned off and
the hardware group busy bit is set.  This bit is protected by
a spinlock, so there should be no need to lock out interrupts
while manipulating it.  There are a few other conditions
checked next, none requiring interrupt lockout.  So a whole
bunch of code has been executing under interrupt lockout when
there was no need for the lockout.  Not a huge problem, just
strange.  Also, in 2.6, the lockout has to begin before the
routine is called which is why I said 2.6 was worse.

Then comes the block comment, the all-CPU lockout of completion
interrupts, and just when the comment suggests that all 
interrupts should be turned off, they are turned back on.

I can understand the problem the comment might be addressing.
The interface could well be sensitive to the timing of the
over-all load sequence or a read from the status register
checking for 'ready' could bollix the whole setup process.
This final setup phase really should be an atomic operation.

I've done a little 'playing' with this code.  First I tried
just removing the enable call.  It seems to have had absolutely
no effect.  The system did NOT hang because the interrupt lock
out did not end.  The file system corruption showed up again
on applying a subsequent update.  I've also made a slightly
more venturesome change.  I pulled the disable at the head of
the routine and put it just before the setup call and moved
the enable to after the setup call.  I've seen no problems
with this variant, but I might not see any with the hardware
I have.  A machine with more than one IDE interface on the
same IRQ line might show a problem, particularly if it was a
multi-processor running SMP.  (I have an SMP machine, but not
one with three IDE controllers.)

I haven't dug around the Linux kernel as much as I probably
should have.  (That does not mean I am not familiar with
kernels.  I did a lot of digging through PDP 11 and VAX
operating system code including RSX-11 A, B, D and M, 
RSTS-11, RT-11, DOS-11 and VMS plus some bare metal [paper 
tape loader!] applications.)  I can see that the setup 
process for an IDE controller could be fairly lengthy and 
that interrupt latency could become a problem, and enabling
local interrupts would relieve the problem, but not at the
cost of corrupting system integrity?

Could someone else fine tooth this?  Just to make sure I'm
not missing something major (or minor)...

1) Modify the FC1 installation environment so that the
   correct lockout sequence is used.  This will require that
   I build a new boot floppy.

2) Do another installation without using the system load
   trick.  This should be fairly definitive since I have not
   be able to do a complete system load without problem in
   half a dozen attempts.

3) Do the updates without the system load trick.

Finally, if this proves to be an acceptable cure to my
problem,

A) Does there need to be a way to turn this fix off when
   it is not needed? (and how do you tell if it is needed?
   boot command line option?  Blacklist?)

B) Should it be included in the standard distribution?

C) What is the procedure for getting it included?

I've been going through the linux-ide archives and noticed
that there have been a number of mystery fs corruption issues
that just disappeared.  This might be related.  There was also
a DMA problem that might have been relevant, but I know it does
not apply in this case since "hdparm" shows DMA turned off by
default on this machine.

Max T.E. Woodbury
max@mtew.isa-geek.net

^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: ide-io.c, ide_do_request -- race condition?
@ 2004-07-11 14:38 Max T. Woodbury
  0 siblings, 0 replies; 16+ messages in thread
From: Max T. Woodbury @ 2004-07-11 14:38 UTC (permalink / raw)
  To: linux-ide

Bartlomiej Zolnierkiewicz wrote:
> On Saturday 10 of July 2004 21:25, Max T. Woodbury wrote:
> > Bartlomiej Zolnierkiewicz wrote:
> > > On Tuesday 06 of July 2004 00:51, Max T. Woodbury wrote:
> > > > (The fact that the machine runs other OSs without noticeable
> > > > problems is also an indication that the underlying hardware
> > > > is in working order.  Only the system software and disk
> > > > drive changed between the two setups and I have explained
> > > > why I do not think it is the disk drive.)
> > >
> > > disk drive changed?  please explain
> >
> > The drives in the Thinkpad 760 are mounted in caddies that can
> > be easily exchanged when the power is off.  I have three drives.
> > One runs the machine as a GPS, the second as a code development
> > Windows box and the third is my Linux code development machine.
> > I'm having a fair amount of trouble getting the Linux setup to do
> > what I want it to do.  Not only did the Linux install flake out,
> > but I still can't get the PCMCIA sockets working, but that's another
> > issue for another list and I haven't quite got enough information
> > on that set of problems to make a request for help useful...  In
> > order to get to the internet with Linux I have to use its docking
> > station.  No such problem with Windoze.  (Yeah, absolutely
> > disgusting but that's what's happening.)
>
> Are you sure that 'Linux' disk is okay?
> http://smartmontool.sf.net

Yes.  The CDs all passed the scanner RedHat provided and they installed
correctly on a desktop system. The drive I used for linux was zeroed
and it passed a read/write badblock scan on the whole disk and several
on the partition in question.  Still, it could be that the drive is
getting the data into its buffer correctly but not out to the magnetic
media, but the fact that the errors move around so much indicates that
it is not the media itself that is bad.  My next comment is on the
fringes of my competence and is mostly opinion:  If the drive is
failing to actually transfer the data from its buffer to the media,
there should be an error indication of some sort and it should be passed
back to the OS where it can be logged.  I didn't see any such code, but
I'm not sure I'd recognize it if I saw it.

I do wish there were a write check option.  It does nasty things to
performance, but improves system integrity.  Of course I haven't checked
the upper layers in the system yet.  There might be such an option at the
block cache or file system levels and I'd not have seen it yet. It's
been about four years since I went through that code last so I'm foggy on
the details and it has almost certainly changed since I looked at it.

> > > __cli() there is just "paranoia" and it is gone in 2.6 kernels
> >
> > That's not quite correct.  There is a check and a BUG() call to assure
> > that interrupts are disabled on entry in the 2.6 code I've seen.  If I
> > understand the new code correctly, you've replaced the single interrupt
> > disable call at the top of this routine by a bunch of similar calls
> > elsewhere before entering this routine.  That would make interrupt latency
> > worse, not better.
>
> This is not correct - __cli() is really just a "paranoia", you may remove
> it if you like and it shouldn't change anything (but we would like to know
> if it changes something ie. fixes fs corruption :-).

Hmm. At the place it was, it was harmless paranoia, but I thought you might
not have been being paranoid enough.  You should not have taken the lockout
off until after the command was actually started.  But doing that was not
enough to fix the problem...

I did a quick scan of the IRQ level code in the arch/i386 tree to get an idea o
f what it did and recognized the pattern from other OSs I've seen, so I did not
look at all the details closely.  I think I saw CLI and STI pairs that might
have nulled the CLI in ide-io.c in there, but I did not try to figure exactly
which conditionals were on or off so I was not certain of the exact set of code
that was operational.  As I said, I was scanning for an overall pattern.

> Please take a look at generic_unplug_device() in drivers/block/ll_rw_blk.c:

If I remember correctly, that's the block cache layer.  The last time I looked
at it, it didn't do hot-plug, so I need to look at it again...  I'll do as you
suggest.

> > > > bunch of code has been executing under interrupt lockout when
> > > > there was no need for the lockout.  Not a huge problem, just
> > > > strange.  Also, in 2.6, the lockout has to begin before the
> > > > routine is called which is why I said 2.6 was worse.
> > >
> > > 2.6 is much better - you have one spinlock per block queue while
> > > in 2.4 you have one global spinlock (io_request_lock) for all
> > > block requests.
> >
> > Yep.  That's a little courser than the model I was using on the never
> > completed OS design I did in the early 70s, but it is better than the
> > single global lock in 2.4 and way better than the design of many other
> > OSs I've waded into.  Still, you've got a complete interrupt lockout
> > in place at the top of this routine which has two bad effects: 1) the
> > interrupt latency is longer and 2) there is no one place to turn it off
> > any longer.
>
> 'lockout' happens earlier both in 2.4 and 2.6 -> generic_unplug_device().

Hmm.  I'll have to look at that layer again.  I think we're talking about
a different level of lockout.  I suspect that you mean that this is a
critical region of code and only one thread of execution can be in it at
one time.  I was thinking 'atomic' in the sense that it could not be
interrupted and that its timing was consistent.  I'm not sure I am using
the term in the same way everyone else does.

> > > > I've been going through the linux-ide archives and noticed
> > > > that there have been a number of mystery fs corruption issues
> > > > that just disappeared.  This might be related.  There was also
> > > > a DMA problem that might have been relevant, but I know it does
> > > > not apply in this case since "hdparm" shows DMA turned off by
> > > > default on this machine.
> > >
> > > dmesg output would be helpful, the same goes for lspci output
> >
> > That is an important part of this issue.  Nothing shows in dmesg
> > until it is much too late.  The read errors get reported, but no
> > write errors.  There should be a 'pirntk' in 'ide_abort' and
> > 'idedisk_abort' (I may have the routine names wrong, I'm doing
> > this from memory) but there isn't,  (I'll post a patch for that fix
> > if you want.) so I can't tell if the problem is coming down from
> > the upper layers.  I also think there should be a 'printk' associated
> > with the posting of the immediate stop command.  (Again, this is from
> > memory.  I'll post a patch with all this if you want me to.  It will
> > not fix any problems, but might shed light.)
>
> I believe that dmesg/lspci would be useful for me or other people reading
> this because it allows us to know a bit more about this specific hardware
> ('Thinkpad 760' is really not enough).

Yep. lspci will be OS version independent.  I'll get you that shortly.
(I booted the original FC1 distribution in an attempt to get the PCMCIA
stuff to work and it promptly (well after 12 hours) messed up the file
system.)  I'm in the process of cleaning up the disk again.

'dmesg' is somewhat dependent on the OS features configured.  I'll get
you the one from the installation boot disk where the problem was first
noticed.  I take it you want the initial boot part, not the part after
rc.sysinit or whatever is driving the installation process takes over.
It also makes a difference if the machine is docked or undocked.  I'll
get you one of the undocked machine.  Both configurations have the problem
but the undocked configuration is simpler.

 > > Still, this is an important problem.  File system corruption is just not
> > something an OS should allow to happen unless the user does something
> > extreme.
>
> Without more info we won't go further in solving this issue.

I really don't expect you to solve the problem for me.  I have to do
most of the work myself.  I'd recognized that I was on the borders of my
competence and have asked for advice.  I recognized that I was making
assumptions about the code that I needed to check.  While you haven't
exactly answered the questions I asked, you have responded in a way that
provides most of the information I need and pointed me to more places to
look for answers.

There was also the possibility that this problem has an impact beyond my
personal involvement.  I'm not certain that it does, but it would be
irresponsible of me to ignore such a possibility.  (And I just arrogant
enough that I think I have such a responsibility...)

Thank you for your help so far.

Max

^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: ide-io.c, ide_do_request -- race condition?
@ 2004-07-12 17:52 Max T. Woodbury
  2004-07-12 18:35 ` Eric D. Mudama
  0 siblings, 1 reply; 16+ messages in thread
From: Max T. Woodbury @ 2004-07-12 17:52 UTC (permalink / raw)
  To: linux-ide

Bartlomiej Zolnierkiewicz wrote:
> 
> On Monday 12 of July 2004 17:15, Max T. Woodbury wrote:
> > PIIXa: IDE controller at PCI slot 00:01.0
> > PIIXa: chipset revision 2
> > PIIXa: not 100% native mode: will probe irqs later
> > PIIXa: neither IDE port enabled (BIOS)
> 
> Do you have IDE ports disabled in BIOS?
> 
> This prevents piix IDE driver from working and instead you are using
> generic IDE driver which is much slower (no DMA!) and may be unsafe.
> 
> Bartlomiej

The Thinkpad 760 BIOS setup does not make this easy.  It's this pretty
GUI thingy with a humming bird for a cursor and practically no text
anywhere.  Very international.  There's a more comprehensive control
program under 'doz.  I'll try to find something on the thinkpad lists...
I don't remember seeing anything in the user docs on this.

Still, why would PIO mode be unsafe?  (I can see slower, but I don't
expect speed from this beast.  Oh well.  Thanks for the pointer.)

Hmm.  Going to DMA would almost certainly make the symptoms disappear
because the I/O timing would change and whatever was screwing up the
I/O would happen at a non-critical point.  Much like the fact that I
could get the symptoms to disappear by by putting a computational load
on the machine.  It would NOT actually solve the underlying problem,
whatever that problem really is.

Max

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2004-07-16 17:58 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-05 22:51 ide-io.c, ide_do_request -- race condition? Max T. Woodbury
2004-07-07 19:43 ` Bartlomiej Zolnierkiewicz
2004-07-10 19:25   ` Max T. Woodbury
2004-07-10 20:07     ` Bartlomiej Zolnierkiewicz
2004-07-11 15:02       ` Max T. Woodbury
2004-07-12 15:15       ` Max T. Woodbury
2004-07-12 15:47         ` Bartlomiej Zolnierkiewicz
  -- strict thread matches above, loose matches on Subject: below --
2004-07-11 14:38 Max T. Woodbury
2004-07-12 17:52 Max T. Woodbury
2004-07-12 18:35 ` Eric D. Mudama
2004-07-16  6:12   ` Max T. Woodbury
2004-07-16  7:02     ` Jens Axboe
2004-07-16 16:33       ` Max T. Woodbury
2004-07-16 17:57         ` Jens Axboe
2004-07-16  7:06     ` Jeff Garzik
2004-07-16 17:45     ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).