public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Device Driver Etiquette
@ 2007-06-01 17:47 Matthew Fredrickson
  2007-06-01 18:38 ` Lee Revell
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Matthew Fredrickson @ 2007-06-01 17:47 UTC (permalink / raw)
  To: linux-kernel

Greetings,

I maintain a device driver that has been bitten by the transition to 4K 
stacks.  It is a T1/E1 line interface PCI card driver that is 
maintained outside of the kernel, although is used by a significant 
number of people.  The card has a part for doing echo cancellation, but 
it is accessed through a vendor API which (when we received it) was 
quite stack heavy.  It used the stack for a number of large data 
structures, although I have moved a great deal of them (particularly 
the larger ones) onto the heap instead of the stack.  Although this has 
reduced stack usage to the point where it is usable within 4K stacks, 
on some code paths, it can still use quite a bit of stack space (though 
under 4K) for local variables and a handful of small data structures.  
The problem is that in order to initialize and use the echo canceler, 
there is a firmware load portion which takes a noticeable period of 
time (~5-10 seconds).  That is done through this stack heavy portion of 
the code.

The problem that I am seeing is that while this card is loading its 
firmware, there are other device interrupts that occur, and if enough 
levels of interrupt happen, it overflows the stack.  Oh, the firmware 
load occurs in the context of an ioctl, FWIW.

My solution was to disable interrupts while the firmware was being 
loaded using local_irq_save/local_irq_restore, although that seems to 
be a naughty thing to do, especially given the long period of time it 
takes to load the firmware on the card.

My question is this: is there a way to either work around the problem I 
am seeing with the stack without recompiling the kernel with 8K stack 
size or without disabling irqs for such a long period of time (which I 
think is not a nice thing to do either) OR is it acceptable (although 
not nice) to simply fix it this way, by disabling irqs while it loads 
the firmware?

Thanks,
Matthew Fredrickson
Kernel Developer
Digium, Inc.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Device Driver Etiquette
  2007-06-01 17:47 Device Driver Etiquette Matthew Fredrickson
@ 2007-06-01 18:38 ` Lee Revell
  2007-06-01 22:28   ` H. Peter Anvin
  2007-06-02 15:05 ` Satyam Sharma
  2007-06-02 23:22 ` Arjan van de Ven
  2 siblings, 1 reply; 10+ messages in thread
From: Lee Revell @ 2007-06-01 18:38 UTC (permalink / raw)
  To: Matthew Fredrickson; +Cc: linux-kernel

On 6/1/07, Matthew Fredrickson <creslin@digium.com> wrote:
> is it acceptable (although
> not nice) to simply fix it this way, by disabling irqs while it loads
> the firmware?
>

I would say to just disable IRQs while loading firmware.  Almost every
server I maintain has some vendor driver which generates a "many lost
ticks!" message on load.  As long as it's only done at module load
time it should be fine.

Of course the best solution is to just get the driver into mainline.

Lee

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Device Driver Etiquette
  2007-06-01 18:38 ` Lee Revell
@ 2007-06-01 22:28   ` H. Peter Anvin
  0 siblings, 0 replies; 10+ messages in thread
From: H. Peter Anvin @ 2007-06-01 22:28 UTC (permalink / raw)
  To: Lee Revell; +Cc: Matthew Fredrickson, linux-kernel

Lee Revell wrote:
> On 6/1/07, Matthew Fredrickson <creslin@digium.com> wrote:
>> is it acceptable (although
>> not nice) to simply fix it this way, by disabling irqs while it loads
>> the firmware?
>>
> 
> I would say to just disable IRQs while loading firmware.  Almost every
> server I maintain has some vendor driver which generates a "many lost
> ticks!" message on load.  As long as it's only done at module load
> time it should be fine.
> 
> Of course the best solution is to just get the driver into mainline.
> 

Disable interrupts for 5-10 seconds (see OP)?  You're nuts.

That might be only a minor disaster on an SMP system, assuming IRQ
balancing sends them elsewhere, but on a uniprocessor system you're
effectively shooting the machine in the head.

It should be possible to streamline the code to not hog the stack that way.

	-hpa

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Device Driver Etiquette
@ 2007-06-01 23:16 Daniel J Blueman
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel J Blueman @ 2007-06-01 23:16 UTC (permalink / raw)
  To: Matthew Fredrickson; +Cc: Lee Revell, Linux Kernel

On 1 Jun, 19:40, "Lee Revell" <rlrevell@joe-job.com> wrote:
> On 6/1/07, Matthew Fredrickson <creslin@digium.com> wrote:
>
> > is it acceptable (although
> > not nice) to simply fix it this way, by disabling irqs while it loads
> > the firmware?
>
> I would say to just disable IRQs while loading firmware.  Almost every
> server I maintain has some vendor driver which generates a "many lost
> ticks!" message on load.  As long as it's only done at module load
> time it should be fine.

For anything ~10s or more, you'll probably also need to call the timer
update function to prevent soft lockup warning being generated.

> Of course the best solution is to just get the driver into mainline.
>
> Lee
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Device Driver Etiquette
  2007-06-01 17:47 Device Driver Etiquette Matthew Fredrickson
  2007-06-01 18:38 ` Lee Revell
@ 2007-06-02 15:05 ` Satyam Sharma
  2007-06-02 23:22 ` Arjan van de Ven
  2 siblings, 0 replies; 10+ messages in thread
From: Satyam Sharma @ 2007-06-02 15:05 UTC (permalink / raw)
  To: Matthew Fredrickson; +Cc: linux-kernel

Hi Matthew,

On 6/1/07, Matthew Fredrickson <creslin@digium.com> wrote:
> Greetings,
>
> I maintain a device driver that has been bitten by the transition to 4K
> stacks.  It is a T1/E1 line interface PCI card driver that is
> maintained outside of the kernel, although is used by a significant
> number of people.  The card has a part for doing echo cancellation, but
> it is accessed through a vendor API which (when we received it) was
> quite stack heavy.  It used the stack for a number of large data
> structures, although I have moved a great deal of them (particularly
> the larger ones) onto the heap instead of the stack.  Although this has
> reduced stack usage to the point where it is usable within 4K stacks,
> on some code paths, it can still use quite a bit of stack space (though
> under 4K) for local variables and a handful of small data structures.
> The problem is that in order to initialize and use the echo canceler,
> there is a firmware load portion which takes a noticeable period of
> time (~5-10 seconds).  That is done through this stack heavy portion of
> the code.

As Peter suggested earlier, the best strategy would be to
cleanup the code and make it stack-light in the first place ...

> My question is this: is there a way to either work around the problem I
> am seeing with the stack without recompiling the kernel with 8K stack
> size or without disabling irqs for such a long period of time (which I
> think is not a nice thing to do either)

There are standard mechanisms to push off long duration or
sleep-capable work from interrupts-disabled contexts to user
contexts. But those contexts can also, obviously, be interrupted,
so if you're stack-heavy and suffer a concurrent interrupt (which
you most definitely will in any codepath this long), there's always
the possibility of stack overflowing. I don't really see a better
solution to this than using bigger stacks or reducing your usage
of the same, but of course, better suggestions would be difficult
to give without looking at the code in question.

> OR is it acceptable (although
> not nice) to simply fix it this way, by disabling irqs while it loads
> the firmware?

IMHO, *not*. You could try this, in any case, and learn it the
hard way :-)

Satyam

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Device Driver Etiquette
       [not found] <12467610.159201180797054019.JavaMail.root@jupiler.digium.com>
@ 2007-06-02 15:12 ` Matt Fredrickson
  2007-06-03 11:22   ` Daniel J Blueman
  0 siblings, 1 reply; 10+ messages in thread
From: Matt Fredrickson @ 2007-06-02 15:12 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Lee Revell, Linux Kernel, hpa


----- "Daniel J Blueman" <daniel.blueman@gmail.com> wrote:
> On 1 Jun, 19:40, "Lee Revell" <rlrevell@joe-job.com> wrote:
> > On 6/1/07, Matthew Fredrickson <creslin@digium.com> wrote:
> >
> > > is it acceptable (although
> > > not nice) to simply fix it this way, by disabling irqs while it
> loads
> > > the firmware?
> >
> > I would say to just disable IRQs while loading firmware.  Almost
> every
> > server I maintain has some vendor driver which generates a "many
> lost
> > ticks!" message on load.  As long as it's only done at module load
> > time it should be fine.
> 
> For anything ~10s or more, you'll probably also need to call the
> timer
> update function to prevent soft lockup warning being generated.

Ahhh... so there is a way to get rid of that cursed message.  I forgot to mention this in my original message, the only place that I had seen this problem is on a certain machine (Dell 2950) with a certain distribution (FC6) kernel.  I had trimmed the code down to fit in 4K stacks already quite a bit.

I believe what actually made it crash and overflow the stack (the straw that breaks the camels back, so to speak) was the intermittent triggering of the softlockup detector.  I think if I can disable that while the firmware is loading that will fix the stack overflow issues and correct my problem.

Matthew Fredrickson

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Device Driver Etiquette
  2007-06-01 17:47 Device Driver Etiquette Matthew Fredrickson
  2007-06-01 18:38 ` Lee Revell
  2007-06-02 15:05 ` Satyam Sharma
@ 2007-06-02 23:22 ` Arjan van de Ven
  2007-06-03  1:45   ` Satyam Sharma
  2 siblings, 1 reply; 10+ messages in thread
From: Arjan van de Ven @ 2007-06-02 23:22 UTC (permalink / raw)
  To: Matthew Fredrickson; +Cc: linux-kernel

On Fri, 2007-06-01 at 12:47 -0500, Matthew Fredrickson wrote:

> My question is this: is there a way to either work around the problem I 
> am seeing with the stack without recompiling the kernel with 8K stack 
> size or without disabling irqs for such a long period of time (which I 
> think is not a nice thing to do either) OR is it acceptable (although 
> not nice) to simply fix it this way, by disabling irqs while it loads 
> the firmware?

I wonder if you're chasing ghosts; 4K stack kernels have a seperate
stack for interrupts so... those should be safe.

Btw, you forgot to post a pointer to the source code of your driver, so
it's a lot harder for us (read: impossible) to give you good advice..

Greetings,
   Arjan van de Ven
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Device Driver Etiquette
  2007-06-02 23:22 ` Arjan van de Ven
@ 2007-06-03  1:45   ` Satyam Sharma
  0 siblings, 0 replies; 10+ messages in thread
From: Satyam Sharma @ 2007-06-03  1:45 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Matthew Fredrickson, linux-kernel

On 6/3/07, Arjan van de Ven <arjan@infradead.org> wrote:
> On Fri, 2007-06-01 at 12:47 -0500, Matthew Fredrickson wrote:
>
> > My question is this: is there a way to either work around the problem I
> > am seeing with the stack without recompiling the kernel with 8K stack
> > size or without disabling irqs for such a long period of time (which I
> > think is not a nice thing to do either) OR is it acceptable (although
> > not nice) to simply fix it this way, by disabling irqs while it loads
> > the firmware?
>
> I wonder if you're chasing ghosts; 4K stack kernels have a seperate
> stack for interrupts so... those should be safe.

Argh, yes, thanks for reminding/correcting all of us :-)

> Btw, you forgot to post a pointer to the source code of your driver, so
> it's a lot harder for us (read: impossible) to give you good advice..

Right, and dmesg / debug output of when things go wrong is also
needed at the very least to confirm whether or not there is really a
stack overflow issue here in the first place.

Satyam

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Device Driver Etiquette
       [not found] <17916027.160971180851607681.JavaMail.root@jupiler.digium.com>
@ 2007-06-03  6:20 ` Matt Fredrickson
  0 siblings, 0 replies; 10+ messages in thread
From: Matt Fredrickson @ 2007-06-03  6:20 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel


----- "Arjan van de Ven" <arjan@infradead.org> wrote:
> On Fri, 2007-06-01 at 12:47 -0500, Matthew Fredrickson wrote:
> 
> > My question is this: is there a way to either work around the
> problem I 
> > am seeing with the stack without recompiling the kernel with 8K
> stack 
> > size or without disabling irqs for such a long period of time (which
> I 
> > think is not a nice thing to do either) OR is it acceptable
> (although 
> > not nice) to simply fix it this way, by disabling irqs while it
> loads 
> > the firmware?
> 
> I wonder if you're chasing ghosts; 4K stack kernels have a seperate
> stack for interrupts so... those should be safe.
> 
> Btw, you forgot to post a pointer to the source code of your driver,
> so
> it's a lot harder for us (read: impossible) to give you good advice..

As someone mentioned in another post, I believe what is causing this problem is a combination of factors.  The triggering of the softlockup detector seems to be what pushes it over the edge.  I think I will try as suggested to change the timer for it so that it does not trigger while the card is initializing.

If you'd like to see the source, here's a link:
http://svn.digium.com/view/zaptel/trunk/wct4xxp/

I can't give you the stack trace right now, since I'm not in the office, but it begins in the vpm450_init function and goes into the Octasic API functions, in the ChannelOpen routines as well as the chip initialization routines (the ones that load the firmware.

---
Matthew Fredrickson
Kernel Developer
Digium, Inc.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Device Driver Etiquette
  2007-06-02 15:12 ` Matt Fredrickson
@ 2007-06-03 11:22   ` Daniel J Blueman
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel J Blueman @ 2007-06-03 11:22 UTC (permalink / raw)
  To: Matt Fredrickson; +Cc: Lee Revell, Linux Kernel, hpa

On 02/06/07, Matt Fredrickson <creslin@digium.com> wrote:
> ----- "Daniel J Blueman" <daniel.blueman@gmail.com> wrote:
> > On 1 Jun, 19:40, "Lee Revell" <rlrevell@joe-job.com> wrote:
> > > On 6/1/07, Matthew Fredrickson <creslin@digium.com> wrote:
> > >
> > > > is it acceptable (although
> > > > not nice) to simply fix it this way, by disabling irqs while it
> > loads
> > > > the firmware?
> > >
> > > I would say to just disable IRQs while loading firmware.  Almost
> > every
> > > server I maintain has some vendor driver which generates a "many
> > lost
> > > ticks!" message on load.  As long as it's only done at module load
> > > time it should be fine.
> >
> > For anything ~10s or more, you'll probably also need to call the
> > timer
> > update function to prevent soft lockup warning being generated.
>
> Ahhh... so there is a way to get rid of that cursed message.  I forgot to mention this in my original message, the only place that I had seen this problem is on a certain machine (Dell 2950) with a certain distribution (FC6) kernel.  I had trimmed the code down to fit in 4K stacks already quite a bit.
>
> I believe what actually made it crash and overflow the stack (the straw that breaks the
> camels back, so to speak) was the intermittent triggering of the softlockup detector.
>  I think if I can disable that while the firmware is loading that will fix the stack overflow
> issues and correct my problem.

It may not help, since the NMIs used to detect soft lockups will still
be received; the soft lockup update call just updates the per-cpu
timer, but there may be less chance of overrunning the end of the
stack from not getting the stack trace.

There is no substitute for implementing your own firmware uploading
mechanism or getting this one addressed though ;-) .

Dan

> Matthew Fredrickson
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2007-06-03 11:22 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-01 17:47 Device Driver Etiquette Matthew Fredrickson
2007-06-01 18:38 ` Lee Revell
2007-06-01 22:28   ` H. Peter Anvin
2007-06-02 15:05 ` Satyam Sharma
2007-06-02 23:22 ` Arjan van de Ven
2007-06-03  1:45   ` Satyam Sharma
  -- strict thread matches above, loose matches on Subject: below --
2007-06-01 23:16 Daniel J Blueman
     [not found] <12467610.159201180797054019.JavaMail.root@jupiler.digium.com>
2007-06-02 15:12 ` Matt Fredrickson
2007-06-03 11:22   ` Daniel J Blueman
     [not found] <17916027.160971180851607681.JavaMail.root@jupiler.digium.com>
2007-06-03  6:20 ` Matt Fredrickson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox