Re: Scheduler latency problems when using NAND

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Scheduler latency problems when using NAND
       [not found] <20100929221401.GA32583@postdiluvian.org>
@ 2010-09-30  4:56 ` Artem Bityutskiy
       [not found] ` <4CA3D92E.9060109@call-direct.com.au>
  1 sibling, 0 replies; 5+ messages in thread
From: Artem Bityutskiy @ 2010-09-30  4:56 UTC (permalink / raw)
  To: Mark Mason; +Cc: linux-mtd, linux-kernel

On Wed, 2010-09-29 at 18:14 -0400, Mark Mason wrote:
> Hi all,
> 
> I hope this is the right place for this question.  I'm having some
> problems with scheduler latency when using UBIFS, and I'm hoping for
> some suggestions.

Hi Mark, this e-mail is not specific to UBIFS, so I suggest you keep
lkml to CC.

I cannot really suggest you much. Off the top of my head - try to enable
preemption in your kernel. But in general, it sounds like you actually
need the RT tree. Also there is the ftrace latency tracer - try to use
it.

> Linux 2.6.29-6, with a newer MTD, dating from probably around six
> months ago.  Embedded PowerPC 8315, with built-in NAND controller,
> using nand/fsl_elbc_nand.c.  NAND is a Samsung K9WAG08U1B two-die
> stack (one package with two chip selects), 2Gbyte x 8 bit.  The system
> has plenty of memory, but is short on CPU.
> 
> The application is storing streaming video, almost entirely large
> sequential files, roughly 250K to 15M, to a 1.6G filesystem.  There's
> no seeking or rewriting, just creat, write, close, repeat.  No
> compression is used on the filesystem.
> 
> The problem I'm seeing is excessively large scheduler latency when
> data is flushed to NAND.
> 
> Originally this had been happening during erases.  I noticed that
> hundreds of erases (up to around 700) were being issued in rapid
> succession, and I was seeing other threads unable to run for sometimes
> as much as the expected 7 seconds (I measured 1.1 ms per erase).  To
> address this, I split the erase command in two halves - FIR_OP_CM0 |
> FIR_OP_PA | FIR_OP_CM2 and FIR_OP_CW1 | FIR_OP_RS - with schedule()
> called in between.  This had the effect if issuing the erase, calling
> schedule(), then waiting for the erase to complete if it hadn't
> already, but usually it had.
> 
> I'm surprised this helped so much, since the calling thread should
> have been put to sleep for the duration of the erase by the call to
> wait_event_timeout(), but it definitely did - I guess it was the
> explicit schedule().
> 
> The erases are no longer a significant bottleneck, but now the writes
> are.  A page program takes 200us, which seems too short for an
> explicit schedule(), and I am seeing periods with the busy line
> asserted in back-to-back 200us chunks for most of a second.
> 
> I have played with thread priorities a bit, but I wound up with too
> many threads being "most important".  There is some hardware that
> can't tolerate large latencies, and unfortunately the existing code
> base doesn't have enough separation between critical and non-critical
> tasks to allow us to run just the critical stuff at a higher priority.
> 
> On average, the system can keep up with the load, but it has problems
> with the burstiness of the flushes to NAND, so I'm hoping for some
> ideas to smooth the traffic out, or even a totally different way to
> approach the problem.  I tried lowering the priority of the UBI
> background thread, the failure mode there is pretty obvious.  I tried
> lowering dirty_background_centisecs, that helped a little bit, but not
> enough, and there's also a SATA drive, although a smaller commit
> interval probably wouldn't bother it since the traffic is similar.
> 
> I'm contemplating something along the lines of a smaller commit
> interval, an even higher background thread priority, and a sleep with
> a schedule during the page program, but that many extra context
> switches are liable to be a problem - there's no L2 cache on this CPU,
> so context switches are extra expensive.
> 
> Does anyone have any suggestions, ideas, hints, advice, etc?
> 
> Thanks!

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)


^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <4CA3D92E.9060109@call-direct.com.au>]

* Re: Scheduler latency problems when using NAND
       [not found] ` <4CA3D92E.9060109@call-direct.com.au>
@ 2010-10-09 17:42   ` Mark Mason
  2010-10-10  7:56     ` Joakim Tjernlund
  2010-10-11 22:54     ` Iwo Mergler
  0 siblings, 2 replies; 5+ messages in thread
From: Mark Mason @ 2010-10-09 17:42 UTC (permalink / raw)
  To: Iwo Mergler; +Cc: linux-mtd, linux-kernel

Iwo Mergler <iwo@call-direct.com.au> wrote:

> Mark Mason wrote:
> > Hi all,
> > 
> > I hope this is the right place for this question.  I'm having some
> > problems with scheduler latency when using UBIFS, and I'm hoping for
> > some suggestions.
> <snip>
> > The application is storing streaming video, almost entirely large
> > sequential files, roughly 250K to 15M, to a 1.6G filesystem.  There's
> > no seeking or rewriting, just creat, write, close, repeat.  No
> > compression is used on the filesystem.
> > 
> > The problem I'm seeing is excessively large scheduler latency when
> > data is flushed to NAND.
> <snip>
> > Does anyone have any suggestions, ideas, hints, advice, etc?
> 
> The Linux block cache is optimised for mechanical hard drives,
> to minimise seek times. Some of the assumptions don't make much
> sense with FLASH and streaming storage.
> 
> Maybe try to flush data whenever you have written a few blocks'
> worth. Or have a look at the O_DIRECT flag (or madvise), although
> I don't know how it interacts with UBIFS.

I tried lowering dirty_writeback_centisecs and dirty_expire_centisecs,
the latency dropped when I used values around 1 or 2 (down from 500 &
3000), but it's still a problem.

> You could use a real filesystem to store the metadata for your
> circular storage partition (file name, length, offset).
> 
> Maybe use raw UBI so you don't have to worry about bad blocks.
> 
> Either way, the time to erase a block and write a single page
> is predictable and you can do it as soon as you get the data.

A custom filesystem would be good, but I still hold out hope that
somebody has already fought this battle for me.

Regardless, it looks like I have a genuine hardware problem on my
hands, and it's one that I would expect other people to have, although
I suspect it wouldn't be an issue with reasonable flash loads.

The flash driver (fsl_elbc_nand.c) goes to sleep right after it issues
a page program, and a context switch to another high priority thread
takes place promptly.  This thread is often one that reads from
another (video) chip on the same bus as the flash (the MPC8315 LBC).
The flash asserts its BUSY line while the page program is in
operation.  When the other thread comes along to read from video chip,
it's held off for the 200us duration of the page program (the LBC
controller for the video chip is running in UPM mode, so the BUSY line
is a BUSY line and not a TA line, in case any 83xx junkies are reading
this).

What I see on a logic analyzer is the BUSY line held by the flash for
200us, a single 32 bit read of the video chip (broken up into two 16
bit reads for the 16 bit bus), then another 200us BUSY from the flash,
two more 16 bit reads, etc, all the way to the end of the logic
analyzer screen.

What I think is happening is that the flash background thread is
running very efficiently - it comes in, issues a page program, and
relinquishes the CPU.  The thread reading the video chip then runs,
stalls for 200us waiting for a single read, gets its read, then is
preempted for the flash BGT.

My guess is that the scheduler sees the flash background thread as
running almost not at all, and the video thread as running a lot more,
although it's stalled for most of the first 200us of its time slice on
a single bus transaction, so it can't really do anything with its time
slice.  I further suspect that the scheduler is dynamically adjusting
the priorities to boost the flash BGT, since it's using much less CPU
time than the video thread, even though the video thread can't use
most of its time slice.  Can someone tell me if this makes sense?

I tried some messing with priorities, but ultimately the flash has to
run, and it has to run frequently in very short bursts to issue all of
the page programs.  The video chip needs to be serviced promptly, so
there is always a significant chance that it will run right after a
page program is issued.

I tried disabling preemption for the duration of a transfer in the
video driver so it wouldn't get preempted once it had waited its 200us
to get the bus.  It helped, but ultimately there's still a 200us delay
to perform a sequence of operations that usually take somewhere
between 5 and 30 us.

Most devices that would sit on the bus will only assert the BUSY line
while the chip select is held.  The NAND, however, asserts BUSY for
the duration of a page program.  So I could add a gate to the BUSY
line to gate it with the chip select.  This would require a respin of
the board, and I'm not 100% certain that it wouldn't confuse the
8315's NAND controller.  I'm going to try this next week.  I could
also take the bus controller out of flash (FCM) mode, depop the
resistors to the NAND's BUSY line, and have the NAND layer talk to the
NAND like it's just a chip on a plain old bus, and poll for BUSY.
This option might be better than it sounds, since the delays are very
predictable.

Any words of wisdom?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Scheduler latency problems when using NAND
  2010-10-09 17:42   ` Mark Mason
@ 2010-10-10  7:56     ` Joakim Tjernlund
  2010-10-11 22:54     ` Iwo Mergler
  1 sibling, 0 replies; 5+ messages in thread
From: Joakim Tjernlund @ 2010-10-10  7:56 UTC (permalink / raw)
  To: Mark Mason; +Cc: Iwo Mergler, linux-kernel, linux-mtd


>
> Iwo Mergler <iwo@call-direct.com.au> wrote:
>
> > Mark Mason wrote:
> > > Hi all,
> > >
> > > I hope this is the right place for this question.  I'm having some
> > > problems with scheduler latency when using UBIFS, and I'm hoping for
> > > some suggestions.
> > <snip>
> > > The application is storing streaming video, almost entirely large
> > > sequential files, roughly 250K to 15M, to a 1.6G filesystem.  There's
> > > no seeking or rewriting, just creat, write, close, repeat.  No
> > > compression is used on the filesystem.
> > >
> > > The problem I'm seeing is excessively large scheduler latency when
> > > data is flushed to NAND.
> > <snip>
> > > Does anyone have any suggestions, ideas, hints, advice, etc?
> >
> > The Linux block cache is optimised for mechanical hard drives,
> > to minimise seek times. Some of the assumptions don't make much
> > sense with FLASH and streaming storage.
> >
> > Maybe try to flush data whenever you have written a few blocks'
> > worth. Or have a look at the O_DIRECT flag (or madvise), although
> > I don't know how it interacts with UBIFS.
>
> I tried lowering dirty_writeback_centisecs and dirty_expire_centisecs,
> the latency dropped when I used values around 1 or 2 (down from 500 &
> 3000), but it's still a problem.
>
> > You could use a real filesystem to store the metadata for your
> > circular storage partition (file name, length, offset).
> >
> > Maybe use raw UBI so you don't have to worry about bad blocks.
> >
> > Either way, the time to erase a block and write a single page
> > is predictable and you can do it as soon as you get the data.
>
> A custom filesystem would be good, but I still hold out hope that
> somebody has already fought this battle for me.
>
> Regardless, it looks like I have a genuine hardware problem on my
> hands, and it's one that I would expect other people to have, although
> I suspect it wouldn't be an issue with reasonable flash loads.
>
> The flash driver (fsl_elbc_nand.c) goes to sleep right after it issues
> a page program, and a context switch to another high priority thread
> takes place promptly.  This thread is often one that reads from
> another (video) chip on the same bus as the flash (the MPC8315 LBC).
> The flash asserts its BUSY line while the page program is in
> operation.  When the other thread comes along to read from video chip,
> it's held off for the 200us duration of the page program (the LBC
> controller for the video chip is running in UPM mode, so the BUSY line
> is a BUSY line and not a TA line, in case any 83xx junkies are reading
> this).
>
> What I see on a logic analyzer is the BUSY line held by the flash for
> 200us, a single 32 bit read of the video chip (broken up into two 16
> bit reads for the 16 bit bus), then another 200us BUSY from the flash,
> two more 16 bit reads, etc, all the way to the end of the logic
> analyzer screen.
>
> What I think is happening is that the flash background thread is
> running very efficiently - it comes in, issues a page program, and
> relinquishes the CPU.  The thread reading the video chip then runs,
> stalls for 200us waiting for a single read, gets its read, then is
> preempted for the flash BGT.
>
> My guess is that the scheduler sees the flash background thread as
> running almost not at all, and the video thread as running a lot more,
> although it's stalled for most of the first 200us of its time slice on
> a single bus transaction, so it can't really do anything with its time
> slice.  I further suspect that the scheduler is dynamically adjusting
> the priorities to boost the flash BGT, since it's using much less CPU
> time than the video thread, even though the video thread can't use
> most of its time slice.  Can someone tell me if this makes sense?
>
> I tried some messing with priorities, but ultimately the flash has to
> run, and it has to run frequently in very short bursts to issue all of
> the page programs.  The video chip needs to be serviced promptly, so
> there is always a significant chance that it will run right after a
> page program is issued.
>
> I tried disabling preemption for the duration of a transfer in the
> video driver so it wouldn't get preempted once it had waited its 200us
> to get the bus.  It helped, but ultimately there's still a 200us delay
> to perform a sequence of operations that usually take somewhere
> between 5 and 30 us.
>
> Most devices that would sit on the bus will only assert the BUSY line
> while the chip select is held.  The NAND, however, asserts BUSY for
> the duration of a page program.  So I could add a gate to the BUSY
> line to gate it with the chip select.  This would require a respin of
> the board, and I'm not 100% certain that it wouldn't confuse the
> 8315's NAND controller.  I'm going to try this next week.  I could
> also take the bus controller out of flash (FCM) mode, depop the
> resistors to the NAND's BUSY line, and have the NAND layer talk to the
> NAND like it's just a chip on a plain old bus, and poll for BUSY.
> This option might be better than it sounds, since the delays are very
> predictable.
>
> Any words of wisdom?

Freescale mentioned last time they visited us that there is a bus
problem with the NAND controller and it sounds similar to what
you are describing. They mentioned a workaround too so you
should contact them for details, I really can't say much more.

 Jocke


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Scheduler latency problems when using NAND
  2010-10-09 17:42   ` Mark Mason
  2010-10-10  7:56     ` Joakim Tjernlund
@ 2010-10-11 22:54     ` Iwo Mergler
  2010-10-12 16:05       ` Mark Mason
  1 sibling, 1 reply; 5+ messages in thread
From: Iwo Mergler @ 2010-10-11 22:54 UTC (permalink / raw)
  To: Mark Mason; +Cc: linux-mtd, linux-kernel

Mark Mason wrote:
> The flash driver (fsl_elbc_nand.c) goes to sleep right after it issues
> a page program, and a context switch to another high priority thread
> takes place promptly.  This thread is often one that reads from
> another (video) chip on the same bus as the flash (the MPC8315 LBC).
> The flash asserts its BUSY line while the page program is in
> operation.  When the other thread comes along to read from video chip,
> it's held off for the 200us duration of the page program (the LBC
> controller for the video chip is running in UPM mode, so the BUSY line
> is a BUSY line and not a TA line, in case any 83xx junkies are reading
> this).
> 
> What I see on a logic analyzer is the BUSY line held by the flash for
> 200us, a single 32 bit read of the video chip (broken up into two 16
> bit reads for the 16 bit bus), then another 200us BUSY from the flash,
> two more 16 bit reads, etc, all the way to the end of the logic
> analyzer screen.

I don't know your controller, but I'm surprised that the FLASH write
can stall the bus like that. It seems a high price to pay for not
having to implement some FLASH write interrupt or wait. Are you sure
that this is the recommended way to connect a FLASH?

As you said, it looks like it may be worthwhile to abandon the NAND
controller and implement a software driver via an SRAM bus mode.

Maybe even to the extent of reading the video controller *while*
polling the BUSY line. You could wake up the FLASH thread from the
video driver when the write is done.

> 
> What I think is happening is that the flash background thread is
> running very efficiently - it comes in, issues a page program, and
> relinquishes the CPU.  The thread reading the video chip then runs,
> stalls for 200us waiting for a single read, gets its read, then is
> preempted for the flash BGT.
> 
> My guess is that the scheduler sees the flash background thread as
> running almost not at all, and the video thread as running a lot more,
> although it's stalled for most of the first 200us of its time slice on
> a single bus transaction, so it can't really do anything with its time
> slice.  I further suspect that the scheduler is dynamically adjusting
> the priorities to boost the flash BGT, since it's using much less CPU
> time than the video thread, even though the video thread can't use
> most of its time slice.  Can someone tell me if this makes sense?

I'm not sure about this, but I thought the scheduler only does priority
escalation when running userspace threads. For kernel thread scheduling,
you get extra priority levels where the higher priority thread always
wins if its ready to run.


Best regards,

Iwo



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Scheduler latency problems when using NAND
  2010-10-11 22:54     ` Iwo Mergler
@ 2010-10-12 16:05       ` Mark Mason
  0 siblings, 0 replies; 5+ messages in thread
From: Mark Mason @ 2010-10-12 16:05 UTC (permalink / raw)
  To: Iwo Mergler; +Cc: linux-mtd, linux-kernel

Iwo Mergler <iwo@call-direct.com.au> wrote:

> Mark Mason wrote:
>
> > What I see on a logic analyzer is the BUSY line held by the flash for
> > 200us, a single 32 bit read of the video chip (broken up into two 16
> > bit reads for the 16 bit bus), then another 200us BUSY from the flash,
> > two more 16 bit reads, etc, all the way to the end of the logic
> > analyzer screen.
> 
> I don't know your controller, but I'm surprised that the FLASH write
> can stall the bus like that. It seems a high price to pay for not
> having to implement some FLASH write interrupt or wait. Are you sure
> that this is the recommended way to connect a FLASH?

I don't think it's the controller's fault, it's a signal provided by
the flash, and its purpose is to hold off the controller while the
NAND is busy.  It seems strange that the signal remains asserted when
the chip isn't selected, but if it didn't then the controller would
have to poll the chip by periodically selecting the chip and see if
BUSY had deasserted.  The controller, running in flash mode, requires
the BUSY line to work.

A saner approach might be to connect the BUSY line to an interrupt,
and have the interrupt wake the NAND BGT up, but it's too late for
that now, since the hardware's already built.

Usually NAND is used for things like booting, config and log files,
etc, which is just 200us every now and then, so this wouldn't be a
problem.  It's only a problem since we need really high bandwidth to
the flash.

I got a fast reply from Freescale - this is the way it works and there
is no workaround.

I'll try shutting the FCM off and acccessing the device as a plain
memory device.

Thanks for the help!

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-10-12 16:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20100929221401.GA32583@postdiluvian.org>
2010-09-30  4:56 ` Scheduler latency problems when using NAND Artem Bityutskiy
     [not found] ` <4CA3D92E.9060109@call-direct.com.au>
2010-10-09 17:42   ` Mark Mason
2010-10-10  7:56     ` Joakim Tjernlund
2010-10-11 22:54     ` Iwo Mergler
2010-10-12 16:05       ` Mark Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox