Cached NAND reads and UBIFS

linux-mtd.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* Cached NAND reads and UBIFS
@ 2016-07-13 12:30 Richard Weinberger
  2016-07-13 12:43 ` Boris Brezillon
  2016-07-13 16:54 ` Brian Norris
  0 siblings, 2 replies; 8+ messages in thread
From: Richard Weinberger @ 2016-07-13 12:30 UTC (permalink / raw)
  To: Artem Bityutskiy, Brian Norris, Boris Brezillon
  Cc: linux-mtd@lists.infradead.org

Hi!

As discussed on IRC, Boris and I figured that on our target UBIFS is sometimes
very slow.
i.e. deleting a 1GiB file right after a reboot takes more than 30 seconds.

When deleting a file with a cold TNC UBIFS has to lookup a lot of znodes
on the flash.
For every single znode lookup UBIFS requests a few bytes from the flash.
This is slow.

After some investigation we found out that the NAND read cache is disabled
when the NAND driver supports reading subpages.
So we removed the NAND_SUBPAGE_READ flag from the driver and suddenly
lookups were fast. Really fast. Deleting a 1GiB took less than 5 seconds.
Since on our MLC NAND a page is 16KiB many znodes can be read very fast
directly out of the NAND read cache.
The read cache helps here a lot because in the regular case UBIFS' index
nodes are linearly stored in a LEB.

The TNC seems to assume that it can do a lot of short reads since the NAND
read cache will help.
But as soon subpage reads are possible this assumption is no longer true.

Now we're not sure what do do, should we implement bulk reading in the TNC
code or improve NAND read caching?

Thanks,
//richard

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cached NAND reads and UBIFS
  2016-07-13 12:30 Cached NAND reads and UBIFS Richard Weinberger
@ 2016-07-13 12:43 ` Boris Brezillon
  2016-07-13 13:13   ` Artem Bityutskiy
  2016-07-13 16:54 ` Brian Norris
  1 sibling, 1 reply; 8+ messages in thread
From: Boris Brezillon @ 2016-07-13 12:43 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Artem Bityutskiy, Brian Norris, linux-mtd@lists.infradead.org

On Wed, 13 Jul 2016 14:30:01 +0200
Richard Weinberger <richard@nod.at> wrote:

> Hi!
> 
> As discussed on IRC, Boris and I figured that on our target UBIFS is sometimes
> very slow.
> i.e. deleting a 1GiB file right after a reboot takes more than 30 seconds.
> 
> When deleting a file with a cold TNC UBIFS has to lookup a lot of znodes
> on the flash.
> For every single znode lookup UBIFS requests a few bytes from the flash.
> This is slow.
> 
> After some investigation we found out that the NAND read cache is disabled
> when the NAND driver supports reading subpages.
> So we removed the NAND_SUBPAGE_READ flag from the driver and suddenly
> lookups were fast. Really fast. Deleting a 1GiB took less than 5 seconds.
> Since on our MLC NAND a page is 16KiB many znodes can be read very fast
> directly out of the NAND read cache.
> The read cache helps here a lot because in the regular case UBIFS' index
> nodes are linearly stored in a LEB.
> 
> The TNC seems to assume that it can do a lot of short reads since the NAND
> read cache will help.
> But as soon subpage reads are possible this assumption is no longer true.
> 
> Now we're not sure what do do, should we implement bulk reading in the TNC
> code or improve NAND read caching?

Hm, NAND page caching is something I'd like to get rid of at some
point, and this for several reasons:

1/ it brings some confusion in NAND controller drivers, where those
don't know when they are allowed to use chip->buffer, and what to do
with ->pagebuf in this case

2/ caching is already implemented at the FS level, so I'm not sure we
really need another level of caching at the MTD/NAND level (except for
those specific use cases where the MTD user relies on this caching to
improve accesses to small contiguous chunks)

3/ it hides the real number of bitflips in a given page: say someone is
reading over and over the same page, the MTD user will never be able to
detect when the number of bitflips exceed the threshold. This should
not be a problem in real world, because MTD users are unlikely to always
read the same page without reading other pages in the meantime, but
still, I think it adds some confusion, especially if one wants to write
a test that reads over and over the same page to see the impact of
read-disturb.

Regards,

Boris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cached NAND reads and UBIFS
  2016-07-13 12:43 ` Boris Brezillon
@ 2016-07-13 13:13   ` Artem Bityutskiy
  2016-07-14  7:58     ` Boris Brezillon
  0 siblings, 1 reply; 8+ messages in thread
From: Artem Bityutskiy @ 2016-07-13 13:13 UTC (permalink / raw)
  To: Boris Brezillon, Richard Weinberger
  Cc: Brian Norris, linux-mtd@lists.infradead.org

On Wed, 2016-07-13 at 14:43 +0200, Boris Brezillon wrote:
> On Wed, 13 Jul 2016 14:30:01 +0200
> Richard Weinberger <richard@nod.at> wrote:
> 
> > Hi!
> > 
> > As discussed on IRC, Boris and I figured that on our target UBIFS
> > is sometimes
> > very slow.
> > i.e. deleting a 1GiB file right after a reboot takes more than 30
> > seconds.
> > 
> > When deleting a file with a cold TNC UBIFS has to lookup a lot of
> > znodes
> > on the flash.
> > For every single znode lookup UBIFS requests a few bytes from the
> > flash.
> > This is slow.
> > 
> > After some investigation we found out that the NAND read cache is
> > disabled
> > when the NAND driver supports reading subpages.
> > So we removed the NAND_SUBPAGE_READ flag from the driver and
> > suddenly
> > lookups were fast. Really fast. Deleting a 1GiB took less than 5
> > seconds.
> > Since on our MLC NAND a page is 16KiB many znodes can be read very
> > fast
> > directly out of the NAND read cache.
> > The read cache helps here a lot because in the regular case UBIFS'
> > index
> > nodes are linearly stored in a LEB.
> > 
> > The TNC seems to assume that it can do a lot of short reads since
> > the NAND
> > read cache will help.
> > But as soon subpage reads are possible this assumption is no longer
> > true.
> > 
> > Now we're not sure what do do, should we implement bulk reading in
> > the TNC
> > code or improve NAND read caching?
> 
> Hm, NAND page caching is something I'd like to get rid of at some
> point, and this for several reasons:
> 
> 1/ it brings some confusion in NAND controller drivers, where those
> don't know when they are allowed to use chip->buffer, and what to do
> with ->pagebuf in this case

Yes, it adds complexity because it is not a separate caching layer but
rather "built-in" into the logic, sprinkled around.

> 2/ caching is already implemented at the FS level, so I'm not sure we
> really need another level of caching at the MTD/NAND level (except
> for
> those specific use cases where the MTD user relies on this caching to
> improve accesses to small contiguous chunks)

Well, FS is caching stuff, but device level caching is still useful.
E.g., UBI decides to move things around, things get cached, and when
UBIFS reads the things, it picks them from the cache.

Disk blocks are also cached in Linux separately from the FS level
cache.

> 3/ it hides the real number of bitflips in a given page: say someone
> is
> reading over and over the same page, the MTD user will never be able
> to
> detect when the number of bitflips exceed the threshold. This should
> not be a problem in real world, because MTD users are unlikely to
> always
> read the same page without reading other pages in the meantime, but
> still, I think it adds some confusion, especially if one wants to
> write
> a test that reads over and over the same page to see the impact of
> read-disturb.

Well, I think this is not a blocker problem, more of a complication
that caching introduces. Indeed, I was working with different kind of
caches, e.g., implementing my own custom caching for my custom user-
space scripts, and caches always introduces extra complexity. That's
the price to pay.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cached NAND reads and UBIFS
  2016-07-13 13:13   ` Artem Bityutskiy
@ 2016-07-14  7:58     ` Boris Brezillon
  0 siblings, 0 replies; 8+ messages in thread
From: Boris Brezillon @ 2016-07-14  7:58 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Richard Weinberger, Brian Norris, "linux-mtd

On Wed, 13 Jul 2016 16:13:09 +0300
Artem Bityutskiy <dedekind1@gmail.com> wrote:

> On Wed, 2016-07-13 at 14:43 +0200, Boris Brezillon wrote:
> > On Wed, 13 Jul 2016 14:30:01 +0200
> > Richard Weinberger <richard@nod.at> wrote:
> >   
> > > Hi!
> > > 
> > > As discussed on IRC, Boris and I figured that on our target UBIFS
> > > is sometimes
> > > very slow.
> > > i.e. deleting a 1GiB file right after a reboot takes more than 30
> > > seconds.
> > > 
> > > When deleting a file with a cold TNC UBIFS has to lookup a lot of
> > > znodes
> > > on the flash.
> > > For every single znode lookup UBIFS requests a few bytes from the
> > > flash.
> > > This is slow.
> > > 
> > > After some investigation we found out that the NAND read cache is
> > > disabled
> > > when the NAND driver supports reading subpages.
> > > So we removed the NAND_SUBPAGE_READ flag from the driver and
> > > suddenly
> > > lookups were fast. Really fast. Deleting a 1GiB took less than 5
> > > seconds.
> > > Since on our MLC NAND a page is 16KiB many znodes can be read very
> > > fast
> > > directly out of the NAND read cache.
> > > The read cache helps here a lot because in the regular case UBIFS'
> > > index
> > > nodes are linearly stored in a LEB.
> > > 
> > > The TNC seems to assume that it can do a lot of short reads since
> > > the NAND
> > > read cache will help.
> > > But as soon subpage reads are possible this assumption is no longer
> > > true.
> > > 
> > > Now we're not sure what do do, should we implement bulk reading in
> > > the TNC
> > > code or improve NAND read caching?  
> > 
> > Hm, NAND page caching is something I'd like to get rid of at some
> > point, and this for several reasons:
> > 
> > 1/ it brings some confusion in NAND controller drivers, where those
> > don't know when they are allowed to use chip->buffer, and what to do
> > with ->pagebuf in this case  
> 
> Yes, it adds complexity because it is not a separate caching layer but
> rather "built-in" into the logic, sprinkled around.

Yep.

> 
> > 2/ caching is already implemented at the FS level, so I'm not sure we
> > really need another level of caching at the MTD/NAND level (except
> > for
> > those specific use cases where the MTD user relies on this caching to
> > improve accesses to small contiguous chunks)  
> 
> Well, FS is caching stuff, but device level caching is still useful.
> E.g., UBI decides to move things around, things get cached, and when
> UBIFS reads the things, it picks them from the cache.

Yes, I don't deny the usefulness of caches in general, but with the MTD
stack, it's not clear who is caching what, and I fear that we'll end up
with different layers caching the same thing, this increasing the
memory consumption

> 
> Disk blocks are also cached in Linux separately from the FS level
> cache.

That's true, except they both use the same mechanism ("Page Cache"). I
don't know it very well, but I thought a page cached at the block
device level could be reused by the FS for it's own cache if it needs
to point to the same data.

Anyway, letting each MTD component implement its own caching logic is
not a good solution IMO, so maybe we should consider this mtdcache
layer you were suggesting on IRC...

> 
> > 3/ it hides the real number of bitflips in a given page: say someone
> > is
> > reading over and over the same page, the MTD user will never be able
> > to
> > detect when the number of bitflips exceed the threshold. This should
> > not be a problem in real world, because MTD users are unlikely to
> > always
> > read the same page without reading other pages in the meantime, but
> > still, I think it adds some confusion, especially if one wants to
> > write
> > a test that reads over and over the same page to see the impact of
> > read-disturb.  
> 
> Well, I think this is not a blocker problem, more of a complication
> that caching introduces. Indeed, I was working with different kind of
> caches, e.g., implementing my own custom caching for my custom user-
> space scripts, and caches always introduces extra complexity. That's
> the price to pay.

Well, having a way to bypass the cache would be clearer than having to
know which page is cached (or if we decide to enhance MTD/NAND caching,
which pages are cached) and making sure this page is replaced by
another one in the cache before reading it again.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cached NAND reads and UBIFS
  2016-07-13 12:30 Cached NAND reads and UBIFS Richard Weinberger
  2016-07-13 12:43 ` Boris Brezillon
@ 2016-07-13 16:54 ` Brian Norris
  2016-07-14  8:33   ` Boris Brezillon
  1 sibling, 1 reply; 8+ messages in thread
From: Brian Norris @ 2016-07-13 16:54 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Artem Bityutskiy, Boris Brezillon, linux-mtd@lists.infradead.org

On Wed, Jul 13, 2016 at 02:30:01PM +0200, Richard Weinberger wrote:
> The TNC seems to assume that it can do a lot of short reads since the NAND
> read cache will help.

For my info: by "short reads" are you talking page-size/aligned?
Subpage-sized/aligned? Or none of the above?

> But as soon subpage reads are possible this assumption is no longer true.

This sounds similar to the problem we were just discussing, about a
bug/oversight/suboptimality in nand_do_read_ops()'s use of an extra
buffer + memcpy(). I guess the difference here is that a sub-page read
can't fill the cache as-is (we have no way to mark the cache as
partially valid), so we just skip it. To "fixup" the cache
for subpages, I guess we'd have to split it into subpages...

> Now we're not sure what do do, should we implement bulk reading in the TNC
> code or improve NAND read caching?

Seems like we should improve the caching, either in a layer above, or
just in the NAND layer.

But, we haven't really figured out how to advertise the minimum
(optimal) read size to the MTD user, right? We have at best

  writesize >> subpage_sft

but that's really the minimum write size, not read size, right? So even
if we're trying to shore up the read cacheability, we should better
define the constraints first, so that we know what we can guarantee an
MTD user, and what we can't. e.g., we will cache on size/alignment FOO,
and anything not aligned to that is not guaranteed.

On a potentially conflicting note, I think nand_do_read_ops() is too
complicated. I think Boris expressed a similar sentiment. So if we can
find a good way to factor the caching code to be less entangled with the
mechanics of reading, that'd be a win for me.

Brian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cached NAND reads and UBIFS
  2016-07-13 16:54 ` Brian Norris
@ 2016-07-14  8:33   ` Boris Brezillon
  2016-07-14 13:06     ` Richard Weinberger
  0 siblings, 1 reply; 8+ messages in thread
From: Boris Brezillon @ 2016-07-14  8:33 UTC (permalink / raw)
  To: Brian Norris
  Cc: Richard Weinberger, Artem Bityutskiy,
	linux-mtd@lists.infradead.org

On Wed, 13 Jul 2016 09:54:22 -0700
Brian Norris <computersforpeace@gmail.com> wrote:

> On Wed, Jul 13, 2016 at 02:30:01PM +0200, Richard Weinberger wrote:
> > The TNC seems to assume that it can do a lot of short reads since the NAND
> > read cache will help.  
> 
> For my info: by "short reads" are you talking page-size/aligned?
> Subpage-sized/aligned? Or none of the above?

None of them. IIRC, the size was around 200bytes and not necessarily
page or ECC-chunk aligned.

> 
> > But as soon subpage reads are possible this assumption is no longer true.  
> 
> This sounds similar to the problem we were just discussing, about a
> bug/oversight/suboptimality in nand_do_read_ops()'s use of an extra
> buffer + memcpy(). I guess the difference here is that a sub-page read
> can't fill the cache as-is (we have no way to mark the cache as
> partially valid), so we just skip it. To "fixup" the cache
> for subpages, I guess we'd have to split it into subpages...

Well, it's not exactly the same problem. For the EC/VID headers we know
that we'll only need to read 64 bytes from the whole page. Here, we
don't know: the code might well read the whole page in small chunks,
and using subpage read in this case is inefficient: each time you have
to issue a READ command to store the page in the internal NAND cache,
and then I/Os + ECC calculation for the ECC chunk (note that modern
NAND controllers are able to pipeline I/Os and ECC calculation, thus
making full-page read more efficient than partial ones especially when
you end-up reading the whole page).

This is why this decision to use full-page vs ECC-chunk (AKA subpage)
reads can only be done at the MTD user level (in this particular case,
only UBIFS can predict whether we are likely to need the whole page
content or only a sub-part of it).

> 
> > Now we're not sure what do do, should we implement bulk reading in the TNC
> > code or improve NAND read caching?  
> 
> Seems like we should improve the caching, either in a layer above, or
> just in the NAND layer.

I think we all agree on that one :).

> 
> But, we haven't really figured out how to advertise the minimum
> (optimal) read size to the MTD user, right? We have at best
> 
>   writesize >> subpage_sft
> 
> but that's really the minimum write size, not read size, right?

Correct, the minimum write-size (when the controller supports
individual ECC chunk read) is an ECC chunk (usually 512 or 1024 bytes).

But again, the minimum read size is not necessarily the most optimal one
(see my explanation above), this means we'll have to differentiate the
minimum read size from the optimal read size (the one providing the
best throughput).

> So even
> if we're trying to shore up the read cacheability, we should better
> define the constraints first, so that we know what we can guarantee an
> MTD user, and what we can't. e.g., we will cache on size/alignment FOO,
> and anything not aligned to that is not guaranteed.

The read request will be ECC chunk (or page) aligned anyway, and the
core will use an intermediate buffer when the user request does not
match this constraint, so we'll always end-up with aligned content that
we can store in the cache.

> 
> On a potentially conflicting note, I think nand_do_read_ops() is too
> complicated. I think Boris expressed a similar sentiment. So if we can
> find a good way to factor the caching code to be less entangled with the
> mechanics of reading, that'd be a win for me.

I fully agree, and I'm not even sure this caching mechanism should be
implemented in the NAND layer.

Regards,

Boris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cached NAND reads and UBIFS
  2016-07-14  8:33   ` Boris Brezillon
@ 2016-07-14 13:06     ` Richard Weinberger
  2016-07-15  8:30       ` Boris Brezillon
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Weinberger @ 2016-07-14 13:06 UTC (permalink / raw)
  To: Boris Brezillon, Brian Norris
  Cc: Artem Bityutskiy, linux-mtd@lists.infradead.org

Am 14.07.2016 um 10:33 schrieb Boris Brezillon:
>>> Now we're not sure what do do, should we implement bulk reading in the TNC
>>> code or improve NAND read caching?  
>>
>> Seems like we should improve the caching, either in a layer above, or
>> just in the NAND layer.
> 
> I think we all agree on that one :).

Today I found some time to implement a PoC of a single page cache directly
in UBI.
IMHO UBI is the right layer for that. NAND is too low and UBIFS too high
level.
With a few lines of case I was able to speedup UBIFS TNC lookups a lot,
in fact it gave me the same speed up as caching in the NAND layer does.

As next step I'll tidy the code, add a debugfs interface to expose cache hit/miss
counters such that we can experiment with different workloads and users ontop
of UBIFS. /me thinks of squashfs+ubiblock which is also rather common these days.

Maybe caching more than one page helps too.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Cached NAND reads and UBIFS
  2016-07-14 13:06     ` Richard Weinberger
@ 2016-07-15  8:30       ` Boris Brezillon
  0 siblings, 0 replies; 8+ messages in thread
From: Boris Brezillon @ 2016-07-15  8:30 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Brian Norris, Artem Bityutskiy, linux-mtd@lists.infradead.org

Hi Richard,

On Thu, 14 Jul 2016 15:06:48 +0200
Richard Weinberger <richard@nod.at> wrote:

> Am 14.07.2016 um 10:33 schrieb Boris Brezillon:
> >>> Now we're not sure what do do, should we implement bulk reading in the TNC
> >>> code or improve NAND read caching?    
> >>
> >> Seems like we should improve the caching, either in a layer above, or
> >> just in the NAND layer.  
> > 
> > I think we all agree on that one :).  
> 
> Today I found some time to implement a PoC of a single page cache directly
> in UBI.

Oh, that was fast. I'm looking forward to seeing this implementation.

> IMHO UBI is the right layer for that. NAND is too low and UBIFS too high
> level.
> With a few lines of case I was able to speedup UBIFS TNC lookups a lot,
> in fact it gave me the same speed up as caching in the NAND layer does.

As explained in a previous answer, assuming the user will always need
the content of a full page is not necessarily the good option, even if,
as seen with the UBIFS example, it's usually what we want.
In our use case, only UBIFS can know/guess how much data will be needed
in each page.

The good thing is that caching at the UBI level allows you to use
sub-page reads for EC/VID headers.

> 
> As next step I'll tidy the code, add a debugfs interface to expose cache hit/miss
> counters such that we can experiment with different workloads and users ontop
> of UBIFS. /me thinks of squashfs+ubiblock which is also rather common these days.

Yep, it might help, especially if the squashfs 'block size' is less
than the NAND page size.

> 
> Maybe caching more than one page helps too.

That would also help, though IMO it requires using the 'page cache'
mechanism so that the system can reclaim cached pages when it is under
memory pressure.

Anyway, thanks for investigating this option, let's see what Artem and
Brian think about this approach.

Regards,

Boris

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-07-15  8:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-13 12:30 Cached NAND reads and UBIFS Richard Weinberger
2016-07-13 12:43 ` Boris Brezillon
2016-07-13 13:13   ` Artem Bityutskiy
2016-07-14  7:58     ` Boris Brezillon
2016-07-13 16:54 ` Brian Norris
2016-07-14  8:33   ` Boris Brezillon
2016-07-14 13:06     ` Richard Weinberger
2016-07-15  8:30       ` Boris Brezillon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).