* Cached NAND reads and UBIFS @ 2016-07-13 12:30 Richard Weinberger 2016-07-13 12:43 ` Boris Brezillon 2016-07-13 16:54 ` Brian Norris 0 siblings, 2 replies; 8+ messages in thread From: Richard Weinberger @ 2016-07-13 12:30 UTC (permalink / raw) To: Artem Bityutskiy, Brian Norris, Boris Brezillon Cc: linux-mtd@lists.infradead.org Hi! As discussed on IRC, Boris and I figured that on our target UBIFS is sometimes very slow. i.e. deleting a 1GiB file right after a reboot takes more than 30 seconds. When deleting a file with a cold TNC UBIFS has to lookup a lot of znodes on the flash. For every single znode lookup UBIFS requests a few bytes from the flash. This is slow. After some investigation we found out that the NAND read cache is disabled when the NAND driver supports reading subpages. So we removed the NAND_SUBPAGE_READ flag from the driver and suddenly lookups were fast. Really fast. Deleting a 1GiB took less than 5 seconds. Since on our MLC NAND a page is 16KiB many znodes can be read very fast directly out of the NAND read cache. The read cache helps here a lot because in the regular case UBIFS' index nodes are linearly stored in a LEB. The TNC seems to assume that it can do a lot of short reads since the NAND read cache will help. But as soon subpage reads are possible this assumption is no longer true. Now we're not sure what do do, should we implement bulk reading in the TNC code or improve NAND read caching? Thanks, //richard ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Cached NAND reads and UBIFS 2016-07-13 12:30 Cached NAND reads and UBIFS Richard Weinberger @ 2016-07-13 12:43 ` Boris Brezillon 2016-07-13 13:13 ` Artem Bityutskiy 2016-07-13 16:54 ` Brian Norris 1 sibling, 1 reply; 8+ messages in thread From: Boris Brezillon @ 2016-07-13 12:43 UTC (permalink / raw) To: Richard Weinberger Cc: Artem Bityutskiy, Brian Norris, linux-mtd@lists.infradead.org On Wed, 13 Jul 2016 14:30:01 +0200 Richard Weinberger <richard@nod.at> wrote: > Hi! > > As discussed on IRC, Boris and I figured that on our target UBIFS is sometimes > very slow. > i.e. deleting a 1GiB file right after a reboot takes more than 30 seconds. > > When deleting a file with a cold TNC UBIFS has to lookup a lot of znodes > on the flash. > For every single znode lookup UBIFS requests a few bytes from the flash. > This is slow. > > After some investigation we found out that the NAND read cache is disabled > when the NAND driver supports reading subpages. > So we removed the NAND_SUBPAGE_READ flag from the driver and suddenly > lookups were fast. Really fast. Deleting a 1GiB took less than 5 seconds. > Since on our MLC NAND a page is 16KiB many znodes can be read very fast > directly out of the NAND read cache. > The read cache helps here a lot because in the regular case UBIFS' index > nodes are linearly stored in a LEB. > > The TNC seems to assume that it can do a lot of short reads since the NAND > read cache will help. > But as soon subpage reads are possible this assumption is no longer true. > > Now we're not sure what do do, should we implement bulk reading in the TNC > code or improve NAND read caching? Hm, NAND page caching is something I'd like to get rid of at some point, and this for several reasons: 1/ it brings some confusion in NAND controller drivers, where those don't know when they are allowed to use chip->buffer, and what to do with ->pagebuf in this case 2/ caching is already implemented at the FS level, so I'm not sure we really need another level of caching at the MTD/NAND level (except for those specific use cases where the MTD user relies on this caching to improve accesses to small contiguous chunks) 3/ it hides the real number of bitflips in a given page: say someone is reading over and over the same page, the MTD user will never be able to detect when the number of bitflips exceed the threshold. This should not be a problem in real world, because MTD users are unlikely to always read the same page without reading other pages in the meantime, but still, I think it adds some confusion, especially if one wants to write a test that reads over and over the same page to see the impact of read-disturb. Regards, Boris ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Cached NAND reads and UBIFS 2016-07-13 12:43 ` Boris Brezillon @ 2016-07-13 13:13 ` Artem Bityutskiy 2016-07-14 7:58 ` Boris Brezillon 0 siblings, 1 reply; 8+ messages in thread From: Artem Bityutskiy @ 2016-07-13 13:13 UTC (permalink / raw) To: Boris Brezillon, Richard Weinberger Cc: Brian Norris, linux-mtd@lists.infradead.org On Wed, 2016-07-13 at 14:43 +0200, Boris Brezillon wrote: > On Wed, 13 Jul 2016 14:30:01 +0200 > Richard Weinberger <richard@nod.at> wrote: > > > Hi! > > > > As discussed on IRC, Boris and I figured that on our target UBIFS > > is sometimes > > very slow. > > i.e. deleting a 1GiB file right after a reboot takes more than 30 > > seconds. > > > > When deleting a file with a cold TNC UBIFS has to lookup a lot of > > znodes > > on the flash. > > For every single znode lookup UBIFS requests a few bytes from the > > flash. > > This is slow. > > > > After some investigation we found out that the NAND read cache is > > disabled > > when the NAND driver supports reading subpages. > > So we removed the NAND_SUBPAGE_READ flag from the driver and > > suddenly > > lookups were fast. Really fast. Deleting a 1GiB took less than 5 > > seconds. > > Since on our MLC NAND a page is 16KiB many znodes can be read very > > fast > > directly out of the NAND read cache. > > The read cache helps here a lot because in the regular case UBIFS' > > index > > nodes are linearly stored in a LEB. > > > > The TNC seems to assume that it can do a lot of short reads since > > the NAND > > read cache will help. > > But as soon subpage reads are possible this assumption is no longer > > true. > > > > Now we're not sure what do do, should we implement bulk reading in > > the TNC > > code or improve NAND read caching? > > Hm, NAND page caching is something I'd like to get rid of at some > point, and this for several reasons: > > 1/ it brings some confusion in NAND controller drivers, where those > don't know when they are allowed to use chip->buffer, and what to do > with ->pagebuf in this case Yes, it adds complexity because it is not a separate caching layer but rather "built-in" into the logic, sprinkled around. > 2/ caching is already implemented at the FS level, so I'm not sure we > really need another level of caching at the MTD/NAND level (except > for > those specific use cases where the MTD user relies on this caching to > improve accesses to small contiguous chunks) Well, FS is caching stuff, but device level caching is still useful. E.g., UBI decides to move things around, things get cached, and when UBIFS reads the things, it picks them from the cache. Disk blocks are also cached in Linux separately from the FS level cache. > 3/ it hides the real number of bitflips in a given page: say someone > is > reading over and over the same page, the MTD user will never be able > to > detect when the number of bitflips exceed the threshold. This should > not be a problem in real world, because MTD users are unlikely to > always > read the same page without reading other pages in the meantime, but > still, I think it adds some confusion, especially if one wants to > write > a test that reads over and over the same page to see the impact of > read-disturb. Well, I think this is not a blocker problem, more of a complication that caching introduces. Indeed, I was working with different kind of caches, e.g., implementing my own custom caching for my custom user- space scripts, and caches always introduces extra complexity. That's the price to pay. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Cached NAND reads and UBIFS 2016-07-13 13:13 ` Artem Bityutskiy @ 2016-07-14 7:58 ` Boris Brezillon 0 siblings, 0 replies; 8+ messages in thread From: Boris Brezillon @ 2016-07-14 7:58 UTC (permalink / raw) To: Artem Bityutskiy; +Cc: Richard Weinberger, Brian Norris, "linux-mtd On Wed, 13 Jul 2016 16:13:09 +0300 Artem Bityutskiy <dedekind1@gmail.com> wrote: > On Wed, 2016-07-13 at 14:43 +0200, Boris Brezillon wrote: > > On Wed, 13 Jul 2016 14:30:01 +0200 > > Richard Weinberger <richard@nod.at> wrote: > > > > > Hi! > > > > > > As discussed on IRC, Boris and I figured that on our target UBIFS > > > is sometimes > > > very slow. > > > i.e. deleting a 1GiB file right after a reboot takes more than 30 > > > seconds. > > > > > > When deleting a file with a cold TNC UBIFS has to lookup a lot of > > > znodes > > > on the flash. > > > For every single znode lookup UBIFS requests a few bytes from the > > > flash. > > > This is slow. > > > > > > After some investigation we found out that the NAND read cache is > > > disabled > > > when the NAND driver supports reading subpages. > > > So we removed the NAND_SUBPAGE_READ flag from the driver and > > > suddenly > > > lookups were fast. Really fast. Deleting a 1GiB took less than 5 > > > seconds. > > > Since on our MLC NAND a page is 16KiB many znodes can be read very > > > fast > > > directly out of the NAND read cache. > > > The read cache helps here a lot because in the regular case UBIFS' > > > index > > > nodes are linearly stored in a LEB. > > > > > > The TNC seems to assume that it can do a lot of short reads since > > > the NAND > > > read cache will help. > > > But as soon subpage reads are possible this assumption is no longer > > > true. > > > > > > Now we're not sure what do do, should we implement bulk reading in > > > the TNC > > > code or improve NAND read caching? > > > > Hm, NAND page caching is something I'd like to get rid of at some > > point, and this for several reasons: > > > > 1/ it brings some confusion in NAND controller drivers, where those > > don't know when they are allowed to use chip->buffer, and what to do > > with ->pagebuf in this case > > Yes, it adds complexity because it is not a separate caching layer but > rather "built-in" into the logic, sprinkled around. Yep. > > > 2/ caching is already implemented at the FS level, so I'm not sure we > > really need another level of caching at the MTD/NAND level (except > > for > > those specific use cases where the MTD user relies on this caching to > > improve accesses to small contiguous chunks) > > Well, FS is caching stuff, but device level caching is still useful. > E.g., UBI decides to move things around, things get cached, and when > UBIFS reads the things, it picks them from the cache. Yes, I don't deny the usefulness of caches in general, but with the MTD stack, it's not clear who is caching what, and I fear that we'll end up with different layers caching the same thing, this increasing the memory consumption > > Disk blocks are also cached in Linux separately from the FS level > cache. That's true, except they both use the same mechanism ("Page Cache"). I don't know it very well, but I thought a page cached at the block device level could be reused by the FS for it's own cache if it needs to point to the same data. Anyway, letting each MTD component implement its own caching logic is not a good solution IMO, so maybe we should consider this mtdcache layer you were suggesting on IRC... > > > 3/ it hides the real number of bitflips in a given page: say someone > > is > > reading over and over the same page, the MTD user will never be able > > to > > detect when the number of bitflips exceed the threshold. This should > > not be a problem in real world, because MTD users are unlikely to > > always > > read the same page without reading other pages in the meantime, but > > still, I think it adds some confusion, especially if one wants to > > write > > a test that reads over and over the same page to see the impact of > > read-disturb. > > Well, I think this is not a blocker problem, more of a complication > that caching introduces. Indeed, I was working with different kind of > caches, e.g., implementing my own custom caching for my custom user- > space scripts, and caches always introduces extra complexity. That's > the price to pay. Well, having a way to bypass the cache would be clearer than having to know which page is cached (or if we decide to enhance MTD/NAND caching, which pages are cached) and making sure this page is replaced by another one in the cache before reading it again. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Cached NAND reads and UBIFS 2016-07-13 12:30 Cached NAND reads and UBIFS Richard Weinberger 2016-07-13 12:43 ` Boris Brezillon @ 2016-07-13 16:54 ` Brian Norris 2016-07-14 8:33 ` Boris Brezillon 1 sibling, 1 reply; 8+ messages in thread From: Brian Norris @ 2016-07-13 16:54 UTC (permalink / raw) To: Richard Weinberger Cc: Artem Bityutskiy, Boris Brezillon, linux-mtd@lists.infradead.org On Wed, Jul 13, 2016 at 02:30:01PM +0200, Richard Weinberger wrote: > The TNC seems to assume that it can do a lot of short reads since the NAND > read cache will help. For my info: by "short reads" are you talking page-size/aligned? Subpage-sized/aligned? Or none of the above? > But as soon subpage reads are possible this assumption is no longer true. This sounds similar to the problem we were just discussing, about a bug/oversight/suboptimality in nand_do_read_ops()'s use of an extra buffer + memcpy(). I guess the difference here is that a sub-page read can't fill the cache as-is (we have no way to mark the cache as partially valid), so we just skip it. To "fixup" the cache for subpages, I guess we'd have to split it into subpages... > Now we're not sure what do do, should we implement bulk reading in the TNC > code or improve NAND read caching? Seems like we should improve the caching, either in a layer above, or just in the NAND layer. But, we haven't really figured out how to advertise the minimum (optimal) read size to the MTD user, right? We have at best writesize >> subpage_sft but that's really the minimum write size, not read size, right? So even if we're trying to shore up the read cacheability, we should better define the constraints first, so that we know what we can guarantee an MTD user, and what we can't. e.g., we will cache on size/alignment FOO, and anything not aligned to that is not guaranteed. On a potentially conflicting note, I think nand_do_read_ops() is too complicated. I think Boris expressed a similar sentiment. So if we can find a good way to factor the caching code to be less entangled with the mechanics of reading, that'd be a win for me. Brian ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Cached NAND reads and UBIFS 2016-07-13 16:54 ` Brian Norris @ 2016-07-14 8:33 ` Boris Brezillon 2016-07-14 13:06 ` Richard Weinberger 0 siblings, 1 reply; 8+ messages in thread From: Boris Brezillon @ 2016-07-14 8:33 UTC (permalink / raw) To: Brian Norris Cc: Richard Weinberger, Artem Bityutskiy, linux-mtd@lists.infradead.org On Wed, 13 Jul 2016 09:54:22 -0700 Brian Norris <computersforpeace@gmail.com> wrote: > On Wed, Jul 13, 2016 at 02:30:01PM +0200, Richard Weinberger wrote: > > The TNC seems to assume that it can do a lot of short reads since the NAND > > read cache will help. > > For my info: by "short reads" are you talking page-size/aligned? > Subpage-sized/aligned? Or none of the above? None of them. IIRC, the size was around 200bytes and not necessarily page or ECC-chunk aligned. > > > But as soon subpage reads are possible this assumption is no longer true. > > This sounds similar to the problem we were just discussing, about a > bug/oversight/suboptimality in nand_do_read_ops()'s use of an extra > buffer + memcpy(). I guess the difference here is that a sub-page read > can't fill the cache as-is (we have no way to mark the cache as > partially valid), so we just skip it. To "fixup" the cache > for subpages, I guess we'd have to split it into subpages... Well, it's not exactly the same problem. For the EC/VID headers we know that we'll only need to read 64 bytes from the whole page. Here, we don't know: the code might well read the whole page in small chunks, and using subpage read in this case is inefficient: each time you have to issue a READ command to store the page in the internal NAND cache, and then I/Os + ECC calculation for the ECC chunk (note that modern NAND controllers are able to pipeline I/Os and ECC calculation, thus making full-page read more efficient than partial ones especially when you end-up reading the whole page). This is why this decision to use full-page vs ECC-chunk (AKA subpage) reads can only be done at the MTD user level (in this particular case, only UBIFS can predict whether we are likely to need the whole page content or only a sub-part of it). > > > Now we're not sure what do do, should we implement bulk reading in the TNC > > code or improve NAND read caching? > > Seems like we should improve the caching, either in a layer above, or > just in the NAND layer. I think we all agree on that one :). > > But, we haven't really figured out how to advertise the minimum > (optimal) read size to the MTD user, right? We have at best > > writesize >> subpage_sft > > but that's really the minimum write size, not read size, right? Correct, the minimum write-size (when the controller supports individual ECC chunk read) is an ECC chunk (usually 512 or 1024 bytes). But again, the minimum read size is not necessarily the most optimal one (see my explanation above), this means we'll have to differentiate the minimum read size from the optimal read size (the one providing the best throughput). > So even > if we're trying to shore up the read cacheability, we should better > define the constraints first, so that we know what we can guarantee an > MTD user, and what we can't. e.g., we will cache on size/alignment FOO, > and anything not aligned to that is not guaranteed. The read request will be ECC chunk (or page) aligned anyway, and the core will use an intermediate buffer when the user request does not match this constraint, so we'll always end-up with aligned content that we can store in the cache. > > On a potentially conflicting note, I think nand_do_read_ops() is too > complicated. I think Boris expressed a similar sentiment. So if we can > find a good way to factor the caching code to be less entangled with the > mechanics of reading, that'd be a win for me. I fully agree, and I'm not even sure this caching mechanism should be implemented in the NAND layer. Regards, Boris ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Cached NAND reads and UBIFS 2016-07-14 8:33 ` Boris Brezillon @ 2016-07-14 13:06 ` Richard Weinberger 2016-07-15 8:30 ` Boris Brezillon 0 siblings, 1 reply; 8+ messages in thread From: Richard Weinberger @ 2016-07-14 13:06 UTC (permalink / raw) To: Boris Brezillon, Brian Norris Cc: Artem Bityutskiy, linux-mtd@lists.infradead.org Am 14.07.2016 um 10:33 schrieb Boris Brezillon: >>> Now we're not sure what do do, should we implement bulk reading in the TNC >>> code or improve NAND read caching? >> >> Seems like we should improve the caching, either in a layer above, or >> just in the NAND layer. > > I think we all agree on that one :). Today I found some time to implement a PoC of a single page cache directly in UBI. IMHO UBI is the right layer for that. NAND is too low and UBIFS too high level. With a few lines of case I was able to speedup UBIFS TNC lookups a lot, in fact it gave me the same speed up as caching in the NAND layer does. As next step I'll tidy the code, add a debugfs interface to expose cache hit/miss counters such that we can experiment with different workloads and users ontop of UBIFS. /me thinks of squashfs+ubiblock which is also rather common these days. Maybe caching more than one page helps too. Thanks, //richard ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Cached NAND reads and UBIFS 2016-07-14 13:06 ` Richard Weinberger @ 2016-07-15 8:30 ` Boris Brezillon 0 siblings, 0 replies; 8+ messages in thread From: Boris Brezillon @ 2016-07-15 8:30 UTC (permalink / raw) To: Richard Weinberger Cc: Brian Norris, Artem Bityutskiy, linux-mtd@lists.infradead.org Hi Richard, On Thu, 14 Jul 2016 15:06:48 +0200 Richard Weinberger <richard@nod.at> wrote: > Am 14.07.2016 um 10:33 schrieb Boris Brezillon: > >>> Now we're not sure what do do, should we implement bulk reading in the TNC > >>> code or improve NAND read caching? > >> > >> Seems like we should improve the caching, either in a layer above, or > >> just in the NAND layer. > > > > I think we all agree on that one :). > > Today I found some time to implement a PoC of a single page cache directly > in UBI. Oh, that was fast. I'm looking forward to seeing this implementation. > IMHO UBI is the right layer for that. NAND is too low and UBIFS too high > level. > With a few lines of case I was able to speedup UBIFS TNC lookups a lot, > in fact it gave me the same speed up as caching in the NAND layer does. As explained in a previous answer, assuming the user will always need the content of a full page is not necessarily the good option, even if, as seen with the UBIFS example, it's usually what we want. In our use case, only UBIFS can know/guess how much data will be needed in each page. The good thing is that caching at the UBI level allows you to use sub-page reads for EC/VID headers. > > As next step I'll tidy the code, add a debugfs interface to expose cache hit/miss > counters such that we can experiment with different workloads and users ontop > of UBIFS. /me thinks of squashfs+ubiblock which is also rather common these days. Yep, it might help, especially if the squashfs 'block size' is less than the NAND page size. > > Maybe caching more than one page helps too. That would also help, though IMO it requires using the 'page cache' mechanism so that the system can reclaim cached pages when it is under memory pressure. Anyway, thanks for investigating this option, let's see what Artem and Brian think about this approach. Regards, Boris ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-07-15 8:31 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-07-13 12:30 Cached NAND reads and UBIFS Richard Weinberger 2016-07-13 12:43 ` Boris Brezillon 2016-07-13 13:13 ` Artem Bityutskiy 2016-07-14 7:58 ` Boris Brezillon 2016-07-13 16:54 ` Brian Norris 2016-07-14 8:33 ` Boris Brezillon 2016-07-14 13:06 ` Richard Weinberger 2016-07-15 8:30 ` Boris Brezillon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).