From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from down.free-electrons.com ([37.187.137.238] helo=mail.free-electrons.com) by bombadil.infradead.org with esmtp (Exim 4.80.1 #2 (Red Hat Linux)) id 1bC3JK-0005a8-6y for linux-mtd@lists.infradead.org; Sun, 12 Jun 2016 11:12:13 +0000 Date: Sun, 12 Jun 2016 13:11:42 +0200 From: Boris Brezillon To: "George Spelvin" Cc: computersforpeace@gmail.com, linux-kernel@vger.kernel.org, linux-mtd@lists.infradead.org, richard@nod.at, "Bean Huo =?UTF-8?B?6ZyN?= =?UTF-8?B?5paM5paM?= (beanhuo)" Subject: Re: [PATCH 2/4] mtd: nand: implement two pairing scheme Message-ID: <20160612131142.293ff800@bbrezillon> In-Reply-To: <20160612092313.4046.qmail@ns.sciencehorizons.net> References: <20160612092019.79b57b7f@bbrezillon> <20160612092313.4046.qmail@ns.sciencehorizons.net> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On 12 Jun 2016 05:23:13 -0400 "George Spelvin" wrote: > >> It also applies an offset of +1, to avoid negative numbers and the > >> problems of signed divides. =20 >=20 > > It seems to cover all cases. =20 >=20 > I wasn't sure why you used a signed int for the interface. No real reason other than consistency with other prototypes where page is always expressed as an integer. >=20 > (Another thing I thought of, but am less sure of, is packing the group > and pair numbers into a register-passable int rather than a structure. > Even 2 bits for the group is probably the most that will ever be needed, > but it's easy to say the low 4 bits are the group and the high 28 are > the pair. Just create a few access macros to pull them apart. We could indeed do that, but again, do we really need to optimize things like that? >=20 > This was inspired by Linus's hash_len abstraction, recently moved to > ) >=20 > >> (or you could add an mtd->write_per_erase field). =20 >=20 > > Okay. Actually I'd like to avoid adding new 'conversion' fields to the > > mtd_info struct. Not sure we are really improving perfs when doing that, > > since what takes long is the I/O ops between the flash and the > > controller not the conversion operations. =20 >=20 > Well, yes, but you may need to do conversion ops for in-memory cache > lookups or searching for free blocks, or wear-levelling computations, > all of which may involve a great many conversions per actual I/O. That's true, even if I don't think it makes such a big difference (you don't have that much paired pages manipulation that are not followed by read/write accesses, and this is where the contention is). >=20 > (In hindsight, I'd wish for writesize and write_per_erase, and not > store erasesize explicitly. Not only is the multiply more efficient, > but it abolishes the error of an erase size which is not a multiple of > the write size by making it impossible.) That's also true. Actually I was thinking about adding inline functions to retrieve the eraseblock and page size instead of letting people manipulate the ->writesize/erasesize fields. This way we would be able to rework the internal representation. >=20 > > Can we have a boolean to make it clearer? > > > > bool lastpage =3D ((page + 1) * mtd->writesize) =3D=3D mtd->erasesize; = =20 >=20 > An improvement IMHO. You can use the same name in all four functions > to make the equivalence clear. >=20 > > Also, the page update is quite obscure for people that did not have the > > explanation you gave above. Can we make it =20 >=20 > > /* > > * The first and last pages are not surrounded by other pages, > > * and are thus less sensitive to read/write disturbance. > > * That's why NAND vendors decided to use a different distance > > * for these 2 specific case, which complicates a bit the > > * pairing scheme logic. =20 >=20 > Um... this is, as far as I can tell, complete nonsense. Actually this was pure guessing, cause I never had a real explanation for these weird pairing scheme. >=20 > I realize you know this about a thousand times better than I do, so > I'm hesitant to make such a strong statement, but one thing that I do > know is that paired pages are stored in the exact same transistors. > The pairing is purely a logical addressing distance. The physical > distance is exactly zero. >=20 > The qustion is why they chose this particular *logival* addressing > scheme, and I believe the reason is write bandwidth for the common case > of streaming writes to consecutive pages. >=20 > The obvious thing to do is pair consecutive even and odd pages (pages 0 a= nd 1, > then 2 and 3, then...), but that makes it hard to pipeline programming of= the > two pages. You can't start programming page 1 until page 0 is finished. >=20 > The next obvious thing is stride-2: 0<->2, 1<->3, 4<->6, 5<->7, etc. Yes I understand that one. >=20 > This is done in some MLC chips. See p. 98 of this Micron data sheet: > http://pdf.datasheet.directory/datasheets-0/micron_technology/MT29F32G08C= BACAWP_C.pdf > which has a stride-4 pairing. 0..3 pair with 4..8, then 9..11 with > 12..15, and so on. >=20 > However, it's desirable to alternate group-0 and group-1 pages, since > the write operations are rather different and even take different amounts > of time. Alternating them makes it possible to: > 1) Possibly overlap parts of the writes that use different on-chip resour= ces, > 2) Average the non-overlapping times for minimum jitter. Okay, that's actually a good reason, and probably the part I was missing to explain these non-log2 distance scheme leading to heterogeneous distance (the first and last set of pages don't have the same stride). >=20 > This leads naturally to the stride-3 solution. You want to minimize the > stride because you can read both pages in a pair with one read disturbanc= e, > and the shorter the distance, the more likely you'll want both pages > (and the less buffering you'll need to make both available). >=20 > Stride-3 does have those two awkward edge cases, and changing the > stride is simply the simplest way to special-case them. Yep. Still, I've seen weird things while working on modern MLC NANDs which makes me think the pairing scheme is also here to help mitigate the write-disturb effect, but I might be wrong. The behavior I'm describing here has been observed on Hynix (H27QCG8T2E5R=E2=80=90BCF) and Toshiba (TC58TEG5DCLTA00) NANDs so far. When I write the 2 pages in a pair, but not the following page, I see a high number of bitflips in the last programmed page until the next page is programmed. Let's take a real example. My NAND is exposing a stride-3 pairing scheme, when I only program page 0, 1, 2, page 2 is showing a high number of bitflips until page 3 is programmed. Actually, I don't remember if the number decrease after programming page 3 or 4, but my guess is that the NAND is accounting for future write-disturb when programming a page in group 1, which makes this page un-reliable until the subsequent page(s) have been programmed. What's your opinion on that? >=20 > > Thanks for your valuable review/suggestions. > > > > Just out of curiosity, why are you interested in the pairing scheme > > concept? Are you working with NANDs? =20 >=20 > Not at present, but I do embedded hardware and might some day. Okay. You seem pretty well aware of MLC/TLC NAND constraints, and you already have good idea of how things work. Good to have someone like you reviewing this stuff. >=20 > Also, the data sheets are a real PITA to find. I have yet to > see an actual data sheet that documents the stride-3 pairing scheme. Yes, that's a real problem. Here is a Samsung NAND data sheet describing stride-3 [1], and an Hynix one describing stride-6 [2]. [1]http://dl.btc.pl/kamami_wa/k9gbg08u0a_ds.pdf [2]http://www.szyuda88.com/uploadfile/cfile/201061714220663.pdf --=20 Boris Brezillon, Free Electrons Embedded Linux and Kernel engineering http://free-electrons.com