* Re: [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout [not found] ` <51AD9140.90500@freescale.com> @ 2013-06-05 18:01 ` Brian Norris 2013-06-05 21:08 ` Brian Norris 0 siblings, 1 reply; 4+ messages in thread From: Brian Norris @ 2013-06-05 18:01 UTC (permalink / raw) To: Huang Shijie; +Cc: Artem Bityutskiy, linux-mtd, Linux Kernel, Kevin Cernekee On Tue, Jun 4, 2013 at 12:03 AM, Huang Shijie <b32955@freescale.com> wrote: > 于 2013年06月04日 09:46, Brian Norris 写道: >> After various tests, it seems simply that the timeout is not long enough >> for my system; increasing it by a few jiffies prevented all failures >> (testing for 12+ hours). There is no harm in increasing the timeout, but >> there is harm in having it too short, as evidenced here. >> > I like the patch1 and patch 2. > > But extending the timeout from 1ms to 10ms is like a workaround. :) I was afraid you might say that; that's why I stuck the first two patches first ;) > From the NOR's spec, even the maximum write-to-buffer only costs several > hundreds us, > such as 200us. > > I GUESS your problem is caused by the timer system, not the MTD code. I > ever met this type of bug. I suspected similarly, but I didn't (until now) believe that's the case here. See below. > The bug is in the kernel 3.5.7, but the latest kernel has fixed it with > NO_HZ_IDLE/NO_HZ_COMMON features. Did you track your bug down to a particular commit? 3.5.7 is the stable kernel; do you know what mainline rev it showed up in? I'm not quite interested in backporting all of the new 3.10 features! > I do not meet the issue the latest linux-next tree. > > I try to describe the jiffies bug with my poor english: > > [1] background: > CONFIG_HZ=100, CONFIG_NO_HZ=y > > [2] call nand_wait() when we write a nand page. > > [3] The jiffies was not updated at a _even_ speed. > > In the nand_wait(), you wait for 20ms(2 jiffies) for a page write, > and the timeout occurs during the page write. Of course, you think that > we have already waited for 20ms. > But in actually, we only waited for 1ms or less! > How do i know this? I use the gettimeofday to check the real time when > the timeout occur. I suspected this very type of thing, since this has come up in a few different contexts. And for some time, with a number of different checks, it appeared that this *wasn't* the case. But while writing this very email, I had the bright idea that my time checkpoint was in slightly the wrong place; so sure enough, I found that I was timing out after only 72519 ns! (That is, 72 us, or well below the max write buffer time.) I'm testing on MIPS with a 3.3 kernel, by the way, but I believe this sort of bug has been around a while. > [4] if i disable the local timer, the bug disappears. > > So, could you check the real time when the timeout occurs? > > > > Btw: My NOR's timeout is proved to be a silicon bug by Micron. Interesting. Brian ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout 2013-06-05 18:01 ` [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout Brian Norris @ 2013-06-05 21:08 ` Brian Norris 2013-06-05 22:39 ` Imre Deak 2013-06-06 2:20 ` Huang Shijie 0 siblings, 2 replies; 4+ messages in thread From: Brian Norris @ 2013-06-05 21:08 UTC (permalink / raw) To: Huang Shijie Cc: Artem Bityutskiy, linux-mtd, Linux Kernel, Kevin Cernekee, Arnd Bergmann, Imre Deak Adding a few others For reference, this thread started with this patch: http://lists.infradead.org/pipermail/linux-mtd/2013-June/047164.html On Wed, Jun 5, 2013 at 11:01 AM, Brian Norris <computersforpeace@gmail.com> wrote: > On Tue, Jun 4, 2013 at 12:03 AM, Huang Shijie <b32955@freescale.com> wrote: >> 于 2013年06月04日 09:46, Brian Norris 写道: >>> After various tests, it seems simply that the timeout is not long enough >>> for my system; increasing it by a few jiffies prevented all failures >>> (testing for 12+ hours). There is no harm in increasing the timeout, but >>> there is harm in having it too short, as evidenced here. >>> >> I like the patch1 and patch 2. >> >> But extending the timeout from 1ms to 10ms is like a workaround. :) > > I was afraid you might say that; that's why I stuck the first two > patches first ;) ... >> I GUESS your problem is caused by the timer system, not the MTD code. I >> ever met this type of bug. ... >> I try to describe the jiffies bug with my poor english: >> >> [1] background: >> CONFIG_HZ=100, CONFIG_NO_HZ=y >> >> [2] call nand_wait() when we write a nand page. >> >> [3] The jiffies was not updated at a _even_ speed. >> >> In the nand_wait(), you wait for 20ms(2 jiffies) for a page write, >> and the timeout occurs during the page write. Of course, you think that >> we have already waited for 20ms. >> But in actually, we only waited for 1ms or less! >> How do i know this? I use the gettimeofday to check the real time when >> the timeout occur. > > I suspected this very type of thing, since this has come up in a few > different contexts. And for some time, with a number of different > checks, it appeared that this *wasn't* the case. But while writing > this very email, I had the bright idea that my time checkpoint was in > slightly the wrong place; so sure enough, I found that I was timing > out after only 72519 ns! (That is, 72 us, or well below the max write > buffer time.) So I can confirm that with the 1ms timeout, I actually am sometimes timing out at 40 to 70 microseconds. I think this may have multiple causes: (1) uneven timer interrupts, as suggested by Huang? (2) a jiffies timeout of 1 is two short (with HZ=1000, msecs_to_jiffies(1) is 1) Regarding reason (2): My thought (which matches with Imre's comments from his [1]) is that one problem here is that we do not know how long it will be until the *next* timer tick -- "waiting 1 jiffy" is really just waiting until the next timer tick, which very well might be in 40us! So the correct timeout calculation is something like: uWriteTimeout = msecs_to_jiffies(1) + 1; or with Imre's proposed methods (not merged upstream yet), just: uWriteTimeout = msecs_to_jiffies_timeout(1); Thoughts? Note that a 2-jiffy timeout does not, in fact, totally resolve my problems; with a timeout of 2 jiffies, I still get a timeout that (according to getnstimeofday()) occurs after only 56us. It does decrease its rate of occurrence, but Huang may still be right that reason (1) is involved. Brian [1] http://marc.info/?l=linux-kernel&m=136854294730957 ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout 2013-06-05 21:08 ` Brian Norris @ 2013-06-05 22:39 ` Imre Deak 2013-06-06 2:20 ` Huang Shijie 1 sibling, 0 replies; 4+ messages in thread From: Imre Deak @ 2013-06-05 22:39 UTC (permalink / raw) To: Brian Norris Cc: Huang Shijie, Artem Bityutskiy, linux-mtd, Linux Kernel, Kevin Cernekee, Arnd Bergmann On Wed, 2013-06-05 at 14:08 -0700, Brian Norris wrote: > Adding a few others > > For reference, this thread started with this patch: > > http://lists.infradead.org/pipermail/linux-mtd/2013-June/047164.html > > On Wed, Jun 5, 2013 at 11:01 AM, Brian Norris > <computersforpeace@gmail.com> wrote: > > On Tue, Jun 4, 2013 at 12:03 AM, Huang Shijie <b32955@freescale.com> wrote: > >> 于 2013年06月04日 09:46, Brian Norris 写道: > >>> After various tests, it seems simply that the timeout is not long enough > >>> for my system; increasing it by a few jiffies prevented all failures > >>> (testing for 12+ hours). There is no harm in increasing the timeout, but > >>> there is harm in having it too short, as evidenced here. > >>> > >> I like the patch1 and patch 2. > >> > >> But extending the timeout from 1ms to 10ms is like a workaround. :) > > > > I was afraid you might say that; that's why I stuck the first two > > patches first ;) > ... > >> I GUESS your problem is caused by the timer system, not the MTD code. I > >> ever met this type of bug. > ... > >> I try to describe the jiffies bug with my poor english: > >> > >> [1] background: > >> CONFIG_HZ=100, CONFIG_NO_HZ=y > >> > >> [2] call nand_wait() when we write a nand page. > >> > >> [3] The jiffies was not updated at a _even_ speed. > >> > >> In the nand_wait(), you wait for 20ms(2 jiffies) for a page write, > >> and the timeout occurs during the page write. Of course, you think that > >> we have already waited for 20ms. > >> But in actually, we only waited for 1ms or less! > >> How do i know this? I use the gettimeofday to check the real time when > >> the timeout occur. > > > > I suspected this very type of thing, since this has come up in a few > > different contexts. And for some time, with a number of different > > checks, it appeared that this *wasn't* the case. But while writing > > this very email, I had the bright idea that my time checkpoint was in > > slightly the wrong place; so sure enough, I found that I was timing > > out after only 72519 ns! (That is, 72 us, or well below the max write > > buffer time.) > > So I can confirm that with the 1ms timeout, I actually am sometimes > timing out at 40 to 70 microseconds. I think this may have multiple > causes: > (1) uneven timer interrupts, as suggested by Huang? > (2) a jiffies timeout of 1 is two short (with HZ=1000, msecs_to_jiffies(1) is 1) > > Regarding reason (2): > > My thought (which matches with Imre's comments from his [1]) is that > one problem here is that we do not know how long it will be until the > *next* timer tick -- "waiting 1 jiffy" is really just waiting until > the next timer tick, which very well might be in 40us! So the correct > timeout calculation is something like: > > uWriteTimeout = msecs_to_jiffies(1) + 1; > > or with Imre's proposed methods (not merged upstream yet), just: > > uWriteTimeout = msecs_to_jiffies_timeout(1); > > Thoughts? I think what you describe at (2) wouldn't cause a premature timeout in your case. The driver uses the returned jiffy value something like the following in all cases (before applying the patch with the +1 change): uWriteTimeout = msecs_to_jiffies(1); timeout = jiffies + uWriteTimeout; while (!condition) if (time_after(jiffies, timeout)) return -ETIMEDOUT; Here using time_after() as opposed to time_after_eq() serves as an implicit +1 and thus guarantees that you wait at least 1 msec. A bit off-topic: Though using msecs_to_jiffies() is not a problem here, I think in this case and almost always it would need less thinking and thus be safer to still use msecs_to_jiffies_timeout(). A rare exception would be when the +1 adjustment would accumulate to a significant error, like in the following polling loop: for (i = 0; i <= 50; i++) { if (poll_condition) return 0; schedule_timeout(msecs_to_jiffies(1)); } return -ETIMEDOUT; Here on HZ=1000 we would time out in average after 100 msec using msecs_to_jiffies_timeout(1), whereas the intention was 50 msecs. --Imre > Note that a 2-jiffy timeout does not, in fact, totally resolve my > problems; with a timeout of 2 jiffies, I still get a timeout that > (according to getnstimeofday()) occurs after only 56us. It does > decrease its rate of occurrence, but Huang may still be right that > reason (1) is involved. > > Brian > > [1] http://marc.info/?l=linux-kernel&m=136854294730957 ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout 2013-06-05 21:08 ` Brian Norris 2013-06-05 22:39 ` Imre Deak @ 2013-06-06 2:20 ` Huang Shijie 1 sibling, 0 replies; 4+ messages in thread From: Huang Shijie @ 2013-06-06 2:20 UTC (permalink / raw) To: Brian Norris Cc: Artem Bityutskiy, linux-mtd, Linux Kernel, Kevin Cernekee, Arnd Bergmann, Imre Deak 于 2013年06月06日 05:08, Brian Norris 写道: > Note that a 2-jiffy timeout does not, in fact, totally resolve my > problems; with a timeout of 2 jiffies, I still get a timeout that > (according to getnstimeofday()) occurs after only 56us. It does since the 2-jiffy does not resolve your problem, i suggest you try the latest linux-next tree. thanks Huang Shijie ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-06-06 2:17 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1370310406-413-1-git-send-email-computersforpeace@gmail.com>
[not found] ` <1370310406-413-3-git-send-email-computersforpeace@gmail.com>
[not found] ` <51AD9140.90500@freescale.com>
2013-06-05 18:01 ` [PATCH 3/3] mtd: cfi_cmdset_0002: increase do_write_buffer() timeout Brian Norris
2013-06-05 21:08 ` Brian Norris
2013-06-05 22:39 ` Imre Deak
2013-06-06 2:20 ` Huang Shijie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox