On Wed, Mar 25, 2009 at 01:48:53AM +1100, Nick Piggin wrote: > On Monday 23 March 2009 03:53:29 Jos Houtman wrote: > > On 3/21/09 11:53 AM, "Andrew Morton" wrote: > > > On Fri, 20 Mar 2009 19:26:06 +0100 Jos Houtman wrote: > > >> Hi, > > >> > > >> We have hit a problem where the page-cache writeback algorithm is not > > >> keeping up. > > >> When memory gets low this will result in very irregular performance > > >> drops. > > >> > > >> Our setup is as follows: > > >> 30 x Quad core machine with 64GB ram. > > >> These are single purpose machines running MySQL. > > >> Kernel version: 2.6.28.7 > > >> A dedicated SSD drive for the ext2 database partition > > >> Noop scheduler for the ssd drive. > > >> > > >> > > >> The current hypothesis is as follows: > > >> The wk_update function does not write enough dirty pages, which allows > > >> the number of dirty pages to grow to the dirty_background limit. > > >> When memory is low, __background_writeout() comes around and > > >> __forcefully__ writes dirty pages to disk. > > >> This forced write fills the disk queue and starves read calls that MySQL > > >> is trying to do: basically killing performance for a few seconds. This > > >> pattern repeats as soon as the cleared memory is filled again. > > >> > > >> Decreasing the dirty_writeback_centisecs to 100 doesn__t help > > >> > > >> I don__t know why this is, but I did some preliminary tracing using > > >> systemtap and it seems that the majority of times wk_update calls > > >> decides to do nothing. > > >> > > >> Doubling /sys/block/sdb/queue/nr_requests to 256, seems to help abit: > > >> the nr_dirty pages is increasing more slowly. > > >> But I am unsure of side-effects and am afraid of increasing the > > >> starvation problem for mysql. > > >> > > >> > > >> I__am very much willing to work on this issue and see it fixed, but > > >> would like to tap into the knowledge of people here. > > >> So: > > >> * Have more people seen this or simular issues? > > >> * Is the hypothesis above a viable one? > > >> * Suggestions/pointers for further research and statistics I should > > >> measure to improve the understanding of this problem. > > > > > > I don't think that noop-iosched tries to do anything to prevent > > > writes-starve-reads. Do you get better behaviour from any of the other > > > IO schedulers? > > > > I did a quick stress test and cfq does not immediately seem to hurt > > performance, although some of my colleague's have tested this in the past > > with the opposite results (which is why we use noop). > > > > But despite the scheduler, the real problem is in the writeback algorithm > > not keeping up. > > We can grow 600K dirty pages during the day, and only ~300k is flushed to > > disk during the night hours. > > > > While a quick look at the writeback algorithm let me to expect > > __wk_update()__ to flush ~1024 pages every 5 seconds, which is almost 3GB > > per hour. It obviously does not manage to do this in our setup. > > > > I don¹t believe the speed of the ssd to be the problem, running sync > > manually only takes a few minutes to flush 800K dirty pages to disk. > > kupdate surely should just continue to keep trying to write back pages > so long as there are more old pages to clean, and the queue isn't > congested. That seems to be the intention anyway: MAX_WRITEBACK_PAGES > is just the number to write back in a single call, but you see > nr_to_write is set to the number of dirty pages in the system. > > On your system, what must be happening is more_io is not being set. > The logic in fs/fs-writeback.c might be busted. Hi Jos, I prepared a debugging patch for 2.6.28. (I cannot observe writeback problems on my local ext2 mount.) You can view the states of all dirty inodes by doing modprobe filecache echo ls dirty > /proc/filecache cat /proc/filecache The 'age' field shows (jiffies - inode->dirtied_when), which may also be useful for debugging Jeff and Ian's case(if it keeps growing, then dirtied_when is stuck). The detailed dirty writeback traces can be retrieved by doing echo 1 > /proc/sys/fs/dirty_debug sleep 6s echo 0 > /proc/sys/fs/dirty_debug dmesg The dmesg trace should help identify the bug in periodic writeback. Thanks, Fengguang