* [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion @ 2010-08-26 15:14 Mel Gorman 2010-08-26 15:14 ` [PATCH 1/3] writeback: Account for time spent congestion_waited Mel Gorman ` (3 more replies) 0 siblings, 4 replies; 38+ messages in thread From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw) To: linux-mm, linux-fsdevel Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel congestion_wait() is a bit stupid in that it goes to sleep even when there is no congestion. This causes stalls in a number of situations and may be partially responsible for bug reports about desktop interactivity. This patch series aims to account for these unnecessary congestion_waits() and to avoid going to sleep when there is no congestion available. Patches 1 and 2 add instrumentation related to congestion which should be reuable by alternative solutions to congestion_wait. Patch 3 calls cond_resched() instead of going to sleep if there is no congestion. Once again, I shoved this through performance test. Unlike previous tests, I ran this on a ported version of my usual test-suite that should be suitable for release soon. It's not quite as good as my old set but it's sufficient for this and related series. The tests I ran were kernbench vmr-stream iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench stress-highalloc. Sysbench was a read/write tests and stress-highalloc is the usual stress the number of high order allocations that can be made while the system is under severe stress. The suite contains the necessary analysis scripts as well and I'd release it now except the documentation blows. x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine) x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine) ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation) The disks on all of them were single disks and not particularly fast. Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for instrumentation and a second test with patch 3 applied. In all cases, kernbench, hackbench, STREAM and iozone did not show any performance difference because none of them were pressuring the system enough to be calling congestion_wait() so I won't post the results. About all worth noting for them is that nothing horrible appeared to break. In the analysis scripts, I record unnecessary sleeps to be a sleep that had no congestion. The post-processing scripts for cond_resched() will only count an uncongested call to congestion_wait() as unnecessary if the process actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted. One vague concern I have is when too many pages are isolated, we call congestion_wait(). This could now actively spin in the loop for its quanta before calling cond_resched(). If it's calling with no congestion, it's hard to know what the proper thing to do there is. X86 Sysbench on this machine was not stressed enough to call congestion_wait so I'll just discuss the stress-highalloc test. This is the full report from the testsuite STRESS-HIGHALLOC stress-highalloc stress-highalloc traceonly-v1r1 nocongest-v1r1 Pass 1 70.00 ( 0.00%) 72.00 ( 2.00%) Pass 2 72.00 ( 0.00%) 72.00 ( 0.00%) At Rest 74.00 ( 0.00%) 73.00 (-1.00%) FTrace Reclaim Statistics: vmscan stress-highalloc stress-highalloc traceonly-v1r1 nocongest-v1r1 Direct reclaims 409 755 Direct reclaim pages scanned 185585 212524 Direct reclaim write file async I/O 442 554 Direct reclaim write anon async I/O 31789 27074 Direct reclaim write file sync I/O 17 23 Direct reclaim write anon sync I/O 17825 15013 Wake kswapd requests 895 1274 Kswapd wakeups 387 432 Kswapd pages scanned 16373859 12892992 Kswapd reclaim write file async I/O 29267 18188 Kswapd reclaim write anon async I/O 1243386 1080234 Kswapd reclaim write file sync I/O 0 0 Kswapd reclaim write anon sync I/O 0 0 Time stalled direct reclaim (seconds) 4479.04 3446.81 Time kswapd awake (seconds) 2229.99 1218.52 Total pages scanned 16559444 13105516 %age total pages scanned/written 7.99% 8.71% %age file pages scanned/written 0.18% 0.14% Percentage Time Spent Direct Reclaim 74.99% 69.54% Percentage Time kswapd Awake 41.78% 28.57% FTrace Reclaim Statistics: congestion_wait Direct number congest waited 474 38 Direct number schedule waited 0 9478 Direct time congest waited 21564ms 3732ms Direct time schedule waited 0ms 4ms Direct unnecessary wait 434 1 KSwapd number congest waited 68 0 KSwapd number schedule waited 0 0 KSwapd time schedule waited 0ms 0ms KSwapd time congest waited 5424ms 0ms Kswapd unnecessary wait 44 0 MMTests Statistics: duration User/Sys Time Running Test (seconds) 1493.97 1509.88 Total Elapsed Time (seconds) 5337.71 4265.07 Allocations under stress were slightly better but by and large there is no significant difference in success rates. The test completed 1072 seconds faster which is a pretty decent speedup. Scanning rates in reclaim were higher but that is somewhat expected because we weren't going to sleep as much. Time stalled in reclaim for both direct and kswapd was reduced which is pretty significant. In terms of congestion_wait, the time spent asleep was massively reduced by 17 seconds for direct reclaim and 5 seconds for kswapd. cond_reched is called a number of times instead of course but the time it spent being scheduled was a mere 4ms. Overall, this looked positive. X86-64 Sysbench again wasn't under enough pressure so here is the high alloc test. STRESS-HIGHALLOC stress-highalloc stress-highalloc traceonly-v1r1 nocongest-v1r1 Pass 1 69.00 ( 0.00%) 73.00 ( 4.00%) Pass 2 71.00 ( 0.00%) 74.00 ( 3.00%) At Rest 72.00 ( 0.00%) 75.00 ( 3.00%) FTrace Reclaim Statistics: vmscan stress-highalloc stress-highalloc traceonly-v1r1 nocongest-v1r1 Direct reclaims 646 1091 Direct reclaim pages scanned 94779 102392 Direct reclaim write file async I/O 164 216 Direct reclaim write anon async I/O 12162 15413 Direct reclaim write file sync I/O 64 45 Direct reclaim write anon sync I/O 5366 6987 Wake kswapd requests 3950 3912 Kswapd wakeups 613 579 Kswapd pages scanned 7544412 7267203 Kswapd reclaim write file async I/O 14660 16256 Kswapd reclaim write anon async I/O 964824 1065445 Kswapd reclaim write file sync I/O 0 0 Kswapd reclaim write anon sync I/O 0 0 Time stalled direct reclaim (seconds) 3279.00 3564.59 Time kswapd awake (seconds) 1445.70 1870.70 Total pages scanned 7639191 7369595 %age total pages scanned/written 13.05% 14.99% %age file pages scanned/written 0.19% 0.22% Percentage Time Spent Direct Reclaim 70.48% 72.04% Percentage Time kswapd Awake 35.62% 42.94% FTrace Reclaim Statistics: congestion_wait Direct number congest waited 801 97 Direct number schedule waited 0 16079 Direct time congest waited 37448ms 9004ms Direct time schedule waited 0ms 0ms Direct unnecessary wait 696 0 KSwapd number congest waited 10 1 KSwapd number schedule waited 0 0 KSwapd time schedule waited 0ms 0ms KSwapd time congest waited 900ms 100ms Kswapd unnecessary wait 6 0 MMTests Statistics: duration User/Sys Time Running Test (seconds) 1373.11 1383.7 Total Elapsed Time (seconds) 4058.33 4356.47 Success rates were slightly higher again, not by a massive amount but some. Time to complete the test was unfortunately increased slightly though and I'm not sure where that is coming from. The increased number of successful allocations would account for some of that because the system is under greater memory pressure as a result of the allocations. Scanning rates are comparable. Writing back files from reclaim was slighly increased which I believe it due to less time being spent asleep so there was a smaller window for the flusher threads to do their work. Reducing that is the responsibility of another series. Again, the time spent asleep in congestion_wait() is reduced by a large amount - 28 seconds for direct reclaim and none of the cond_resched() resulted in sleep times. Overally, seems reasonable. PPC64 Unlike the other two machines, sysbench called congestion_wait a few times so here are the full results for sysbench SYSBENCH sysbench-traceonly-v1r1-sysbenchsysbench-nocongest-v1r1-sysbench traceonly-v1r1 nocongest-v1r1 1 5307.36 ( 0.00%) 5349.58 ( 0.79%) 2 9886.45 ( 0.00%) 10274.78 ( 3.78%) 3 14165.01 ( 0.00%) 14210.64 ( 0.32%) 4 16239.12 ( 0.00%) 16201.46 (-0.23%) 5 15337.09 ( 0.00%) 15541.56 ( 1.32%) 6 14763.64 ( 0.00%) 15805.80 ( 6.59%) 7 14216.69 ( 0.00%) 15023.57 ( 5.37%) 8 13749.62 ( 0.00%) 14492.34 ( 5.12%) 9 13647.75 ( 0.00%) 13969.77 ( 2.31%) 10 13275.70 ( 0.00%) 13495.08 ( 1.63%) 11 13324.91 ( 0.00%) 12879.81 (-3.46%) 12 13169.23 ( 0.00%) 12967.36 (-1.56%) 13 12896.20 ( 0.00%) 12981.43 ( 0.66%) 14 12793.44 ( 0.00%) 12768.26 (-0.20%) 15 12627.98 ( 0.00%) 12522.86 (-0.84%) 16 12228.54 ( 0.00%) 12352.07 ( 1.00%) FTrace Reclaim Statistics: vmscan sysbench-traceonly-v1r1-sysbenchsysbench-nocongest-v1r1-sysbench traceonly-v1r1 nocongest-v1r1 Direct reclaims 0 0 Direct reclaim pages scanned 0 0 Direct reclaim write file async I/O 0 0 Direct reclaim write anon async I/O 0 0 Direct reclaim write file sync I/O 0 0 Direct reclaim write anon sync I/O 0 0 Wake kswapd requests 0 0 Kswapd wakeups 202 194 Kswapd pages scanned 5990987 5618709 Kswapd reclaim write file async I/O 24 16 Kswapd reclaim write anon async I/O 1509 1564 Kswapd reclaim write file sync I/O 0 0 Kswapd reclaim write anon sync I/O 0 0 Time stalled direct reclaim (seconds) 0.00 0.00 Time kswapd awake (seconds) 174.23 152.17 Total pages scanned 5990987 5618709 %age total pages scanned/written 0.03% 0.03% %age file pages scanned/written 0.00% 0.00% Percentage Time Spent Direct Reclaim 0.00% 0.00% Percentage Time kswapd Awake 2.80% 2.60% FTrace Reclaim Statistics: congestion_wait Direct number congest waited 0 0 Direct number schedule waited 0 0 Direct time congest waited 0ms 0ms Direct time schedule waited 0ms 0ms Direct unnecessary wait 0 0 KSwapd number congest waited 10 3 KSwapd number schedule waited 0 0 KSwapd time schedule waited 0ms 0ms KSwapd time congest waited 800ms 300ms Kswapd unnecessary wait 6 0 Performance is improved by a decent marging although I didn't check if it was statistically significant or not. The time kswapd spent asleep was slightly reduced. STRESS-HIGHALLOC stress-highalloc stress-highalloc traceonly-v1r1 nocongest-v1r1 Pass 1 40.00 ( 0.00%) 35.00 (-5.00%) Pass 2 50.00 ( 0.00%) 45.00 (-5.00%) At Rest 61.00 ( 0.00%) 64.00 ( 3.00%) FTrace Reclaim Statistics: vmscan stress-highalloc stress-highalloc traceonly-v1r1 nocongest-v1r1 Direct reclaims 166 926 Direct reclaim pages scanned 167920 183644 Direct reclaim write file async I/O 391 412 Direct reclaim write anon async I/O 31563 31986 Direct reclaim write file sync I/O 54 52 Direct reclaim write anon sync I/O 21696 17087 Wake kswapd requests 123 128 Kswapd wakeups 143 143 Kswapd pages scanned 3899414 4229450 Kswapd reclaim write file async I/O 12392 13098 Kswapd reclaim write anon async I/O 673260 709817 Kswapd reclaim write file sync I/O 0 0 Kswapd reclaim write anon sync I/O 0 0 Time stalled direct reclaim (seconds) 1595.13 1692.18 Time kswapd awake (seconds) 1114.00 1210.48 Total pages scanned 4067334 4413094 %age total pages scanned/written 18.18% 17.50% %age file pages scanned/written 0.32% 0.31% Percentage Time Spent Direct Reclaim 45.89% 47.50% Percentage Time kswapd Awake 46.09% 48.04% FTrace Reclaim Statistics: congestion_wait Direct number congest waited 233 16 Direct number schedule waited 0 1323 Direct time congest waited 10164ms 1600ms Direct time schedule waited 0ms 0ms Direct unnecessary wait 218 0 KSwapd number congest waited 11 13 KSwapd number schedule waited 0 3 KSwapd time schedule waited 0ms 0ms KSwapd time congest waited 1100ms 1244ms Kswapd unnecessary wait 0 0 MMTests Statistics: duration User/Sys Time Running Test (seconds) 1880.56 1870.36 Total Elapsed Time (seconds) 2417.17 2519.51 Allocation success rates are slightly down but on PPC64, they are always very difficult and I have other ideas on how allocation success rates could be improved. What is more important is that again the time spent asleep due to congestion_wait() was reduced for direct reclaimers. The results here aren't as positive as the other two machines but they still seem acceptable. Broadly speaking, I think sleeping in congestion_wait() has been responsible for some bugs related to stalls under large IO, particularly the read IO, so we need to do something about it. These tests seem overall positive but it'd be interesting if someone with a workload that stalls in congestion_wait unnecessarily are helped by this patch. Desktop interactivity would be harder to test because I think it has multiple root causes of which congestion_wait is just one of them. I've included Christian Ehrhardt in the cc because he had a bug back in April that was rooted in congestion_wait() that I think this might help and hopefully he can provide hard data for a workload with lots of IO but constrained memory. I cc'd Johannes because we were discussion congestion_wait() at LSF/MM and he might have some thoughts and I think I was talking to Jan briefly about congestion_wait() as well. As this affects writeback, Wu and fsdevel might have some opinions. include/trace/events/writeback.h | 22 ++++++++++++++++++++++ mm/backing-dev.c | 31 ++++++++++++++++++++++++++----- 2 files changed, 48 insertions(+), 5 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 1/3] writeback: Account for time spent congestion_waited 2010-08-26 15:14 [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Mel Gorman @ 2010-08-26 15:14 ` Mel Gorman 2010-08-26 17:23 ` Minchan Kim 2010-08-26 18:10 ` Johannes Weiner 2010-08-26 15:14 ` [PATCH 2/3] writeback: Record if the congestion was unnecessary Mel Gorman ` (2 subsequent siblings) 3 siblings, 2 replies; 38+ messages in thread From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw) To: linux-mm, linux-fsdevel Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel There is strong evidence to indicate a lot of time is being spent in congestion_wait(), some of it unnecessarily. This patch adds a tracepoint for congestion_wait to record when congestion_wait() occurred and how long was spent. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/trace/events/writeback.h | 17 +++++++++++++++++ mm/backing-dev.c | 4 ++++ 2 files changed, 21 insertions(+), 0 deletions(-) diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index f345f66..e3bee61 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -153,6 +153,23 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_written); DEFINE_WBC_EVENT(wbc_balance_dirty_wait); DEFINE_WBC_EVENT(wbc_writepage); +TRACE_EVENT(writeback_congest_waited, + + TP_PROTO(unsigned int usec_delayed), + + TP_ARGS(usec_delayed), + + TP_STRUCT__entry( + __field( unsigned int, usec_delayed ) + ), + + TP_fast_assign( + __entry->usec_delayed = usec_delayed; + ), + + TP_printk("usec_delayed=%u", __entry->usec_delayed) +); + #endif /* _TRACE_WRITEBACK_H */ /* This part must be outside protection */ diff --git a/mm/backing-dev.c b/mm/backing-dev.c index eaa4a5b..7ae33e2 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -759,12 +759,16 @@ EXPORT_SYMBOL(set_bdi_congested); long congestion_wait(int sync, long timeout) { long ret; + unsigned long start = jiffies; DEFINE_WAIT(wait); wait_queue_head_t *wqh = &congestion_wqh[sync]; prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); ret = io_schedule_timeout(timeout); finish_wait(wqh, &wait); + + trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start)); + return ret; } EXPORT_SYMBOL(congestion_wait); -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 1/3] writeback: Account for time spent congestion_waited 2010-08-26 15:14 ` [PATCH 1/3] writeback: Account for time spent congestion_waited Mel Gorman @ 2010-08-26 17:23 ` Minchan Kim 2010-08-26 18:10 ` Johannes Weiner 1 sibling, 0 replies; 38+ messages in thread From: Minchan Kim @ 2010-08-26 17:23 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel On Thu, Aug 26, 2010 at 04:14:14PM +0100, Mel Gorman wrote: > There is strong evidence to indicate a lot of time is being spent in > congestion_wait(), some of it unnecessarily. This patch adds a > tracepoint for congestion_wait to record when congestion_wait() occurred > and how long was spent. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> I think that's enough to add tracepoint until solving this issue at least. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 1/3] writeback: Account for time spent congestion_waited 2010-08-26 15:14 ` [PATCH 1/3] writeback: Account for time spent congestion_waited Mel Gorman 2010-08-26 17:23 ` Minchan Kim @ 2010-08-26 18:10 ` Johannes Weiner 1 sibling, 0 replies; 38+ messages in thread From: Johannes Weiner @ 2010-08-26 18:10 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel On Thu, Aug 26, 2010 at 04:14:14PM +0100, Mel Gorman wrote: > There is strong evidence to indicate a lot of time is being spent in > congestion_wait(), some of it unnecessarily. This patch adds a > tracepoint for congestion_wait to record when congestion_wait() occurred > and how long was spent. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-08-26 15:14 [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Mel Gorman 2010-08-26 15:14 ` [PATCH 1/3] writeback: Account for time spent congestion_waited Mel Gorman @ 2010-08-26 15:14 ` Mel Gorman 2010-08-26 17:35 ` Minchan Kim 2010-08-26 18:29 ` Johannes Weiner 2010-08-26 15:14 ` [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs Mel Gorman 2010-08-26 17:20 ` [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Minchan Kim 3 siblings, 2 replies; 38+ messages in thread From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw) To: linux-mm, linux-fsdevel Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel If congestion_wait() is called when there is no congestion, the caller will wait for the full timeout. This can cause unreasonable and unnecessary stalls. There are a number of potential modifications that could be made to wake sleepers but this patch measures how serious the problem is. It keeps count of how many congested BDIs there are. If congestion_wait() is called with no BDIs congested, the tracepoint will record that the wait was unnecessary. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/trace/events/writeback.h | 11 ++++++++--- mm/backing-dev.c | 15 ++++++++++++--- 2 files changed, 20 insertions(+), 6 deletions(-) diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index e3bee61..03bb04b 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage); TRACE_EVENT(writeback_congest_waited, - TP_PROTO(unsigned int usec_delayed), + TP_PROTO(unsigned int usec_delayed, bool unnecessary), - TP_ARGS(usec_delayed), + TP_ARGS(usec_delayed, unnecessary), TP_STRUCT__entry( __field( unsigned int, usec_delayed ) + __field( unsigned int, unnecessary ) ), TP_fast_assign( __entry->usec_delayed = usec_delayed; + __entry->unnecessary = unnecessary; ), - TP_printk("usec_delayed=%u", __entry->usec_delayed) + TP_printk("usec_delayed=%u unnecessary=%d", + __entry->usec_delayed, + __entry->unnecessary + ) ); #endif /* _TRACE_WRITEBACK_H */ diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 7ae33e2..a49167f 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = { __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1]) }; +static atomic_t nr_bdi_congested[2]; void clear_bdi_congested(struct backing_dev_info *bdi, int sync) { @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync) wait_queue_head_t *wqh = &congestion_wqh[sync]; bit = sync ? BDI_sync_congested : BDI_async_congested; - clear_bit(bit, &bdi->state); + if (test_and_clear_bit(bit, &bdi->state)) + atomic_dec(&nr_bdi_congested[sync]); smp_mb__after_clear_bit(); if (waitqueue_active(wqh)) wake_up(wqh); @@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync) enum bdi_state bit; bit = sync ? BDI_sync_congested : BDI_async_congested; - set_bit(bit, &bdi->state); + if (!test_and_set_bit(bit, &bdi->state)) + atomic_inc(&nr_bdi_congested[sync]); } EXPORT_SYMBOL(set_bdi_congested); @@ -760,14 +763,20 @@ long congestion_wait(int sync, long timeout) { long ret; unsigned long start = jiffies; + bool unnecessary = false; DEFINE_WAIT(wait); wait_queue_head_t *wqh = &congestion_wqh[sync]; + /* Check if this call to congestion_wait was necessary */ + if (atomic_read(&nr_bdi_congested[sync]) == 0) + unnecessary = true; + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); ret = io_schedule_timeout(timeout); finish_wait(wqh, &wait); - trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start)); + trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start), + unnecessary); return ret; } -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-08-26 15:14 ` [PATCH 2/3] writeback: Record if the congestion was unnecessary Mel Gorman @ 2010-08-26 17:35 ` Minchan Kim 2010-08-26 17:41 ` Mel Gorman 2010-08-26 18:29 ` Johannes Weiner 1 sibling, 1 reply; 38+ messages in thread From: Minchan Kim @ 2010-08-26 17:35 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote: > If congestion_wait() is called when there is no congestion, the caller > will wait for the full timeout. This can cause unreasonable and > unnecessary stalls. There are a number of potential modifications that > could be made to wake sleepers but this patch measures how serious the > problem is. It keeps count of how many congested BDIs there are. If > congestion_wait() is called with no BDIs congested, the tracepoint will > record that the wait was unnecessary. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > --- > include/trace/events/writeback.h | 11 ++++++++--- > mm/backing-dev.c | 15 ++++++++++++--- > 2 files changed, 20 insertions(+), 6 deletions(-) > > diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h > index e3bee61..03bb04b 100644 > --- a/include/trace/events/writeback.h > +++ b/include/trace/events/writeback.h > @@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage); > > TRACE_EVENT(writeback_congest_waited, > > - TP_PROTO(unsigned int usec_delayed), > + TP_PROTO(unsigned int usec_delayed, bool unnecessary), > > - TP_ARGS(usec_delayed), > + TP_ARGS(usec_delayed, unnecessary), > > TP_STRUCT__entry( > __field( unsigned int, usec_delayed ) > + __field( unsigned int, unnecessary ) > ), > > TP_fast_assign( > __entry->usec_delayed = usec_delayed; > + __entry->unnecessary = unnecessary; > ), > > - TP_printk("usec_delayed=%u", __entry->usec_delayed) > + TP_printk("usec_delayed=%u unnecessary=%d", > + __entry->usec_delayed, > + __entry->unnecessary > + ) > ); > > #endif /* _TRACE_WRITEBACK_H */ > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index 7ae33e2..a49167f 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = { > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1]) > }; > +static atomic_t nr_bdi_congested[2]; > > void clear_bdi_congested(struct backing_dev_info *bdi, int sync) > { > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync) > wait_queue_head_t *wqh = &congestion_wqh[sync]; > > bit = sync ? BDI_sync_congested : BDI_async_congested; > - clear_bit(bit, &bdi->state); > + if (test_and_clear_bit(bit, &bdi->state)) > + atomic_dec(&nr_bdi_congested[sync]); Hmm.. Now congestion_wait's semantics "wait for _a_ backing_dev to become uncongested" But this seems to consider whole backing dev. Is your intention? or Am I missing now? -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-08-26 17:35 ` Minchan Kim @ 2010-08-26 17:41 ` Mel Gorman 0 siblings, 0 replies; 38+ messages in thread From: Mel Gorman @ 2010-08-26 17:41 UTC (permalink / raw) To: Minchan Kim Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel On Fri, Aug 27, 2010 at 02:35:34AM +0900, Minchan Kim wrote: > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote: > > If congestion_wait() is called when there is no congestion, the caller > > will wait for the full timeout. This can cause unreasonable and > > unnecessary stalls. There are a number of potential modifications that > > could be made to wake sleepers but this patch measures how serious the > > problem is. It keeps count of how many congested BDIs there are. If > > congestion_wait() is called with no BDIs congested, the tracepoint will > > record that the wait was unnecessary. > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > --- > > include/trace/events/writeback.h | 11 ++++++++--- > > mm/backing-dev.c | 15 ++++++++++++--- > > 2 files changed, 20 insertions(+), 6 deletions(-) > > > > diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h > > index e3bee61..03bb04b 100644 > > --- a/include/trace/events/writeback.h > > +++ b/include/trace/events/writeback.h > > @@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage); > > > > TRACE_EVENT(writeback_congest_waited, > > > > - TP_PROTO(unsigned int usec_delayed), > > + TP_PROTO(unsigned int usec_delayed, bool unnecessary), > > > > - TP_ARGS(usec_delayed), > > + TP_ARGS(usec_delayed, unnecessary), > > > > TP_STRUCT__entry( > > __field( unsigned int, usec_delayed ) > > + __field( unsigned int, unnecessary ) > > ), > > > > TP_fast_assign( > > __entry->usec_delayed = usec_delayed; > > + __entry->unnecessary = unnecessary; > > ), > > > > - TP_printk("usec_delayed=%u", __entry->usec_delayed) > > + TP_printk("usec_delayed=%u unnecessary=%d", > > + __entry->usec_delayed, > > + __entry->unnecessary > > + ) > > ); > > > > #endif /* _TRACE_WRITEBACK_H */ > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > index 7ae33e2..a49167f 100644 > > --- a/mm/backing-dev.c > > +++ b/mm/backing-dev.c > > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = { > > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), > > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1]) > > }; > > +static atomic_t nr_bdi_congested[2]; > > > > void clear_bdi_congested(struct backing_dev_info *bdi, int sync) > > { > > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync) > > wait_queue_head_t *wqh = &congestion_wqh[sync]; > > > > bit = sync ? BDI_sync_congested : BDI_async_congested; > > - clear_bit(bit, &bdi->state); > > + if (test_and_clear_bit(bit, &bdi->state)) > > + atomic_dec(&nr_bdi_congested[sync]); > > Hmm.. Now congestion_wait's semantics "wait for _a_ backing_dev to become uncongested" > But this seems to consider whole backing dev. Is your intention? or Am I missing now? > Not whole backing devs, all backing devs. This is intentional. If congestion_wait() is called with 0 BDIs congested, we sleep the full timeout because a wakeup event will not occur - this is a bad scenario. To know if 0 BDIs were congested, one could either walk all the BDIs checking their status or maintain a counter like nr_bdi_congested which is what I decided on. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-08-26 15:14 ` [PATCH 2/3] writeback: Record if the congestion was unnecessary Mel Gorman 2010-08-26 17:35 ` Minchan Kim @ 2010-08-26 18:29 ` Johannes Weiner 2010-08-26 20:31 ` Mel Gorman 2010-08-29 16:03 ` Minchan Kim 1 sibling, 2 replies; 38+ messages in thread From: Johannes Weiner @ 2010-08-26 18:29 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote: > If congestion_wait() is called when there is no congestion, the caller > will wait for the full timeout. This can cause unreasonable and > unnecessary stalls. There are a number of potential modifications that > could be made to wake sleepers but this patch measures how serious the > problem is. It keeps count of how many congested BDIs there are. If > congestion_wait() is called with no BDIs congested, the tracepoint will > record that the wait was unnecessary. I am not convinced that unnecessary is the right word. On a workload without any IO (i.e. no congestion_wait() necessary, ever), I noticed the VM regressing both in time and in reclaiming the right pages when simply removing congestion_wait() from the direct reclaim paths (the one in __alloc_pages_slowpath and the other one in do_try_to_free_pages). So just being stupid and waiting for the timeout in direct reclaim while kswapd can make progress seemed to do a better job for that load. I can not exactly pinpoint the reason for that behaviour, it would be nice if somebody had an idea. So personally I think it's a good idea to get an insight on the use of congestion_wait() [patch 1] but I don't agree with changing its behaviour just yet, or judging its usefulness solely on whether it correctly waits for bdi congestion. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-08-26 18:29 ` Johannes Weiner @ 2010-08-26 20:31 ` Mel Gorman 2010-08-27 2:12 ` Shaohua Li 2010-08-27 8:16 ` Johannes Weiner 2010-08-29 16:03 ` Minchan Kim 1 sibling, 2 replies; 38+ messages in thread From: Mel Gorman @ 2010-08-26 20:31 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote: > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote: > > If congestion_wait() is called when there is no congestion, the caller > > will wait for the full timeout. This can cause unreasonable and > > unnecessary stalls. There are a number of potential modifications that > > could be made to wake sleepers but this patch measures how serious the > > problem is. It keeps count of how many congested BDIs there are. If > > congestion_wait() is called with no BDIs congested, the tracepoint will > > record that the wait was unnecessary. > > I am not convinced that unnecessary is the right word. On a workload > without any IO (i.e. no congestion_wait() necessary, ever), I noticed > the VM regressing both in time and in reclaiming the right pages when > simply removing congestion_wait() from the direct reclaim paths (the > one in __alloc_pages_slowpath and the other one in > do_try_to_free_pages). > > So just being stupid and waiting for the timeout in direct reclaim > while kswapd can make progress seemed to do a better job for that > load. > > I can not exactly pinpoint the reason for that behaviour, it would be > nice if somebody had an idea. > There is a possibility that the behaviour in that case was due to flusher threads doing the writes rather than direct reclaim queueing pages for IO in an inefficient manner. So the stall is stupid but happens to work out well because flusher threads get the chance to do work. > So personally I think it's a good idea to get an insight on the use of > congestion_wait() [patch 1] but I don't agree with changing its > behaviour just yet, or judging its usefulness solely on whether it > correctly waits for bdi congestion. > Unfortunately, I strongly suspect that some of the desktop stalls seen during IO (one of which involved no writes) were due to calling congestion_wait and waiting the full timeout where no writes are going on. It gets potentially worse too. Lets say we have a system with many BDIs of different speed - e.g. SSD on one end of the spectrum and USB flash drive on the other. The congestion for writes could be on the USB flash drive but due to low memory, the allocator, direct reclaimers and kswapd go to sleep periodically on congestion_wait for USB even though the bulk of the pages need reclaiming are backed by an SSD. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-08-26 20:31 ` Mel Gorman @ 2010-08-27 2:12 ` Shaohua Li 2010-08-27 9:20 ` Mel Gorman 2010-08-27 8:16 ` Johannes Weiner 1 sibling, 1 reply; 38+ messages in thread From: Shaohua Li @ 2010-08-27 2:12 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt, Wu, Fengguang, Jan Kara, linux-kernel@vger.kernel.org On Fri, 2010-08-27 at 04:31 +0800, Mel Gorman wrote: > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote: > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote: > > > If congestion_wait() is called when there is no congestion, the caller > > > will wait for the full timeout. This can cause unreasonable and > > > unnecessary stalls. There are a number of potential modifications that > > > could be made to wake sleepers but this patch measures how serious the > > > problem is. It keeps count of how many congested BDIs there are. If > > > congestion_wait() is called with no BDIs congested, the tracepoint will > > > record that the wait was unnecessary. > > > > I am not convinced that unnecessary is the right word. On a workload > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed > > the VM regressing both in time and in reclaiming the right pages when > > simply removing congestion_wait() from the direct reclaim paths (the > > one in __alloc_pages_slowpath and the other one in > > do_try_to_free_pages). > > > > So just being stupid and waiting for the timeout in direct reclaim > > while kswapd can make progress seemed to do a better job for that > > load. > > > > I can not exactly pinpoint the reason for that behaviour, it would be > > nice if somebody had an idea. > > > > There is a possibility that the behaviour in that case was due to flusher > threads doing the writes rather than direct reclaim queueing pages for IO > in an inefficient manner. So the stall is stupid but happens to work out > well because flusher threads get the chance to do work. If this is the case, we already have queue congested. removing congestion_wait() might cause regression but either your change or the congestion_wait_check() should not have the regression, as we do check if the bdi is congested. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-08-27 2:12 ` Shaohua Li @ 2010-08-27 9:20 ` Mel Gorman 0 siblings, 0 replies; 38+ messages in thread From: Mel Gorman @ 2010-08-27 9:20 UTC (permalink / raw) To: Shaohua Li Cc: Johannes Weiner, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt, Wu, Fengguang, Jan Kara, linux-kernel@vger.kernel.org On Fri, Aug 27, 2010 at 10:12:10AM +0800, Shaohua Li wrote: > On Fri, 2010-08-27 at 04:31 +0800, Mel Gorman wrote: > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote: > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote: > > > > If congestion_wait() is called when there is no congestion, the caller > > > > will wait for the full timeout. This can cause unreasonable and > > > > unnecessary stalls. There are a number of potential modifications that > > > > could be made to wake sleepers but this patch measures how serious the > > > > problem is. It keeps count of how many congested BDIs there are. If > > > > congestion_wait() is called with no BDIs congested, the tracepoint will > > > > record that the wait was unnecessary. > > > > > > I am not convinced that unnecessary is the right word. On a workload > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed > > > the VM regressing both in time and in reclaiming the right pages when > > > simply removing congestion_wait() from the direct reclaim paths (the > > > one in __alloc_pages_slowpath and the other one in > > > do_try_to_free_pages). > > > > > > So just being stupid and waiting for the timeout in direct reclaim > > > while kswapd can make progress seemed to do a better job for that > > > load. > > > > > > I can not exactly pinpoint the reason for that behaviour, it would be > > > nice if somebody had an idea. > > > > > > > There is a possibility that the behaviour in that case was due to flusher > > threads doing the writes rather than direct reclaim queueing pages for IO > > in an inefficient manner. So the stall is stupid but happens to work out > > well because flusher threads get the chance to do work. > > If this is the case, we already have queue congested. Not necessarily. The fact that with the full series we sometimes call cond_sched() indicating that there was no congestion when congestion_wait() was called proves that. We might have some IO on the queue but it's not congested. Also, there is no guarantee that the congested queue is one we care about. If we are reclaiming main memory and the congested queue is a USB stick, we do not necessarily need to stall. > removing > congestion_wait() might cause regression but either your change or the > congestion_wait_check() should not have the regression, as we do check > if the bdi is congested. > What congestion_wait_check()? If there is no congestion and no writes, congestion is the wrong event to sleep on. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-08-26 20:31 ` Mel Gorman 2010-08-27 2:12 ` Shaohua Li @ 2010-08-27 8:16 ` Johannes Weiner 2010-08-27 9:24 ` Mel Gorman 1 sibling, 1 reply; 38+ messages in thread From: Johannes Weiner @ 2010-08-27 8:16 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote: > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote: > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote: > > > If congestion_wait() is called when there is no congestion, the caller > > > will wait for the full timeout. This can cause unreasonable and > > > unnecessary stalls. There are a number of potential modifications that > > > could be made to wake sleepers but this patch measures how serious the > > > problem is. It keeps count of how many congested BDIs there are. If > > > congestion_wait() is called with no BDIs congested, the tracepoint will > > > record that the wait was unnecessary. > > > > I am not convinced that unnecessary is the right word. On a workload > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed > > the VM regressing both in time and in reclaiming the right pages when > > simply removing congestion_wait() from the direct reclaim paths (the > > one in __alloc_pages_slowpath and the other one in > > do_try_to_free_pages). > > > > So just being stupid and waiting for the timeout in direct reclaim > > while kswapd can make progress seemed to do a better job for that > > load. > > > > I can not exactly pinpoint the reason for that behaviour, it would be > > nice if somebody had an idea. > > > > There is a possibility that the behaviour in that case was due to flusher > threads doing the writes rather than direct reclaim queueing pages for IO > in an inefficient manner. So the stall is stupid but happens to work out > well because flusher threads get the chance to do work. The workload was accessing a large sparse-file through mmap, so there wasn't much IO in the first place. And I experimented on the latest -mmotm where direct reclaim wouldn't do writeback by itself anymore, but kick the flushers. > > So personally I think it's a good idea to get an insight on the use of > > congestion_wait() [patch 1] but I don't agree with changing its > > behaviour just yet, or judging its usefulness solely on whether it > > correctly waits for bdi congestion. > > > > Unfortunately, I strongly suspect that some of the desktop stalls seen during > IO (one of which involved no writes) were due to calling congestion_wait > and waiting the full timeout where no writes are going on. Oh, I am in full agreement here! Removing those congestion_wait() as described above showed a reduction in peak latency. The dilemma is only that it increased the overall walltime of the load. And the scanning behaviour deteriorated, as in having increased scanning pressure on other zones than the unpatched kernel did. So I think very much that we need a fix. congestion_wait() causes stalls and relying on random sleeps for the current reclaim behaviour can not be the solution, at all. I just don't think we can remove it based on the argument that it doesn't do what it is supposed to do, when it does other things right that it is not supposed to do ;-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-08-27 8:16 ` Johannes Weiner @ 2010-08-27 9:24 ` Mel Gorman 2010-08-30 13:19 ` Johannes Weiner 0 siblings, 1 reply; 38+ messages in thread From: Mel Gorman @ 2010-08-27 9:24 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote: > On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote: > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote: > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote: > > > > If congestion_wait() is called when there is no congestion, the caller > > > > will wait for the full timeout. This can cause unreasonable and > > > > unnecessary stalls. There are a number of potential modifications that > > > > could be made to wake sleepers but this patch measures how serious the > > > > problem is. It keeps count of how many congested BDIs there are. If > > > > congestion_wait() is called with no BDIs congested, the tracepoint will > > > > record that the wait was unnecessary. > > > > > > I am not convinced that unnecessary is the right word. On a workload > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed > > > the VM regressing both in time and in reclaiming the right pages when > > > simply removing congestion_wait() from the direct reclaim paths (the > > > one in __alloc_pages_slowpath and the other one in > > > do_try_to_free_pages). > > > > > > So just being stupid and waiting for the timeout in direct reclaim > > > while kswapd can make progress seemed to do a better job for that > > > load. > > > > > > I can not exactly pinpoint the reason for that behaviour, it would be > > > nice if somebody had an idea. > > > > > > > There is a possibility that the behaviour in that case was due to flusher > > threads doing the writes rather than direct reclaim queueing pages for IO > > in an inefficient manner. So the stall is stupid but happens to work out > > well because flusher threads get the chance to do work. > > The workload was accessing a large sparse-file through mmap, so there > wasn't much IO in the first place. > Then waiting on congestion was the totally wrong thing to do. We were effectively calling sleep(HZ/10) and magically this was helping in some undefined manner. Do you know *which* called of congestion_wait() was the most important to you? > And I experimented on the latest -mmotm where direct reclaim wouldn't > do writeback by itself anymore, but kick the flushers. > What were the results? I'm preparing a full series incorporating a number of patches in this area to see how they behave in aggregate. > > > So personally I think it's a good idea to get an insight on the use of > > > congestion_wait() [patch 1] but I don't agree with changing its > > > behaviour just yet, or judging its usefulness solely on whether it > > > correctly waits for bdi congestion. > > > > > > > Unfortunately, I strongly suspect that some of the desktop stalls seen during > > IO (one of which involved no writes) were due to calling congestion_wait > > and waiting the full timeout where no writes are going on. > > Oh, I am in full agreement here! Removing those congestion_wait() as > described above showed a reduction in peak latency. The dilemma is > only that it increased the overall walltime of the load. > Do you know why because leaving in random sleeps() hardly seems to be the right approach? > And the scanning behaviour deteriorated, as in having increased > scanning pressure on other zones than the unpatched kernel did. > Probably because it was scanning more but not finding what it needed. There is a condition other than congestion it is having trouble with. In some respects, I think if we change congestion_wait() as I propose, we may see a case where CPU usage is higher because it's now encountering the unspecified reclaim problem we have. > So I think very much that we need a fix. congestion_wait() causes > stalls and relying on random sleeps for the current reclaim behaviour > can not be the solution, at all. > > I just don't think we can remove it based on the argument that it > doesn't do what it is supposed to do, when it does other things right > that it is not supposed to do ;-) > We are not removing it, we are just stopping it going to sleep for stupid reasons. If we find that wall time is increasing as a result, we have a path to figuring out what the real underlying problem is instead of sweeping it under the rug. congestion_wait() is causing other problems such as Christian's bug of massive IO regressions because it was sleeping when it shouldn't. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-08-27 9:24 ` Mel Gorman @ 2010-08-30 13:19 ` Johannes Weiner 2010-08-31 15:02 ` Mel Gorman 0 siblings, 1 reply; 38+ messages in thread From: Johannes Weiner @ 2010-08-30 13:19 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel [-- Attachment #1: Type: text/plain, Size: 5340 bytes --] On Fri, Aug 27, 2010 at 10:24:16AM +0100, Mel Gorman wrote: > On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote: > > On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote: > > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote: > > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote: > > > > > If congestion_wait() is called when there is no congestion, the caller > > > > > will wait for the full timeout. This can cause unreasonable and > > > > > unnecessary stalls. There are a number of potential modifications that > > > > > could be made to wake sleepers but this patch measures how serious the > > > > > problem is. It keeps count of how many congested BDIs there are. If > > > > > congestion_wait() is called with no BDIs congested, the tracepoint will > > > > > record that the wait was unnecessary. > > > > > > > > I am not convinced that unnecessary is the right word. On a workload > > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed > > > > the VM regressing both in time and in reclaiming the right pages when > > > > simply removing congestion_wait() from the direct reclaim paths (the > > > > one in __alloc_pages_slowpath and the other one in > > > > do_try_to_free_pages). > > > > > > > > So just being stupid and waiting for the timeout in direct reclaim > > > > while kswapd can make progress seemed to do a better job for that > > > > load. > > > > > > > > I can not exactly pinpoint the reason for that behaviour, it would be > > > > nice if somebody had an idea. > > > > > > > > > > There is a possibility that the behaviour in that case was due to flusher > > > threads doing the writes rather than direct reclaim queueing pages for IO > > > in an inefficient manner. So the stall is stupid but happens to work out > > > well because flusher threads get the chance to do work. > > > > The workload was accessing a large sparse-file through mmap, so there > > wasn't much IO in the first place. > > > > Then waiting on congestion was the totally wrong thing to do. We were > effectively calling sleep(HZ/10) and magically this was helping in some > undefined manner. Do you know *which* called of congestion_wait() was > the most important to you? Removing congestion_wait() in do_try_to_free_pages() definitely worsens reclaim behaviour for this workload: 1. wallclock time of the testrun increases by 11% 2. the scanners do a worse job and go for the wrong zone: -pgalloc_dma 79597 -pgalloc_dma32 134465902 +pgalloc_dma 297089 +pgalloc_dma32 134247237 -pgsteal_dma 77501 -pgsteal_dma32 133939446 +pgsteal_dma 294998 +pgsteal_dma32 133722312 -pgscan_kswapd_dma 145897 -pgscan_kswapd_dma32 266141381 +pgscan_kswapd_dma 287981 +pgscan_kswapd_dma32 186647637 -pgscan_direct_dma 9666 -pgscan_direct_dma32 1758655 +pgscan_direct_dma 302495 +pgscan_direct_dma32 80947179 -pageoutrun 1768531 -allocstall 614 +pageoutrun 1927451 +allocstall 8566 I attached the full vmstat contents below. Also the test program, which I ran in this case as: ./mapped-file-stream 1 $((512 << 30)) > > > > So personally I think it's a good idea to get an insight on the use of > > > > congestion_wait() [patch 1] but I don't agree with changing its > > > > behaviour just yet, or judging its usefulness solely on whether it > > > > correctly waits for bdi congestion. > > > > > > > > > > Unfortunately, I strongly suspect that some of the desktop stalls seen during > > > IO (one of which involved no writes) were due to calling congestion_wait > > > and waiting the full timeout where no writes are going on. > > > > Oh, I am in full agreement here! Removing those congestion_wait() as > > described above showed a reduction in peak latency. The dilemma is > > only that it increased the overall walltime of the load. > > > > Do you know why because leaving in random sleeps() hardly seems to be > the right approach? I am still trying to find out what's going wrong. > > And the scanning behaviour deteriorated, as in having increased > > scanning pressure on other zones than the unpatched kernel did. > > > > Probably because it was scanning more but not finding what it needed. > There is a condition other than congestion it is having trouble with. In > some respects, I think if we change congestion_wait() as I propose, > we may see a case where CPU usage is higher because it's now > encountering the unspecified reclaim problem we have. Exactly. > > So I think very much that we need a fix. congestion_wait() causes > > stalls and relying on random sleeps for the current reclaim behaviour > > can not be the solution, at all. > > > > I just don't think we can remove it based on the argument that it > > doesn't do what it is supposed to do, when it does other things right > > that it is not supposed to do ;-) > > > > We are not removing it, we are just stopping it going to sleep for > stupid reasons. If we find that wall time is increasing as a result, we > have a path to figuring out what the real underlying problem is instead > of sweeping it under the rug. Well, for that testcase it is in effect the same as a removal as there's never congestion. But again: I agree with your changes per-se, I just don't think they should get merged as long as they knowingly catalyze a problem that has yet to be identified. [-- Attachment #2: mapped-file-stream.c --] [-- Type: text/plain, Size: 1885 bytes --] #include <sys/types.h> #include <sys/mman.h> #include <sys/wait.h> #include <limits.h> #include <signal.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <stdio.h> static int start_process(unsigned long nr_bytes) { char filename[] = "/tmp/clog-XXXXXX"; unsigned long i; char *map; int fd; fd = mkstemp(filename); unlink(filename); if (fd == -1) { perror("mkstemp()"); return -1; } if (ftruncate(fd, nr_bytes)) { perror("ftruncate()"); return -1; } map = mmap(NULL, nr_bytes, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (map == MAP_FAILED) { perror("mmap()"); return -1; } if (madvise(map, nr_bytes, MADV_RANDOM)) { perror("madvise()"); return -1; } kill(getpid(), SIGSTOP); for (i = 0; i < nr_bytes; i += 4096) ((volatile char *)map)[i]; close(fd); return 0; } static int do_test(unsigned long nr_procs, unsigned long nr_bytes) { pid_t procs[nr_procs]; unsigned long i; int dummy; for (i = 0; i < nr_procs; i++) { switch ((procs[i] = fork())) { case -1: kill(0, SIGKILL); perror("fork()"); return -1; case 0: return start_process(nr_bytes); default: waitpid(procs[i], &dummy, WUNTRACED); break; } } kill(0, SIGCONT); for (i = 0; i < nr_procs; i++) waitpid(procs[i], &dummy, 0); return 0; } static int xstrtoul(const char *str, unsigned long *valuep) { unsigned long value; char *endp; value = strtoul(str, &endp, 0); if (*endp || (value == ULONG_MAX && errno == ERANGE)) return -1; *valuep = value; return 0; } int main(int ac, char **av) { unsigned long nr_procs, nr_bytes; if (ac != 3) goto usage; if (xstrtoul(av[1], &nr_procs)) goto usage; if (xstrtoul(av[2], &nr_bytes)) goto usage; setbuf(stdout, NULL); setbuf(stderr, NULL); return !!do_test(nr_procs, nr_bytes); usage: fprintf(stderr, "usage: %s nr_procs nr_bytes\n", av[0]); return 1; } [-- Attachment #3: vmstat.a.2 --] [-- Type: application/x-troff-man, Size: 1794 bytes --] [-- Attachment #4: vmstat.b.2 --] [-- Type: application/x-troff-man, Size: 1803 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-08-30 13:19 ` Johannes Weiner @ 2010-08-31 15:02 ` Mel Gorman 2010-09-02 15:49 ` Johannes Weiner 0 siblings, 1 reply; 38+ messages in thread From: Mel Gorman @ 2010-08-31 15:02 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote: > On Fri, Aug 27, 2010 at 10:24:16AM +0100, Mel Gorman wrote: > > On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote: > > > On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote: > > > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote: > > > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote: > > > > > > If congestion_wait() is called when there is no congestion, the caller > > > > > > will wait for the full timeout. This can cause unreasonable and > > > > > > unnecessary stalls. There are a number of potential modifications that > > > > > > could be made to wake sleepers but this patch measures how serious the > > > > > > problem is. It keeps count of how many congested BDIs there are. If > > > > > > congestion_wait() is called with no BDIs congested, the tracepoint will > > > > > > record that the wait was unnecessary. > > > > > > > > > > I am not convinced that unnecessary is the right word. On a workload > > > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed > > > > > the VM regressing both in time and in reclaiming the right pages when > > > > > simply removing congestion_wait() from the direct reclaim paths (the > > > > > one in __alloc_pages_slowpath and the other one in > > > > > do_try_to_free_pages). > > > > > > > > > > So just being stupid and waiting for the timeout in direct reclaim > > > > > while kswapd can make progress seemed to do a better job for that > > > > > load. > > > > > > > > > > I can not exactly pinpoint the reason for that behaviour, it would be > > > > > nice if somebody had an idea. > > > > > > > > > > > > > There is a possibility that the behaviour in that case was due to flusher > > > > threads doing the writes rather than direct reclaim queueing pages for IO > > > > in an inefficient manner. So the stall is stupid but happens to work out > > > > well because flusher threads get the chance to do work. > > > > > > The workload was accessing a large sparse-file through mmap, so there > > > wasn't much IO in the first place. > > > > > > > Then waiting on congestion was the totally wrong thing to do. We were > > effectively calling sleep(HZ/10) and magically this was helping in some > > undefined manner. Do you know *which* called of congestion_wait() was > > the most important to you? > > Removing congestion_wait() in do_try_to_free_pages() definitely > worsens reclaim behaviour for this workload: > > 1. wallclock time of the testrun increases by 11% > > 2. the scanners do a worse job and go for the wrong zone: > > -pgalloc_dma 79597 > -pgalloc_dma32 134465902 > +pgalloc_dma 297089 > +pgalloc_dma32 134247237 > > -pgsteal_dma 77501 > -pgsteal_dma32 133939446 > +pgsteal_dma 294998 > +pgsteal_dma32 133722312 > > -pgscan_kswapd_dma 145897 > -pgscan_kswapd_dma32 266141381 > +pgscan_kswapd_dma 287981 > +pgscan_kswapd_dma32 186647637 > > -pgscan_direct_dma 9666 > -pgscan_direct_dma32 1758655 > +pgscan_direct_dma 302495 > +pgscan_direct_dma32 80947179 > > -pageoutrun 1768531 > -allocstall 614 > +pageoutrun 1927451 > +allocstall 8566 > > I attached the full vmstat contents below. Also the test program, > which I ran in this case as: ./mapped-file-stream 1 $((512 << 30)) > Excellent stuff. I didn't look at your vmstat output because it was for an old patch and you have already highlighted the problems related to the workload. Chances are, I'd just reach the same conclusions. What is interesting is your workload. > > > > > So personally I think it's a good idea to get an insight on the use of > > > > > congestion_wait() [patch 1] but I don't agree with changing its > > > > > behaviour just yet, or judging its usefulness solely on whether it > > > > > correctly waits for bdi congestion. > > > > > > > > > > > > > Unfortunately, I strongly suspect that some of the desktop stalls seen during > > > > IO (one of which involved no writes) were due to calling congestion_wait > > > > and waiting the full timeout where no writes are going on. > > > > > > Oh, I am in full agreement here! Removing those congestion_wait() as > > > described above showed a reduction in peak latency. The dilemma is > > > only that it increased the overall walltime of the load. > > > > > > > Do you know why because leaving in random sleeps() hardly seems to be > > the right approach? > > I am still trying to find out what's going wrong. > > > > And the scanning behaviour deteriorated, as in having increased > > > scanning pressure on other zones than the unpatched kernel did. > > > > > > > Probably because it was scanning more but not finding what it needed. > > There is a condition other than congestion it is having trouble with. In > > some respects, I think if we change congestion_wait() as I propose, > > we may see a case where CPU usage is higher because it's now > > encountering the unspecified reclaim problem we have. > > Exactly. > > > > So I think very much that we need a fix. congestion_wait() causes > > > stalls and relying on random sleeps for the current reclaim behaviour > > > can not be the solution, at all. > > > > > > I just don't think we can remove it based on the argument that it > > > doesn't do what it is supposed to do, when it does other things right > > > that it is not supposed to do ;-) > > > > > > > We are not removing it, we are just stopping it going to sleep for > > stupid reasons. If we find that wall time is increasing as a result, we > > have a path to figuring out what the real underlying problem is instead > > of sweeping it under the rug. > > Well, for that testcase it is in effect the same as a removal as > there's never congestion. > > But again: I agree with your changes per-se, I just don't think they > should get merged as long as they knowingly catalyze a problem that > has yet to be identified. Ok, well there was some significant feedback on why wholesale changing of congestion_wait() reached too far and I've incorporated that feedback. I also integrated your workload into my testsuite (btw, because there is no license the script has to download it from a google archive. I might get back to you on licensing this so it can be made a permanent part of the suite). These are the results just for your workload on the only machine I had available with a lot of disk. There are a bunch of kernels because I'm testing a superset of different series posted recently. The nocongest column is an unreleased patch that has congestion_wait() and wait_iff_congested() that only goes to sleep if there is real congestion or a lot of writeback going on. Rather than worrying about the patch contents for now, lets consider the results for just your workload. The report is in 4 parts. The first is the vmstat counter differences as a result of running your test. The exact interpretation of good and bad here is open to interpretation. The second part is based on the vmscan tracepoints. The third part is based on the congestion tracepoints and the final part reports CPU usage and elapsed time. MICRO traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4 pgalloc_dma 89409.00 ( 0.00%) 47750.00 ( -87.24%) 47430.00 ( -88.51%) 47246.00 ( -89.24%) pgalloc_dma32 101407571.00 ( 0.00%) 101518722.00 ( 0.11%) 101502059.00 ( 0.09%) 101511868.00 ( 0.10%) pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgsteal_dma 74529.00 ( 0.00%) 43386.00 ( -71.78%) 43213.00 ( -72.47%) 42691.00 ( -74.58%) pgsteal_dma32 100666955.00 ( 0.00%) 100712596.00 ( 0.05%) 100712537.00 ( 0.05%) 100713305.00 ( 0.05%) pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgscan_kswapd_dma 118198.00 ( 0.00%) 47370.00 (-149.52%) 49515.00 (-138.71%) 46134.00 (-156.21%) pgscan_kswapd_dma32 177619794.00 ( 0.00%) 161549938.00 ( -9.95%) 161679701.00 ( -9.86%) 156657926.00 ( -13.38%) pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pgscan_direct_dma 27128.00 ( 0.00%) 39215.00 ( 30.82%) 36561.00 ( 25.80%) 38806.00 ( 30.09%) pgscan_direct_dma32 23927492.00 ( 0.00%) 40122173.00 ( 40.36%) 39997463.00 ( 40.18%) 45041626.00 ( 46.88%) pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) pageoutrun 756020.00 ( 0.00%) 903192.00 ( 16.29%) 899965.00 ( 15.99%) 868055.00 ( 12.91%) allocstall 2722.00 ( 0.00%) 70156.00 ( 96.12%) 67554.00 ( 95.97%) 87691.00 ( 96.90%) So, the allocstall counts go up of course because it is incremented every time direct reclaim is entered and nocongest is only going to sleep when there is congestion or significant writeback. I don't see this as being nceessarily bad. Direct scanning rates go up a bit as you'd expect - again because we are sleeping less. It's interesting that the pages reclaimed is reduced implying that despite higher scanning rates, there is less reclaim activity. It's debatable if this is good or not because higher scanning rates in themselves are not bad but fewer pages reclaimed seems positive so lets see what the rest of the reports look like. FTrace Reclaim Statistics: vmscan micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4 Direct reclaims 2722 70156 67554 87691 Direct reclaim pages scanned 23955333 40161426 40034132 45080524 Direct reclaim write file async I/O 0 0 0 0 Direct reclaim write anon async I/O 0 0 0 0 Direct reclaim write file sync I/O 0 0 0 0 Direct reclaim write anon sync I/O 0 0 0 0 Wake kswapd requests 2718040 17801688 17622777 18379572 Kswapd wakeups 24 1 1 1 Kswapd pages scanned 177738381 161597313 161729224 156704078 Kswapd reclaim write file async I/O 0 0 0 0 Kswapd reclaim write anon async I/O 0 0 0 0 Kswapd reclaim write file sync I/O 0 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 0 Time stalled direct reclaim (seconds) 247.97 76.97 77.15 76.63 Time kswapd awake (seconds) 489.17 400.20 403.19 390.08 Total pages scanned 201693714 201758739 201763356 201784602 %age total pages scanned/written 0.00% 0.00% 0.00% 0.00% %age file pages scanned/written 0.00% 0.00% 0.00% 0.00% Percentage Time Spent Direct Reclaim 41.41% 16.03% 15.96% 16.32% Percentage Time kswapd Awake 98.76% 98.94% 98.96% 98.87% Interesting, kswapd is now staying awake (woke up only once) even though the total time awake was reduced and it looks like because it was requested to wake up a lot more that was keeping it awake. Despite the higher scan rates from direct reclaim, the time actually spent direct reclaiming is significantly reduced. Scanning rates and times we direct reclaim go up but as we finish work a lot faster, it would seem that we are doing less work overall. FTrace Reclaim Statistics: congestion_wait Direct number congest waited 3664 0 0 0 Direct time congest waited 247636ms 0ms 0ms 0ms Direct full congest waited 3081 0 0 0 Direct number conditional waited 0 47587 45659 58779 Direct time conditional waited 0ms 0ms 0ms 0ms Direct full conditional waited 3081 0 0 0 KSwapd number congest waited 1448 949 909 981 KSwapd time congest waited 118552ms 31652ms 32780ms 38732ms KSwapd full congest waited 1056 90 115 147 KSwapd number conditional waited 0 0 0 0 KSwapd time conditional waited 0ms 0ms 0ms 0ms KSwapd full conditional waited 1056 90 115 147 congest waited is congestion_wait() and conditional waited is wait_iff_congested(). Look at what happens to the congest waited times for direct reclaim - it disappeared and despite the number of times wait_iff_congested() was called, it never actually decided it needed to sleep. kswapd is still congestion waiting but the time it spent is reduced. MMTests Statistics: duration User/Sys Time Running Test (seconds) 350.9 403.27 406.12 393.02 Total Elapsed Time (seconds) 495.29 404.47 407.44 394.53 This is plain old time. The same test completes 91 seconds faster. Ordinarily at this point I would be preparing to do a full series report including the other benchmarks but I'm interested in seeing if there is a significantly different reading of the above report as to whether it is a "good" or "bad" result? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-08-31 15:02 ` Mel Gorman @ 2010-09-02 15:49 ` Johannes Weiner 2010-09-02 18:28 ` Mel Gorman 0 siblings, 1 reply; 38+ messages in thread From: Johannes Weiner @ 2010-09-02 15:49 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel On Tue, Aug 31, 2010 at 04:02:07PM +0100, Mel Gorman wrote: > On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote: > > Removing congestion_wait() in do_try_to_free_pages() definitely > > worsens reclaim behaviour for this workload: > > > > 1. wallclock time of the testrun increases by 11% > > > > 2. the scanners do a worse job and go for the wrong zone: > > > > -pgalloc_dma 79597 > > -pgalloc_dma32 134465902 > > +pgalloc_dma 297089 > > +pgalloc_dma32 134247237 > > > > -pgsteal_dma 77501 > > -pgsteal_dma32 133939446 > > +pgsteal_dma 294998 > > +pgsteal_dma32 133722312 > > > > -pgscan_kswapd_dma 145897 > > -pgscan_kswapd_dma32 266141381 > > +pgscan_kswapd_dma 287981 > > +pgscan_kswapd_dma32 186647637 > > > > -pgscan_direct_dma 9666 > > -pgscan_direct_dma32 1758655 > > +pgscan_direct_dma 302495 > > +pgscan_direct_dma32 80947179 > > > > -pageoutrun 1768531 > > -allocstall 614 > > +pageoutrun 1927451 > > +allocstall 8566 > > > > I attached the full vmstat contents below. Also the test program, > > which I ran in this case as: ./mapped-file-stream 1 $((512 << 30)) > > Excellent stuff. I didn't look at your vmstat output because it was for > an old patch and you have already highlighted the problems related to > the workload. Chances are, I'd just reach the same conclusions. What is > interesting is your workload. [...] > Ok, well there was some significant feedback on why wholesale changing of > congestion_wait() reached too far and I've incorporated that feedback. I > also integrated your workload into my testsuite (btw, because there is no > license the script has to download it from a google archive. I might get > back to you on licensing this so it can be made a permanent part of the suite). Oh, certainly, feel free to add the following file header: /* * Copyright (c) 2010 Johannes Weiner * Code released under the GNU GPLv2. */ > These are the results just for your workload on the only machine I had > available with a lot of disk. There are a bunch of kernels because I'm testing > a superset of different series posted recently. The nocongest column is an > unreleased patch that has congestion_wait() and wait_iff_congested() that > only goes to sleep if there is real congestion or a lot of writeback going > on. Rather than worrying about the patch contents for now, lets consider > the results for just your workload. > > The report is in 4 parts. The first is the vmstat counter differences as > a result of running your test. The exact interpretation of good and bad > here is open to interpretation. The second part is based on the vmscan > tracepoints. The third part is based on the congestion tracepoints and > the final part reports CPU usage and elapsed time. > > MICRO > traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4 > pgalloc_dma 89409.00 ( 0.00%) 47750.00 ( -87.24%) 47430.00 ( -88.51%) 47246.00 ( -89.24%) > pgalloc_dma32 101407571.00 ( 0.00%) 101518722.00 ( 0.11%) 101502059.00 ( 0.09%) 101511868.00 ( 0.10%) > pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > pgsteal_dma 74529.00 ( 0.00%) 43386.00 ( -71.78%) 43213.00 ( -72.47%) 42691.00 ( -74.58%) > pgsteal_dma32 100666955.00 ( 0.00%) 100712596.00 ( 0.05%) 100712537.00 ( 0.05%) 100713305.00 ( 0.05%) > pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > pgscan_kswapd_dma 118198.00 ( 0.00%) 47370.00 (-149.52%) 49515.00 (-138.71%) 46134.00 (-156.21%) > pgscan_kswapd_dma32 177619794.00 ( 0.00%) 161549938.00 ( -9.95%) 161679701.00 ( -9.86%) 156657926.00 ( -13.38%) > pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > pgscan_direct_dma 27128.00 ( 0.00%) 39215.00 ( 30.82%) 36561.00 ( 25.80%) 38806.00 ( 30.09%) > pgscan_direct_dma32 23927492.00 ( 0.00%) 40122173.00 ( 40.36%) 39997463.00 ( 40.18%) 45041626.00 ( 46.88%) > pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > pageoutrun 756020.00 ( 0.00%) 903192.00 ( 16.29%) 899965.00 ( 15.99%) 868055.00 ( 12.91%) > allocstall 2722.00 ( 0.00%) 70156.00 ( 96.12%) 67554.00 ( 95.97%) 87691.00 ( 96.90%) > > > So, the allocstall counts go up of course because it is incremented > every time direct reclaim is entered and nocongest is only going to > sleep when there is congestion or significant writeback. I don't see > this as being nceessarily bad. Agreed. Also the dma zone is less allocated from, which I suppose is only the second zone in the zonelist, after dma32. So allocations succeed more often from the first-choice zone with your patches. > Direct scanning rates go up a bit as you'd expect - again because we > are sleeping less. It's interesting that the pages reclaimed is > reduced implying that despite higher scanning rates, there is less > reclaim activity. > > It's debatable if this is good or not because higher scanning rates in > themselves are not bad but fewer pages reclaimed seems positive so lets > see what the rest of the reports look like. > > FTrace Reclaim Statistics: vmscan > micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro > traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4 > Direct reclaims 2722 70156 67554 87691 > Direct reclaim pages scanned 23955333 40161426 40034132 45080524 > Direct reclaim write file async I/O 0 0 0 0 > Direct reclaim write anon async I/O 0 0 0 0 > Direct reclaim write file sync I/O 0 0 0 0 > Direct reclaim write anon sync I/O 0 0 0 0 > Wake kswapd requests 2718040 17801688 17622777 18379572 > Kswapd wakeups 24 1 1 1 > Kswapd pages scanned 177738381 161597313 161729224 156704078 > Kswapd reclaim write file async I/O 0 0 0 0 > Kswapd reclaim write anon async I/O 0 0 0 0 > Kswapd reclaim write file sync I/O 0 0 0 0 > Kswapd reclaim write anon sync I/O 0 0 0 0 > Time stalled direct reclaim (seconds) 247.97 76.97 77.15 76.63 > Time kswapd awake (seconds) 489.17 400.20 403.19 390.08 > > Total pages scanned 201693714 201758739 201763356 201784602 > %age total pages scanned/written 0.00% 0.00% 0.00% 0.00% > %age file pages scanned/written 0.00% 0.00% 0.00% 0.00% > Percentage Time Spent Direct Reclaim 41.41% 16.03% 15.96% 16.32% > Percentage Time kswapd Awake 98.76% 98.94% 98.96% 98.87% > > Interesting, kswapd is now staying awake (woke up only once) even though > the total time awake was reduced and it looks like because it was requested > to wake up a lot more that was keeping it awake. Despite the higher scan > rates from direct reclaim, the time actually spent direct reclaiming is > significantly reduced. > > Scanning rates and times we direct reclaim go up but as we finish work a > lot faster, it would seem that we are doing less work overall. I do not reach the same conclusion here. More pages are scanned overall on the same workload, so we _are_ doing more work. The result for this single-threaded workload improves because CPU-time is not the issue when the only runnable process needs memory. But we are in fact becoming less efficient at reclaim, so it would make sense to also test how this interacts with other processes that do need the CPU concurrently. > FTrace Reclaim Statistics: congestion_wait > Direct number congest waited 3664 0 0 0 > Direct time congest waited 247636ms 0ms 0ms 0ms > Direct full congest waited 3081 0 0 0 > Direct number conditional waited 0 47587 45659 58779 > Direct time conditional waited 0ms 0ms 0ms 0ms > Direct full conditional waited 3081 0 0 0 > KSwapd number congest waited 1448 949 909 981 > KSwapd time congest waited 118552ms 31652ms 32780ms 38732ms > KSwapd full congest waited 1056 90 115 147 > KSwapd number conditional waited 0 0 0 0 > KSwapd time conditional waited 0ms 0ms 0ms 0ms > KSwapd full conditional waited 1056 90 115 147 > > congest waited is congestion_wait() and conditional waited is > wait_iff_congested(). Look at what happens to the congest waited times > for direct reclaim - it disappeared and despite the number of times > wait_iff_congested() was called, it never actually decided it needed to > sleep. kswapd is still congestion waiting but the time it spent is > reduced. > > MMTests Statistics: duration > User/Sys Time Running Test (seconds) 350.9 403.27 406.12 393.02 > Total Elapsed Time (seconds) 495.29 404.47 407.44 394.53 > > This is plain old time. The same test completes 91 seconds faster. > Ordinarily at this point I would be preparing to do a full series report > including the other benchmarks but I'm interested in seeing if there is > a significantly different reading of the above report as to whether it > is a "good" or "bad" result? I think one interesting piece that is missing is whether the scanned/reclaimed ratio went up. Do you have the kswapd_steal counter value still available to calculate that ratio? A "good" result would be, IMO, if that ratio did not get worse, while at the same time having reclaim perform better due to reduced sleeps. Another aspect to look out for is increased overreclaim: the total number of allocations went up (I suppose the sum of reclaimed pages as well), which means reclaim became more eager and created more throwout-refault churn. Those were refaults from a sparse-file, but a slow backing dev will have more impact on wall clock time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-09-02 15:49 ` Johannes Weiner @ 2010-09-02 18:28 ` Mel Gorman 0 siblings, 0 replies; 38+ messages in thread From: Mel Gorman @ 2010-09-02 18:28 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel On Thu, Sep 02, 2010 at 05:49:54PM +0200, Johannes Weiner wrote: > On Tue, Aug 31, 2010 at 04:02:07PM +0100, Mel Gorman wrote: > > On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote: > > > Removing congestion_wait() in do_try_to_free_pages() definitely > > > worsens reclaim behaviour for this workload: > > > > > > 1. wallclock time of the testrun increases by 11% > > > > > > 2. the scanners do a worse job and go for the wrong zone: > > > > > > -pgalloc_dma 79597 > > > -pgalloc_dma32 134465902 > > > +pgalloc_dma 297089 > > > +pgalloc_dma32 134247237 > > > > > > -pgsteal_dma 77501 > > > -pgsteal_dma32 133939446 > > > +pgsteal_dma 294998 > > > +pgsteal_dma32 133722312 > > > > > > -pgscan_kswapd_dma 145897 > > > -pgscan_kswapd_dma32 266141381 > > > +pgscan_kswapd_dma 287981 > > > +pgscan_kswapd_dma32 186647637 > > > > > > -pgscan_direct_dma 9666 > > > -pgscan_direct_dma32 1758655 > > > +pgscan_direct_dma 302495 > > > +pgscan_direct_dma32 80947179 > > > > > > -pageoutrun 1768531 > > > -allocstall 614 > > > +pageoutrun 1927451 > > > +allocstall 8566 > > > > > > I attached the full vmstat contents below. Also the test program, > > > which I ran in this case as: ./mapped-file-stream 1 $((512 << 30)) > > > > Excellent stuff. I didn't look at your vmstat output because it was for > > an old patch and you have already highlighted the problems related to > > the workload. Chances are, I'd just reach the same conclusions. What is > > interesting is your workload. > > [...] > > > Ok, well there was some significant feedback on why wholesale changing of > > congestion_wait() reached too far and I've incorporated that feedback. I > > also integrated your workload into my testsuite (btw, because there is no > > license the script has to download it from a google archive. I might get > > back to you on licensing this so it can be made a permanent part of the suite). > > Oh, certainly, feel free to add the following file header: > > /* > * Copyright (c) 2010 Johannes Weiner > * Code released under the GNU GPLv2. > */ > Super, thanks. Downloading off a mail archive is a bit contorted :) > > These are the results just for your workload on the only machine I had > > available with a lot of disk. There are a bunch of kernels because I'm testing > > a superset of different series posted recently. The nocongest column is an > > unreleased patch that has congestion_wait() and wait_iff_congested() that > > only goes to sleep if there is real congestion or a lot of writeback going > > on. Rather than worrying about the patch contents for now, lets consider > > the results for just your workload. > > > > The report is in 4 parts. The first is the vmstat counter differences as > > a result of running your test. The exact interpretation of good and bad > > here is open to interpretation. The second part is based on the vmscan > > tracepoints. The third part is based on the congestion tracepoints and > > the final part reports CPU usage and elapsed time. > > > > MICRO > > traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4 > > pgalloc_dma 89409.00 ( 0.00%) 47750.00 ( -87.24%) 47430.00 ( -88.51%) 47246.00 ( -89.24%) > > pgalloc_dma32 101407571.00 ( 0.00%) 101518722.00 ( 0.11%) 101502059.00 ( 0.09%) 101511868.00 ( 0.10%) > > pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > > pgsteal_dma 74529.00 ( 0.00%) 43386.00 ( -71.78%) 43213.00 ( -72.47%) 42691.00 ( -74.58%) > > pgsteal_dma32 100666955.00 ( 0.00%) 100712596.00 ( 0.05%) 100712537.00 ( 0.05%) 100713305.00 ( 0.05%) > > pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > > pgscan_kswapd_dma 118198.00 ( 0.00%) 47370.00 (-149.52%) 49515.00 (-138.71%) 46134.00 (-156.21%) > > pgscan_kswapd_dma32 177619794.00 ( 0.00%) 161549938.00 ( -9.95%) 161679701.00 ( -9.86%) 156657926.00 ( -13.38%) > > pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > > pgscan_direct_dma 27128.00 ( 0.00%) 39215.00 ( 30.82%) 36561.00 ( 25.80%) 38806.00 ( 30.09%) > > pgscan_direct_dma32 23927492.00 ( 0.00%) 40122173.00 ( 40.36%) 39997463.00 ( 40.18%) 45041626.00 ( 46.88%) > > pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) > > pageoutrun 756020.00 ( 0.00%) 903192.00 ( 16.29%) 899965.00 ( 15.99%) 868055.00 ( 12.91%) > > allocstall 2722.00 ( 0.00%) 70156.00 ( 96.12%) 67554.00 ( 95.97%) 87691.00 ( 96.90%) > > > > > > So, the allocstall counts go up of course because it is incremented > > every time direct reclaim is entered and nocongest is only going to > > sleep when there is congestion or significant writeback. I don't see > > this as being nceessarily bad. > > Agreed. Also the dma zone is less allocated from, which I suppose is > only the second zone in the zonelist, after dma32. So allocations > succeed more often from the first-choice zone with your patches. > Which is good sortof. > > Direct scanning rates go up a bit as you'd expect - again because we > > are sleeping less. It's interesting that the pages reclaimed is > > reduced implying that despite higher scanning rates, there is less > > reclaim activity. > > > > It's debatable if this is good or not because higher scanning rates in > > themselves are not bad but fewer pages reclaimed seems positive so lets > > see what the rest of the reports look like. > > > > FTrace Reclaim Statistics: vmscan > > micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro > > traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4 > > Direct reclaims 2722 70156 67554 87691 > > Direct reclaim pages scanned 23955333 40161426 40034132 45080524 > > Direct reclaim write file async I/O 0 0 0 0 > > Direct reclaim write anon async I/O 0 0 0 0 > > Direct reclaim write file sync I/O 0 0 0 0 > > Direct reclaim write anon sync I/O 0 0 0 0 > > Wake kswapd requests 2718040 17801688 17622777 18379572 > > Kswapd wakeups 24 1 1 1 > > Kswapd pages scanned 177738381 161597313 161729224 156704078 > > Kswapd reclaim write file async I/O 0 0 0 0 > > Kswapd reclaim write anon async I/O 0 0 0 0 > > Kswapd reclaim write file sync I/O 0 0 0 0 > > Kswapd reclaim write anon sync I/O 0 0 0 0 > > Time stalled direct reclaim (seconds) 247.97 76.97 77.15 76.63 > > Time kswapd awake (seconds) 489.17 400.20 403.19 390.08 > > > > Total pages scanned 201693714 201758739 201763356 201784602 > > %age total pages scanned/written 0.00% 0.00% 0.00% 0.00% > > %age file pages scanned/written 0.00% 0.00% 0.00% 0.00% > > Percentage Time Spent Direct Reclaim 41.41% 16.03% 15.96% 16.32% > > Percentage Time kswapd Awake 98.76% 98.94% 98.96% 98.87% > > > > Interesting, kswapd is now staying awake (woke up only once) even though > > the total time awake was reduced and it looks like because it was requested > > to wake up a lot more that was keeping it awake. Despite the higher scan > > rates from direct reclaim, the time actually spent direct reclaiming is > > significantly reduced. > > > > Scanning rates and times we direct reclaim go up but as we finish work a > > lot faster, it would seem that we are doing less work overall. > > I do not reach the same conclusion here. More pages are scanned > overall on the same workload, so we _are_ doing more work. > We did more scanning but we finished sooner so the processor would sleep sooner, consume less power etc. How good or bad this is depends on what we mean by work but I am of the opinion that finishing sooner is better overall. That said... > The result for this single-threaded workload improves because CPU-time > is not the issue when the only runnable process needs memory. > > But we are in fact becoming less efficient at reclaim, so it would > make sense to also test how this interacts with other processes that > do need the CPU concurrently. > This is indeed a problem. Measuring multiple workloads and drawing sensible conclusions is going to be a real pain. I'd at least try running multiple instances of your workload first and seeing how it interacts. > > FTrace Reclaim Statistics: congestion_wait > > Direct number congest waited 3664 0 0 0 > > Direct time congest waited 247636ms 0ms 0ms 0ms > > Direct full congest waited 3081 0 0 0 > > Direct number conditional waited 0 47587 45659 58779 > > Direct time conditional waited 0ms 0ms 0ms 0ms > > Direct full conditional waited 3081 0 0 0 > > KSwapd number congest waited 1448 949 909 981 > > KSwapd time congest waited 118552ms 31652ms 32780ms 38732ms > > KSwapd full congest waited 1056 90 115 147 > > KSwapd number conditional waited 0 0 0 0 > > KSwapd time conditional waited 0ms 0ms 0ms 0ms > > KSwapd full conditional waited 1056 90 115 147 > > > > congest waited is congestion_wait() and conditional waited is > > wait_iff_congested(). Look at what happens to the congest waited times > > for direct reclaim - it disappeared and despite the number of times > > wait_iff_congested() was called, it never actually decided it needed to > > sleep. kswapd is still congestion waiting but the time it spent is > > reduced. > > > > MMTests Statistics: duration > > User/Sys Time Running Test (seconds) 350.9 403.27 406.12 393.02 > > Total Elapsed Time (seconds) 495.29 404.47 407.44 394.53 > > > > This is plain old time. The same test completes 91 seconds faster. > > Ordinarily at this point I would be preparing to do a full series report > > including the other benchmarks but I'm interested in seeing if there is > > a significantly different reading of the above report as to whether it > > is a "good" or "bad" result? > > I think one interesting piece that is missing is whether the > scanned/reclaimed ratio went up. Do you have the kswapd_steal counter > value still available to calculate that ratio? > Knowing the scanning/reclaimed ration would be nice. The tracepoints as-is are orientiated around scanning/write ratio and assumed discarding clean pages was not that big a deal. I did capture vmstat before and after the test which is not as fine-grained as the tracepoints but it'll do Vanilla scan: 203050870 Vanilla steal: 102099263 Nocongest scan: 203117496 Nocongest stal: 102114301 Vanilla ratio: 0.5028 Nocongest ratio: 0.5027 So here is an interesting point. While direct scanning rates went up, overall scanning rates were approximately the same. All that changed really was who was doing the scanning which is a timing issue. The reclaim to scanning ratio were *very* close together for the vanilla and nocongest kernels but nocongest completed far faster. > A "good" result would be, IMO, if that ratio did not get worse, while > at the same time having reclaim perform better due to reduced sleeps. > So, the ratio did not get worse. The scanning rates were the same although who was scanning changed and I'm undecided as to whether that is significant or not. Overall, the test completed faster which is good. > Another aspect to look out for is increased overreclaim: the total > number of allocations went up (I suppose the sum of reclaimed pages as > well), which means reclaim became more eager and created more > throwout-refault churn. Those were refaults from a sparse-file, but a > slow backing dev will have more impact on wall clock time. > I can capture the reclaim rates handy enough - just needs another trace point to keep everything in step. Refaults would be trickier so I would have a tendency to rely on wall time to tell me if the wrong reclaim decisions were being made. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary 2010-08-26 18:29 ` Johannes Weiner 2010-08-26 20:31 ` Mel Gorman @ 2010-08-29 16:03 ` Minchan Kim 1 sibling, 0 replies; 38+ messages in thread From: Minchan Kim @ 2010-08-29 16:03 UTC (permalink / raw) To: Johannes Weiner Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel Hi, Hannes. On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote: > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote: > > If congestion_wait() is called when there is no congestion, the caller > > will wait for the full timeout. This can cause unreasonable and > > unnecessary stalls. There are a number of potential modifications that > > could be made to wake sleepers but this patch measures how serious the > > problem is. It keeps count of how many congested BDIs there are. If > > congestion_wait() is called with no BDIs congested, the tracepoint will > > record that the wait was unnecessary. > > I am not convinced that unnecessary is the right word. On a workload > without any IO (i.e. no congestion_wait() necessary, ever), I noticed > the VM regressing both in time and in reclaiming the right pages when > simply removing congestion_wait() from the direct reclaim paths (the > one in __alloc_pages_slowpath and the other one in > do_try_to_free_pages). Not exactly same your experiment but I had a simillar experince. I had a experiement about swapout. System has lots of anon pages but almost no file pages and it already started to swap out. It means system have no memory. In this case, I forked new process which mmap some MB pages and touch the pages. It means VM should swapout some MB page for the process. And I measured the time until completing touching the pages. Sometime it's fast, sometime it's slow. time gap is almost two. Interesting thing is when it is fast, many of pages are reclaimed by kswapd. Ah.. I used swap to ramdisk and reserve the swap pages by touching before starting the experiment. So I would say it's not a _flushd_ effect. > > So just being stupid and waiting for the timeout in direct reclaim > while kswapd can make progress seemed to do a better job for that > load. > > I can not exactly pinpoint the reason for that behaviour, it would be > nice if somebody had an idea. I just thought the cause is direct reclaim just reclaims by 32 pages but kswapd could reclaim many pages by batch. But i didn't look at it any more due to busy. Does it make sense? -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs 2010-08-26 15:14 [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Mel Gorman 2010-08-26 15:14 ` [PATCH 1/3] writeback: Account for time spent congestion_waited Mel Gorman 2010-08-26 15:14 ` [PATCH 2/3] writeback: Record if the congestion was unnecessary Mel Gorman @ 2010-08-26 15:14 ` Mel Gorman 2010-08-26 17:38 ` Minchan Kim 2010-08-27 5:13 ` Dave Chinner 2010-08-26 17:20 ` [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Minchan Kim 3 siblings, 2 replies; 38+ messages in thread From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw) To: linux-mm, linux-fsdevel Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel If congestion_wait() is called with no BDIs congested, the caller will sleep for the full timeout and this is an unnecessary sleep. This patch checks if there are BDIs congested. If so, it goes to sleep as normal. If not, it calls cond_resched() to ensure the caller is not hogging the CPU longer than its quota but otherwise will not sleep. This is aimed at reducing some of the major desktop stalls reported during IO. For example, while kswapd is operating, it calls congestion_wait() but it could just have been reclaiming clean page cache pages with no congestion. Without this patch, it would sleep for a full timeout but after this patch, it'll just call schedule() if it has been on the CPU too long. Similar logic applies to direct reclaimers that are not making enough progress. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/backing-dev.c | 20 ++++++++++++++------ 1 files changed, 14 insertions(+), 6 deletions(-) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index a49167f..6abe860 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout) DEFINE_WAIT(wait); wait_queue_head_t *wqh = &congestion_wqh[sync]; - /* Check if this call to congestion_wait was necessary */ - if (atomic_read(&nr_bdi_congested[sync]) == 0) + /* + * If there is no congestion, there is no point sleeping on the queue. + * This call was unecessary but in case we are spinning due to a bad + * caller, at least call cond_reched() and sleep if our CPU quota + * has expired + */ + if (atomic_read(&nr_bdi_congested[sync]) == 0) { unnecessary = true; - - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); - ret = io_schedule_timeout(timeout); - finish_wait(wqh, &wait); + cond_resched(); + ret = 0; + } else { + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); + ret = io_schedule_timeout(timeout); + finish_wait(wqh, &wait); + } trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start), unnecessary); -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs 2010-08-26 15:14 ` [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs Mel Gorman @ 2010-08-26 17:38 ` Minchan Kim 2010-08-26 17:42 ` Mel Gorman 2010-08-27 5:13 ` Dave Chinner 1 sibling, 1 reply; 38+ messages in thread From: Minchan Kim @ 2010-08-26 17:38 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote: > If congestion_wait() is called with no BDIs congested, the caller will > sleep for the full timeout and this is an unnecessary sleep. This patch > checks if there are BDIs congested. If so, it goes to sleep as normal. > If not, it calls cond_resched() to ensure the caller is not hogging the > CPU longer than its quota but otherwise will not sleep. > > This is aimed at reducing some of the major desktop stalls reported during > IO. For example, while kswapd is operating, it calls congestion_wait() > but it could just have been reclaiming clean page cache pages with no > congestion. Without this patch, it would sleep for a full timeout but after > this patch, it'll just call schedule() if it has been on the CPU too long. > Similar logic applies to direct reclaimers that are not making enough > progress. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > --- > mm/backing-dev.c | 20 ++++++++++++++------ > 1 files changed, 14 insertions(+), 6 deletions(-) > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index a49167f..6abe860 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c Function's decripton should be changed since we don't wait next write any more. > @@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout) > DEFINE_WAIT(wait); > wait_queue_head_t *wqh = &congestion_wqh[sync]; > > - /* Check if this call to congestion_wait was necessary */ > - if (atomic_read(&nr_bdi_congested[sync]) == 0) > + /* > + * If there is no congestion, there is no point sleeping on the queue. > + * This call was unecessary but in case we are spinning due to a bad > + * caller, at least call cond_reched() and sleep if our CPU quota > + * has expired > + */ > + if (atomic_read(&nr_bdi_congested[sync]) == 0) { > unnecessary = true; > - > - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); > - ret = io_schedule_timeout(timeout); > - finish_wait(wqh, &wait); > + cond_resched(); > + ret = 0; "ret = timeout" is more proper as considering io_schedule_timeout's return value. > + } else { > + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); > + ret = io_schedule_timeout(timeout); > + finish_wait(wqh, &wait); > + } > > trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start), > unnecessary); > -- > 1.7.1 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs 2010-08-26 17:38 ` Minchan Kim @ 2010-08-26 17:42 ` Mel Gorman 2010-08-26 18:17 ` Johannes Weiner 0 siblings, 1 reply; 38+ messages in thread From: Mel Gorman @ 2010-08-26 17:42 UTC (permalink / raw) To: Minchan Kim Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote: > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote: > > If congestion_wait() is called with no BDIs congested, the caller will > > sleep for the full timeout and this is an unnecessary sleep. This patch > > checks if there are BDIs congested. If so, it goes to sleep as normal. > > If not, it calls cond_resched() to ensure the caller is not hogging the > > CPU longer than its quota but otherwise will not sleep. > > > > This is aimed at reducing some of the major desktop stalls reported during > > IO. For example, while kswapd is operating, it calls congestion_wait() > > but it could just have been reclaiming clean page cache pages with no > > congestion. Without this patch, it would sleep for a full timeout but after > > this patch, it'll just call schedule() if it has been on the CPU too long. > > Similar logic applies to direct reclaimers that are not making enough > > progress. > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > --- > > mm/backing-dev.c | 20 ++++++++++++++------ > > 1 files changed, 14 insertions(+), 6 deletions(-) > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > index a49167f..6abe860 100644 > > --- a/mm/backing-dev.c > > +++ b/mm/backing-dev.c > > Function's decripton should be changed since we don't wait next write any more. > My bad. I need to check that "next write" thing. It doesn't appear to be happening but maybe that side of things just broke somewhere in the distant past. I lack context of how this is meant to work so maybe someone will educate me. > > @@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout) > > DEFINE_WAIT(wait); > > wait_queue_head_t *wqh = &congestion_wqh[sync]; > > > > - /* Check if this call to congestion_wait was necessary */ > > - if (atomic_read(&nr_bdi_congested[sync]) == 0) > > + /* > > + * If there is no congestion, there is no point sleeping on the queue. > > + * This call was unecessary but in case we are spinning due to a bad > > + * caller, at least call cond_reched() and sleep if our CPU quota > > + * has expired > > + */ > > + if (atomic_read(&nr_bdi_congested[sync]) == 0) { > > unnecessary = true; > > - > > - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); > > - ret = io_schedule_timeout(timeout); > > - finish_wait(wqh, &wait); > > + cond_resched(); > > + ret = 0; > > "ret = timeout" is more proper as considering io_schedule_timeout's return value. > Good point, will fix. > > + } else { > > + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); > > + ret = io_schedule_timeout(timeout); > > + finish_wait(wqh, &wait); > > + } > > > > trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start), > > unnecessary); -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs 2010-08-26 17:42 ` Mel Gorman @ 2010-08-26 18:17 ` Johannes Weiner 2010-08-26 20:23 ` Mel Gorman 2010-08-27 1:42 ` Wu Fengguang 0 siblings, 2 replies; 38+ messages in thread From: Johannes Weiner @ 2010-08-26 18:17 UTC (permalink / raw) To: Mel Gorman Cc: Minchan Kim, linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote: > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote: > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote: > > > If congestion_wait() is called with no BDIs congested, the caller will > > > sleep for the full timeout and this is an unnecessary sleep. This patch > > > checks if there are BDIs congested. If so, it goes to sleep as normal. > > > If not, it calls cond_resched() to ensure the caller is not hogging the > > > CPU longer than its quota but otherwise will not sleep. > > > > > > This is aimed at reducing some of the major desktop stalls reported during > > > IO. For example, while kswapd is operating, it calls congestion_wait() > > > but it could just have been reclaiming clean page cache pages with no > > > congestion. Without this patch, it would sleep for a full timeout but after > > > this patch, it'll just call schedule() if it has been on the CPU too long. > > > Similar logic applies to direct reclaimers that are not making enough > > > progress. > > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > --- > > > mm/backing-dev.c | 20 ++++++++++++++------ > > > 1 files changed, 14 insertions(+), 6 deletions(-) > > > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > > index a49167f..6abe860 100644 > > > --- a/mm/backing-dev.c > > > +++ b/mm/backing-dev.c > > > > Function's decripton should be changed since we don't wait next write any more. > > > > My bad. I need to check that "next write" thing. It doesn't appear to be > happening but maybe that side of things just broke somewhere in the > distant past. I lack context of how this is meant to work so maybe > someone will educate me. On every retired io request the congestion state on the bdi is checked and the congestion waitqueue woken up. So without congestion, we still only wait until the next write retires, but without any IO, we sleep the full timeout. Check __freed_requests() in block/blk-core.c. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs 2010-08-26 18:17 ` Johannes Weiner @ 2010-08-26 20:23 ` Mel Gorman 2010-08-27 1:11 ` Wu Fengguang 2010-08-27 1:42 ` Wu Fengguang 1 sibling, 1 reply; 38+ messages in thread From: Mel Gorman @ 2010-08-26 20:23 UTC (permalink / raw) To: Johannes Weiner Cc: Minchan Kim, linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote: > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote: > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote: > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote: > > > > If congestion_wait() is called with no BDIs congested, the caller will > > > > sleep for the full timeout and this is an unnecessary sleep. This patch > > > > checks if there are BDIs congested. If so, it goes to sleep as normal. > > > > If not, it calls cond_resched() to ensure the caller is not hogging the > > > > CPU longer than its quota but otherwise will not sleep. > > > > > > > > This is aimed at reducing some of the major desktop stalls reported during > > > > IO. For example, while kswapd is operating, it calls congestion_wait() > > > > but it could just have been reclaiming clean page cache pages with no > > > > congestion. Without this patch, it would sleep for a full timeout but after > > > > this patch, it'll just call schedule() if it has been on the CPU too long. > > > > Similar logic applies to direct reclaimers that are not making enough > > > > progress. > > > > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > > --- > > > > mm/backing-dev.c | 20 ++++++++++++++------ > > > > 1 files changed, 14 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > > > index a49167f..6abe860 100644 > > > > --- a/mm/backing-dev.c > > > > +++ b/mm/backing-dev.c > > > > > > Function's decripton should be changed since we don't wait next write any more. > > > > > > > My bad. I need to check that "next write" thing. It doesn't appear to be > > happening but maybe that side of things just broke somewhere in the > > distant past. I lack context of how this is meant to work so maybe > > someone will educate me. > > On every retired io request the congestion state on the bdi is checked > and the congestion waitqueue woken up. > > So without congestion, we still only wait until the next write > retires, but without any IO, we sleep the full timeout. > > Check __freed_requests() in block/blk-core.c. > Seems reasonable. Still, if there is no write IO going on and no congestion there seems to be no point going to sleep for the full timeout. It still feels wrong. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs 2010-08-26 20:23 ` Mel Gorman @ 2010-08-27 1:11 ` Wu Fengguang 2010-08-27 9:34 ` Mel Gorman 0 siblings, 1 reply; 38+ messages in thread From: Wu Fengguang @ 2010-08-27 1:11 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, Minchan Kim, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt, Jan Kara, linux-kernel@vger.kernel.org, Li Shaohua On Fri, Aug 27, 2010 at 04:23:24AM +0800, Mel Gorman wrote: > On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote: > > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote: > > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote: > > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote: > > > > > If congestion_wait() is called with no BDIs congested, the caller will > > > > > sleep for the full timeout and this is an unnecessary sleep. This patch > > > > > checks if there are BDIs congested. If so, it goes to sleep as normal. > > > > > If not, it calls cond_resched() to ensure the caller is not hogging the > > > > > CPU longer than its quota but otherwise will not sleep. > > > > > > > > > > This is aimed at reducing some of the major desktop stalls reported during > > > > > IO. For example, while kswapd is operating, it calls congestion_wait() > > > > > but it could just have been reclaiming clean page cache pages with no > > > > > congestion. Without this patch, it would sleep for a full timeout but after > > > > > this patch, it'll just call schedule() if it has been on the CPU too long. > > > > > Similar logic applies to direct reclaimers that are not making enough > > > > > progress. > > > > > > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > > > --- > > > > > mm/backing-dev.c | 20 ++++++++++++++------ > > > > > 1 files changed, 14 insertions(+), 6 deletions(-) > > > > > > > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > > > > index a49167f..6abe860 100644 > > > > > --- a/mm/backing-dev.c > > > > > +++ b/mm/backing-dev.c > > > > > > > > Function's decripton should be changed since we don't wait next write any more. > > > > > > > > > > My bad. I need to check that "next write" thing. It doesn't appear to be > > > happening but maybe that side of things just broke somewhere in the > > > distant past. I lack context of how this is meant to work so maybe > > > someone will educate me. > > > > On every retired io request the congestion state on the bdi is checked > > and the congestion waitqueue woken up. > > > > So without congestion, we still only wait until the next write > > retires, but without any IO, we sleep the full timeout. > > > > Check __freed_requests() in block/blk-core.c. > > > > Seems reasonable. Still, if there is no write IO going on and no > congestion there seems to be no point going to sleep for the full > timeout. It still feels wrong. Yeah the stupid sleeping feels wrong. However there are ~20 congestion_wait() callers spread randomly in VM, FS and block drivers. Many of them may be added by rule of thumb, however what if some of them happen to depend on the old stupid sleeping behavior? Obviously you've done extensive tests on the page reclaim paths, however that's far from enough to cover the wider changes made by this patch. We may have to do the conversions case by case. Converting to congestion_wait_check() (see http://lkml.org/lkml/2010/8/18/292) or other waiting schemes. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs 2010-08-27 1:11 ` Wu Fengguang @ 2010-08-27 9:34 ` Mel Gorman 0 siblings, 0 replies; 38+ messages in thread From: Mel Gorman @ 2010-08-27 9:34 UTC (permalink / raw) To: Wu Fengguang Cc: Johannes Weiner, Minchan Kim, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt, Jan Kara, linux-kernel@vger.kernel.org, Li Shaohua On Fri, Aug 27, 2010 at 09:11:06AM +0800, Wu Fengguang wrote: > On Fri, Aug 27, 2010 at 04:23:24AM +0800, Mel Gorman wrote: > > On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote: > > > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote: > > > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote: > > > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote: > > > > > > If congestion_wait() is called with no BDIs congested, the caller will > > > > > > sleep for the full timeout and this is an unnecessary sleep. This patch > > > > > > checks if there are BDIs congested. If so, it goes to sleep as normal. > > > > > > If not, it calls cond_resched() to ensure the caller is not hogging the > > > > > > CPU longer than its quota but otherwise will not sleep. > > > > > > > > > > > > This is aimed at reducing some of the major desktop stalls reported during > > > > > > IO. For example, while kswapd is operating, it calls congestion_wait() > > > > > > but it could just have been reclaiming clean page cache pages with no > > > > > > congestion. Without this patch, it would sleep for a full timeout but after > > > > > > this patch, it'll just call schedule() if it has been on the CPU too long. > > > > > > Similar logic applies to direct reclaimers that are not making enough > > > > > > progress. > > > > > > > > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > > > > --- > > > > > > mm/backing-dev.c | 20 ++++++++++++++------ > > > > > > 1 files changed, 14 insertions(+), 6 deletions(-) > > > > > > > > > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > > > > > index a49167f..6abe860 100644 > > > > > > --- a/mm/backing-dev.c > > > > > > +++ b/mm/backing-dev.c > > > > > > > > > > Function's decripton should be changed since we don't wait next write any more. > > > > > > > > > > > > > My bad. I need to check that "next write" thing. It doesn't appear to be > > > > happening but maybe that side of things just broke somewhere in the > > > > distant past. I lack context of how this is meant to work so maybe > > > > someone will educate me. > > > > > > On every retired io request the congestion state on the bdi is checked > > > and the congestion waitqueue woken up. > > > > > > So without congestion, we still only wait until the next write > > > retires, but without any IO, we sleep the full timeout. > > > > > > Check __freed_requests() in block/blk-core.c. > > > > > > > Seems reasonable. Still, if there is no write IO going on and no > > congestion there seems to be no point going to sleep for the full > > timeout. It still feels wrong. > > Yeah the stupid sleeping feels wrong. However there are ~20 > congestion_wait() callers spread randomly in VM, FS and block drivers. > Many of them may be added by rule of thumb, however what if some of > them happen to depend on the old stupid sleeping behavior? Obviously > you've done extensive tests on the page reclaim paths, however that's > far from enough to cover the wider changes made by this patch. > > We may have to do the conversions case by case. Converting to > congestion_wait_check() (see http://lkml.org/lkml/2010/8/18/292) or > other waiting schemes. > I am taking this direction now. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs 2010-08-26 18:17 ` Johannes Weiner 2010-08-26 20:23 ` Mel Gorman @ 2010-08-27 1:42 ` Wu Fengguang 2010-08-27 9:37 ` Mel Gorman 1 sibling, 1 reply; 38+ messages in thread From: Wu Fengguang @ 2010-08-27 1:42 UTC (permalink / raw) To: Johannes Weiner Cc: Mel Gorman, Minchan Kim, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt, Jan Kara, linux-kernel@vger.kernel.org, Li Shaohua, Rik van Riel On Fri, Aug 27, 2010 at 02:17:35AM +0800, Johannes Weiner wrote: > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote: > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote: > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote: > > > > If congestion_wait() is called with no BDIs congested, the caller will > > > > sleep for the full timeout and this is an unnecessary sleep. This patch > > > > checks if there are BDIs congested. If so, it goes to sleep as normal. > > > > If not, it calls cond_resched() to ensure the caller is not hogging the > > > > CPU longer than its quota but otherwise will not sleep. > > > > > > > > This is aimed at reducing some of the major desktop stalls reported during > > > > IO. For example, while kswapd is operating, it calls congestion_wait() > > > > but it could just have been reclaiming clean page cache pages with no > > > > congestion. Without this patch, it would sleep for a full timeout but after > > > > this patch, it'll just call schedule() if it has been on the CPU too long. > > > > Similar logic applies to direct reclaimers that are not making enough > > > > progress. > > > > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > > --- > > > > mm/backing-dev.c | 20 ++++++++++++++------ > > > > 1 files changed, 14 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > > > index a49167f..6abe860 100644 > > > > --- a/mm/backing-dev.c > > > > +++ b/mm/backing-dev.c > > > > > > Function's decripton should be changed since we don't wait next write any more. > > > > > > > My bad. I need to check that "next write" thing. It doesn't appear to be > > happening but maybe that side of things just broke somewhere in the > > distant past. I lack context of how this is meant to work so maybe > > someone will educate me. > > On every retired io request the congestion state on the bdi is checked > and the congestion waitqueue woken up. > > So without congestion, we still only wait until the next write > retires, but without any IO, we sleep the full timeout. > > Check __freed_requests() in block/blk-core.c. congestion_wait() is tightly related with pageout() and writeback, however it may have some intention for the no-IO case as well. - if write congested, maybe we are doing too much pageout(), so wait. it might also reduce some get_request_wait() stalls (the normal way is to explicitly check for congestion before doing write out). - if any write completes, it may free some PG_reclaim pages, so proceed. (when not congested) - if no IO at all, the 100ms sleep might still prevent a page reclaimer from stealing lots of slices from a busy computing program that involves no page allocation at all. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs 2010-08-27 1:42 ` Wu Fengguang @ 2010-08-27 9:37 ` Mel Gorman 0 siblings, 0 replies; 38+ messages in thread From: Mel Gorman @ 2010-08-27 9:37 UTC (permalink / raw) To: Wu Fengguang Cc: Johannes Weiner, Minchan Kim, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt, Jan Kara, linux-kernel@vger.kernel.org, Li Shaohua, Rik van Riel On Fri, Aug 27, 2010 at 09:42:54AM +0800, Wu Fengguang wrote: > On Fri, Aug 27, 2010 at 02:17:35AM +0800, Johannes Weiner wrote: > > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote: > > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote: > > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote: > > > > > If congestion_wait() is called with no BDIs congested, the caller will > > > > > sleep for the full timeout and this is an unnecessary sleep. This patch > > > > > checks if there are BDIs congested. If so, it goes to sleep as normal. > > > > > If not, it calls cond_resched() to ensure the caller is not hogging the > > > > > CPU longer than its quota but otherwise will not sleep. > > > > > > > > > > This is aimed at reducing some of the major desktop stalls reported during > > > > > IO. For example, while kswapd is operating, it calls congestion_wait() > > > > > but it could just have been reclaiming clean page cache pages with no > > > > > congestion. Without this patch, it would sleep for a full timeout but after > > > > > this patch, it'll just call schedule() if it has been on the CPU too long. > > > > > Similar logic applies to direct reclaimers that are not making enough > > > > > progress. > > > > > > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > > > --- > > > > > mm/backing-dev.c | 20 ++++++++++++++------ > > > > > 1 files changed, 14 insertions(+), 6 deletions(-) > > > > > > > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > > > > index a49167f..6abe860 100644 > > > > > --- a/mm/backing-dev.c > > > > > +++ b/mm/backing-dev.c > > > > > > > > Function's decripton should be changed since we don't wait next write any more. > > > > > > > > > > My bad. I need to check that "next write" thing. It doesn't appear to be > > > happening but maybe that side of things just broke somewhere in the > > > distant past. I lack context of how this is meant to work so maybe > > > someone will educate me. > > > > On every retired io request the congestion state on the bdi is checked > > and the congestion waitqueue woken up. > > > > So without congestion, we still only wait until the next write > > retires, but without any IO, we sleep the full timeout. > > > > Check __freed_requests() in block/blk-core.c. > > congestion_wait() is tightly related with pageout() and writeback, > however it may have some intention for the no-IO case as well. > > - if write congested, maybe we are doing too much pageout(), so wait. > it might also reduce some get_request_wait() stalls (the normal way > is to explicitly check for congestion before doing write out). > > - if any write completes, it may free some PG_reclaim pages, so proceed. > (when not congested) > For these cases, would it make sense for wait_iff_congested() to compare nr_writeback to nr_inactive and decide to wait on congestion if more than half the inactive list is in writeback? > - if no IO at all, the 100ms sleep might still prevent a page reclaimer > from stealing lots of slices from a busy computing program that > involves no page allocation at all. > I don't think this is a very strong arguement because cond_reched() is being called. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs 2010-08-26 15:14 ` [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs Mel Gorman 2010-08-26 17:38 ` Minchan Kim @ 2010-08-27 5:13 ` Dave Chinner 2010-08-27 9:33 ` Mel Gorman 1 sibling, 1 reply; 38+ messages in thread From: Dave Chinner @ 2010-08-27 5:13 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote: > If congestion_wait() is called with no BDIs congested, the caller will > sleep for the full timeout and this is an unnecessary sleep. That, I think, is an invalid assumption. congestion_wait is used in some places as a backoff mechanism that waits for some IO work to be done, with congestion disappearing being a indication that progress has been made and so we can retry sooner than the entire timeout. For example, if _xfs_buf_lookup_pages() fails to allocate page cache pages for a buffer, it will kick the xfsbufd to writeback dirty buffers (so they can be freed) and immediately enter congestion_wait(). If there isn't congestion when we enter congestion_wait(), we still want to give the xfsbufds a chance to clean some pages before we retry the allocation for the new buffer. Removing the congestion_wait() sleep behaviour will effectively _increase_ memory pressure with XFS on fast disk subsystems because it now won't backoff between failed allocation attempts... Perhaps a congestion_wait_iff_congested() variant is needed for the VM? I can certainly see how it benefits the VM from a latency perspective, but it is the opposite behaviour that is expected in other places... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs 2010-08-27 5:13 ` Dave Chinner @ 2010-08-27 9:33 ` Mel Gorman 0 siblings, 0 replies; 38+ messages in thread From: Mel Gorman @ 2010-08-27 9:33 UTC (permalink / raw) To: Dave Chinner Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel On Fri, Aug 27, 2010 at 03:13:16PM +1000, Dave Chinner wrote: > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote: > > If congestion_wait() is called with no BDIs congested, the caller will > > sleep for the full timeout and this is an unnecessary sleep. > > That, I think, is an invalid assumption. congestion_wait is used in > some places as a backoff mechanism that waits for some IO work to be > done, with congestion disappearing being a indication that progress > has been made and so we can retry sooner than the entire timeout. > As it's write IO rather than some IO, I wonder if that's really the right thing to do. However, I accept your (and others) point that converting all congestion_wait() callers may be too much of a change. > For example, if _xfs_buf_lookup_pages() fails to allocate page cache > pages for a buffer, it will kick the xfsbufd to writeback dirty > buffers (so they can be freed) and immediately enter > congestion_wait(). If there isn't congestion when we enter > congestion_wait(), we still want to give the xfsbufds a chance to > clean some pages before we retry the allocation for the new buffer. > Removing the congestion_wait() sleep behaviour will effectively > _increase_ memory pressure with XFS on fast disk subsystems because > it now won't backoff between failed allocation attempts... > > Perhaps a congestion_wait_iff_congested() variant is needed for the > VM? I can certainly see how it benefits the VM from a latency > perspective, but it is the opposite behaviour that is expected in > other places... > I'm added a wait_iff_congested() and updated a few of the VM callers. I changed a fairly minimum number of what appeared to be the obvious ones to change. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion 2010-08-26 15:14 [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Mel Gorman ` (2 preceding siblings ...) 2010-08-26 15:14 ` [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs Mel Gorman @ 2010-08-26 17:20 ` Minchan Kim 2010-08-26 17:31 ` Mel Gorman 2010-08-27 1:21 ` Wu Fengguang 3 siblings, 2 replies; 38+ messages in thread From: Minchan Kim @ 2010-08-26 17:20 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote: > congestion_wait() is a bit stupid in that it goes to sleep even when there > is no congestion. This causes stalls in a number of situations and may be > partially responsible for bug reports about desktop interactivity. > > This patch series aims to account for these unnecessary congestion_waits() > and to avoid going to sleep when there is no congestion available. Patches > 1 and 2 add instrumentation related to congestion which should be reuable > by alternative solutions to congestion_wait. Patch 3 calls cond_resched() > instead of going to sleep if there is no congestion. > > Once again, I shoved this through performance test. Unlike previous tests, > I ran this on a ported version of my usual test-suite that should be suitable > for release soon. It's not quite as good as my old set but it's sufficient > for this and related series. The tests I ran were kernbench vmr-stream > iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench > stress-highalloc. Sysbench was a read/write tests and stress-highalloc is > the usual stress the number of high order allocations that can be made while > the system is under severe stress. The suite contains the necessary analysis > scripts as well and I'd release it now except the documentation blows. > > x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine) > x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine) > ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation) > > The disks on all of them were single disks and not particularly fast. > > Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for > instrumentation and a second test with patch 3 applied. > > In all cases, kernbench, hackbench, STREAM and iozone did not show any > performance difference because none of them were pressuring the system > enough to be calling congestion_wait() so I won't post the results. > About all worth noting for them is that nothing horrible appeared to break. > > In the analysis scripts, I record unnecessary sleeps to be a sleep that > had no congestion. The post-processing scripts for cond_resched() will only > count an uncongested call to congestion_wait() as unnecessary if the process > actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted. > > One vague concern I have is when too many pages are isolated, we call > congestion_wait(). This could now actively spin in the loop for its quanta > before calling cond_resched(). If it's calling with no congestion, it's > hard to know what the proper thing to do there is. Suddenly, many processes could enter into the direct reclaim path by another reason(ex, fork bomb) regradless of congestion. backing dev congestion is just one of them. I think if congestion_wait returns without calling io_schedule_timeout by your patch, too_many_isolated can schedule_timeout to wait for the system's calm to preventing OOM killing. How about this? If you don't mind, I will send the patch based on this patch series after your patch settle down or Could you add this to your patch series? But I admit this doesn't almost affect your experiment. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion 2010-08-26 17:20 ` [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Minchan Kim @ 2010-08-26 17:31 ` Mel Gorman 2010-08-26 17:50 ` Minchan Kim 2010-08-27 1:21 ` Wu Fengguang 1 sibling, 1 reply; 38+ messages in thread From: Mel Gorman @ 2010-08-26 17:31 UTC (permalink / raw) To: Minchan Kim Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Fri, Aug 27, 2010 at 02:20:38AM +0900, Minchan Kim wrote: > On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote: > > congestion_wait() is a bit stupid in that it goes to sleep even when there > > is no congestion. This causes stalls in a number of situations and may be > > partially responsible for bug reports about desktop interactivity. > > > > This patch series aims to account for these unnecessary congestion_waits() > > and to avoid going to sleep when there is no congestion available. Patches > > 1 and 2 add instrumentation related to congestion which should be reuable > > by alternative solutions to congestion_wait. Patch 3 calls cond_resched() > > instead of going to sleep if there is no congestion. > > > > Once again, I shoved this through performance test. Unlike previous tests, > > I ran this on a ported version of my usual test-suite that should be suitable > > for release soon. It's not quite as good as my old set but it's sufficient > > for this and related series. The tests I ran were kernbench vmr-stream > > iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench > > stress-highalloc. Sysbench was a read/write tests and stress-highalloc is > > the usual stress the number of high order allocations that can be made while > > the system is under severe stress. The suite contains the necessary analysis > > scripts as well and I'd release it now except the documentation blows. > > > > x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine) > > x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine) > > ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation) > > > > The disks on all of them were single disks and not particularly fast. > > > > Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for > > instrumentation and a second test with patch 3 applied. > > > > In all cases, kernbench, hackbench, STREAM and iozone did not show any > > performance difference because none of them were pressuring the system > > enough to be calling congestion_wait() so I won't post the results. > > About all worth noting for them is that nothing horrible appeared to break. > > > > In the analysis scripts, I record unnecessary sleeps to be a sleep that > > had no congestion. The post-processing scripts for cond_resched() will only > > count an uncongested call to congestion_wait() as unnecessary if the process > > actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted. > > > > One vague concern I have is when too many pages are isolated, we call > > congestion_wait(). This could now actively spin in the loop for its quanta > > before calling cond_resched(). If it's calling with no congestion, it's > > hard to know what the proper thing to do there is. > > Suddenly, many processes could enter into the direct reclaim path by another > reason(ex, fork bomb) regradless of congestion. backing dev congestion is > just one of them. > This situation applys with or without this series, right? > I think if congestion_wait returns without calling io_schedule_timeout > by your patch, too_many_isolated can schedule_timeout to wait for the system's > calm to preventing OOM killing. > More likely, to stop a loop in too_many_isolated() consuming CPU time it can do nothing with. > How about this? > > If you don't mind, I will send the patch based on this patch series > after your patch settle down or Could you add this to your patch series? > But I admit this doesn't almost affect your experiment. > I think it's a related topic so could belong with the series. > From 70d6584e125c3954d74a69bfcb72de17244635d2 Mon Sep 17 00:00:00 2001 > From: Minchan Kim <minchan.kim@gmail.com> > Date: Fri, 27 Aug 2010 02:06:45 +0900 > Subject: [PATCH] Wait regardless of congestion if too many pages are isolated > > Suddenly, many processes could enter into the direct reclaim path > regradless of congestion. backing dev congestion is just one of them. > But current implementation calls congestion_wait if too many pages are isolated. > > if congestion_wait returns without calling io_schedule_timeout, > too_many_isolated can schedule_timeout to wait for the system's calm > to preventing OOM killing. > I think the reasoning here might be a little off. How about; If many processes enter direct reclaim or memory compaction, too many pages can get isolated. In this situation, too_many_isolated() can call congestion_wait() but if there is no congestion, it fails to go to sleep and instead spins until it's quota expires. This patch checks if congestion_wait() returned without sleeping. If it did because there was no congestion, it unconditionally goes to sleep instead of hogging the CPU. > Signed-off-by: Minchan Kim <minchan.kim@gmail.com> > --- > mm/backing-dev.c | 5 ++--- > mm/compaction.c | 6 +++++- > mm/vmscan.c | 6 +++++- > 3 files changed, 12 insertions(+), 5 deletions(-) > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > index 6abe860..9431bca 100644 > --- a/mm/backing-dev.c > +++ b/mm/backing-dev.c > @@ -756,8 +756,7 @@ EXPORT_SYMBOL(set_bdi_congested); > * @timeout: timeout in jiffies > * > * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit > - * write congestion. If no backing_devs are congested then just wait for the > - * next write to be completed. > + * write congestion. If no backing_devs are congested then just returns. > */ > long congestion_wait(int sync, long timeout) > { > @@ -776,7 +775,7 @@ long congestion_wait(int sync, long timeout) > if (atomic_read(&nr_bdi_congested[sync]) == 0) { > unnecessary = true; > cond_resched(); > - ret = 0; > + ret = timeout; > } else { > prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); > ret = io_schedule_timeout(timeout); > diff --git a/mm/compaction.c b/mm/compaction.c > index 94cce51..7370683 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone, > * delay for some time until fewer pages are isolated > */ > while (unlikely(too_many_isolated(zone))) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + long timeout = HZ/10; > + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) { > + set_current_state(TASK_INTERRUPTIBLE); > + schedule_timeout(timeout); > + } > We don't really need the timeout variable here but I see what you are at. It's unfortunate to just go to sleep for HZ/10 but if it's not congestion, we do not have any other event to wake up on at the moment. We'd have to introduce a too_many_isolated waitqueue that is kicked if pages are put back on the LRU. This is better than spinning though. > if (fatal_signal_pending(current)) > return 0; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 3109ff7..f5e3e28 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > unsigned long nr_dirty; > while (unlikely(too_many_isolated(zone, file, sc))) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + long timeout = HZ/10; > + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) { > + set_current_state(TASK_INTERRUPTIBLE); > + schedule_timeout(timeout); > + } > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) This seems very reasonable. I'll review it more carefully tomorrow and if I spot nothing horrible, I'll add it onto the series. I'm not sure I'm hitting the too_many_isolated() case but I cannot think of a better alternative without adding more waitqueues. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion 2010-08-26 17:31 ` Mel Gorman @ 2010-08-26 17:50 ` Minchan Kim 0 siblings, 0 replies; 38+ messages in thread From: Minchan Kim @ 2010-08-26 17:50 UTC (permalink / raw) To: Mel Gorman Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki On Thu, Aug 26, 2010 at 06:31:47PM +0100, Mel Gorman wrote: > On Fri, Aug 27, 2010 at 02:20:38AM +0900, Minchan Kim wrote: > > On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote: > > > congestion_wait() is a bit stupid in that it goes to sleep even when there > > > is no congestion. This causes stalls in a number of situations and may be > > > partially responsible for bug reports about desktop interactivity. > > > > > > This patch series aims to account for these unnecessary congestion_waits() > > > and to avoid going to sleep when there is no congestion available. Patches > > > 1 and 2 add instrumentation related to congestion which should be reuable > > > by alternative solutions to congestion_wait. Patch 3 calls cond_resched() > > > instead of going to sleep if there is no congestion. > > > > > > Once again, I shoved this through performance test. Unlike previous tests, > > > I ran this on a ported version of my usual test-suite that should be suitable > > > for release soon. It's not quite as good as my old set but it's sufficient > > > for this and related series. The tests I ran were kernbench vmr-stream > > > iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench > > > stress-highalloc. Sysbench was a read/write tests and stress-highalloc is > > > the usual stress the number of high order allocations that can be made while > > > the system is under severe stress. The suite contains the necessary analysis > > > scripts as well and I'd release it now except the documentation blows. > > > > > > x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine) > > > x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine) > > > ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation) > > > > > > The disks on all of them were single disks and not particularly fast. > > > > > > Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for > > > instrumentation and a second test with patch 3 applied. > > > > > > In all cases, kernbench, hackbench, STREAM and iozone did not show any > > > performance difference because none of them were pressuring the system > > > enough to be calling congestion_wait() so I won't post the results. > > > About all worth noting for them is that nothing horrible appeared to break. > > > > > > In the analysis scripts, I record unnecessary sleeps to be a sleep that > > > had no congestion. The post-processing scripts for cond_resched() will only > > > count an uncongested call to congestion_wait() as unnecessary if the process > > > actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted. > > > > > > One vague concern I have is when too many pages are isolated, we call > > > congestion_wait(). This could now actively spin in the loop for its quanta > > > before calling cond_resched(). If it's calling with no congestion, it's > > > hard to know what the proper thing to do there is. > > > > Suddenly, many processes could enter into the direct reclaim path by another > > reason(ex, fork bomb) regradless of congestion. backing dev congestion is > > just one of them. > > > > This situation applys with or without this series, right? I think the situation applys with this series. That's because old behavior was calling schedule regardless of I/O congested as seeing io_schedule_timeout. But you are changing it now as calling it conditionally. > > > I think if congestion_wait returns without calling io_schedule_timeout > > by your patch, too_many_isolated can schedule_timeout to wait for the system's > > calm to preventing OOM killing. > > > > More likely, to stop a loop in too_many_isolated() consuming CPU time it > can do nothing with. > > > How about this? > > > > If you don't mind, I will send the patch based on this patch series > > after your patch settle down or Could you add this to your patch series? > > But I admit this doesn't almost affect your experiment. > > > > I think it's a related topic so could belong with the series. > > > From 70d6584e125c3954d74a69bfcb72de17244635d2 Mon Sep 17 00:00:00 2001 > > From: Minchan Kim <minchan.kim@gmail.com> > > Date: Fri, 27 Aug 2010 02:06:45 +0900 > > Subject: [PATCH] Wait regardless of congestion if too many pages are isolated > > > > Suddenly, many processes could enter into the direct reclaim path > > regradless of congestion. backing dev congestion is just one of them. > > But current implementation calls congestion_wait if too many pages are isolated. > > > > if congestion_wait returns without calling io_schedule_timeout, > > too_many_isolated can schedule_timeout to wait for the system's calm > > to preventing OOM killing. > > > > I think the reasoning here might be a little off. How about; > > If many processes enter direct reclaim or memory compaction, too many pages > can get isolated. In this situation, too_many_isolated() can call > congestion_wait() but if there is no congestion, it fails to go to sleep > and instead spins until it's quota expires. > > This patch checks if congestion_wait() returned without sleeping. If it > did because there was no congestion, it unconditionally goes to sleep > instead of hogging the CPU. That's good to me. :) > > > Signed-off-by: Minchan Kim <minchan.kim@gmail.com> > > --- > > mm/backing-dev.c | 5 ++--- > > mm/compaction.c | 6 +++++- > > mm/vmscan.c | 6 +++++- > > 3 files changed, 12 insertions(+), 5 deletions(-) > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c > > index 6abe860..9431bca 100644 > > --- a/mm/backing-dev.c > > +++ b/mm/backing-dev.c > > @@ -756,8 +756,7 @@ EXPORT_SYMBOL(set_bdi_congested); > > * @timeout: timeout in jiffies > > * > > * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit > > - * write congestion. If no backing_devs are congested then just wait for the > > - * next write to be completed. > > + * write congestion. If no backing_devs are congested then just returns. > > */ > > long congestion_wait(int sync, long timeout) > > { > > @@ -776,7 +775,7 @@ long congestion_wait(int sync, long timeout) > > if (atomic_read(&nr_bdi_congested[sync]) == 0) { > > unnecessary = true; > > cond_resched(); > > - ret = 0; > > + ret = timeout; > > } else { > > prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE); > > ret = io_schedule_timeout(timeout); > > diff --git a/mm/compaction.c b/mm/compaction.c > > index 94cce51..7370683 100644 > > --- a/mm/compaction.c > > +++ b/mm/compaction.c > > @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone, > > * delay for some time until fewer pages are isolated > > */ > > while (unlikely(too_many_isolated(zone))) { > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > + long timeout = HZ/10; > > + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) { > > + set_current_state(TASK_INTERRUPTIBLE); > > + schedule_timeout(timeout); > > + } > > > > We don't really need the timeout variable here but I see what you are > at. It's unfortunate to just go to sleep for HZ/10 but if it's not > congestion, we do not have any other event to wake up on at the moment. > We'd have to introduce a too_many_isolated waitqueue that is kicked if > pages are put back on the LRU. I thought it firstly but first of all, let's make sure how often this situation happens and it's really serious problem. I means it's rather overkill. > > This is better than spinning though. > > > if (fatal_signal_pending(current)) > > return 0; > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 3109ff7..f5e3e28 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > > unsigned long nr_dirty; > > while (unlikely(too_many_isolated(zone, file, sc))) { > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > + long timeout = HZ/10; > > + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) { > > + set_current_state(TASK_INTERRUPTIBLE); > > + schedule_timeout(timeout); > > + } > > > > /* We are about to die and free our memory. Return now. */ > > if (fatal_signal_pending(current)) > > This seems very reasonable. I'll review it more carefully tomorrow and if I > spot nothing horrible, I'll add it onto the series. I'm not sure I'm hitting > the too_many_isolated() case but I cannot think of a better alternative > without adding more waitqueues. Thanks. Mel. > > -- > Mel Gorman > Part-time Phd Student Linux Technology Center > University of Limerick IBM Dublin Software Lab -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion 2010-08-26 17:20 ` [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Minchan Kim 2010-08-26 17:31 ` Mel Gorman @ 2010-08-27 1:21 ` Wu Fengguang 2010-08-27 1:41 ` Minchan Kim 2010-08-27 9:38 ` Mel Gorman 1 sibling, 2 replies; 38+ messages in thread From: Wu Fengguang @ 2010-08-27 1:21 UTC (permalink / raw) To: Minchan Kim Cc: Mel Gorman, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel@vger.kernel.org, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li Shaohua Minchan, It's much cleaner to keep the unchanged congestion_wait() and add a congestion_wait_check() for converting problematic wait sites. The too_many_isolated() wait is merely a protective mechanism, I won't bother to improve it at the cost of more code. Thanks, Fengguang > diff --git a/mm/compaction.c b/mm/compaction.c > index 94cce51..7370683 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone, > * delay for some time until fewer pages are isolated > */ > while (unlikely(too_many_isolated(zone))) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + long timeout = HZ/10; > + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) { > + set_current_state(TASK_INTERRUPTIBLE); > + schedule_timeout(timeout); > + } > > if (fatal_signal_pending(current)) > return 0; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 3109ff7..f5e3e28 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > unsigned long nr_dirty; > while (unlikely(too_many_isolated(zone, file, sc))) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + long timeout = HZ/10; > + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) { > + set_current_state(TASK_INTERRUPTIBLE); > + schedule_timeout(timeout); > + } > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) > -- > 1.7.0.5 > > > -- > Kind regards, > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion 2010-08-27 1:21 ` Wu Fengguang @ 2010-08-27 1:41 ` Minchan Kim 2010-08-27 1:50 ` Wu Fengguang 2010-08-27 9:38 ` Mel Gorman 1 sibling, 1 reply; 38+ messages in thread From: Minchan Kim @ 2010-08-27 1:41 UTC (permalink / raw) To: Wu Fengguang Cc: Mel Gorman, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel@vger.kernel.org, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li Shaohua Hi, Wu. On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote: > Minchan, > > It's much cleaner to keep the unchanged congestion_wait() and add a > congestion_wait_check() for converting problematic wait sites. The > too_many_isolated() wait is merely a protective mechanism, I won't > bother to improve it at the cost of more code. You means following as? while (unlikely(too_many_isolated(zone, file, sc))) { congestion_wait_check(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; } > > Thanks, > Fengguang > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion 2010-08-27 1:41 ` Minchan Kim @ 2010-08-27 1:50 ` Wu Fengguang 2010-08-27 2:02 ` Minchan Kim 0 siblings, 1 reply; 38+ messages in thread From: Wu Fengguang @ 2010-08-27 1:50 UTC (permalink / raw) To: Minchan Kim Cc: Mel Gorman, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel@vger.kernel.org, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li, Shaohua On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote: > Hi, Wu. > > On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote: > > Minchan, > > > > It's much cleaner to keep the unchanged congestion_wait() and add a > > congestion_wait_check() for converting problematic wait sites. The > > too_many_isolated() wait is merely a protective mechanism, I won't > > bother to improve it at the cost of more code. > > You means following as? No, I mean do not change the too_many_isolated() related code at all :) And to use congestion_wait_check() in other places that we can prove there is a problem that can be rightly fixed by changing to congestion_wait_check(). > while (unlikely(too_many_isolated(zone, file, sc))) { > congestion_wait_check(BLK_RW_ASYNC, HZ/10); > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) > return SWAP_CLUSTER_MAX; > } Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion 2010-08-27 1:50 ` Wu Fengguang @ 2010-08-27 2:02 ` Minchan Kim 2010-08-27 4:34 ` Wu Fengguang 0 siblings, 1 reply; 38+ messages in thread From: Minchan Kim @ 2010-08-27 2:02 UTC (permalink / raw) To: Wu Fengguang Cc: Mel Gorman, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel@vger.kernel.org, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li, Shaohua On Fri, Aug 27, 2010 at 10:50 AM, Wu Fengguang <fengguang.wu@intel.com> wrote: > On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote: >> Hi, Wu. >> >> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote: >> > Minchan, >> > >> > It's much cleaner to keep the unchanged congestion_wait() and add a >> > congestion_wait_check() for converting problematic wait sites. The >> > too_many_isolated() wait is merely a protective mechanism, I won't >> > bother to improve it at the cost of more code. >> >> You means following as? > > No, I mean do not change the too_many_isolated() related code at all :) > And to use congestion_wait_check() in other places that we can prove > there is a problem that can be rightly fixed by changing to > congestion_wait_check(). I always suffer from understanding your comment. Apparently, my eyes have a problem. ;( This patch is dependent of Mel's series. With changing congestion_wait with just return when no congestion, it would have CPU hogging in too_many_isolated. I think it would apply in Li's congestion_wait_check, too. If no change is current congestion_wait, we doesn't need this patch. Still, maybe I can't understand your comment. Sorry. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion 2010-08-27 2:02 ` Minchan Kim @ 2010-08-27 4:34 ` Wu Fengguang 0 siblings, 0 replies; 38+ messages in thread From: Wu Fengguang @ 2010-08-27 4:34 UTC (permalink / raw) To: Minchan Kim Cc: Mel Gorman, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel@vger.kernel.org, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li, Shaohua On Fri, Aug 27, 2010 at 10:02:52AM +0800, Minchan Kim wrote: > On Fri, Aug 27, 2010 at 10:50 AM, Wu Fengguang <fengguang.wu@intel.com> wrote: > > On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote: > >> Hi, Wu. > >> > >> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote: > >> > Minchan, > >> > > >> > It's much cleaner to keep the unchanged congestion_wait() and add a > >> > congestion_wait_check() for converting problematic wait sites. The > >> > too_many_isolated() wait is merely a protective mechanism, I won't > >> > bother to improve it at the cost of more code. > >> > >> You means following as? > > > > No, I mean do not change the too_many_isolated() related code at all :) > > And to use congestion_wait_check() in other places that we can prove > > there is a problem that can be rightly fixed by changing to > > congestion_wait_check(). > > I always suffer from understanding your comment. > Apparently, my eyes have a problem. ;( > This patch is dependent of Mel's series. > With changing congestion_wait with just return when no congestion, it > would have CPU hogging in too_many_isolated. I think it would apply in > Li's congestion_wait_check, too. > If no change is current congestion_wait, we doesn't need this patch. > > Still, maybe I can't understand your comment. Sorry. Sorry! The confusion must come from the modified congestion_wait() by Mel. My proposal is to _not_ modify congestion_wait(), but add another congestion_wait_check() which won't sleep 100ms when no IO. In this way, the following chunks become unnecessary. --- a/mm/compaction.c +++ b/mm/compaction.c @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone, * delay for some time until fewer pages are isolated */ while (unlikely(too_many_isolated(zone))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + long timeout = HZ/10; + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) { + set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(timeout); + } if (fatal_signal_pending(current)) return 0; diff --git a/mm/vmscan.c b/mm/vmscan.c index 3109ff7..f5e3e28 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, unsigned long nr_dirty; while (unlikely(too_many_isolated(zone, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + long timeout = HZ/10; + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) { + set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(timeout); + } /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion 2010-08-27 1:21 ` Wu Fengguang 2010-08-27 1:41 ` Minchan Kim @ 2010-08-27 9:38 ` Mel Gorman 1 sibling, 0 replies; 38+ messages in thread From: Mel Gorman @ 2010-08-27 9:38 UTC (permalink / raw) To: Wu Fengguang Cc: Minchan Kim, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel@vger.kernel.org, Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li Shaohua On Fri, Aug 27, 2010 at 09:21:47AM +0800, Wu Fengguang wrote: > Minchan, > > It's much cleaner to keep the unchanged congestion_wait() and add a > congestion_wait_check() for converting problematic wait sites. The > too_many_isolated() wait is merely a protective mechanism, I won't > bother to improve it at the cost of more code. > This is what I've done. I dropped the patch again and am using wait_iff_congested(). I left the too_many_isolated() callers as congestion_wait(). -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2010-09-02 18:29 UTC | newest] Thread overview: 38+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-08-26 15:14 [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Mel Gorman 2010-08-26 15:14 ` [PATCH 1/3] writeback: Account for time spent congestion_waited Mel Gorman 2010-08-26 17:23 ` Minchan Kim 2010-08-26 18:10 ` Johannes Weiner 2010-08-26 15:14 ` [PATCH 2/3] writeback: Record if the congestion was unnecessary Mel Gorman 2010-08-26 17:35 ` Minchan Kim 2010-08-26 17:41 ` Mel Gorman 2010-08-26 18:29 ` Johannes Weiner 2010-08-26 20:31 ` Mel Gorman 2010-08-27 2:12 ` Shaohua Li 2010-08-27 9:20 ` Mel Gorman 2010-08-27 8:16 ` Johannes Weiner 2010-08-27 9:24 ` Mel Gorman 2010-08-30 13:19 ` Johannes Weiner 2010-08-31 15:02 ` Mel Gorman 2010-09-02 15:49 ` Johannes Weiner 2010-09-02 18:28 ` Mel Gorman 2010-08-29 16:03 ` Minchan Kim 2010-08-26 15:14 ` [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs Mel Gorman 2010-08-26 17:38 ` Minchan Kim 2010-08-26 17:42 ` Mel Gorman 2010-08-26 18:17 ` Johannes Weiner 2010-08-26 20:23 ` Mel Gorman 2010-08-27 1:11 ` Wu Fengguang 2010-08-27 9:34 ` Mel Gorman 2010-08-27 1:42 ` Wu Fengguang 2010-08-27 9:37 ` Mel Gorman 2010-08-27 5:13 ` Dave Chinner 2010-08-27 9:33 ` Mel Gorman 2010-08-26 17:20 ` [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Minchan Kim 2010-08-26 17:31 ` Mel Gorman 2010-08-26 17:50 ` Minchan Kim 2010-08-27 1:21 ` Wu Fengguang 2010-08-27 1:41 ` Minchan Kim 2010-08-27 1:50 ` Wu Fengguang 2010-08-27 2:02 ` Minchan Kim 2010-08-27 4:34 ` Wu Fengguang 2010-08-27 9:38 ` Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).