[RFC PATCH 0/3] Do not wait the full timeout on congestion

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
@ 2010-08-26 15:14 Mel Gorman
  2010-08-26 15:14 ` [PATCH 1/3] writeback: Account for time spent congestion_waited Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner,
	Wu Fengguang, Jan Kara, linux-kernel

congestion_wait() is a bit stupid in that it goes to sleep even when there
is no congestion. This causes stalls in a number of situations and may be
partially responsible for bug reports about desktop interactivity.

This patch series aims to account for these unnecessary congestion_waits()
and to avoid going to sleep when there is no congestion available. Patches
1 and 2 add instrumentation related to congestion which should be reuable
by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
instead of going to sleep if there is no congestion.

Once again, I shoved this through performance test. Unlike previous tests,
I ran this on a ported version of my usual test-suite that should be suitable
for release soon. It's not quite as good as my old set but it's sufficient
for this and related series. The tests I ran were kernbench vmr-stream
iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
the usual stress the number of high order allocations that can be made while
the system is under severe stress. The suite contains the necessary analysis
scripts as well and I'd release it now except the documentation blows.

x86:    Intel Pentium D 3GHz with 3G RAM (no-brand machine)
x86-64:	AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
ppc64:	PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)

The disks on all of them were single disks and not particularly fast.

Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
instrumentation and a second test with patch 3 applied.

In all cases, kernbench, hackbench, STREAM and iozone did not show any
performance difference because none of them were pressuring the system
enough to be calling congestion_wait() so I won't post the results.
About all worth noting for them is that nothing horrible appeared to break.

In the analysis scripts, I record unnecessary sleeps to be a sleep that
had no congestion. The post-processing scripts for cond_resched() will only
count an uncongested call to congestion_wait() as unnecessary if the process
actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.

One vague concern I have is when too many pages are isolated, we call
congestion_wait(). This could now actively spin in the loop for its quanta
before calling cond_resched(). If it's calling with no congestion, it's
hard to know what the proper thing to do there is.

X86
Sysbench on this machine was not stressed enough to call congestion_wait
so I'll just discuss the stress-highalloc test. This is the full report
from the testsuite

STRESS-HIGHALLOC
              stress-highalloc  stress-highalloc
                traceonly-v1r1    nocongest-v1r1
Pass 1          70.00 ( 0.00%)    72.00 ( 2.00%)
Pass 2          72.00 ( 0.00%)    72.00 ( 0.00%)
At Rest         74.00 ( 0.00%)    73.00 (-1.00%)

FTrace Reclaim Statistics: vmscan
              stress-highalloc  stress-highalloc
                traceonly-v1r1    nocongest-v1r1
Direct reclaims                                409        755 
Direct reclaim pages scanned                185585     212524 
Direct reclaim write file async I/O            442        554 
Direct reclaim write anon async I/O          31789      27074 
Direct reclaim write file sync I/O              17         23 
Direct reclaim write anon sync I/O           17825      15013 
Wake kswapd requests                           895       1274 
Kswapd wakeups                                 387        432 
Kswapd pages scanned                      16373859   12892992 
Kswapd reclaim write file async I/O          29267      18188 
Kswapd reclaim write anon async I/O        1243386    1080234 
Kswapd reclaim write file sync I/O               0          0 
Kswapd reclaim write anon sync I/O               0          0 
Time stalled direct reclaim (seconds)      4479.04    3446.81 
Time kswapd awake (seconds)                2229.99    1218.52 

Total pages scanned                       16559444  13105516
%age total pages scanned/written             7.99%     8.71%
%age  file pages scanned/written             0.18%     0.14%
Percentage Time Spent Direct Reclaim        74.99%    69.54%
Percentage Time kswapd Awake                41.78%    28.57%

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited                   474         38 
Direct number schedule waited                    0       9478 
Direct time congest waited                 21564ms     3732ms 
Direct time schedule waited                    0ms        4ms 
Direct unnecessary wait                        434          1 
KSwapd number congest waited                    68          0 
KSwapd number schedule waited                    0          0 
KSwapd time schedule waited                    0ms        0ms 
KSwapd time congest waited                  5424ms        0ms 
Kswapd unnecessary wait                         44          0 

MMTests Statistics: duration
User/Sys Time Running Test (seconds)       1493.97   1509.88
Total Elapsed Time (seconds)               5337.71   4265.07

Allocations under stress were slightly better but by and large there is no
significant difference in success rates. The test completed 1072 seconds
faster which is a pretty decent speedup.

Scanning rates in reclaim were higher but that is somewhat expected because
we weren't going to sleep as much. Time stalled in reclaim for both direct
and kswapd was reduced which is pretty significant.

In terms of congestion_wait, the time spent asleep was massively reduced
by 17 seconds for direct reclaim and 5 seconds for kswapd. cond_reched
is called a number of times instead of course but the time it spent
being scheduled was a mere 4ms. Overall, this looked positive.

X86-64
Sysbench again wasn't under enough pressure so here is the high alloc test.

STRESS-HIGHALLOC
              stress-highalloc  stress-highalloc
                traceonly-v1r1    nocongest-v1r1
Pass 1          69.00 ( 0.00%)    73.00 ( 4.00%)
Pass 2          71.00 ( 0.00%)    74.00 ( 3.00%)
At Rest         72.00 ( 0.00%)    75.00 ( 3.00%)

FTrace Reclaim Statistics: vmscan
              stress-highalloc  stress-highalloc
                traceonly-v1r1    nocongest-v1r1
Direct reclaims                                646       1091 
Direct reclaim pages scanned                 94779     102392 
Direct reclaim write file async I/O            164        216 
Direct reclaim write anon async I/O          12162      15413 
Direct reclaim write file sync I/O              64         45 
Direct reclaim write anon sync I/O            5366       6987 
Wake kswapd requests                          3950       3912 
Kswapd wakeups                                 613        579 
Kswapd pages scanned                       7544412    7267203 
Kswapd reclaim write file async I/O          14660      16256 
Kswapd reclaim write anon async I/O         964824    1065445 
Kswapd reclaim write file sync I/O               0          0 
Kswapd reclaim write anon sync I/O               0          0 
Time stalled direct reclaim (seconds)      3279.00    3564.59 
Time kswapd awake (seconds)                1445.70    1870.70 

Total pages scanned                        7639191   7369595
%age total pages scanned/written            13.05%    14.99%
%age  file pages scanned/written             0.19%     0.22%
Percentage Time Spent Direct Reclaim        70.48%    72.04%
Percentage Time kswapd Awake                35.62%    42.94%

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited                   801         97 
Direct number schedule waited                    0      16079 
Direct time congest waited                 37448ms     9004ms 
Direct time schedule waited                    0ms        0ms 
Direct unnecessary wait                        696          0 
KSwapd number congest waited                    10          1 
KSwapd number schedule waited                    0          0 
KSwapd time schedule waited                    0ms        0ms 
KSwapd time congest waited                   900ms      100ms 
Kswapd unnecessary wait                          6          0 

MMTests Statistics: duration
User/Sys Time Running Test (seconds)       1373.11    1383.7
Total Elapsed Time (seconds)               4058.33   4356.47

Success rates were slightly higher again, not by a massive amount but some.
Time to complete the test was unfortunately increased slightly though and
I'm not sure where that is coming from. The increased number of successful
allocations would account for some of that because the system is under
greater memory pressure as a result of the allocations.

Scanning rates are comparable. Writing back files from reclaim was slighly
increased which I believe it due to less time being spent asleep so there
was a smaller window for the flusher threads to do their work. Reducing
that is the responsibility of another series.

Again, the time spent asleep in congestion_wait() is reduced by a large
amount - 28 seconds for direct reclaim and none of the cond_resched()
resulted in sleep times.

Overally, seems reasonable.

PPC64
Unlike the other two machines, sysbench called congestion_wait a few times so here are the full
results for sysbench

SYSBENCH
            sysbench-traceonly-v1r1-sysbenchsysbench-nocongest-v1r1-sysbench
                traceonly-v1r1    nocongest-v1r1
           1  5307.36 ( 0.00%)  5349.58 ( 0.79%)
           2  9886.45 ( 0.00%) 10274.78 ( 3.78%)
           3 14165.01 ( 0.00%) 14210.64 ( 0.32%)
           4 16239.12 ( 0.00%) 16201.46 (-0.23%)
           5 15337.09 ( 0.00%) 15541.56 ( 1.32%)
           6 14763.64 ( 0.00%) 15805.80 ( 6.59%)
           7 14216.69 ( 0.00%) 15023.57 ( 5.37%)
           8 13749.62 ( 0.00%) 14492.34 ( 5.12%)
           9 13647.75 ( 0.00%) 13969.77 ( 2.31%)
          10 13275.70 ( 0.00%) 13495.08 ( 1.63%)
          11 13324.91 ( 0.00%) 12879.81 (-3.46%)
          12 13169.23 ( 0.00%) 12967.36 (-1.56%)
          13 12896.20 ( 0.00%) 12981.43 ( 0.66%)
          14 12793.44 ( 0.00%) 12768.26 (-0.20%)
          15 12627.98 ( 0.00%) 12522.86 (-0.84%)
          16 12228.54 ( 0.00%) 12352.07 ( 1.00%)
FTrace Reclaim Statistics: vmscan
            sysbench-traceonly-v1r1-sysbenchsysbench-nocongest-v1r1-sysbench
                traceonly-v1r1    nocongest-v1r1
Direct reclaims                                  0          0 
Direct reclaim pages scanned                     0          0 
Direct reclaim write file async I/O              0          0 
Direct reclaim write anon async I/O              0          0 
Direct reclaim write file sync I/O               0          0 
Direct reclaim write anon sync I/O               0          0 
Wake kswapd requests                             0          0 
Kswapd wakeups                                 202        194 
Kswapd pages scanned                       5990987    5618709 
Kswapd reclaim write file async I/O             24         16 
Kswapd reclaim write anon async I/O           1509       1564 
Kswapd reclaim write file sync I/O               0          0 
Kswapd reclaim write anon sync I/O               0          0 
Time stalled direct reclaim (seconds)         0.00       0.00 
Time kswapd awake (seconds)                 174.23     152.17 

Total pages scanned                        5990987   5618709
%age total pages scanned/written             0.03%     0.03%
%age  file pages scanned/written             0.00%     0.00%
Percentage Time Spent Direct Reclaim         0.00%     0.00%
Percentage Time kswapd Awake                 2.80%     2.60%

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited                     0          0 
Direct number schedule waited                    0          0 
Direct time congest waited                     0ms        0ms 
Direct time schedule waited                    0ms        0ms 
Direct unnecessary wait                          0          0 
KSwapd number congest waited                    10          3 
KSwapd number schedule waited                    0          0 
KSwapd time schedule waited                    0ms        0ms 
KSwapd time congest waited                   800ms      300ms 
Kswapd unnecessary wait                          6          0 

Performance is improved by a decent marging although I didn't check if
it was statistically significant or not. The time kswapd spent asleep was
slightly reduced.

STRESS-HIGHALLOC
              stress-highalloc  stress-highalloc
                traceonly-v1r1    nocongest-v1r1
Pass 1          40.00 ( 0.00%)    35.00 (-5.00%)
Pass 2          50.00 ( 0.00%)    45.00 (-5.00%)
At Rest         61.00 ( 0.00%)    64.00 ( 3.00%)

FTrace Reclaim Statistics: vmscan
              stress-highalloc  stress-highalloc
                traceonly-v1r1    nocongest-v1r1
Direct reclaims                                166        926 
Direct reclaim pages scanned                167920     183644 
Direct reclaim write file async I/O            391        412 
Direct reclaim write anon async I/O          31563      31986 
Direct reclaim write file sync I/O              54         52 
Direct reclaim write anon sync I/O           21696      17087 
Wake kswapd requests                           123        128 
Kswapd wakeups                                 143        143 
Kswapd pages scanned                       3899414    4229450 
Kswapd reclaim write file async I/O          12392      13098 
Kswapd reclaim write anon async I/O         673260     709817 
Kswapd reclaim write file sync I/O               0          0 
Kswapd reclaim write anon sync I/O               0          0 
Time stalled direct reclaim (seconds)      1595.13    1692.18 
Time kswapd awake (seconds)                1114.00    1210.48 

Total pages scanned                        4067334   4413094
%age total pages scanned/written            18.18%    17.50%
%age  file pages scanned/written             0.32%     0.31%
Percentage Time Spent Direct Reclaim        45.89%    47.50%
Percentage Time kswapd Awake                46.09%    48.04%

FTrace Reclaim Statistics: congestion_wait
Direct number congest waited                   233         16 
Direct number schedule waited                    0       1323 
Direct time congest waited                 10164ms     1600ms 
Direct time schedule waited                    0ms        0ms 
Direct unnecessary wait                        218          0 
KSwapd number congest waited                    11         13 
KSwapd number schedule waited                    0          3 
KSwapd time schedule waited                    0ms        0ms 
KSwapd time congest waited                  1100ms     1244ms 
Kswapd unnecessary wait                          0          0 

MMTests Statistics: duration
User/Sys Time Running Test (seconds)       1880.56   1870.36
Total Elapsed Time (seconds)               2417.17   2519.51

Allocation success rates are slightly down but on PPC64, they are always
very difficult and I have other ideas on how allocation success rates could
be improved.

What is more important is that again the time spent asleep due to
congestion_wait() was reduced for direct reclaimers.

The results here aren't as positive as the other two machines but they
still seem acceptable.

Broadly speaking, I think sleeping in congestion_wait() has been responsible
for some bugs related to stalls under large IO, particularly the read IO,
so we need to do something about it. These tests seem overall positive but
it'd be interesting if someone with a workload that stalls in congestion_wait
unnecessarily are helped by this patch. Desktop interactivity would be harder
to test because I think it has multiple root causes of which congestion_wait
is just one of them. I've included Christian Ehrhardt in the cc because he
had a bug back in April that was rooted in congestion_wait() that I think
this might help and hopefully he can provide hard data for a workload
with lots of IO but constrained memory. I cc'd Johannes because we were
discussion congestion_wait() at LSF/MM and he might have some thoughts and
I think I was talking to Jan briefly about congestion_wait() as well. As
this affects writeback, Wu and fsdevel might have some opinions.

 include/trace/events/writeback.h |   22 ++++++++++++++++++++++
 mm/backing-dev.c                 |   31 ++++++++++++++++++++++++++-----
 2 files changed, 48 insertions(+), 5 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/3] writeback: Account for time spent congestion_waited
  2010-08-26 15:14 [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Mel Gorman
@ 2010-08-26 15:14 ` Mel Gorman
  2010-08-26 17:23   ` Minchan Kim
  2010-08-26 18:10   ` Johannes Weiner
  2010-08-26 15:14 ` [PATCH 2/3] writeback: Record if the congestion was unnecessary Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 38+ messages in thread
From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner,
	Wu Fengguang, Jan Kara, linux-kernel

There is strong evidence to indicate a lot of time is being spent in
congestion_wait(), some of it unnecessarily. This patch adds a
tracepoint for congestion_wait to record when congestion_wait() occurred
and how long was spent.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/writeback.h |   17 +++++++++++++++++
 mm/backing-dev.c                 |    4 ++++
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index f345f66..e3bee61 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -153,6 +153,23 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_written);
 DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_congest_waited,
+
+	TP_PROTO(unsigned int usec_delayed),
+
+	TP_ARGS(usec_delayed),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	usec_delayed	)
+	),
+
+	TP_fast_assign(
+		__entry->usec_delayed	= usec_delayed;
+	),
+
+	TP_printk("usec_delayed=%u", __entry->usec_delayed)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index eaa4a5b..7ae33e2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -759,12 +759,16 @@ EXPORT_SYMBOL(set_bdi_congested);
 long congestion_wait(int sync, long timeout)
 {
 	long ret;
+	unsigned long start = jiffies;
 	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
+
+	trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start));
+
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] writeback: Account for time spent congestion_waited
  2010-08-26 15:14 ` [PATCH 1/3] writeback: Account for time spent congestion_waited Mel Gorman
@ 2010-08-26 17:23   ` Minchan Kim
  2010-08-26 18:10   ` Johannes Weiner
  1 sibling, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2010-08-26 17:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel

On Thu, Aug 26, 2010 at 04:14:14PM +0100, Mel Gorman wrote:
> There is strong evidence to indicate a lot of time is being spent in
> congestion_wait(), some of it unnecessarily. This patch adds a
> tracepoint for congestion_wait to record when congestion_wait() occurred
> and how long was spent.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

I think that's enough to add tracepoint until solving this issue at least.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/3] writeback: Account for time spent congestion_waited
  2010-08-26 15:14 ` [PATCH 1/3] writeback: Account for time spent congestion_waited Mel Gorman
  2010-08-26 17:23   ` Minchan Kim
@ 2010-08-26 18:10   ` Johannes Weiner
  1 sibling, 0 replies; 38+ messages in thread
From: Johannes Weiner @ 2010-08-26 18:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Wu Fengguang, Jan Kara, linux-kernel

On Thu, Aug 26, 2010 at 04:14:14PM +0100, Mel Gorman wrote:
> There is strong evidence to indicate a lot of time is being spent in
> congestion_wait(), some of it unnecessarily. This patch adds a
> tracepoint for congestion_wait to record when congestion_wait() occurred
> and how long was spent.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-08-26 15:14 [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Mel Gorman
  2010-08-26 15:14 ` [PATCH 1/3] writeback: Account for time spent congestion_waited Mel Gorman
@ 2010-08-26 15:14 ` Mel Gorman
  2010-08-26 17:35   ` Minchan Kim
  2010-08-26 18:29   ` Johannes Weiner
  2010-08-26 15:14 ` [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs Mel Gorman
  2010-08-26 17:20 ` [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Minchan Kim
  3 siblings, 2 replies; 38+ messages in thread
From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner,
	Wu Fengguang, Jan Kara, linux-kernel

If congestion_wait() is called when there is no congestion, the caller
will wait for the full timeout. This can cause unreasonable and
unnecessary stalls. There are a number of potential modifications that
could be made to wake sleepers but this patch measures how serious the
problem is. It keeps count of how many congested BDIs there are. If
congestion_wait() is called with no BDIs congested, the tracepoint will
record that the wait was unnecessary.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/writeback.h |   11 ++++++++---
 mm/backing-dev.c                 |   15 ++++++++++++---
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index e3bee61..03bb04b 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);
 
 TRACE_EVENT(writeback_congest_waited,
 
-	TP_PROTO(unsigned int usec_delayed),
+	TP_PROTO(unsigned int usec_delayed, bool unnecessary),
 
-	TP_ARGS(usec_delayed),
+	TP_ARGS(usec_delayed, unnecessary),
 
 	TP_STRUCT__entry(
 		__field(	unsigned int,	usec_delayed	)
+		__field(	unsigned int,	unnecessary	)
 	),
 
 	TP_fast_assign(
 		__entry->usec_delayed	= usec_delayed;
+		__entry->unnecessary	= unnecessary;
 	),
 
-	TP_printk("usec_delayed=%u", __entry->usec_delayed)
+	TP_printk("usec_delayed=%u unnecessary=%d",
+		__entry->usec_delayed,
+		__entry->unnecessary
+	)
 );
 
 #endif /* _TRACE_WRITEBACK_H */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7ae33e2..a49167f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 	};
+static atomic_t nr_bdi_congested[2];
 
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 {
@@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
-	clear_bit(bit, &bdi->state);
+	if (test_and_clear_bit(bit, &bdi->state))
+		atomic_dec(&nr_bdi_congested[sync]);
 	smp_mb__after_clear_bit();
 	if (waitqueue_active(wqh))
 		wake_up(wqh);
@@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 	enum bdi_state bit;
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
-	set_bit(bit, &bdi->state);
+	if (!test_and_set_bit(bit, &bdi->state))
+		atomic_inc(&nr_bdi_congested[sync]);
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
@@ -760,14 +763,20 @@ long congestion_wait(int sync, long timeout)
 {
 	long ret;
 	unsigned long start = jiffies;
+	bool unnecessary = false;
 	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
+	/* Check if this call to congestion_wait was necessary */
+	if (atomic_read(&nr_bdi_congested[sync]) == 0)
+		unnecessary = true;
+
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
 
-	trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start));
+	trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
+			unnecessary);
 
 	return ret;
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-08-26 15:14 ` [PATCH 2/3] writeback: Record if the congestion was unnecessary Mel Gorman
@ 2010-08-26 17:35   ` Minchan Kim
  2010-08-26 17:41     ` Mel Gorman
  2010-08-26 18:29   ` Johannes Weiner
  1 sibling, 1 reply; 38+ messages in thread
From: Minchan Kim @ 2010-08-26 17:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel

On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> If congestion_wait() is called when there is no congestion, the caller
> will wait for the full timeout. This can cause unreasonable and
> unnecessary stalls. There are a number of potential modifications that
> could be made to wake sleepers but this patch measures how serious the
> problem is. It keeps count of how many congested BDIs there are. If
> congestion_wait() is called with no BDIs congested, the tracepoint will
> record that the wait was unnecessary.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/trace/events/writeback.h |   11 ++++++++---
>  mm/backing-dev.c                 |   15 ++++++++++++---
>  2 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index e3bee61..03bb04b 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);
>  
>  TRACE_EVENT(writeback_congest_waited,
>  
> -	TP_PROTO(unsigned int usec_delayed),
> +	TP_PROTO(unsigned int usec_delayed, bool unnecessary),
>  
> -	TP_ARGS(usec_delayed),
> +	TP_ARGS(usec_delayed, unnecessary),
>  
>  	TP_STRUCT__entry(
>  		__field(	unsigned int,	usec_delayed	)
> +		__field(	unsigned int,	unnecessary	)
>  	),
>  
>  	TP_fast_assign(
>  		__entry->usec_delayed	= usec_delayed;
> +		__entry->unnecessary	= unnecessary;
>  	),
>  
> -	TP_printk("usec_delayed=%u", __entry->usec_delayed)
> +	TP_printk("usec_delayed=%u unnecessary=%d",
> +		__entry->usec_delayed,
> +		__entry->unnecessary
> +	)
>  );
>  
>  #endif /* _TRACE_WRITEBACK_H */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 7ae33e2..a49167f 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
>  	};
> +static atomic_t nr_bdi_congested[2];
>  
>  void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  	wait_queue_head_t *wqh = &congestion_wqh[sync];
>  
>  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> -	clear_bit(bit, &bdi->state);
> +	if (test_and_clear_bit(bit, &bdi->state))
> +		atomic_dec(&nr_bdi_congested[sync]);

Hmm.. Now congestion_wait's semantics "wait for _a_ backing_dev to become uncongested"
But this seems to consider whole backing dev. Is your intention? or Am I missing now?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-08-26 17:35   ` Minchan Kim
@ 2010-08-26 17:41     ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2010-08-26 17:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel

On Fri, Aug 27, 2010 at 02:35:34AM +0900, Minchan Kim wrote:
> On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called when there is no congestion, the caller
> > will wait for the full timeout. This can cause unreasonable and
> > unnecessary stalls. There are a number of potential modifications that
> > could be made to wake sleepers but this patch measures how serious the
> > problem is. It keeps count of how many congested BDIs there are. If
> > congestion_wait() is called with no BDIs congested, the tracepoint will
> > record that the wait was unnecessary.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/trace/events/writeback.h |   11 ++++++++---
> >  mm/backing-dev.c                 |   15 ++++++++++++---
> >  2 files changed, 20 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> > index e3bee61..03bb04b 100644
> > --- a/include/trace/events/writeback.h
> > +++ b/include/trace/events/writeback.h
> > @@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);
> >  
> >  TRACE_EVENT(writeback_congest_waited,
> >  
> > -	TP_PROTO(unsigned int usec_delayed),
> > +	TP_PROTO(unsigned int usec_delayed, bool unnecessary),
> >  
> > -	TP_ARGS(usec_delayed),
> > +	TP_ARGS(usec_delayed, unnecessary),
> >  
> >  	TP_STRUCT__entry(
> >  		__field(	unsigned int,	usec_delayed	)
> > +		__field(	unsigned int,	unnecessary	)
> >  	),
> >  
> >  	TP_fast_assign(
> >  		__entry->usec_delayed	= usec_delayed;
> > +		__entry->unnecessary	= unnecessary;
> >  	),
> >  
> > -	TP_printk("usec_delayed=%u", __entry->usec_delayed)
> > +	TP_printk("usec_delayed=%u unnecessary=%d",
> > +		__entry->usec_delayed,
> > +		__entry->unnecessary
> > +	)
> >  );
> >  
> >  #endif /* _TRACE_WRITEBACK_H */
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 7ae33e2..a49167f 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> >  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> >  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> >  	};
> > +static atomic_t nr_bdi_congested[2];
> >  
> >  void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> >  {
> > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> >  	wait_queue_head_t *wqh = &congestion_wqh[sync];
> >  
> >  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> > -	clear_bit(bit, &bdi->state);
> > +	if (test_and_clear_bit(bit, &bdi->state))
> > +		atomic_dec(&nr_bdi_congested[sync]);
> 
> Hmm.. Now congestion_wait's semantics "wait for _a_ backing_dev to become uncongested"
> But this seems to consider whole backing dev. Is your intention? or Am I missing now?
> 

Not whole backing devs, all backing devs. This is intentional.

If congestion_wait() is called with 0 BDIs congested, we sleep the full timeout
because a wakeup event will not occur - this is a bad scenario. To know if
0 BDIs were congested, one could either walk all the BDIs checking their
status or maintain a counter like nr_bdi_congested which is what I decided on.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-08-26 15:14 ` [PATCH 2/3] writeback: Record if the congestion was unnecessary Mel Gorman
  2010-08-26 17:35   ` Minchan Kim
@ 2010-08-26 18:29   ` Johannes Weiner
  2010-08-26 20:31     ` Mel Gorman
  2010-08-29 16:03     ` Minchan Kim
  1 sibling, 2 replies; 38+ messages in thread
From: Johannes Weiner @ 2010-08-26 18:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Wu Fengguang, Jan Kara, linux-kernel

On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> If congestion_wait() is called when there is no congestion, the caller
> will wait for the full timeout. This can cause unreasonable and
> unnecessary stalls. There are a number of potential modifications that
> could be made to wake sleepers but this patch measures how serious the
> problem is. It keeps count of how many congested BDIs there are. If
> congestion_wait() is called with no BDIs congested, the tracepoint will
> record that the wait was unnecessary.

I am not convinced that unnecessary is the right word.  On a workload
without any IO (i.e. no congestion_wait() necessary, ever), I noticed
the VM regressing both in time and in reclaiming the right pages when
simply removing congestion_wait() from the direct reclaim paths (the
one in __alloc_pages_slowpath and the other one in
do_try_to_free_pages).

So just being stupid and waiting for the timeout in direct reclaim
while kswapd can make progress seemed to do a better job for that
load.

I can not exactly pinpoint the reason for that behaviour, it would be
nice if somebody had an idea.

So personally I think it's a good idea to get an insight on the use of
congestion_wait() [patch 1] but I don't agree with changing its
behaviour just yet, or judging its usefulness solely on whether it
correctly waits for bdi congestion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-08-26 18:29   ` Johannes Weiner
@ 2010-08-26 20:31     ` Mel Gorman
  2010-08-27  2:12       ` Shaohua Li
  2010-08-27  8:16       ` Johannes Weiner
  2010-08-29 16:03     ` Minchan Kim
  1 sibling, 2 replies; 38+ messages in thread
From: Mel Gorman @ 2010-08-26 20:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Wu Fengguang, Jan Kara, linux-kernel

On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called when there is no congestion, the caller
> > will wait for the full timeout. This can cause unreasonable and
> > unnecessary stalls. There are a number of potential modifications that
> > could be made to wake sleepers but this patch measures how serious the
> > problem is. It keeps count of how many congested BDIs there are. If
> > congestion_wait() is called with no BDIs congested, the tracepoint will
> > record that the wait was unnecessary.
> 
> I am not convinced that unnecessary is the right word.  On a workload
> without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> the VM regressing both in time and in reclaiming the right pages when
> simply removing congestion_wait() from the direct reclaim paths (the
> one in __alloc_pages_slowpath and the other one in
> do_try_to_free_pages).
> 
> So just being stupid and waiting for the timeout in direct reclaim
> while kswapd can make progress seemed to do a better job for that
> load.
> 
> I can not exactly pinpoint the reason for that behaviour, it would be
> nice if somebody had an idea.
> 

There is a possibility that the behaviour in that case was due to flusher
threads doing the writes rather than direct reclaim queueing pages for IO
in an inefficient manner. So the stall is stupid but happens to work out
well because flusher threads get the chance to do work.

> So personally I think it's a good idea to get an insight on the use of
> congestion_wait() [patch 1] but I don't agree with changing its
> behaviour just yet, or judging its usefulness solely on whether it
> correctly waits for bdi congestion.
> 

Unfortunately, I strongly suspect that some of the desktop stalls seen during
IO (one of which involved no writes) were due to calling congestion_wait
and waiting the full timeout where no writes are going on.

It gets potentially worse too. Lets say we have a system with many BDIs of
different speed - e.g. SSD on one end of the spectrum and USB flash drive
on the other. The congestion for writes could be on the USB flash drive but
due to low memory, the allocator, direct reclaimers and kswapd go to sleep
periodically on congestion_wait for USB even though the bulk of the pages
need reclaiming are backed by an SSD.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-08-26 20:31     ` Mel Gorman
@ 2010-08-27  2:12       ` Shaohua Li
  2010-08-27  9:20         ` Mel Gorman
  2010-08-27  8:16       ` Johannes Weiner
  1 sibling, 1 reply; 38+ messages in thread
From: Shaohua Li @ 2010-08-27  2:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt,
	Wu, Fengguang, Jan Kara, linux-kernel@vger.kernel.org

On Fri, 2010-08-27 at 04:31 +0800, Mel Gorman wrote:
> On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > If congestion_wait() is called when there is no congestion, the caller
> > > will wait for the full timeout. This can cause unreasonable and
> > > unnecessary stalls. There are a number of potential modifications that
> > > could be made to wake sleepers but this patch measures how serious the
> > > problem is. It keeps count of how many congested BDIs there are. If
> > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > record that the wait was unnecessary.
> > 
> > I am not convinced that unnecessary is the right word.  On a workload
> > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > the VM regressing both in time and in reclaiming the right pages when
> > simply removing congestion_wait() from the direct reclaim paths (the
> > one in __alloc_pages_slowpath and the other one in
> > do_try_to_free_pages).
> > 
> > So just being stupid and waiting for the timeout in direct reclaim
> > while kswapd can make progress seemed to do a better job for that
> > load.
> > 
> > I can not exactly pinpoint the reason for that behaviour, it would be
> > nice if somebody had an idea.
> > 
> 
> There is a possibility that the behaviour in that case was due to flusher
> threads doing the writes rather than direct reclaim queueing pages for IO
> in an inefficient manner. So the stall is stupid but happens to work out
> well because flusher threads get the chance to do work.
If this is the case, we already have queue congested. removing
congestion_wait() might cause regression but either your change or the
congestion_wait_check() should not have the regression, as we do check
if the bdi is congested.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-08-27  2:12       ` Shaohua Li
@ 2010-08-27  9:20         ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2010-08-27  9:20 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Johannes Weiner, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt,
	Wu, Fengguang, Jan Kara, linux-kernel@vger.kernel.org

On Fri, Aug 27, 2010 at 10:12:10AM +0800, Shaohua Li wrote:
> On Fri, 2010-08-27 at 04:31 +0800, Mel Gorman wrote:
> > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called when there is no congestion, the caller
> > > > will wait for the full timeout. This can cause unreasonable and
> > > > unnecessary stalls. There are a number of potential modifications that
> > > > could be made to wake sleepers but this patch measures how serious the
> > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > record that the wait was unnecessary.
> > > 
> > > I am not convinced that unnecessary is the right word.  On a workload
> > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > the VM regressing both in time and in reclaiming the right pages when
> > > simply removing congestion_wait() from the direct reclaim paths (the
> > > one in __alloc_pages_slowpath and the other one in
> > > do_try_to_free_pages).
> > > 
> > > So just being stupid and waiting for the timeout in direct reclaim
> > > while kswapd can make progress seemed to do a better job for that
> > > load.
> > > 
> > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > nice if somebody had an idea.
> > > 
> > 
> > There is a possibility that the behaviour in that case was due to flusher
> > threads doing the writes rather than direct reclaim queueing pages for IO
> > in an inefficient manner. So the stall is stupid but happens to work out
> > well because flusher threads get the chance to do work.
>
> If this is the case, we already have queue congested.

Not necessarily. The fact that with the full series we sometimes call
cond_sched() indicating that there was no congestion when congestion_wait()
was called proves that. We might have some IO on the queue but it's not
congested. Also, there is no guarantee that the congested queue is one we
care about. If we are reclaiming main memory and the congested queue is a
USB stick, we do not necessarily need to stall.

> removing
> congestion_wait() might cause regression but either your change or the
> congestion_wait_check() should not have the regression, as we do check
> if the bdi is congested.
> 

What congestion_wait_check()? If there is no congestion and no writes,
congestion is the wrong event to sleep on.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-08-26 20:31     ` Mel Gorman
  2010-08-27  2:12       ` Shaohua Li
@ 2010-08-27  8:16       ` Johannes Weiner
  2010-08-27  9:24         ` Mel Gorman
  1 sibling, 1 reply; 38+ messages in thread
From: Johannes Weiner @ 2010-08-27  8:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Wu Fengguang, Jan Kara, linux-kernel

On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > If congestion_wait() is called when there is no congestion, the caller
> > > will wait for the full timeout. This can cause unreasonable and
> > > unnecessary stalls. There are a number of potential modifications that
> > > could be made to wake sleepers but this patch measures how serious the
> > > problem is. It keeps count of how many congested BDIs there are. If
> > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > record that the wait was unnecessary.
> > 
> > I am not convinced that unnecessary is the right word.  On a workload
> > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > the VM regressing both in time and in reclaiming the right pages when
> > simply removing congestion_wait() from the direct reclaim paths (the
> > one in __alloc_pages_slowpath and the other one in
> > do_try_to_free_pages).
> > 
> > So just being stupid and waiting for the timeout in direct reclaim
> > while kswapd can make progress seemed to do a better job for that
> > load.
> > 
> > I can not exactly pinpoint the reason for that behaviour, it would be
> > nice if somebody had an idea.
> > 
> 
> There is a possibility that the behaviour in that case was due to flusher
> threads doing the writes rather than direct reclaim queueing pages for IO
> in an inefficient manner. So the stall is stupid but happens to work out
> well because flusher threads get the chance to do work.

The workload was accessing a large sparse-file through mmap, so there
wasn't much IO in the first place.

And I experimented on the latest -mmotm where direct reclaim wouldn't
do writeback by itself anymore, but kick the flushers.

> > So personally I think it's a good idea to get an insight on the use of
> > congestion_wait() [patch 1] but I don't agree with changing its
> > behaviour just yet, or judging its usefulness solely on whether it
> > correctly waits for bdi congestion.
> > 
> 
> Unfortunately, I strongly suspect that some of the desktop stalls seen during
> IO (one of which involved no writes) were due to calling congestion_wait
> and waiting the full timeout where no writes are going on.

Oh, I am in full agreement here!  Removing those congestion_wait() as
described above showed a reduction in peak latency.  The dilemma is
only that it increased the overall walltime of the load.

And the scanning behaviour deteriorated, as in having increased
scanning pressure on other zones than the unpatched kernel did.

So I think very much that we need a fix.  congestion_wait() causes
stalls and relying on random sleeps for the current reclaim behaviour
can not be the solution, at all.

I just don't think we can remove it based on the argument that it
doesn't do what it is supposed to do, when it does other things right
that it is not supposed to do ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-08-27  8:16       ` Johannes Weiner
@ 2010-08-27  9:24         ` Mel Gorman
  2010-08-30 13:19           ` Johannes Weiner
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2010-08-27  9:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Wu Fengguang, Jan Kara, linux-kernel

On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called when there is no congestion, the caller
> > > > will wait for the full timeout. This can cause unreasonable and
> > > > unnecessary stalls. There are a number of potential modifications that
> > > > could be made to wake sleepers but this patch measures how serious the
> > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > record that the wait was unnecessary.
> > > 
> > > I am not convinced that unnecessary is the right word.  On a workload
> > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > the VM regressing both in time and in reclaiming the right pages when
> > > simply removing congestion_wait() from the direct reclaim paths (the
> > > one in __alloc_pages_slowpath and the other one in
> > > do_try_to_free_pages).
> > > 
> > > So just being stupid and waiting for the timeout in direct reclaim
> > > while kswapd can make progress seemed to do a better job for that
> > > load.
> > > 
> > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > nice if somebody had an idea.
> > > 
> > 
> > There is a possibility that the behaviour in that case was due to flusher
> > threads doing the writes rather than direct reclaim queueing pages for IO
> > in an inefficient manner. So the stall is stupid but happens to work out
> > well because flusher threads get the chance to do work.
> 
> The workload was accessing a large sparse-file through mmap, so there
> wasn't much IO in the first place.
> 

Then waiting on congestion was the totally wrong thing to do. We were
effectively calling sleep(HZ/10) and magically this was helping in some
undefined manner. Do you know *which* called of congestion_wait() was
the most important to you?

> And I experimented on the latest -mmotm where direct reclaim wouldn't
> do writeback by itself anymore, but kick the flushers.
> 

What were the results? I'm preparing a full series incorporating a
number of patches in this area to see how they behave in aggregate.

> > > So personally I think it's a good idea to get an insight on the use of
> > > congestion_wait() [patch 1] but I don't agree with changing its
> > > behaviour just yet, or judging its usefulness solely on whether it
> > > correctly waits for bdi congestion.
> > > 
> > 
> > Unfortunately, I strongly suspect that some of the desktop stalls seen during
> > IO (one of which involved no writes) were due to calling congestion_wait
> > and waiting the full timeout where no writes are going on.
> 
> Oh, I am in full agreement here!  Removing those congestion_wait() as
> described above showed a reduction in peak latency.  The dilemma is
> only that it increased the overall walltime of the load.
> 

Do you know why because leaving in random sleeps() hardly seems to be
the right approach?

> And the scanning behaviour deteriorated, as in having increased
> scanning pressure on other zones than the unpatched kernel did.
> 

Probably because it was scanning more but not finding what it needed.
There is a condition other than congestion it is having trouble with. In
some respects, I think if we change congestion_wait() as I propose,
we may see a case where CPU usage is higher because it's now
encountering the unspecified reclaim problem we have.

> So I think very much that we need a fix.  congestion_wait() causes
> stalls and relying on random sleeps for the current reclaim behaviour
> can not be the solution, at all.
> 
> I just don't think we can remove it based on the argument that it
> doesn't do what it is supposed to do, when it does other things right
> that it is not supposed to do ;-)
> 

We are not removing it, we are just stopping it going to sleep for
stupid reasons. If we find that wall time is increasing as a result, we
have a path to figuring out what the real underlying problem is instead
of sweeping it under the rug.

congestion_wait() is causing other problems such as Christian's bug of
massive IO regressions because it was sleeping when it shouldn't.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-08-27  9:24         ` Mel Gorman
@ 2010-08-30 13:19           ` Johannes Weiner
  2010-08-31 15:02             ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Johannes Weiner @ 2010-08-30 13:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Wu Fengguang, Jan Kara, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5340 bytes --]

On Fri, Aug 27, 2010 at 10:24:16AM +0100, Mel Gorman wrote:
> On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > > If congestion_wait() is called when there is no congestion, the caller
> > > > > will wait for the full timeout. This can cause unreasonable and
> > > > > unnecessary stalls. There are a number of potential modifications that
> > > > > could be made to wake sleepers but this patch measures how serious the
> > > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > > record that the wait was unnecessary.
> > > > 
> > > > I am not convinced that unnecessary is the right word.  On a workload
> > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > > the VM regressing both in time and in reclaiming the right pages when
> > > > simply removing congestion_wait() from the direct reclaim paths (the
> > > > one in __alloc_pages_slowpath and the other one in
> > > > do_try_to_free_pages).
> > > > 
> > > > So just being stupid and waiting for the timeout in direct reclaim
> > > > while kswapd can make progress seemed to do a better job for that
> > > > load.
> > > > 
> > > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > > nice if somebody had an idea.
> > > > 
> > > 
> > > There is a possibility that the behaviour in that case was due to flusher
> > > threads doing the writes rather than direct reclaim queueing pages for IO
> > > in an inefficient manner. So the stall is stupid but happens to work out
> > > well because flusher threads get the chance to do work.
> > 
> > The workload was accessing a large sparse-file through mmap, so there
> > wasn't much IO in the first place.
> > 
> 
> Then waiting on congestion was the totally wrong thing to do. We were
> effectively calling sleep(HZ/10) and magically this was helping in some
> undefined manner. Do you know *which* called of congestion_wait() was
> the most important to you?

Removing congestion_wait() in do_try_to_free_pages() definitely
worsens reclaim behaviour for this workload:

1. wallclock time of the testrun increases by 11%

2. the scanners do a worse job and go for the wrong zone:

-pgalloc_dma 79597
-pgalloc_dma32 134465902
+pgalloc_dma 297089
+pgalloc_dma32 134247237

-pgsteal_dma 77501
-pgsteal_dma32 133939446
+pgsteal_dma 294998
+pgsteal_dma32 133722312

-pgscan_kswapd_dma 145897
-pgscan_kswapd_dma32 266141381
+pgscan_kswapd_dma 287981
+pgscan_kswapd_dma32 186647637

-pgscan_direct_dma 9666
-pgscan_direct_dma32 1758655
+pgscan_direct_dma 302495
+pgscan_direct_dma32 80947179

-pageoutrun 1768531
-allocstall 614
+pageoutrun 1927451
+allocstall 8566

I attached the full vmstat contents below.  Also the test program,
which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))

> > > > So personally I think it's a good idea to get an insight on the use of
> > > > congestion_wait() [patch 1] but I don't agree with changing its
> > > > behaviour just yet, or judging its usefulness solely on whether it
> > > > correctly waits for bdi congestion.
> > > > 
> > > 
> > > Unfortunately, I strongly suspect that some of the desktop stalls seen during
> > > IO (one of which involved no writes) were due to calling congestion_wait
> > > and waiting the full timeout where no writes are going on.
> > 
> > Oh, I am in full agreement here!  Removing those congestion_wait() as
> > described above showed a reduction in peak latency.  The dilemma is
> > only that it increased the overall walltime of the load.
> > 
> 
> Do you know why because leaving in random sleeps() hardly seems to be
> the right approach?

I am still trying to find out what's going wrong.

> > And the scanning behaviour deteriorated, as in having increased
> > scanning pressure on other zones than the unpatched kernel did.
> > 
> 
> Probably because it was scanning more but not finding what it needed.
> There is a condition other than congestion it is having trouble with. In
> some respects, I think if we change congestion_wait() as I propose,
> we may see a case where CPU usage is higher because it's now
> encountering the unspecified reclaim problem we have.

Exactly.

> > So I think very much that we need a fix.  congestion_wait() causes
> > stalls and relying on random sleeps for the current reclaim behaviour
> > can not be the solution, at all.
> > 
> > I just don't think we can remove it based on the argument that it
> > doesn't do what it is supposed to do, when it does other things right
> > that it is not supposed to do ;-)
> > 
> 
> We are not removing it, we are just stopping it going to sleep for
> stupid reasons. If we find that wall time is increasing as a result, we
> have a path to figuring out what the real underlying problem is instead
> of sweeping it under the rug.

Well, for that testcase it is in effect the same as a removal as
there's never congestion.

But again: I agree with your changes per-se, I just don't think they
should get merged as long as they knowingly catalyze a problem that
has yet to be identified.

[-- Attachment #2: mapped-file-stream.c --]
[-- Type: text/plain, Size: 1885 bytes --]

#include <sys/types.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <limits.h>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <stdio.h>

static int start_process(unsigned long nr_bytes)
{
	char filename[] = "/tmp/clog-XXXXXX";
	unsigned long i;
	char *map;
	int fd;

	fd = mkstemp(filename);
	unlink(filename);
	if (fd == -1) {
		perror("mkstemp()");
		return -1;
	}

	if (ftruncate(fd, nr_bytes)) {
		perror("ftruncate()");
		return -1;
	}

	map = mmap(NULL, nr_bytes, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (map == MAP_FAILED) {
		perror("mmap()");
		return -1;
	}

	if (madvise(map, nr_bytes, MADV_RANDOM)) {
		perror("madvise()");
		return -1;
	}

	kill(getpid(), SIGSTOP);

	for (i = 0; i < nr_bytes; i += 4096)
		((volatile char *)map)[i];

	close(fd);
	return 0;
}

static int do_test(unsigned long nr_procs, unsigned long nr_bytes)
{
	pid_t procs[nr_procs];
	unsigned long i;
	int dummy;

	for (i = 0; i < nr_procs; i++) {
		switch ((procs[i] = fork())) {
		case -1:
			kill(0, SIGKILL);
			perror("fork()");
			return -1;
		case 0:
			return start_process(nr_bytes);
		default:
			waitpid(procs[i], &dummy, WUNTRACED);
			break;
		}
	}

	kill(0, SIGCONT);

	for (i = 0; i < nr_procs; i++)
		waitpid(procs[i], &dummy, 0);

	return 0;
}

static int xstrtoul(const char *str, unsigned long *valuep)
{
	unsigned long value;
	char *endp;

	value = strtoul(str, &endp, 0);
	if (*endp || (value == ULONG_MAX && errno == ERANGE))
		return -1;
	*valuep = value;
	return 0;
}

int main(int ac, char **av)
{
	unsigned long nr_procs, nr_bytes;

	if (ac != 3)
		goto usage;
	if (xstrtoul(av[1], &nr_procs))
		goto usage;
	if (xstrtoul(av[2], &nr_bytes))
		goto usage;
	setbuf(stdout, NULL);
	setbuf(stderr, NULL);
	return !!do_test(nr_procs, nr_bytes);
usage:
	fprintf(stderr, "usage: %s nr_procs nr_bytes\n", av[0]);
	return 1;
}

[-- Attachment #3: vmstat.a.2 --]
[-- Type: application/x-troff-man, Size: 1794 bytes --]

[-- Attachment #4: vmstat.b.2 --]
[-- Type: application/x-troff-man, Size: 1803 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-08-30 13:19           ` Johannes Weiner
@ 2010-08-31 15:02             ` Mel Gorman
  2010-09-02 15:49               ` Johannes Weiner
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2010-08-31 15:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Wu Fengguang, Jan Kara, linux-kernel

On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote:
> On Fri, Aug 27, 2010 at 10:24:16AM +0100, Mel Gorman wrote:
> > On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> > > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > > > If congestion_wait() is called when there is no congestion, the caller
> > > > > > will wait for the full timeout. This can cause unreasonable and
> > > > > > unnecessary stalls. There are a number of potential modifications that
> > > > > > could be made to wake sleepers but this patch measures how serious the
> > > > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > > > record that the wait was unnecessary.
> > > > > 
> > > > > I am not convinced that unnecessary is the right word.  On a workload
> > > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > > > the VM regressing both in time and in reclaiming the right pages when
> > > > > simply removing congestion_wait() from the direct reclaim paths (the
> > > > > one in __alloc_pages_slowpath and the other one in
> > > > > do_try_to_free_pages).
> > > > > 
> > > > > So just being stupid and waiting for the timeout in direct reclaim
> > > > > while kswapd can make progress seemed to do a better job for that
> > > > > load.
> > > > > 
> > > > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > > > nice if somebody had an idea.
> > > > > 
> > > > 
> > > > There is a possibility that the behaviour in that case was due to flusher
> > > > threads doing the writes rather than direct reclaim queueing pages for IO
> > > > in an inefficient manner. So the stall is stupid but happens to work out
> > > > well because flusher threads get the chance to do work.
> > > 
> > > The workload was accessing a large sparse-file through mmap, so there
> > > wasn't much IO in the first place.
> > > 
> > 
> > Then waiting on congestion was the totally wrong thing to do. We were
> > effectively calling sleep(HZ/10) and magically this was helping in some
> > undefined manner. Do you know *which* called of congestion_wait() was
> > the most important to you?
> 
> Removing congestion_wait() in do_try_to_free_pages() definitely
> worsens reclaim behaviour for this workload:
> 
> 1. wallclock time of the testrun increases by 11%
> 
> 2. the scanners do a worse job and go for the wrong zone:
> 
> -pgalloc_dma 79597
> -pgalloc_dma32 134465902
> +pgalloc_dma 297089
> +pgalloc_dma32 134247237
> 
> -pgsteal_dma 77501
> -pgsteal_dma32 133939446
> +pgsteal_dma 294998
> +pgsteal_dma32 133722312
> 
> -pgscan_kswapd_dma 145897
> -pgscan_kswapd_dma32 266141381
> +pgscan_kswapd_dma 287981
> +pgscan_kswapd_dma32 186647637
> 
> -pgscan_direct_dma 9666
> -pgscan_direct_dma32 1758655
> +pgscan_direct_dma 302495
> +pgscan_direct_dma32 80947179
> 
> -pageoutrun 1768531
> -allocstall 614
> +pageoutrun 1927451
> +allocstall 8566
> 
> I attached the full vmstat contents below.  Also the test program,
> which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))
> 

Excellent stuff. I didn't look at your vmstat output because it was for
an old patch and you have already highlighted the problems related to
the workload. Chances are, I'd just reach the same conclusions. What is
interesting is your workload.

> > > > > So personally I think it's a good idea to get an insight on the use of
> > > > > congestion_wait() [patch 1] but I don't agree with changing its
> > > > > behaviour just yet, or judging its usefulness solely on whether it
> > > > > correctly waits for bdi congestion.
> > > > > 
> > > > 
> > > > Unfortunately, I strongly suspect that some of the desktop stalls seen during
> > > > IO (one of which involved no writes) were due to calling congestion_wait
> > > > and waiting the full timeout where no writes are going on.
> > > 
> > > Oh, I am in full agreement here!  Removing those congestion_wait() as
> > > described above showed a reduction in peak latency.  The dilemma is
> > > only that it increased the overall walltime of the load.
> > > 
> > 
> > Do you know why because leaving in random sleeps() hardly seems to be
> > the right approach?
> 
> I am still trying to find out what's going wrong.
> 
> > > And the scanning behaviour deteriorated, as in having increased
> > > scanning pressure on other zones than the unpatched kernel did.
> > > 
> > 
> > Probably because it was scanning more but not finding what it needed.
> > There is a condition other than congestion it is having trouble with. In
> > some respects, I think if we change congestion_wait() as I propose,
> > we may see a case where CPU usage is higher because it's now
> > encountering the unspecified reclaim problem we have.
> 
> Exactly.
> 
> > > So I think very much that we need a fix.  congestion_wait() causes
> > > stalls and relying on random sleeps for the current reclaim behaviour
> > > can not be the solution, at all.
> > > 
> > > I just don't think we can remove it based on the argument that it
> > > doesn't do what it is supposed to do, when it does other things right
> > > that it is not supposed to do ;-)
> > > 
> > 
> > We are not removing it, we are just stopping it going to sleep for
> > stupid reasons. If we find that wall time is increasing as a result, we
> > have a path to figuring out what the real underlying problem is instead
> > of sweeping it under the rug.
> 
> Well, for that testcase it is in effect the same as a removal as
> there's never congestion.
> 
> But again: I agree with your changes per-se, I just don't think they
> should get merged as long as they knowingly catalyze a problem that
> has yet to be identified.

Ok, well there was some significant feedback on why wholesale changing of
congestion_wait() reached too far and I've incorporated that feedback. I
also integrated your workload into my testsuite (btw, because there is no
license the script has to download it from a google archive. I might get
back to you on licensing this so it can be made a permanent part of the suite).

These are the results just for your workload on the only machine I had
available with a lot of disk. There are a bunch of kernels because I'm testing
a superset of different series posted recently. The nocongest column is an
unreleased patch that has congestion_wait() and wait_iff_congested() that
only goes to sleep if there is real congestion or a lot of writeback going
on. Rather than worrying about the patch contents for now, lets consider
the results for just your workload.

The report is in 4 parts. The first is the vmstat counter differences as
a result of running your test. The exact interpretation of good and bad
here is open to interpretation. The second part is based on the vmscan
tracepoints. The third part is based on the congestion tracepoints and
the final part reports CPU usage and elapsed time.

MICRO
                                      traceonly-v1r4          nocongest-v1r4           lowlumpy-v1r4     nodirect-v1r4
pgalloc_dma                      89409.00 (   0.00%)     47750.00 ( -87.24%)     47430.00 ( -88.51%)     47246.00 ( -89.24%)
pgalloc_dma32                101407571.00 (   0.00%) 101518722.00 (   0.11%) 101502059.00 (   0.09%) 101511868.00 (   0.10%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                      74529.00 (   0.00%)     43386.00 ( -71.78%)     43213.00 ( -72.47%)     42691.00 ( -74.58%)
pgsteal_dma32                100666955.00 (   0.00%) 100712596.00 (   0.05%) 100712537.00 (   0.05%) 100713305.00 (   0.05%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma               118198.00 (   0.00%)     47370.00 (-149.52%)     49515.00 (-138.71%)     46134.00 (-156.21%)
pgscan_kswapd_dma32          177619794.00 (   0.00%) 161549938.00 (  -9.95%) 161679701.00 (  -9.86%) 156657926.00 ( -13.38%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma                27128.00 (   0.00%)     39215.00 (  30.82%)     36561.00 (  25.80%)     38806.00 (  30.09%)
pgscan_direct_dma32           23927492.00 (   0.00%)  40122173.00 (  40.36%)  39997463.00 (  40.18%)  45041626.00 (  46.88%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                      756020.00 (   0.00%)    903192.00 (  16.29%)    899965.00 (  15.99%)    868055.00 (  12.91%)
allocstall                        2722.00 (   0.00%)     70156.00 (  96.12%)     67554.00 (  95.97%)     87691.00 (  96.90%)


So, the allocstall counts go up of course because it is incremented
every time direct reclaim is entered and nocongest is only going to
sleep when there is congestion or significant writeback. I don't see
this as being nceessarily bad.

Direct scanning rates go up a bit as you'd expect - again because we are
sleeping less. It's interesting that the pages reclaimed is reduced implying
that despite higher scanning rates, there is less reclaim activity.

It's debatable if this is good or not because higher scanning rates in
themselves are not bad but fewer pages reclaimed seems positive so lets
see what the rest of the reports look like.

FTrace Reclaim Statistics: vmscan
            micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro
                traceonly-v1r4    nocongest-v1r4     lowlumpy-v1r4     nodirect-v1r4
Direct reclaims                               2722      70156      67554      87691 
Direct reclaim pages scanned              23955333   40161426   40034132   45080524 
Direct reclaim write file async I/O              0          0          0          0 
Direct reclaim write anon async I/O              0          0          0          0 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                       2718040   17801688   17622777   18379572 
Kswapd wakeups                                  24          1          1          1 
Kswapd pages scanned                     177738381  161597313  161729224  156704078 
Kswapd reclaim write file async I/O              0          0          0          0 
Kswapd reclaim write anon async I/O              0          0          0          0 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)       247.97      76.97      77.15      76.63 
Time kswapd awake (seconds)                 489.17     400.20     403.19     390.08 

Total pages scanned                      201693714 201758739 201763356 201784602
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        41.41%    16.03%    15.96%    16.32%
Percentage Time kswapd Awake                98.76%    98.94%    98.96%    98.87%

Interesting, kswapd is now staying awake (woke up only once) even though
the total time awake was reduced and it looks like because it was requested
to wake up a lot more that was keeping it awake. Despite the higher scan
rates from direct reclaim, the time actually spent direct reclaiming is
significantly reduced.

Scanning rates and times we direct reclaim go up but as we finish work a
lot faster, it would seem that we are doing less work overall.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited              3664          0          0          0 
Direct time   congest     waited          247636ms        0ms        0ms        0ms 
Direct full   congest     waited              3081          0          0          0 
Direct number conditional waited                 0      47587      45659      58779 
Direct time   conditional waited               0ms        0ms        0ms        0ms 
Direct full   conditional waited              3081          0          0          0 
KSwapd number congest     waited              1448        949        909        981 
KSwapd time   congest     waited          118552ms    31652ms    32780ms    38732ms 
KSwapd full   congest     waited              1056         90        115        147 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited              1056         90        115        147 

congest waited is congestion_wait() and conditional waited is
wait_iff_congested(). Look at what happens to the congest waited times
for direct reclaim - it disappeared and despite the number of times
wait_iff_congested() was called, it never actually decided it needed to
sleep. kswapd is still congestion waiting but the time it spent is
reduced.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)         350.9    403.27    406.12    393.02
Total Elapsed Time (seconds)                495.29    404.47    407.44    394.53

This is plain old time. The same test completes 91 seconds faster.
Ordinarily at this point I would be preparing to do a full series report
including the other benchmarks but I'm interested in seeing if there is
a significantly different reading of the above report as to whether it
is a "good" or "bad" result?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-08-31 15:02             ` Mel Gorman
@ 2010-09-02 15:49               ` Johannes Weiner
  2010-09-02 18:28                 ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Johannes Weiner @ 2010-09-02 15:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Wu Fengguang, Jan Kara, linux-kernel

On Tue, Aug 31, 2010 at 04:02:07PM +0100, Mel Gorman wrote:
> On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote:
> > Removing congestion_wait() in do_try_to_free_pages() definitely
> > worsens reclaim behaviour for this workload:
> > 
> > 1. wallclock time of the testrun increases by 11%
> > 
> > 2. the scanners do a worse job and go for the wrong zone:
> > 
> > -pgalloc_dma 79597
> > -pgalloc_dma32 134465902
> > +pgalloc_dma 297089
> > +pgalloc_dma32 134247237
> > 
> > -pgsteal_dma 77501
> > -pgsteal_dma32 133939446
> > +pgsteal_dma 294998
> > +pgsteal_dma32 133722312
> > 
> > -pgscan_kswapd_dma 145897
> > -pgscan_kswapd_dma32 266141381
> > +pgscan_kswapd_dma 287981
> > +pgscan_kswapd_dma32 186647637
> > 
> > -pgscan_direct_dma 9666
> > -pgscan_direct_dma32 1758655
> > +pgscan_direct_dma 302495
> > +pgscan_direct_dma32 80947179
> > 
> > -pageoutrun 1768531
> > -allocstall 614
> > +pageoutrun 1927451
> > +allocstall 8566
> > 
> > I attached the full vmstat contents below.  Also the test program,
> > which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))
> 
> Excellent stuff. I didn't look at your vmstat output because it was for
> an old patch and you have already highlighted the problems related to
> the workload. Chances are, I'd just reach the same conclusions. What is
> interesting is your workload.

[...]

> Ok, well there was some significant feedback on why wholesale changing of
> congestion_wait() reached too far and I've incorporated that feedback. I
> also integrated your workload into my testsuite (btw, because there is no
> license the script has to download it from a google archive. I might get
> back to you on licensing this so it can be made a permanent part of the suite).

Oh, certainly, feel free to add the following file header:

/*
 * Copyright (c) 2010 Johannes Weiner
 * Code released under the GNU GPLv2.
 */

> These are the results just for your workload on the only machine I had
> available with a lot of disk. There are a bunch of kernels because I'm testing
> a superset of different series posted recently. The nocongest column is an
> unreleased patch that has congestion_wait() and wait_iff_congested() that
> only goes to sleep if there is real congestion or a lot of writeback going
> on. Rather than worrying about the patch contents for now, lets consider
> the results for just your workload.
> 
> The report is in 4 parts. The first is the vmstat counter differences as
> a result of running your test. The exact interpretation of good and bad
> here is open to interpretation. The second part is based on the vmscan
> tracepoints. The third part is based on the congestion tracepoints and
> the final part reports CPU usage and elapsed time.
> 
> MICRO
>                                       traceonly-v1r4          nocongest-v1r4           lowlumpy-v1r4     nodirect-v1r4
> pgalloc_dma                      89409.00 (   0.00%)     47750.00 ( -87.24%)     47430.00 ( -88.51%)     47246.00 ( -89.24%)
> pgalloc_dma32                101407571.00 (   0.00%) 101518722.00 (   0.11%) 101502059.00 (   0.09%) 101511868.00 (   0.10%)
> pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgsteal_dma                      74529.00 (   0.00%)     43386.00 ( -71.78%)     43213.00 ( -72.47%)     42691.00 ( -74.58%)
> pgsteal_dma32                100666955.00 (   0.00%) 100712596.00 (   0.05%) 100712537.00 (   0.05%) 100713305.00 (   0.05%)
> pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_kswapd_dma               118198.00 (   0.00%)     47370.00 (-149.52%)     49515.00 (-138.71%)     46134.00 (-156.21%)
> pgscan_kswapd_dma32          177619794.00 (   0.00%) 161549938.00 (  -9.95%) 161679701.00 (  -9.86%) 156657926.00 ( -13.38%)
> pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_direct_dma                27128.00 (   0.00%)     39215.00 (  30.82%)     36561.00 (  25.80%)     38806.00 (  30.09%)
> pgscan_direct_dma32           23927492.00 (   0.00%)  40122173.00 (  40.36%)  39997463.00 (  40.18%)  45041626.00 (  46.88%)
> pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pageoutrun                      756020.00 (   0.00%)    903192.00 (  16.29%)    899965.00 (  15.99%)    868055.00 (  12.91%)
> allocstall                        2722.00 (   0.00%)     70156.00 (  96.12%)     67554.00 (  95.97%)     87691.00 (  96.90%)
> 
> 
> So, the allocstall counts go up of course because it is incremented
> every time direct reclaim is entered and nocongest is only going to
> sleep when there is congestion or significant writeback. I don't see
> this as being nceessarily bad.

Agreed.  Also the dma zone is less allocated from, which I suppose is
only the second zone in the zonelist, after dma32.  So allocations
succeed more often from the first-choice zone with your patches.

> Direct scanning rates go up a bit as you'd expect - again because we
> are sleeping less. It's interesting that the pages reclaimed is
> reduced implying that despite higher scanning rates, there is less
> reclaim activity.
> 
> It's debatable if this is good or not because higher scanning rates in
> themselves are not bad but fewer pages reclaimed seems positive so lets
> see what the rest of the reports look like.
>
> FTrace Reclaim Statistics: vmscan
>             micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro
>                 traceonly-v1r4    nocongest-v1r4     lowlumpy-v1r4     nodirect-v1r4
> Direct reclaims                               2722      70156      67554      87691 
> Direct reclaim pages scanned              23955333   40161426   40034132   45080524 
> Direct reclaim write file async I/O              0          0          0          0 
> Direct reclaim write anon async I/O              0          0          0          0 
> Direct reclaim write file sync I/O               0          0          0          0 
> Direct reclaim write anon sync I/O               0          0          0          0 
> Wake kswapd requests                       2718040   17801688   17622777   18379572 
> Kswapd wakeups                                  24          1          1          1 
> Kswapd pages scanned                     177738381  161597313  161729224  156704078 
> Kswapd reclaim write file async I/O              0          0          0          0 
> Kswapd reclaim write anon async I/O              0          0          0          0 
> Kswapd reclaim write file sync I/O               0          0          0          0 
> Kswapd reclaim write anon sync I/O               0          0          0          0 
> Time stalled direct reclaim (seconds)       247.97      76.97      77.15      76.63 
> Time kswapd awake (seconds)                 489.17     400.20     403.19     390.08 
> 
> Total pages scanned                      201693714 201758739 201763356 201784602
> %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> Percentage Time Spent Direct Reclaim        41.41%    16.03%    15.96%    16.32%
> Percentage Time kswapd Awake                98.76%    98.94%    98.96%    98.87%
> 
> Interesting, kswapd is now staying awake (woke up only once) even though
> the total time awake was reduced and it looks like because it was requested
> to wake up a lot more that was keeping it awake. Despite the higher scan
> rates from direct reclaim, the time actually spent direct reclaiming is
> significantly reduced.
> 
> Scanning rates and times we direct reclaim go up but as we finish work a
> lot faster, it would seem that we are doing less work overall.

I do not reach the same conclusion here.  More pages are scanned
overall on the same workload, so we _are_ doing more work.

The result for this single-threaded workload improves because CPU-time
is not the issue when the only runnable process needs memory.

But we are in fact becoming less efficient at reclaim, so it would
make sense to also test how this interacts with other processes that
do need the CPU concurrently.

> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited              3664          0          0          0 
> Direct time   congest     waited          247636ms        0ms        0ms        0ms 
> Direct full   congest     waited              3081          0          0          0 
> Direct number conditional waited                 0      47587      45659      58779 
> Direct time   conditional waited               0ms        0ms        0ms        0ms 
> Direct full   conditional waited              3081          0          0          0 
> KSwapd number congest     waited              1448        949        909        981 
> KSwapd time   congest     waited          118552ms    31652ms    32780ms    38732ms 
> KSwapd full   congest     waited              1056         90        115        147 
> KSwapd number conditional waited                 0          0          0          0 
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
> KSwapd full   conditional waited              1056         90        115        147 
> 
> congest waited is congestion_wait() and conditional waited is
> wait_iff_congested(). Look at what happens to the congest waited times
> for direct reclaim - it disappeared and despite the number of times
> wait_iff_congested() was called, it never actually decided it needed to
> sleep. kswapd is still congestion waiting but the time it spent is
> reduced.
> 
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         350.9    403.27    406.12    393.02
> Total Elapsed Time (seconds)                495.29    404.47    407.44    394.53
>
> This is plain old time. The same test completes 91 seconds faster.
> Ordinarily at this point I would be preparing to do a full series report
> including the other benchmarks but I'm interested in seeing if there is
> a significantly different reading of the above report as to whether it
> is a "good" or "bad" result?

I think one interesting piece that is missing is whether the
scanned/reclaimed ratio went up.  Do you have the kswapd_steal counter
value still available to calculate that ratio?

A "good" result would be, IMO, if that ratio did not get worse, while
at the same time having reclaim perform better due to reduced sleeps.

Another aspect to look out for is increased overreclaim: the total
number of allocations went up (I suppose the sum of reclaimed pages as
well), which means reclaim became more eager and created more
throwout-refault churn.  Those were refaults from a sparse-file, but a
slow backing dev will have more impact on wall clock time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-09-02 15:49               ` Johannes Weiner
@ 2010-09-02 18:28                 ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2010-09-02 18:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Wu Fengguang, Jan Kara, linux-kernel

On Thu, Sep 02, 2010 at 05:49:54PM +0200, Johannes Weiner wrote:
> On Tue, Aug 31, 2010 at 04:02:07PM +0100, Mel Gorman wrote:
> > On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote:
> > > Removing congestion_wait() in do_try_to_free_pages() definitely
> > > worsens reclaim behaviour for this workload:
> > > 
> > > 1. wallclock time of the testrun increases by 11%
> > > 
> > > 2. the scanners do a worse job and go for the wrong zone:
> > > 
> > > -pgalloc_dma 79597
> > > -pgalloc_dma32 134465902
> > > +pgalloc_dma 297089
> > > +pgalloc_dma32 134247237
> > > 
> > > -pgsteal_dma 77501
> > > -pgsteal_dma32 133939446
> > > +pgsteal_dma 294998
> > > +pgsteal_dma32 133722312
> > > 
> > > -pgscan_kswapd_dma 145897
> > > -pgscan_kswapd_dma32 266141381
> > > +pgscan_kswapd_dma 287981
> > > +pgscan_kswapd_dma32 186647637
> > > 
> > > -pgscan_direct_dma 9666
> > > -pgscan_direct_dma32 1758655
> > > +pgscan_direct_dma 302495
> > > +pgscan_direct_dma32 80947179
> > > 
> > > -pageoutrun 1768531
> > > -allocstall 614
> > > +pageoutrun 1927451
> > > +allocstall 8566
> > > 
> > > I attached the full vmstat contents below.  Also the test program,
> > > which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))
> > 
> > Excellent stuff. I didn't look at your vmstat output because it was for
> > an old patch and you have already highlighted the problems related to
> > the workload. Chances are, I'd just reach the same conclusions. What is
> > interesting is your workload.
> 
> [...]
> 
> > Ok, well there was some significant feedback on why wholesale changing of
> > congestion_wait() reached too far and I've incorporated that feedback. I
> > also integrated your workload into my testsuite (btw, because there is no
> > license the script has to download it from a google archive. I might get
> > back to you on licensing this so it can be made a permanent part of the suite).
> 
> Oh, certainly, feel free to add the following file header:
> 
> /*
>  * Copyright (c) 2010 Johannes Weiner
>  * Code released under the GNU GPLv2.
>  */
> 

Super, thanks. Downloading off a mail archive is a bit contorted :)

> > These are the results just for your workload on the only machine I had
> > available with a lot of disk. There are a bunch of kernels because I'm testing
> > a superset of different series posted recently. The nocongest column is an
> > unreleased patch that has congestion_wait() and wait_iff_congested() that
> > only goes to sleep if there is real congestion or a lot of writeback going
> > on. Rather than worrying about the patch contents for now, lets consider
> > the results for just your workload.
> > 
> > The report is in 4 parts. The first is the vmstat counter differences as
> > a result of running your test. The exact interpretation of good and bad
> > here is open to interpretation. The second part is based on the vmscan
> > tracepoints. The third part is based on the congestion tracepoints and
> > the final part reports CPU usage and elapsed time.
> > 
> > MICRO
> >                                       traceonly-v1r4          nocongest-v1r4           lowlumpy-v1r4     nodirect-v1r4
> > pgalloc_dma                      89409.00 (   0.00%)     47750.00 ( -87.24%)     47430.00 ( -88.51%)     47246.00 ( -89.24%)
> > pgalloc_dma32                101407571.00 (   0.00%) 101518722.00 (   0.11%) 101502059.00 (   0.09%) 101511868.00 (   0.10%)
> > pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgsteal_dma                      74529.00 (   0.00%)     43386.00 ( -71.78%)     43213.00 ( -72.47%)     42691.00 ( -74.58%)
> > pgsteal_dma32                100666955.00 (   0.00%) 100712596.00 (   0.05%) 100712537.00 (   0.05%) 100713305.00 (   0.05%)
> > pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgscan_kswapd_dma               118198.00 (   0.00%)     47370.00 (-149.52%)     49515.00 (-138.71%)     46134.00 (-156.21%)
> > pgscan_kswapd_dma32          177619794.00 (   0.00%) 161549938.00 (  -9.95%) 161679701.00 (  -9.86%) 156657926.00 ( -13.38%)
> > pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgscan_direct_dma                27128.00 (   0.00%)     39215.00 (  30.82%)     36561.00 (  25.80%)     38806.00 (  30.09%)
> > pgscan_direct_dma32           23927492.00 (   0.00%)  40122173.00 (  40.36%)  39997463.00 (  40.18%)  45041626.00 (  46.88%)
> > pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pageoutrun                      756020.00 (   0.00%)    903192.00 (  16.29%)    899965.00 (  15.99%)    868055.00 (  12.91%)
> > allocstall                        2722.00 (   0.00%)     70156.00 (  96.12%)     67554.00 (  95.97%)     87691.00 (  96.90%)
> > 
> > 
> > So, the allocstall counts go up of course because it is incremented
> > every time direct reclaim is entered and nocongest is only going to
> > sleep when there is congestion or significant writeback. I don't see
> > this as being nceessarily bad.
> 
> Agreed.  Also the dma zone is less allocated from, which I suppose is
> only the second zone in the zonelist, after dma32.  So allocations
> succeed more often from the first-choice zone with your patches.
> 

Which is good sortof.

> > Direct scanning rates go up a bit as you'd expect - again because we
> > are sleeping less. It's interesting that the pages reclaimed is
> > reduced implying that despite higher scanning rates, there is less
> > reclaim activity.
> > 
> > It's debatable if this is good or not because higher scanning rates in
> > themselves are not bad but fewer pages reclaimed seems positive so lets
> > see what the rest of the reports look like.
> >
> > FTrace Reclaim Statistics: vmscan
> >             micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro
> >                 traceonly-v1r4    nocongest-v1r4     lowlumpy-v1r4     nodirect-v1r4
> > Direct reclaims                               2722      70156      67554      87691 
> > Direct reclaim pages scanned              23955333   40161426   40034132   45080524 
> > Direct reclaim write file async I/O              0          0          0          0 
> > Direct reclaim write anon async I/O              0          0          0          0 
> > Direct reclaim write file sync I/O               0          0          0          0 
> > Direct reclaim write anon sync I/O               0          0          0          0 
> > Wake kswapd requests                       2718040   17801688   17622777   18379572 
> > Kswapd wakeups                                  24          1          1          1 
> > Kswapd pages scanned                     177738381  161597313  161729224  156704078 
> > Kswapd reclaim write file async I/O              0          0          0          0 
> > Kswapd reclaim write anon async I/O              0          0          0          0 
> > Kswapd reclaim write file sync I/O               0          0          0          0 
> > Kswapd reclaim write anon sync I/O               0          0          0          0 
> > Time stalled direct reclaim (seconds)       247.97      76.97      77.15      76.63 
> > Time kswapd awake (seconds)                 489.17     400.20     403.19     390.08 
> > 
> > Total pages scanned                      201693714 201758739 201763356 201784602
> > %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> > %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> > Percentage Time Spent Direct Reclaim        41.41%    16.03%    15.96%    16.32%
> > Percentage Time kswapd Awake                98.76%    98.94%    98.96%    98.87%
> > 
> > Interesting, kswapd is now staying awake (woke up only once) even though
> > the total time awake was reduced and it looks like because it was requested
> > to wake up a lot more that was keeping it awake. Despite the higher scan
> > rates from direct reclaim, the time actually spent direct reclaiming is
> > significantly reduced.
> > 
> > Scanning rates and times we direct reclaim go up but as we finish work a
> > lot faster, it would seem that we are doing less work overall.
> 
> I do not reach the same conclusion here.  More pages are scanned
> overall on the same workload, so we _are_ doing more work.
> 

We did more scanning but we finished sooner so the processor would sleep
sooner, consume less power etc. How good or bad this is depends on what
we mean by work but I am of the opinion that finishing sooner is better
overall. That said...

> The result for this single-threaded workload improves because CPU-time
> is not the issue when the only runnable process needs memory.
> 
> But we are in fact becoming less efficient at reclaim, so it would
> make sense to also test how this interacts with other processes that
> do need the CPU concurrently.
> 

This is indeed a problem. Measuring multiple workloads and drawing
sensible conclusions is going to be a real pain. I'd at least try
running multiple instances of your workload first and seeing how it
interacts.

> > FTrace Reclaim Statistics: congestion_wait
> > Direct number congest     waited              3664          0          0          0 
> > Direct time   congest     waited          247636ms        0ms        0ms        0ms 
> > Direct full   congest     waited              3081          0          0          0 
> > Direct number conditional waited                 0      47587      45659      58779 
> > Direct time   conditional waited               0ms        0ms        0ms        0ms 
> > Direct full   conditional waited              3081          0          0          0 
> > KSwapd number congest     waited              1448        949        909        981 
> > KSwapd time   congest     waited          118552ms    31652ms    32780ms    38732ms 
> > KSwapd full   congest     waited              1056         90        115        147 
> > KSwapd number conditional waited                 0          0          0          0 
> > KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
> > KSwapd full   conditional waited              1056         90        115        147 
> > 
> > congest waited is congestion_wait() and conditional waited is
> > wait_iff_congested(). Look at what happens to the congest waited times
> > for direct reclaim - it disappeared and despite the number of times
> > wait_iff_congested() was called, it never actually decided it needed to
> > sleep. kswapd is still congestion waiting but the time it spent is
> > reduced.
> > 
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds)         350.9    403.27    406.12    393.02
> > Total Elapsed Time (seconds)                495.29    404.47    407.44    394.53
> >
> > This is plain old time. The same test completes 91 seconds faster.
> > Ordinarily at this point I would be preparing to do a full series report
> > including the other benchmarks but I'm interested in seeing if there is
> > a significantly different reading of the above report as to whether it
> > is a "good" or "bad" result?
> 
> I think one interesting piece that is missing is whether the
> scanned/reclaimed ratio went up.  Do you have the kswapd_steal counter
> value still available to calculate that ratio?
> 

Knowing the scanning/reclaimed ration would be nice. The tracepoints as-is
are orientiated around scanning/write ratio and assumed discarding clean
pages was not that big a deal. I did capture vmstat before and after the
test which is not as fine-grained as the tracepoints but it'll do

Vanilla scan:	203050870
Vanilla steal:	102099263
Nocongest scan:	203117496
Nocongest stal:	102114301

Vanilla ratio:		0.5028
Nocongest ratio:	0.5027

So here is an interesting point. While direct scanning rates went up,
overall scanning rates were approximately the same. All that changed
really was who was doing the scanning which is a timing issue. The
reclaim to scanning ratio were *very* close together for the vanilla and
nocongest kernels but nocongest completed far faster.

> A "good" result would be, IMO, if that ratio did not get worse, while
> at the same time having reclaim perform better due to reduced sleeps.
> 

So, the ratio did not get worse. The scanning rates were the same although
who was scanning changed and I'm undecided as to whether that is significant
or not. Overall, the test completed faster which is good.

> Another aspect to look out for is increased overreclaim: the total
> number of allocations went up (I suppose the sum of reclaimed pages as
> well), which means reclaim became more eager and created more
> throwout-refault churn. Those were refaults from a sparse-file, but a
> slow backing dev will have more impact on wall clock time.
> 

I can capture the reclaim rates handy enough - just needs another trace
point to keep everything in step. Refaults would be trickier so I would
have a tendency to rely on wall time to tell me if the wrong reclaim
decisions were being made.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
  2010-08-26 18:29   ` Johannes Weiner
  2010-08-26 20:31     ` Mel Gorman
@ 2010-08-29 16:03     ` Minchan Kim
  1 sibling, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2010-08-29 16:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton,
	Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel

Hi, Hannes. 

On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called when there is no congestion, the caller
> > will wait for the full timeout. This can cause unreasonable and
> > unnecessary stalls. There are a number of potential modifications that
> > could be made to wake sleepers but this patch measures how serious the
> > problem is. It keeps count of how many congested BDIs there are. If
> > congestion_wait() is called with no BDIs congested, the tracepoint will
> > record that the wait was unnecessary.
> 
> I am not convinced that unnecessary is the right word.  On a workload
> without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> the VM regressing both in time and in reclaiming the right pages when
> simply removing congestion_wait() from the direct reclaim paths (the
> one in __alloc_pages_slowpath and the other one in
> do_try_to_free_pages).

Not exactly same your experiment but I had a simillar experince. 
I had a experiement about swapout. System has lots of anon pages but
almost no file pages and it already started to swap out. It means
system have no memory. In this case, I forked new process which mmap
some MB pages and touch the pages. It means VM should swapout some MB page 
for the process. And I measured the time until completing touching the pages. 

Sometime it's fast, sometime it's slow. time gap is almost two. 
Interesting thing is when it is fast, many of pages are reclaimed by kswapd. 
Ah.. I used swap to ramdisk and reserve the swap pages by touching before 
starting the experiment. So I would say it's not a _flushd_ effect.

> 
> So just being stupid and waiting for the timeout in direct reclaim
> while kswapd can make progress seemed to do a better job for that
> load.
> 
> I can not exactly pinpoint the reason for that behaviour, it would be
> nice if somebody had an idea.

I just thought the cause is direct reclaim just reclaims by 32 pages 
but kswapd could reclaim many pages by batch. But i didn't look at it any more
due to busy. Does it make sense?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
  2010-08-26 15:14 [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Mel Gorman
  2010-08-26 15:14 ` [PATCH 1/3] writeback: Account for time spent congestion_waited Mel Gorman
  2010-08-26 15:14 ` [PATCH 2/3] writeback: Record if the congestion was unnecessary Mel Gorman
@ 2010-08-26 15:14 ` Mel Gorman
  2010-08-26 17:38   ` Minchan Kim
  2010-08-27  5:13   ` Dave Chinner
  2010-08-26 17:20 ` [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Minchan Kim
  3 siblings, 2 replies; 38+ messages in thread
From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner,
	Wu Fengguang, Jan Kara, linux-kernel

If congestion_wait() is called with no BDIs congested, the caller will
sleep for the full timeout and this is an unnecessary sleep. This patch
checks if there are BDIs congested. If so, it goes to sleep as normal.
If not, it calls cond_resched() to ensure the caller is not hogging the
CPU longer than its quota but otherwise will not sleep.

This is aimed at reducing some of the major desktop stalls reported during
IO. For example, while kswapd is operating, it calls congestion_wait()
but it could just have been reclaiming clean page cache pages with no
congestion. Without this patch, it would sleep for a full timeout but after
this patch, it'll just call schedule() if it has been on the CPU too long.
Similar logic applies to direct reclaimers that are not making enough
progress.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/backing-dev.c |   20 ++++++++++++++------
 1 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index a49167f..6abe860 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout)
 	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
-	/* Check if this call to congestion_wait was necessary */
-	if (atomic_read(&nr_bdi_congested[sync]) == 0)
+	/*
+	 * If there is no congestion, there is no point sleeping on the queue.
+	 * This call was unecessary but in case we are spinning due to a bad
+	 * caller, at least call cond_reched() and sleep if our CPU quota
+	 * has expired
+	 */
+	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
 		unnecessary = true;
-
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
+		cond_resched();
+		ret = 0;
+	} else {
+		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+		ret = io_schedule_timeout(timeout);
+		finish_wait(wqh, &wait);
+	}
 
 	trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
 			unnecessary);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
  2010-08-26 15:14 ` [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs Mel Gorman
@ 2010-08-26 17:38   ` Minchan Kim
  2010-08-26 17:42     ` Mel Gorman
  2010-08-27  5:13   ` Dave Chinner
  1 sibling, 1 reply; 38+ messages in thread
From: Minchan Kim @ 2010-08-26 17:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel

On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDIs congested, the caller will
> sleep for the full timeout and this is an unnecessary sleep. This patch
> checks if there are BDIs congested. If so, it goes to sleep as normal.
> If not, it calls cond_resched() to ensure the caller is not hogging the
> CPU longer than its quota but otherwise will not sleep.
> 
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/backing-dev.c |   20 ++++++++++++++------
>  1 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index a49167f..6abe860 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c

Function's decripton should be changed since we don't wait next write any more. 

> @@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout)
>  	DEFINE_WAIT(wait);
>  	wait_queue_head_t *wqh = &congestion_wqh[sync];
>  
> -	/* Check if this call to congestion_wait was necessary */
> -	if (atomic_read(&nr_bdi_congested[sync]) == 0)
> +	/*
> +	 * If there is no congestion, there is no point sleeping on the queue.
> +	 * This call was unecessary but in case we are spinning due to a bad
> +	 * caller, at least call cond_reched() and sleep if our CPU quota
> +	 * has expired
> +	 */
> +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
>  		unnecessary = true;
> -
> -	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> -	ret = io_schedule_timeout(timeout);
> -	finish_wait(wqh, &wait);
> +		cond_resched();
> +		ret = 0;

"ret = timeout" is more proper as considering io_schedule_timeout's return value.

> +	} else {
> +		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> +		ret = io_schedule_timeout(timeout);
> +		finish_wait(wqh, &wait);
> +	}
>  
>  	trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
>  			unnecessary);
> -- 
> 1.7.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
  2010-08-26 17:38   ` Minchan Kim
@ 2010-08-26 17:42     ` Mel Gorman
  2010-08-26 18:17       ` Johannes Weiner
  0 siblings, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2010-08-26 17:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel

On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDIs congested, the caller will
> > sleep for the full timeout and this is an unnecessary sleep. This patch
> > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > If not, it calls cond_resched() to ensure the caller is not hogging the
> > CPU longer than its quota but otherwise will not sleep.
> > 
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/backing-dev.c |   20 ++++++++++++++------
> >  1 files changed, 14 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index a49167f..6abe860 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> 
> Function's decripton should be changed since we don't wait next write any more. 
> 

My bad. I need to check that "next write" thing. It doesn't appear to be
happening but maybe that side of things just broke somewhere in the
distant past. I lack context of how this is meant to work so maybe
someone will educate me.

> > @@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout)
> >  	DEFINE_WAIT(wait);
> >  	wait_queue_head_t *wqh = &congestion_wqh[sync];
> >  
> > -	/* Check if this call to congestion_wait was necessary */
> > -	if (atomic_read(&nr_bdi_congested[sync]) == 0)
> > +	/*
> > +	 * If there is no congestion, there is no point sleeping on the queue.
> > +	 * This call was unecessary but in case we are spinning due to a bad
> > +	 * caller, at least call cond_reched() and sleep if our CPU quota
> > +	 * has expired
> > +	 */
> > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> >  		unnecessary = true;
> > -
> > -	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > -	ret = io_schedule_timeout(timeout);
> > -	finish_wait(wqh, &wait);
> > +		cond_resched();
> > +		ret = 0;
> 
> "ret = timeout" is more proper as considering io_schedule_timeout's return value.
> 

Good point, will fix.

> > +	} else {
> > +		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > +		ret = io_schedule_timeout(timeout);
> > +		finish_wait(wqh, &wait);
> > +	}
> >  
> >  	trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
> >  			unnecessary);

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
  2010-08-26 17:42     ` Mel Gorman
@ 2010-08-26 18:17       ` Johannes Weiner
  2010-08-26 20:23         ` Mel Gorman
  2010-08-27  1:42         ` Wu Fengguang
  0 siblings, 2 replies; 38+ messages in thread
From: Johannes Weiner @ 2010-08-26 18:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Minchan Kim, linux-mm, linux-fsdevel, Andrew Morton,
	Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel

On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > If congestion_wait() is called with no BDIs congested, the caller will
> > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > CPU longer than its quota but otherwise will not sleep.
> > > 
> > > This is aimed at reducing some of the major desktop stalls reported during
> > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > but it could just have been reclaiming clean page cache pages with no
> > > congestion. Without this patch, it would sleep for a full timeout but after
> > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > Similar logic applies to direct reclaimers that are not making enough
> > > progress.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  mm/backing-dev.c |   20 ++++++++++++++------
> > >  1 files changed, 14 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > index a49167f..6abe860 100644
> > > --- a/mm/backing-dev.c
> > > +++ b/mm/backing-dev.c
> > 
> > Function's decripton should be changed since we don't wait next write any more. 
> > 
> 
> My bad. I need to check that "next write" thing. It doesn't appear to be
> happening but maybe that side of things just broke somewhere in the
> distant past. I lack context of how this is meant to work so maybe
> someone will educate me.

On every retired io request the congestion state on the bdi is checked
and the congestion waitqueue woken up.

So without congestion, we still only wait until the next write
retires, but without any IO, we sleep the full timeout.

Check __freed_requests() in block/blk-core.c.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
  2010-08-26 18:17       ` Johannes Weiner
@ 2010-08-26 20:23         ` Mel Gorman
  2010-08-27  1:11           ` Wu Fengguang
  2010-08-27  1:42         ` Wu Fengguang
  1 sibling, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2010-08-26 20:23 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Minchan Kim, linux-mm, linux-fsdevel, Andrew Morton,
	Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel

On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > CPU longer than its quota but otherwise will not sleep.
> > > > 
> > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > but it could just have been reclaiming clean page cache pages with no
> > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > Similar logic applies to direct reclaimers that are not making enough
> > > > progress.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > >  mm/backing-dev.c |   20 ++++++++++++++------
> > > >  1 files changed, 14 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > index a49167f..6abe860 100644
> > > > --- a/mm/backing-dev.c
> > > > +++ b/mm/backing-dev.c
> > > 
> > > Function's decripton should be changed since we don't wait next write any more. 
> > > 
> > 
> > My bad. I need to check that "next write" thing. It doesn't appear to be
> > happening but maybe that side of things just broke somewhere in the
> > distant past. I lack context of how this is meant to work so maybe
> > someone will educate me.
> 
> On every retired io request the congestion state on the bdi is checked
> and the congestion waitqueue woken up.
> 
> So without congestion, we still only wait until the next write
> retires, but without any IO, we sleep the full timeout.
> 
> Check __freed_requests() in block/blk-core.c.
> 

Seems reasonable. Still, if there is no write IO going on and no
congestion there seems to be no point going to sleep for the full
timeout. It still feels wrong.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
  2010-08-26 20:23         ` Mel Gorman
@ 2010-08-27  1:11           ` Wu Fengguang
  2010-08-27  9:34             ` Mel Gorman
  0 siblings, 1 reply; 38+ messages in thread
From: Wu Fengguang @ 2010-08-27  1:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Minchan Kim, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt,
	Jan Kara, linux-kernel@vger.kernel.org, Li Shaohua

On Fri, Aug 27, 2010 at 04:23:24AM +0800, Mel Gorman wrote:
> On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > > CPU longer than its quota but otherwise will not sleep.
> > > > > 
> > > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > > but it could just have been reclaiming clean page cache pages with no
> > > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > > Similar logic applies to direct reclaimers that are not making enough
> > > > > progress.
> > > > > 
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > ---
> > > > >  mm/backing-dev.c |   20 ++++++++++++++------
> > > > >  1 files changed, 14 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > > index a49167f..6abe860 100644
> > > > > --- a/mm/backing-dev.c
> > > > > +++ b/mm/backing-dev.c
> > > > 
> > > > Function's decripton should be changed since we don't wait next write any more. 
> > > > 
> > > 
> > > My bad. I need to check that "next write" thing. It doesn't appear to be
> > > happening but maybe that side of things just broke somewhere in the
> > > distant past. I lack context of how this is meant to work so maybe
> > > someone will educate me.
> > 
> > On every retired io request the congestion state on the bdi is checked
> > and the congestion waitqueue woken up.
> > 
> > So without congestion, we still only wait until the next write
> > retires, but without any IO, we sleep the full timeout.
> > 
> > Check __freed_requests() in block/blk-core.c.
> > 
> 
> Seems reasonable. Still, if there is no write IO going on and no
> congestion there seems to be no point going to sleep for the full
> timeout. It still feels wrong.

Yeah the stupid sleeping feels wrong. However there are ~20
congestion_wait() callers spread randomly in VM, FS and block drivers.
Many of them may be added by rule of thumb, however what if some of
them happen to depend on the old stupid sleeping behavior? Obviously
you've done extensive tests on the page reclaim paths, however that's
far from enough to cover the wider changes made by this patch.

We may have to do the conversions case by case. Converting to
congestion_wait_check() (see http://lkml.org/lkml/2010/8/18/292) or
other waiting schemes.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
  2010-08-27  1:11           ` Wu Fengguang
@ 2010-08-27  9:34             ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2010-08-27  9:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Johannes Weiner, Minchan Kim, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt,
	Jan Kara, linux-kernel@vger.kernel.org, Li Shaohua

On Fri, Aug 27, 2010 at 09:11:06AM +0800, Wu Fengguang wrote:
> On Fri, Aug 27, 2010 at 04:23:24AM +0800, Mel Gorman wrote:
> > On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > > > CPU longer than its quota but otherwise will not sleep.
> > > > > > 
> > > > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > > > but it could just have been reclaiming clean page cache pages with no
> > > > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > > > Similar logic applies to direct reclaimers that are not making enough
> > > > > > progress.
> > > > > > 
> > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > > ---
> > > > > >  mm/backing-dev.c |   20 ++++++++++++++------
> > > > > >  1 files changed, 14 insertions(+), 6 deletions(-)
> > > > > > 
> > > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > > > index a49167f..6abe860 100644
> > > > > > --- a/mm/backing-dev.c
> > > > > > +++ b/mm/backing-dev.c
> > > > > 
> > > > > Function's decripton should be changed since we don't wait next write any more. 
> > > > > 
> > > > 
> > > > My bad. I need to check that "next write" thing. It doesn't appear to be
> > > > happening but maybe that side of things just broke somewhere in the
> > > > distant past. I lack context of how this is meant to work so maybe
> > > > someone will educate me.
> > > 
> > > On every retired io request the congestion state on the bdi is checked
> > > and the congestion waitqueue woken up.
> > > 
> > > So without congestion, we still only wait until the next write
> > > retires, but without any IO, we sleep the full timeout.
> > > 
> > > Check __freed_requests() in block/blk-core.c.
> > > 
> > 
> > Seems reasonable. Still, if there is no write IO going on and no
> > congestion there seems to be no point going to sleep for the full
> > timeout. It still feels wrong.
> 
> Yeah the stupid sleeping feels wrong. However there are ~20
> congestion_wait() callers spread randomly in VM, FS and block drivers.
> Many of them may be added by rule of thumb, however what if some of
> them happen to depend on the old stupid sleeping behavior? Obviously
> you've done extensive tests on the page reclaim paths, however that's
> far from enough to cover the wider changes made by this patch.
> 
> We may have to do the conversions case by case. Converting to
> congestion_wait_check() (see http://lkml.org/lkml/2010/8/18/292) or
> other waiting schemes.
> 

I am taking this direction now.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
  2010-08-26 18:17       ` Johannes Weiner
  2010-08-26 20:23         ` Mel Gorman
@ 2010-08-27  1:42         ` Wu Fengguang
  2010-08-27  9:37           ` Mel Gorman
  1 sibling, 1 reply; 38+ messages in thread
From: Wu Fengguang @ 2010-08-27  1:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Minchan Kim, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt,
	Jan Kara, linux-kernel@vger.kernel.org, Li Shaohua, Rik van Riel

On Fri, Aug 27, 2010 at 02:17:35AM +0800, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > CPU longer than its quota but otherwise will not sleep.
> > > > 
> > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > but it could just have been reclaiming clean page cache pages with no
> > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > Similar logic applies to direct reclaimers that are not making enough
> > > > progress.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > >  mm/backing-dev.c |   20 ++++++++++++++------
> > > >  1 files changed, 14 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > index a49167f..6abe860 100644
> > > > --- a/mm/backing-dev.c
> > > > +++ b/mm/backing-dev.c
> > > 
> > > Function's decripton should be changed since we don't wait next write any more. 
> > > 
> > 
> > My bad. I need to check that "next write" thing. It doesn't appear to be
> > happening but maybe that side of things just broke somewhere in the
> > distant past. I lack context of how this is meant to work so maybe
> > someone will educate me.
> 
> On every retired io request the congestion state on the bdi is checked
> and the congestion waitqueue woken up.
> 
> So without congestion, we still only wait until the next write
> retires, but without any IO, we sleep the full timeout.
> 
> Check __freed_requests() in block/blk-core.c.

congestion_wait() is tightly related with pageout() and writeback,
however it may have some intention for the no-IO case as well.

- if write congested, maybe we are doing too much pageout(), so wait.
  it might also reduce some get_request_wait() stalls (the normal way
  is to explicitly check for congestion before doing write out).

- if any write completes, it may free some PG_reclaim pages, so proceed.
  (when not congested)

- if no IO at all, the 100ms sleep might still prevent a page reclaimer
  from stealing lots of slices from a busy computing program that
  involves no page allocation at all.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
  2010-08-27  1:42         ` Wu Fengguang
@ 2010-08-27  9:37           ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2010-08-27  9:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Johannes Weiner, Minchan Kim, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, Andrew Morton, Christian Ehrhardt,
	Jan Kara, linux-kernel@vger.kernel.org, Li Shaohua, Rik van Riel

On Fri, Aug 27, 2010 at 09:42:54AM +0800, Wu Fengguang wrote:
> On Fri, Aug 27, 2010 at 02:17:35AM +0800, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > > CPU longer than its quota but otherwise will not sleep.
> > > > > 
> > > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > > but it could just have been reclaiming clean page cache pages with no
> > > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > > Similar logic applies to direct reclaimers that are not making enough
> > > > > progress.
> > > > > 
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > ---
> > > > >  mm/backing-dev.c |   20 ++++++++++++++------
> > > > >  1 files changed, 14 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > > index a49167f..6abe860 100644
> > > > > --- a/mm/backing-dev.c
> > > > > +++ b/mm/backing-dev.c
> > > > 
> > > > Function's decripton should be changed since we don't wait next write any more. 
> > > > 
> > > 
> > > My bad. I need to check that "next write" thing. It doesn't appear to be
> > > happening but maybe that side of things just broke somewhere in the
> > > distant past. I lack context of how this is meant to work so maybe
> > > someone will educate me.
> > 
> > On every retired io request the congestion state on the bdi is checked
> > and the congestion waitqueue woken up.
> > 
> > So without congestion, we still only wait until the next write
> > retires, but without any IO, we sleep the full timeout.
> > 
> > Check __freed_requests() in block/blk-core.c.
> 
> congestion_wait() is tightly related with pageout() and writeback,
> however it may have some intention for the no-IO case as well.
> 
> - if write congested, maybe we are doing too much pageout(), so wait.
>   it might also reduce some get_request_wait() stalls (the normal way
>   is to explicitly check for congestion before doing write out).
> 
> - if any write completes, it may free some PG_reclaim pages, so proceed.
>   (when not congested)
> 

For these cases, would it make sense for wait_iff_congested() to compare
nr_writeback to nr_inactive and decide to wait on congestion if more
than half the inactive list is in writeback?

> - if no IO at all, the 100ms sleep might still prevent a page reclaimer
>   from stealing lots of slices from a busy computing program that
>   involves no page allocation at all.
> 

I don't think this is a very strong arguement because cond_reched() is
being called.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
  2010-08-26 15:14 ` [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs Mel Gorman
  2010-08-26 17:38   ` Minchan Kim
@ 2010-08-27  5:13   ` Dave Chinner
  2010-08-27  9:33     ` Mel Gorman
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2010-08-27  5:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel

On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDIs congested, the caller will
> sleep for the full timeout and this is an unnecessary sleep.

That, I think, is an invalid assumption. congestion_wait is used in
some places as a backoff mechanism that waits for some IO work to be
done, with congestion disappearing being a indication that progress
has been made and so we can retry sooner than the entire timeout.

For example, if _xfs_buf_lookup_pages() fails to allocate page cache
pages for a buffer, it will kick the xfsbufd to writeback dirty
buffers (so they can be freed) and immediately enter
congestion_wait(). If there isn't congestion when we enter
congestion_wait(), we still want to give the xfsbufds a chance to
clean some pages before we retry the allocation for the new buffer.
Removing the congestion_wait() sleep behaviour will effectively
_increase_ memory pressure with XFS on fast disk subsystems because
it now won't backoff between failed allocation attempts...

Perhaps a congestion_wait_iff_congested() variant is needed for the
VM? I can certainly see how it benefits the VM from a latency
perspective, but it is the opposite behaviour that is expected in
other places...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
  2010-08-27  5:13   ` Dave Chinner
@ 2010-08-27  9:33     ` Mel Gorman
  0 siblings, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2010-08-27  9:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel

On Fri, Aug 27, 2010 at 03:13:16PM +1000, Dave Chinner wrote:
> On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDIs congested, the caller will
> > sleep for the full timeout and this is an unnecessary sleep.
> 
> That, I think, is an invalid assumption. congestion_wait is used in
> some places as a backoff mechanism that waits for some IO work to be
> done, with congestion disappearing being a indication that progress
> has been made and so we can retry sooner than the entire timeout.
> 

As it's write IO rather than some IO, I wonder if that's really the
right thing to do. However, I accept your (and others) point that
converting all congestion_wait() callers may be too much of a change.

> For example, if _xfs_buf_lookup_pages() fails to allocate page cache
> pages for a buffer, it will kick the xfsbufd to writeback dirty
> buffers (so they can be freed) and immediately enter
> congestion_wait(). If there isn't congestion when we enter
> congestion_wait(), we still want to give the xfsbufds a chance to
> clean some pages before we retry the allocation for the new buffer.
> Removing the congestion_wait() sleep behaviour will effectively
> _increase_ memory pressure with XFS on fast disk subsystems because
> it now won't backoff between failed allocation attempts...
> 
> Perhaps a congestion_wait_iff_congested() variant is needed for the
> VM? I can certainly see how it benefits the VM from a latency
> perspective, but it is the opposite behaviour that is expected in
> other places...
> 

I'm added a wait_iff_congested() and updated a few of the VM callers. I changed
a fairly minimum number of what appeared to be the obvious ones to change.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
  2010-08-26 15:14 [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Mel Gorman
                   ` (2 preceding siblings ...)
  2010-08-26 15:14 ` [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs Mel Gorman
@ 2010-08-26 17:20 ` Minchan Kim
  2010-08-26 17:31   ` Mel Gorman
  2010-08-27  1:21   ` Wu Fengguang
  3 siblings, 2 replies; 38+ messages in thread
From: Minchan Kim @ 2010-08-26 17:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel,
	Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote:
> congestion_wait() is a bit stupid in that it goes to sleep even when there
> is no congestion. This causes stalls in a number of situations and may be
> partially responsible for bug reports about desktop interactivity.
> 
> This patch series aims to account for these unnecessary congestion_waits()
> and to avoid going to sleep when there is no congestion available. Patches
> 1 and 2 add instrumentation related to congestion which should be reuable
> by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
> instead of going to sleep if there is no congestion.
> 
> Once again, I shoved this through performance test. Unlike previous tests,
> I ran this on a ported version of my usual test-suite that should be suitable
> for release soon. It's not quite as good as my old set but it's sufficient
> for this and related series. The tests I ran were kernbench vmr-stream
> iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
> stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
> the usual stress the number of high order allocations that can be made while
> the system is under severe stress. The suite contains the necessary analysis
> scripts as well and I'd release it now except the documentation blows.
> 
> x86:    Intel Pentium D 3GHz with 3G RAM (no-brand machine)
> x86-64:	AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
> ppc64:	PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
> 
> The disks on all of them were single disks and not particularly fast.
> 
> Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
> instrumentation and a second test with patch 3 applied.
> 
> In all cases, kernbench, hackbench, STREAM and iozone did not show any
> performance difference because none of them were pressuring the system
> enough to be calling congestion_wait() so I won't post the results.
> About all worth noting for them is that nothing horrible appeared to break.
> 
> In the analysis scripts, I record unnecessary sleeps to be a sleep that
> had no congestion. The post-processing scripts for cond_resched() will only
> count an uncongested call to congestion_wait() as unnecessary if the process
> actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
> 
> One vague concern I have is when too many pages are isolated, we call
> congestion_wait(). This could now actively spin in the loop for its quanta
> before calling cond_resched(). If it's calling with no congestion, it's
> hard to know what the proper thing to do there is.

Suddenly, many processes could enter into the direct reclaim path by another
reason(ex, fork bomb) regradless of congestion. backing dev congestion is 
just one of them. 

I think if congestion_wait returns without calling io_schedule_timeout 
by your patch, too_many_isolated can schedule_timeout to wait for the system's 
calm to preventing OOM killing.

How about this?

If you don't mind, I will send the patch based on this patch series 
after your patch settle down or Could you add this to your patch series?
But I admit this doesn't almost affect your experiment. 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
  2010-08-26 17:20 ` [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Minchan Kim
@ 2010-08-26 17:31   ` Mel Gorman
  2010-08-26 17:50     ` Minchan Kim
  2010-08-27  1:21   ` Wu Fengguang
  1 sibling, 1 reply; 38+ messages in thread
From: Mel Gorman @ 2010-08-26 17:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel,
	Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Fri, Aug 27, 2010 at 02:20:38AM +0900, Minchan Kim wrote:
> On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote:
> > congestion_wait() is a bit stupid in that it goes to sleep even when there
> > is no congestion. This causes stalls in a number of situations and may be
> > partially responsible for bug reports about desktop interactivity.
> > 
> > This patch series aims to account for these unnecessary congestion_waits()
> > and to avoid going to sleep when there is no congestion available. Patches
> > 1 and 2 add instrumentation related to congestion which should be reuable
> > by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
> > instead of going to sleep if there is no congestion.
> > 
> > Once again, I shoved this through performance test. Unlike previous tests,
> > I ran this on a ported version of my usual test-suite that should be suitable
> > for release soon. It's not quite as good as my old set but it's sufficient
> > for this and related series. The tests I ran were kernbench vmr-stream
> > iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
> > stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
> > the usual stress the number of high order allocations that can be made while
> > the system is under severe stress. The suite contains the necessary analysis
> > scripts as well and I'd release it now except the documentation blows.
> > 
> > x86:    Intel Pentium D 3GHz with 3G RAM (no-brand machine)
> > x86-64:	AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
> > ppc64:	PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
> > 
> > The disks on all of them were single disks and not particularly fast.
> > 
> > Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
> > instrumentation and a second test with patch 3 applied.
> > 
> > In all cases, kernbench, hackbench, STREAM and iozone did not show any
> > performance difference because none of them were pressuring the system
> > enough to be calling congestion_wait() so I won't post the results.
> > About all worth noting for them is that nothing horrible appeared to break.
> > 
> > In the analysis scripts, I record unnecessary sleeps to be a sleep that
> > had no congestion. The post-processing scripts for cond_resched() will only
> > count an uncongested call to congestion_wait() as unnecessary if the process
> > actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
> > 
> > One vague concern I have is when too many pages are isolated, we call
> > congestion_wait(). This could now actively spin in the loop for its quanta
> > before calling cond_resched(). If it's calling with no congestion, it's
> > hard to know what the proper thing to do there is.
> 
> Suddenly, many processes could enter into the direct reclaim path by another
> reason(ex, fork bomb) regradless of congestion. backing dev congestion is 
> just one of them. 
> 

This situation applys with or without this series, right?

> I think if congestion_wait returns without calling io_schedule_timeout 
> by your patch, too_many_isolated can schedule_timeout to wait for the system's 
> calm to preventing OOM killing.
> 

More likely, to stop a loop in too_many_isolated() consuming CPU time it
can do nothing with.

> How about this?
> 
> If you don't mind, I will send the patch based on this patch series 
> after your patch settle down or Could you add this to your patch series?
> But I admit this doesn't almost affect your experiment. 
> 

I think it's a related topic so could belong with the series.

> From 70d6584e125c3954d74a69bfcb72de17244635d2 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Fri, 27 Aug 2010 02:06:45 +0900
> Subject: [PATCH] Wait regardless of congestion if too many pages are isolated
> 
> Suddenly, many processes could enter into the direct reclaim path
> regradless of congestion. backing dev congestion is just one of them.
> But current implementation calls congestion_wait if too many pages are isolated.
> 
> if congestion_wait returns without calling io_schedule_timeout,
> too_many_isolated can schedule_timeout to wait for the system's calm
> to preventing OOM killing.
> 

I think the reasoning here might be a little off. How about;

If many processes enter direct reclaim or memory compaction, too many pages
can get isolated. In this situation, too_many_isolated() can call
congestion_wait() but if there is no congestion, it fails to go to sleep
and instead spins until it's quota expires.

This patch checks if congestion_wait() returned without sleeping. If it
did because there was no congestion, it unconditionally goes to sleep
instead of hogging the CPU.

> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> ---
>  mm/backing-dev.c |    5 ++---
>  mm/compaction.c  |    6 +++++-
>  mm/vmscan.c      |    6 +++++-
>  3 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 6abe860..9431bca 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -756,8 +756,7 @@ EXPORT_SYMBOL(set_bdi_congested);
>   * @timeout: timeout in jiffies
>   *
>   * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> - * write congestion.  If no backing_devs are congested then just wait for the
> - * next write to be completed.
> + * write congestion.  If no backing_devs are congested then just returns.
>   */  
>  long congestion_wait(int sync, long timeout)
>  {
> @@ -776,7 +775,7 @@ long congestion_wait(int sync, long timeout)
>         if (atomic_read(&nr_bdi_congested[sync]) == 0) {
>                 unnecessary = true;
>                 cond_resched();
> -               ret = 0;
> +               ret = timeout;
>         } else {
>                 prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
>                 ret = io_schedule_timeout(timeout);
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 94cce51..7370683 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
>          * delay for some time until fewer pages are isolated
>          */  
>         while (unlikely(too_many_isolated(zone))) {
> -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> +               long timeout = HZ/10;
> +               if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> +                       set_current_state(TASK_INTERRUPTIBLE);
> +                       schedule_timeout(timeout);
> +               }
> 

We don't really need the timeout variable here but I see what you are
at. It's unfortunate to just go to sleep for HZ/10 but if it's not
congestion, we do not have any other event to wake up on at the moment.
We'd have to introduce a too_many_isolated waitqueue that is kicked if
pages are put back on the LRU.

This is better than spinning though.

>                 if (fatal_signal_pending(current))
>                         return 0;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3109ff7..f5e3e28 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>         unsigned long nr_dirty;
>         while (unlikely(too_many_isolated(zone, file, sc))) {
> -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> +               long timeout = HZ/10;
> +               if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> +                       set_current_state(TASK_INTERRUPTIBLE);
> +                       schedule_timeout(timeout);
> +               }
> 
>                 /* We are about to die and free our memory. Return now. */
>                 if (fatal_signal_pending(current))

This seems very reasonable. I'll review it more carefully tomorrow and if I
spot nothing horrible, I'll add it onto the series. I'm not sure I'm hitting
the too_many_isolated() case but I cannot think of a better alternative
without adding more waitqueues.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
  2010-08-26 17:31   ` Mel Gorman
@ 2010-08-26 17:50     ` Minchan Kim
  0 siblings, 0 replies; 38+ messages in thread
From: Minchan Kim @ 2010-08-26 17:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
	Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel,
	Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki

On Thu, Aug 26, 2010 at 06:31:47PM +0100, Mel Gorman wrote:
> On Fri, Aug 27, 2010 at 02:20:38AM +0900, Minchan Kim wrote:
> > On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote:
> > > congestion_wait() is a bit stupid in that it goes to sleep even when there
> > > is no congestion. This causes stalls in a number of situations and may be
> > > partially responsible for bug reports about desktop interactivity.
> > > 
> > > This patch series aims to account for these unnecessary congestion_waits()
> > > and to avoid going to sleep when there is no congestion available. Patches
> > > 1 and 2 add instrumentation related to congestion which should be reuable
> > > by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
> > > instead of going to sleep if there is no congestion.
> > > 
> > > Once again, I shoved this through performance test. Unlike previous tests,
> > > I ran this on a ported version of my usual test-suite that should be suitable
> > > for release soon. It's not quite as good as my old set but it's sufficient
> > > for this and related series. The tests I ran were kernbench vmr-stream
> > > iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
> > > stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
> > > the usual stress the number of high order allocations that can be made while
> > > the system is under severe stress. The suite contains the necessary analysis
> > > scripts as well and I'd release it now except the documentation blows.
> > > 
> > > x86:    Intel Pentium D 3GHz with 3G RAM (no-brand machine)
> > > x86-64:	AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
> > > ppc64:	PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
> > > 
> > > The disks on all of them were single disks and not particularly fast.
> > > 
> > > Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
> > > instrumentation and a second test with patch 3 applied.
> > > 
> > > In all cases, kernbench, hackbench, STREAM and iozone did not show any
> > > performance difference because none of them were pressuring the system
> > > enough to be calling congestion_wait() so I won't post the results.
> > > About all worth noting for them is that nothing horrible appeared to break.
> > > 
> > > In the analysis scripts, I record unnecessary sleeps to be a sleep that
> > > had no congestion. The post-processing scripts for cond_resched() will only
> > > count an uncongested call to congestion_wait() as unnecessary if the process
> > > actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
> > > 
> > > One vague concern I have is when too many pages are isolated, we call
> > > congestion_wait(). This could now actively spin in the loop for its quanta
> > > before calling cond_resched(). If it's calling with no congestion, it's
> > > hard to know what the proper thing to do there is.
> > 
> > Suddenly, many processes could enter into the direct reclaim path by another
> > reason(ex, fork bomb) regradless of congestion. backing dev congestion is 
> > just one of them. 
> > 
> 
> This situation applys with or without this series, right?

I think the situation applys with this series. That's because old behavior was calling
schedule regardless of I/O congested as seeing io_schedule_timeout.
But you are changing it now as calling it conditionally.

> 
> > I think if congestion_wait returns without calling io_schedule_timeout 
> > by your patch, too_many_isolated can schedule_timeout to wait for the system's 
> > calm to preventing OOM killing.
> > 
> 
> More likely, to stop a loop in too_many_isolated() consuming CPU time it
> can do nothing with.
> 
> > How about this?
> > 
> > If you don't mind, I will send the patch based on this patch series 
> > after your patch settle down or Could you add this to your patch series?
> > But I admit this doesn't almost affect your experiment. 
> > 
> 
> I think it's a related topic so could belong with the series.
> 
> > From 70d6584e125c3954d74a69bfcb72de17244635d2 Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan.kim@gmail.com>
> > Date: Fri, 27 Aug 2010 02:06:45 +0900
> > Subject: [PATCH] Wait regardless of congestion if too many pages are isolated
> > 
> > Suddenly, many processes could enter into the direct reclaim path
> > regradless of congestion. backing dev congestion is just one of them.
> > But current implementation calls congestion_wait if too many pages are isolated.
> > 
> > if congestion_wait returns without calling io_schedule_timeout,
> > too_many_isolated can schedule_timeout to wait for the system's calm
> > to preventing OOM killing.
> > 
> 
> I think the reasoning here might be a little off. How about;
> 
> If many processes enter direct reclaim or memory compaction, too many pages
> can get isolated. In this situation, too_many_isolated() can call
> congestion_wait() but if there is no congestion, it fails to go to sleep
> and instead spins until it's quota expires.
> 
> This patch checks if congestion_wait() returned without sleeping. If it
> did because there was no congestion, it unconditionally goes to sleep
> instead of hogging the CPU.

That's good to me. :)

> 
> > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> > ---
> >  mm/backing-dev.c |    5 ++---
> >  mm/compaction.c  |    6 +++++-
> >  mm/vmscan.c      |    6 +++++-
> >  3 files changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 6abe860..9431bca 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -756,8 +756,7 @@ EXPORT_SYMBOL(set_bdi_congested);
> >   * @timeout: timeout in jiffies
> >   *
> >   * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > - * write congestion.  If no backing_devs are congested then just wait for the
> > - * next write to be completed.
> > + * write congestion.  If no backing_devs are congested then just returns.
> >   */  
> >  long congestion_wait(int sync, long timeout)
> >  {
> > @@ -776,7 +775,7 @@ long congestion_wait(int sync, long timeout)
> >         if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> >                 unnecessary = true;
> >                 cond_resched();
> > -               ret = 0;
> > +               ret = timeout;
> >         } else {
> >                 prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> >                 ret = io_schedule_timeout(timeout);
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 94cce51..7370683 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
> >          * delay for some time until fewer pages are isolated
> >          */  
> >         while (unlikely(too_many_isolated(zone))) {
> > -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +               long timeout = HZ/10;
> > +               if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> > +                       set_current_state(TASK_INTERRUPTIBLE);
> > +                       schedule_timeout(timeout);
> > +               }
> > 
> 
> We don't really need the timeout variable here but I see what you are
> at. It's unfortunate to just go to sleep for HZ/10 but if it's not
> congestion, we do not have any other event to wake up on at the moment.
> We'd have to introduce a too_many_isolated waitqueue that is kicked if
> pages are put back on the LRU.

I thought it firstly but first of all, let's make sure how often this situation happens 
and it's really serious problem. I means it's rather overkill. 
> 
> This is better than spinning though.
> 
> >                 if (fatal_signal_pending(current))
> >                         return 0;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 3109ff7..f5e3e28 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >         unsigned long nr_dirty;
> >         while (unlikely(too_many_isolated(zone, file, sc))) {
> > -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +               long timeout = HZ/10;
> > +               if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> > +                       set_current_state(TASK_INTERRUPTIBLE);
> > +                       schedule_timeout(timeout);
> > +               }
> > 
> >                 /* We are about to die and free our memory. Return now. */
> >                 if (fatal_signal_pending(current))
> 
> This seems very reasonable. I'll review it more carefully tomorrow and if I
> spot nothing horrible, I'll add it onto the series. I'm not sure I'm hitting
> the too_many_isolated() case but I cannot think of a better alternative
> without adding more waitqueues.

Thanks. Mel. 

> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
  2010-08-26 17:20 ` [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Minchan Kim
  2010-08-26 17:31   ` Mel Gorman
@ 2010-08-27  1:21   ` Wu Fengguang
  2010-08-27  1:41     ` Minchan Kim
  2010-08-27  9:38     ` Mel Gorman
  1 sibling, 2 replies; 38+ messages in thread
From: Wu Fengguang @ 2010-08-27  1:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	Andrew Morton, Christian Ehrhardt, Johannes Weiner, Jan Kara,
	linux-kernel@vger.kernel.org, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Li Shaohua

Minchan,

It's much cleaner to keep the unchanged congestion_wait() and add a
congestion_wait_check() for converting problematic wait sites. The
too_many_isolated() wait is merely a protective mechanism, I won't
bother to improve it at the cost of more code.

Thanks,
Fengguang

> diff --git a/mm/compaction.c b/mm/compaction.c
> index 94cce51..7370683 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
>          * delay for some time until fewer pages are isolated
>          */  
>         while (unlikely(too_many_isolated(zone))) {
> -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> +               long timeout = HZ/10;
> +               if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> +                       set_current_state(TASK_INTERRUPTIBLE);
> +                       schedule_timeout(timeout);
> +               }
> 
>                 if (fatal_signal_pending(current))
>                         return 0;

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3109ff7..f5e3e28 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>         unsigned long nr_dirty;
>         while (unlikely(too_many_isolated(zone, file, sc))) {
> -               congestion_wait(BLK_RW_ASYNC, HZ/10);
> +               long timeout = HZ/10;
> +               if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> +                       set_current_state(TASK_INTERRUPTIBLE);
> +                       schedule_timeout(timeout);
> +               }
> 
>                 /* We are about to die and free our memory. Return now. */
>                 if (fatal_signal_pending(current))
> -- 
> 1.7.0.5
> 
> 
> -- 
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
  2010-08-27  1:21   ` Wu Fengguang
@ 2010-08-27  1:41     ` Minchan Kim
  2010-08-27  1:50       ` Wu Fengguang
  2010-08-27  9:38     ` Mel Gorman
  1 sibling, 1 reply; 38+ messages in thread
From: Minchan Kim @ 2010-08-27  1:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Mel Gorman, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	Andrew Morton, Christian Ehrhardt, Johannes Weiner, Jan Kara,
	linux-kernel@vger.kernel.org, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Li Shaohua

Hi, Wu.

On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> Minchan,
>
> It's much cleaner to keep the unchanged congestion_wait() and add a
> congestion_wait_check() for converting problematic wait sites. The
> too_many_isolated() wait is merely a protective mechanism, I won't
> bother to improve it at the cost of more code.

You means following as?

        while (unlikely(too_many_isolated(zone, file, sc))) {
                congestion_wait_check(BLK_RW_ASYNC, HZ/10);

                /* We are about to die and free our memory. Return now. */
                if (fatal_signal_pending(current))
                        return SWAP_CLUSTER_MAX;
        }


>
> Thanks,
> Fengguang
>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
  2010-08-27  1:41     ` Minchan Kim
@ 2010-08-27  1:50       ` Wu Fengguang
  2010-08-27  2:02         ` Minchan Kim
  0 siblings, 1 reply; 38+ messages in thread
From: Wu Fengguang @ 2010-08-27  1:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	Andrew Morton, Christian Ehrhardt, Johannes Weiner, Jan Kara,
	linux-kernel@vger.kernel.org, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Li, Shaohua

On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote:
> Hi, Wu.
> 
> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > Minchan,
> >
> > It's much cleaner to keep the unchanged congestion_wait() and add a
> > congestion_wait_check() for converting problematic wait sites. The
> > too_many_isolated() wait is merely a protective mechanism, I won't
> > bother to improve it at the cost of more code.
> 
> You means following as?

No, I mean do not change the too_many_isolated() related code at all :)
And to use congestion_wait_check() in other places that we can prove
there is a problem that can be rightly fixed by changing to
congestion_wait_check().

>         while (unlikely(too_many_isolated(zone, file, sc))) {
>                 congestion_wait_check(BLK_RW_ASYNC, HZ/10);
> 
>                 /* We are about to die and free our memory. Return now. */
>                 if (fatal_signal_pending(current))
>                         return SWAP_CLUSTER_MAX;
>         }

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
  2010-08-27  1:50       ` Wu Fengguang
@ 2010-08-27  2:02         ` Minchan Kim
  2010-08-27  4:34           ` Wu Fengguang
  0 siblings, 1 reply; 38+ messages in thread
From: Minchan Kim @ 2010-08-27  2:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Mel Gorman, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	Andrew Morton, Christian Ehrhardt, Johannes Weiner, Jan Kara,
	linux-kernel@vger.kernel.org, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Li, Shaohua

On Fri, Aug 27, 2010 at 10:50 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote:
>> Hi, Wu.
>>
>> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > Minchan,
>> >
>> > It's much cleaner to keep the unchanged congestion_wait() and add a
>> > congestion_wait_check() for converting problematic wait sites. The
>> > too_many_isolated() wait is merely a protective mechanism, I won't
>> > bother to improve it at the cost of more code.
>>
>> You means following as?
>
> No, I mean do not change the too_many_isolated() related code at all :)
> And to use congestion_wait_check() in other places that we can prove
> there is a problem that can be rightly fixed by changing to
> congestion_wait_check().

I always suffer from understanding your comment.
Apparently, my eyes have a problem. ;(

This patch is dependent of Mel's series.
With changing congestion_wait with just return when no congestion, it
would have CPU hogging in too_many_isolated. I think it would apply in
Li's congestion_wait_check, too.
If no change is current congestion_wait, we doesn't need this patch.

Still, maybe I can't understand your comment. Sorry.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
  2010-08-27  2:02         ` Minchan Kim
@ 2010-08-27  4:34           ` Wu Fengguang
  0 siblings, 0 replies; 38+ messages in thread
From: Wu Fengguang @ 2010-08-27  4:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	Andrew Morton, Christian Ehrhardt, Johannes Weiner, Jan Kara,
	linux-kernel@vger.kernel.org, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Li, Shaohua

On Fri, Aug 27, 2010 at 10:02:52AM +0800, Minchan Kim wrote:
> On Fri, Aug 27, 2010 at 10:50 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote:
> >> Hi, Wu.
> >>
> >> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> > Minchan,
> >> >
> >> > It's much cleaner to keep the unchanged congestion_wait() and add a
> >> > congestion_wait_check() for converting problematic wait sites. The
> >> > too_many_isolated() wait is merely a protective mechanism, I won't
> >> > bother to improve it at the cost of more code.
> >>
> >> You means following as?
> >
> > No, I mean do not change the too_many_isolated() related code at all :)
> > And to use congestion_wait_check() in other places that we can prove
> > there is a problem that can be rightly fixed by changing to
> > congestion_wait_check().
> 
> I always suffer from understanding your comment.
> Apparently, my eyes have a problem. ;(

> This patch is dependent of Mel's series.
> With changing congestion_wait with just return when no congestion, it
> would have CPU hogging in too_many_isolated. I think it would apply in
> Li's congestion_wait_check, too.
> If no change is current congestion_wait, we doesn't need this patch.
> 
> Still, maybe I can't understand your comment. Sorry.

Sorry! The confusion must come from the modified congestion_wait() by
Mel. My proposal is to _not_ modify congestion_wait(), but add another
congestion_wait_check() which won't sleep 100ms when no IO. In this
way, the following chunks become unnecessary.

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
         * delay for some time until fewer pages are isolated
         */
        while (unlikely(too_many_isolated(zone))) {
-               congestion_wait(BLK_RW_ASYNC, HZ/10);
+               long timeout = HZ/10;
+               if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
+                       set_current_state(TASK_INTERRUPTIBLE);
+                       schedule_timeout(timeout);
+               }
               
                if (fatal_signal_pending(current))
                        return 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3109ff7..f5e3e28 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
        unsigned long nr_dirty;
        while (unlikely(too_many_isolated(zone, file, sc))) {
-               congestion_wait(BLK_RW_ASYNC, HZ/10);
+               long timeout = HZ/10;
+               if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
+                       set_current_state(TASK_INTERRUPTIBLE);
+                       schedule_timeout(timeout);
+               }
               
                /* We are about to die and free our memory. Return now. */
                if (fatal_signal_pending(current))

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
  2010-08-27  1:21   ` Wu Fengguang
  2010-08-27  1:41     ` Minchan Kim
@ 2010-08-27  9:38     ` Mel Gorman
  1 sibling, 0 replies; 38+ messages in thread
From: Mel Gorman @ 2010-08-27  9:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Minchan Kim, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	Andrew Morton, Christian Ehrhardt, Johannes Weiner, Jan Kara,
	linux-kernel@vger.kernel.org, Rik van Riel, KOSAKI Motohiro,
	KAMEZAWA Hiroyuki, Li Shaohua

On Fri, Aug 27, 2010 at 09:21:47AM +0800, Wu Fengguang wrote:
> Minchan,
> 
> It's much cleaner to keep the unchanged congestion_wait() and add a
> congestion_wait_check() for converting problematic wait sites. The
> too_many_isolated() wait is merely a protective mechanism, I won't
> bother to improve it at the cost of more code.
> 

This is what I've done. I dropped the patch again and am using
wait_iff_congested(). I left the too_many_isolated() callers as
congestion_wait().

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2010-09-02 18:29 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-26 15:14 [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Mel Gorman
2010-08-26 15:14 ` [PATCH 1/3] writeback: Account for time spent congestion_waited Mel Gorman
2010-08-26 17:23   ` Minchan Kim
2010-08-26 18:10   ` Johannes Weiner
2010-08-26 15:14 ` [PATCH 2/3] writeback: Record if the congestion was unnecessary Mel Gorman
2010-08-26 17:35   ` Minchan Kim
2010-08-26 17:41     ` Mel Gorman
2010-08-26 18:29   ` Johannes Weiner
2010-08-26 20:31     ` Mel Gorman
2010-08-27  2:12       ` Shaohua Li
2010-08-27  9:20         ` Mel Gorman
2010-08-27  8:16       ` Johannes Weiner
2010-08-27  9:24         ` Mel Gorman
2010-08-30 13:19           ` Johannes Weiner
2010-08-31 15:02             ` Mel Gorman
2010-09-02 15:49               ` Johannes Weiner
2010-09-02 18:28                 ` Mel Gorman
2010-08-29 16:03     ` Minchan Kim
2010-08-26 15:14 ` [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs Mel Gorman
2010-08-26 17:38   ` Minchan Kim
2010-08-26 17:42     ` Mel Gorman
2010-08-26 18:17       ` Johannes Weiner
2010-08-26 20:23         ` Mel Gorman
2010-08-27  1:11           ` Wu Fengguang
2010-08-27  9:34             ` Mel Gorman
2010-08-27  1:42         ` Wu Fengguang
2010-08-27  9:37           ` Mel Gorman
2010-08-27  5:13   ` Dave Chinner
2010-08-27  9:33     ` Mel Gorman
2010-08-26 17:20 ` [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Minchan Kim
2010-08-26 17:31   ` Mel Gorman
2010-08-26 17:50     ` Minchan Kim
2010-08-27  1:21   ` Wu Fengguang
2010-08-27  1:41     ` Minchan Kim
2010-08-27  1:50       ` Wu Fengguang
2010-08-27  2:02         ` Minchan Kim
2010-08-27  4:34           ` Wu Fengguang
2010-08-27  9:38     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).