From: Wu Fengguang <fengguang.wu@intel.com>
To: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
LKML <linux-kernel@vger.kernel.org>,
Feng Tang <feng.tang@intel.com>
Subject: [PATCH] nfs: writeback pages wait queue
Date: Sat, 19 Nov 2011 20:52:53 +0800 [thread overview]
Message-ID: <20111119125253.GA5483@localhost> (raw)
The generic writeback routines are departing from congestion_wait()
in preference of get_request_wait(), aka. waiting on the block queues.
Introduce the missing writeback wait queue for NFS, otherwise its
writeback pages will grow greedily, exhausting all PG_dirty pages.
Tests show that it can effectively reduce stalls in the disk-network
pipeline, improve performance and reduce delays.
The test cases are basically
for run in 1 2 3
for nr_dd in 1 10 100
for dirty_thresh in 10M 100M 1000M 2G
start $nr_dd dd's writing to a 1-disk mem=12G NFS server
During all tests, nfs_congestion_kb is set to 1/8 dirty_thresh.
3.2.0-rc1 3.2.0-rc1-ioless-full+
(w/o patch) (w/ patch)
----------- ------------------------
20.66 +136.7% 48.90 thresh=1000M/nfs-100dd-1
20.82 +147.5% 51.52 thresh=1000M/nfs-100dd-2
20.57 +129.8% 47.26 thresh=1000M/nfs-100dd-3
35.96 +96.5% 70.67 thresh=1000M/nfs-10dd-1
37.47 +89.1% 70.85 thresh=1000M/nfs-10dd-2
34.55 +106.1% 71.21 thresh=1000M/nfs-10dd-3
58.24 +28.2% 74.63 thresh=1000M/nfs-1dd-1
59.83 +18.6% 70.93 thresh=1000M/nfs-1dd-2
58.30 +31.4% 76.61 thresh=1000M/nfs-1dd-3
23.69 -10.0% 21.33 thresh=100M/nfs-100dd-1
23.59 -1.7% 23.19 thresh=100M/nfs-100dd-2
23.94 -1.0% 23.70 thresh=100M/nfs-100dd-3
27.06 -0.0% 27.06 thresh=100M/nfs-10dd-1
25.43 +4.8% 26.66 thresh=100M/nfs-10dd-2
27.21 -0.8% 26.99 thresh=100M/nfs-10dd-3
53.82 +4.4% 56.17 thresh=100M/nfs-1dd-1
55.80 +4.2% 58.12 thresh=100M/nfs-1dd-2
55.75 +2.9% 57.37 thresh=100M/nfs-1dd-3
15.47 +1.3% 15.68 thresh=10M/nfs-10dd-1
16.09 -3.5% 15.53 thresh=10M/nfs-10dd-2
15.09 -0.9% 14.96 thresh=10M/nfs-10dd-3
26.65 +13.0% 30.10 thresh=10M/nfs-1dd-1
25.09 +7.7% 27.02 thresh=10M/nfs-1dd-2
27.16 +3.3% 28.06 thresh=10M/nfs-1dd-3
27.51 +78.6% 49.11 thresh=2G/nfs-100dd-1
22.46 +131.6% 52.01 thresh=2G/nfs-100dd-2
12.95 +289.8% 50.50 thresh=2G/nfs-100dd-3
42.28 +81.0% 76.52 thresh=2G/nfs-10dd-1
40.33 +78.8% 72.10 thresh=2G/nfs-10dd-2
42.52 +67.6% 71.27 thresh=2G/nfs-10dd-3
62.27 +34.6% 83.84 thresh=2G/nfs-1dd-1
60.10 +35.6% 81.48 thresh=2G/nfs-1dd-2
66.29 +17.5% 77.88 thresh=2G/nfs-1dd-3
1164.97 +41.6% 1649.19 TOTAL write_bw
The local queue time for WRITE RPCs could be reduced by several orders!
3.2.0-rc1 3.2.0-rc1-ioless-full+
----------- ------------------------
90226.82 -99.9% 92.07 thresh=1000M/nfs-100dd-1
88904.27 -99.9% 80.21 thresh=1000M/nfs-100dd-2
97436.73 -99.9% 87.32 thresh=1000M/nfs-100dd-3
62167.19 -99.3% 444.25 thresh=1000M/nfs-10dd-1
64150.34 -99.2% 539.38 thresh=1000M/nfs-10dd-2
78675.54 -99.3% 540.27 thresh=1000M/nfs-10dd-3
5372.84 +57.8% 8477.45 thresh=1000M/nfs-1dd-1
10245.66 -51.2% 4995.71 thresh=1000M/nfs-1dd-2
4744.06 +109.1% 9919.55 thresh=1000M/nfs-1dd-3
1727.29 -9.6% 1562.16 thresh=100M/nfs-100dd-1
2183.49 +4.4% 2280.21 thresh=100M/nfs-100dd-2
2201.49 +3.7% 2281.92 thresh=100M/nfs-100dd-3
6213.73 +19.9% 7448.13 thresh=100M/nfs-10dd-1
8127.01 +3.2% 8387.06 thresh=100M/nfs-10dd-2
7255.35 +4.4% 7571.11 thresh=100M/nfs-10dd-3
1144.67 +20.4% 1378.01 thresh=100M/nfs-1dd-1
1010.02 +19.0% 1202.22 thresh=100M/nfs-1dd-2
906.33 +15.8% 1049.76 thresh=100M/nfs-1dd-3
642.82 +17.3% 753.80 thresh=10M/nfs-10dd-1
766.82 -21.7% 600.18 thresh=10M/nfs-10dd-2
575.95 +16.5% 670.85 thresh=10M/nfs-10dd-3
21.91 +71.0% 37.47 thresh=10M/nfs-1dd-1
16.70 +105.3% 34.29 thresh=10M/nfs-1dd-2
19.05 -71.3% 5.47 thresh=10M/nfs-1dd-3
123877.11 -99.0% 1187.27 thresh=2G/nfs-100dd-1
122353.65 -98.8% 1505.84 thresh=2G/nfs-100dd-2
101140.82 -98.4% 1641.03 thresh=2G/nfs-100dd-3
78248.51 -98.9% 892.00 thresh=2G/nfs-10dd-1
84589.42 -98.6% 1212.17 thresh=2G/nfs-10dd-2
89684.95 -99.4% 495.28 thresh=2G/nfs-10dd-3
10405.39 -6.9% 9684.57 thresh=2G/nfs-1dd-1
16151.86 -48.5% 8316.69 thresh=2G/nfs-1dd-2
16119.17 -49.0% 8214.84 thresh=2G/nfs-1dd-3
1177306.98 -92.1% 93588.50 TOTAL nfs_write_queue_time
The average COMMIT size is not impacted too much.
3.2.0-rc1 3.2.0-rc1-ioless-full+
----------- ------------------------
5.56 +44.9% 8.06 thresh=1000M/nfs-100dd-1
4.14 +109.1% 8.67 thresh=1000M/nfs-100dd-2
5.46 +16.3% 6.35 thresh=1000M/nfs-100dd-3
52.04 -8.4% 47.70 thresh=1000M/nfs-10dd-1
52.33 -13.8% 45.09 thresh=1000M/nfs-10dd-2
51.72 -9.2% 46.98 thresh=1000M/nfs-10dd-3
484.63 -8.6% 443.16 thresh=1000M/nfs-1dd-1
492.42 -8.2% 452.26 thresh=1000M/nfs-1dd-2
493.13 -11.4% 437.15 thresh=1000M/nfs-1dd-3
32.52 -72.9% 8.80 thresh=100M/nfs-100dd-1
36.15 +26.1% 45.58 thresh=100M/nfs-100dd-2
38.33 +0.4% 38.49 thresh=100M/nfs-100dd-3
5.67 +0.5% 5.69 thresh=100M/nfs-10dd-1
5.74 -1.1% 5.68 thresh=100M/nfs-10dd-2
5.69 +0.9% 5.74 thresh=100M/nfs-10dd-3
44.91 -1.0% 44.45 thresh=100M/nfs-1dd-1
44.22 -0.6% 43.96 thresh=100M/nfs-1dd-2
44.18 +0.2% 44.28 thresh=100M/nfs-1dd-3
1.42 +1.1% 1.43 thresh=10M/nfs-10dd-1
1.48 +0.3% 1.48 thresh=10M/nfs-10dd-2
1.43 -1.0% 1.42 thresh=10M/nfs-10dd-3
5.51 -6.8% 5.14 thresh=10M/nfs-1dd-1
5.91 -8.1% 5.43 thresh=10M/nfs-1dd-2
5.44 +3.0% 5.61 thresh=10M/nfs-1dd-3
8.80 +6.6% 9.38 thresh=2G/nfs-100dd-1
8.51 +65.2% 14.06 thresh=2G/nfs-100dd-2
15.28 -13.2% 13.27 thresh=2G/nfs-100dd-3
105.12 -24.9% 78.99 thresh=2G/nfs-10dd-1
101.90 -9.1% 92.60 thresh=2G/nfs-10dd-2
106.24 -29.7% 74.65 thresh=2G/nfs-10dd-3
909.85 +0.4% 913.68 thresh=2G/nfs-1dd-1
1030.45 -18.3% 841.68 thresh=2G/nfs-1dd-2
1016.56 -11.6% 898.36 thresh=2G/nfs-1dd-3
5222.74 -10.1% 4695.25 TOTAL nfs_commit_size
And here is the list of overall numbers.
3.2.0-rc1 3.2.0-rc1-ioless-full+
----------- ------------------------
1164.97 +41.6% 1649.19 TOTAL write_bw
54799.00 +25.0% 68500.00 TOTAL nfs_nr_commits
3543263.00 -3.3% 3425418.00 TOTAL nfs_nr_writes
5222.74 -10.1% 4695.25 TOTAL nfs_commit_size
7.62 +89.2% 14.42 TOTAL nfs_write_size
1177306.98 -92.1% 93588.50 TOTAL nfs_write_queue_time
5977.02 -16.0% 5019.34 TOTAL nfs_write_rtt_time
1183360.15 -91.7% 98645.74 TOTAL nfs_write_execute_time
51186.59 -62.5% 19170.98 TOTAL nfs_commit_queue_time
81801.14 +3.6% 84735.19 TOTAL nfs_commit_rtt_time
133015.32 -21.9% 103926.05 TOTAL nfs_commit_execute_time
Feng: do more coarse grained throttle on each ->writepages rather than
on each page, for better performance and avoid throttled-before-send-rpc
deadlock
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/nfs/client.c | 2
fs/nfs/write.c | 84 +++++++++++++++++++++++++++++++-----
include/linux/nfs_fs_sb.h | 1
3 files changed, 77 insertions(+), 10 deletions(-)
--- linux-next.orig/fs/nfs/write.c 2011-10-20 23:08:17.000000000 +0800
+++ linux-next/fs/nfs/write.c 2011-10-20 23:45:59.000000000 +0800
@@ -190,11 +190,64 @@ static int wb_priority(struct writeback_
* NFS congestion control
*/
+#define NFS_WAIT_PAGES (1024L >> (PAGE_SHIFT - 10))
int nfs_congestion_kb;
-#define NFS_CONGESTION_ON_THRESH (nfs_congestion_kb >> (PAGE_SHIFT-10))
-#define NFS_CONGESTION_OFF_THRESH \
- (NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+/*
+ * SYNC requests will block on (2*limit) and wakeup on (2*limit-NFS_WAIT_PAGES)
+ * ASYNC requests will block on (limit) and wakeup on (limit - NFS_WAIT_PAGES)
+ * In this way SYNC writes will never be blocked by ASYNC ones.
+ */
+
+static void nfs_set_congested(long nr, struct backing_dev_info *bdi)
+{
+ long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+ if (nr > limit && !test_bit(BDI_async_congested, &bdi->state))
+ set_bdi_congested(bdi, BLK_RW_ASYNC);
+ else if (nr > 2 * limit && !test_bit(BDI_sync_congested, &bdi->state))
+ set_bdi_congested(bdi, BLK_RW_SYNC);
+}
+
+static void nfs_wait_congested(int is_sync,
+ struct backing_dev_info *bdi,
+ wait_queue_head_t *wqh)
+{
+ int waitbit = is_sync ? BDI_sync_congested : BDI_async_congested;
+ DEFINE_WAIT(wait);
+
+ if (!test_bit(waitbit, &bdi->state))
+ return;
+
+ for (;;) {
+ prepare_to_wait(&wqh[is_sync], &wait, TASK_UNINTERRUPTIBLE);
+ if (!test_bit(waitbit, &bdi->state))
+ break;
+
+ io_schedule();
+ }
+ finish_wait(&wqh[is_sync], &wait);
+}
+
+static void nfs_wakeup_congested(long nr,
+ struct backing_dev_info *bdi,
+ wait_queue_head_t *wqh)
+{
+ long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+ if (nr < 2 * limit - min(limit / 8, NFS_WAIT_PAGES)) {
+ if (test_bit(BDI_sync_congested, &bdi->state))
+ clear_bdi_congested(bdi, BLK_RW_SYNC);
+ if (waitqueue_active(&wqh[BLK_RW_SYNC]))
+ wake_up(&wqh[BLK_RW_SYNC]);
+ }
+ if (nr < limit - min(limit / 8, NFS_WAIT_PAGES)) {
+ if (test_bit(BDI_async_congested, &bdi->state))
+ clear_bdi_congested(bdi, BLK_RW_ASYNC);
+ if (waitqueue_active(&wqh[BLK_RW_ASYNC]))
+ wake_up(&wqh[BLK_RW_ASYNC]);
+ }
+}
static int nfs_set_page_writeback(struct page *page)
{
@@ -205,11 +258,8 @@ static int nfs_set_page_writeback(struct
struct nfs_server *nfss = NFS_SERVER(inode);
page_cache_get(page);
- if (atomic_long_inc_return(&nfss->writeback) >
- NFS_CONGESTION_ON_THRESH) {
- set_bdi_congested(&nfss->backing_dev_info,
- BLK_RW_ASYNC);
- }
+ nfs_set_congested(atomic_long_inc_return(&nfss->writeback),
+ &nfss->backing_dev_info);
}
return ret;
}
@@ -221,8 +271,10 @@ static void nfs_end_page_writeback(struc
end_page_writeback(page);
page_cache_release(page);
- if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
- clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
+ nfs_wakeup_congested(atomic_long_dec_return(&nfss->writeback),
+ &nfss->backing_dev_info,
+ nfss->writeback_wait);
}
static struct nfs_page *nfs_find_and_lock_request(struct page *page, bool nonblock)
@@ -323,10 +375,17 @@ static int nfs_writepage_locked(struct p
int nfs_writepage(struct page *page, struct writeback_control *wbc)
{
+ struct inode *inode = page->mapping->host;
+ struct nfs_server *nfss = NFS_SERVER(inode);
int ret;
ret = nfs_writepage_locked(page, wbc);
unlock_page(page);
+
+ nfs_wait_congested(wbc->sync_mode == WB_SYNC_ALL,
+ &nfss->backing_dev_info,
+ nfss->writeback_wait);
+
return ret;
}
@@ -342,6 +401,7 @@ static int nfs_writepages_callback(struc
int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
struct inode *inode = mapping->host;
+ struct nfs_server *nfss = NFS_SERVER(inode);
unsigned long *bitlock = &NFS_I(inode)->flags;
struct nfs_pageio_descriptor pgio;
int err;
@@ -358,6 +418,10 @@ int nfs_writepages(struct address_space
err = write_cache_pages(mapping, wbc, nfs_writepages_callback, &pgio);
nfs_pageio_complete(&pgio);
+ nfs_wait_congested(wbc->sync_mode == WB_SYNC_ALL,
+ &nfss->backing_dev_info,
+ nfss->writeback_wait);
+
clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
smp_mb__after_clear_bit();
wake_up_bit(bitlock, NFS_INO_FLUSHING);
--- linux-next.orig/include/linux/nfs_fs_sb.h 2011-10-20 23:08:17.000000000 +0800
+++ linux-next/include/linux/nfs_fs_sb.h 2011-10-20 23:45:12.000000000 +0800
@@ -102,6 +102,7 @@ struct nfs_server {
struct nfs_iostats __percpu *io_stats; /* I/O statistics */
struct backing_dev_info backing_dev_info;
atomic_long_t writeback; /* number of writeback pages */
+ wait_queue_head_t writeback_wait[2];
int flags; /* various flags */
unsigned int caps; /* server capabilities */
unsigned int rsize; /* read size */
--- linux-next.orig/fs/nfs/client.c 2011-10-20 23:08:17.000000000 +0800
+++ linux-next/fs/nfs/client.c 2011-10-20 23:45:12.000000000 +0800
@@ -1066,6 +1066,8 @@ static struct nfs_server *nfs_alloc_serv
INIT_LIST_HEAD(&server->layouts);
atomic_set(&server->active, 0);
+ init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
+ init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
server->io_stats = nfs_alloc_iostats();
if (!server->io_stats) {
next reply other threads:[~2011-11-19 12:53 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-11-19 12:52 Wu Fengguang [this message]
[not found] ` <20111119134412.GA5853@umich.edu>
2011-11-20 1:57 ` [PATCH] nfs: writeback pages wait queue Wu Fengguang
2011-11-20 1:57 ` Wu Fengguang
2011-11-21 7:15 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111119125253.GA5483@localhost \
--to=fengguang.wu@intel.com \
--cc=Trond.Myklebust@netapp.com \
--cc=feng.tang@intel.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.