From: Peter Staubach <staubach@redhat.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>,
NFS list <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH] flow control for WRITE requests
Date: Fri, 23 Jan 2009 16:23:54 -0500 [thread overview]
Message-ID: <497A356A.8000506@redhat.com> (raw)
In-Reply-To: <25989A0A-2E69-4761-B6DA-AE1938B1792E@oracle.com>
Chuck Lever wrote:
> On Jan 23, 2009, at Jan 23, 2009, 4:00 PM, Peter Staubach wrote:
>> Chuck Lever wrote:
>>> How does this play with memory pressure?
>>>
>>> I would expect the generic background page flushers, like pdflush
>>> and kswapd, could still cause out-of-order writes.
>>>
>>
>> I believe that they all end up calling into the NFS client
>> code via the nfs_writepages() interface with the writeback
>> control struct set to write out pages sequentially, possibly
>> limiting the number of pages to be written, but usually not.
>>
>> I ran this code on a 1GB system, writing out 2GB+ files, and
>> the only out of order issues that I saw were generated by
>> the pdflush and the application doing the writing itself
>> intermixing the generation of WRITE requests. This was
>> addressed by the introduction of NFS_INO_FLUSHING and the
>> code in nfs_writepages() which uses it.
>>
>> I also ran the code on a 4GB system with the changes,
>> writing the same sized files, andno out of order WRITE
>> requests were seen.
>>
>> I believe that Nick Piggin solved a bunch of these issues
>> recently and the NFS_INO_FLUSHING patch along with his
>> patches seems to have resolved the issues.
>
> That's good news.
>
> How about ^C -- any weirdness there? pages left pinned on the client,
> fractured writes on the server, application hangs?
>
Nope, nothing like that. Signals should cause things to unwind
normally.
> Have you tried this patch with a multi-threaded write-intensive async
> I/O workload? I'm sure you know which one I'm referring to. ;-)
>
No, probably not. :-)
> Would it make sense to key the new nfs_max_outstanding_writes sysctl
> to the RPC slot table size for that mount point? Another tunable...
> that doesn't self-tune... Ugh. How did you arrive at the default
> value of 16?
>
Yes, I didn't like the new tunable either. I needed to be able
to adjust it though. For my situation, I had a 48GB memory
client, but the right value of the tunable seemed to be, 2.
Sigh.
> CITI had been trying to improve write performance on transatlantic
> 10Gb links. Would be useful to test in that environment.
>
That, I don't know.
ps
>> Thanx...
>>
>> ps
>>
>>> On Jan 23, 2009, at Jan 23, 2009, 2:10 PM, Peter Staubach wrote:
>>>
>>>> Hi.
>>>>
>>>> Attached is a patch which implements some flow control for the
>>>> NFS client to control dirty pages. The flow control is
>>>> implemented on a per-file basis and causes dirty pages to be
>>>> written out when the client can detect that the application is
>>>> writing in a serial fashion and has dirtied enough pages to
>>>> fill a complete over the wire transfer.
>>>>
>>>> This work was precipitated by working on a situation where a
>>>> server at a customer site was not able to adequately handle
>>>> the behavior of the Linux NFS client. This particular server
>>>> required that all data to the file written to the file be
>>>> written in a strictly serial fashion. It also had problems
>>>> handling the Linux NFS client semantic of caching a large
>>>> amount of data and then sending out that data all at once.
>>>>
>>>> The sequential ordering problem was resolved by a previous
>>>> patch which was submitted to the linux-nfs list. This patch
>>>> addresses the capacity problem.
>>>>
>>>> The problem is resolved by sending WRITE requests much
>>>> earlier in the process of the application writing to the file.
>>>> The client keeps track of the number of dirty pages associated
>>>> with the file and also the last offset of the data being
>>>> written. When the client detects that a full over the wire
>>>> transfer could be constructed and that the application is
>>>> writing sequentially, then it generates an UNSTABLE write to
>>>> server for the currently dirty data.
>>>>
>>>> The client also keeps track of the number of these WRITE
>>>> requests which have been generated. It flow controls based
>>>> on a configurable maximum. This keeps the client from
>>>> completely overwhelming the server.
>>>>
>>>> A nice side effect of the framework is that the issue of
>>>> stat()'ing a file being written can be handled much more
>>>> quickly than before. The amount of data that must be
>>>> transmitted to the server to satisfy the "latest mtime"
>>>> requirement is limited. Also, the application writing to
>>>> the file is blocked until the over the wire GETATTR is
>>>> completed. This allows the GETATTR to be send and the
>>>> response received without competing with the data being
>>>> written.
>>>>
>>>> I did not do formal performance testing, but in my
>>>> watching testing, I did not see any performance regressions.
>>>>
>>>> As a side note -- the more natural model of flow control
>>>> would seem to be at the client/server level instead of
>>>> the per-file level. However, that level was too coarse
>>>> with the particular server that was required to be used
>>>> because its requirements were at the per-file level.
>>>>
>>>> Thanx...
>>>>
>>>> ps
>>>>
>>>> Signed-off-by: Peter Staubach <staubach@redhat.com>
>>>> --- linux-2.6.28.i686/fs/nfs/inode.c.org
>>>> +++ linux-2.6.28.i686/fs/nfs/inode.c
>>>> @@ -486,8 +486,10 @@ void nfs_setattr_update_inode(struct ino
>>>> int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct
>>>> kstat *stat)
>>>> {
>>>> struct inode *inode = dentry->d_inode;
>>>> - int need_atime = NFS_I(inode)->cache_validity &
>>>> NFS_INO_INVALID_ATIME;
>>>> + struct nfs_inode *nfsi = NFS_I(inode);
>>>> + int need_atime = nfsi->cache_validity & NFS_INO_INVALID_ATIME;
>>>> int err;
>>>> + int inc_outstanding_writes = nfs_max_outstanding_writes;
>>>>
>>>> /*
>>>> * Flush out writes to the server in order to update c/mtime.
>>>> @@ -497,9 +499,8 @@ int nfs_getattr(struct vfsmount *mnt, st
>>>> * nfs_wb_nocommit.
>>>> */
>>>> if (S_ISREG(inode->i_mode)) {
>>>> - mutex_lock(&inode->i_mutex);
>>>> + atomic_add(inc_outstanding_writes, &nfsi->writes);
>>>> nfs_wb_nocommit(inode);
>>>> - mutex_unlock(&inode->i_mutex);
>>>> }
>>>>
>>>> /*
>>>> @@ -523,6 +524,10 @@ int nfs_getattr(struct vfsmount *mnt, st
>>>> generic_fillattr(inode, stat);
>>>> stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
>>>> }
>>>> + if (S_ISREG(inode->i_mode)) {
>>>> + atomic_sub(inc_outstanding_writes, &nfsi->writes);
>>>> + wake_up(&nfsi->writes_wq);
>>>> + }
>>>> return err;
>>>> }
>>>>
>>>> @@ -1288,9 +1293,13 @@ static void init_once(void *foo)
>>>> INIT_RADIX_TREE(&nfsi->nfs_page_tree, GFP_ATOMIC);
>>>> nfsi->ncommit = 0;
>>>> nfsi->npages = 0;
>>>> + atomic_set(&nfsi->ndirty, 0);
>>>> atomic_set(&nfsi->silly_count, 1);
>>>> INIT_HLIST_HEAD(&nfsi->silly_list);
>>>> init_waitqueue_head(&nfsi->waitqueue);
>>>> + atomic_set(&nfsi->writes, 0);
>>>> + init_waitqueue_head(&nfsi->writes_wq);
>>>> + nfsi->wrpos = 0;
>>>> nfs4_init_once(nfsi);
>>>> }
>>>>
>>>> --- linux-2.6.28.i686/fs/nfs/write.c.org
>>>> +++ linux-2.6.28.i686/fs/nfs/write.c
>>>> @@ -197,7 +197,9 @@ static int nfs_set_page_writeback(struct
>>>> if (!ret) {
>>>> struct inode *inode = page->mapping->host;
>>>> struct nfs_server *nfss = NFS_SERVER(inode);
>>>> + struct nfs_inode *nfsi = NFS_I(inode);
>>>>
>>>> + atomic_dec(&nfsi->ndirty);
>>>> if (atomic_long_inc_return(&nfss->writeback) >
>>>> NFS_CONGESTION_ON_THRESH)
>>>> set_bdi_congested(&nfss->backing_dev_info, WRITE);
>>>> @@ -310,6 +312,33 @@ static int nfs_writepages_callback(struc
>>>> return ret;
>>>> }
>>>>
>>>> +int nfs_max_outstanding_writes = 16;
>>>> +
>>>> +static void nfs_inc_outstanding_writes(struct inode *inode)
>>>> +{
>>>> + struct nfs_inode *nfsi = NFS_I(inode);
>>>> +
>>>> + atomic_inc(&nfsi->writes);
>>>> +}
>>>> +
>>>> +static void nfs_dec_outstanding_writes(struct inode *inode)
>>>> +{
>>>> + struct nfs_inode *nfsi = NFS_I(inode);
>>>> +
>>>> + if (atomic_dec_return(&nfsi->writes) <
>>>> nfs_max_outstanding_writes)
>>>> + wake_up(&nfsi->writes_wq);
>>>> +}
>>>> +
>>>> +void nfs_wait_for_outstanding_writes(struct inode *inode)
>>>> +{
>>>> + struct nfs_inode *nfsi = NFS_I(inode);
>>>> +
>>>> + if (nfs_max_outstanding_writes) {
>>>> + wait_event(nfsi->writes_wq,
>>>> + atomic_read(&nfsi->writes) < nfs_max_outstanding_writes);
>>>> + }
>>>> +}
>>>> +
>>>> int nfs_writepages(struct address_space *mapping, struct
>>>> writeback_control *wbc)
>>>> {
>>>> struct inode *inode = mapping->host;
>>>> @@ -369,6 +398,7 @@ static int nfs_inode_add_request(struct
>>>> SetPagePrivate(req->wb_page);
>>>> set_page_private(req->wb_page, (unsigned long)req);
>>>> nfsi->npages++;
>>>> + atomic_inc(&nfsi->ndirty);
>>>> kref_get(&req->wb_kref);
>>>> radix_tree_tag_set(&nfsi->nfs_page_tree, req->wb_index,
>>>> NFS_PAGE_TAG_LOCKED);
>>>> @@ -405,6 +435,10 @@ static void nfs_inode_remove_request(str
>>>> static void
>>>> nfs_mark_request_dirty(struct nfs_page *req)
>>>> {
>>>> + struct inode *inode = req->wb_context->path.dentry->d_inode;
>>>> + struct nfs_inode *nfsi = NFS_I(inode);
>>>> +
>>>> + atomic_inc(&nfsi->ndirty);
>>>> __set_page_dirty_nobuffers(req->wb_page);
>>>> }
>>>>
>>>> @@ -633,6 +667,7 @@ static struct nfs_page *nfs_try_to_updat
>>>> req->wb_bytes = end - req->wb_offset;
>>>> else
>>>> req->wb_bytes = rqend - req->wb_offset;
>>>> + atomic_inc(&NFS_I(inode)->ndirty);
>>>> out_unlock:
>>>> spin_unlock(&inode->i_lock);
>>>> return req;
>>>> @@ -855,6 +890,8 @@ static int nfs_write_rpcsetup(struct nfs
>>>> count,
>>>> (unsigned long long)data->args.offset);
>>>>
>>>> + nfs_inc_outstanding_writes(inode);
>>>> +
>>>> task = rpc_run_task(&task_setup_data);
>>>> if (IS_ERR(task))
>>>> return PTR_ERR(task);
>>>> @@ -1130,7 +1167,7 @@ int nfs_writeback_done(struct rpc_task *
>>>> */
>>>> status = NFS_PROTO(data->inode)->write_done(task, data);
>>>> if (status != 0)
>>>> - return status;
>>>> + goto out;
>>>> nfs_add_stats(data->inode, NFSIOS_SERVERWRITTENBYTES, resp->count);
>>>>
>>>> #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
>>>> @@ -1186,7 +1223,9 @@ int nfs_writeback_done(struct rpc_task *
>>>> /* Can't do anything about it except throw an error. */
>>>> task->tk_status = -EIO;
>>>> }
>>>> - return 0;
>>>> +out:
>>>> + nfs_dec_outstanding_writes(data->inode);
>>>> + return status;
>>>> }
>>>>
>>>>
>>>> @@ -1546,6 +1585,29 @@ int nfs_wb_page(struct inode *inode, str
>>>> return nfs_wb_page_priority(inode, page, FLUSH_STABLE);
>>>> }
>>>>
>>>> +/*
>>>> + * Start the WRITE requests for dirty pages on their way.
>>>> + * This is used when a sufficient number of dirty pages
>>>> + * have accumulated.
>>>> + */
>>>> +int nfs_wb_interim(struct inode *inode)
>>>> +{
>>>> + struct address_space *mapping = inode->i_mapping;
>>>> + struct writeback_control wbc = {
>>>> + .bdi = mapping->backing_dev_info,
>>>> + .sync_mode = WB_SYNC_NONE,
>>>> + .nr_to_write = LONG_MAX,
>>>> + .range_start = 0,
>>>> + .range_end = LLONG_MAX,
>>>> + };
>>>> + int ret;
>>>> +
>>>> + ret = nfs_writepages(mapping, &wbc);
>>>> + if (ret < 0)
>>>> + __mark_inode_dirty(inode, I_DIRTY_PAGES);
>>>> + return ret;
>>>> +}
>>>> +
>>>> int __init nfs_init_writepagecache(void)
>>>> {
>>>> nfs_wdata_cachep = kmem_cache_create("nfs_write_data",
>>>> --- linux-2.6.28.i686/fs/nfs/sysctl.c.org
>>>> +++ linux-2.6.28.i686/fs/nfs/sysctl.c
>>>> @@ -58,6 +58,14 @@ static ctl_table nfs_cb_sysctls[] = {
>>>> .mode = 0644,
>>>> .proc_handler = &proc_dointvec,
>>>> },
>>>> + {
>>>> + .ctl_name = CTL_UNNUMBERED,
>>>> + .procname = "nfs_max_outstanding_writes",
>>>> + .data = &nfs_max_outstanding_writes,
>>>> + .maxlen = sizeof(int),
>>>> + .mode = 0644,
>>>> + .proc_handler = &proc_dointvec,
>>>> + },
>>>> { .ctl_name = 0 }
>>>> };
>>>>
>>>> --- linux-2.6.28.i686/fs/nfs/file.c.org
>>>> +++ linux-2.6.28.i686/fs/nfs/file.c
>>>> @@ -512,11 +512,17 @@ static int nfs_need_sync_write(struct fi
>>>> return 0;
>>>> }
>>>>
>>>> +static int nfs_is_serial(struct inode *inode, loff_t pos)
>>>> +{
>>>> + return NFS_I(inode)->wrpos == pos;
>>>> +}
>>>> +
>>>> static ssize_t nfs_file_write(struct kiocb *iocb, const struct
>>>> iovec *iov,
>>>> unsigned long nr_segs, loff_t pos)
>>>> {
>>>> - struct dentry * dentry = iocb->ki_filp->f_path.dentry;
>>>> - struct inode * inode = dentry->d_inode;
>>>> + struct dentry *dentry = iocb->ki_filp->f_path.dentry;
>>>> + struct inode *inode = dentry->d_inode;
>>>> + struct nfs_inode *nfsi = NFS_I(inode);
>>>> ssize_t result;
>>>> size_t count = iov_length(iov, nr_segs);
>>>>
>>>> @@ -530,6 +536,13 @@ static ssize_t nfs_file_write(struct kio
>>>> result = -EBUSY;
>>>> if (IS_SWAPFILE(inode))
>>>> goto out_swapfile;
>>>> +
>>>> + result = count;
>>>> + if (!count)
>>>> + goto out;
>>>> +
>>>> + nfs_wait_for_outstanding_writes(inode);
>>>> +
>>>> /*
>>>> * O_APPEND implies that we must revalidate the file length.
>>>> */
>>>> @@ -539,17 +552,22 @@ static ssize_t nfs_file_write(struct kio
>>>> goto out;
>>>> }
>>>>
>>>> - result = count;
>>>> - if (!count)
>>>> - goto out;
>>>> -
>>>> nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, count);
>>>> result = generic_file_aio_write(iocb, iov, nr_segs, pos);
>>>> /* Return error values for O_SYNC and IS_SYNC() */
>>>> - if (result >= 0 && nfs_need_sync_write(iocb->ki_filp, inode)) {
>>>> - int err =
>>>> nfs_do_fsync(nfs_file_open_context(iocb->ki_filp), inode);
>>>> - if (err < 0)
>>>> - result = err;
>>>> + if (result >= 0) {
>>>> + if (nfs_need_sync_write(iocb->ki_filp, inode)) {
>>>> + int err;
>>>> + err = nfs_do_fsync(nfs_file_open_context(iocb->ki_filp),
>>>> + inode);
>>>> + if (err < 0)
>>>> + result = err;
>>>> + } else if (nfs_max_outstanding_writes &&
>>>> + nfs_is_serial(inode, pos) &&
>>>> + atomic_read(&nfsi->ndirty) >= NFS_SERVER(inode)->wpages)
>>>> + nfs_wb_interim(inode);
>>>> + if (result > 0)
>>>> + nfsi->wrpos = pos + result;
>>>> }
>>>> out:
>>>> return result;
>>>> --- linux-2.6.28.i686/include/linux/nfs_fs.h.org
>>>> +++ linux-2.6.28.i686/include/linux/nfs_fs.h
>>>> @@ -168,6 +168,7 @@ struct nfs_inode {
>>>>
>>>> unsigned long ncommit,
>>>> npages;
>>>> + atomic_t ndirty;
>>>>
>>>> /* Open contexts for shared mmap writes */
>>>> struct list_head open_files;
>>>> @@ -186,6 +187,9 @@ struct nfs_inode {
>>>> fmode_t delegation_state;
>>>> struct rw_semaphore rwsem;
>>>> #endif /* CONFIG_NFS_V4*/
>>>> + atomic_t writes; /* number of outstanding WRITEs */
>>>> + wait_queue_head_t writes_wq;
>>>> + loff_t wrpos; /* position after last WRITE */
>>>> struct inode vfs_inode;
>>>> };
>>>>
>>>> @@ -459,12 +463,14 @@ extern void nfs_unblock_sillyrename(stru
>>>> * linux/fs/nfs/write.c
>>>> */
>>>> extern int nfs_congestion_kb;
>>>> +extern int nfs_max_outstanding_writes;
>>>> extern int nfs_writepage(struct page *page, struct
>>>> writeback_control *wbc);
>>>> extern int nfs_writepages(struct address_space *, struct
>>>> writeback_control *);
>>>> extern int nfs_flush_incompatible(struct file *file, struct page
>>>> *page);
>>>> extern int nfs_updatepage(struct file *, struct page *, unsigned
>>>> int, unsigned int);
>>>> -extern int nfs_writeback_done(struct rpc_task *, struct
>>>> nfs_write_data *);
>>>> +extern int nfs_writeback_done(struct rpc_task *, struct
>>>> nfs_write_data *);
>>>> extern void nfs_writedata_release(void *);
>>>> +extern void nfs_wait_for_outstanding_writes(struct inode *);
>>>>
>>>> /*
>>>> * Try to write back everything synchronously (but check the
>>>> @@ -475,6 +481,7 @@ extern int nfs_wb_all(struct inode *inod
>>>> extern int nfs_wb_nocommit(struct inode *inode);
>>>> extern int nfs_wb_page(struct inode *inode, struct page* page);
>>>> extern int nfs_wb_page_cancel(struct inode *inode, struct page* page);
>>>> +extern int nfs_wb_interim(struct inode *);
>>>> #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
>>>> extern int nfs_commit_inode(struct inode *, int);
>>>> extern struct nfs_write_data *nfs_commitdata_alloc(void);
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
next prev parent reply other threads:[~2009-01-23 21:24 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-23 19:10 [PATCH] flow control for WRITE requests Peter Staubach
2009-01-23 19:35 ` Chuck Lever
2009-01-23 21:00 ` Peter Staubach
2009-01-23 21:13 ` Chuck Lever
2009-01-23 21:23 ` Peter Staubach [this message]
2009-01-23 21:29 ` Peter Staubach
2009-01-23 21:38 ` Chuck Lever
2009-01-23 21:49 ` Peter Staubach
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=497A356A.8000506@redhat.com \
--to=staubach@redhat.com \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=trond.myklebust@fys.uio.no \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox