public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Staubach <staubach@redhat.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>,
	NFS list <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH] flow control for WRITE requests
Date: Fri, 23 Jan 2009 16:00:29 -0500	[thread overview]
Message-ID: <497A2FED.4080407@redhat.com> (raw)
In-Reply-To: <BE8426F8-1170-4484-A2A0-B7989F3D3EBE@oracle.com>

Chuck Lever wrote:
> How does this play with memory pressure?
>
> I would expect the generic background page flushers, like pdflush and 
> kswapd, could still cause out-of-order writes.
>

I believe that they all end up calling into the NFS client
code via the nfs_writepages() interface with the writeback
control struct set to write out pages sequentially, possibly
limiting the number of pages to be written, but usually not.

I ran this code on a 1GB system, writing out 2GB+ files, and
the only out of order issues that I saw were generated by
the pdflush and the application doing the writing itself
intermixing the generation of WRITE requests.  This was
addressed by the introduction of NFS_INO_FLUSHING and the
code in nfs_writepages() which uses it.

I also ran the code on a 4GB system with the changes,
writing the same sized files, andno out of order WRITE
requests were seen.

I believe that Nick Piggin solved a bunch of these issues
recently and the NFS_INO_FLUSHING patch along with his
patches seems to have resolved the issues.

    Thanx...

       ps

> On Jan 23, 2009, at Jan 23, 2009, 2:10 PM, Peter Staubach wrote:
>
>> Hi.
>>
>> Attached is a patch which implements some flow control for the
>> NFS client to control dirty pages.  The flow control is
>> implemented on a per-file basis and causes dirty pages to be
>> written out when the client can detect that the application is
>> writing in a serial fashion and has dirtied enough pages to
>> fill a complete over the wire transfer.
>>
>> This work was precipitated by working on a situation where a
>> server at a customer site was not able to adequately handle
>> the behavior of the Linux NFS client.  This particular server
>> required that all data to the file written to the file be
>> written in a strictly serial fashion.  It also had problems
>> handling the Linux NFS client semantic of caching a large
>> amount of data and then sending out that data all at once.
>>
>> The sequential ordering problem was resolved by a previous
>> patch which was submitted to the linux-nfs list.  This patch
>> addresses the capacity problem.
>>
>> The problem is resolved by sending WRITE requests much
>> earlier in the process of the application writing to the file.
>> The client keeps track of the number of dirty pages associated
>> with the file and also the last offset of the data being
>> written.  When the client detects that a full over the wire
>> transfer could be constructed and that the application is
>> writing sequentially, then it generates an UNSTABLE write to
>> server for the currently dirty data.
>>
>> The client also keeps track of the number of these WRITE
>> requests which have been generated.  It flow controls based
>> on a configurable maximum.  This keeps the client from
>> completely overwhelming the server.
>>
>> A nice side effect of the framework is that the issue of
>> stat()'ing a file being written can be handled much more
>> quickly than before.  The amount of data that must be
>> transmitted to the server to satisfy the "latest mtime"
>> requirement is limited.  Also, the application writing to
>> the file is blocked until the over the wire GETATTR is
>> completed.  This allows the GETATTR to be send and the
>> response received without competing with the data being
>> written.
>>
>> I did not do formal performance testing, but in my
>> watching testing, I did not see any performance regressions.
>>
>> As a side note -- the more natural model of flow control
>> would seem to be at the client/server level instead of
>> the per-file level.  However, that level was too coarse
>> with the particular server that was required to be used
>> because its requirements were at the per-file level.
>>
>>   Thanx...
>>
>>      ps
>>
>> Signed-off-by: Peter Staubach <staubach@redhat.com>
>> --- linux-2.6.28.i686/fs/nfs/inode.c.org
>> +++ linux-2.6.28.i686/fs/nfs/inode.c
>> @@ -486,8 +486,10 @@ void nfs_setattr_update_inode(struct ino
>> int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct 
>> kstat *stat)
>> {
>>     struct inode *inode = dentry->d_inode;
>> -    int need_atime = NFS_I(inode)->cache_validity & 
>> NFS_INO_INVALID_ATIME;
>> +    struct nfs_inode *nfsi = NFS_I(inode);
>> +    int need_atime = nfsi->cache_validity & NFS_INO_INVALID_ATIME;
>>     int err;
>> +    int inc_outstanding_writes = nfs_max_outstanding_writes;
>>
>>     /*
>>      * Flush out writes to the server in order to update c/mtime.
>> @@ -497,9 +499,8 @@ int nfs_getattr(struct vfsmount *mnt, st
>>      * nfs_wb_nocommit.
>>      */
>>     if (S_ISREG(inode->i_mode)) {
>> -        mutex_lock(&inode->i_mutex);
>> +        atomic_add(inc_outstanding_writes, &nfsi->writes);
>>         nfs_wb_nocommit(inode);
>> -        mutex_unlock(&inode->i_mutex);
>>     }
>>
>>     /*
>> @@ -523,6 +524,10 @@ int nfs_getattr(struct vfsmount *mnt, st
>>         generic_fillattr(inode, stat);
>>         stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
>>     }
>> +    if (S_ISREG(inode->i_mode)) {
>> +        atomic_sub(inc_outstanding_writes, &nfsi->writes);
>> +        wake_up(&nfsi->writes_wq);
>> +    }
>>     return err;
>> }
>>
>> @@ -1288,9 +1293,13 @@ static void init_once(void *foo)
>>     INIT_RADIX_TREE(&nfsi->nfs_page_tree, GFP_ATOMIC);
>>     nfsi->ncommit = 0;
>>     nfsi->npages = 0;
>> +    atomic_set(&nfsi->ndirty, 0);
>>     atomic_set(&nfsi->silly_count, 1);
>>     INIT_HLIST_HEAD(&nfsi->silly_list);
>>     init_waitqueue_head(&nfsi->waitqueue);
>> +    atomic_set(&nfsi->writes, 0);
>> +    init_waitqueue_head(&nfsi->writes_wq);
>> +    nfsi->wrpos = 0;
>>     nfs4_init_once(nfsi);
>> }
>>
>> --- linux-2.6.28.i686/fs/nfs/write.c.org
>> +++ linux-2.6.28.i686/fs/nfs/write.c
>> @@ -197,7 +197,9 @@ static int nfs_set_page_writeback(struct
>>     if (!ret) {
>>         struct inode *inode = page->mapping->host;
>>         struct nfs_server *nfss = NFS_SERVER(inode);
>> +        struct nfs_inode *nfsi = NFS_I(inode);
>>
>> +        atomic_dec(&nfsi->ndirty);
>>         if (atomic_long_inc_return(&nfss->writeback) >
>>                 NFS_CONGESTION_ON_THRESH)
>>             set_bdi_congested(&nfss->backing_dev_info, WRITE);
>> @@ -310,6 +312,33 @@ static int nfs_writepages_callback(struc
>>     return ret;
>> }
>>
>> +int nfs_max_outstanding_writes = 16;
>> +
>> +static void nfs_inc_outstanding_writes(struct inode *inode)
>> +{
>> +    struct nfs_inode *nfsi = NFS_I(inode);
>> +
>> +    atomic_inc(&nfsi->writes);
>> +}
>> +
>> +static void nfs_dec_outstanding_writes(struct inode *inode)
>> +{
>> +    struct nfs_inode *nfsi = NFS_I(inode);
>> +
>> +    if (atomic_dec_return(&nfsi->writes) < nfs_max_outstanding_writes)
>> +        wake_up(&nfsi->writes_wq);
>> +}
>> +
>> +void nfs_wait_for_outstanding_writes(struct inode *inode)
>> +{
>> +    struct nfs_inode *nfsi = NFS_I(inode);
>> +
>> +    if (nfs_max_outstanding_writes) {
>> +        wait_event(nfsi->writes_wq,
>> +            atomic_read(&nfsi->writes) < nfs_max_outstanding_writes);
>> +    }
>> +}
>> +
>> int nfs_writepages(struct address_space *mapping, struct 
>> writeback_control *wbc)
>> {
>>     struct inode *inode = mapping->host;
>> @@ -369,6 +398,7 @@ static int nfs_inode_add_request(struct
>>     SetPagePrivate(req->wb_page);
>>     set_page_private(req->wb_page, (unsigned long)req);
>>     nfsi->npages++;
>> +    atomic_inc(&nfsi->ndirty);
>>     kref_get(&req->wb_kref);
>>     radix_tree_tag_set(&nfsi->nfs_page_tree, req->wb_index,
>>                 NFS_PAGE_TAG_LOCKED);
>> @@ -405,6 +435,10 @@ static void nfs_inode_remove_request(str
>> static void
>> nfs_mark_request_dirty(struct nfs_page *req)
>> {
>> +    struct inode *inode = req->wb_context->path.dentry->d_inode;
>> +    struct nfs_inode *nfsi = NFS_I(inode);
>> +
>> +    atomic_inc(&nfsi->ndirty);
>>     __set_page_dirty_nobuffers(req->wb_page);
>> }
>>
>> @@ -633,6 +667,7 @@ static struct nfs_page *nfs_try_to_updat
>>         req->wb_bytes = end - req->wb_offset;
>>     else
>>         req->wb_bytes = rqend - req->wb_offset;
>> +    atomic_inc(&NFS_I(inode)->ndirty);
>> out_unlock:
>>     spin_unlock(&inode->i_lock);
>>     return req;
>> @@ -855,6 +890,8 @@ static int nfs_write_rpcsetup(struct nfs
>>         count,
>>         (unsigned long long)data->args.offset);
>>
>> +    nfs_inc_outstanding_writes(inode);
>> +
>>     task = rpc_run_task(&task_setup_data);
>>     if (IS_ERR(task))
>>         return PTR_ERR(task);
>> @@ -1130,7 +1167,7 @@ int nfs_writeback_done(struct rpc_task *
>>      */
>>     status = NFS_PROTO(data->inode)->write_done(task, data);
>>     if (status != 0)
>> -        return status;
>> +        goto out;
>>     nfs_add_stats(data->inode, NFSIOS_SERVERWRITTENBYTES, resp->count);
>>
>> #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
>> @@ -1186,7 +1223,9 @@ int nfs_writeback_done(struct rpc_task *
>>         /* Can't do anything about it except throw an error. */
>>         task->tk_status = -EIO;
>>     }
>> -    return 0;
>> +out:
>> +    nfs_dec_outstanding_writes(data->inode);
>> +    return status;
>> }
>>
>>
>> @@ -1546,6 +1585,29 @@ int nfs_wb_page(struct inode *inode, str
>>     return nfs_wb_page_priority(inode, page, FLUSH_STABLE);
>> }
>>
>> +/*
>> + * Start the WRITE requests for dirty pages on their way.
>> + * This is used when a sufficient number of dirty pages
>> + * have accumulated.
>> + */
>> +int nfs_wb_interim(struct inode *inode)
>> +{
>> +    struct address_space *mapping = inode->i_mapping;
>> +    struct writeback_control wbc = {
>> +        .bdi = mapping->backing_dev_info,
>> +        .sync_mode = WB_SYNC_NONE,
>> +        .nr_to_write = LONG_MAX,
>> +        .range_start = 0,
>> +        .range_end = LLONG_MAX,
>> +    };
>> +    int ret;
>> +
>> +    ret = nfs_writepages(mapping, &wbc);
>> +    if (ret < 0)
>> +        __mark_inode_dirty(inode, I_DIRTY_PAGES);
>> +    return ret;
>> +}
>> +
>> int __init nfs_init_writepagecache(void)
>> {
>>     nfs_wdata_cachep = kmem_cache_create("nfs_write_data",
>> --- linux-2.6.28.i686/fs/nfs/sysctl.c.org
>> +++ linux-2.6.28.i686/fs/nfs/sysctl.c
>> @@ -58,6 +58,14 @@ static ctl_table nfs_cb_sysctls[] = {
>>         .mode        = 0644,
>>         .proc_handler    = &proc_dointvec,
>>     },
>> +    {
>> +        .ctl_name = CTL_UNNUMBERED,
>> +        .procname = "nfs_max_outstanding_writes",
>> +        .data = &nfs_max_outstanding_writes,
>> +        .maxlen = sizeof(int),
>> +        .mode = 0644,
>> +        .proc_handler = &proc_dointvec,
>> +    },
>>     { .ctl_name = 0 }
>> };
>>
>> --- linux-2.6.28.i686/fs/nfs/file.c.org
>> +++ linux-2.6.28.i686/fs/nfs/file.c
>> @@ -512,11 +512,17 @@ static int nfs_need_sync_write(struct fi
>>     return 0;
>> }
>>
>> +static int nfs_is_serial(struct inode *inode, loff_t pos)
>> +{
>> +    return NFS_I(inode)->wrpos == pos;
>> +}
>> +
>> static ssize_t nfs_file_write(struct kiocb *iocb, const struct iovec 
>> *iov,
>>                 unsigned long nr_segs, loff_t pos)
>> {
>> -    struct dentry * dentry = iocb->ki_filp->f_path.dentry;
>> -    struct inode * inode = dentry->d_inode;
>> +    struct dentry *dentry = iocb->ki_filp->f_path.dentry;
>> +    struct inode *inode = dentry->d_inode;
>> +    struct nfs_inode *nfsi = NFS_I(inode);
>>     ssize_t result;
>>     size_t count = iov_length(iov, nr_segs);
>>
>> @@ -530,6 +536,13 @@ static ssize_t nfs_file_write(struct kio
>>     result = -EBUSY;
>>     if (IS_SWAPFILE(inode))
>>         goto out_swapfile;
>> +
>> +    result = count;
>> +    if (!count)
>> +        goto out;
>> +
>> +    nfs_wait_for_outstanding_writes(inode);
>> +
>>     /*
>>      * O_APPEND implies that we must revalidate the file length.
>>      */
>> @@ -539,17 +552,22 @@ static ssize_t nfs_file_write(struct kio
>>             goto out;
>>     }
>>
>> -    result = count;
>> -    if (!count)
>> -        goto out;
>> -
>>     nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, count);
>>     result = generic_file_aio_write(iocb, iov, nr_segs, pos);
>>     /* Return error values for O_SYNC and IS_SYNC() */
>> -    if (result >= 0 && nfs_need_sync_write(iocb->ki_filp, inode)) {
>> -        int err = nfs_do_fsync(nfs_file_open_context(iocb->ki_filp), 
>> inode);
>> -        if (err < 0)
>> -            result = err;
>> +    if (result >= 0) {
>> +        if (nfs_need_sync_write(iocb->ki_filp, inode)) {
>> +            int err;
>> +            err = nfs_do_fsync(nfs_file_open_context(iocb->ki_filp),
>> +                    inode);
>> +            if (err < 0)
>> +                result = err;
>> +        } else if (nfs_max_outstanding_writes &&
>> +            nfs_is_serial(inode, pos) &&
>> +            atomic_read(&nfsi->ndirty) >= NFS_SERVER(inode)->wpages)
>> +            nfs_wb_interim(inode);
>> +        if (result > 0)
>> +            nfsi->wrpos = pos + result;
>>     }
>> out:
>>     return result;
>> --- linux-2.6.28.i686/include/linux/nfs_fs.h.org
>> +++ linux-2.6.28.i686/include/linux/nfs_fs.h
>> @@ -168,6 +168,7 @@ struct nfs_inode {
>>
>>     unsigned long        ncommit,
>>                 npages;
>> +    atomic_t        ndirty;
>>
>>     /* Open contexts for shared mmap writes */
>>     struct list_head    open_files;
>> @@ -186,6 +187,9 @@ struct nfs_inode {
>>     fmode_t             delegation_state;
>>     struct rw_semaphore    rwsem;
>> #endif /* CONFIG_NFS_V4*/
>> +    atomic_t        writes;    /* number of outstanding WRITEs */
>> +    wait_queue_head_t    writes_wq;
>> +    loff_t            wrpos;    /* position after last WRITE */
>>     struct inode        vfs_inode;
>> };
>>
>> @@ -459,12 +463,14 @@ extern void nfs_unblock_sillyrename(stru
>>  * linux/fs/nfs/write.c
>>  */
>> extern int  nfs_congestion_kb;
>> +extern int  nfs_max_outstanding_writes;
>> extern int  nfs_writepage(struct page *page, struct writeback_control 
>> *wbc);
>> extern int  nfs_writepages(struct address_space *, struct 
>> writeback_control *);
>> extern int  nfs_flush_incompatible(struct file *file, struct page 
>> *page);
>> extern int  nfs_updatepage(struct file *, struct page *, unsigned 
>> int, unsigned int);
>> -extern int nfs_writeback_done(struct rpc_task *, struct 
>> nfs_write_data *);
>> +extern int  nfs_writeback_done(struct rpc_task *, struct 
>> nfs_write_data *);
>> extern void nfs_writedata_release(void *);
>> +extern void nfs_wait_for_outstanding_writes(struct inode *);
>>
>> /*
>>  * Try to write back everything synchronously (but check the
>> @@ -475,6 +481,7 @@ extern int nfs_wb_all(struct inode *inod
>> extern int nfs_wb_nocommit(struct inode *inode);
>> extern int nfs_wb_page(struct inode *inode, struct page* page);
>> extern int nfs_wb_page_cancel(struct inode *inode, struct page* page);
>> +extern int nfs_wb_interim(struct inode *);
>> #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
>> extern int  nfs_commit_inode(struct inode *, int);
>> extern struct nfs_write_data *nfs_commitdata_alloc(void);
>
> --Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
>


  reply	other threads:[~2009-01-23 21:00 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-23 19:10 [PATCH] flow control for WRITE requests Peter Staubach
2009-01-23 19:35 ` Chuck Lever
2009-01-23 21:00   ` Peter Staubach [this message]
2009-01-23 21:13     ` Chuck Lever
2009-01-23 21:23       ` Peter Staubach
2009-01-23 21:29       ` Peter Staubach
2009-01-23 21:38         ` Chuck Lever
2009-01-23 21:49           ` Peter Staubach

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=497A2FED.4080407@redhat.com \
    --to=staubach@redhat.com \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trond.myklebust@fys.uio.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox