public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <cel@citi.umich.edu>
To: Andrew Morton <akpm@osdl.org>
Cc: cel@netapp.com, linux-kernel@vger.kernel.org, trond.myklebust@fys.uio.no
Subject: Re: [PATCH 3/6] nfs: Eliminate nfs_get_user_pages()
Date: Fri, 19 May 2006 15:18:04 -0400	[thread overview]
Message-ID: <446E19EC.1070902@citi.umich.edu> (raw)
In-Reply-To: <20060519111734.523232b4.akpm@osdl.org>

Andrew Morton wrote:
> Chuck Lever <cel@netapp.com> wrote:
>> Neil Brown observed that the kmalloc() in nfs_get_user_pages() is more
>> likely to fail if the I/O is large enough to require the allocation of more
>> than a single page to keep track of all the pinned pages in the user's
>> buffer.
>>
>> Instead of tracking one large page array per dreq/iocb, track pages per
>> nfs_read/write_data, just like the cached I/O path does.  An array for
>> pages is already allocated for us by nfs_readdata_alloc() (and the write
>> and commit equivalents).
>>
>> This is also required for adding support for vectored I/O to the NFS direct
>> I/O path.
>>
>> The original reason to pin the user buffer and allocate all the NFS data
>> structures before trying to schedule I/O was to ensure all needed resources
>> are allocated on the client before starting to send requests.  This reduces
>> the chance that resource exhaustion on the client will cause a short read
>> or write.
>>
>> On the other hand, for an application making very large application I/O
>> requests, this means that it will be nearly impossible for the application
>> to make forward progress on a resource-limited client.
>>
>> Thus, moving the buffer pinning functionality into the I/O scheduling
>> loops should be good for scalability.  The next patch will do the same for
>> NFS data structure allocation.
>>
>> +static void nfs_release_user_pages(struct page **pages, int npages)
>>  {
>> -	int result = -ENOMEM;
>> -	unsigned long page_count;
>> -	size_t array_size;
>> -
>> -	page_count = (user_addr + size + PAGE_SIZE - 1) >> PAGE_SHIFT;
>> -	page_count -= user_addr >> PAGE_SHIFT;
>> -
>> -	array_size = (page_count * sizeof(struct page *));
>> -	*pages = kmalloc(array_size, GFP_KERNEL);
>> -	if (*pages) {
>> -		down_read(&current->mm->mmap_sem);
>> -		result = get_user_pages(current, current->mm, user_addr,
>> -					page_count, (rw == READ), 0,
>> -					*pages, NULL);
>> -		up_read(&current->mm->mmap_sem);
>> -		if (result != page_count) {
>> -			/*
>> -			 * If we got fewer pages than expected from
>> -			 * get_user_pages(), the user buffer runs off the
>> -			 * end of a mapping; return EFAULT.
>> -			 */
>> -			if (result >= 0) {
>> -				nfs_free_user_pages(*pages, result, 0);
>> -				result = -EFAULT;
>> -			} else
>> -				kfree(*pages);
>> -			*pages = NULL;
>> -		}
>> -	}
>> -	return result;
>> +	int i;
>> +	for (i = 0; i < npages; i++)
>> +		page_cache_release(pages[i]);
>>  }
> 
> If `npages' is negative, this does the right thing.
> 
>> +		result = get_user_pages(current, current->mm, user_addr,
>> +					data->npages, 1, 0, data->pagevec, NULL);
>> +		up_read(&current->mm->mmap_sem);
>> +		if (unlikely(result < data->npages))
>> +			goto out_err;
>> ...
>> +out_err:
>> +	nfs_release_user_pages(data->pagevec, result);
> 
> And `npages' can indeed be negative.

I fixed this by making all of these an "unsigned long". 
get_user_pages() returns an unsigned long result, so all these 
comparisons should always work correctly.

nfs_count_pages() now also returns an unsigned long, but I don't see how 
it is possible for it to compute a negative value.

> So.  No bug there, but the code is a little unobvious and fragile - if
> someone were to alter a type then subtle bugs would happen.
> 
> Perhaps
> 
> 	if (result > 0)
> 		nfs_release_user_pages(...);
> 
> would be cleaner.  Or at least a loud comment in nfs_release_user_pages().

-- 
corporate:	cel at netapp dot com
personal:	chucklever at bigfoot dot com

  reply	other threads:[~2006-05-19 19:18 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-05-19 17:56 [PATCH 0/6] Support scatter/gather I/O in NFS direct I/O path Chuck Lever
2006-05-19 18:00 ` [PATCH 1/6] nfs: "open code" the NFS direct write rescheduler Chuck Lever
2006-05-19 18:10   ` Andrew Morton
2006-05-19 18:37     ` Chuck Lever
2006-05-19 18:46       ` Andrew Morton
2006-05-19 18:56         ` Chuck Lever
2006-05-19 18:00 ` [PATCH 2/6] nfs: remove user_addr and user_count from nfs_direct_req Chuck Lever
2006-05-19 18:00 ` [PATCH 3/6] nfs: Eliminate nfs_get_user_pages() Chuck Lever
2006-05-19 18:17   ` Andrew Morton
2006-05-19 19:18     ` Chuck Lever [this message]
2006-05-19 18:00 ` [PATCH 4/6] nfs: alloc nfs_read/write_data as direct I/O is scheduled Chuck Lever
2006-05-19 18:00 ` [PATCH 5/6] nfs: check all iov segments for correct memory access rights Chuck Lever
2006-05-19 18:22   ` Andrew Morton
2006-05-19 18:46     ` Chuck Lever
2006-05-19 19:36     ` Chuck Lever
2006-05-19 20:07       ` Andrew Morton
2006-05-19 18:25   ` Badari Pulavarty
2006-05-22 11:27   ` Andi Kleen
2006-05-19 18:00 ` [PATCH 6/6] nfs: Support vector I/O throughout the NFS direct I/O path Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=446E19EC.1070902@citi.umich.edu \
    --to=cel@citi.umich.edu \
    --cc=akpm@osdl.org \
    --cc=cel@netapp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=trond.myklebust@fys.uio.no \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox