[PATCH 0/2] sys_write() should write all valid data

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] sys_write() should write all valid data
@ 2009-05-14 16:18 Vitaly Mayatskikh
  2009-05-14 16:19 ` [PATCH 1/2] Introduce check_readable_bytes() Vitaly Mayatskikh
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Vitaly Mayatskikh @ 2009-05-14 16:18 UTC (permalink / raw)
  To: Josef Bacik; +Cc: sandeen, linux-fsdevel, linux-kernel

There's user-visible misbehavour in sys_write(): when user tries to put
down to disk some data, which crosses boundary of existing memory, sys_write()
either immediately returns with EFAULT or writes first page(s).

Next 2 patches make sys_write()'s behaviour more consistent: it tries now
to write down all what it can.

Vitaly Mayatskikh (2):
  Introduce check_readable_bytes()
  Perform checks in iov_iter_fault_in_readable() with
    check_readable_bytes()

 fs/fuse/file.c          |    6 ++++--
 include/linux/pagemap.h |   35 +++++++++++++++++++++++++++++++++++
 mm/filemap.c            |   13 +++++++++----
 3 files changed, 48 insertions(+), 6 deletions(-)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/2] Introduce check_readable_bytes()
  2009-05-14 16:18 [PATCH 0/2] sys_write() should write all valid data Vitaly Mayatskikh
@ 2009-05-14 16:19 ` Vitaly Mayatskikh
  2009-05-14 17:40   ` Josef Bacik
  2009-05-14 16:19 ` [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes() Vitaly Mayatskikh
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Vitaly Mayatskikh @ 2009-05-14 16:19 UTC (permalink / raw)
  To: Josef Bacik; +Cc: sandeen, linux-fsdevel, linux-kernel

This routine acts almost as fault_in_pages_readable(), but returns
accessible amount of bytes instead of plain OK/FAIL, so callers
may know how many data can be proceeded without GPF.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
---
 include/linux/pagemap.h |   35 +++++++++++++++++++++++++++++++++++
 1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 34da523..f931308 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -439,6 +439,41 @@ static inline int fault_in_pages_readable(const char __user *uaddr, int size)
 	return ret;
 }
 
+/*
+ * check_readable_bytes - check if given amount of bytes can be read
+ * at given address.
+ * @uaddr: address to check
+ * @size: size to check
+ *
+ * Returns 0 when @size is 0, -EFAULT when @uaddr points to
+ * unaccessible region, or count of accessible bytes.
+ */
+static inline int check_readable_bytes(const char __user *uaddr, int size)
+{
+	volatile char c;
+	int ret = 0;
+	long page_begin = (unsigned long)uaddr & PAGE_MASK;
+	long page_end = ((unsigned long)uaddr + size) & PAGE_MASK;
+	long ptr = page_begin;
+
+	if (unlikely(size == 0))
+		goto out;
+
+	while (!ret && ptr <= page_end) {
+		ret = __get_user(c, (const char __user*)ptr);
+		if (!ret)
+			ptr += PAGE_SIZE;
+	}
+
+	if (likely(!ret))
+		ret = size;
+	else
+		ret = (ptr == page_begin) ?
+			-EFAULT : (const char __user *)ptr - uaddr;
+out:
+	return ret;
+}
+
 int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
 int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
-- 
1.6.2.2



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-14 16:18 [PATCH 0/2] sys_write() should write all valid data Vitaly Mayatskikh
  2009-05-14 16:19 ` [PATCH 1/2] Introduce check_readable_bytes() Vitaly Mayatskikh
@ 2009-05-14 16:19 ` Vitaly Mayatskikh
  2009-05-15  6:56   ` Andi Kleen
  2009-05-19  8:55   ` Pavel Machek
  2009-05-14 18:02 ` [PATCH 0/2] sys_write() should write all valid data Josef Bacik
  2009-05-15  6:52 ` Andi Kleen
  3 siblings, 2 replies; 22+ messages in thread
From: Vitaly Mayatskikh @ 2009-05-14 16:19 UTC (permalink / raw)
  To: Josef Bacik; +Cc: sandeen, linux-fsdevel, linux-kernel

This patch changes iov_iter_fault_in_readable() to use check_readable_bytes()
instead of fault_in_pages_readable(). It allows iov_iter_fault_in_readable()
callers to know how much data is accessible and proceed with it. It makes
sys_write() more POSIX-friendly.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
---
 fs/fuse/file.c |    6 ++++--
 mm/filemap.c   |   13 +++++++++----
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 06f30e9..c74ef04 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -782,13 +782,15 @@ static ssize_t fuse_fill_write_pages(struct fuse_req *req,
 		pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 		size_t bytes = min_t(size_t, PAGE_CACHE_SIZE - offset,
 				     iov_iter_count(ii));
-
+		long accessible;
 		bytes = min_t(size_t, bytes, fc->max_write - count);
 
  again:
 		err = -EFAULT;
-		if (iov_iter_fault_in_readable(ii, bytes))
+		accessible = iov_iter_fault_in_readable(ii, bytes);
+		if (accessible < 0)
 			break;
+		bytes = accessible;
 
 		err = -ENOMEM;
 		page = grab_cache_page_write_begin(mapping, index, 0);
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..ef598a1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1947,8 +1947,9 @@ EXPORT_SYMBOL(iov_iter_advance);
 
 /*
  * Fault in the first iovec of the given iov_iter, to a maximum length
- * of bytes. Returns 0 on success, or non-zero if the memory could not be
- * accessed (ie. because it is an invalid address).
+ * of bytes. Returns count of accessible bytes on success, or -EFAULT
+ * if the  memory could not be accessed (ie. because it is an invalid
+ * address).
  *
  * writev-intensive code may want this to prefault several iovecs -- that
  * would be possible (callers must not rely on the fact that _only_ the
@@ -1958,7 +1959,7 @@ int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 {
 	char __user *buf = i->iov->iov_base + i->iov_offset;
 	bytes = min(bytes, i->iov->iov_len - i->iov_offset);
-	return fault_in_pages_readable(buf, bytes);
+	return check_readable_bytes(buf, bytes);
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
@@ -2215,6 +2216,7 @@ static ssize_t generic_perform_write(struct file *file,
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */
+		long accessible;
 		void *fsdata;
 
 		offset = (pos & (PAGE_CACHE_SIZE - 1));
@@ -2234,10 +2236,13 @@ again:
 		 * to check that the address is actually valid, when atomic
 		 * usercopies are used, below.
 		 */
-		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+
+		accessible = iov_iter_fault_in_readable(i, bytes);
+		if (accessible < 0) {
 			status = -EFAULT;
 			break;
 		}
+		bytes = accessible;
 
 		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 						&page, &fsdata);
-- 
1.6.2.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] Introduce check_readable_bytes()
  2009-05-14 16:19 ` [PATCH 1/2] Introduce check_readable_bytes() Vitaly Mayatskikh
@ 2009-05-14 17:40   ` Josef Bacik
  2009-05-14 17:57     ` Vitaly Mayatskikh
  0 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2009-05-14 17:40 UTC (permalink / raw)
  To: Vitaly Mayatskikh; +Cc: Josef Bacik, sandeen, linux-fsdevel, linux-kernel

On Thu, May 14, 2009 at 06:19:00PM +0200, Vitaly Mayatskikh wrote:
> This routine acts almost as fault_in_pages_readable(), but returns
> accessible amount of bytes instead of plain OK/FAIL, so callers
> may know how many data can be proceeded without GPF.
> 
> Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
> ---
>  include/linux/pagemap.h |   35 +++++++++++++++++++++++++++++++++++
>  1 files changed, 35 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 34da523..f931308 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -439,6 +439,41 @@ static inline int fault_in_pages_readable(const char __user *uaddr, int size)
>  	return ret;
>  }
>  
> +/*
> + * check_readable_bytes - check if given amount of bytes can be read
> + * at given address.
> + * @uaddr: address to check
> + * @size: size to check
> + *
> + * Returns 0 when @size is 0, -EFAULT when @uaddr points to
> + * unaccessible region, or count of accessible bytes.
> + */
> +static inline int check_readable_bytes(const char __user *uaddr, int size)
> +{
> +	volatile char c;
> +	int ret = 0;
> +	long page_begin = (unsigned long)uaddr & PAGE_MASK;
> +	long page_end = ((unsigned long)uaddr + size) & PAGE_MASK;
> +	long ptr = page_begin;
> +

Firstly page_begin/page_end/ptr should all be unsigned long.  Secondly page_end
should be uaddr+size-1.

> +	if (unlikely(size == 0))
> +		goto out;
> +
> +	while (!ret && ptr <= page_end) {
> +		ret = __get_user(c, (const char __user*)ptr);
> +		if (!ret)
> +			ptr += PAGE_SIZE;
> +	}
> +
> +	if (likely(!ret))
> +		ret = size;
> +	else
> +		ret = (ptr == page_begin) ?
> +			-EFAULT : (const char __user *)ptr - uaddr;

This isn't quite right either, we do some pointer math, and return that and its
casted as an int?  That seems like a bad idea in general.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] Introduce check_readable_bytes()
  2009-05-14 17:40   ` Josef Bacik
@ 2009-05-14 17:57     ` Vitaly Mayatskikh
  0 siblings, 0 replies; 22+ messages in thread
From: Vitaly Mayatskikh @ 2009-05-14 17:57 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Vitaly Mayatskikh, sandeen, linux-fsdevel, linux-kernel

At Thu, 14 May 2009 13:40:09 -0400, Josef Bacik wrote:

> > +	volatile char c;
> > +	int ret = 0;
> > +	long page_begin = (unsigned long)uaddr & PAGE_MASK;
> > +	long page_end = ((unsigned long)uaddr + size) & PAGE_MASK;
> > +	long ptr = page_begin;
> > +
> 
> Firstly page_begin/page_end/ptr should all be unsigned long.  Secondly page_end
> should be uaddr+size-1.
> 
> > +	if (unlikely(size == 0))
> > +		goto out;
> > +
> > +	while (!ret && ptr <= page_end) {
> > +		ret = __get_user(c, (const char __user*)ptr);
> > +		if (!ret)
> > +			ptr += PAGE_SIZE;
> > +	}
> > +
> > +	if (likely(!ret))
> > +		ret = size;
> > +	else
> > +		ret = (ptr == page_begin) ?
> > +			-EFAULT : (const char __user *)ptr - uaddr;
> 
> This isn't quite right either, we do some pointer math, and return that and its
> casted as an int?  That seems like a bad idea in general.  Thanks,

Did you like the approach in general?

-- 
wbr, Vitaly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] sys_write() should write all valid data
  2009-05-14 16:18 [PATCH 0/2] sys_write() should write all valid data Vitaly Mayatskikh
  2009-05-14 16:19 ` [PATCH 1/2] Introduce check_readable_bytes() Vitaly Mayatskikh
  2009-05-14 16:19 ` [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes() Vitaly Mayatskikh
@ 2009-05-14 18:02 ` Josef Bacik
  2009-05-14 18:48   ` Vitaly Mayatskikh
  2009-05-15  6:52 ` Andi Kleen
  3 siblings, 1 reply; 22+ messages in thread
From: Josef Bacik @ 2009-05-14 18:02 UTC (permalink / raw)
  To: Vitaly Mayatskikh; +Cc: Josef Bacik, sandeen, linux-fsdevel, linux-kernel

On Thu, May 14, 2009 at 06:18:59PM +0200, Vitaly Mayatskikh wrote:
> There's user-visible misbehavour in sys_write(): when user tries to put
> down to disk some data, which crosses boundary of existing memory, sys_write()
> either immediately returns with EFAULT or writes first page(s).
> 
> Next 2 patches make sys_write()'s behaviour more consistent: it tries now
> to write down all what it can.
> 
> Vitaly Mayatskikh (2):
>   Introduce check_readable_bytes()
>   Perform checks in iov_iter_fault_in_readable() with
>     check_readable_bytes()
> 
>  fs/fuse/file.c          |    6 ++++--
>  include/linux/pagemap.h |   35 +++++++++++++++++++++++++++++++++++
>  mm/filemap.c            |   13 +++++++++----
>  3 files changed, 48 insertions(+), 6 deletions(-)
>

Ok all in all I don't think this is a good way to handle this problem.
Hopefully somebody smarter than I will speak up, but what you are trying to do
here is have your cake and eat it too.  You want to get the size of what we were
able to fault in and return that, which should be a size_t, but you also want to
throw back an error if something happened, which needs a signed value.  I think
the best way to handle this would be to make check_readable_bytes return size_t,
and then if you get an EFAULT back, have it return 0.  Then the caller can say
"hey I couldn't fault anything in, let me make what I want to fault in smaller",
and then if that fault returns 0 we can exit.  I hope thats helpful/correct :).
Thanks,

Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] sys_write() should write all valid data
  2009-05-14 18:02 ` [PATCH 0/2] sys_write() should write all valid data Josef Bacik
@ 2009-05-14 18:48   ` Vitaly Mayatskikh
  2009-05-14 19:05     ` Josef Bacik
  0 siblings, 1 reply; 22+ messages in thread
From: Vitaly Mayatskikh @ 2009-05-14 18:48 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Vitaly Mayatskikh, sandeen, linux-fsdevel, linux-kernel

At Thu, 14 May 2009 14:02:34 -0400, Josef Bacik wrote:

> Ok all in all I don't think this is a good way to handle this problem.
> Hopefully somebody smarter than I will speak up, but what you are trying to do
> here is have your cake and eat it too.  You want to get the size of what we were
> able to fault in and return that, which should be a size_t,

btw, fault_in_pages_readable() has argument `size' of type int...

> but you also want to
> throw back an error if something happened, which needs a signed value.  I think
> the best way to handle this would be to make check_readable_bytes return size_t,
> and then if you get an EFAULT back, have it return 0.  Then the caller can say
> "hey I couldn't fault anything in, let me make what I want to fault in smaller",
> and then if that fault returns 0 we can exit.

Won't it be suboptimal to trap in the same place twice?

> I hope thats helpful/correct :).

-- 
wbr, Vitaly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] sys_write() should write all valid data
  2009-05-14 18:48   ` Vitaly Mayatskikh
@ 2009-05-14 19:05     ` Josef Bacik
  0 siblings, 0 replies; 22+ messages in thread
From: Josef Bacik @ 2009-05-14 19:05 UTC (permalink / raw)
  To: Vitaly Mayatskikh; +Cc: Josef Bacik, sandeen, linux-fsdevel, linux-kernel

On Thu, May 14, 2009 at 08:48:18PM +0200, Vitaly Mayatskikh wrote:
> At Thu, 14 May 2009 14:02:34 -0400, Josef Bacik wrote:
> 
> > Ok all in all I don't think this is a good way to handle this problem.
> > Hopefully somebody smarter than I will speak up, but what you are trying to do
> > here is have your cake and eat it too.  You want to get the size of what we were
> > able to fault in and return that, which should be a size_t,
> 
> btw, fault_in_pages_readable() has argument `size' of type int...
>

Yeah I noticed that, it should be fixed to be size_t.

> > but you also want to
> > throw back an error if something happened, which needs a signed value.  I think
> > the best way to handle this would be to make check_readable_bytes return size_t,
> > and then if you get an EFAULT back, have it return 0.  Then the caller can say
> > "hey I couldn't fault anything in, let me make what I want to fault in smaller",
> > and then if that fault returns 0 we can exit.
> 
> Won't it be suboptimal to trap in the same place twice?
>

Yes, but only in the case we fail to fault in the entire range of userspace
pages we wanted, to which I say "meh" :).  Userspace gave us a bad buffer, I
think having optimal performance in this case isn't a high priority.

The fact of the matter is we need to be return'ing a size_t of the number of
bytes we were able to fault in, so 0 is the only sane way to tell us we weren't
successfull in faulting what we wanted in.  Maybe change what you've currently
done to do the above, and then we just return size_t of what we were able to
fault in, and then the next time we loop around when we get a 0 back we know we
couldn't fault anymore in and return -EFAULT then.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] sys_write() should write all valid data
  2009-05-14 16:18 [PATCH 0/2] sys_write() should write all valid data Vitaly Mayatskikh
                   ` (2 preceding siblings ...)
  2009-05-14 18:02 ` [PATCH 0/2] sys_write() should write all valid data Josef Bacik
@ 2009-05-15  6:52 ` Andi Kleen
  3 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2009-05-15  6:52 UTC (permalink / raw)
  To: Vitaly Mayatskikh; +Cc: Josef Bacik, sandeen, linux-fsdevel, linux-kernel

Vitaly Mayatskikh <v.mayatskih@gmail.com> writes:

> There's user-visible misbehavour in sys_write(): when user tries to
> put down to disk some data, which crosses boundary of existing
> memory, sys_write() either immediately returns with EFAULT or writes
> first page(s).

What's wrong with that? Seems like perfectly fine behaviour to me.
Did it break some program of yours? If yes can you describe the
scenario?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-14 16:19 ` [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes() Vitaly Mayatskikh
@ 2009-05-15  6:56   ` Andi Kleen
  2009-05-15  7:56     ` Vitaly Mayatskikh
  2009-05-19  8:55   ` Pavel Machek
  1 sibling, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2009-05-15  6:56 UTC (permalink / raw)
  To: Vitaly Mayatskikh; +Cc: Josef Bacik, sandeen, linux-fsdevel, linux-kernel

Vitaly Mayatskikh <v.mayatskih@gmail.com> writes:

> This patch changes iov_iter_fault_in_readable() to use check_readable_bytes()
> instead of fault_in_pages_readable(). It allows iov_iter_fault_in_readable()
> callers to know how much data is accessible and proceed with it. It makes
> sys_write() more POSIX-friendly.

Can you describe how it makes it POSIX friendly?

My understanding was that EFAULT behaviour was undefined in POSIX.

The obvious hole in the patch is that all these checks are not race
free -- they don't pin pages -- so if there's a parallel unmap even
with your change they can still fail in the middle.

-Andi 

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-15  6:56   ` Andi Kleen
@ 2009-05-15  7:56     ` Vitaly Mayatskikh
  2009-05-15  9:38       ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Vitaly Mayatskikh @ 2009-05-15  7:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Vitaly Mayatskikh, Josef Bacik, sandeen, linux-fsdevel,
	linux-kernel

At Fri, 15 May 2009 08:56:00 +0200, Andi Kleen wrote:
 
> Can you describe how it makes it POSIX friendly?
> 
> My understanding was that EFAULT behaviour was undefined in POSIX.
> 
> The obvious hole in the patch is that all these checks are not race
> free -- they don't pin pages -- so if there's a parallel unmap even
> with your change they can still fail in the middle.


ptr = mmap(0, page_size, ....);
...
write(fd, ptr + page_size - 256, 512);

Write() will fail here, but it can write first 256 bytes. Previously,
all 512 bytes were proceeded, but last 256 bytes were zeroed, and
sys_write() returned 256. Not very nice too. 

-- 
wbr, Vitaly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-15  7:56     ` Vitaly Mayatskikh
@ 2009-05-15  9:38       ` Andi Kleen
  2009-05-15 11:56         ` Vitaly Mayatskikh
  2009-05-15 13:43         ` Jamie Lokier
  0 siblings, 2 replies; 22+ messages in thread
From: Andi Kleen @ 2009-05-15  9:38 UTC (permalink / raw)
  To: Vitaly Mayatskikh
  Cc: Andi Kleen, Josef Bacik, sandeen, linux-fsdevel, linux-kernel

> ptr = mmap(0, page_size, ....);
> ...
> write(fd, ptr + page_size - 256, 512);
> 
> Write() will fail here, but it can write first 256 bytes. Previously,
> all 512 bytes were proceeded, but last 256 bytes were zeroed, and
> sys_write() returned 256. Not very nice too. 

Is that really something that users rely on? It looks like a seriously
broken user program. Which one is that? (just that I can avoid it :)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-15  9:38       ` Andi Kleen
@ 2009-05-15 11:56         ` Vitaly Mayatskikh
  2009-05-15 12:19           ` Andi Kleen
  2009-05-15 13:43         ` Jamie Lokier
  1 sibling, 1 reply; 22+ messages in thread
From: Vitaly Mayatskikh @ 2009-05-15 11:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Vitaly Mayatskikh, Josef Bacik, sandeen, linux-fsdevel,
	linux-kernel

At Fri, 15 May 2009 11:38:38 +0200, Andi Kleen wrote:
 
> Is that really something that users rely on? It looks like a seriously
> broken user program. Which one is that? (just that I can avoid it :)

IIRC, it was first noticed in some sanity test from ltp suite ;)

I'm readiny POSIX spec for write() once again:

http://www.opengroup.org/onlinepubs/9699919799/functions/write.html

"The write() function shall attempt to write nbyte bytes from the
buffer pointed to by buf to the file associated with the open file
descriptor, fildes."

Well, "shall attempt"... It's not clear to me, is it enough just to
validate user's buffer and gave up with error, like we do now?

-- 
wbr, Vitaly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-15 11:56         ` Vitaly Mayatskikh
@ 2009-05-15 12:19           ` Andi Kleen
  0 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2009-05-15 12:19 UTC (permalink / raw)
  To: Vitaly Mayatskikh
  Cc: Andi Kleen, Josef Bacik, sandeen, linux-fsdevel, linux-kernel

> Well, "shall attempt"... It's not clear to me, is it enough just to
> validate user's buffer and gave up with error, like we do now?

I think it is.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-15  9:38       ` Andi Kleen
  2009-05-15 11:56         ` Vitaly Mayatskikh
@ 2009-05-15 13:43         ` Jamie Lokier
  2009-05-15 14:01           ` Andi Kleen
  2009-05-18  8:31           ` Alan Cox
  1 sibling, 2 replies; 22+ messages in thread
From: Jamie Lokier @ 2009-05-15 13:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Vitaly Mayatskikh, Josef Bacik, sandeen, linux-fsdevel,
	linux-kernel

Andi Kleen wrote:
> > ptr = mmap(0, page_size, ....);
> > ...
> > write(fd, ptr + page_size - 256, 512);
> > 
> > Write() will fail here, but it can write first 256 bytes. Previously,
> > all 512 bytes were proceeded, but last 256 bytes were zeroed, and
> > sys_write() returned 256. Not very nice too. 
> 
> Is that really something that users rely on? It looks like a seriously
> broken user program. Which one is that? (just that I can avoid it :)

A few programs set pages read-only, and rely on SIGSEGVs to trigger
mprotect() in the signal handler and thus track dirty pages.

I think the Boehm garbage collector has this option, as do some LISP
interpreters.

System calls don't trigger SIGSEGVs so they can't rely on that when
calling read().  I'm not sure how they handle that.

It would be quite nice if it were safe to call read(), get EFAULT
immediately, or a truncated read() then the next read() gets EFAULT
because it starts at a missing page boundary, and then that's a hint
for the program to consult it's data structures and do it's mprotect()
thing.

Hopefully no programs assume they can do that already, but it would be
nice if they could begin to assume it, instead of checking their data
structure in advance of every read() call.

I don't know of any program which would need the same thing with
write(), but obviously good for symmetry.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-15 13:43         ` Jamie Lokier
@ 2009-05-15 14:01           ` Andi Kleen
  2009-05-15 14:37             ` Jamie Lokier
  2009-05-18  8:31           ` Alan Cox
  1 sibling, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2009-05-15 14:01 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andi Kleen, Vitaly Mayatskikh, Josef Bacik, sandeen,
	linux-fsdevel, linux-kernel

> It would be quite nice if it were safe to call read(), get EFAULT

Even with that patch it is not safe, not even for write, because
it's racy.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-15 14:01           ` Andi Kleen
@ 2009-05-15 14:37             ` Jamie Lokier
  0 siblings, 0 replies; 22+ messages in thread
From: Jamie Lokier @ 2009-05-15 14:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Vitaly Mayatskikh, Josef Bacik, sandeen, linux-fsdevel,
	linux-kernel

Andi Kleen wrote:
> > It would be quite nice if it were safe to call read(), get EFAULT
> 
> Even with that patch it is not safe, not even for write, because
> it's racy.

That's not necessarily a problem for the applications I described,
because they can be made to only increase access in parallel with
access; decreasing access can be synchronised over the whole program.

But I agree with you, as no application should depend on such subtle
details, which would probably be broken again in a future kernel anyway.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-15 13:43         ` Jamie Lokier
  2009-05-15 14:01           ` Andi Kleen
@ 2009-05-18  8:31           ` Alan Cox
  2009-05-18  9:48             ` Jamie Lokier
  1 sibling, 1 reply; 22+ messages in thread
From: Alan Cox @ 2009-05-18  8:31 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andi Kleen, Vitaly Mayatskikh, Josef Bacik, sandeen,
	linux-fsdevel, linux-kernel

> System calls don't trigger SIGSEGVs so they can't rely on that when
> calling read().  I'm not sure how they handle that.

Usually by touching the pages before the syscall. (You can't rely on
-EFAULT either in POSIX, its optional that it bothers to tell you and for
example on MMUless it may well not)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-18  8:31           ` Alan Cox
@ 2009-05-18  9:48             ` Jamie Lokier
  2009-05-18 10:03               ` Alan Cox
  2009-05-18 10:16               ` Andi Kleen
  0 siblings, 2 replies; 22+ messages in thread
From: Jamie Lokier @ 2009-05-18  9:48 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Vitaly Mayatskikh, Josef Bacik, sandeen,
	linux-fsdevel, linux-kernel

Alan Cox wrote:
> > System calls don't trigger SIGSEGVs so they can't rely on that when
> > calling read().  I'm not sure how they handle that.
> 
> Usually by touching the pages before the syscall.

Makes sense.

> (You can't rely on -EFAULT either in POSIX, its optional that it
> bothers to tell you and for example on MMUless it may well not)

On MMUless you wouldn't use page protection as a GC technique :-)
Is EFAULT really optional on systems with page protection?

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-18  9:48             ` Jamie Lokier
@ 2009-05-18 10:03               ` Alan Cox
  2009-05-18 10:16               ` Andi Kleen
  1 sibling, 0 replies; 22+ messages in thread
From: Alan Cox @ 2009-05-18 10:03 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Andi Kleen, Vitaly Mayatskikh, Josef Bacik, sandeen,
	linux-fsdevel, linux-kernel

> On MMUless you wouldn't use page protection as a GC technique :-)
> Is EFAULT really optional on systems with page protection?

Yes.

It's also not guaranteed under Linux although pretty much all our I/O
paths do bother to report -EFAULT


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-18  9:48             ` Jamie Lokier
  2009-05-18 10:03               ` Alan Cox
@ 2009-05-18 10:16               ` Andi Kleen
  1 sibling, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2009-05-18 10:16 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Alan Cox, Andi Kleen, Vitaly Mayatskikh, Josef Bacik, sandeen,
	linux-fsdevel, linux-kernel

> Is EFAULT really optional on systems with page protection?

It's de facto -- e.g. the x86-64 vsyscalls (getimeofday, time, clock_gettime,
getcpu) never implement it.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes()
  2009-05-14 16:19 ` [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes() Vitaly Mayatskikh
  2009-05-15  6:56   ` Andi Kleen
@ 2009-05-19  8:55   ` Pavel Machek
  1 sibling, 0 replies; 22+ messages in thread
From: Pavel Machek @ 2009-05-19  8:55 UTC (permalink / raw)
  To: Vitaly Mayatskikh; +Cc: Josef Bacik, sandeen, linux-fsdevel, linux-kernel

Hi!

> This patch changes iov_iter_fault_in_readable() to use check_readable_bytes()
> instead of fault_in_pages_readable(). It allows iov_iter_fault_in_readable()
> callers to know how much data is accessible and proceed with it. It makes
> sys_write() more POSIX-friendly.

I agree with andi here: yes, your suggested behaviour is 'nicer', but
I don't the code complexity is worth it.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2009-05-19  8:56 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-14 16:18 [PATCH 0/2] sys_write() should write all valid data Vitaly Mayatskikh
2009-05-14 16:19 ` [PATCH 1/2] Introduce check_readable_bytes() Vitaly Mayatskikh
2009-05-14 17:40   ` Josef Bacik
2009-05-14 17:57     ` Vitaly Mayatskikh
2009-05-14 16:19 ` [PATCH 2/2] Perform check in iov_iter_fault_in_readable() by check_readable_bytes() Vitaly Mayatskikh
2009-05-15  6:56   ` Andi Kleen
2009-05-15  7:56     ` Vitaly Mayatskikh
2009-05-15  9:38       ` Andi Kleen
2009-05-15 11:56         ` Vitaly Mayatskikh
2009-05-15 12:19           ` Andi Kleen
2009-05-15 13:43         ` Jamie Lokier
2009-05-15 14:01           ` Andi Kleen
2009-05-15 14:37             ` Jamie Lokier
2009-05-18  8:31           ` Alan Cox
2009-05-18  9:48             ` Jamie Lokier
2009-05-18 10:03               ` Alan Cox
2009-05-18 10:16               ` Andi Kleen
2009-05-19  8:55   ` Pavel Machek
2009-05-14 18:02 ` [PATCH 0/2] sys_write() should write all valid data Josef Bacik
2009-05-14 18:48   ` Vitaly Mayatskikh
2009-05-14 19:05     ` Josef Bacik
2009-05-15  6:52 ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).