public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
@ 2008-06-17  6:00 Bron Gondwana
  2008-06-17  6:02 ` Bron Gondwana
  0 siblings, 1 reply; 38+ messages in thread
From: Bron Gondwana @ 2008-06-17  6:00 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Nick Piggin, Andrew Morton,
	Linus Torvalds
  Cc:  Rob Mueller

[-- Attachment #1: Type: text/plain, Size: 2957 bytes --]

Background: we recently upgraded one of our 64bit kernel
machines to 2.6.25 and discovered that Cyrus skiplist
files were becoming randomly corrupted.  This is a machine
with a 32bit Debian Etch userland but a 64bit kernel with
32bit support.

We run this way so we can actually use the 12Gb memory as
cache without running out of inode space, but don't have
to support two different sets of userland across our
servers.


The symptom - 16 to 24 bytes of zero appearing "randomly"
within the file after a "checkpoint" (file rewrite,
skipping stale copies of records).

Further investigation by retaining the pre-checkpoint files
showed that those bytes were the last ones of a page in the
original file.


Attached is a small C program which recreates the actions
that Cyrus takes, and uses record lengths identical to a
known-broken skiplist file on our systems.

Using this I was able to bisect the kernel to find the
commit which caused the problem:


08291429cfa6258c4cd95d8833beb40f828b194e is first bad commit
commit 08291429cfa6258c4cd95d8833beb40f828b194e
Author: Nick Piggin <npiggin@suse.de>
Date:   Tue Oct 16 01:24:59 2007 -0700

    mm: fix pagecache write deadlocks

    Modify the core write() code so that it won't take a pagefault while holding a
    lock on the pagecache page. There are a number of different deadlocks possible
    if we try to do such a thing:

    [...]

    Signed-off-by: Nick Piggin <npiggin@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


For Cyrus users, this is a really serious bug - occasionally
the zeros will hit a "navigational component" of the file,
causing crashes and being noticeable.  Most of the time
(including this example) it will just cause silent corruption
and data loss.

I suspect this will be visible to users as large swathes of
messages becoming unread, and if it hits the mailboxes.db,
large swathes of mailboxes just disappearing.  Not good.


I apologise for the length of the attached C program.  I tried
to make it shorter, but kept not tickling the bug.  There's also
another advantage of keeping it like this - it closely mirrors the
Cyrus behaviour, to the point where the output is a valid skiplist
file.

It also has a "magic" mode, just pass a second parameter.  It will
read through the mapped memory in order before the checkpoint.
This makes the bug disappear.

Let me know if there's anything else I can do to make this report
clearer.  I've had a quick glance around the code, but especially
since it's 64 bit kernel only bug (I tested by rebooting the test
machine with a 2.6.25.3 32 bit kernel I had lying around and the
bug was not visible there.  I've also repeated the test on my Ubuntu
desktop machine with the shipped kernel vs my hand-compiled
known-bad kernel, and tested with a 64bit userland in a chroot as
well).

Regards,

Bron.

-- 
  Bron Gondwana
  brong@fastmail.fm


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: maptest.c --]
[-- Type: text/x-csrc; name="maptest.c", Size: 9823 bytes --]

/* This program is designed to test mmap file
 * reads and iovec writes as per cyrus imap's
 * skiplist database usage.  It emulates the
 * behaviour of a skiplist database (sans
 * fcntl locking) across a file checkpoint */

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/uio.h>
#include <fcntl.h>

/* use some definitions out of the cyrus skiplist 
 * code to allow identical behaviour */

#define SKIPLIST_HEADER "\241\002\213\015skiplist file\0\0\0"

typedef unsigned int bit32;

#define WRITEV_ADD_TO_IOVEC(iov, num_iov, s, len) \
    do { (iov)[(num_iov)].iov_base = (s); \
         (iov)[(num_iov)++].iov_len = (len); } while (0)
/* bump to the next multiple of 4 bytes */
#define ROUNDUP(num) (((num) + 3) & 0xFFFFFFFC)

bit32 zero;
bit32 inorder;
bit32 add;
bit32 delete;
bit32 commit;
bit32 dummy;
bit32 max;

#define keylen 16
char *key;
#define vallen 6506
char *val;

void setup_vars() {
  int i;

  /* skiplist magic values */
  zero = htonl(0);
  inorder = htonl(1);
  add = htonl(2);
  delete = htonl(4);
  commit = htonl(255);
  dummy = htonl(257);
  max = htonl((bit32)-1);

  /* our key and value */
  key = malloc(keylen * sizeof(char));
  for (i = 0; i < keylen; i++) {
    key[i] = 'k';
  }

  val = malloc(vallen * sizeof(char));
  for (i = 0; i < vallen; i++) {
    val[i] = 'b';
  }
}

void write_dummy(int fd) {
  char buffer[96];
  int i;

  for (i = 0; i < 96; i++) 
    buffer[i] = 0;

  *(bit32 *)(buffer + 0) = dummy; /* type */
  *(bit32 *)(buffer + 92) = max; /* end ptr list */

  lseek(fd, 48, 0);
  write(fd, buffer, 96);
}

void write_header(int fd) {
  char buffer[48];
  int i;

  for (i = 0; i < 48; i++) 
    buffer[i] = 0;

  strncpy(buffer, SKIPLIST_HEADER, 20);
  *(bit32 *)(buffer + 20) = htonl((bit32)  1); /* version */
  *(bit32 *)(buffer + 24) = htonl((bit32)  2); /* minorversion */
  *(bit32 *)(buffer + 28) = htonl((bit32) 20); /* maxlevel */
  *(bit32 *)(buffer + 32) = htonl((bit32)  1); /* curlevel */
  *(bit32 *)(buffer + 36) = htonl((bit32)  0); /* numrecords */
  *(bit32 *)(buffer + 40) = htonl((bit32)144); /* length at last checkpoint */
  *(bit32 *)(buffer + 44) = htonl((bit32)  0); /* last recovery timestamp */

  lseek(fd, 0, 0);
  write(fd, buffer, 48);
}

int main(int argc, char **argv)
{
  /* files */
  char filename[1024];
  int fd;
  char newfilename[1024];
  int newfd;

  /* io vector */
  struct iovec iov[50];
  int num_iov;

  /* mmap stuff */
  char *mapbase;
  int mapsize;
  int maplen;

  /* network order values */
  bit32 netkeylen;
  bit32 netvallen;
  bit32 netoffset;

  /* misc ints */
  int i;
  int offset;
  int newoffset;
  int size;
  int magic;

  setup_vars();

  if (argc <= 1) {
    printf("NEED A FILE\n");
    exit(1);
  }

  strncpy(filename, argv[1], sizeof(filename));
  unlink(filename);
  fd = open(filename, O_RDWR | O_CREAT, 0666);

  /* STAGE 1 - create a "pre-checkpointed" file that looks just like
   * a real seen file before it gets updated */

  write_header(fd);
  write_dummy(fd);
  fdatasync(fd);

  /* mmap the file now */

  maplen = 1024 * 16;
  mapbase = (char *)mmap((caddr_t)0, maplen, PROT_READ, MAP_SHARED, fd, 0L);

  /* create the initial checkpointed record */

  num_iov = 0;
  WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&inorder, 4); /* header */

  /* key */
  netkeylen = htonl(keylen);
  WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&netkeylen, 4); /* length */
  WRITEV_ADD_TO_IOVEC(iov, num_iov, key, keylen); /* data */
  if (ROUNDUP(keylen) > keylen) { /* need to pad */
    WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&zero, ROUNDUP(keylen) - keylen);
  }

  /* value */
  netvallen = htonl(vallen);
  WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&netvallen, 4); /* length */
  WRITEV_ADD_TO_IOVEC(iov, num_iov, val, vallen); /* data */
  if (ROUNDUP(vallen) > vallen) { /* need to pad */
    WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&zero, ROUNDUP(vallen) - vallen);
  }

  /* only one pointer, and then finish the record off */

  WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&zero, 4); /* pointer */
  WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&max, 4); /* end record */

  /* write the record */
  lseek(fd, 0, 2);
  writev(fd, iov, num_iov);

  /* update the dummy pointer */
  lseek(fd, 60, 0);
  netoffset = htonl(144);
  write(fd, (char *)&netoffset, 4); 

  /* update the header */
  netoffset = htonl(0x1a20);
  lseek(fd, 40, 0);
  write(fd, (char *)&netoffset, 4);

  /* sync the record to disk */
  fdatasync(fd);


  /* CHECKPOINT 1 - the file on disk is now 100% identical to 
   * the file that would exist after loading the data above 
   * into a cyrus skiplist file and running the checkpoint code 
   * (ok, except the timestamp)
   */

  /* Cyrus would check the MMAP at this point and discover it 
   * still fits all the data within the "slot", so it doesn't
   * actually remap anything */

  /* STAGE 2 - write a new record - starting by finding the
   * pointers and deleting the old record.  Simplified for 
   * a level 1 node here.  This appends to the file.
   */

  /* find the pointer record - Cyrus does this, so do we */
  for (i = 20; i > 0; i--) {
    offset = ntohl(*(bit32 *)(mapbase + 60 + 4*(i-1)));
    if (offset) break;
  }

  /* test that the keys match */
  if (strncmp(mapbase + offset + 4 + 4, key, keylen)) {
    printf("ODD, key wasn't found\n");
  }

  num_iov = 0;

  /* add the delete record */
  WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&delete, 4); /* header */
  netoffset = htonl(offset);
  WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&netoffset, 4); /* offset to fix */

  /* move offset along - this is "NEXT" of deleted record, will
   * always be zero because there's only one record */
  if (offset) {
    int klen = ROUNDUP(ntohl(*(bit32 *)(mapbase + offset + 4)));
    int vlen = ROUNDUP(ntohl(*(bit32 *)(mapbase + offset + 4 + 4 + klen)));
    offset = ntohl(*(bit32 *)(mapbase + offset + 4 + 4 + klen + 4 + vlen));
  }
  if (offset != 0) {
    printf("ODD, offset isn't 0 after deleting record\n");
  }

  /* mark the deletion */
  lseek(fd, 60, 0);
  write(fd, (char *)&zero, 4);

  /* set up the new record */
  WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&add, 4); /* header */

  /* key */
  netkeylen = htonl(keylen);
  WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&netkeylen, 4); /* length */
  WRITEV_ADD_TO_IOVEC(iov, num_iov, key, keylen); /* data */
  if (ROUNDUP(keylen) > keylen) { /* need to pad */
    WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&zero, ROUNDUP(keylen) - keylen);
  }

  /* value */
  netvallen = htonl(vallen);
  WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&netvallen, 4); /* length */
  WRITEV_ADD_TO_IOVEC(iov, num_iov, val, vallen); /* data */
  if (ROUNDUP(vallen) > vallen) { /* need to pad */
    WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&zero, ROUNDUP(vallen) - vallen);
  }

  /* only one pointer, and then finish the record off */

  WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&zero, 4); /* pointer */
  WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&max, 4); /* end record */

  /* write the record */
  offset = lseek(fd, 0, 2);
  writev(fd, iov, num_iov);

  /* update the dummy pointer */
  lseek(fd, 60, 0);
  netoffset = htonl(offset + 8); /* skip the delete */
  write(fd, (char *)&netoffset, 4); 
  fdatasync(fd);
  
  /* commit record goes on the end of the file only once all
   * data changes are fsynced to disk */
  lseek(fd, 0, 2);
  write(fd, (char *)&commit, 4);
  fdatasync(fd);

  /* CHECKPOINT 2 - we've finished writing and committing the new record,
   * and the Cyrus skiplist would checkpoint this file now.  
   * We simulate that here */

  /* again - we fit insize the mmap space, so it doesn't get refreshed */

  /* STAGE 3 - perform a cyrus checkpoint */

  if (argc > 2) {
    /* PERFORM MAGIC - access the mmap in order */
    for (i = 0; i < maplen; i++) {
      magic ^= mapbase[i];
    }
  }

  /* stream to a new file */

  snprintf(newfilename, sizeof(newfilename), "%s.NEW", filename);
  newfd = open(newfilename, O_RDWR | O_CREAT, 0666);
  ftruncate(newfd, 0);

  /* dummy goes first, header gets written last */
  write_dummy(newfd);

  lseek(newfd, 0, 2); /* 144 */

  /* again, Cyrus finds the record first */
  for (i = 20; i > 0; i--) {
    offset = ntohl(*(bit32 *)(mapbase + 60 + 4*(i-1)));
    if (offset) break;
  }

  /* and calculates the length by reading the lengths of the
   * various components and summing them up, finally walking
   * the pointers until it hits a "-1"
   */
  if (offset) {
    int klen = ROUNDUP(ntohl(*(bit32 *)(mapbase + offset + 4)));
    int vlen = ROUNDUP(ntohl(*(bit32 *)(mapbase + offset + 4 + 4 + klen)));
    size = 4 + 4 + klen + 4 + vlen;
    while (ntohl(*(bit32 *)(mapbase + offset + size)) != -1) {
      size += 4;
    }
    size += 4; /* final -1 pointer */
  }
  else {
    /* didn't find the record */
    printf("Failed to find the record, dying\n");
    exit(1);
  }

  num_iov = 0;
  /* write the entire record, but change the TYPE field from
   * whatever it was (ADD or INORDER) to INORDER only */
  WRITEV_ADD_TO_IOVEC(iov, num_iov, (char *)&inorder, 4); /* header */
  WRITEV_ADD_TO_IOVEC(iov, num_iov, mapbase + offset + 4, size - 4); /* data */
  writev(newfd, iov, num_iov);

  /* write the forward pointer */
  lseek(newfd, 60, 0);
  netoffset = htonl(144);
  write(newfd, (char *)&netoffset, 4);

  /* write the zeros - first the one in the new record */
  lseek(newfd, mapbase + offset + size - 8, 0);
  write(newfd, (char *)&zero, 4);

  /* then the rest into the dummy */
  for (i = 1; i < 20; i++) {
    lseek(newfd, 60 + 4*i, 0);
    write(newfd, (char *)&zero, 4);
  }

  /* write the header last */
  write_header(fd);

  close(newfd);
  close(fd);

  exit(0);
}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17  6:00 BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable) Bron Gondwana
@ 2008-06-17  6:02 ` Bron Gondwana
  2008-06-17 17:08   ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Bron Gondwana @ 2008-06-17  6:02 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Nick Piggin, Andrew Morton,
	Linus Torvalds
  Cc:  Rob Mueller


On Tue, 17 Jun 2008 16:00:10 +1000, "Bron Gondwana" <brong@fastmail.fm> said:
> I apologise for the length of the attached C program.  I tried
> to make it shorter, but kept not tickling the bug.  There's also
> another advantage of keeping it like this - it closely mirrors the
> Cyrus behaviour, to the point where the output is a valid skiplist
> file.

And I appear to have sent the one without the usage comments at the top.
Here they are:

 * USAGE:
 *    $ gcc -g maptest.c
 *    $ ./a.out z
 *    $ hexdump -C z.NEW
 *
 * Notice the block of zero byte records appearing within
 * the output.
 *
 * Run with a second argument:
 *
 *    $ ./a.out z 1
 *    $ hexdump -C z.NEW
 *
 * No more zeros!

Bron.
-- 
  Bron Gondwana
  brong@fastmail.fm


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17  6:02 ` Bron Gondwana
@ 2008-06-17 17:08   ` Linus Torvalds
  2008-06-17 17:45     ` Linus Torvalds
  2008-06-18  2:21     ` Bron Gondwana
  0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2008-06-17 17:08 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Linux Kernel Mailing List, Nick Piggin, Andrew Morton,
	Rob Mueller



On Tue, 17 Jun 2008, Bron Gondwana wrote:
> 
> And I appear to have sent the one without the usage comments at the top.
> Here they are:

Very interesting. There's certainly something there. 

That said, there's a distracting bug which is visible when doing an strace

 lseek(4, 140333890921392, SEEK_SET)     = -1 EINVAL (Invalid argument)
 write(4, "\0\0\0\0", 4)                 = 4

which is from that

	lseek(newfd, mapbase + offset + size - 8, 0);
	write(newfd, (char *) &zero, 4);

where the addition of "mapbase" is insane. So that will write zeroes to 
the wrong part of the file (offset 64, to be exact). And that will get 
overwritten by the next write, making it all look entirely insane.

That said, that bug may be distracting, but it seems to have nothign at 
all to do with the actual problem. The bug seems to happen only when the 
file is not pre-paged in.

Nick?

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 17:08   ` Linus Torvalds
@ 2008-06-17 17:45     ` Linus Torvalds
  2008-06-17 20:16       ` Linus Torvalds
  2008-06-18  2:21     ` Bron Gondwana
  1 sibling, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2008-06-17 17:45 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Linux Kernel Mailing List, Nick Piggin, Andrew Morton,
	Rob Mueller



On Tue, 17 Jun 2008, Linus Torvalds wrote:
> 
> That said, that bug may be distracting, but it seems to have nothign at 
> all to do with the actual problem. The bug seems to happen only when the 
> file is not pre-paged in.

Bron, does this untested patch hide the bug?

> Nick?

I don't think this patch is correct, because it doesn't really fix the 
basic issue (the code should do the right thing even if a page isn't 
there), but it might hide it by faulting in the whole "bytes" range rather 
than just the first iov.

So Nick, it's still over to you, but if this does hide it, then that's an 
interesting detail in itself.

		Linus

---
 mm/filemap.c |   17 ++++++++++++++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1e6a7d3..0080a27 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1808,9 +1808,20 @@ EXPORT_SYMBOL(iov_iter_advance);
  */
 int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 {
-	char __user *buf = i->iov->iov_base + i->iov_offset;
-	bytes = min(bytes, i->iov->iov_len - i->iov_offset);
-	return fault_in_pages_readable(buf, bytes);
+	unsigned long offset = i->iov_offset;
+	const struct iovec *iov = i->iov;
+
+	while (bytes) {
+		char __user *buf = iov->iov_base + offset;
+		size_t n = min(bytes, iov->iov_len - offset);
+
+		if (fault_in_pages_readable(buf, n))
+			return -EFAULT;
+		bytes -= n;
+		offset = 0;
+		iov++;
+	}
+	return 0;	
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 17:45     ` Linus Torvalds
@ 2008-06-17 20:16       ` Linus Torvalds
  2008-06-17 20:41         ` Linus Torvalds
  2008-06-17 20:58         ` Andi Kleen
  0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2008-06-17 20:16 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Linux Kernel Mailing List, Nick Piggin, Andrew Morton,
	Rob Mueller, Andi Kleen, Ingo Molnar



On Tue, 17 Jun 2008, Linus Torvalds wrote:
> 
> Bron, does this untested patch hide the bug?

Ok, it does hide the bug.

> I don't think this patch is correct, because it doesn't really fix the 
> basic issue (the code should do the right thing even if a page isn't 
> there), but it might hide it by faulting in the whole "bytes" range rather 
> than just the first iov.
> 
> So Nick, it's still over to you, but if this does hide it, then that's an 
> interesting detail in itself.

I actually am starting to think that the bug is in 
__copy_to_user_inatomic_nocache().

The return value of that thing in the case of a failure seems to be 
totally wrong. In particular, since that function does an unrolled loop of 
accesses like so:

	.Ls1:   movq (%rsi),%r11
	.Ls2:   movq 1*8(%rsi),%r8
	.Ls3:   movq 2*8(%rsi),%r9
	.Ls4:   movq 3*8(%rsi),%r10
	.Ld1:   movnti %r11,(%rdi)
	.Ld2:   movnti %r8,1*8(%rdi)
	.Ld3:   movnti %r9,2*8(%rdi)
	.Ld4:   movnti %r10,3*8(%rdi)

it may have done three of the 64-bit loads, and then trap on the fouth: 
but since it hasn't done a _single_ of the stores, it shouldn't count as 
any different whether it traps on any of .Ls[1-4]. But that code 
definitely seems to think it makes a difference whether the exception 
happened at Ls1 or at Ls4, even though both points have copied _exactly_ 
as many bytes when the exception happens.

So I'm starting to think the bug is all in there, not in the VM itself. 
See arch/x86/lib/copy_user_nocache.S.

In fact, even the comment there implies that the author didn't know or 
care about returning the correct value.

Andi and Ingo added to Cc.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 20:16       ` Linus Torvalds
@ 2008-06-17 20:41         ` Linus Torvalds
  2008-06-17 21:06           ` Linus Torvalds
  2008-06-17 21:15           ` BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable) Andi Kleen
  2008-06-17 20:58         ` Andi Kleen
  1 sibling, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2008-06-17 20:41 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Linux Kernel Mailing List, Nick Piggin, Andrew Morton,
	Rob Mueller, Andi Kleen, Ingo Molnar



On Tue, 17 Jun 2008, Linus Torvalds wrote:
> 
> I actually am starting to think that the bug is in 
> __copy_to_user_inatomic_nocache().

Confirmed.

The uncached user copies are totally broken. The number of bytes left 
uncopied is simply wrong, because of how it does that unrolled loop and 
doesn't account for the fact that just doing loads does not actually 
increase the number of bytes copied at all.

So because the "copy_to_user_inatomic()" logic cares _deeply_ about how 
many bytes were actually copied, when the copy count is wrong, the code 
ends up thinking that it copied more bytes than it actually did, resulting 
in the corruption in the page cache.

Nasty.

That whole file is a mess. Sadly, so is the regular "copy_user_64.S" too 
(it has the same totally broken comment, too!), this is not just the 
uncached version.

And the only reason that it only shows up with the uncached version in 
_practice_ is that the routine that uses the x86 string instructions (ie 
the "rep movsq" in copy_user_generic_string) actually gets this all right. 

So the bug is hidden in that case - which is most CPU's out there (all 
CPU's that have X86_FEATURE_REP_GOOD set).

I don't think that code is reasonably salvageable. Damn. 

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 20:16       ` Linus Torvalds
  2008-06-17 20:41         ` Linus Torvalds
@ 2008-06-17 20:58         ` Andi Kleen
  2008-06-17 21:14           ` Linus Torvalds
  1 sibling, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2008-06-17 20:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar

Linus Torvalds <torvalds@linux-foundation.org> writes:

>
> So I'm starting to think the bug is all in there, not in the VM itself. 
> See arch/x86/lib/copy_user_nocache.S.

The x86-64 copy_*_user functions were always designed to return errors
both ways (as in both for load and for store). That's needed because
the loops are shared for copy_to_user and copy_from_user. That's normally
ok because when you do _to_user you shouldn't fault on the loads
and vice versa. If a caller does that it's buggy.

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 20:41         ` Linus Torvalds
@ 2008-06-17 21:06           ` Linus Torvalds
  2008-06-17 21:16             ` Andi Kleen
  2008-06-17 21:20             ` Linus Torvalds
  2008-06-17 21:15           ` BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable) Andi Kleen
  1 sibling, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2008-06-17 21:06 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Linux Kernel Mailing List, Nick Piggin, Andrew Morton,
	Rob Mueller, Andi Kleen, Ingo Molnar



On Tue, 17 Jun 2008, Linus Torvalds wrote:
> 
> I don't think that code is reasonably salvageable. Damn. 

Hmm. Something like this *may* salvage it.

Untested, so far (I'll reboot and test soon enough), but even if it fixes 
things, it's not really very good. 

Why?

Because of the unrolling, we may be doing 32 bytes worth of reads from 
user space before doing a *single* write, so if we have a user copy from 
the end of a page (say, 16 bytes from the end), we'd take the fault on the 
third 8-byte load, and not copy anything at all.

So instead of copying 16 bytes and then returning "I copied 16 bytes", it 
will copy _zero_ bytes, and now return "I copied 0 bytes" (it used to 
incorrectly return that it copied 16 bytes because it could _read_ 16 
bytes even though it never wrote them! Totally buggy!).

But I think the mm/filemap.c routines handle this case correctly (because 
of how it probes at least the first iovec), so at least copied will 
generally not be zero. But as mentioned, this isn't tested yet.

It would be better to not do the unrolling, or at least do the faulting 
behaviour correctly so that we fall back on a byte-for-byte copy when it 
faults. But not even the "rep movs" version has ever been _that_ careful, 
so I guess that's ok.

Somebody should _really_ double-check this, and it would be wonderful if 
somebody can come up with something better than this patch.

		Linus
---
 arch/x86/lib/copy_user_64.S         |   25 +++++++++++--------------
 arch/x86/lib/copy_user_nocache_64.S |   25 +++++++++++--------------
 2 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 70bebd3..ee1c3f6 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -217,19 +217,19 @@ ENTRY(copy_user_generic_unrolled)
 	/* table sorted by exception address */
 	.section __ex_table,"a"
 	.align 8
-	.quad .Ls1,.Ls1e
-	.quad .Ls2,.Ls2e
-	.quad .Ls3,.Ls3e
-	.quad .Ls4,.Ls4e
-	.quad .Ld1,.Ls1e
+	.quad .Ls1,.Ls1e	/* Ls1-Ls4 have copied zero bytes */
+	.quad .Ls2,.Ls1e
+	.quad .Ls3,.Ls1e
+	.quad .Ls4,.Ls1e
+	.quad .Ld1,.Ls1e	/* Ld1-Ld4 have copied 0-24 bytes */
 	.quad .Ld2,.Ls2e
 	.quad .Ld3,.Ls3e
 	.quad .Ld4,.Ls4e
-	.quad .Ls5,.Ls5e
-	.quad .Ls6,.Ls6e
-	.quad .Ls7,.Ls7e
-	.quad .Ls8,.Ls8e
-	.quad .Ld5,.Ls5e
+	.quad .Ls5,.Ls5e	/* Ls5-Ls8 have copied 32 bytes */
+	.quad .Ls6,.Ls5e
+	.quad .Ls7,.Ls5e
+	.quad .Ls8,.Ls5e
+	.quad .Ld5,.Ls5e	/* Ld5-Ld8 have copied 32-56 bytes */
 	.quad .Ld6,.Ls6e
 	.quad .Ld7,.Ls7e
 	.quad .Ld8,.Ls8e
@@ -244,11 +244,8 @@ ENTRY(copy_user_generic_unrolled)
 	.quad .Le5,.Le_zero
 	.previous
 
-	/* compute 64-offset for main loop. 8 bytes accuracy with error on the
-	   pessimistic side. this is gross. it would be better to fix the
-	   interface. */
 	/* eax: zero, ebx: 64 */
-.Ls1e: 	addl $8,%eax
+.Ls1e: 	addl $8,%eax		/* eax is bytes left uncopied within the loop (Ls1e: 64 .. Ls8e: 8) */
 .Ls2e: 	addl $8,%eax
 .Ls3e: 	addl $8,%eax
 .Ls4e: 	addl $8,%eax
diff --git a/arch/x86/lib/copy_user_nocache_64.S b/arch/x86/lib/copy_user_nocache_64.S
index 5196762..9d3d1ab 100644
--- a/arch/x86/lib/copy_user_nocache_64.S
+++ b/arch/x86/lib/copy_user_nocache_64.S
@@ -145,19 +145,19 @@ ENTRY(__copy_user_nocache)
 	/* table sorted by exception address */
 	.section __ex_table,"a"
 	.align 8
-	.quad .Ls1,.Ls1e
-	.quad .Ls2,.Ls2e
-	.quad .Ls3,.Ls3e
-	.quad .Ls4,.Ls4e
-	.quad .Ld1,.Ls1e
+	.quad .Ls1,.Ls1e	/* .Ls[1-4] - 0 bytes copied */
+	.quad .Ls2,.Ls1e
+	.quad .Ls3,.Ls1e
+	.quad .Ls4,.Ls1e
+	.quad .Ld1,.Ls1e	/* .Ld[1-4] - 0..24 bytes coped */
 	.quad .Ld2,.Ls2e
 	.quad .Ld3,.Ls3e
 	.quad .Ld4,.Ls4e
-	.quad .Ls5,.Ls5e
-	.quad .Ls6,.Ls6e
-	.quad .Ls7,.Ls7e
-	.quad .Ls8,.Ls8e
-	.quad .Ld5,.Ls5e
+	.quad .Ls5,.Ls5e	/* .Ls[5-8] - 32 bytes copied */
+	.quad .Ls6,.Ls5e
+	.quad .Ls7,.Ls5e
+	.quad .Ls8,.Ls5e
+	.quad .Ld5,.Ls5e	/* .Ld[5-8] - 32..56 bytes copied */
 	.quad .Ld6,.Ls6e
 	.quad .Ld7,.Ls7e
 	.quad .Ld8,.Ls8e
@@ -172,11 +172,8 @@ ENTRY(__copy_user_nocache)
 	.quad .Le5,.Le_zero
 	.previous
 
-	/* compute 64-offset for main loop. 8 bytes accuracy with error on the
-	   pessimistic side. this is gross. it would be better to fix the
-	   interface. */
 	/* eax: zero, ebx: 64 */
-.Ls1e: 	addl $8,%eax
+.Ls1e: 	addl $8,%eax	/* eax: bytes left uncopied: Ls1e: 64 .. Ls8e: 8 */
 .Ls2e: 	addl $8,%eax
 .Ls3e: 	addl $8,%eax
 .Ls4e: 	addl $8,%eax





^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 20:58         ` Andi Kleen
@ 2008-06-17 21:14           ` Linus Torvalds
  2008-06-17 21:26             ` Andi Kleen
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2008-06-17 21:14 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar



On Tue, 17 Jun 2008, Andi Kleen wrote:
> 
> The x86-64 copy_*_user functions were always designed to return errors
> both ways (as in both for load and for store).

That's not the problem, Andi.

The problem is that it returns THE WRONG VALUE!

If the fault happened on the second load, but the first load was never 
actually paired up with a store (because of unrolling the loop), then YOU 
MUST NOT CLAIM THAT YOU DID A 8-BYTE COPY! Because you have copied exactly 
_zero_ bytes, even though you _loaded_ 8 bytes successfully!

See?

Claiming that you copied 8 bytes when you didn't do anything at all is 
WRONG. It's so incredibly wrong that it is scary. 

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 20:41         ` Linus Torvalds
  2008-06-17 21:06           ` Linus Torvalds
@ 2008-06-17 21:15           ` Andi Kleen
  1 sibling, 0 replies; 38+ messages in thread
From: Andi Kleen @ 2008-06-17 21:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar

Linus Torvalds wrote:
> 
> On Tue, 17 Jun 2008, Linus Torvalds wrote:
>> I actually am starting to think that the bug is in 
>> __copy_to_user_inatomic_nocache().
> 
> Confirmed.
> 
> The uncached user copies are totally broken. The number of bytes left 
> uncopied is simply wrong, because of how it does that unrolled loop and 
> doesn't account for the fact that just doing loads does not actually 
> increase the number of bytes copied at all.

How can a load fault legitimately in copy_to_user?

-Andi



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:06           ` Linus Torvalds
@ 2008-06-17 21:16             ` Andi Kleen
  2008-06-17 21:24               ` Linus Torvalds
  2008-06-17 21:20             ` Linus Torvalds
  1 sibling, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2008-06-17 21:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar

Linus Torvalds wrote:
> 
> On Tue, 17 Jun 2008, Linus Torvalds wrote:
>> I don't think that code is reasonably salvageable. Damn. 
> 
> Hmm. Something like this *may* salvage it.
> 
> Untested, so far (I'll reboot and test soon enough), but even if it fixes 
> things, it's not really very good. 

If that fixes anything:
- The caller is broken because it shouldn't pass a faulting source to copy_to_user()
- And you broken copy_from_user error reporting which shares the same code

-Andi



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:06           ` Linus Torvalds
  2008-06-17 21:16             ` Andi Kleen
@ 2008-06-17 21:20             ` Linus Torvalds
  2008-06-18  2:27               ` Bron Gondwana
                                 ` (2 more replies)
  1 sibling, 3 replies; 38+ messages in thread
From: Linus Torvalds @ 2008-06-17 21:20 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Linux Kernel Mailing List, Nick Piggin, Andrew Morton,
	Rob Mueller, Andi Kleen, Ingo Molnar



On Tue, 17 Jun 2008, Linus Torvalds wrote:
> 
> Hmm. Something like this *may* salvage it.
> 
> Untested, so far (I'll reboot and test soon enough), but even if it fixes 
> things, it's not really very good. 

Ok, so I just rebooted with this, and it does indeed fix the bug.

I'd be happier with a more complete fix (ie being byte-accurate and 
actually doing the partial copy when it hits a fault in the middle), but 
this seems to be the minimal fix, and at least fixes the totally bogus 
return values from the x86-64 __copy_user*() functions.

Not that I checked that I got _all_ cases correct (and maybe there are 
other versions of __copy_user that I missed entirely), but Bron's 
test-case at least seems to work properly for me now.

Bron? If you have a more complete test-suite (ie the real-world case that 
made you find this), it would be good to verify the whole thing. 

And again, this is the kind of thing that really wants others looking at 
it. I personally think that this thing should likely be written as C code, 
with just the inner loops done as asm (ie the inner "rep movsq" and the 
inner 64-byte unrolled cached/uncached copies done as inline asms, and 
then the outer layers could be C).

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:16             ` Andi Kleen
@ 2008-06-17 21:24               ` Linus Torvalds
  2008-06-17 21:30                 ` Andi Kleen
  2008-06-17 21:36                 ` Al Viro
  0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2008-06-17 21:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar



On Tue, 17 Jun 2008, Andi Kleen wrote:
> 
> If that fixes anything:
> - The caller is broken because it shouldn't pass a faulting source to copy_to_user()
> - And you broken copy_from_user error reporting which shares the same code

Andi, I'm sorry I cc'd you. You are the author of that crap, but the bug 
seems to be that you never even understood what copy_from_user() is 
supposed to do.

The whole *and*only* reason for copy_to/from_user() existing AT ALL is 
exactly the fact that the source or destination access can fault. 

I don't really see why you continually start arguing about things that are 
OBVIOUSLY BUGGY, as if they weren't buggy. Once somebody has debugged a 
buggy routine, you shouldn't argue against it. 

So here's a hint: next time I claim some code of yours is buggy, either 
just acknowledge the bug, or stay silent. You'll look smarter that way.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:14           ` Linus Torvalds
@ 2008-06-17 21:26             ` Andi Kleen
  2008-06-17 21:31               ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2008-06-17 21:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar

Linus Torvalds wrote:
> 
> On Tue, 17 Jun 2008, Andi Kleen wrote:
>> The x86-64 copy_*_user functions were always designed to return errors
>> both ways (as in both for load and for store).
> 
> That's not the problem, Andi.
> 
> The problem is that it returns THE WRONG VALUE!
> 
> If the fault happened on the second load, 

Loads are not supposed to fault in copy_to_user(). Only stores are.

The way it works is that it assumes that either loads fault (when used
as copy_from_user) or stores (copy_to_user), but never both.

> but the first load was never 
> actually paired up with a store (because of unrolling the loop), then YOU 
> MUST NOT CLAIM THAT YOU DID A 8-BYTE COPY! Because you have copied exactly 
> _zero_ bytes, even though you _loaded_ 8 bytes successfully!
> 
> See?
> 
> Claiming that you copied 8 bytes when you didn't do anything at all is 
> WRONG. It's so incredibly wrong that it is scary. 

If your patch fixes something then the main wrong thing is the caller
who passes a faulting source address.

And again it always breaks the other case.

-Andi



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:24               ` Linus Torvalds
@ 2008-06-17 21:30                 ` Andi Kleen
  2008-06-17 21:37                   ` Linus Torvalds
  2008-06-17 21:36                 ` Al Viro
  1 sibling, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2008-06-17 21:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar

Linus Torvalds wrote:
> 
> On Tue, 17 Jun 2008, Andi Kleen wrote:
>> If that fixes anything:
>> - The caller is broken because it shouldn't pass a faulting source to copy_to_user()
>> - And you broken copy_from_user error reporting which shares the same code
> 
> Andi, I'm sorry I cc'd you. You are the author of that crap, but the bug 
> seems to be that you never even understood what copy_from_user() is 
> supposed to do.
> 
> The whole *and*only* reason for copy_to/from_user() existing AT ALL is 
> exactly the fact that the source or destination access can fault. 

yes, but only one of them (destination for copy_to_user and source for
copy_from_user)

Or are you're describing copy_in_user()?

> I don't really see why you continually start arguing about things that are 
> OBVIOUSLY BUGGY, as if they weren't buggy. Once somebody has debugged a 
> buggy routine, you shouldn't argue against it. 
> 
> So here's a hint: next time I claim some code of yours is buggy, either 
> just acknowledge the bug, or stay silent. You'll look smarter that way.

Ok if I'm really wrong on this (but frankly I don't see the mistake, sorry)
for my person edification: what's a legitimate case for copy_to_user()
where the source can fault?

-Andi



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:26             ` Andi Kleen
@ 2008-06-17 21:31               ` Linus Torvalds
  2008-06-17 21:34                 ` Linus Torvalds
  2008-06-17 21:34                 ` Andi Kleen
  0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2008-06-17 21:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar



On Tue, 17 Jun 2008, Andi Kleen wrote:
> 
> Loads are not supposed to fault in copy_to_user(). Only stores are.

Andi, just shut up already.

It is "copy_user". Notice the lack of "from" or "to". That code handles 
*both* copy_to_user and copy_from_user.

> If your patch fixes something then the main wrong thing is the caller
> who passes a faulting source address.

Andi, SHUT THE F*CK UP. Read the code. Read the patch. Sorry for ever 
involving you. I don't want to hear your idiotic whining.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:31               ` Linus Torvalds
@ 2008-06-17 21:34                 ` Linus Torvalds
  2008-06-17 21:34                 ` Andi Kleen
  1 sibling, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2008-06-17 21:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar



On Tue, 17 Jun 2008, Linus Torvalds wrote:
> 
> It is "copy_user". Notice the lack of "from" or "to". That code handles 
> *both* copy_to_user and copy_from_user.

Oh, and yes, I said "copy_to_user_inatomic()" by mistake. The fact is that 
"write()" obviously does copy_from_user, but more importantly that on x86 
it's THE EXACT SAME ROUTINE. 

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:31               ` Linus Torvalds
  2008-06-17 21:34                 ` Linus Torvalds
@ 2008-06-17 21:34                 ` Andi Kleen
  2008-06-17 21:46                   ` Linus Torvalds
  1 sibling, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2008-06-17 21:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar

Linus Torvalds wrote:
> 
> On Tue, 17 Jun 2008, Andi Kleen wrote:
>> Loads are not supposed to fault in copy_to_user(). Only stores are.
> 
> Andi, just shut up already.
> 
> It is "copy_user". Notice the lack of "from" or "to". That code handles 
> *both* copy_to_user and copy_from_user.

Yes, but it assumes only one can fault at a time.

> 
>> If your patch fixes something then the main wrong thing is the caller
>> who passes a faulting source address.
> 
> Andi, SHUT THE F*CK UP. Read the code. Read the patch.

I did exactly the same patch first on your first email and then I realized
it was wrong before posting ;-)

> Sorry for ever 
> involving you. I don't want to hear your idiotic whining.

The patch is wrong because it'll break the other case (in this case
copy_from_user)

-Andi


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:24               ` Linus Torvalds
  2008-06-17 21:30                 ` Andi Kleen
@ 2008-06-17 21:36                 ` Al Viro
  2008-06-17 21:42                   ` Andi Kleen
  1 sibling, 1 reply; 38+ messages in thread
From: Al Viro @ 2008-06-17 21:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar

On Tue, Jun 17, 2008 at 02:24:39PM -0700, Linus Torvalds wrote:

> The whole *and*only* reason for copy_to/from_user() existing AT ALL is 
> exactly the fact that the source or destination access can fault. 

Erm...  The reason for copy_to_user() existing is that dereferencing
a user-supplied address in the kernel does not have to access any
memory reachable in user mode, regardless of any faults.  WTF are you
guys talking about?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:30                 ` Andi Kleen
@ 2008-06-17 21:37                   ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2008-06-17 21:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar



On Tue, 17 Jun 2008, Andi Kleen wrote:
> 
> Ok if I'm really wrong on this (but frankly I don't see the mistake, sorry)
> for my person edification: what's a legitimate case for copy_to_user()
> where the source can fault?

Andi, read the patch already. And then shut up. I'm so tired of you always 
trying to argue against fixes, just because you wrote the code.

The fact is, that code was buggy. I've explained _why_ it was buggy. I've 
sent a patch. I've even talked about how I _tested_ the patch. You've read 
none of that, you just fixated on the fact that I said "to" instead of 
"from" by mistake, even though it's the exact same code.

At no point did you at all bother to read the patch, nor did you try to 
understand the problem, nor did you follow the original report or try it 
out, or think about it.

So tell me why I shouldn't just put you in my "idiots" filter? The 
aggravation just isn't worth it for me.

When you grow up and can admit that your code was buggy, or at least 
*look* at the patches, feel free to start sending me email again.

Until then, just don't bother.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:36                 ` Al Viro
@ 2008-06-17 21:42                   ` Andi Kleen
  2008-06-17 21:49                     ` Linus Torvalds
  2008-06-17 22:11                     ` Al Viro
  0 siblings, 2 replies; 38+ messages in thread
From: Andi Kleen @ 2008-06-17 21:42 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Bron Gondwana, Linux Kernel Mailing List,
	Nick Piggin, Andrew Morton, Rob Mueller, Ingo Molnar

Al Viro wrote:
> On Tue, Jun 17, 2008 at 02:24:39PM -0700, Linus Torvalds wrote:
> 
>> The whole *and*only* reason for copy_to/from_user() existing AT ALL is 
>> exactly the fact that the source or destination access can fault. 
> 
> Erm...  The reason for copy_to_user() existing is that dereferencing
> a user-supplied address in the kernel does not have to access any
> memory reachable in user mode, regardless of any faults. 

Well on x86 it is reachable, so it only handles faults.

> WTF are you
> guys talking about?

Linus seems to think that copy_to_user() should have
copy_in_user semantics(). It happens to be in some cases (when string instructions
are used), but not for the unrolled case.

What seems also confusing him is that x86-64 copy_from/to_user use a shared
subfunction. The trick that this subfunction uses is to assume that
either the destination faults or the source, but never both. It's legal
because the caller should never pass in a faulting source for copy to
or a faulting destination for copy from.

Actually they handle it, but the return value is not correct.

Now he "fixed" copy_to_user to return a kind of correct return value
for source faults, but it'll of course break copy_from_user()'s return value.

It's still unclear why his patch fixes the test case. The caller should
be using copy_in_user perhaps? Or is it just buggy by passing something
unmapped to copy_to_user?

-Andi


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:34                 ` Andi Kleen
@ 2008-06-17 21:46                   ` Linus Torvalds
  2008-06-18  6:10                     ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2008-06-17 21:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar



On Tue, 17 Jun 2008, Andi Kleen wrote:
> 
> The patch is wrong because it'll break the other case (in this case
> copy_from_user)

Ok, I'm now putting you in my idiots filter.

No, it will not break the other case. You're an idiot. Loading a value 
without using it will not break anything, quite the reverse. What the 
patch does is to _fix_ copy_from_user(), because if the second load traps, 
then the fact that we did the first load IS IMMATERIAL, because its result 
was never stored!

So the patch _fixes_ copy_from_user(), exactly because it says that even 
if you've loaded 24 bytes, but you faulted on the fourth load, you've 
still _copied_ exactly zero bytes, because you didn't actually store the 
24 bytes you loaded.

And it is a no-op for copy_to_user, since copy_to_user will never fault on 
the load (not the first one, not the second one, not _any_ load), so the 
exception table entries for the loads are unusued.

IOW: 
 - the patch fixes a bug
 - you refuse to acknowledge this
 - I'll put you in my "flamers" filter that goes into another mailbox, 
   because it's not worth my time even arguing with you any more.

Sorry for ever adding you to the cc. I thought it might be a good idea, 
since you were the author of the code. But clearly the bug was not because 
you made a mistake, but because you simply don't seem to understand what 
the function is supposed to return, and you're not even interested in 
learning.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:42                   ` Andi Kleen
@ 2008-06-17 21:49                     ` Linus Torvalds
  2008-06-17 22:11                     ` Al Viro
  1 sibling, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2008-06-17 21:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Al Viro, Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar



On Tue, 17 Jun 2008, Andi Kleen wrote:
> 
> Linus seems to think that copy_to_user() should have
> copy_in_user semantics().

No. 

I *know* that copy_to/from_user() is supposed to return the number of 
bytes not copied.

And if you loaded a value but didn't store it to the destination, then BY 
DEFINITION you didn't copy it.

Comprende?

Obviously you don't.

Gaah. Why do I keep bothering to even _try_ to educate you? Can't you 
accept that I just fixed a bug, which had a test-case an everything?

Andi, really. Take it from me. If I tell you something, I'm usually right. 
Think about that for a moment, instead of living in your own little 
dream-world where you think you understand everything.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:42                   ` Andi Kleen
  2008-06-17 21:49                     ` Linus Torvalds
@ 2008-06-17 22:11                     ` Al Viro
  2008-06-17 22:21                       ` Andi Kleen
  1 sibling, 1 reply; 38+ messages in thread
From: Al Viro @ 2008-06-17 22:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Bron Gondwana, Linux Kernel Mailing List,
	Nick Piggin, Andrew Morton, Rob Mueller, Ingo Molnar

On Tue, Jun 17, 2008 at 11:42:03PM +0200, Andi Kleen wrote:

> What seems also confusing him is that x86-64 copy_from/to_user use a shared
> subfunction. The trick that this subfunction uses is to assume that
> either the destination faults or the source, but never both. It's legal
> because the caller should never pass in a faulting source for copy to
> or a faulting destination for copy from.
> 
> Actually they handle it, but the return value is not correct.
> 
> Now he "fixed" copy_to_user to return a kind of correct return value
> for source faults, but it'll of course break copy_from_user()'s return value.
> 
> It's still unclear why his patch fixes the test case. The caller should
> be using copy_in_user perhaps? Or is it just buggy by passing something
> unmapped to copy_to_user?

AFAICS, what happened is that b0rken copy_*FROM*_user() had been discussed
with references to copy_*TO*_user().  With proposed patch indeed not affecting
any legitimate calls of the latter.  Does affect the former and that, from
my reading of the code in question, correctly.

IOW, s/copy_to_user/copy_from_user/ in Linus' postings upthread and they
make sense.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 22:11                     ` Al Viro
@ 2008-06-17 22:21                       ` Andi Kleen
  2008-06-18  6:22                         ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Andi Kleen @ 2008-06-17 22:21 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Bron Gondwana, Linux Kernel Mailing List,
	Nick Piggin, Andrew Morton, Rob Mueller, Ingo Molnar


> AFAICS, what happened is that b0rken copy_*FROM*_user() had been discussed
> with references to copy_*TO*_user().  With proposed patch indeed not affecting
> any legitimate calls of the latter.  Does affect the former and that, from
> my reading of the code in question, correctly.
> 
> IOW, s/copy_to_user/copy_from_user/ in Linus' postings upthread and they
> make sense.

Yes, it makes some more sense, but I'm not completely happy with the fix
because it makes the fault point reporting very unreliable (maximum error
will be 63 instead of 7 now). iirc especially mount was sensitive to that.

Unfortunately fixing the accuracy is a little tricky.

-Andi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 17:08   ` Linus Torvalds
  2008-06-17 17:45     ` Linus Torvalds
@ 2008-06-18  2:21     ` Bron Gondwana
  1 sibling, 0 replies; 38+ messages in thread
From: Bron Gondwana @ 2008-06-18  2:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller

On Tue, Jun 17, 2008 at 10:08:24AM -0700, Linus Torvalds wrote:
> On Tue, 17 Jun 2008, Bron Gondwana wrote:
> > 
> > And I appear to have sent the one without the usage comments at the top.
> > Here they are:
> 
> Very interesting. There's certainly something there. 
> 
> That said, there's a distracting bug which is visible when doing an strace
> 
>  lseek(4, 140333890921392, SEEK_SET)     = -1 EINVAL (Invalid argument)
>  write(4, "\0\0\0\0", 4)                 = 4

Bah - crap.  Sorry about that.  I sent you the wrong version of the
code.  That's what I get for not putting it in revision control.

That bug exists because I did an htonl on a value that I shouldn't
have.

Bron ( now to read the rest of the thread! )

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:20             ` Linus Torvalds
@ 2008-06-18  2:27               ` Bron Gondwana
  2008-06-18  3:14               ` Bron Gondwana
  2008-10-03 11:44               ` BUG: mmapfile/writev spurious zero bytes still in the wild Bron Gondwana
  2 siblings, 0 replies; 38+ messages in thread
From: Bron Gondwana @ 2008-06-18  2:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Andi Kleen, Ingo Molnar

On Tue, Jun 17, 2008 at 02:20:49PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 17 Jun 2008, Linus Torvalds wrote:
> > 
> > Hmm. Something like this *may* salvage it.
> > 
> > Untested, so far (I'll reboot and test soon enough), but even if it fixes 
> > things, it's not really very good. 
> 
> Ok, so I just rebooted with this, and it does indeed fix the bug.
> 
> I'd be happier with a more complete fix (ie being byte-accurate and 
> actually doing the partial copy when it hits a fault in the middle), but 
> this seems to be the minimal fix, and at least fixes the totally bogus 
> return values from the x86-64 __copy_user*() functions.
> 
> Not that I checked that I got _all_ cases correct (and maybe there are 
> other versions of __copy_user that I missed entirely), but Bron's 
> test-case at least seems to work properly for me now.
> 
> Bron? If you have a more complete test-suite (ie the real-world case that 
> made you find this), it would be good to verify the whole thing. 

I have a real world test case using "cyr_dbtool" from Cyrus 2.3.12 on a
known-bug-inducing piece of data (the key and value sizes in the example
code were taken from that.  Indeed, the example code started off
building byte-for-byte identical files, then I changed it to write only
a single character across the entire key and value so the hexdump was
shorter)

I'll give it a go.

Bron.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:20             ` Linus Torvalds
  2008-06-18  2:27               ` Bron Gondwana
@ 2008-06-18  3:14               ` Bron Gondwana
  2008-06-18  4:03                 ` Linus Torvalds
  2008-10-03 11:44               ` BUG: mmapfile/writev spurious zero bytes still in the wild Bron Gondwana
  2 siblings, 1 reply; 38+ messages in thread
From: Bron Gondwana @ 2008-06-18  3:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Andi Kleen, Ingo Molnar

On Tue, Jun 17, 2008 at 02:20:49PM -0700, Linus Torvalds wrote:
> On Tue, 17 Jun 2008, Linus Torvalds wrote:
> > 
> > Hmm. Something like this *may* salvage it.
> > 
> > Untested, so far (I'll reboot and test soon enough), but even if it fixes 
> > things, it's not really very good. 
> 
> Ok, so I just rebooted with this, and it does indeed fix the bug.
> 
> I'd be happier with a more complete fix (ie being byte-accurate and 
> actually doing the partial copy when it hits a fault in the middle), but 
> this seems to be the minimal fix, and at least fixes the totally bogus 
> return values from the x86-64 __copy_user*() functions.
> 
> Not that I checked that I got _all_ cases correct (and maybe there are 
> other versions of __copy_user that I missed entirely), but Bron's 
> test-case at least seems to work properly for me now.
> 
> Bron? If you have a more complete test-suite (ie the real-world case that 
> made you find this), it would be good to verify the whole thing. 

Ok - I pulled the latest linus-2.6 git, and discovered 
the patch was already in there, so I just built and
rebooted (git 952f4a0a9b27e6dbd5d32e330b3f609ebfa0b061).

Confirmed - fixed in both the test code and the cyr_dbtool 
test case I was using previously (I would have posted that 
instead, but building cyrus is a bit of pain.  You need 
bdb and sasl and all sorts of extraneous crap - and 
cyrusdb_skiplist.c depends on about half of Cyrus' 
infrastructure, so I couldn't just pull it out by itself)

For my sins, I appear to be becoming the world expert on 
that particular file.  I've debugged skiplist bugs many 
times over, and completely rewritten the locking code.  
It really does some pretty evil things - the memory accesses
look something like this:

[file...................]
[mmap^....^.^........^^..................................]
[file...................++++++++++++]
[mmap^....^.^........^^.^^ ^      ^^.....................]

Where (^) is the bits that get accessed.  All reads are via
the mmap, all writes are done with retry_write or 
retry_writev (Cyrus library functions that keep hammering
until all the bytes are written)

I was suspecting as early as Friday night (we've been 
debugging this one for a few days now!) that it was page 
break related, because the bug only seemed to be appearing
on seen databases with really long seen lists (they're in 
ranged integer format like 1:5,7:9,12,14:22,24:...).  

It didn't help that at first we were only finding out about
cases where the corruption hit exactly on the "navigational
components", hence breaking the skiplist logic.  And then 
the backpointer writes would scribble all over the corrupt 
area as well, so that made it even stranger to debug!


OK - so I'll report this issue to the Cyrus mailing list.
Warn people not to run on kernels 2.6.23 -> 2.6.25.7 with
x86_64 kernels.  At least not without the skanky little
patch that I'm planning to post:

int magic = 0;
for (i = 0; i < maplen; i++) magic ^= mapbase[i];

Since I've tested that as a viable workaround!

Bron.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-18  3:14               ` Bron Gondwana
@ 2008-06-18  4:03                 ` Linus Torvalds
  2008-06-18  5:11                   ` Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)) Bron Gondwana
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2008-06-18  4:03 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Linux Kernel Mailing List, Nick Piggin, Andrew Morton,
	Rob Mueller, Andi Kleen, Ingo Molnar



On Wed, 18 Jun 2008, Bron Gondwana wrote:
> 
> For my sins, I appear to be becoming the world expert on 
> that particular file.

Heh. Congrats ;)

> I've debugged skiplist bugs many times over, and completely rewritten 
> the locking code.  It really does some pretty evil things - the memory 
> accesses look something like this:
> 
> [file...................]
> [mmap^....^.^........^^..................................]
> [file...................++++++++++++]
> [mmap^....^.^........^^.^^ ^      ^^.....................]
> 
> Where (^) is the bits that get accessed.  All reads are via
> the mmap, all writes are done with retry_write or 
> retry_writev (Cyrus library functions that keep hammering
> until all the bytes are written)

Is there any reason it doesn't use mmap(MAP_SHARED) and make the 
modifications that way too?

Because quite frankly, the mixture of doing mmap() and write() system 
calls is quite fragile - and I'm not saying that just because of this 
particular bug, but because there are all kinds of nasty cache aliasing 
issues with virtually indexed caches etc that just fundamentally mean that 
it's often a mistake to mix mmap with read/write at the same time.

(For the same reason it's not a good idea to mix writing through an mmap() 
and then using read() to read it - again, you can have some nasty aliasing 
going on there).

So this particular issue was definitely a kernel bug (and big thanks for 
making such a good test-case), but in general, it does sound like Cyrus is 
actively trying to dig itself into a nasty hole there.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable))
  2008-06-18  4:03                 ` Linus Torvalds
@ 2008-06-18  5:11                   ` Bron Gondwana
  2008-06-18 16:22                     ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Bron Gondwana @ 2008-06-18  5:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Andi Kleen, Ingo Molnar,
	Ken Murchison

On Tue, Jun 17, 2008 at 09:03:17PM -0700, Linus Torvalds wrote:
> On Wed, 18 Jun 2008, Bron Gondwana wrote:
> > 
> > For my sins, I appear to be becoming the world expert on 
> > that particular file.
> 
> Heh. Congrats ;)
> 
> > I've debugged skiplist bugs many times over, and completely rewritten 
> > the locking code.  It really does some pretty evil things - the memory 
> > accesses look something like this:
> > 
> > [file...................]
> > [mmap^....^.^........^^..................................]
> > [file...................++++++++++++]
> > [mmap^....^.^........^^.^^ ^      ^^.....................]
> > 
> > Where (^) is the bits that get accessed.  All reads are via
> > the mmap, all writes are done with retry_write or 
> > retry_writev (Cyrus library functions that keep hammering
> > until all the bytes are written)
> 
> Is there any reason it doesn't use mmap(MAP_SHARED) and make the 
> modifications that way too?

Portability[tm].

It actually does use MAP_SHARED already, but only for reading.
Writing is all done with seeks and writes, followed by a map
"refresh", which is really just an unmmap/mmap if the file has
extended past the "SLOP" (between 8 and 16 k after the end of
the file length at last mapping).

I can't just right now find the place where the reasoning behind
this was explained to me.  The theory I think was that mmap write
support is unreliable across systems, but read support is pretty 
good (except HPUX for which there is map_stupidshared.c)

> Because quite frankly, the mixture of doing mmap() and write() system 
> calls is quite fragile - and I'm not saying that just because of this 
> particular bug, but because there are all kinds of nasty cache aliasing 
> issues with virtually indexed caches etc that just fundamentally mean that 
> it's often a mistake to mix mmap with read/write at the same time.

I'm not actually a maintainer for Cyrus, I just write patches to
keep our mail servers working.  I'll pass this on.

> (For the same reason it's not a good idea to mix writing through an mmap() 
> and then using read() to read it - again, you can have some nasty aliasing 
> going on there).
>
> So this particular issue was definitely a kernel bug (and big thanks for 
> making such a good test-case), but in general, it does sound like Cyrus is 
> actively trying to dig itself into a nasty hole there.

Yeah, indeed.

I suspect the response from the Cyrus side might include a
small dose of "POSIX says it's valid to do this, and making
it work is the kernel programmers' lookout".

Ahh - I found the explaination in doc/internal/hacking in
the Cyrus source tree.  While 'ack' is a nice tool, it
doesn't check files with no extention by default.  Ho hum:

- map_refresh and map_free

  - In many cases, it is far more effective to read a file via the operating
    system's mmap facility than it is to via the traditional read() and
    lseek system calls.  To this end, Cyrus provides an operating system
    independent wrapper around the mmap() services (or lack thereof) of the
    operating system.

  - Cyrus currently only supports read-only memory maps, all writes back
    to a file need to be done via the more traditional facilities.  This is
    to enable very low-performance support for operating systems which do not
    provide an mmap() facility via a fake userspace mmap.

  - To create a map, simply call map_refresh on the map (details are in
    lib/map.h).  To free it, call map_free on the same map.

  - Despite the fact that the maps are read-only, it is often useful to open
    the file descriptors O_RDWR, especially if the file decriptors could
    possibly be used for writing elsewhere in the code. Some operating
    systems REQUIRE file descriptors that are mmap()ed to be opened
    O_RDWR, so just do it.

If I was God in the Cyrus world (woot) I suspect I'd 
provide some sort of OS independent wrapper around the 
various write functions, using mmap where appropriate,
while still working via lseek/write for those systems 
without mmap support.

(Added CC: Ken Murchison, on the grounds that he actually 
is God in the Cyrus world)

Thanks for the good explaination.  I'll have a look at the
Cyrus code and see just how tricky that would actually be
(even just doing the skiplist, index and cache code would
hit 99% of the cases where files are both mmaped and
written concurrently)

Bron.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 21:46                   ` Linus Torvalds
@ 2008-06-18  6:10                     ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2008-06-18  6:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Ingo Molnar

On Wednesday 18 June 2008 07:46, Linus Torvalds wrote:
> On Tue, 17 Jun 2008, Andi Kleen wrote:

> So the patch _fixes_ copy_from_user(), exactly because it says that even
> if you've loaded 24 bytes, but you faulted on the fourth load, you've
> still _copied_ exactly zero bytes, because you didn't actually store the
> 24 bytes you loaded.

Yes, the new filemap.c code does not require an exact byte count, but
it won't work if there is an under-estimation of the number of bytes
left to copy.

The old filemap.c code actually also relies on the byte count in some
cases, I can't remember off the top of my head, but I *think* it was a
security measure to prevent uninitialized data leak.

Most other cases of course only care about complete success or not, but
there are others. filemap_xip, splice are a couple that pop up.

Thanks for working that all out before I even read my email :)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)
  2008-06-17 22:21                       ` Andi Kleen
@ 2008-06-18  6:22                         ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2008-06-18  6:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Al Viro, Linus Torvalds, Bron Gondwana, Linux Kernel Mailing List,
	Nick Piggin, Andrew Morton, Rob Mueller, Ingo Molnar

On Wednesday 18 June 2008 08:21, Andi Kleen wrote:
> > AFAICS, what happened is that b0rken copy_*FROM*_user() had been
> > discussed with references to copy_*TO*_user().  With proposed patch
> > indeed not affecting any legitimate calls of the latter.  Does affect the
> > former and that, from my reading of the code in question, correctly.
> >
> > IOW, s/copy_to_user/copy_from_user/ in Linus' postings upthread and they
> > make sense.
>
> Yes, it makes some more sense, but I'm not completely happy with the fix
> because it makes the fault point reporting very unreliable (maximum error
> will be 63 instead of 7 now). iirc especially mount was sensitive to that.

It looks like mount does need an exact copy, so they've rolled their own
(exact_copy_from_user). I guess if you need an exact copy, then it doesn't
really matter how inexact an inexact one is, it's still unusable :)

All else being equal, a smaller maximum error is preferable, but surely
that is outweighed by the correctness issue of returning a valid number of
bytes left to operate on.

BTW. we already have lots (although steadily declining number) of corner
case issues around this whole area, but if we want to get really strict,
even an inexact report may be wrong for filemap.

Suppose we copy 10 bytes into the pagecache, but report that 5 were copied.
That means, we'll subsequently re-copy the delta. Between these two copies,
a 2nd writer might come in and write something over those 5 bytes. Then a
reader might see the following sequence of those 10 bytes
"0000000000"
"1111111111"
"2222222222"
"2222211111"

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable))
  2008-06-18  5:11                   ` Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)) Bron Gondwana
@ 2008-06-18 16:22                     ` Linus Torvalds
  2008-06-18 23:45                       ` Robert Mueller
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2008-06-18 16:22 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Linux Kernel Mailing List, Nick Piggin, Andrew Morton,
	Rob Mueller, Andi Kleen, Ingo Molnar, Ken Murchison



On Wed, 18 Jun 2008, Bron Gondwana wrote:
> On Tue, Jun 17, 2008 at 09:03:17PM -0700, Linus Torvalds wrote:
> >
> > Is there any reason it doesn't use mmap(MAP_SHARED) and make the 
> > modifications that way too?
> 
> Portability[tm].

Hmm.. I'm pretty sure that using MAP_SHARED for writing is _more_ portable 
than mixing mmap() and "write()" - or at least more _consistent_.

That said, it's probably six one way, and half a dozen the other. The 
shared writable mmap() doesn't work well on unix-lookalikes (ie "not real 
unix"). That does include really really old Linux versions (ie 1.x 
series), but more relevantly probably includes things like QNX etc.

On the other hand, the mmap()+write(), as mentioned, doesn't work well on 
various hardware platforms where theer can be cache aliases, and that 
includes HP-UX (as you apparently have noticed), but I'm pretty certain 
there are other cases too.

The cache alias issue can actually be really thorny, because it's going to 
be very hard to see and essentially random: if your working set is big 
enough (or the cache is small enough) that the cache basically gets 
flushed between the write and the access through the mmap (and vice 
versa), you'll never see any problems.

But then, _occasionally_, you'll have really hard-to-replicate corruption 
due to cache aliases (ie you read something from the mmap() after the 
write, but you don't actually see the newly written data, because it's 
cached at a different virtual address).

Linux tries really hard to be coherent between mmap and read/write even on 
those kinds of platforms, but I would definitely not call it "portable". 
It really is a fundamentally nasty thing, and depends deeply on the CPU 
architecture, not just the OS.

> It actually does use MAP_SHARED already, but only for reading.
> Writing is all done with seeks and writes, followed by a map
> "refresh", which is really just an unmmap/mmap if the file has
> extended past the "SLOP" (between 8 and 16 k after the end of
> the file length at last mapping).

Yeah, I can certainly see that working. That said, I can also see it 
failing, partly because of the CPU virtual indexing cache issues, but 
partly because it's such an unusual thing to do (partly because it simply 
is known not to work on some systems, ie HP-UX). And that will mean that 
it is probably not a well-tested path.. As you found out.

(Side note: I mention HP-UX just because it is known to historically have 
totally and utterly brain-damaged and useless mmap support. It _may_ be 
that they've fixed it in more modern versions. It literally used to be a 
mix of horrible hardware problems - the virtual cache issue - _and_ a VM 
system that was based on some really old BSD code).

So the more traditional way would be to do an all-mmap thing, and extend 
the file with ftruncate(), not write. That's somethign that programs like 
nntpd have been doing for decades, so it's a very "traditional" model and 
thus much more likely to be safe. It also avoids all the aliasing issues, 
if all accesses are done the same way.

That said, you _would_ need to have alternate strategies to access things, 
but apparently Cyrus already has such strategies at least for HP-UX.

> Ahh - I found the explaination in doc/internal/hacking in
> the Cyrus source tree.  While 'ack' is a nice tool, it
> doesn't check files with no extention by default.  Ho hum:
> 
> - map_refresh and map_free
> 
>   - In many cases, it is far more effective to read a file via the operating
>     system's mmap facility than it is to via the traditional read() and
>     lseek system calls.  To this end, Cyrus provides an operating system
>     independent wrapper around the mmap() services (or lack thereof) of the
>     operating system.

One of the issues here is that in order to give coherency for mmap + 
read/write access, the OS may need to map the area uncached or at least 
flush caches when writing. So from a pure performance standpoint, it can 
also cause problems.

Of course, even a uncached mmap() _can_ certainly be faster than using 
just read()/write(), depending on the access patterns. So maybe Cyrus is 
doing the rigth thing, it just sounds rather fragile and prone to 
unexpected and hard-to-debug problems.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable))
  2008-06-18 16:22                     ` Linus Torvalds
@ 2008-06-18 23:45                       ` Robert Mueller
  2008-06-19  0:20                         ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Robert Mueller @ 2008-06-18 23:45 UTC (permalink / raw)
  To: Linus Torvalds, Bron Gondwana
  Cc: Linux Kernel Mailing List, Nick Piggin, Andrew Morton, Andi Kleen,
	Ingo Molnar, Ken Murchison


>> It actually does use MAP_SHARED already, but only for reading.
>> Writing is all done with seeks and writes, followed by a map
>> "refresh", which is really just an unmmap/mmap if the file has
>> extended past the "SLOP" (between 8 and 16 k after the end of the
>> file length at last mapping).
>
> So the more traditional way would be to do an all-mmap thing, and
> extend the file with ftruncate(), not write. That's somethign that
> programs like nntpd have been doing for decades, so it's a very
> "traditional" model and thus much more likely to be safe. It also
> avoids all the aliasing issues, if all accesses are done the same way.

As noted above, one thing cyrus does which does seem to be plain "wrong"
is that it mmaps a region greater the file size (rounds to an 8k
boundary, but 8k-16k past the current end of the file) and then assumes
that when it writes to the end of the file (but less than the end of the
mmap region) that there's no need to remmap and that data is immediately
available within the previous mmaped region.

Apparently that works on most OS's (but is what this bug actually
exposed), but according to the mmap docs:

---
If the size of the mapped file changes after the call to mmap() as a
result of some other operation on the mapped file, the effect of
references to portions of the mapped region that correspond to added or
removed portions of the file is unspecified.
---

The way I read that, even if you mmap a file with a size past the end of
the file currently, if you subsequently write to the end of that file,
you shouldn't assume that written data is available in the region you
previously mmaped, which cyrus definitely does do.

Amazingly (apart from HP/UX) no OS actually seems to have a problem with
this since there would be massive cyrus bug reports otherwise.

Rob

----------
robm@fastmail.fm
Sign up at http://fastmail.fm for fast, ad free, IMAP accessible email


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable))
  2008-06-18 23:45                       ` Robert Mueller
@ 2008-06-19  0:20                         ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2008-06-19  0:20 UTC (permalink / raw)
  To: Robert Mueller
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Andi Kleen, Ingo Molnar, Ken Murchison



On Thu, 19 Jun 2008, Robert Mueller wrote:
> 
> As noted above, one thing cyrus does which does seem to be plain "wrong"
> is that it mmaps a region greater the file size (rounds to an 8k
> boundary, but 8k-16k past the current end of the file) and then assumes
> that when it writes to the end of the file (but less than the end of the
> mmap region) that there's no need to remmap and that data is immediately
> available within the previous mmaped region.

Pretty much any OS that tries to be make mmap() coherent with regular 
read/write accesses will automatically also have to be coherent wrt file 
size updates.

IOW, I don't think that cyrus is real any more "wrong" in this than in 
assuming that you can mix read/write and mmap() accesses. In fact, I 
suspect that Cyrus is probably _more_ conservative than most, in that it 
would not be totally unheard of to just do a much bigger mmap(), and not 
even bother to re-do it until the file grows past that size (ie no 8k/16k 
granularity, but make it arbitrarily non-granular).

> Apparently that works on most OS's (but is what this bug actually
> exposed), but according to the mmap docs:
> 
> ---
> If the size of the mapped file changes after the call to mmap() as a
> result of some other operation on the mapped file, the effect of
> references to portions of the mapped region that correspond to added or
> removed portions of the file is unspecified.

Note that if you really want to be portable, you simply must not mix 
mmap() with *any* other operations without sprinking in a healthy amount 
of "msync()" or unmapping/remapping entirely.

So _in_practice_ - because everybody tries to do a good job - you can 
actually expect to have mmap() be coherent, even though there are no real 
guarantees. 

> Amazingly (apart from HP/UX) no OS actually seems to have a problem with 
> this since there would be massive cyrus bug reports otherwise.

Yeah. Over the years, the pain from having a non-coherent mmap() generally 
has pushed everybody into just making mmap() easy to use. Which means that 
mixing things generally works fine, even if it is not at all _guaranteed_. 

So I'd expect mmap+write to work and be coherent almost always. But it's 
still a fairly unusual combination, and I would personally think that 
using MAP_SHARED and writing through the mmap() would be the less 
surprising model.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* BUG: mmapfile/writev spurious zero bytes still in the wild
  2008-06-17 21:20             ` Linus Torvalds
  2008-06-18  2:27               ` Bron Gondwana
  2008-06-18  3:14               ` Bron Gondwana
@ 2008-10-03 11:44               ` Bron Gondwana
  2008-10-03 13:07                 ` Andrew Morton
  2 siblings, 1 reply; 38+ messages in thread
From: Bron Gondwana @ 2008-10-03 11:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bron Gondwana, Linux Kernel Mailing List, Nick Piggin,
	Andrew Morton, Rob Mueller, Andi Kleen, Ingo Molnar

On Tue, Jun 17, 2008 at 02:20:49PM -0700, Linus Torvalds wrote:

[reminder from way back: this bug was caused by writev containing 
mmaped pages that weren't paged in, it's 64 bit only.  It
particularly affects Cyrus Imapd's database formats]

> On Tue, 17 Jun 2008, Linus Torvalds wrote:
> > 
> > Hmm. Something like this *may* salvage it.
> > 
> > Untested, so far (I'll reboot and test soon enough), but even if it fixes 
> > things, it's not really very good. 
> 
> Ok, so I just rebooted with this, and it does indeed fix the bug.
> 
> I'd be happier with a more complete fix (ie being byte-accurate and 
> actually doing the partial copy when it hits a fault in the middle), but 
> this seems to be the minimal fix, and at least fixes the totally bogus 
> return values from the x86-64 __copy_user*() functions.

Has this been revisited since?  I haven't noticed, but I really only
skim LKML - have to save some time in the day for my real job[tm] of
keeping an email service running!

> Not that I checked that I got _all_ cases correct (and maybe there are 
> other versions of __copy_user that I missed entirely), but Bron's 
> test-case at least seems to work properly for me now.
> 
> Bron? If you have a more complete test-suite (ie the real-world case that 
> made you find this), it would be good to verify the whole thing. 

It's been fine for us since, but unfortunately most of the world is
still running distribution "stable" kernels.  I've just been helping a
user who's getting corrupted flat file databases on Ubuntu's stable 64
bit xen kernels, and it looks like it's the same issue.

Is there a standard way to tell backporters "you really need to add this
patch for your users' sanity"?

Bron ( I tried reporting it in Launchpad, but kept getting timeout
       errors, so I figured reposting here might get noticed.  Besides,
			 I can follow up on the "more complete fix" at the same time! )

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes still in the wild
  2008-10-03 11:44               ` BUG: mmapfile/writev spurious zero bytes still in the wild Bron Gondwana
@ 2008-10-03 13:07                 ` Andrew Morton
  2008-10-04  0:13                   ` Bron Gondwana
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2008-10-03 13:07 UTC (permalink / raw)
  To: Bron Gondwana
  Cc: Linus Torvalds, Linux Kernel Mailing List, Nick Piggin,
	Rob Mueller, Andi Kleen, Ingo Molnar, stable

On Fri, 3 Oct 2008 21:44:14 +1000 Bron Gondwana <brong@fastmail.fm> wrote:

> On Tue, Jun 17, 2008 at 02:20:49PM -0700, Linus Torvalds wrote:
> 
> [reminder from way back: this bug was caused by writev containing 
> mmaped pages that weren't paged in, it's 64 bit only.  It
> particularly affects Cyrus Imapd's database formats]
> 
> > On Tue, 17 Jun 2008, Linus Torvalds wrote:
> > > 
> > > Hmm. Something like this *may* salvage it.
> > > 
> > > Untested, so far (I'll reboot and test soon enough), but even if it fixes 
> > > things, it's not really very good. 
> > 
> > Ok, so I just rebooted with this, and it does indeed fix the bug.
> > 
> > I'd be happier with a more complete fix (ie being byte-accurate and 
> > actually doing the partial copy when it hits a fault in the middle), but 
> > this seems to be the minimal fix, and at least fixes the totally bogus 
> > return values from the x86-64 __copy_user*() functions.
> 
> Has this been revisited since?  I haven't noticed, but I really only
> skim LKML - have to save some time in the day for my real job[tm] of
> keeping an email service running!
> 
> > Not that I checked that I got _all_ cases correct (and maybe there are 
> > other versions of __copy_user that I missed entirely), but Bron's 
> > test-case at least seems to work properly for me now.
> > 
> > Bron? If you have a more complete test-suite (ie the real-world case that 
> > made you find this), it would be good to verify the whole thing. 
> 
> It's been fine for us since, but unfortunately most of the world is
> still running distribution "stable" kernels.  I've just been helping a
> user who's getting corrupted flat file databases on Ubuntu's stable 64
> bit xen kernels, and it looks like it's the same issue.
> 
> Is there a standard way to tell backporters "you really need to add this
> patch for your users' sanity"?

Yes, there is.  We backport the patch into earlier kernel releases and
that action _should_ wake the distros up to take a look at the fix.

This particular fix (42a886af728c089df8da1b0017b0e7e6c81b5335) was
included in 2.6.26 and also is present in 2.6.25.17, but not 2.6.25. 
So we did backport it into 2.6.25.x.  Maybe distros were slow or errant
in picking up the patch.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: BUG: mmapfile/writev spurious zero bytes still in the wild
  2008-10-03 13:07                 ` Andrew Morton
@ 2008-10-04  0:13                   ` Bron Gondwana
  0 siblings, 0 replies; 38+ messages in thread
From: Bron Gondwana @ 2008-10-04  0:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Linux Kernel Mailing List, Nick Piggin,
	Rob Mueller, Andi Kleen, Ingo Molnar, stable


On Fri, 3 Oct 2008 06:07:23 -0700, "Andrew Morton" <akpm@linux-foundation.org> said:
> On Fri, 3 Oct 2008 21:44:14 +1000 Bron Gondwana <brong@fastmail.fm>
> wrote:
> > Is there a standard way to tell backporters "you really need to add this
> > patch for your users' sanity"?
> 
> Yes, there is.  We backport the patch into earlier kernel releases and
> that action _should_ wake the distros up to take a look at the fix.
> 
> This particular fix (42a886af728c089df8da1b0017b0e7e6c81b5335) was
> included in 2.6.26 and also is present in 2.6.25.17, but not 2.6.25. 
> So we did backport it into 2.6.25.x.  Maybe distros were slow or errant
> in picking up the patch.

It went into 2.6.25.8, which was the stable line at the time.  It
didn't get backported to 2.6.24.  Ubuntu's stable line is called
2.6.24-19.41, and I'm guessing they just didn't realise it was
a fix that actually needs to be applied to everything back to at
least 2.6.23.

I'll go try launchpad again and see if it's unbroken enough to
accept my bug report.

Bron ( really not wanting to keep fielding Cyrus questions on this
       for the 4 1/2 years support remaining on that LTS distro! )
-- 
  Bron Gondwana
  brong@fastmail.fm


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2008-10-04  0:36 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-17  6:00 BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable) Bron Gondwana
2008-06-17  6:02 ` Bron Gondwana
2008-06-17 17:08   ` Linus Torvalds
2008-06-17 17:45     ` Linus Torvalds
2008-06-17 20:16       ` Linus Torvalds
2008-06-17 20:41         ` Linus Torvalds
2008-06-17 21:06           ` Linus Torvalds
2008-06-17 21:16             ` Andi Kleen
2008-06-17 21:24               ` Linus Torvalds
2008-06-17 21:30                 ` Andi Kleen
2008-06-17 21:37                   ` Linus Torvalds
2008-06-17 21:36                 ` Al Viro
2008-06-17 21:42                   ` Andi Kleen
2008-06-17 21:49                     ` Linus Torvalds
2008-06-17 22:11                     ` Al Viro
2008-06-17 22:21                       ` Andi Kleen
2008-06-18  6:22                         ` Nick Piggin
2008-06-17 21:20             ` Linus Torvalds
2008-06-18  2:27               ` Bron Gondwana
2008-06-18  3:14               ` Bron Gondwana
2008-06-18  4:03                 ` Linus Torvalds
2008-06-18  5:11                   ` Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)) Bron Gondwana
2008-06-18 16:22                     ` Linus Torvalds
2008-06-18 23:45                       ` Robert Mueller
2008-06-19  0:20                         ` Linus Torvalds
2008-10-03 11:44               ` BUG: mmapfile/writev spurious zero bytes still in the wild Bron Gondwana
2008-10-03 13:07                 ` Andrew Morton
2008-10-04  0:13                   ` Bron Gondwana
2008-06-17 21:15           ` BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable) Andi Kleen
2008-06-17 20:58         ` Andi Kleen
2008-06-17 21:14           ` Linus Torvalds
2008-06-17 21:26             ` Andi Kleen
2008-06-17 21:31               ` Linus Torvalds
2008-06-17 21:34                 ` Linus Torvalds
2008-06-17 21:34                 ` Andi Kleen
2008-06-17 21:46                   ` Linus Torvalds
2008-06-18  6:10                     ` Nick Piggin
2008-06-18  2:21     ` Bron Gondwana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox