* Corruption with O_DIRECT and unaligned user buffers @ 2008-11-14 17:04 Tim LaBerge 2008-11-19 4:25 ` Nick Piggin 0 siblings, 1 reply; 18+ messages in thread From: Tim LaBerge @ 2008-11-14 17:04 UTC (permalink / raw) To: linux-mm, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 4243 bytes --] The man page for open(2) states the following: O_DIRECT (Since Linux 2.6.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The I/O is synchronous, that is, at the completion of a read(2) or write(2), data is guaranteed to have been transferred. Under Linux 2.4 transfer sizes, and the alignment of user buffer and file offset must all be multiples of the logical block size of the file system. Under Linux 2.6 alignment to 512-byte bound- aries suffices. However, it appears that data corruption may occur when a multithreaded process reads into a non-page size aligned user buffer. A test program which reliably reproduces the problem on ext3 and xfs is attached. The program creates, patterns, reads, and verify a series of files. In the read phase, a file is opened with O_DIRECT n times, where n is the number of cpu's. A single buffer large enough to contain the file is allocated and patterned with data not found in any of the files. The alignment of the buffer is controlled by a command line option. Each file is read in parallel by n threads, where n is the number of cpu's. Thread 0 reads the first page of data from the file into the first page of the buffer, thread 1 reads the second page of data in to the second page of the buffer, and so on. Thread n - 1 reads the remainder of the file into the remainder of the buffer. After a thread reads data into the buffer, it immediately verifies that the contents of the buffer are correct. If the buffer contains corrupt data, the thread dumps the data surrounding the corruption and calls abort(). Otherwise, the thread exits. Crucially, before the reader threads are dispatched, another thread is started which calls fork()/msleep() in a loop until all reads are completed. The child created by fork() does nothing but call exit(0). A command line option controls whether the buffer is aligned. In the case where the buffer is aligned on a page boundary, all is well. In the case where the buffer is aligned on a page + 512 byte offset, corruption is seen frequently. I believe that what is happening is that in the direct IO path, because the user's buffer is not aligned, some user pages are being mapped twice. When a fork() happens in between the calls to map the page, the page will be marked as COW. When the second map happens (via get_user_pages()), a new physical page will be allocated and copied. Thus, there is a race between the completion of the first read from disk (and write to the user page) and get_user_pages() mapping the page for the second time. If the write does not complete before the page is copied, the user will see stale data in the first 512 bytes of this page of their buffer. Indeed, this is corruption most frequently seen. (It's also possible for the race to be lost the other way, so that the last 3584 bytes of the page are stale.) The attached program dma_thread.c (which is a heavily modified version of a program provided by a customer seeing this problem) reliably reproduces the problem on any multicore linux machine on both ext3 and xfs, although any filesystem using the generic blockdev_direct_IO() routine is probably vulnerable. I've seen a few threads that mention the potential for this kind of problem, but no definitive solution or workaround (other than "Don't do that"). Thanks, Tim LaBerge ----------------------------------------------------------- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum Corporation. Furthermore, Quantum Corporation is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. ------------------------------------------------------------ [-- Attachment #2: dma_thread.c --] [-- Type: text/x-csrc, Size: 6582 bytes --] /* compile with 'gcc -g -o dma_thread dma_thread.c -lpthread' */ #define _GNU_SOURCE 1 #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <memory.h> #include <pthread.h> #include <getopt.h> #include <errno.h> #include <sys/types.h> #include <sys/wait.h> #define FILESIZE (12*1024*1024) #define READSIZE (1024*1024) #define FILENAME "test_%.04d.tmp" #define FILECOUNT 100 #define MIN_WORKERS 2 #define MAX_WORKERS 256 #define PAGE_SIZE 4096 #define true 1 #define false 0 typedef int bool; bool done = false; int workers = 2; #define PATTERN (0xfa) static void usage (void) { fprintf(stderr, "\nUsage: dma_thread [-h | -a <alignment> [ -w <workers>]\n" "\nWith no arguments, generate test files and exit.\n" "-h Display this help and exit.\n" "-a align read buffer to offset <alignment>.\n" "-w number of worker threads, 2 (default) to 256,\n" " defaults to number of cores.\n\n" "Run first with no arguments to generate files.\n" "Then run with -a <alignment> = 512 or 0. \n"); } typedef struct { pthread_t tid; int worker_number; int fd; int offset; int length; int pattern; unsigned char *buffer; } worker_t; void *worker_thread(void * arg) { int bytes_read; int i,k; worker_t *worker = (worker_t *) arg; int offset = worker->offset; int fd = worker->fd; unsigned char *buffer = worker->buffer; int pattern = worker->pattern; int length = worker->length; if (lseek(fd, offset, SEEK_SET) < 0) { fprintf(stderr, "Failed to lseek to %d on fd %d: %s.\n", offset, fd, strerror(errno)); exit(1); } bytes_read = read(fd, buffer, length); if (bytes_read != length) { fprintf(stderr, "read failed on fd %d: bytes_read %d, %s\n", fd, bytes_read, strerror(errno)); exit(1); } /* Corruption check */ for (i = 0; i < length; i++) { if (buffer[i] != pattern) { printf("Bad data at 0x%.06x: %p, \n", i, buffer + i); printf("Data dump starting at 0x%.06x:\n", i - 8); printf("Expect 0x%x followed by 0x%x:\n", pattern, PATTERN); for (k = 0; k < 16; k++) { printf("%02x ", buffer[i - 8 + k]); if (k == 7) { printf("\n"); } } printf("\n"); abort(); } } return 0; } void *fork_thread (void *arg) { pid_t pid; while (!done) { pid = fork(); if (pid == 0) { exit(0); } else if (pid < 0) { fprintf(stderr, "Failed to fork child.\n"); exit(1); } waitpid(pid, NULL, 0 ); usleep(100); } return NULL; } int main(int argc, char *argv[]) { unsigned char *buffer = NULL; char filename[1024]; int fd; bool dowrite = true; pthread_t fork_tid; int c, n, j; worker_t *worker; int align = 0; int offset, rc; workers = sysconf(_SC_NPROCESSORS_ONLN); while ((c = getopt(argc, argv, "a:hw:")) != -1) { switch (c) { case 'a': align = atoi(optarg); if (align < 0 || align > PAGE_SIZE) { printf("Bad alignment %d.\n", align); exit(1); } dowrite = false; break; case 'h': usage(); exit(0); break; case 'w': workers = atoi(optarg); if (workers < MIN_WORKERS || workers > MAX_WORKERS) { fprintf(stderr, "Worker count %d not between " "%d and %d, inclusive.\n", workers, MIN_WORKERS, MAX_WORKERS); usage(); exit(1); } dowrite = false; break; default: usage(); exit(1); } } if (argc > 1 && (optind < argc)) { fprintf(stderr, "Bad command line.\n"); usage(); exit(1); } if (dowrite) { buffer = malloc(FILESIZE); if (buffer == NULL) { fprintf(stderr, "Failed to malloc write buffer.\n"); exit(1); } for (n = 1; n <= FILECOUNT; n++) { sprintf(filename, FILENAME, n); fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0666); if (fd < 0) { printf("create failed(%s): %s.\n", filename, strerror(errno)); exit(1); } memset(buffer, n, FILESIZE); printf("Writing file %s.\n", filename); if (write(fd, buffer, FILESIZE) != FILESIZE) { printf("write failed (%s)\n", filename); } close(fd); fd = -1; } free(buffer); buffer = NULL; printf("done\n"); exit(0); } printf("Using %d workers.\n", workers); worker = malloc(workers * sizeof(worker_t)); if (worker == NULL) { fprintf(stderr, "Failed to malloc worker array.\n"); exit(1); } for (j = 0; j < workers; j++) { worker[j].worker_number = j; } printf("Using alignment %d.\n", align); posix_memalign((void *)&buffer, PAGE_SIZE, READSIZE+ align); printf("Read buffer: %p.\n", buffer); for (n = 1; n <= FILECOUNT; n++) { sprintf(filename, FILENAME, n); for (j = 0; j < workers; j++) { if ((worker[j].fd = open(filename, O_RDONLY|O_DIRECT)) < 0) { fprintf(stderr, "Failed to open %s: %s.\n", filename, strerror(errno)); exit(1); } worker[j].pattern = n; } printf("Reading file %d.\n", n); for (offset = 0; offset < FILESIZE; offset += READSIZE) { memset(buffer, PATTERN, READSIZE + align); for (j = 0; j < workers; j++) { worker[j].offset = offset + j * PAGE_SIZE; worker[j].buffer = buffer + align + j * PAGE_SIZE; worker[j].length = PAGE_SIZE; } /* The final worker reads whatever is left over. */ worker[workers - 1].length = READSIZE - PAGE_SIZE * (workers - 1); done = 0; rc = pthread_create(&fork_tid, NULL, fork_thread, NULL); if (rc != 0) { fprintf(stderr, "Can't create fork thread: %s.\n", strerror(rc)); exit(1); } for (j = 0; j < workers; j++) { rc = pthread_create(&worker[j].tid, NULL, worker_thread, worker + j); if (rc != 0) { fprintf(stderr, "Can't create worker thread %d: %s.\n", j, strerror(rc)); exit(1); } } for (j = 0; j < workers; j++) { rc = pthread_join(worker[j].tid, NULL); if (rc != 0) { fprintf(stderr, "Failed to join worker thread %d: %s.\n", j, strerror(rc)); exit(1); } } /* Let the fork thread know it's ok to exit */ done = 1; rc = pthread_join(fork_tid, NULL); if (rc != 0) { fprintf(stderr, "Failed to join fork thread: %s.\n", strerror(rc)); exit(1); } } /* Close the fd's for the next file. */ for (j = 0; j < workers; j++) { close(worker[j].fd); } } return 0; } ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-11-14 17:04 Corruption with O_DIRECT and unaligned user buffers Tim LaBerge @ 2008-11-19 4:25 ` Nick Piggin 2008-11-19 6:52 ` Nick Piggin 2008-11-19 16:58 ` Andrea Arcangeli 0 siblings, 2 replies; 18+ messages in thread From: Nick Piggin @ 2008-11-19 4:25 UTC (permalink / raw) To: Tim LaBerge, Arcangeli, Andrea; +Cc: linux-mm, linux-fsdevel On Saturday 15 November 2008 04:04, Tim LaBerge wrote: > However, it appears that data corruption may occur when a multithreaded > process reads into a non-page size aligned user buffer. A test program > which reliably reproduces the problem on ext3 and xfs is attached. > > The program creates, patterns, reads, and verify a series of files. > > In the read phase, a file is opened with O_DIRECT n times, where n is the > number of cpu's. A single buffer large enough to contain the file is > allocated > and patterned with data not found in any of the files. The alignment of the > buffer is controlled by a command line option. > > Each file is read in parallel by n threads, where n is the number of cpu's. > Thread 0 reads the first page of data from the file into the first page > of the buffer, thread 1 reads the second page of data in to the second > page of > the buffer, and so on. Thread n - 1 reads the remainder of the file > into the > remainder of the buffer. > > After a thread reads data into the buffer, it immediately verifies that the > contents of the buffer are correct. If the buffer contains corrupt data, > the thread dumps the data surrounding the corruption and calls abort(). > Otherwise, > the thread exits. > > Crucially, before the reader threads are dispatched, another thread is > started > which calls fork()/msleep() in a loop until all reads are completed. The > child > created by fork() does nothing but call exit(0). > > A command line option controls whether the buffer is aligned. In the > case where > the buffer is aligned on a page boundary, all is well. In the case where > the buffer is aligned on a page + 512 byte offset, corruption is seen > frequently. > > I believe that what is happening is that in the direct IO path, because the > user's buffer is not aligned, some user pages are being mapped twice. When > a fork() happens in between the calls to map the page, the page will be > marked as > COW. When the second map happens (via get_user_pages()), a new physical > page will be allocated and copied. > > Thus, there is a race between the completion of the first read from disk > (and > write to the user page) and get_user_pages() mapping the page for the > second time. If the write does not complete before the page is copied, the > user will > see stale data in the first 512 bytes of this page of their buffer. Indeed, > this is corruption most frequently seen. (It's also possible for the > race to be > lost the other way, so that the last 3584 bytes of the page are stale.) > > The attached program dma_thread.c (which is a heavily modified version of a > program provided by a customer seeing this problem) reliably reproduces the > problem on any multicore linux machine on both ext3 and xfs, although any > filesystem using the generic blockdev_direct_IO() routine is probably > vulnerable. > > I've seen a few threads that mention the potential for this kind of > problem, but no > definitive solution or workaround (other than "Don't do that"). I think your analysis is correct. It is in the same class of problems that Andrea identified with fork and COW vs get_user_pages(). (I'm sorry Andrea for being really slow in participating in that thread, I've just been spending some time tinkering and thinking, but I'll reply soon...) The solution either involves synchronising forks and get_user_pages, or probably better, to do copy on fork rather than COW in the case that we detect a page is subject to get_user_pages. The trick is in the details :) Thanks for the test program though, that's something I hadn't actually written myself yet so that's really useful. For the moment (and previous kernels up to now), I guess you have to be careful about fork and get_user_pages, unfortunately. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-11-19 4:25 ` Nick Piggin @ 2008-11-19 6:52 ` Nick Piggin 2008-11-19 16:58 ` Andrea Arcangeli 1 sibling, 0 replies; 18+ messages in thread From: Nick Piggin @ 2008-11-19 6:52 UTC (permalink / raw) To: Tim LaBerge; +Cc: Arcangeli, Andrea, linux-mm, linux-fsdevel On Wednesday 19 November 2008 15:25, Nick Piggin wrote: > For the moment (and previous kernels up to now), I guess you have to > be careful about fork and get_user_pages, unfortunately. I'm reminded by someone wishing to remain anonymous that one of the ways that we can "be careful", is to use MADV_DONTFORK for ranges that may be under direct IO. Not a beautiful solution, but it might work. If you need some sharing of that region between parent and child, you could alternatively use a shared mapping (eg. MAP_ANONYMOUS | MAP_SHARED) and avoid the COW issue completely. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-11-19 4:25 ` Nick Piggin 2008-11-19 6:52 ` Nick Piggin @ 2008-11-19 16:58 ` Andrea Arcangeli 2008-12-18 15:29 ` Andrea Arcangeli 1 sibling, 1 reply; 18+ messages in thread From: Andrea Arcangeli @ 2008-11-19 16:58 UTC (permalink / raw) To: Nick Piggin; +Cc: Tim LaBerge, linux-mm, linux-fsdevel On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote: > The solution either involves synchronising forks and get_user_pages, > or probably better, to do copy on fork rather than COW in the case > that we detect a page is subject to get_user_pages. The trick is in > the details :) We already have a patch that works. The only trouble here is get_user_pages_fast, it breaks the fix for fork, the current ksm (that is safe against get_user_pages but can't be safe against get_user_pages_fast) and even migrate.c memory-corrupts against O_DIRECT after the introduction of get_user_pages_fast. So I recommend focusing on how to fix get_user_pages_fast for any of the 3 broken pieces, then hopefully the same fix will work for the other two. fork is special in that it even breaks against get_user_pages but again we've a fix for that. The only problem without a solution is how to serialize against get_user_pages_fast. A brlock was my proposal, not nice but still better than backing out get_user_pages_fast. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-11-19 16:58 ` Andrea Arcangeli @ 2008-12-18 15:29 ` Andrea Arcangeli 2008-12-19 2:21 ` KAMEZAWA Hiroyuki ` (3 more replies) 0 siblings, 4 replies; 18+ messages in thread From: Andrea Arcangeli @ 2008-12-18 15:29 UTC (permalink / raw) To: Nick Piggin; +Cc: Tim LaBerge, linux-mm, linux-fsdevel On Wed, Nov 19, 2008 at 05:58:19PM +0100, Andrea Arcangeli wrote: > On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote: > > The solution either involves synchronising forks and get_user_pages, > > or probably better, to do copy on fork rather than COW in the case > > that we detect a page is subject to get_user_pages. The trick is in > > the details :) > > We already have a patch that works. Here it is below, had to produce it for rhel (so far it was only in our minds and it didn't float around just yet). So this fixes the reported bug for me, Tim can you check to be sure? Very convenient that I didn't need to write the reproducer myself, this was a very nice testcase thanks a lot, probably worth adding to ltp ;). Problem this only fixes it for rhel and other kernels that don't have get_user_pages_fast yet. You really have to think at some way to serialize get_user_pages_fast for this and ksm. get_user_pages_fast makes it a unfixable bug to mark any anon pte from readwrite to readonly when there could be O_DIRECT on it, this has to be solved sooner or later... So last detail, I take it as safe not to check if the pte is writeable after handle_mm_fault returns as the new address space is private and the page fault couldn't possibly race with anything (i.e. pte_same is guaranteed to succeed). For the mainline version we can remove the page lock and replace with smb_wmb in add_to_swap_cache and smp_rmb in the page_count/PG_swapcache read to remove that trylockpage. Given smp_wmb is barrier() it should worth it. If you see something wrong during review below let me know, this is a tricky place to change. Note the ->open done after copy_page_range returns in fork, do_wp_page will run and copy anon pages before ->open is run on the child vma, given those are anon pages I think it should work but said that I doubt I exercised in practice any device driver open method there yet. Thanks! ------ From: Andrea Arcangeli <aarcange@redhat.com> Subject: fork-o_direct-race Think a thread writing constantly to the last 512bytes of a page, while another thread read and writes to/from the first 512bytes of the page. We can lose O_DIRECT reads, the very moment we mark any pte wrprotected because a third unrelated thread forks off a child. This fixes it by never wprotecting anon ptes if there can be any direct I/O in flight to the page, and by instantiating a readonly pte and triggering a COW in the child. The only trouble here are O_DIRECT reads (writes to memory, read from disk). Checking the page_count under the PT lock guarantees no get_user_pages could be running under us because if somebody wants to write to the page, it has to break any cow first and that requires taking the PT lock in follow_page before increasing the page count. The COW triggered inside fork will run while the parent pte is read-write, this is not usual but that's ok as it's only a page copy and it doesn't modify the page contents. In the long term there should be a smp_wmb() in between page_cache_get and SetPageSwapCache in __add_to_swap_cache and a smp_rmb in between the PageSwapCache and the page_count() to remove the trylock op. Fixed version of original patch from Nick Piggin. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- diff -ur rhel-5.2/kernel/fork.c x/kernel/fork.c --- rhel-5.2/kernel/fork.c 2008-07-10 17:26:43.000000000 +0200 +++ x/kernel/fork.c 2008-12-18 15:57:31.000000000 +0100 @@ -368,7 +368,7 @@ rb_parent = &tmp->vm_rb; mm->map_count++; - retval = copy_page_range(mm, oldmm, mpnt); + retval = copy_page_range(mm, oldmm, tmp); if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); diff -ur rhel-5.2/mm/memory.c x/mm/memory.c --- rhel-5.2/mm/memory.c 2008-07-10 17:26:44.000000000 +0200 +++ x/mm/memory.c 2008-12-18 15:51:17.000000000 +0100 @@ -426,7 +426,7 @@ * covered by this vma. */ -static inline void +static inline int copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, unsigned long addr, int *rss) @@ -434,6 +434,7 @@ unsigned long vm_flags = vma->vm_flags; pte_t pte = *src_pte; struct page *page; + int forcecow = 0; /* pte contains position in swap or file, so copy. */ if (unlikely(!pte_present(pte))) { @@ -464,15 +465,6 @@ } /* - * If it's a COW mapping, write protect it both - * in the parent and the child - */ - if (is_cow_mapping(vm_flags)) { - ptep_set_wrprotect(src_mm, addr, src_pte); - pte = *src_pte; - } - - /* * If it's a shared mapping, mark it clean in * the child */ @@ -484,11 +476,34 @@ if (page) { get_page(page); page_dup_rmap(page); + if (is_cow_mapping(vm_flags) && PageAnon(page)) { + if (unlikely(TestSetPageLocked(page))) + forcecow = 1; + else { + if (unlikely(page_count(page) != + page_mapcount(page) + + !!PageSwapCache(page))) + forcecow = 1; + unlock_page(page); + } + } rss[!!PageAnon(page)]++; } + /* + * If it's a COW mapping, write protect it both + * in the parent and the child + */ + if (is_cow_mapping(vm_flags)) { + if (!forcecow) + ptep_set_wrprotect(src_mm, addr, src_pte); + pte = pte_wrprotect(pte); + } + out_set_pte: set_pte_at(dst_mm, addr, dst_pte, pte); + + return forcecow; } static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm, @@ -499,8 +514,10 @@ spinlock_t *src_ptl, *dst_ptl; int progress = 0; int rss[2]; + int forcecow; again: + forcecow = 0; rss[1] = rss[0] = 0; dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl); if (!dst_pte) @@ -510,6 +527,9 @@ spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); do { + if (forcecow) + break; + /* * We are holding two locks at this point - either of them * could generate latencies in another task on another CPU. @@ -525,7 +545,7 @@ progress++; continue; } - copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss); + forcecow = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte, vma, addr, rss); progress += 8; } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); @@ -534,6 +554,10 @@ add_mm_rss(dst_mm, rss[0], rss[1]); pte_unmap_unlock(dst_pte - 1, dst_ptl); cond_resched(); + if (forcecow) + if (__handle_mm_fault(dst_mm, vma, addr - PAGE_SIZE, 1) & + (VM_FAULT_OOM | VM_FAULT_SIGBUS)) + return -ENOMEM; if (addr != end) goto again; return 0; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-12-18 15:29 ` Andrea Arcangeli @ 2008-12-19 2:21 ` KAMEZAWA Hiroyuki 2008-12-19 5:06 ` KAMEZAWA Hiroyuki 2008-12-19 6:34 ` KOSAKI Motohiro ` (2 subsequent siblings) 3 siblings, 1 reply; 18+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-12-19 2:21 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel On Thu, 18 Dec 2008 16:29:52 +0100 Andrea Arcangeli <aarcange@redhat.com> wrote: > @@ -484,11 +476,34 @@ > if (page) { > get_page(page); > page_dup_rmap(page); > + if (is_cow_mapping(vm_flags) && PageAnon(page)) { > + if (unlikely(TestSetPageLocked(page))) > + forcecow = 1; > + else { > + if (unlikely(page_count(page) != > + page_mapcount(page) > + + !!PageSwapCache(page))) > + forcecow = 1; > + unlock_page(page); > + } > + } > rss[!!PageAnon(page)]++; > } - Why do you check only Anon rather than all MAP_PRIVATE mappings ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-12-19 2:21 ` KAMEZAWA Hiroyuki @ 2008-12-19 5:06 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 18+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-12-19 5:06 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrea Arcangeli, Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel On Fri, 19 Dec 2008 11:21:25 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Thu, 18 Dec 2008 16:29:52 +0100 > Andrea Arcangeli <aarcange@redhat.com> wrote: > > > @@ -484,11 +476,34 @@ > > if (page) { > > get_page(page); > > page_dup_rmap(page); > > + if (is_cow_mapping(vm_flags) && PageAnon(page)) { > > + if (unlikely(TestSetPageLocked(page))) > > + forcecow = 1; > > + else { > > + if (unlikely(page_count(page) != > > + page_mapcount(page) > > + + !!PageSwapCache(page))) > > + forcecow = 1; > > + unlock_page(page); > > + } > > + } > > rss[!!PageAnon(page)]++; > > } > - Why do you check only Anon rather than all MAP_PRIVATE mappings ? > Sorry, ignore this quesiton. -Kame ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-12-18 15:29 ` Andrea Arcangeli 2008-12-19 2:21 ` KAMEZAWA Hiroyuki @ 2008-12-19 6:34 ` KOSAKI Motohiro 2008-12-20 16:02 ` Andrea Arcangeli 2008-12-19 7:19 ` KAMEZAWA Hiroyuki 2008-12-19 11:51 ` Li Zefan 3 siblings, 1 reply; 18+ messages in thread From: KOSAKI Motohiro @ 2008-12-19 6:34 UTC (permalink / raw) To: Andrea Arcangeli Cc: kosaki.motohiro, Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel Hi I don't undestand your patch yet. just dumb question. > Problem this only fixes it for rhel and other kernels that don't have > get_user_pages_fast yet. You really have to think at some way to > serialize get_user_pages_fast for this and ksm. get_user_pages_fast > makes it a unfixable bug to mark any anon pte from readwrite to > readonly when there could be O_DIRECT on it, this has to be solved > sooner or later... I'm confused. I think gup_pte_range() doesn't change pte attribute. Could you explain why get_user_pages_fast() is evil? > So last detail, I take it as safe not to check if the pte is writeable > after handle_mm_fault returns as the new address space is private and > the page fault couldn't possibly race with anything (i.e. pte_same is > guaranteed to succeed). For the mainline version we can remove the > page lock and replace with smb_wmb in add_to_swap_cache and smp_rmb in > the page_count/PG_swapcache read to remove that trylockpage. Given > smp_wmb is barrier() it should worth it. Why rhel can't use memory barrier? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-12-19 6:34 ` KOSAKI Motohiro @ 2008-12-20 16:02 ` Andrea Arcangeli 0 siblings, 0 replies; 18+ messages in thread From: Andrea Arcangeli @ 2008-12-20 16:02 UTC (permalink / raw) To: KOSAKI Motohiro; +Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel Hello! On Fri, Dec 19, 2008 at 03:34:20PM +0900, KOSAKI Motohiro wrote: > I think gup_pte_range() doesn't change pte attribute. > Could you explain why get_user_pages_fast() is evil? It's evil because it was assumed that by just relying on the local_irq_disable() to prevent the smp tlb flush IPI to run, it'd be enough to simulate a 'current' pagetable walk that allowed the current task to run entirely lockless. Problem is that by being totally lockless it prevents us to know if a page is under direct-io or not. And if a page is under direct IO with writing to memory (reading from memory we cannot care less, it's always ok) we can't merge pages in ksm or we can't mark the pte readonly in fork etc... If we do things break. The entirely lockless (but atomic) pagetable walk done by the cpu is different from gup_fast because the one done by the cpu will never end up writing to the page through the pci bus in DMA, so the moment the IPI runs whatever I/O is interrupted (not the case for gup_fast, when gup_fast returns and the IPI runs and page is then available for sharing to ksm or pte marked readonly, the direct DMA is still in flight). That's why gup_fast *can't* be 100% lockless as today, otherwise it's unfixable and broken and it's not just ksm. This very O_DIRECT bug in fork is 100% unfixable without adding some serialization to gup_fast. So my patch fixes it fully only for kernels before the introduction of gup_fast... My suggestion is to reintroduced the big reader lock (br_lock) of 2.4 and have gup_fast take the read side of it, and fork/ksm take the write side. It must no be a write-starving lock like the 2.4 one though or fork would hang forever on large smp. It should be still faster than get_user_pages. > Why rhel can't use memory barrier? Oh it can, just I didn't implemented immediately as I wanted to ship a simpler patch first, but given the 27% slowdown measured in later email, I'll definitely have to replace the TestSetPageLocked with smb_rmb and see if the introduced overhead goes away. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-12-18 15:29 ` Andrea Arcangeli 2008-12-19 2:21 ` KAMEZAWA Hiroyuki 2008-12-19 6:34 ` KOSAKI Motohiro @ 2008-12-19 7:19 ` KAMEZAWA Hiroyuki 2008-12-19 7:44 ` Li Zefan 2008-12-20 15:55 ` Andrea Arcangeli 2008-12-19 11:51 ` Li Zefan 3 siblings, 2 replies; 18+ messages in thread From: KAMEZAWA Hiroyuki @ 2008-12-19 7:19 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel On Thu, 18 Dec 2008 16:29:52 +0100 Andrea Arcangeli <aarcange@redhat.com> wrote: > On Wed, Nov 19, 2008 at 05:58:19PM +0100, Andrea Arcangeli wrote: > > On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote: > > > The solution either involves synchronising forks and get_user_pages, > > > or probably better, to do copy on fork rather than COW in the case > > > that we detect a page is subject to get_user_pages. The trick is in > > > the details :) > > > From: Andrea Arcangeli <aarcange@redhat.com> > Subject: fork-o_direct-race > > Think a thread writing constantly to the last 512bytes of a page, while another > thread read and writes to/from the first 512bytes of the page. We can lose > O_DIRECT reads, the very moment we mark any pte wrprotected because a third > unrelated thread forks off a child. > > This fixes it by never wprotecting anon ptes if there can be any direct I/O in > flight to the page, and by instantiating a readonly pte and triggering a COW in > the child. The only trouble here are O_DIRECT reads (writes to memory, read > from disk). Checking the page_count under the PT lock guarantees no > get_user_pages could be running under us because if somebody wants to write to > the page, it has to break any cow first and that requires taking the PT lock in > follow_page before increasing the page count. > > The COW triggered inside fork will run while the parent pte is read-write, this > is not usual but that's ok as it's only a page copy and it doesn't modify the > page contents. > > In the long term there should be a smp_wmb() in between page_cache_get and > SetPageSwapCache in __add_to_swap_cache and a smp_rmb in between the > PageSwapCache and the page_count() to remove the trylock op. > > Fixed version of original patch from Nick Piggin. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Confirmed this fixes the problem. Hmm, but, fork() gets slower. Result of cost-of-fork() on ia64. == size of memory before after Anon=1M , 0.07ms, 0.08ms Anon=10M , 0.17ms, 0.22ms Anon=100M , 1.15ms, 1.64ms Anon=1000M , 11.5ms, 15.821ms == fork() cost is 135% when the process has 1G of Anon. test program is below. (used "/usr/bin/time" for measurement.) == #include <stdlib.h> #include <sys/types.h> #include <sys/wait.h> int main(int argc, char *argv[]) { int size, i, status; char *c; size = atoi(argv[1]) * 1024 * 1024; c = malloc(size); memset(c, 0,size); for (i = 0; i < 5000; i++) { if (!fork()) { exit(0); } wait(&status); } } == -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-12-19 7:19 ` KAMEZAWA Hiroyuki @ 2008-12-19 7:44 ` Li Zefan 2008-12-19 8:45 ` Li Zefan 2008-12-19 20:27 ` Andrea Arcangeli 2008-12-20 15:55 ` Andrea Arcangeli 1 sibling, 2 replies; 18+ messages in thread From: Li Zefan @ 2008-12-19 7:44 UTC (permalink / raw) To: KAMEZAWA Hiroyuki, Andrea Arcangeli Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel, FNST-Wang Chen [-- Attachment #1: Type: text/plain, Size: 2265 bytes --] KAMEZAWA Hiroyuki wrote: > On Thu, 18 Dec 2008 16:29:52 +0100 > Andrea Arcangeli <aarcange@redhat.com> wrote: > >> On Wed, Nov 19, 2008 at 05:58:19PM +0100, Andrea Arcangeli wrote: >>> On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote: >>>> The solution either involves synchronising forks and get_user_pages, >>>> or probably better, to do copy on fork rather than COW in the case >>>> that we detect a page is subject to get_user_pages. The trick is in >>>> the details :) > >> From: Andrea Arcangeli <aarcange@redhat.com> >> Subject: fork-o_direct-race >> >> Think a thread writing constantly to the last 512bytes of a page, while another >> thread read and writes to/from the first 512bytes of the page. We can lose >> O_DIRECT reads, the very moment we mark any pte wrprotected because a third >> unrelated thread forks off a child. >> >> This fixes it by never wprotecting anon ptes if there can be any direct I/O in >> flight to the page, and by instantiating a readonly pte and triggering a COW in >> the child. The only trouble here are O_DIRECT reads (writes to memory, read >> from disk). Checking the page_count under the PT lock guarantees no >> get_user_pages could be running under us because if somebody wants to write to >> the page, it has to break any cow first and that requires taking the PT lock in >> follow_page before increasing the page count. >> >> The COW triggered inside fork will run while the parent pte is read-write, this >> is not usual but that's ok as it's only a page copy and it doesn't modify the >> page contents. >> >> In the long term there should be a smp_wmb() in between page_cache_get and >> SetPageSwapCache in __add_to_swap_cache and a smp_rmb in between the >> PageSwapCache and the page_count() to remove the trylock op. >> >> Fixed version of original patch from Nick Piggin. >> >> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > > Confirmed this fixes the problem. > We tested with RHEL 5.2 + patch on i386 using the test program provided by Tim LaBerge, though the program can pass but sometimes hanged. strace log is attached, and we'll test it again with LOCKDEP enabled to see if we can get some other information. BTW, the patch works fine on IA64. > Hmm, but, fork() gets slower. [-- Attachment #2: strace.log --] [-- Type: text/x-log, Size: 25241 bytes --] xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5193 futex(0xb6a18bd8, FUTEX_WAIT, 5192, NULL) = 0 futex(0xb7419bd8, FUTEX_WAIT, 5193, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5191, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5200 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5201 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5202 futex(0xb7419bd8, FUTEX_WAIT, 5201, NULL) = 0 futex(0xb6a18bd8, FUTEX_WAIT, 5202, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5200, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5207 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5208 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5209 futex(0xb6a18bd8, FUTEX_WAIT, 5208, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5207, NULL) = -1 EAGAIN (Resource temporarily unavailable) clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5221 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5222 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5223 futex(0xb7419bd8, FUTEX_WAIT, 5222, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5221, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5228 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5229 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5230 futex(0xb6a18bd8, FUTEX_WAIT, 5229, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5228, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5234 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5235 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5236 futex(0xb7419bd8, FUTEX_WAIT, 5235, NULL) = 0 futex(0xb6a18bd8, FUTEX_WAIT, 5236, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5234, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5241 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5242 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5243 futex(0xb6a18bd8, FUTEX_WAIT, 5242, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5241, NULL) = 0 close(3) = 0 close(4) = 0 open("test_0060.tmp", O_RDONLY|O_DIRECT) = 3 open("test_0060.tmp", O_RDONLY|O_DIRECT) = 4 write(1, "Reading file 60.\n", 17Reading file 60. ) = 17 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5248 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5249 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5250 futex(0xb7419bd8, FUTEX_WAIT, 5249, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5248, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5257 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5258 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5259 futex(0xb6a18bd8, FUTEX_WAIT, 5258, NULL) = 0 futex(0xb7419bd8, FUTEX_WAIT, 5259, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5257, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5266 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5267 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5268 futex(0xb7419bd8, FUTEX_WAIT, 5267, NULL) = 0 futex(0xb6a18bd8, FUTEX_WAIT, 5268, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5266, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5279 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5280 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5281 futex(0xb6a18bd8, FUTEX_WAIT, 5280, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5279, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5288 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5289 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5290 futex(0xb7419bd8, FUTEX_WAIT, 5289, NULL) = 0 futex(0xb6a18bd8, FUTEX_WAIT, 5290, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5288, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5297 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5298 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5299 futex(0xb6a18bd8, FUTEX_WAIT, 5298, NULL) = 0 futex(0xb7419bd8, FUTEX_WAIT, 5299, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5297, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5306 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5307 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5308 futex(0xb7419bd8, FUTEX_WAIT, 5307, NULL) = 0 futex(0xb6a18bd8, FUTEX_WAIT, 5308, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5306, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5313 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5314 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5315 futex(0xb6a18bd8, FUTEX_WAIT, 5314, NULL) = 0 futex(0xb7419bd8, FUTEX_WAIT, 5315, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5313, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5320 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5321 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5322 futex(0xb7419bd8, FUTEX_WAIT, 5321, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5320, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5328 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5329 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5331 futex(0xb6a18bd8, FUTEX_WAIT, 5329, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xb7419bd8, FUTEX_WAIT, 5331, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5328, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5337 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5338 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5339 futex(0xb7419bd8, FUTEX_WAIT, 5338, NULL) = 0 futex(0xb6a18bd8, FUTEX_WAIT, 5339, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5337, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5356 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5357 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5358 futex(0xb6a18bd8, FUTEX_WAIT, 5357, NULL) = 0 futex(0xb7419bd8, FUTEX_WAIT, 5358, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5356, NULL) = 0 close(3) = 0 close(4) = 0 open("test_0061.tmp", O_RDONLY|O_DIRECT) = 3 open("test_0061.tmp", O_RDONLY|O_DIRECT) = 4 write(1, "Reading file 61.\n", 17Reading file 61. ) = 17 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5366 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5367 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5369 futex(0xb7419bd8, FUTEX_WAIT, 5367, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5366, NULL) = 0 clone(child_stack=0xb7e1a4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7e1abd8, {entry_number:6, base_addr:0xb7e1ab90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e1abd8) = 5372 clone(child_stack=0xb6a184b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6a18bd8, {entry_number:6, base_addr:0xb6a18b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6a18bd8) = 5373 clone(child_stack=0xb74194b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7419bd8, {entry_number:6, base_addr:0xb7419b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7419bd8) = 5375 futex(0xb6a18bd8, FUTEX_WAIT, 5373, NULL) = 0 futex(0xb7419bd8, FUTEX_WAIT, 5375, NULL) = 0 futex(0xb7e1abd8, FUTEX_WAIT, 5372, NULL ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-12-19 7:44 ` Li Zefan @ 2008-12-19 8:45 ` Li Zefan 2008-12-19 20:27 ` Andrea Arcangeli 1 sibling, 0 replies; 18+ messages in thread From: Li Zefan @ 2008-12-19 8:45 UTC (permalink / raw) To: Andrea Arcangeli Cc: KAMEZAWA Hiroyuki, Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel, FNST-Wang Chen Li Zefan wrote: > KAMEZAWA Hiroyuki wrote: >> On Thu, 18 Dec 2008 16:29:52 +0100 >> Andrea Arcangeli <aarcange@redhat.com> wrote: >> >>> On Wed, Nov 19, 2008 at 05:58:19PM +0100, Andrea Arcangeli wrote: >>>> On Wed, Nov 19, 2008 at 03:25:59PM +1100, Nick Piggin wrote: >>>>> The solution either involves synchronising forks and get_user_pages, >>>>> or probably better, to do copy on fork rather than COW in the case >>>>> that we detect a page is subject to get_user_pages. The trick is in >>>>> the details :) >>> From: Andrea Arcangeli <aarcange@redhat.com> >>> Subject: fork-o_direct-race >>> >>> Think a thread writing constantly to the last 512bytes of a page, while another >>> thread read and writes to/from the first 512bytes of the page. We can lose >>> O_DIRECT reads, the very moment we mark any pte wrprotected because a third >>> unrelated thread forks off a child. >>> >>> This fixes it by never wprotecting anon ptes if there can be any direct I/O in >>> flight to the page, and by instantiating a readonly pte and triggering a COW in >>> the child. The only trouble here are O_DIRECT reads (writes to memory, read >>> from disk). Checking the page_count under the PT lock guarantees no >>> get_user_pages could be running under us because if somebody wants to write to >>> the page, it has to break any cow first and that requires taking the PT lock in >>> follow_page before increasing the page count. >>> >>> The COW triggered inside fork will run while the parent pte is read-write, this >>> is not usual but that's ok as it's only a page copy and it doesn't modify the >>> page contents. >>> >>> In the long term there should be a smp_wmb() in between page_cache_get and >>> SetPageSwapCache in __add_to_swap_cache and a smp_rmb in between the >>> PageSwapCache and the page_count() to remove the trylock op. >>> >>> Fixed version of original patch from Nick Piggin. >>> >>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> >> Confirmed this fixes the problem. >> > > We tested with RHEL 5.2 + patch on i386 using the test program provided by > Tim LaBerge, though the program can pass but sometimes hanged. strace log is > attached, and we'll test it again with LOCKDEP enabled to see if we can get > some other information. > # ./dma_thread -a 512 Using 2 workers. Using alignment 512. Read buffer: 0xb7e4e000. Reading file 1. Reading file 2. ... Reading file 26. Reading file 27. (hang here, Ctrl+C can break the process) And we modified the program to use 'dma_thread -a 512 -w 1', we can still see hung in a very low frequency. ============== Here is a snapshop of call trace: dma_thread S 00000035 2872 20296 8797 23256 (NOTLB) f7018e78 00000046 1f593e7d 00000035 f7018e84 00000002 00000000 00000006 f4c35530 f71ac030 1f5a03da 00000035 0000c55d 00000001 f4c3563c c1a80044 f7018f04 f7018f1c b7e4cbd8 00000046 00000000 00000002 00000001 7fffffff Call Trace: [<c061bd10>] schedule_timeout+0x13/0x8c [<c043c435>] do_futex+0x1e2/0xb38 [<c061d316>] _spin_unlock+0x14/0x1c [<c0465938>] do_wp_page+0x3fb/0x405 [<c0466da0>] __handle_mm_fault+0x858/0x8b8 [<c041e5f3>] default_wake_function+0x0/0xc [<c044e32b>] audit_syscall_entry+0x14b/0x17d [<c043ce9c>] sys_futex+0x111/0x127 [<c0408076>] do_syscall_trace+0xab/0xb1 [<c0404f53>] syscall_call+0x7/0xb ======================= dma_thread S 00000035 3304 23256 8797 23258 20296 (NOTLB) f4e24f50 00000046 1ec0e26d 00000035 c073ea10 416db065 00000046 00000003 f70ac030 c1b7eab0 1ec7d7c9 00000035 0006f55c 00000001 f70ac13c c1a80044 00005ada f4cc0030 00000000 00000246 ffffffff 00000000 00000000 f53acab0 Call Trace: [<c0426b84>] do_wait+0x8b5/0x9a3 [<c044e32b>] audit_syscall_entry+0x14b/0x17d [<c041e5f3>] default_wake_function+0x0/0xc [<c0426c99>] sys_wait4+0x27/0x2a [<c0426caf>] sys_waitpid+0x13/0x17 [<c0404f53>] syscall_call+0x7/0xb ======================= dma_thread R running 3412 23258 23256 (NOTLB) ... ... Showing all locks held in the system: 4 locks held by kseriod/82: #0: (serio_mutex){--..}, at: [<c059c7f6>] serio_thread+0x13/0x28d #1: (&serio->drv_mutex){--..}, at: [<c059be7e>] serio_connect_driver+0x16/0x2c #2: (psmouse_mutex){--..}, at: [<c05a41ce>] psmouse_connect+0x18/0x211 #3: (&ps2_mutex_key){--..}, at: [<c059e4bd>] ps2_command+0x80/0x2dc ============================================= > BTW, the patch works fine on IA64. > >> Hmm, but, fork() gets slower. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-12-19 7:44 ` Li Zefan 2008-12-19 8:45 ` Li Zefan @ 2008-12-19 20:27 ` Andrea Arcangeli 1 sibling, 0 replies; 18+ messages in thread From: Andrea Arcangeli @ 2008-12-19 20:27 UTC (permalink / raw) To: Li Zefan Cc: KAMEZAWA Hiroyuki, Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel, FNST-Wang Chen Hello, On Fri, Dec 19, 2008 at 03:44:09PM +0800, Li Zefan wrote: > Tim LaBerge, though the program can pass but sometimes hanged. strace log is > attached, and we'll test it again with LOCKDEP enabled to see if we can get > some other information. So my current suggestion on this is to understand why __reclaim_stacks is not starting with a lll_unlock before the list_for_each runs, I'll look into this next week if nobody explained it yet ;). Statistically speaking it's more likely to be the kernel patch to be buggy and this is likely a faulty theory I know, but it's not impossible that this is an unrelated bug that was hidden as it required userland list_del/add/splice to race against the kernel ptep_set_wrprotect single instruction. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-12-19 7:19 ` KAMEZAWA Hiroyuki 2008-12-19 7:44 ` Li Zefan @ 2008-12-20 15:55 ` Andrea Arcangeli 1 sibling, 0 replies; 18+ messages in thread From: Andrea Arcangeli @ 2008-12-20 15:55 UTC (permalink / raw) To: KAMEZAWA Hiroyuki; +Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel On Fri, Dec 19, 2008 at 04:19:11PM +0900, KAMEZAWA Hiroyuki wrote: > Result of cost-of-fork() on ia64. > == > size of memory before after > Anon=1M , 0.07ms, 0.08ms > Anon=10M , 0.17ms, 0.22ms > Anon=100M , 1.15ms, 1.64ms > Anon=1000M , 11.5ms, 15.821ms > == > > fork() cost is 135% when the process has 1G of Anon. Not sure where the 135% number comes from. The above number shows a performance decrease of 27% or a time increase of 37% which I hope is inline with the overhead introduced by the TestSetPageLocked in the fast path (which I didn't expect to be so bad), but that it's almost trivial to eliminate with a smb_wmb in add_to_swap_cache and a smb_rmb in fork. So we'll need to repeat this measurement after replacing the TestSetPageLocked with smb_rmb. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-12-18 15:29 ` Andrea Arcangeli ` (2 preceding siblings ...) 2008-12-19 7:19 ` KAMEZAWA Hiroyuki @ 2008-12-19 11:51 ` Li Zefan 2008-12-19 12:14 ` KOSAKI Motohiro ` (2 more replies) 3 siblings, 3 replies; 18+ messages in thread From: Li Zefan @ 2008-12-19 11:51 UTC (permalink / raw) To: Andrea Arcangeli Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel, Wang Chen > diff -ur rhel-5.2/kernel/fork.c x/kernel/fork.c > --- rhel-5.2/kernel/fork.c 2008-07-10 17:26:43.000000000 +0200 > +++ x/kernel/fork.c 2008-12-18 15:57:31.000000000 +0100 > @@ -368,7 +368,7 @@ > rb_parent = &tmp->vm_rb; > > mm->map_count++; > - retval = copy_page_range(mm, oldmm, mpnt); > + retval = copy_page_range(mm, oldmm, tmp); > Could you explain a bit why this change is needed? Seems this is a revert of the following commit: commit 0b0db14c536debd92328819fe6c51a49717e8440 Author: Hugh Dickins <hugh@veritas.com> Date: Mon Nov 21 21:32:20 2005 -0800 [PATCH] unpaged: copy_page_range vma For copy_one_pte's print_bad_pte to show the task correctly (instead of "???"), dup_mmap must pass down parent vma rather than child vma. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff --git a/kernel/fork.c b/kernel/fork.c index e0d0b77..1c1cf8d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -263,7 +263,7 @@ static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) rb_parent = &tmp->vm_rb; mm->map_count++; - retval = copy_page_range(mm, oldmm, tmp); + retval = copy_page_range(mm, oldmm, mpnt); if (tmp->vm_ops && tmp->vm_ops->open) tmp->vm_ops->open(tmp); > if (tmp->vm_ops && tmp->vm_ops->open) > tmp->vm_ops->open(tmp); ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-12-19 11:51 ` Li Zefan @ 2008-12-19 12:14 ` KOSAKI Motohiro 2008-12-19 12:58 ` Hugh Dickins 2008-12-19 20:34 ` Andrea Arcangeli 2 siblings, 0 replies; 18+ messages in thread From: KOSAKI Motohiro @ 2008-12-19 12:14 UTC (permalink / raw) To: Li Zefan Cc: kosaki.motohiro, Andrea Arcangeli, Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel, Wang Chen > > diff -ur rhel-5.2/kernel/fork.c x/kernel/fork.c > > --- rhel-5.2/kernel/fork.c 2008-07-10 17:26:43.000000000 +0200 > > +++ x/kernel/fork.c 2008-12-18 15:57:31.000000000 +0100 > > @@ -368,7 +368,7 @@ > > rb_parent = &tmp->vm_rb; > > > > mm->map_count++; > > - retval = copy_page_range(mm, oldmm, mpnt); > > + retval = copy_page_range(mm, oldmm, tmp); > > > > Could you explain a bit why this change is needed? maybe.. __handle_mm_fault() change rmap of passwd vma. we need to parent process has original page, child process has new page. then we need child vma. > Seems this is a revert of the following commit: > > commit 0b0db14c536debd92328819fe6c51a49717e8440 > Author: Hugh Dickins <hugh@veritas.com> > Date: Mon Nov 21 21:32:20 2005 -0800 > > [PATCH] unpaged: copy_page_range vma > > For copy_one_pte's print_bad_pte to show the task correctly (instead of > "???"), dup_mmap must pass down parent vma rather than child vma. I think you are right. This patch reintroduce the same problem. end up, print_bad_pte() need parent vma. __handle_mm_fault() need child vma. corrent? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-12-19 11:51 ` Li Zefan 2008-12-19 12:14 ` KOSAKI Motohiro @ 2008-12-19 12:58 ` Hugh Dickins 2008-12-19 20:34 ` Andrea Arcangeli 2 siblings, 0 replies; 18+ messages in thread From: Hugh Dickins @ 2008-12-19 12:58 UTC (permalink / raw) To: Li Zefan Cc: Andrea Arcangeli, Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel, Wang Chen On Fri, 19 Dec 2008, Li Zefan wrote: > > diff -ur rhel-5.2/kernel/fork.c x/kernel/fork.c > > --- rhel-5.2/kernel/fork.c 2008-07-10 17:26:43.000000000 +0200 > > +++ x/kernel/fork.c 2008-12-18 15:57:31.000000000 +0100 > > @@ -368,7 +368,7 @@ > > rb_parent = &tmp->vm_rb; > > > > mm->map_count++; > > - retval = copy_page_range(mm, oldmm, mpnt); > > + retval = copy_page_range(mm, oldmm, tmp); > > > > Could you explain a bit why this change is needed? > > Seems this is a revert of the following commit: > > commit 0b0db14c536debd92328819fe6c51a49717e8440 > Author: Hugh Dickins <hugh@veritas.com> > Date: Mon Nov 21 21:32:20 2005 -0800 > > [PATCH] unpaged: copy_page_range vma > > For copy_one_pte's print_bad_pte to show the task correctly (instead of > "???"), dup_mmap must pass down parent vma rather than child vma. > > Signed-off-by: Hugh Dickins <hugh@veritas.com> > Signed-off-by: Andrew Morton <akpm@osdl.org> > Signed-off-by: Linus Torvalds <torvalds@osdl.org> > > diff --git a/kernel/fork.c b/kernel/fork.c > index e0d0b77..1c1cf8d 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -263,7 +263,7 @@ static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) > rb_parent = &tmp->vm_rb; > > mm->map_count++; > - retval = copy_page_range(mm, oldmm, tmp); > + retval = copy_page_range(mm, oldmm, mpnt); > > if (tmp->vm_ops && tmp->vm_ops->open) > tmp->vm_ops->open(tmp); > > > > if (tmp->vm_ops && tmp->vm_ops->open) > > tmp->vm_ops->open(tmp); [I'm not finding much time to think about anything at the moment, so reluctant even to stick my head above the parapet; but this is easy, and though there might be lots of things I'd dislike about Andrea's patch if I had time to study it ;-), this certainly isn't one of them.] This should be a non-issue: although the patch that this reverts was valid in itself, it arose from my misunderstanding (of the likely relevance of current->comm in exit_mmap - much more likely to be relevant than I was thinking at the time) that I forced upon Nick in print_bad_pte(). And now I've a rewrite of print_bad_pte() queued up in -mm, which admits that misunderstanding and removes the "???" case: so in 2.6.29 it shouldn't matter if we pass parent or child vma to copy_page_range. Oh, and it doesn't even matter in 2.6.26 onwards either: they don't have any calls to print_bad_pte() below copy_page_range(). Though I've not checked whether we might have added some other dependence on it being parent vma in the meanwhile - that's possible. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Corruption with O_DIRECT and unaligned user buffers 2008-12-19 11:51 ` Li Zefan 2008-12-19 12:14 ` KOSAKI Motohiro 2008-12-19 12:58 ` Hugh Dickins @ 2008-12-19 20:34 ` Andrea Arcangeli 2 siblings, 0 replies; 18+ messages in thread From: Andrea Arcangeli @ 2008-12-19 20:34 UTC (permalink / raw) To: Li Zefan; +Cc: Nick Piggin, Tim LaBerge, linux-mm, linux-fsdevel, Wang Chen On Fri, Dec 19, 2008 at 07:51:49PM +0800, Li Zefan wrote: > > diff -ur rhel-5.2/kernel/fork.c x/kernel/fork.c > > --- rhel-5.2/kernel/fork.c 2008-07-10 17:26:43.000000000 +0200 > > +++ x/kernel/fork.c 2008-12-18 15:57:31.000000000 +0100 > > @@ -368,7 +368,7 @@ > > rb_parent = &tmp->vm_rb; > > > > mm->map_count++; > > - retval = copy_page_range(mm, oldmm, mpnt); > > + retval = copy_page_range(mm, oldmm, tmp); > > > > Could you explain a bit why this change is needed? This change is needed to pass the child vma (not the parent vma) to handle_mm_fault. We run handle_mm_fault on the child not on the parent, so the vma passed to handle_mm_fault has to be the one of the child obviously. It won't make a difference for the other users of the vma because both vma are basically the same. Nick did it btw. ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2008-12-20 16:02 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-11-14 17:04 Corruption with O_DIRECT and unaligned user buffers Tim LaBerge 2008-11-19 4:25 ` Nick Piggin 2008-11-19 6:52 ` Nick Piggin 2008-11-19 16:58 ` Andrea Arcangeli 2008-12-18 15:29 ` Andrea Arcangeli 2008-12-19 2:21 ` KAMEZAWA Hiroyuki 2008-12-19 5:06 ` KAMEZAWA Hiroyuki 2008-12-19 6:34 ` KOSAKI Motohiro 2008-12-20 16:02 ` Andrea Arcangeli 2008-12-19 7:19 ` KAMEZAWA Hiroyuki 2008-12-19 7:44 ` Li Zefan 2008-12-19 8:45 ` Li Zefan 2008-12-19 20:27 ` Andrea Arcangeli 2008-12-20 15:55 ` Andrea Arcangeli 2008-12-19 11:51 ` Li Zefan 2008-12-19 12:14 ` KOSAKI Motohiro 2008-12-19 12:58 ` Hugh Dickins 2008-12-19 20:34 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).