public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* can device drivers return non-ram via vm_ops->nopage?
@ 2004-03-20 13:30 Andrea Arcangeli
  2004-03-20 14:40 ` William Lee Irwin III
  2004-03-20 17:39 ` Linus Torvalds
  0 siblings, 2 replies; 67+ messages in thread
From: Andrea Arcangeli @ 2004-03-20 13:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Linus Torvalds

The only bugreport I've got so far for the latest anon_vma code is from
Jens, and it's a device driver bug in my opinion, but I'd like to have a
definitive confirmation from you about the ->nopage API.

I changed ->nopage like this to catch bugs:

retry:
	new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);

	/*
	 * non-ram cannot be mapped via ->nopage, it must
	 * be mapped via remap_page_range instead synchronously
	 * in the ->mmap device driver callback.
	 *
	 * PageReserved pages can be mapped as far as they're under
	 * a VM_RESERVED vma.
	 */
	BUG_ON(!pfn_valid(page_to_pfn(new_page)));

	/* ->nopage cannot return swapcache */
	BUG_ON(PageSwapCache(new_page));
	/* ->nopage cannot return anonymous pages */
	BUG_ON(PageAnon(new_page));

	/*
	 * This is the entry point for memory under VM_RESERVED vmas.
	 * That memory will not be tracked by the vm. These aren't
	 * real anonymous pages, they're "device" reserved pages
	 * instead.
	 * These pages under VM_RESERVED vmas are the only pages mapped
	 * by the VM into userspace with page->as.mapping = NULL.
	 */
	reserved = vma->vm_flags & VM_RESERVED;
	BUG_ON(!reserved && (!new_page->mapping || PageReserved(new_page)));


really it would not be mandatory for me to enforce the last BUG_ON,
since we don't do the pagetable walk anymore to unmap stuff, but I think
it's nicer to enforce the model in the drivers so if we'll ever want to
do the pagetable walk again, we could, if we giveup then we'll be unable
to go back to the pagetable walk. I'm not saying that we'll ever want to
go back, but since most drivers are already setting VM_RESREVED
correctly to work with 2.4, I believe it worth to maintain this
abstraction so if we really want we can go back.

Anyways returning to the non-ram returned by ->nopage see the below
email exchange with Jens. the bug triggering of course is the
BUG_ON(!pfn_valid(page_to_pfn(new_page))).

If we want to return non-ram, we could, but I believe we should change
the API to return a pfn not a page_t * if we want to.

----- Forwarded message from Andrea Arcangeli <andrea@suse.de> -----

Date: Sat, 20 Mar 2004 14:21:56 +0100
From: Andrea Arcangeli <andrea@suse.de>
To: Jens Axboe <axboe@suse.de>

On Fri, Mar 19, 2004 at 01:32:13PM +0100, Jens Axboe wrote:
> kernel BUG at mm/memory.c:1412!
> invalid operand: 0000 [#1]
> SMP 
> CPU:    1
> EIP:    0060:[<c01407fe>]    Not tainted
> EFLAGS: 00010216   (2.6.4-0-axboe) 
> EIP is at do_no_page+0x42c/0x4dc
> eax: 01f80000   ebx: 00000000   ecx: 00001000   edx: f0c5be78
> esi: f0860280   edi: c0453880   ebp: f0c5bec0   esp: f0c5be88
> ds: 007b   es: 007b   ss: 0068
> Process mplayer (pid: 1500, threadinfo=f0c5a000 task=f149d300)
> Stack: f7f997a0 f7f997a0 00000282 f06fec80 f0c5bea8 00000000 f7d8c984
> 40b0e000 
>        f0860280 f1267800 00000001 f0c5a000 f0866c38 c0453880 f0c5bf0c
> c01415da 
>        00000000 f0866c38 f08b3408 00000000 000081ff 405ae7a3 1be20e55
> f7de0480 
> Call Trace:
>  [<c01415da>] handle_mm_fault+0xf3/0x694
>  [<c011674d>] do_page_fault+0x16c/0x535
>  [<c0144179>] __do_mmap_pgoff+0x34c/0x643
>  [<c010c44a>] do_mmap2+0x7a/0xa8
>  [<c01165e1>] do_page_fault+0x0/0x535
>  [<c0106aad>] error_code+0x2d/0x38

a device driver is returning a non-ram page via ->nopage.

I don't think this has ever been safe, it's just that my more robust
anon_vma code is trapping this bug, I think non-ram pages should use
remap_file_pages not ->nopage.

Let's assume I'm wrong and you can return non-ram via ->nopage (even
ignoring the API would be totally incorrect since one should return a
'pfn' not a 'page_t *' if really ->nopage can return non-ram), let's
take plain 2.6.5-rc1 (w/o my anon_vma code)

	new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
[..]
	if (pte_none(*page_table)) {
		if (!PageReserved(new_page))
			++mm->rss;
		flush_icache_page(vma, new_page);
		entry = mk_pte(new_page, vma->vm_page_prot);
		if (write_access)
			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
		set_pte(page_table, entry);
		pte_chain = page_add_rmap(new_page, page_table, pte_chain);
		pte_unmap(page_table);


PageReserved(new_page) is already reading random memory, that could even
genrate a machine exception on amd64 or ia64 and lock the box hard.

then it goes ahead and it even does page_add_rmap(new_page), writing
new_page->pte_chain in non-ram with, again potentially crashing the box.
It's ironic that Andrew removed a pfn_valid check in front of
page_add_rmap just in 2.6.5-rc1 (previous kernels wouldn't overwrite
random non-ram there, but still the unrecoverable machine check was
still there simply due the PageReserved).

So I think my anon_vma code is right forbidding non-ram to be returned
by ->nopage, and the device driver should be fixed.

If you disagree, I can change ->nopage to survive a non-ram page, but
besides the API of returning a page_t * would be misleading, this would
be a new feature, and it wouldn't work stable with mainline kernels
(regardless if page_add_rmap starts with a pfn_valid check or not).

----- End forwarded message -----

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2004-03-23 18:11 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-20 13:30 can device drivers return non-ram via vm_ops->nopage? Andrea Arcangeli
2004-03-20 14:40 ` William Lee Irwin III
2004-03-20 15:06   ` Andrea Arcangeli
2004-03-20 15:27     ` William Lee Irwin III
2004-03-20 15:44     ` Russell King
2004-03-20 15:57       ` Andrea Arcangeli
2004-03-20 16:15         ` Russell King
2004-03-20 16:25           ` Andrea Arcangeli
2004-03-20 16:57             ` William Lee Irwin III
2004-03-20 17:48             ` Andrea Arcangeli
2004-03-20 19:03               ` Andrea Arcangeli
2004-03-20 15:58       ` Jaroslav Kysela
2004-03-20 16:09         ` Russell King
2004-03-20 19:44           ` Jaroslav Kysela
2004-03-20 22:23             ` Russell King
2004-03-20 22:45               ` William Lee Irwin III
2004-03-20 23:54                 ` Russell King
2004-03-21  0:22                   ` Zwane Mwaikambo
2004-03-22  4:46                     ` Benjamin Herrenschmidt
2004-03-22 18:23                       ` Richard Curnow
2004-03-21  0:23                   ` William Lee Irwin III
2004-03-21  9:52                   ` Arjan van de Ven
2004-03-21 10:39                   ` Jaroslav Kysela
2004-03-22  4:43             ` Benjamin Herrenschmidt
2004-03-20 20:13     ` Andrew Morton
2004-03-20 20:28       ` Andrea Arcangeli
2004-03-20 20:50       ` William Lee Irwin III
2004-03-20 22:26         ` Russell King
2004-03-20 22:45           ` William Lee Irwin III
2004-03-21 20:45             ` David Woodhouse
2004-03-21 20:49               ` Christoph Hellwig
2004-03-21 20:57                 ` David Woodhouse
2004-03-21 21:53                   ` Linus Torvalds
2004-03-21 22:17                     ` Jeff Garzik
2004-03-21 22:23                     ` David Woodhouse
2004-03-21 22:23                     ` Russell King
2004-03-21 22:34                       ` Jeff Garzik
2004-03-21 22:42                         ` David Woodhouse
2004-03-21 23:06                           ` Jeff Garzik
2004-03-21 22:51                         ` Russell King
2004-03-21 23:09                           ` Jeff Garzik
2004-03-21 23:11                           ` Linus Torvalds
2004-03-21 23:22                             ` Jeff Garzik
2004-03-21 23:51                               ` Linus Torvalds
2004-03-21 23:58                                 ` Russell King
2004-03-22  0:34                                   ` Andrea Arcangeli
2004-03-22  3:05                                     ` Linus Torvalds
2004-03-23 17:59                                   ` Andy Whitcroft
2004-03-23 17:58                                     ` David Woodhouse
2004-03-23 18:11                                     ` William Lee Irwin III
2004-03-22  0:02                                 ` David Woodhouse
2004-03-22  3:28                                   ` Linus Torvalds
2004-03-22  0:10                                 ` Jeff Garzik
2004-03-22  0:20                                   ` Russell King
2004-03-22  0:33                                     ` Jeff Garzik
2004-03-22  4:57                                     ` Benjamin Herrenschmidt
2004-03-21 23:45                             ` Russell King
2004-03-22  0:23                               ` William Lee Irwin III
2004-03-22  0:29                                 ` Jeff Garzik
2004-03-22  1:28                                   ` William Lee Irwin III
2004-03-22  6:36           ` William Lee Irwin III
2004-03-20 17:39 ` Linus Torvalds
2004-03-20 17:56   ` Andrea Arcangeli
2004-03-20 18:22   ` William Lee Irwin III
2004-03-21  3:13   ` Chris Wedgwood
2004-03-21  6:23     ` Christoph Hellwig
2004-03-21  7:00       ` Chris Wedgwood

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox