public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Memory leak in 2.4.27 kernel, using mmap raw packet sockets
@ 2004-10-14 14:50 bgagnon
  2004-10-15 18:23 ` Marcelo Tosatti
  0 siblings, 1 reply; 19+ messages in thread
From: bgagnon @ 2004-10-14 14:50 UTC (permalink / raw)
  To: linux-kernel

(Please cc me directly, since I am not a subscriber to this list)

Hi,

I discovered a memory leak in the mmap raw packet socket implementation, 
specifically when the client application core dumps (elf format). More 
specifically, the entire ring buffer memory leaks under such 
circumstances. I have reproduced the problem with our custom app, which 
links with Phil Wood's mmap version of libpcap (
http://plublic.lanl.gov/cpw) as well as with tcpdump binding against the 
same library. I have asserted the problem presence in kernels 2.4.26 and 
2.4.27. It has been resolved in kernel 2.6.8.

In a nutshell, the reference count of the ring buffer page frame is not 
handled properly by the core dump function elf_core_dump() in 
fs/binfmt_elf.c. After exiting this function, all ring buffer page frames 
reference counts have been increased by one, which impedes their release 
when the socket is closed. 

The reference count mishap is related to the fact that the ring buffer 
page frames have their PG_reserved bit set when they are created by the 
raw packet socket. According to the comments in mm.h, this is to ensure 
the pages don't get swapped out to disk. elf_core_dump() calls 
get_user_pages(), which increments the reference count, irrespective of 
the PG_reserved bit, while the subsequent put_page() call only decrements 
it if the PG_reserved bit is clear.

According to a similar kernel mailing list thread (
http://www.ussg.iu.edu/hypermail/linux/kernel/0311.3/0495.html), the page 
reference count field is not significant if the PG_reserved bit is set. 
Applied to our particular case, the raw packet socket implementation 
should unconditionally reset the reference count when the socket is 
closed, so that the memory gets released. However, I fear this would cause 
user space applications to crash: according to the mmap man page, the 
memory mapping should survive the socket closure:

"The  munmap  system call deletes the mappings for the specified address
range, and causes further references to addresses within the  range  to
generate  invalid  memory references.  The region is also automatically
unmapped when the process is terminated.  On the  other  hand,  closing
the file descriptor does not unmap the region."

So, on 2.4 kernels, we are facing the following situation:

1- The reference count information IS important to control the ring buffer 
lifetime
2- The PG_reserved bit causes the reference count not to be handled 
properly by different kernel subsystems, the core dump functions being one 
of them

>From what I could understand, 2.6 developer fixed the get_user_pages() 
code so it does not increment to reference count if the page has the 
PG_reserved bit set. I don't know if this is readily applicable to 2.4 
kernels, since it may break some other subsystem that rely on its current 
behavior.

So, unless a better solution is found, I would propose a localized fix for 
the leak, in function elf_core_dump(), starting at line 1275  (comments 
are my own) :

if (get_user_pages(current, current->mm, addr, 1, 0, 1,   /*BG: this 
ALWAYS will increment the ref count*/
                   &page, &vma) <= 0) {
       DUMP_SEEK (file->f_pos + PAGE_SIZE);
} else {
       if (page == ZERO_PAGE(addr)) {
              DUMP_SEEK (file->f_pos + PAGE_SIZE);
    } else {
            void *kaddr;
            flush_cache_page(vma, addr);
            kaddr = kmap(page);
            DUMP_WRITE(kaddr, PAGE_SIZE);
            flush_page_to_ram(page);
            kunmap(page);
    }
    /*BG: Start of fix*/ 
    if (PageReserved(page))
            /* BG: just undo the ref count increase done in 
get_user_pages*/
            put_page_testzero(page);
    else
            put_page(page);
    /*BG: end of fix */
}



Thanks,

Bernard

^ permalink raw reply	[flat|nested] 19+ messages in thread
* Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets
@ 2004-10-21 13:39 O.Sezer
  2004-10-21 14:26 ` Andrea Arcangeli
  0 siblings, 1 reply; 19+ messages in thread
From: O.Sezer @ 2004-10-21 13:39 UTC (permalink / raw)
  To: andrea; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4304 bytes --]

Andrea Arcangeli  wrote:
>> > > That isnt sufficient. Consider anything else taking a reference to the
>> > > page and the refcount going negative. 
>> > 
>> > You mean not going negative? The problem here as I understand here is 
>> > we dont release the count if the PageReserved is set, but we should.
>> 
>> Drivers like the OSS audio drivers move page between Reserved and 
>> unreserved. The count can thus be corrupted.
> 
> the PG_reserved goes away after VM_IO, so forbidding pages with
> PG_reserved of vmas with VM_IO isn't any different as far as I can tell,
> and since PG_reserved is the real offender sure we shouldn't leave a
> check in get_user_pages that explicitly do something if the page is
> reserved, since if the page is reserved at that point we'd need to
> return -EFAULT or BUG_ON.
> 
> Adding the VM_IO patch on top of this is sure a good idea.
> 
> --- sles/mm/memory.c.~1~	2004-10-19 19:34:10.264335488 +0200
> +++ sles/mm/memory.c	2004-10-19 19:58:47.403776160 +0200
> @@ -806,7 +806,7 @@ int get_user_pages(struct task_struct *t
>  			}
>  			if (pages) {
>  				pages[i] = get_page_map(map);
> -				if (!pages[i]) {
> +				if (!pages[i] || PageReserved(pages[i])) {
>  					spin_unlock(&mm->page_table_lock);
>  					while (i--)
>  						page_cache_release(pages[i]);
> @@ -814,8 +814,7 @@ int get_user_pages(struct task_struct *t
>  					goto out;
>  				}
>  				flush_dcache_page(pages[i]);
> -				if (!PageReserved(pages[i]))
> -					page_cache_get(pages[i]);
> +				page_cache_get(pages[i]);
>  			}
>  			if (vmas)
>  				vmas[i] = vma;
> 
> My version of the fix for 2.4 is this, but this fixes as well an issue
> with the zeropage and it's on top of some other fix for COW corruption
> in 2.4 not yet fixed in mainline 2.4. Since 2.4 never checked
> PageReserved like 2.6 does in get_user_pages, 2.4 as worse can suffer a
> memleak.
> 
> --- sles/include/linux/mm.h.~1~	2004-10-18 10:20:53.391823696 +0200
> +++ sles/include/linux/mm.h	2004-10-18 10:47:10.861011928 +0200
> @@ -533,9 +533,8 @@ extern void unpin_pte_page(struct page *
>  
>  static inline void put_user_page_pte_pin(struct page * page)
>  {
> -	if (PagePinned(page))
> -		/* must run before put_page, put_page may free the page */
> -		unpin_pte_page(page);
> +	/* must run before put_page, put_page may free the page */
> +	unpin_pte_page(page);
>  
>  	put_page(page);
>  }
> --- sles/mm/memory.c.~1~	2004-10-18 10:20:54.947587184 +0200
> +++ sles/mm/memory.c	2004-10-18 10:47:49.822088944 +0200
> @@ -530,7 +530,11 @@ void __wait_on_pte_pinned_page(struct pa
>  
>  void unpin_pte_page(struct page *page)
>  {
> -	wait_queue_head_t *waitqueue = page_waitqueue(page);
> +	wait_queue_head_t *waitqueue;
> +
> +	if (!PagePinned(page))
> +		return;
> +	waitqueue = page_waitqueue(page);
>  	if (unlikely(!TestClearPagePinned(page)))
>  		BUG();
>  	smp_mb__after_clear_bit(); 
> @@ -598,17 +602,21 @@ int __get_user_pages(struct task_struct 
>  				 */
>  				if (!map)
>  					goto bad_page;
> -				page_cache_get(map);
> -				if (pte_pin && unlikely(TestSetPagePinned(map))) {
> -					/* fail if this is a duplicate physical page in this kiovec */
> -					int i2 = i;
> -					while (i2--)
> -						if (map == pages[i2]) {
> -							put_page(map);
> -							goto bad_page;
> -						}
> -					/* hold a reference on "map" so we can wait on it */
> -					goto pte_pin_collision;
> +				if (map != ZERO_PAGE(start)) {
> +					if (PageReserved(map))
> +						goto bad_page;
> +					page_cache_get(map);
> +					if (pte_pin && unlikely(TestSetPagePinned(map))) {
> +						/* fail if this is a duplicate physical page in this kiovec */
> +						int i2 = i;
> +						while (i2--)
> +							if (map == pages[i2]) {
> +								put_page(map);
> +								goto bad_page;
> +							}
> +						/* hold a reference on "map" so we can wait on it */
> +						goto pte_pin_collision;
> +					}
>  				}
>  				pages[i] = map;
>  			}

I can't find to which suse kernel these patch(es) apply. I assume
your first one comes down to the attached one-liner for vanilla-2.4,
can you confirm?
For your second: I think it needs your 9999_z-get_user_pages_pte_pin-1
patch applied beforehand?. Without that patch, are there any problems
to be fixed? Can you post patches for vanilla kernels, please?

Regards,
Ozkan Sezer


[-- Attachment #2: 2.4_memory.c-PageReserved.diff --]
[-- Type: text/plain, Size: 360 bytes --]

--- 2.4/mm/memory.c.BAK	2004-10-20 11:49:35.000000000 +0300
+++ 2.4/mm/memory.c	2004-10-21 10:43:01.000000000 +0300
@@ -499,7 +499,7 @@
 				/* FIXME: call the correct function,
 				 * depending on the type of the found page
 				 */
-				if (!pages[i])
+				if (!pages[i] || PageReserved(pages[i]))
 					goto bad_page;
 				page_cache_get(pages[i]);
 			}


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2004-11-30  6:21 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-14 14:50 Memory leak in 2.4.27 kernel, using mmap raw packet sockets bgagnon
2004-10-15 18:23 ` Marcelo Tosatti
2004-10-17  2:39   ` Alan Cox
2004-10-19 14:35     ` Marcelo Tosatti
2004-10-20 18:43       ` Alan Cox
2004-10-20 23:24         ` Andrea Arcangeli
2004-10-23 14:17           ` Marcelo Tosatti
2004-11-25 15:02     ` Marcelo Tosatti
2004-11-25 20:32       ` Andrea Arcangeli
2004-11-25 17:12         ` Marcelo Tosatti
2004-11-25 23:13           ` Andrea Arcangeli
2004-11-25 19:45             ` Marcelo Tosatti
2004-11-26  1:04               ` Andrea Arcangeli
2004-11-30  4:03                 ` David S. Miller
2004-11-30  4:16                   ` Andrea Arcangeli
2004-11-30  6:11                     ` David S. Miller
2004-11-30  6:19                     ` David S. Miller
  -- strict thread matches above, loose matches on Subject: below --
2004-10-21 13:39 O.Sezer
2004-10-21 14:26 ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox