BUG root-caused: careless processing of pagevec causes "Bad page states"

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* BUG root-caused: careless processing of pagevec causes "Bad page states"
@ 2013-02-20 14:16 Valery Podrezov
  2013-03-27 21:20 ` Andrew Morton
  0 siblings, 1 reply; 2+ messages in thread
From: Valery Podrezov @ 2013-02-20 14:16 UTC (permalink / raw)
  To: linux-mm; +Cc: Valery Podrezov

[-- Attachment #1: Type: text/plain, Size: 6776 bytes --]

SUMMARY: careless processing of pagevec causes "Bad page states"

I have the messages "BUG: Bad page state in process.." in SMP mode with two
cpus (kernel 3.3).
I have root-caused the problem, see description below.
I have prepared the temporary workaround, it helps to eliminate the problem
and demonstrates additionally the essence of the problem.

The following sections are provided below:

    DESCRIPTION
    ENVIRONEMENT
    OOPS-messages
    WORKAROUND

Is it a known issue and is there already the patch properly fixing it?
Feel free to ask me any questions.

Best Regards,
 Valery Podrezov



DESCRIPTION:


There is how the problem is generated
(PFN0 refers the problematical physical page,
(1) and (2) are successive points of execution):

1. cpu 0: ...
   cpu 1: is running the user process (PROC0)
          Gets the new page with the PFN0 from free list by alloc_page_vma()
          Runs page_add_new_anon_rmap(), thus the page PFN0 occurs in
pagevec of this cpu (it is 5-th): pvec = &get_cpu_var(lru_add_pvecs)[lru];
          Runs fork (PROC1 - the generated child process)
          The page PFN0 is present in the page tables of the child process
PROC1 (it is read-only, to be COWed)

2. cpu 0: is running PROC1
          writes to the virtual address (VA1) translated through its page
tables to the PFN0
          do_page_fault (data) on VA1 (physical page is present in the page
tables of the process, but no write permissions)

   cpu 1: is running PROC1
          do_page_fault (data) on some virtual address (no page in page
tables)
          Gets the new page from free list by alloc_page_vma()
          Runs page_add_new_anon_rmap(), then __lru_cache_add()
          This new page is just 14-th in pagevec of this cpu, so runs
__pagevec_lru_add(),
          then pagevec_lru_move_fn() and, finally, __pagevec_lru_add_fn()

There are no common locks at this point applied for both processes
simultaneously,
these locks are applied:
   core 0: PROC0->mm->mmap_sem
           PFN0->flags PG_locked (lock_page)

   core 1: PROC1->mm->mmap_sem (!= PROC0->mm->mmap_sem)
           PFN0->zone->lru_lock

The more detailed timing below of point (2) for both cpus
shows how the bit PG_locked is mistakenly generated for the PFN0.

   Both cpus are processing do_page_fault() (see above)
   Both cpus are in the same routine do_wp_page()

   a) cpu 0: locks the page by trylock_page(old_page) (it is just the page
with PFN0)
   b) cpu 1: is processing __pagevec_lru_add_fn()
             Reads page->flags of its 5-th element of pagevec (it is PFN0
page, it contains PG_locked set to 1, see (a))

   c) cpu 0: unlocks the page by unlock_page(old_page) (reset the bit
PG_locked of PFN0 page)
   d) cpu 1: executes SetPageLRU(page) in __pagevec_lru_add_fn() and thus
sets not only PG_lru
             bit of PFN0 page but, mistakenly, the bit PG_locked too

This leads to "BUG: Bad page state" later while releasing PFN0 page because
of PG_locked bit present in flags of PFN0 page.


ENVIRONMENT:


   Linux kernel-3.3


OOPS-messages:


BUG: Bad page state in process runt_cj.sh  pfn:7fcd9
page:c05f9b20 count:0 mapcount:0 mapping:  (null) index:0xbfffd
page flags: 0x80080009(locked|uptodate|swapbacked)
Modules linked in:

Call Trace:
 [<00000000c1098d78>] dump_page+0x10c/0x120
 [<00000000c1098f50>] bad_page+0x1c4/0x1f4
 [<00000000c1099060>] free_pages_prepare+0xe0/0x10c
 [<00000000c109afd0>] free_hot_cold_page+0x38/0x2c8
 [<00000000c109b538>] free_hot_cold_page_list+0x38/0x64
 [<00000000c10a12f8>] release_pages+0x1e0/0x2cc
 [<00000000c10cdffc>] free_pages_and_swap_cache+0xa4/0x154
 [<00000000c10b49a0>] tlb_flush_mmu+0x98/0xcc
 [<00000000c10b49e4>] tlb_finish_mmu+0x10/0x54
 [<00000000c10c08a0>] exit_mmap+0x11c/0x168
 [<00000000c101988c>] mmput+0x5c/0x164
 [<00000000c10e85c0>] flush_old_exec+0x7d4/0xacc
 [<00000000c114ac24>] load_elf_binary+0x534/0x2514
 [<00000000c11c7158>] __up_read+0x20/0x108
 [<00000000c11cde48>] __va_probe_existent_region+0x164/0x190
 [<00000000c11ce098>] generic_copy_from_user+0xb4/0xd0
 [<00000000c10e7c10>] copy_strings+0x4d8/0x66c
 [<00000000c10e68ec>] search_binary_handler+0x110/0x488
 [<00000000c10e97f0>] do_execve+0x584/0x6a8
 [<00000000c10017c4>] sys_execve+0x38/0x104
 [<00000000c1013aec>] stub_execve+0x14/0x18
 [<00000000c100f1b4>] go_scall+0x30/0x38

Disabling lock debugging due to kernel taint


WORKAROUND:


I don't consider it as a potential patch at least because it doesn't
support properly
the "WARNING, pagevec_add: no space in pvec" conditions, as well, it can
impact performance, etc..
It requires further investigations.
Nevertheless, it helped me temporary not to stick in the problem.

There are the changed things per-files below.

linux-3.3/include/linux/pagevec.h:

/* 14 pointers + two long's align the pagevec structure to a power of two */
// #define PAGEVEC_SIZE    14
#define PAGEVEC_SIZE    (14 + 5*16)

static inline unsigned pagevec_add(struct pagevec *pvec, struct page *page)
{
    if (pvec->nr >= PAGEVEC_SIZE) {
        early_printk("WARNING, pagevec_add: no space in pvec 0x%lx, the
page=0x%lx ????????????????!!!!!!!!!!!!!!!!\n", pvec, page);
        return (0);
    }

    pvec->pages[pvec->nr++] = page;
    return pagevec_space(pvec);
}


linux-3.3/mm/swap.c:


static void pagevec_lru_move_fn(struct pagevec *pvec,
                int (*move_fn)(struct page *page, void *arg),
                void *arg)
{
    int i;
    struct zone *zone = NULL;
    unsigned long flags = 0;

int processed;
struct page *page;
int slots_available = -1;

int not_processed_index = 0;
struct page *not_processed_pages[PAGEVEC_SIZE];

int processed_index = 0;
struct page *processed_pages[PAGEVEC_SIZE];


    for (i = 0; i < pagevec_count(pvec); i++) {
        struct page *page = pvec->pages[i];
        struct zone *pagezone = page_zone(page);

        if (pagezone != zone) {
            if (zone)
                spin_unlock_irqrestore(&zone->lru_lock, flags);
            zone = pagezone;
            spin_lock_irqsave(&zone->lru_lock, flags);
        }

        // (*move_fn)(page, arg);

if (trylock_page(page)) {
    (*move_fn)(page, arg);
    unlock_page(page);
    processed = 1;
} else {
    processed = 0;
}

if (processed) {
    processed_pages[processed_index++] = page;
} else {
    not_processed_pages[not_processed_index++] = page;
}

    }
    if (zone)
        spin_unlock_irqrestore(&zone->lru_lock, flags);

    // release_pages(pvec->pages, pvec->nr, pvec->cold);
if (processed_index) {
    release_pages(processed_pages, processed_index, pvec->cold);
}

    pagevec_reinit(pvec);

if (not_processed_index) {
    for (i = 0; i < not_processed_index; i++) {
        page = not_processed_pages[i];
        slots_available = pagevec_add(pvec, page);
    }
}
}

----<end>

[-- Attachment #2: Type: text/html, Size: 9250 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: BUG root-caused: careless processing of pagevec causes "Bad page states"
  2013-02-20 14:16 BUG root-caused: careless processing of pagevec causes "Bad page states" Valery Podrezov
@ 2013-03-27 21:20 ` Andrew Morton
  0 siblings, 0 replies; 2+ messages in thread
From: Andrew Morton @ 2013-03-27 21:20 UTC (permalink / raw)
  To: Valery Podrezov; +Cc: linux-mm

On Wed, 20 Feb 2013 17:16:43 +0300 Valery Podrezov <pvadop@gmail.com> wrote:

> SUMMARY: careless processing of pagevec causes "Bad page states"
> 
> I have the messages "BUG: Bad page state in process.." in SMP mode with two
> cpus (kernel 3.3).
> I have root-caused the problem, see description below.
> I have prepared the temporary workaround, it helps to eliminate the problem
> and demonstrates additionally the essence of the problem.
> 
> The following sections are provided below:
> 
>     DESCRIPTION
>     ENVIRONEMENT
>     OOPS-messages
>     WORKAROUND
> 
> Is it a known issue and is there already the patch properly fixing it?
> Feel free to ask me any questions.
> 
> Best Regards,
>  Valery Podrezov
> 
> 
> 
> DESCRIPTION:
> 
> 
> There is how the problem is generated
> (PFN0 refers the problematical physical page,
> (1) and (2) are successive points of execution):
> 
> 1. cpu 0: ...
>    cpu 1: is running the user process (PROC0)
>           Gets the new page with the PFN0 from free list by alloc_page_vma()
>           Runs page_add_new_anon_rmap(), thus the page PFN0 occurs in
> pagevec of this cpu (it is 5-th): pvec = &get_cpu_var(lru_add_pvecs)[lru];
>           Runs fork (PROC1 - the generated child process)
>           The page PFN0 is present in the page tables of the child process
> PROC1 (it is read-only, to be COWed)
> 
> 2. cpu 0: is running PROC1
>           writes to the virtual address (VA1) translated through its page
> tables to the PFN0
>           do_page_fault (data) on VA1 (physical page is present in the page
> tables of the process, but no write permissions)
> 
>    cpu 1: is running PROC1
>           do_page_fault (data) on some virtual address (no page in page
> tables)
>           Gets the new page from free list by alloc_page_vma()
>           Runs page_add_new_anon_rmap(), then __lru_cache_add()
>           This new page is just 14-th in pagevec of this cpu, so runs
> __pagevec_lru_add(),
>           then pagevec_lru_move_fn() and, finally, __pagevec_lru_add_fn()
> 
> There are no common locks at this point applied for both processes
> simultaneously,
> these locks are applied:
>    core 0: PROC0->mm->mmap_sem
>            PFN0->flags PG_locked (lock_page)
> 
>    core 1: PROC1->mm->mmap_sem (!= PROC0->mm->mmap_sem)
>            PFN0->zone->lru_lock
> 
> The more detailed timing below of point (2) for both cpus
> shows how the bit PG_locked is mistakenly generated for the PFN0.
> 
>    Both cpus are processing do_page_fault() (see above)
>    Both cpus are in the same routine do_wp_page()
> 
>    a) cpu 0: locks the page by trylock_page(old_page) (it is just the page
> with PFN0)
>    b) cpu 1: is processing __pagevec_lru_add_fn()
>              Reads page->flags of its 5-th element of pagevec (it is PFN0
> page, it contains PG_locked set to 1, see (a))
> 
>    c) cpu 0: unlocks the page by unlock_page(old_page) (reset the bit
> PG_locked of PFN0 page)
>    d) cpu 1: executes SetPageLRU(page) in __pagevec_lru_add_fn() and thus
> sets not only PG_lru
>              bit of PFN0 page but, mistakenly, the bit PG_locked too

Here is where I got lost.

: static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
: 				 void *arg)
: {
: 	enum lru_list lru = (enum lru_list)arg;
: 	int file = is_file_lru(lru);
: 	int active = is_active_lru(lru);
: 
: 	VM_BUG_ON(PageActive(page));
: 	VM_BUG_ON(PageUnevictable(page));
: 	VM_BUG_ON(PageLRU(page));
: 
: 	SetPageLRU(page);
: 	if (active)
: 		SetPageActive(page);
: 	add_page_to_lru_list(page, lruvec, lru);
: 	update_page_reclaim_stat(lruvec, file, active);
: }


__pagevec_lru_add_fn() is using atomic bit operations of page->flags,
so how could it unintentionally retain the old PG_locked state?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2013-03-27 21:20 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-20 14:16 BUG root-caused: careless processing of pagevec causes "Bad page states" Valery Podrezov
2013-03-27 21:20 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).