SG_IO with >4k buffer size to iscsi sg device causes "Bad page" panic

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* SG_IO with >4k buffer size to iscsi sg device causes "Bad page" panic
@ 2007-05-08 18:08 Qi, Yanling
  2007-05-08 19:30 ` Mike Christie
  2007-05-09 23:38 ` Herbert Xu
  0 siblings, 2 replies; 6+ messages in thread
From: Qi, Yanling @ 2007-05-08 18:08 UTC (permalink / raw)
  To: netdev, linux-scsi, open-iscsi, linux-iscsi-devel
  Cc: Qi, Yanling, Mike Christie, dougg, James Bottomley

Hi All,

This panic is related to the interactions between scsi/sg.c, iscsi
initiator and tcp on the RHEL 2.6.9-42 kernel. But we may also have the
similar problem with open-iscsi initiator. I will explain why we see the
Bad page panic first. I did a patch to the sg driver to workaround the
problem and seek for ideas where we should fix the problem.

When sg driver accepts a sg_io request from user space, it invokes
kernel API __get_free_pages() to allocate multiple pages for holding
user space data IO request. The allocated pages will consist of one base
page and a number of sub pages (total 8 pages for a big request). The
pages have the following attributes after they are allocated by the sg
driver. 
	0 page:000001007fb89ac0 flags:0x01000000
mapping:0000000000000000 mapcount:0 count:1
	1 page:000001007fb89af8 flags:0x01000004
mapping:0000000000000000 mapcount:0 count:0
	2 page:000001007fb89b30 flags:0x01000004
mapping:0000000000000000 mapcount:0 count:0

Please note that only the base page has count=1 and all subpages have
count=0. 

After the request reaches iscsi-sfnet initiator driver, the iscsi-sfnet
driver will send a buffer with multiple pages one by one through network
interface API.
 
 rc = sock->ops->sendpage(sock, pg, pg_offset, len, flags);

At the network layer (linux/net/ipv4/tcp.c), the sendpage() operation
will perform get_page() first and then put_page() later. The get_page()
will increase the page's count by 1. The put_page() will perform the
following (linux/mm/swap.c)

void put_page(struct page *page)
{
        if (unlikely(PageCompound(page))) {
                page = (struct page *)page->private;
                if (put_page_testzero(page)) {
                        void (*dtor)(struct page *page);
  
                        dtor = (void (*)(struct page *))page[1].mapping;
                        (*dtor)(page);
                }
                return;
        }
        if (!PageReserved(page) && put_page_testzero(page))
               __page_cache_release(page);
} 

Please note that if the count is 0, the page will be released and
recycled to the free-page pool. 

At the time when sg driver is ready to free its allocated pages by
invoking free_pages(), the sub-pages is already re-used by someone else.
We will get "Bad page kernel expeption" such as the following 

Bad page state at __free_pages_ok (in process 'java', page
000001007fb89b30)
flags:0x0100103c mapping:0000010075a4eaf0 mapcount:0 count:2
Backtrace:
Call Trace:<ffffffff8015d37f>{bad_page+112}
<ffffffff8015d713>{__free_pages_ok+154} 
      <ffffffffa01d9fa5>{:sg:sg_remove_scat+276} <ffffffffa01da13e>
{:sg:sg_finish_rem_req+238} 
      <ffffffffa01da56a>{:sg:sg_new_read+1050}
<ffffffffa01dcb48>{:sg:sg_ioctl+929} 
      <ffffffff8030a0f5>{thread_return+0}
<ffffffff801d42e6>{selinux_file_ioctl+711} 
      <ffffffff8030ab88>{schedule_timeout+224}
<ffffffff8016bfb6>{find_extend_vma+22} 
      <ffffffff8014c6b0>{unqueue_me+138}
<ffffffff8014c8ce>{do_futex+441} 
      <ffffffff80135752>{autoremove_wake_function+0}
<ffffffff80135752>{autoremove_wake_function+0} 
      <ffffffff8018ae05>{sys_ioctl+853}
<ffffffff8012a122>{sg_ioctl_trans+832} 
      <ffffffff8019e8ac>{compat_sys_ioctl+235}
<ffffffff80125bbb>{sysenter_do_call+27}
In the above oops, the page with page address 000001007fb89b30 has been
reused with active count 2 and memory mapped. Because the sg driver
tries to free a page that is mapped and active, we got the above bad
page panic.

I did the following patch to the sg.c. The sg driver will set
PG_reserved for all sub-pages at sg_page_malloc() time and clear the
bit/count at sg_page_free() time. I tested it and it worked great. Do
you see any side impacts with this patch? Is this a right place to fix
the panic? We may have similar problem for st driver.

--- linux-2.6.9/drivers/scsi/sg.c       2007-05-07 22:14:33.000000000
-0500
+++ /home/yqi/working_sg_iscsi_sfnet/sg.c       2007-05-07
22:45:26.000000000 -0500
@@ -2551,8 +2551,9 @@ sg_page_malloc(int rqSz, int lowDma, int
 {
        char *resp = NULL;
        int page_mask;
-       int order, a_size;
+       int order, a_size, m;
        int resSz = rqSz;
+       struct page *tmppage;

        if (rqSz <= 0)
                return resp;
@@ -2571,6 +2572,13 @@ sg_page_malloc(int rqSz, int lowDma, int
                resp = (char *) __get_free_pages(page_mask, order);
/* try half */
                resSz = a_size;
        }
+       tmppage = virt_to_page(resp);
+       for( m = PAGE_SIZE; m < resSz; m += PAGE_SIZE )
+       {
+               tmppage++;
+               SetPageReserved(tmppage);
+       }
+
        if (resp) {
                if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
                        memset(resp, 0, resSz);
@@ -2583,12 +2591,20 @@ sg_page_malloc(int rqSz, int lowDma, int
 static void
 sg_page_free(char *buff, int size)
 {
-       int order, a_size;
+       int order, a_size, m;
+       struct page * tmppage;
+       tmppage = virt_to_page(buff);

        if (!buff)
                return;
        for (order = 0, a_size = PAGE_SIZE; a_size < size;
             order++, a_size <<= 1) ;
+       for( m = PAGE_SIZE; m < size; m += PAGE_SIZE )
+       {
+               tmppage++;
+               set_page_count(tmppage,0);
+               ClearPageReserved(tmppage);
+       }
        free_pages((unsigned long) buff, order);
 }

Thanks,

Yanling

Yanling Qi
Engenio Storage Group - LSI Logic
512-794-3713 (Office)
512-794-3702 (Fax)
yanling.qi@lsi.com
 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SG_IO with >4k buffer size to iscsi sg device causes "Bad page" panic
  2007-05-08 18:08 SG_IO with >4k buffer size to iscsi sg device causes "Bad page" panic Qi, Yanling
@ 2007-05-08 19:30 ` Mike Christie
  2007-05-09 16:13   ` Qi, Yanling
  2007-05-09 23:38 ` Herbert Xu
  1 sibling, 1 reply; 6+ messages in thread
From: Mike Christie @ 2007-05-08 19:30 UTC (permalink / raw)
  To: open-iscsi
  Cc: netdev, linux-scsi, linux-iscsi-devel, Qi, Yanling, dougg,
	James Bottomley

Qi, Yanling wrote:
> Hi All,
> 
> This panic is related to the interactions between scsi/sg.c, iscsi
> initiator and tcp on the RHEL 2.6.9-42 kernel. But we may also have the
> similar problem with open-iscsi initiator. I will explain why we see the

Yeah, this problem should occur in the upstream open-iscsi iscsi code.
open-iscsi works very similar to linux-scsi where it just sends pages
around with sock->ops-sendpage, and it looks like sg uses
__get_free_pages in RHEL's kernel and upstream it uses alloc_pages so
unless there was a change in those functions or the network layer then
we should have a similar problem.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: SG_IO with >4k buffer size to iscsi sg device causes "Bad page" panic
  2007-05-08 19:30 ` Mike Christie
@ 2007-05-09 16:13   ` Qi, Yanling
  0 siblings, 0 replies; 6+ messages in thread
From: Qi, Yanling @ 2007-05-09 16:13 UTC (permalink / raw)
  To: Mike Christie, open-iscsi
  Cc: netdev, linux-scsi, linux-iscsi-devel, dougg, James Bottomley


> -----Original Message-----
> From: Mike Christie [mailto:michaelc@cs.wisc.edu]
> Qi, Yanling wrote:
> Yeah, this problem should occur in the upstream open-iscsi iscsi code.
> open-iscsi works very similar to linux-scsi where it just sends pages
> around with sock->ops-sendpage, and it looks like sg uses
> __get_free_pages in RHEL's kernel and upstream it uses alloc_pages so
> unless there was a change in those functions or the network layer then
> we should have a similar problem.
[Qi, Yanling] 
Mike,

I tried the same test on a SLES10SP1 with open-iscsi driver (lk
2.6.16.37-0.23). It works fine.
What happens is that both "alloc_pages()" and "__get_free_pages()" will
set page_count to 1 for base page and sub-pages. Because page_count =1,
the subpages will not be recycled.

It seems the mm code has changed alloc_pages and __get_free_pages()'s
behavior along the way from 2.6.9 to 2.6.16.

Therefore, we don't have an issue in the upstream kernel and RHEL5.

0 page:ffff81007f8da240 flags:0x0100000000004000
mapping:0000000000000000 mapcount:0 count:1
1 page:ffff81007f8da278 flags:0x0100000000004000
mapping:0000000000000000 mapcount:0 count:1
2 page:ffff81007f8da2b0 flags:0x0100000000004000
mapping:0000000000000000 mapcount:0 count:1
3 page:ffff81007f8da2e8 flags:0x0100000000004000
mapping:0000000000000000 mapcount:0 count:1
4 page:ffff81007f8da320 flags:0x0100000000004000
mapping:0000000000000000 mapcount:0 count:1
5 page:ffff81007f8da358 flags:0x0100000000004000
mapping:0000000000000000 mapcount:0 count:1
6 page:ffff81007f8da390 flags:0x0100000000004000
mapping:0000000000000000 mapcount:0 count:1
7 page:ffff81007f8da3c8 flags:0x0100000000004000
mapping:0000000000000000 mapcount:0 count:1

Thanks,
Yanling

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SG_IO with >4k buffer size to iscsi sg device causes "Bad page" panic
  2007-05-08 18:08 SG_IO with >4k buffer size to iscsi sg device causes "Bad page" panic Qi, Yanling
  2007-05-08 19:30 ` Mike Christie
@ 2007-05-09 23:38 ` Herbert Xu
  2007-05-10 18:31   ` Qi, Yanling
  1 sibling, 1 reply; 6+ messages in thread
From: Herbert Xu @ 2007-05-09 23:38 UTC (permalink / raw)
  To: Qi, Yanling
  Cc: netdev, linux-scsi, open-iscsi, linux-iscsi-devel, Yanling.Qi,
	michaelc, dougg, James.Bottomley

Qi, Yanling <Yanling.Qi@lsi.com> wrote:
> @@ -2571,6 +2572,13 @@ sg_page_malloc(int rqSz, int lowDma, int
>                resp = (char *) __get_free_pages(page_mask, order);
> /* try half */
>                resSz = a_size;
>        }
> +       tmppage = virt_to_page(resp);
> +       for( m = PAGE_SIZE; m < resSz; m += PAGE_SIZE )
> +       {
> +               tmppage++;
> +               SetPageReserved(tmppage);
> +       }
> +

Why not just increase the page use count?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: SG_IO with >4k buffer size to iscsi sg device causes "Bad page" panic
  2007-05-09 23:38 ` Herbert Xu
@ 2007-05-10 18:31   ` Qi, Yanling
  2007-06-08  6:22     ` Herbert Xu
  0 siblings, 1 reply; 6+ messages in thread
From: Qi, Yanling @ 2007-05-10 18:31 UTC (permalink / raw)
  To: open-iscsi
  Cc: netdev, linux-scsi, linux-iscsi-devel, michaelc, dougg,
	James.Bottomley



> -----Original Message-----
> From: open-iscsi@googlegroups.com [mailto:open-iscsi@googlegroups.com]
On
> Behalf Of Herbert Xu
> Sent: Wednesday, May 09, 2007 6:39 PM
> To: Qi, Yanling
> Cc: netdev@vger.kernel.org; linux-scsi@vger.kernel.org; open-
> iscsi@googlegroups.com; linux-iscsi-devel@lists.sourceforge.net; Qi,
> Yanling; michaelc@cs.wisc.edu; dougg@torque.net;
> James.Bottomley@steeleye.com
> Subject: Re: SG_IO with >4k buffer size to iscsi sg device causes "Bad
> page" panic
> 
> 
> Qi, Yanling <Yanling.Qi@lsi.com> wrote:
> > @@ -2571,6 +2572,13 @@ sg_page_malloc(int rqSz, int lowDma, int
> >                resp = (char *) __get_free_pages(page_mask, order);
> > /* try half */
> >                resSz = a_size;
> >        }
> > +       tmppage = virt_to_page(resp);
> > +       for( m = PAGE_SIZE; m < resSz; m += PAGE_SIZE )
> > +       {
> > +               tmppage++;
> > +               SetPageReserved(tmppage);
> > +       }
> > +
> 
[Qi, Yanling]
If I do a get_page() at sg_page_malloc() time and then do a put_page()
at sg_page_free() time, I worry about a race condition that the page
gets re-used before calling free_pages().

Thanks,
Yanling

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SG_IO with >4k buffer size to iscsi sg device causes "Bad page" panic
  2007-05-10 18:31   ` Qi, Yanling
@ 2007-06-08  6:22     ` Herbert Xu
  0 siblings, 0 replies; 6+ messages in thread
From: Herbert Xu @ 2007-06-08  6:22 UTC (permalink / raw)
  To: Qi, Yanling
  Cc: open-iscsi, netdev, linux-scsi, linux-iscsi-devel, michaelc,
	dougg, James.Bottomley

Please don't drop CCs.

Qi, Yanling <Yanling.Qi@lsi.com> wrote:
>
>> Qi, Yanling <Yanling.Qi@lsi.com> wrote:
>> > @@ -2571,6 +2572,13 @@ sg_page_malloc(int rqSz, int lowDma, int
>> >                resp = (char *) __get_free_pages(page_mask, order);
>> > /* try half */
>> >                resSz = a_size;
>> >        }
>> > +       tmppage = virt_to_page(resp);
>> > +       for( m = PAGE_SIZE; m < resSz; m += PAGE_SIZE )
>> > +       {
>> > +               tmppage++;
>> > +               SetPageReserved(tmppage);
>> > +       }
>> > +
>> 
> [Qi, Yanling]
> If I do a get_page() at sg_page_malloc() time and then do a put_page()
> at sg_page_free() time, I worry about a race condition that the page
> gets re-used before calling free_pages().

Could you explain what is going to cause this page to be reused if it
has a non-zero reference count?

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-06-08  6:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-08 18:08 SG_IO with >4k buffer size to iscsi sg device causes "Bad page" panic Qi, Yanling
2007-05-08 19:30 ` Mike Christie
2007-05-09 16:13   ` Qi, Yanling
2007-05-09 23:38 ` Herbert Xu
2007-05-10 18:31   ` Qi, Yanling
2007-06-08  6:22     ` Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).