2.6.9-rc2 bio sickness with large writes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.6.9-rc2 bio sickness with large writes
@ 2004-09-15 23:39 Jeff V. Merkey
  2004-09-16  6:34 ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff V. Merkey @ 2004-09-15 23:39 UTC (permalink / raw)
  To: linux-kernel, jmerkey, jmerkey

[-- Attachment #1: Type: text/plain, Size: 1131 bytes --]


I am posting bio chains each with a total size of 64KB (16 pages each) 
and of each of
these I am posting chains of these bio requests in 128MB contiguous sectors
(i.e.  bio = 16 pages = 64KB + bio =16 pages = 64KB + bio = 16 pages = 
64KB, etc.)

With the subsequent bio requests posted after the first bio is posted 
the bio requests
merge together, but I never get all the callbacks from the coaslesced 
bio requests which
are posted subsequent to the intial 64KB bio.  The data ends up making 
it onto the drives
and I am not seeing any data loss, the 2nd,3rd, .... etc. bios don't 
seem to callback
correctly.

This bug does not manifest itself every time on every bio chain.  It 
only shows up
part of the time.  about 1 in every 2-3 bio chains behave this way.  
This is severely
BUSTED and I am providing the calling code as an example of what may be 
happening.

Attached is the write case and the callback.

Any ideas?  Also, if I use bio's the way Linus does in his buffer cache 
code for submit_bh
ine onesy - twosy mode the interface works just fine.  It is severely 
broken with multiple pages, however.

Jeff







[-- Attachment #2: example.c --]
[-- Type: text/x-csrc, Size: 3008 bytes --]


static int end_bio_asynch(struct bio *bio, unsigned int bytes_done, int err)
{
    ASYNCH_IO *io = bio->bi_private;

    if (err)
    {
       P_Print("asynch bio error %d\n", (int)err);
       io->ccode = ASIO_IO_ERROR;
       datascout_put_bio(io->bio);
       io->bio = NULL; 
       return 0;
    }

    if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
       io->ccode = ASIO_IO_ERROR;

    atomic_inc((atomic_t *)&io->complete);
    if (io->complete == io->count)
    {
       datascout_put_bio(io->bio);
       io->bio = NULL; 
       insert_callback(io);

#if (PROFILE_AIO)
       profile_complete();
#endif
       return 0;
    }
    return 1;

}

ULONG aWriteDiskSectors(ULONG disk, ULONG StartingLBA, BYTE *Sector,
		       ULONG sectors, ULONG readAhead, ASYNCH_IO *io)
{
    register ULONG i, bytesWritten = 0;
    register ULONG bps, lba;
    register DSDISK *DSDisk;
    register ULONG rsize, blocks, blocksize, spb;

    DSDisk = SystemDisk[disk];
    bps = DSDisk->BytesPerSector;
    blocksize = DSDisk->DeviceBlockSize;

    if ((ULONG)(sectors * bps) % blocksize)
       return 0;

    rsize = sectors * bps;
    blocks = rsize / blocksize;
    if (!blocks)
       return 0;
    spb = blocksize / bps;
    
    lba = StartingLBA / spb;
    if (StartingLBA % spb)
    {
       P_Print("request not %d block aligned (%d) sectors-%d lba-%d (write)\n",
	       (int)blocksize, (int)(StartingLBA % spb), (int)sectors,
	       (int)StartingLBA);
       return 0;
    }

    io->bio = datascout_get_bio();
    if (!io->bio)
       return 0;

    if (io->bio->bi_max_vecs < blocks)
       return 0;

    io->ccode = 0;
    io->count = blocks;
    io->complete = 0;

    io->bio->bi_bdev = DSDisk->PhysicalDiskHandle;
    io->bio->bi_sector = StartingLBA;
    io->bio->bi_idx = 0;
    io->bio->bi_end_io = end_bio_asynch;
    io->bio->bi_private = io;
    io->bio->bi_vcnt = 0;
    io->bio->bi_phys_segments = 0;
    io->bio->bi_hw_segments = 0;
    io->bio->bi_size = 0;

    for (i=0; i < blocks; i++)
    {
#if LINUX_26_BIO_ADDPAGE
         register struct page *page = virt_to_page(&Sector[i * blocksize]);
         register ULONG offset = ((ULONG)(&Sector[i * blocksize])) % PAGE_SIZE;
         register ULONG bytes;

         bytes = bio_add_page(io->bio, page, 
                              PAGE_SIZE - (offset % PAGE_SIZE), 0);

         bytesWritten += bytes;
#else
         register struct page *page = virt_to_page(&Sector[i * blocksize]);
         register ULONG offset = ((ULONG)(&Sector[i * blocksize])) % PAGE_SIZE;

	 io->bio->bi_io_vec[i].bv_page = page;
	 io->bio->bi_io_vec[i].bv_len = blocksize;
	 io->bio->bi_io_vec[i].bv_offset = offset;
	 io->bio->bi_vcnt++;
	 io->bio->bi_phys_segments++;
	 io->bio->bi_hw_segments++;
	 io->bio->bi_size += blocksize;
         bytesWritten += blocksize;
#endif
    }

    // unplug the queue and drain the bathtub
    bio_get(io->bio);
    submit_bio(WRITE | (1 << BIO_RW_SYNC), io->bio);
    bio_put(io->bio);

    return (bytesWritten);
}


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.6.9-rc2 bio sickness with large writes
  2004-09-15 23:39 2.6.9-rc2 bio sickness with large writes Jeff V. Merkey
@ 2004-09-16  6:34 ` Jens Axboe
  2004-09-16 16:38   ` Jeff V. Merkey
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2004-09-16  6:34 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: linux-kernel, jmerkey

On Wed, Sep 15 2004, Jeff V. Merkey wrote:
> 
> I am posting bio chains each with a total size of 64KB (16 pages each) 
> and of each of
> these I am posting chains of these bio requests in 128MB contiguous sectors
> (i.e.  bio = 16 pages = 64KB + bio =16 pages = 64KB + bio = 16 pages = 
> 64KB, etc.)
> 
> With the subsequent bio requests posted after the first bio is posted 
> the bio requests
> merge together, but I never get all the callbacks from the coaslesced 
> bio requests which
> are posted subsequent to the intial 64KB bio.  The data ends up making 
> it onto the drives
> and I am not seeing any data loss, the 2nd,3rd, .... etc. bios don't 
> seem to callback
> correctly.
> 
> This bug does not manifest itself every time on every bio chain.  It 
> only shows up
> part of the time.  about 1 in every 2-3 bio chains behave this way.  
> This is severely
> BUSTED and I am providing the calling code as an example of what may be 
> happening.
> 
> Attached is the write case and the callback.
> 
> Any ideas?  Also, if I use bio's the way Linus does in his buffer cache 
> code for submit_bh
> ine onesy - twosy mode the interface works just fine.  It is severely 
> broken with multiple pages, however.
> 
> Jeff
> 
> 
> 
> 
> 
> 

> 
> static int end_bio_asynch(struct bio *bio, unsigned int bytes_done, int err)
> {
>     ASYNCH_IO *io = bio->bi_private;
> 

      if (bio->bi_size)
		return 1;

>     if (err)
>     {
>        P_Print("asynch bio error %d\n", (int)err);
>        io->ccode = ASIO_IO_ERROR;
>        datascout_put_bio(io->bio);
>        io->bio = NULL; 
>        return 0;
>     }
> 
>     if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
>        io->ccode = ASIO_IO_ERROR;
> 
>     atomic_inc((atomic_t *)&io->complete);
>     if (io->complete == io->count)
>     {
>        datascout_put_bio(io->bio);
>        io->bio = NULL; 
>        insert_callback(io);
> 
> #if (PROFILE_AIO)
>        profile_complete();
> #endif
>        return 0;
>     }
>     return 1;
> 
> }
> 
> ULONG aWriteDiskSectors(ULONG disk, ULONG StartingLBA, BYTE *Sector,
> 		       ULONG sectors, ULONG readAhead, ASYNCH_IO *io)
> {
>     register ULONG i, bytesWritten = 0;
>     register ULONG bps, lba;
>     register DSDISK *DSDisk;
>     register ULONG rsize, blocks, blocksize, spb;
> 
>     DSDisk = SystemDisk[disk];
>     bps = DSDisk->BytesPerSector;
>     blocksize = DSDisk->DeviceBlockSize;
> 
>     if ((ULONG)(sectors * bps) % blocksize)
>        return 0;
> 
>     rsize = sectors * bps;
>     blocks = rsize / blocksize;
>     if (!blocks)
>        return 0;
>     spb = blocksize / bps;
>     
>     lba = StartingLBA / spb;
>     if (StartingLBA % spb)
>     {
>        P_Print("request not %d block aligned (%d) sectors-%d lba-%d (write)\n",
> 	       (int)blocksize, (int)(StartingLBA % spb), (int)sectors,
> 	       (int)StartingLBA);
>        return 0;
>     }
> 
>     io->bio = datascout_get_bio();
>     if (!io->bio)
>        return 0;
> 
>     if (io->bio->bi_max_vecs < blocks)
>        return 0;
> 
>     io->ccode = 0;
>     io->count = blocks;
>     io->complete = 0;
> 
>     io->bio->bi_bdev = DSDisk->PhysicalDiskHandle;
>     io->bio->bi_sector = StartingLBA;
>     io->bio->bi_idx = 0;
>     io->bio->bi_end_io = end_bio_asynch;
>     io->bio->bi_private = io;
>     io->bio->bi_vcnt = 0;
>     io->bio->bi_phys_segments = 0;
>     io->bio->bi_hw_segments = 0;
>     io->bio->bi_size = 0;
> 
>     for (i=0; i < blocks; i++)
>     {
> #if LINUX_26_BIO_ADDPAGE
>          register struct page *page = virt_to_page(&Sector[i * blocksize]);
>          register ULONG offset = ((ULONG)(&Sector[i * blocksize])) % PAGE_SIZE;
>          register ULONG bytes;
> 
>          bytes = bio_add_page(io->bio, page, 
>                               PAGE_SIZE - (offset % PAGE_SIZE), 0);

offset instead of 0?

>          bytesWritten += bytes;
> #else
>          register struct page *page = virt_to_page(&Sector[i * blocksize]);
>          register ULONG offset = ((ULONG)(&Sector[i * blocksize])) % PAGE_SIZE;
> 
> 	 io->bio->bi_io_vec[i].bv_page = page;
> 	 io->bio->bi_io_vec[i].bv_len = blocksize;
> 	 io->bio->bi_io_vec[i].bv_offset = offset;
> 	 io->bio->bi_vcnt++;
> 	 io->bio->bi_phys_segments++;
> 	 io->bio->bi_hw_segments++;
> 	 io->bio->bi_size += blocksize;
>          bytesWritten += blocksize;
> #endif
>     }

Get rid of this if/else, it's not correct. 2.6 always had
bio_add_page(), and you _must_ use it.

>     // unplug the queue and drain the bathtub
>     bio_get(io->bio);
>     submit_bio(WRITE | (1 << BIO_RW_SYNC), io->bio);
>     bio_put(io->bio);

You don't get to get/put the bio here, you aren't touching it after
submit_bio().

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.6.9-rc2 bio sickness with large writes
  2004-09-16  6:34 ` Jens Axboe
@ 2004-09-16 16:38   ` Jeff V. Merkey
  2004-09-17  7:36     ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff V. Merkey @ 2004-09-16 16:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, jmerkey

>
>>static int end_bio_asynch(struct bio *bio, unsigned int bytes_done, int err)
>>{
>>    ASYNCH_IO *io = bio->bi_private;
>>
>>    
>>
>
>      if (bio->bi_size)
>		return 1;
>  
>

I guess the question here is if a group of coalesed bios are submitted, 
shouldn't I get a callback
on each one? This implies if they merge beneath me, I won't always see 
the callback for each one.
Is this an acuurate assumption?

>  
>
>>         register struct page *page = virt_to_page(&Sector[i * blocksize]);
>>         register ULONG offset = ((ULONG)(&Sector[i * blocksize])) % PAGE_SIZE;
>>         register ULONG bytes;
>>
>>         bytes = bio_add_page(io->bio, page, 
>>                              PAGE_SIZE - (offset % PAGE_SIZE), 0);
>>    
>>
>
>offset instead of 0?
>
>  
>
blocksize always = PAGE_SIZE in this code. PAGE_SIZE - (offset % 
PAGE_SIZE) determines the length of the
transfer. You could just as well substitute PAGE_SIZE for blocksize. I 
always set the device (if it supports it) to a
PAGE_SIZE for the blocksize. 1024 blocksize is a little small.

>Get rid of this if/else, it's not correct. 2.6 always had
>bio_add_page(), and you _must_ use it.
>
>  
>

Linus should then do the same in buffer.c since this example is 
misleading in the code. Linus seems to get away with it
so why can't I? :-)

buffer.c line 2776
/*
* from here on down, it's all bio -- do the initial mapping,
* submit_bio -> generic_make_request may further map this bio around
*/
bio = bio_alloc(GFP_NOIO, 1);

bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio->bi_bdev = bh->b_bdev;
bio->bi_io_vec[0].bv_page = bh->b_page;
bio->bi_io_vec[0].bv_len = bh->b_size;
bio->bi_io_vec[0].bv_offset = bh_offset(bh);

bio->bi_vcnt = 1;
bio->bi_idx = 0;
bio->bi_size = bh->b_size;

bio->bi_end_io = end_bio_bh_io_sync;
bio->bi_private = bh;

submit_bio(rw, bio);

>>    // unplug the queue and drain the bathtub
>>    bio_get(io->bio);
>>    submit_bio(WRITE | (1 << BIO_RW_SYNC), io->bio);
>>    bio_put(io->bio);
>>    
>>
>
>You don't get to get/put the bio here, you aren't touching it after
>submit_bio().
>
>  
>

I removed this from my code. You need to correct the example in bio.h 
with something more meaningful. This is what
the source code says to do.

bio.h line 213

/*
* get a reference to a bio, so it won't disappear. the intended use is
* something like: *
* bio_get(bio);
* submit_bio(rw, bio);
* if (bio->bi_flags ...)
* do_something
* bio_put(bio);
*
* without the bio_get(), it could potentially complete I/O before submit_bio
* returns. and then bio would be freed memory when if (bio->bi_flags ...)
* runs
*/

I still see the problem and I suspect it's related to the callback 
mechanism with the bio-bi_size check always returning 1. I guess
the question here is are bios guaranteed to always return a 1:1 ratio of 
submission vs. callbacks?

Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.6.9-rc2 bio sickness with large writes
  2004-09-16 16:38   ` Jeff V. Merkey
@ 2004-09-17  7:36     ` Jens Axboe
       [not found]       ` <20040917201604.GA12974@galt.devicelogics.com>
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2004-09-17  7:36 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: linux-kernel, jmerkey

On Thu, Sep 16 2004, Jeff V. Merkey wrote:
> 
> >
> >>static int end_bio_asynch(struct bio *bio, unsigned int bytes_done, int 
> >>err)
> >>{
> >>   ASYNCH_IO *io = bio->bi_private;
> >>
> >>   
> >>
> >
> >     if (bio->bi_size)
> >		return 1;
> > 
> >
> 
> I guess the question here is if a group of coalesed bios are submitted, 
> shouldn't I get a callback
> on each one? This implies if they merge beneath me, I won't always see 
> the callback for each one.
> Is this an acuurate assumption?

No, you are reading it wrong. You will get a callback for each bio that
you issued, regardless of merge status. You never need to care about
merge status. But you might get partial completions from drivers, so
multiple completions on the same bio. Your code doesn't look like it's
handling that - you either need to take bytes_done into account, or just
add the above two lines if you don't care about partial completions.
When bi_size reaches 0, all io is done on this bio.

> >>        register struct page *page = virt_to_page(&Sector[i * blocksize]);
> >>        register ULONG offset = ((ULONG)(&Sector[i * blocksize])) % 
> >>        PAGE_SIZE;
> >>        register ULONG bytes;
> >>
> >>        bytes = bio_add_page(io->bio, page, 
> >>                             PAGE_SIZE - (offset % PAGE_SIZE), 0);
> >>   
> >>
> >
> >offset instead of 0?
> >
> > 
> >
> blocksize always = PAGE_SIZE in this code. PAGE_SIZE - (offset % 
> PAGE_SIZE) determines the length of the
> transfer. You could just as well substitute PAGE_SIZE for blocksize. I 
> always set the device (if it supports it) to a
> PAGE_SIZE for the blocksize. 1024 blocksize is a little small.

Ok, that's not visible from the source you sent. You should still change
that to offset in case this changes. Or if not, kill the PAGE_SIZE - xxx
stuff to avoid confusion.

> >Get rid of this if/else, it's not correct. 2.6 always had
> >bio_add_page(), and you _must_ use it.
> >
> > 
> >
> 
> Linus should then do the same in buffer.c since this example is 
> misleading in the code. Linus seems to get away with it
> so why can't I? :-)

Linus didn't write that, I did.
> 
> buffer.c line 2776
> /*
> * from here on down, it's all bio -- do the initial mapping,
> * submit_bio -> generic_make_request may further map this bio around
> */
> bio = bio_alloc(GFP_NOIO, 1);
> 
> bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
> bio->bi_bdev = bh->b_bdev;
> bio->bi_io_vec[0].bv_page = bh->b_page;
> bio->bi_io_vec[0].bv_len = bh->b_size;
> bio->bi_io_vec[0].bv_offset = bh_offset(bh);
> 
> bio->bi_vcnt = 1;
> bio->bi_idx = 0;
> bio->bi_size = bh->b_size;
> 
> bio->bi_end_io = end_bio_bh_io_sync;
> bio->bi_private = bh;
> 
> submit_bio(rw, bio);

And submit_bh() is correct, since it only uses a single page. You are
adding multiple pages to the bio manually, this is highly illegal. If
you aren't using bio_add_page() to do this you are both on your own and
in a world of pain support wise.

> >>   // unplug the queue and drain the bathtub
> >>   bio_get(io->bio);
> >>   submit_bio(WRITE | (1 << BIO_RW_SYNC), io->bio);
> >>   bio_put(io->bio);
> >>   
> >>
> >
> >You don't get to get/put the bio here, you aren't touching it after
> >submit_bio().
> >
> > 
> >
> 
> I removed this from my code. You need to correct the example in bio.h 
> with something more meaningful. This is what
> the source code says to do.
> 
> bio.h line 213
> 
> /*
> * get a reference to a bio, so it won't disappear. the intended use is
> * something like: *
> * bio_get(bio);
> * submit_bio(rw, bio);
> * if (bio->bi_flags ...)
> * do_something
> * bio_put(bio);
> *
> * without the bio_get(), it could potentially complete I/O before submit_bio
> * returns. and then bio would be freed memory when if (bio->bi_flags ...)
> * runs
> */

No, you need to read it properly. See how it looks at the bio after
submitting it in this example? That's why it gets a reference to it.
There's even a comment to that effect. You don't need to hold a
reference across the submit_bio() call if you don't look at the bio
afterwards, why would you care? This is, btw, no different than the bh
semantics since 2.4...

> I still see the problem and I suspect it's related to the callback 
> mechanism with the bio-bi_size check always returning 1. I guess
> the question here is are bios guaranteed to always return a 1:1 ratio of 
> submission vs. callbacks?

Fix your code like I described first, I cannot help you otherwise. Your
question is answered further up (it's yes).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <20040917201604.GA12974@galt.devicelogics.com>]

* Re: 2.6.9-rc2 bio sickness with large writes
       [not found]       ` <20040917201604.GA12974@galt.devicelogics.com>
@ 2004-09-20 17:12         ` Jeff V. Merkey
  2004-09-20 18:09           ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff V. Merkey @ 2004-09-20 17:12 UTC (permalink / raw)
  To: jmerkey; +Cc: Jens Axboe, linux-kernel

jmerkey@galt.devicelogics.com wrote:

Jens,

Can you explain the circumstances related to the bio->bi_size callback. 
You stated (and I have observed)
that there may be callbacks from bi_end_io with bi_size set and that 
these should be ignored and
return a value of 1.

Can you explain to me under what circumstances I am likely to see this 
behavior? In other words, would
you explain the bio process start to finish with coalesced IO requests, 
at least as designed. Also, the whole
page and offset sematics in the interface are also somewhat burdensome. 
Wouldn't a more reasonable
interface for async IO be:

address
length
address
length

rather than

page structure
offset in page structure
page structure
offset in page structure

The hardware doesn't care about page mapping above it just needs to see 
addresses (and not always 4 byte aligned addresses)
and lengths for building scatter gather lists. Forcing page sematics 
seems a little orthagonal.

I can assume from the interface as designed that if you pass an offset 
for a page that is not page aligned,
and ask for a 4K write, then you will end up dropping the data on the 
floor than spans beyond the end of the page.

No offense, but this is **BUSTED** behavior for an asynch interface. The 
whole page offset thing is a little
difficult to use and pushes needless complexity to someone just needing 
to submit IO to the disk. I think this
is great for mmap for the VFS layer, and it dure makes it a lot easier 
to submit IO for fs with the mmap layer,
but as a general purpose async layer on top of generic_make_request, 
it's a little tough to use, IMHO.

Please advise,

Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.6.9-rc2 bio sickness with large writes
  2004-09-20 17:12         ` Jeff V. Merkey
@ 2004-09-20 18:09           ` Jens Axboe
  2004-09-20 19:20             ` Jeff V. Merkey
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2004-09-20 18:09 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: jmerkey, linux-kernel

On Mon, Sep 20 2004, Jeff V. Merkey wrote:
> jmerkey@galt.devicelogics.com wrote:
> 
> Jens,
> 
> Can you explain the circumstances related to the bio->bi_size callback. 
> You stated (and I have observed)
> that there may be callbacks from bi_end_io with bi_size set and that 
> these should be ignored and
> return a value of 1.

Driver calls bio_endio() with bytes_done < bio->bi_size, thus the bio
isn't completely ended. Drivers may do this for a variety of reasons.

> Can you explain to me under what circumstances I am likely to see this 
> behavior? In other words, would

On drivers that do partial completions :-)

You should not worry about where you'll see it. Either you handle
partial completions by taking bytes_done into account, or you ignore the
issue and just look at bio->bi_size to determine whether all io in this
bio has completed.

> you explain the bio process start to finish with coalesced IO requests, 
> at least as designed. Also, the whole

Well, you build a bio and submit it. If another bio is built after that
and is contig on disk (either before or after the other one), they are
clustered in the request structure if they don't violate the
restrictions of the hardware (max size of a single request, max number
segments, etc).

> page and offset sematics in the interface are also somewhat burdensome. 
> Wouldn't a more reasonable
> interface for async IO be:
> 
> address
> length
> address
> length
> 
> rather than
> 
> page structure
> offset in page structure
> page structure
> offset in page structure

No, because { address, length } cannot fully describe all memory in any
given machine.

> The hardware doesn't care about page mapping above it just needs to see 
> addresses (and not always 4 byte aligned addresses)
> and lengths for building scatter gather lists. Forcing page sematics 
> seems a little orthagonal.

Any chunk of memory has a page associated with it, but it may not have a
kernel address mapping associated with it. So some identifier was needed
other than a virtual address, a page is as good as any so making one up
would be silly.

Once you understand this, it doesn't seem so odd. You need to pass in a
single page or sg table to map for dma anyways, the sg table holds page
pointers as well.

> I can assume from the interface as designed that if you pass an offset 
> for a page that is not page aligned,
> and ask for a 4K write, then you will end up dropping the data on the 
> floor than spans beyond the end of the page.

What kind of bogus example is that? Asking for a 4K write from a 4K page
but asking to start 1K in that page is just stupid and not even remotely
valid.

> No offense, but this is **BUSTED** behavior for an asynch interface. The 
> whole page offset thing is a little
> difficult to use and pushes needless complexity to someone just needing 
> to submit IO to the disk. I think this

It's not difficult at all. Apparently you don't understand it so you
think it's difficult, that's only natural. But you have access to the
page mapping of any given piece of data always, or if you have the
virtual address only it's trivial to go to the { page, offset } mapping.

> is great for mmap for the VFS layer, and it dure makes it a lot easier 
> to submit IO for fs with the mmap layer,
> but as a general purpose async layer on top of generic_make_request, 
> it's a little tough to use, IMHO.

I can only imagine that you are used to a very different interface on
some other OS so you think it's difficult to use. Most of your
complaints seem to be based on false assumptions or because you don't
understand why certain design decisions were made.

> Please advise,

I'll advise that you read and understand Suparnas excellent description
of bio in Documentation/block/biodoc.txt.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.6.9-rc2 bio sickness with large writes
  2004-09-20 18:09           ` Jens Axboe
@ 2004-09-20 19:20             ` Jeff V. Merkey
  2004-09-21  6:40               ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff V. Merkey @ 2004-09-20 19:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jmerkey, linux-kernel

Jens Axboe wrote:

>>page and offset sematics in the interface are also somewhat burdensome. 
>>Wouldn't a more reasonable
>>interface for async IO be:
>>
>>address
>>length
>>address
>>length
>>
>>rather than
>>
>>page structure
>>offset in page structure
>>page structure
>>offset in page structure
>>    
>>
>
>No, because { address, length } cannot fully describe all memory in any
>given machine.
>
>  
>
This response I don't understand. How memory is described in a machine 
for DMA addressibility
is pretty standard (with the exception of memory on intel machine in 32 
bit systems above 4GBthat need page tables) --
a physical numerical address. But who is going to DMA into memory not in 
the address space.


>Any chunk of memory has a page associated with it, but it may not have a
>kernel address mapping associated with it. So some identifier was needed
>other than a virtual address, a page is as good as any so making one up
>would be silly.
>
>Once you understand this, it doesn't seem so odd. You need to pass in a
>single page or sg table to map for dma anyways, the sg table holds page
>pointers as well.
>
>  
>
>>I can assume from the interface as designed that if you pass an offset 
>>for a page that is not page aligned,
>>and ask for a 4K write, then you will end up dropping the data on the 
>>floor than spans beyond the end of the page.
>>    
>>
>
>What kind of bogus example is that? Asking for a 4K write from a 4K page
>but asking to start 1K in that page is just stupid and not even remotely
>valid.
>
>  
>
Hardware doesn't care about page boundries. It sees hardware addresses 
and lengths, at
least most SG hardware I've worked with does. For ease of submission, an 
interface that
takes <address,length> would suffice. Why on earth would someone need a 
context
pointer into the kernel's page tables to submit an SG into a device, 
apart from performing
virtual-to-physical translation?

>It's not difficult at all. Apparently you don't understand it so you
>think it's difficult, that's only natural. But you have access to the
>page mapping of any given piece of data always, or if you have the
>virtual address only it's trivial to go to the { page, offset } mapping.
>  
>
No, I do understand, and using page/offset at a low level SG interface 
IS burdensome.
I mean, if this is what I have to support I guess I have to use it, but 
it will be just another
section of code where I have another **FAT** layer to waste more CPU 
cycles calculating
offset/page (oh yeah I have to lookup the struct page * structure also) 
when it would be much
simpler to just submit address/len in i386 systems. With this type of 
interface, If I have for instance
an on-disk structure that starts in the middle of a 4K page due to other 
headers, etc. than spans
a page, I cannot just submit the address and length, I have to break it 
into two bio requests instead
of one with a for () loop from hell and calculate the offsets and rumage 
around in memory looking
up struct page * addresses.

>I can only imagine that you are used to a very different interface on
>some other OS so you think it's difficult to use. Most of your
>complaints seem to be based on false assumptions or because you don't
>understand why certain design decisions were made.
>
>  
>

No. I am used to programming to hardware with SG devices that all OS 
use. Is there somewhere a page based
SG device (other than SCI) for disk drive?. I don't think so, I think 
they operate address/len, address/len, etc.

:-)

Jeff



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2.6.9-rc2 bio sickness with large writes
  2004-09-20 19:20             ` Jeff V. Merkey
@ 2004-09-21  6:40               ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2004-09-21  6:40 UTC (permalink / raw)
  To: Jeff V. Merkey; +Cc: jmerkey, linux-kernel

On Mon, Sep 20 2004, Jeff V. Merkey wrote:
> Jens Axboe wrote:
> 
> >>page and offset sematics in the interface are also somewhat burdensome. 
> >>Wouldn't a more reasonable
> >>interface for async IO be:
> >>
> >>address
> >>length
> >>address
> >>length
> >>
> >>rather than
> >>
> >>page structure
> >>offset in page structure
> >>page structure
> >>offset in page structure
> >>   
> >>
> >
> >No, because { address, length } cannot fully describe all memory in any
> >given machine.
> >
> > 
> >
> This response I don't understand. How memory is described in a machine 
> for DMA addressibility
> is pretty standard (with the exception of memory on intel machine in 32 
> bit systems above 4GBthat need page tables) --
> a physical numerical address. But who is going to DMA into memory not in 
> the address space.

The mapped address, yes, not the data you are actually mapping for dma.
On 32-bit intel boxes, even with just 1GB of memory, you don't have a
kernel mapping of the memory above 960MB.

> >Any chunk of memory has a page associated with it, but it may not have a
> >kernel address mapping associated with it. So some identifier was needed
> >other than a virtual address, a page is as good as any so making one up
> >would be silly.
> >
> >Once you understand this, it doesn't seem so odd. You need to pass in a
> >single page or sg table to map for dma anyways, the sg table holds page
> >pointers as well.
> >
> > 
> >
> >>I can assume from the interface as designed that if you pass an offset 
> >>for a page that is not page aligned,
> >>and ask for a 4K write, then you will end up dropping the data on the 
> >>floor than spans beyond the end of the page.
> >>   
> >>
> >
> >What kind of bogus example is that? Asking for a 4K write from a 4K page
> >but asking to start 1K in that page is just stupid and not even remotely
> >valid.
> >
> > 
> >
> Hardware doesn't care about page boundries. It sees hardware addresses
> and lengths, at least most SG hardware I've worked with does. For ease
> of submission, an interface that takes <address,length> would suffice.

But what if there's no 'address' that identifies the piece of memory you
want to do io for? It might not be directly addressable by the kernel.

The pci dma mapping will coalesce the segments for you in the sg table
according to the hardware restrictions. Most hardware understands sg
addressing, if it doesn't it's a bad design imho.

> Why on earth would someone need a context pointer into the kernel's
> page tables to submit an SG into a device, apart from performing
> virtual-to-physical translation?

Ehm well exactly to do that virtual-to-physical translation, perhaps?
That's the whole point.

> >It's not difficult at all. Apparently you don't understand it so you
> >think it's difficult, that's only natural. But you have access to the
> >page mapping of any given piece of data always, or if you have the
> >virtual address only it's trivial to go to the { page, offset } mapping.
> > 
> >
> No, I do understand, and using page/offset at a low level SG interface

You pretty much continue to describe throughout this mail that you do
not understand why that is so.

> IS burdensome.  I mean, if this is what I have to support I guess I
> have to use it, but it will be just another section of code where I
> have another **FAT** layer to waste more CPU cycles calculating
> offset/page (oh yeah I have to lookup the struct page * structure
> also) when it would be much simpler to just submit address/len in i386

What a load of crap. Where does this waste of cycles come from?

> systems. With this type of interface, If I have for instance an
> on-disk structure that starts in the middle of a 4K page due to other
> headers, etc. than spans a page, I cannot just submit the address and
> length, I have to break it into two bio requests instead of one with a
> for () loop from hell and calculate the offsets and rumage around in
> memory looking up struct page * addresses.

Calculating an offset is extremely complicated and wasteful, indeed.
It's at least a few cycles.

If you just used the bio_add_page() stuff like you are supposed to, you
would not have to do that either.

> >I can only imagine that you are used to a very different interface on
> >some other OS so you think it's difficult to use. Most of your
> >complaints seem to be based on false assumptions or because you don't
> >understand why certain design decisions were made.
> >
> > 
> >
> 
> No. I am used to programming to hardware with SG devices that all OS 
> use. Is there somewhere a page based
> SG device (other than SCI) for disk drive?. I don't think so, I think 
> they operate address/len, address/len, etc.

Where 'address' is the dma address, not the virtual address. You need
the page to do that translation.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2004-09-21  6:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-15 23:39 2.6.9-rc2 bio sickness with large writes Jeff V. Merkey
2004-09-16  6:34 ` Jens Axboe
2004-09-16 16:38   ` Jeff V. Merkey
2004-09-17  7:36     ` Jens Axboe
     [not found]       ` <20040917201604.GA12974@galt.devicelogics.com>
2004-09-20 17:12         ` Jeff V. Merkey
2004-09-20 18:09           ` Jens Axboe
2004-09-20 19:20             ` Jeff V. Merkey
2004-09-21  6:40               ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox