public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* block_device usage and incorrect block writes
@ 2007-01-18  1:08 Chris Frost
  2007-01-18  2:13 ` Jens Axboe
  0 siblings, 1 reply; 5+ messages in thread
From: Chris Frost @ 2007-01-18  1:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: KudOS

We are working on a kernel module which uses the linux block device
interface as part of a larger project, are seeing unexpected block write
behavior from our usage of the noop scheduler, and were wondering whether
anyone might have feedback on what the behavior we see?

We would like to send block writes such that they are written to the
drive controller in fifo order, so we are using the noop scheduler.
However, a small percentage (1-5 of ~50,000) of block writes
end up with incorrect data on the disk. We have determined that for each
of these incorrect blocks, the last write for the block was issued while
a previous write to the block was still queued (that is, the bio end
function had not yet been called) and that the next to last issued
write (that is, the generic_make_request function call) for the block contains
the data that ends up on the disk.

Here are more details we have noticed. The behavior appears to be
timing sensitive; multiple runs each may or may not work and
introducing slow downs (i.e. write barriers or many printks) make the
problem disappear with high certainty. About 2% of our block device
writes are writes to a block when there is still a write to that block
in the queue. About 0.3% of these 2% result in incorrect data on
the disk.

A possibly related (and unexpected) behavior we have noticed is that the
bio end function is not always called in the same order as our calls
to generic_make_request(). We are not sure whether this indicates that
the requests are being written to disk in the callback order, but would
like to fix this if so (since we want the writes made in the order of our
requests).

Below is the essence of our write code. We also make read requests, but do
not include the read code below.

struct block_device *dev;

int my_end(struct bio *bio, unsigned int done, int error)
{
	if (bio->bi_size)
		return 1; /* everyone else (in 2.6.12) returns in this case */
	__free_page(bio_iovec_idx(bio, 0)->bv_page);
	bio_iovec_idx(bio, 0)->bv_page = NULL:
	bio_iovec_idx(bio, 0)->bv_len = 0:
	bio_iovec_idx(bio, 0)->bv_offset = 0:
	bio_put(bio);
	return error;
}

void write_block(...)
{
	struct bio *bio = bio_alloc(GFP_KERNEL, 1);

	struct bio_vec *bv = bio_iovec_idx(bio, 0);
	bv->bv_page = alloc_page(GFP_KERNEL | GFP_DMA);
	memcpy(page_address(bv->bv_page), ...);
	bv->bv_len = ...;
	bv->bv_offset = 0;

	bio->bi_idx = 0;
	bio->bi_vcnt = 1;
	bio->bi_sector = ...;
	bio->bi_size = ...;
	bio->bi_bdev = dev;
	bio->bi_rw = WRITE;
	bio->bi_end_io = my_end;

	generic_make_request(bio);
}

void init(const char *path)
{
	path_lookup(path, LOOKUP_FOLLOW, nd);
	dev = open_by_devnum(nd.dentry->d_inode->i_rde, mode);
	bd_claim(dev, claimer);
}

thanks in advance for any feedback!
-- 
Chris Frost  |  <http://www.frostnet.net/chris/>
-------------+----------------------------------
Public PGP Key:
   Email chris@frostnet.net with the subject "retrieve pgp key"
   or visit <http://www.frostnet.net/chris/about/pgp_key.phtml>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-01-24  9:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-01-18  1:08 block_device usage and incorrect block writes Chris Frost
2007-01-18  2:13 ` Jens Axboe
2007-01-18 10:31   ` Jan Engelhardt
2007-01-24  8:21   ` Chris Frost
2007-01-24  9:23     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox