* [PATCH] Bio Traversal Changes
@ 2002-08-02 12:35 Suparna Bhattacharya
2002-08-02 12:43 ` [PATCH] Bio Traversal Changes (Patch 1/4: biotr8-blk.diff) Suparna Bhattacharya
` (4 more replies)
0 siblings, 5 replies; 9+ messages in thread
From: Suparna Bhattacharya @ 2002-08-02 12:35 UTC (permalink / raw)
To: linux-kernel, linux-scsi, axboe; +Cc: B.Zolnierkiewicz, akpm
This has been discussed before in principle on lkml and
mentioned by Jens at the kernel summit. Here's some quick
information about the changes, and the latest patches
against 2.5.30, including bio and request documentation
updates.
Jens, biotr8 mostly just an update of biotr7 to 2.5.30, and
minor cleanups like the removal of the commented out debug
printks I had kept around so far.
The doc patch got updated a bit since last night - have
also tried to modify request.txt in addition to biodoc.txt.
BIO traversal enhancements
--------------------------
- Pre-req for full BIO splitting infrastructure [Splits
in the middle of a segment can also share the same vec]
- Pre-req for attaching a pre-built vector (not owned by block)
to a bio (e.g for aio) [Avoids certain subtle side-effects in
special cases]
- Pre-req for IDE mult-count PIO write fixes [Ability
to traverse a bio partially for i/o submission without
effecting bio completion]
Patches are based on 2.5.30 (and will follow in subsequent
mails)
Many many thanks to Barthlomiej for helping out in review
and testing, fixing several bugs in the generic code and
using the bio walking routines in this patch to actually
get IDE PIO working (single and multi count) !
The patches must all be applied together to get a
compiling and booting kernel on some h/w at least, so it
might make sense to roll them up for submission to ensure
that they are committed together.
01biotr8-blk.diff: Contains the core changes for bio traversal,
including (a) avoidance of alteration of bv_offset/bv_len by
block (b) introduction of bi_voffset to help in splitting a
bio via cloning, even when the split has to happen in the
middle of a segment and (c) clean separation of submission and
completion state for a request. The bio always reflects the
state of unfinished i/o, while the request structure also tracks
the state of unsubmitted i/o (these two may be different,
e.g. for ide multi-count pio writes). A helper routine
process_that_request_first() is provided to enable drivers
to traverse a request for i/o submission without affecting
completion state. It also has a few fixes for partial
completion cases in end_that_request_first() (non error
cases).
[Andrew, I have uninlined blk_rq_next_segment() and
process_that_request_first() as you'd suggested and moved
them into ll_rw_blk.c]
02biotr8-blkusers.diff: Contains the modifications needed to
code that uses/accesses bio (e.g mm, filesystems) above blk.
Most of the changes ensure correct initialisation of the
bi_voffset field when building a bio.
(BTW, Looking at the number of places that this touches, a
helper for building a bio would seem to be good to have).
03biotr8-drivers.diff: Contains minimal modifications to
(some) drivers to compile/work with the changed bio
traversal and mapping assumptions. It would not fix any
other breakages that exist already in the drivers (e.g.
doesn't fix any existing IDE problems, and doesn't
touch LVM).
Barthlomiej's patch for ATA PIO with IDE driver has
the correct changes to get to a working version including
some state machine fixes, and uses the bio traversal helper
for i/o submission in a clean manner for PIO operations.
Modifications have also been made in cases like loop,
floppy to either account for bi_voffset or to have a
sanity check bugon if they get passed a bio which
starts/ends in the middle of a segment, as a reminder to
fix it up later. This would need to be completed, including
extending to other drivers and appropriate testing to ensure
compatibility.
04biotr8-doc.diff: Contains the updates to Documentation/block
corresponding to the changes. Have taken the liberty to
update a few other parts of the doc that were strikingly
obsolete.
Details:
=========
Introduces some new fields in bio and the request structure
to help maintain traversal state.
Bio fields:
----------
bi_voffset:
Offset relative to the start of the first page, which
indicates where the bio really starts. In general before
i/o starts this would be the same as bv_offset for the
first vec (at bi_idx), but in the case of clone bio s where
the bio may be split in the middle of a segment this could be
different. As i/o progresses, now instead of changing any
of the bvec fields, bi_voffset is moved ahead instead.
The relative offset w.r.t to the start of the first vec
can be calculated using the macro bio_startoffset(bio)
bi_endvoffset:
Offset relative to the last page which indicates
where the bio really ends. In general this would be the same
as bv_offset + bv_len for the last vec, but in the case of
clone bio s where a split piece ends in the middle of a
segment, it could be different. This field is really used
mainly for segment boundary and merge checks (it is more
convenient than having to walk through the entire bio
and use bi_size to determine the end just to determine
mergeability).
The macro bio_endoffset(bio) can be used calculate the
relative offset w.r.t to the bvec end where the bio
was broken up.
The remaining size to be transfered in the current bio
vec should be calculated using the bio_segsize() routine
(instead of accessing bv_len directly any more). This
takes care of adjusting the length for the above offsets.
Aside:
An alternative to bi_voffset being an absolute
offset wrt to the start of the bvec page would be to
make it relative to bio_io_vec(bio)->bv_offset instead
(i.e. the value bio_startoffset() returns in the patch). A
similar change would then apply to bi_endvoffset. Then
the fields would be initialized to zero by default,
though it also would make the mergeability check macros
a little longer, and possibly add a little extra computation
during request mapping and end_that_request_first.
Request structure fields:
------------------------
The basic protocol followed all through is that the bio fields
would always reflect the status w.r.t how much i/o remains
to be completed. On the other hand submission status would
only be maintained in the request structure. In most cases
of course, both move in sync (the generic end_request_first
code tries to handle that transparently by advancing the
submission pointers if they are behind the completion pointers
as would happen in the case of drivers which don't modify
those themselves), but for things like IDE
mult-count write, the submission counters/pointers may be
ahead of the completion pointers.
The following fields have been added to help maintain
this distinction.
hard_bio
the rq->bio field now reflects the next bio which
is to be submitted for i/o. Hence, the need for rq->hard_bio
which keeps track of the next bio to be completed (this
is the one used by end_that_request_first now, instead
of rq->bio)
nr_bio_segments
this keeps track of how many more vecs remain
to be submitted in the current bio (rq->bio). It is
used to compute the current index into rq->bio which
specifies the segment under submission.
(rq_map_buffer for example uses this field to map
the right buffer)
nr_bio_sectors
this keeps track of the number of sectors to
be submitted in the current bio (rq->bio). It can be
used to compute the remaining sectors in the current
segment in the situation when it is the last segment.
Now a subtle point about hard_cur_sectors. It reflects
the number of sectors left to be completed in the
_current_ segment under submission (i.e. the segment
in rq->bio, and _not_ rq->hard_bio). This makes it
possible to use it in rq_map_buffer to determine the
relative offset in the current segment w.r.t what
the bio indices might indicate.
rq_map_buffer() is a macro that would be used to
obtain a virtual address mapping corresponding to the
current segment of the buffer being submitted for i/o.
(It is expected to replace the use of private driver
versions of the same operation e.g. ide_map_buffer()
A new helper, process_that_request_first() has been
introduced for updating submission state of the request
without completing the corresponding bios. It can be used
by code such as mult-count write which need to traverse
multiple bio segments for each chunk of i/o submitted,
where the chunk does not cover the entire request.
Regards
Suparna
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Bio Traversal Changes (Patch 1/4: biotr8-blk.diff)
2002-08-02 12:35 [PATCH] Bio Traversal Changes Suparna Bhattacharya
@ 2002-08-02 12:43 ` Suparna Bhattacharya
2002-08-02 12:46 ` [PATCH] Bio Traversal Changes (Patch 2/4: biotr8-blkusers.diff) Suparna Bhattacharya
` (3 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Suparna Bhattacharya @ 2002-08-02 12:43 UTC (permalink / raw)
To: linux-kernel, linux-scsi, axboe
This patch has the core changes for bio traversal (as explained
in the last mail). Adds bi_voffset, bi_endvoffset fields
to the bio, request struct fields for tracking submission state,
and a process_that_request_first() routine for traversing a
request for i/o submission (without completion).
diff -ur linux-2.5.30-pure/drivers/block/elevator.c linux-2.5.30-bio/drivers/block/elevator.c
--- linux-2.5.30-pure/drivers/block/elevator.c Fri Aug 2 10:08:27 2002
+++ linux-2.5.30-bio/drivers/block/elevator.c Fri Aug 2 10:40:41 2002
@@ -236,6 +236,9 @@
if (!insert_here)
insert_here = q->queue_head.prev;
+ if (rq->bio)
+ BIO_BUG_ON(!bio_consistent(rq->bio)); /* debug only */
+
if (!(rq->flags & REQ_BARRIER))
lat = latency[rq_data_dir(rq)];
@@ -309,6 +312,9 @@
void elevator_noop_add_request(request_queue_t *q, struct request *rq,
struct list_head *insert_here)
{
+ if (rq->bio)
+ BIO_BUG_ON(!bio_consistent(rq->bio)); /* debug only */
+
list_add_tail(&rq->queuelist, &q->queue_head);
/*
@@ -389,6 +395,9 @@
void __elv_add_request(request_queue_t *q, struct request *rq,
struct list_head *insert_here)
{
+ if (rq->bio)
+ BIO_BUG_ON(!bio_consistent(rq->bio)); /* debug only */
+
q->elevator.elevator_add_req_fn(q, rq, insert_here);
}
@@ -412,7 +421,19 @@
/*
* all ok, break and return it
+ * after presetting some fields (e.g. the req might
+ * have been restarted)
*/
+ if ((rq->bio = rq->hard_bio)) {
+ rq->nr_bio_segments = bio_segments(rq->bio);
+ rq->nr_bio_sectors = bio_sectors(rq->bio);
+ rq->hard_cur_sectors = bio_segsize(rq->bio) >> 9;
+ rq->buffer = bio_data(rq->bio);
+ }
+ rq->sector = rq->hard_sector;
+ rq->nr_sectors = rq->hard_nr_sectors;
+ rq->current_nr_sectors = rq->hard_cur_sectors;
+
if (!q->prep_rq_fn(q, rq))
break;
@@ -425,6 +446,9 @@
end_that_request_last(rq);
}
+
+ if (rq->bio)
+ BIO_BUG_ON(!bio_consistent(rq->bio)); /* debug only */
return rq;
}
diff -ur linux-2.5.30-pure/drivers/block/ll_rw_blk.c linux-2.5.30-bio/drivers/block/ll_rw_blk.c
--- linux-2.5.30-pure/drivers/block/ll_rw_blk.c Fri Aug 2 10:08:27 2002
+++ linux-2.5.30-bio/drivers/block/ll_rw_blk.c Fri Aug 2 11:57:20 2002
@@ -569,10 +569,21 @@
{
struct bio_vec *bv, *bvprv = NULL;
int i, nr_phys_segs, nr_hw_segs, seg_size, cluster;
+ int offset;
if (unlikely(!bio->bi_io_vec))
return;
+ if (unlikely(bio->bi_idx >= bio->bi_vcnt))
+ return;
+
+ /*
+ * Relative offset into the first bvec where the data for
+ * this bio starts (may be non-zero for cloned bio split off
+ * in the middle of a segment)
+ */
+ offset = bio_startoffset(bio);
+
cluster = q->queue_flags & (1 << QUEUE_FLAG_CLUSTER);
seg_size = nr_phys_segs = nr_hw_segs = 0;
bio_for_each_segment(bv, bio, i) {
@@ -580,8 +591,12 @@
int phys, seg;
if (seg_size + bv->bv_len > q->max_segment_size) {
- nr_phys_segs++;
- goto new_segment;
+ if ((i != bio->bi_vcnt - 1) ||
+ (seg_size + bio->bi_endvoffset - bv->bv_offset
+ > q->max_segment_size)) {
+ nr_phys_segs++;
+ goto new_segment;
+ }
}
phys = BIOVEC_PHYS_MERGEABLE(bvprv, bv);
@@ -599,11 +614,17 @@
continue;
} else {
nr_phys_segs++;
+ nr_hw_segs++;
+ seg_size = bv->bv_len - offset;
+ bvprv = bv;
+ offset = 0; /* for all except the first bv */
+ continue;
}
new_segment:
nr_hw_segs++;
bvprv = bv;
seg_size = bv->bv_len;
+ offset = 0; /* for all except the first bv */
}
bio->bi_phys_segments = nr_phys_segs;
@@ -618,7 +639,8 @@
if (!(q->queue_flags & (1 << QUEUE_FLAG_CLUSTER)))
return 0;
- if (!BIOVEC_PHYS_MERGEABLE(__BVEC_END(bio), __BVEC_START(nxt)))
+ if (!BIOVEC_PHYS_MERGEABLE_PARTIAL(__BVEC_END(bio), bio->bi_endvoffset,
+ __BVEC_START(nxt), nxt->bi_voffset))
return 0;
if (bio->bi_size + nxt->bi_size > q->max_segment_size)
return 0;
@@ -639,7 +661,8 @@
if (!(q->queue_flags & (1 << QUEUE_FLAG_CLUSTER)))
return 0;
- if (!BIOVEC_VIRT_MERGEABLE(__BVEC_END(bio), __BVEC_START(nxt)))
+ if (!BIOVEC_VIRT_MERGEABLE_PARTIAL(__BVEC_END(bio), bio->bi_endvoffset,
+ __BVEC_START(nxt), nxt->bi_voffset))
return 0;
if (bio->bi_size + nxt->bi_size > q->max_segment_size)
return 0;
@@ -663,6 +686,7 @@
struct bio_vec *bvec, *bvprv;
struct bio *bio;
int nsegs, i, cluster;
+ int offset, endoffset;
nsegs = 0;
cluster = q->queue_flags & (1 << QUEUE_FLAG_CLUSTER);
@@ -671,20 +695,27 @@
* for each bio in rq
*/
bvprv = NULL;
+ endoffset = 0;
rq_for_each_bio(bio, rq) {
/*
* for each segment in bio
*/
+ offset = bio_startoffset(bio);
bio_for_each_segment(bvec, bio, i) {
- int nbytes = bvec->bv_len;
+ int nbytes = bvec->bv_len - offset;
+ int start = bvec->bv_offset + offset;
if (bvprv && cluster) {
+ int end = bvprv->bv_offset + bvprv->bv_len -
+ endoffset;
if (sg[nsegs - 1].length + nbytes > q->max_segment_size)
goto new_segment;
- if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
+ if (!BIOVEC_PHYS_MERGEABLE_PARTIAL(bvprv,
+ end, bvec, start))
goto new_segment;
- if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
+ if (!BIOVEC_SEG_BOUNDARY_PARTIAL(q, bvprv,
+ end, bvec, start))
goto new_segment;
sg[nsegs - 1].length += nbytes;
@@ -693,12 +724,16 @@
memset(&sg[nsegs],0,sizeof(struct scatterlist));
sg[nsegs].page = bvec->bv_page;
sg[nsegs].length = nbytes;
- sg[nsegs].offset = bvec->bv_offset;
+ sg[nsegs].offset = bvec->bv_offset + offset;
nsegs++;
}
bvprv = bvec;
+ offset = 0;
+ endoffset = 0;
} /* segments in bio */
+ sg[nsegs - 1].length -= bio_endoffset(bio);
+ endoffset = bio_endoffset(bio);
} /* bios in rq */
return nsegs;
@@ -761,7 +796,9 @@
return 0;
}
- if (BIOVEC_VIRT_MERGEABLE(__BVEC_END(req->biotail), __BVEC_START(bio)))
+ if (BIOVEC_VIRT_MERGEABLE_PARTIAL(__BVEC_END(req->biotail),
+ req->biotail->bi_endvoffset,
+ __BVEC_START(bio), bio->bi_voffset))
return ll_new_mergeable(q, req, bio);
return ll_new_hw_segment(q, req, bio);
@@ -776,7 +813,8 @@
return 0;
}
- if (BIOVEC_VIRT_MERGEABLE(__BVEC_END(bio), __BVEC_START(req->bio)))
+ if (BIOVEC_VIRT_MERGEABLE_PARTIAL(__BVEC_END(bio), bio->bi_endvoffset,
+ __BVEC_START(req->bio), req->bio->bi_voffset))
return ll_new_mergeable(q, req, bio);
return ll_new_hw_segment(q, req, bio);
@@ -1470,7 +1508,8 @@
sector = bio->bi_sector;
nr_sectors = bio_sectors(bio);
- cur_nr_sectors = bio_iovec(bio)->bv_len >> 9;
+ cur_nr_sectors = bio_segsize(bio) >> 9;
+
rw = bio_data_dir(bio);
/*
@@ -1520,8 +1559,12 @@
break;
}
+
bio->bi_next = req->bio;
- req->bio = bio;
+ req->hard_bio = req->bio = bio;
+ req->nr_bio_segments = bio_segments(req->bio);
+ req->nr_bio_sectors = bio_sectors(req->bio);
+
/*
* may not be valid. if the low level driver said
* it didn't need a bounce buffer then it better
@@ -1600,7 +1643,9 @@
req->nr_hw_segments = bio_hw_segments(q, bio);
req->buffer = bio_data(bio); /* see ->buffer comment above */
req->waiting = NULL;
- req->bio = req->biotail = bio;
+ req->hard_bio = req->bio = req->biotail = bio;
+ req->nr_bio_segments = bio_segments(req->bio);
+ req->nr_bio_sectors = bio_sectors(req->bio);
req->rq_dev = to_kdev_t(bio->bi_bdev->bd_dev);
add_request(q, req, insert_here);
out:
@@ -1749,6 +1794,10 @@
BIO_BUG_ON(!bio->bi_size);
BIO_BUG_ON(!bio->bi_io_vec);
+ BIO_BUG_ON(bio->bi_idx >= bio->bi_vcnt);
+
+ /* This is in now for debugging purposes */
+ BIO_BUG_ON(!bio_consistent(bio));
bio->bi_rw = rw;
@@ -1799,6 +1848,8 @@
bio->bi_vcnt = 1;
bio->bi_idx = 0;
bio->bi_size = bh->b_size;
+ bio->bi_voffset = bio->bi_io_vec[0].bv_offset;
+ bio->bi_endvoffset = bio->bi_voffset + bio->bi_size;
bio->bi_end_io = end_bio_bh_io_sync;
bio->bi_private = bh;
@@ -1921,10 +1972,8 @@
struct bio *bio;
int nr_phys_segs, nr_hw_segs;
- rq->buffer = bio_data(rq->bio);
-
nr_phys_segs = nr_hw_segs = 0;
- rq_for_each_bio(bio, rq) {
+ rq_for_each_unfin_bio(bio, rq) {
/* Force bio hw/phys segs to be recalculated. */
bio->bi_flags &= ~(1 << BIO_SEG_VALID);
@@ -1936,15 +1985,27 @@
rq->nr_hw_segments = nr_hw_segs;
}
-inline void blk_recalc_rq_sectors(struct request *rq, int nsect)
+void blk_recalc_rq_sectors(struct request *rq, int nsect)
{
if (rq->flags & REQ_CMD) {
+
rq->hard_sector += nsect;
- rq->nr_sectors = rq->hard_nr_sectors -= nsect;
- rq->sector = rq->hard_sector;
+ rq->hard_nr_sectors -= nsect;
- rq->current_nr_sectors = bio_iovec(rq->bio)->bv_len >> 9;
- rq->hard_cur_sectors = rq->current_nr_sectors;
+ /* Move the i/o submission pointers ahead if required */
+ /* (i.e. if the driver doesn't update them) */
+ if ((rq->nr_sectors >= rq->hard_nr_sectors) &&
+ (rq->sector <= rq->hard_sector)){
+ rq->sector = rq->hard_sector;
+ rq->nr_sectors = rq->hard_nr_sectors;
+ rq->bio = rq->hard_bio;
+ rq->nr_bio_segments = bio_segments(rq->bio);
+ rq->nr_bio_sectors = bio_sectors(rq->bio);
+ rq->hard_cur_sectors = bio_segsize(rq->bio) >> 9;
+ rq->current_nr_sectors = rq->hard_cur_sectors;
+
+ rq->buffer = bio_data(rq->bio);
+ }
/*
* if total number of sectors is less than the first segment
@@ -1977,27 +2038,35 @@
int nsect, total_nsect;
struct bio *bio;
+
req->errors = 0;
if (!uptodate)
printk("end_request: I/O error, dev %s, sector %lu\n",
kdevname(req->rq_dev), req->sector);
total_nsect = 0;
- while ((bio = req->bio)) {
- nsect = bio_iovec(bio)->bv_len >> 9;
- BIO_BUG_ON(bio_iovec(bio)->bv_len > bio->bi_size);
+ /* our starting point may be in the middle of a segment */
+ while ((bio = req->hard_bio)) {
+
+ /* For debugging - Verify consistency */
+ BIO_BUG_ON(!bio_consistent(bio));
+
+ nsect = bio_segsize(bio) >> 9;
/*
* not a complete bvec done
*/
if (unlikely(nsect > nr_sectors)) {
- int residual = (nsect - nr_sectors) << 9;
- bio->bi_size -= residual;
- bio_iovec(bio)->bv_offset += residual;
- bio_iovec(bio)->bv_len -= residual;
+ bio->bi_size -= nr_sectors << 9;
+ bio->bi_voffset += nr_sectors << 9;
blk_recalc_rq_sectors(req, nr_sectors);
+
+ /*
+ * TBD: Could we just do without recalc segments ?
+ * (or a better way to achieve it)
+ */
blk_recalc_rq_segments(req);
return 1;
}
@@ -2005,28 +2074,34 @@
/*
* account transfer
*/
- bio->bi_size -= bio_iovec(bio)->bv_len;
+ bio->bi_size -= nsect << 9;
bio->bi_idx++;
nr_sectors -= nsect;
total_nsect += nsect;
if (!bio->bi_size) {
- req->bio = bio->bi_next;
-
+ req->hard_bio = bio->bi_next;
bio_endio(bio, uptodate);
-
total_nsect = 0;
+ } else {
+ bio->bi_voffset = bio_iovec(bio)->bv_offset;
}
- if ((bio = req->bio)) {
+ if ((bio = req->hard_bio)) {
+ BIO_BUG_ON(bio_segments(bio) <= 0);
blk_recalc_rq_sectors(req, nsect);
/*
* end more in this run, or just return 'not-done'
*/
if (unlikely(nr_sectors <= 0)) {
- blk_recalc_rq_segments(req);
+ /*
+ *TBD:Could we just do without recalc segments ?
+ * (or a better way to achieve it)
+ */
+ blk_recalc_rq_segments(req);
+
return 1;
}
}
@@ -2043,6 +2118,81 @@
blk_put_request(req);
}
+/*
+ * blk_rq_next_segment
+ * @req: the request being processed
+ *
+ * Points to the next segment in the request if the current segment
+ * is complete. Leaves things unchanged if this segment is not over or
+ * if no more segments are left in this request.
+ *
+ * Meant to be used for bio traversal during i/o submission
+ * Does not effect any i/o completions or update completion state in
+ * the request, and does not modify any bio fields
+ *
+ * Decrementing rq->nr_sectors, rq->current_nr_sectors, and
+ * rq->nr_bio_sectors as data is transferred is the caller's
+ * responsibility and should be done before calling this routine.
+ */
+void blk_rq_next_segment(struct request *rq)
+{
+ if (rq->current_nr_sectors > 0)
+ return;
+
+ if (rq->nr_bio_sectors > 0) {
+ --rq->nr_bio_segments;
+ /* a clone bio could end in the middle of a segment */
+ rq->current_nr_sectors = min(
+ (unsigned long)blk_rq_vec(rq)->bv_len >> 9,
+ rq->nr_bio_sectors);
+ } else {
+ if ((rq->bio = rq->bio->bi_next)) {
+ rq->nr_bio_segments = bio_segments(rq->bio);
+ rq->nr_bio_sectors = bio_sectors(rq->bio);
+ rq->current_nr_sectors = bio_segsize(rq->bio) >> 9;
+ }
+ }
+
+ /* remember the size of this segment before we start i/o */
+ rq->hard_cur_sectors = rq->current_nr_sectors;
+}
+
+/*
+ * process_that_request_first: process partial request submission
+ * @req: the request being processed
+ * @nr_sectors: number of sectors i/o has been submitted on
+ *
+ * Description:
+ * May be used for processing bio's while submitting i/o without
+ * signalling completion. Fails if more data is requested than is
+ * available in the request in which case it doesn't advance any
+ * pointers.
+ *
+ * Assumes a request is correctly set up. No sanity checks)
+ *
+ * Return:
+ * 0 - no more data left to submit (not processed)
+ * 1 - data available to submit for this request (processed)
+ */
+int process_that_request_first(struct request *req, unsigned int nr_sectors)
+{
+ int nsect;
+
+ if (req->nr_sectors < nr_sectors)
+ return 0;
+
+ req->nr_sectors -= nr_sectors;
+ req->sector += nr_sectors;
+ while (nr_sectors) {
+ nsect = min(req->current_nr_sectors, nr_sectors);
+ req->current_nr_sectors -= nsect;
+ req->nr_bio_sectors -= nsect;
+ nr_sectors -= nsect;
+ blk_rq_next_segment(req);
+ }
+ return 1;
+}
+
#define MB(kb) ((kb) << 10)
int __init blk_dev_init(void)
@@ -2095,6 +2245,7 @@
EXPORT_SYMBOL(end_that_request_first);
EXPORT_SYMBOL(end_that_request_last);
+EXPORT_SYMBOL(process_that_request_first);
EXPORT_SYMBOL(blk_init_queue);
EXPORT_SYMBOL(bdev_get_queue);
EXPORT_SYMBOL(blk_cleanup_queue);
diff -ur linux-2.5.30-pure/fs/bio.c linux-2.5.30-bio/fs/bio.c
--- linux-2.5.30-pure/fs/bio.c Sat Jul 27 08:28:39 2002
+++ linux-2.5.30-bio/fs/bio.c Fri Aug 2 10:34:38 2002
@@ -178,6 +178,28 @@
}
}
+/* Perform some sanity checks on the bio vectors, size and offsets */
+int bio_consistent(struct bio *bio)
+{
+ struct bio_vec *bvec;
+ int i;
+ int size = 0;
+
+ /* Verify that the size is consistent with sigma vec len */
+ bio_for_each_segment(bvec, bio, i)
+ size += bvec->bv_len;
+
+ /* Adjust for both ends */
+ size -= bio_startoffset(bio) + bio_endoffset(bio);
+
+ if (size != bio->bi_size)
+ return 0; /* size mismatch */
+
+ /* Place any other checks here */
+
+ return 1;
+}
+
inline int bio_phys_segments(request_queue_t *q, struct bio *bio)
{
if (unlikely(!(bio->bi_flags & (1 << BIO_SEG_VALID))))
@@ -225,6 +247,8 @@
}
bio->bi_size = bio_src->bi_size;
bio->bi_max = bio_src->bi_max;
+ bio->bi_voffset = bio_src->bi_voffset;
+ bio->bi_endvoffset = bio_src->bi_endvoffset;
}
/**
@@ -312,6 +336,9 @@
b->bi_vcnt = bio->bi_vcnt;
b->bi_size = bio->bi_size;
+ b->bi_voffset = bio->bi_voffset;
+ b->bi_endvoffset = bio->bi_endvoffset;
+
return b;
oom:
@@ -424,6 +451,9 @@
}
queue_io:
+ bio->bi_voffset = bio_iovec(bio)->bv_offset;
+ bio->bi_endvoffset = __BVEC_END(bio)->bv_offset +
+ __BVEC_END(bio)->bv_len;
submit_bio(rw, bio);
if (total_nr_pages)
diff -ur linux-2.5.30-pure/include/linux/bio.h linux-2.5.30-bio/include/linux/bio.h
--- linux-2.5.30-pure/include/linux/bio.h Sat Jul 27 08:28:41 2002
+++ linux-2.5.30-bio/include/linux/bio.h Fri Aug 2 11:56:12 2002
@@ -67,7 +67,22 @@
*/
unsigned short bi_vcnt; /* how many bio_vec's */
+
+ /*
+ * Residual section - portion on which i/o hasn't finished as yet
+ * (i/o may already have been submitted and in progress for some
+ * of these segments if this is an active bio)
+ */
unsigned short bi_idx; /* current index into bvl_vec */
+ unsigned short bi_voffset; /* current vec offset -
+ * relative to start of bvec
+ * page
+ */
+ unsigned short bi_endvoffset; /* offset into the last bvec
+ * page that marks the end of
+ * this buffer
+ */
+ unsigned int bi_size; /* residual I/O count */
/* Number of segments in this BIO after
* physical address coalescing is performed.
@@ -79,7 +94,6 @@
*/
unsigned short bi_hw_segments;
- unsigned int bi_size; /* residual I/O count */
unsigned int bi_max; /* max bvl_vecs we can hold,
used as index into pool */
@@ -120,11 +134,13 @@
#define bio_iovec_idx(bio, idx) (&((bio)->bi_io_vec[(idx)]))
#define bio_iovec(bio) bio_iovec_idx((bio), (bio)->bi_idx)
#define bio_page(bio) bio_iovec((bio))->bv_page
-#define bio_offset(bio) bio_iovec((bio))->bv_offset
+#define bio_offset(bio) (bio)->bi_voffset
+#define bio_segments(bio) ((bio)->bi_vcnt - (bio)->bi_idx)
#define bio_sectors(bio) ((bio)->bi_size >> 9)
#define bio_data(bio) (page_address(bio_page((bio))) + bio_offset((bio)))
#define bio_barrier(bio) ((bio)->bi_rw & (1 << BIO_BARRIER))
+
/*
* will die
*/
@@ -150,16 +166,38 @@
#define __BVEC_START(bio) bio_iovec_idx((bio), 0)
#define BIOVEC_PHYS_MERGEABLE(vec1, vec2) \
((bvec_to_phys((vec1)) + (vec1)->bv_len) == bvec_to_phys((vec2)))
+#define BIOVEC_PHYS_MERGEABLE_PARTIAL(vec1, end, vec2, start) \
+ ((page_to_phys((vec1)->bv_page) + (end)) == \
+ (page_to_phys((vec2)->bv_page) + (start)))
#define BIOVEC_VIRT_MERGEABLE(vec1, vec2) \
((((bvec_to_phys((vec1)) + (vec1)->bv_len) | bvec_to_phys((vec2))) & (BIO_VMERGE_BOUNDARY - 1)) == 0)
+#define BIOVEC_VIRT_MERGEABLE_PARTIAL(vec1, end, vec2, start) \
+ ((((page_to_phys((vec1)->bv_page) + (end)) | (page_to_phys((vec2)->bv_page) + (start))) & (BIO_VMERGE_BOUNDARY - 1)) == 0)
#define __BIO_SEG_BOUNDARY(addr1, addr2, mask) \
(((addr1) | (mask)) == (((addr2) - 1) | (mask)))
#define BIOVEC_SEG_BOUNDARY(q, b1, b2) \
__BIO_SEG_BOUNDARY(bvec_to_phys((b1)), bvec_to_phys((b2)) + (b2)->bv_len, (q)->seg_boundary_mask)
+#define BIOVEC_SEG_BOUNDARY_PARTIAL(q, b1, start, b2, end) \
+ __BIO_SEG_BOUNDARY(page_to_phys((b1)->bv_page) + (start), page_to_phys((b2)->bv_page) + (end), (q)->seg_boundary_mask)
#define BIO_SEG_BOUNDARY(q, b1, b2) \
- BIOVEC_SEG_BOUNDARY((q), __BVEC_END((b1)), __BVEC_START((b2)))
+ BIOVEC_SEG_BOUNDARY_PARTIAL((q), __BVEC_END((b1)),(b1)->bi_voffset, \
+ __BVEC_START((b2)), (b2)->bi_endvoffset)
#define bio_io_error(bio) bio_endio((bio), 0)
+#define bio_startoffset(bio) (bio_offset(bio) - bio_iovec(bio)->bv_offset)
+#define bio_endoffset(bio) (__BVEC_END(bio)->bv_offset + \
+ __BVEC_END(bio)->bv_len - (bio)->bi_endvoffset)
+
+/* adjusts for clones which may start or end in the middle of a segment */
+static inline unsigned int bio_segsize(struct bio *bio)
+{
+ unsigned int len = bio_iovec(bio)->bv_len - bio_startoffset(bio);
+
+ if (len > bio->bi_size)
+ len = bio->bi_size;
+
+ return len;
+}
/*
* drivers should not use the __ version unless they _really_ want to
@@ -203,6 +241,8 @@
extern inline void bio_init(struct bio *);
+extern int bio_consistent(struct bio *bio);
+
#ifdef CONFIG_HIGHMEM
/*
* remember to add offset! and never ever reenable interrupts between a
@@ -211,7 +251,7 @@
* This function MUST be inlined - it plays with the CPU interrupt flags.
* Hence the `extern inline'.
*/
-extern inline char *bio_kmap_irq(struct bio *bio, unsigned long *flags)
+extern inline char *bvec_kmap_irq(struct bio_vec *bvec, unsigned long *flags)
{
unsigned long addr;
@@ -220,22 +260,23 @@
/*
* could be low
*/
- if (!PageHighMem(bio_page(bio)))
- return bio_data(bio);
+ if (!PageHighMem(bvec->bv_page))
+ return page_address(bvec->bv_page) + bvec->bv_offset;
/*
* it's a highmem page
*/
local_irq_disable();
- addr = (unsigned long) kmap_atomic(bio_page(bio), KM_BIO_SRC_IRQ);
+ addr = (unsigned long) kmap_atomic(bvec->bv_page, KM_BIO_SRC_IRQ);
if (addr & ~PAGE_MASK)
BUG();
- return (char *) addr + bio_offset(bio);
+ return (char *) addr + bvec->bv_offset;
}
-extern inline void bio_kunmap_irq(char *buffer, unsigned long *flags)
+
+extern inline void bvec_kunmap_irq(char *buffer, unsigned long *flags)
{
unsigned long ptr = (unsigned long) buffer & PAGE_MASK;
@@ -244,8 +285,19 @@
}
#else
-#define bio_kmap_irq(bio, flags) (bio_data(bio))
-#define bio_kunmap_irq(buf, flags) do { *(flags) = 0; } while (0)
+#define bvec_kmap_irq(bvec, flags) (page_address((bvec)->bv_page) + ((bvec)->bv_offset))
+#define bvec_kunmap_irq(buf, flags) do { *(flags) = 0; } while (0)
#endif
+
+
+/* TBD: Could this be made more efficient ? */
+extern inline char *__bio_kmap_irq(struct bio *bio, unsigned short idx,
+ unsigned long *flags)
+{
+ return bvec_kmap_irq(bio_iovec_idx(bio, idx), flags) +
+ ((idx == bio->bi_idx) ? bio_startoffset(bio): 0);
+}
+
+#define __bio_kunmap_irq bvec_kunmap_irq
#endif /* __LINUX_BIO_H */
diff -ur linux-2.5.30-pure/include/linux/blk.h linux-2.5.30-bio/include/linux/blk.h
--- linux-2.5.30-pure/include/linux/blk.h Sat Jul 27 08:28:41 2002
+++ linux-2.5.30-bio/include/linux/blk.h Fri Aug 2 10:34:38 2002
@@ -40,6 +40,7 @@
extern int end_that_request_first(struct request *, int, int);
extern void end_that_request_last(struct request *);
+extern int process_that_request_first(struct request *, unsigned int);
struct request *elv_next_request(request_queue_t *q);
static inline void blkdev_dequeue_request(struct request *req)
@@ -50,11 +51,14 @@
elv_remove_request(req->q, req);
}
+
+
#define _elv_add_request_core(q, rq, where, plug) \
do { \
if ((plug)) \
blk_plug_device((q)); \
(q)->elevator.elevator_add_req_fn((q), (rq), (where)); \
+ if ((rq)->bio) BIO_BUG_ON(!bio_consistent((rq)->bio)); \
} while (0)
#define _elv_add_request(q, rq, back, p) do { \
diff -ur linux-2.5.30-pure/include/linux/blkdev.h linux-2.5.30-bio/include/linux/blkdev.h
--- linux-2.5.30-pure/include/linux/blkdev.h Sat Jul 27 08:28:32 2002
+++ linux-2.5.30-bio/include/linux/blkdev.h Fri Aug 2 10:34:38 2002
@@ -10,6 +10,7 @@
#include <linux/backing-dev.h>
#include <asm/scatterlist.h>
+#include <linux/bio.h>
struct request_queue;
typedef struct request_queue request_queue_t;
@@ -35,13 +36,13 @@
int rq_status; /* should split this into a few status bits */
kdev_t rq_dev;
int errors;
- sector_t sector;
- unsigned long nr_sectors;
- unsigned long hard_sector; /* the hard_* are block layer
- * internals, no driver should
- * touch them
- */
- unsigned long hard_nr_sectors;
+ sector_t sector; /* next sector to submit */
+ unsigned long nr_sectors; /* no of sectors left to submit */
+
+ /* the hard_* are block layer internals, no driver should
+ touch them */
+ unsigned long hard_sector; /* next sector to complete */
+ unsigned long hard_nr_sectors; /* no. of sectors left to complete */
/* Number of scatter-gather DMA addr+len pairs after
* physical address coalescing is performed.
@@ -55,13 +56,26 @@
*/
unsigned short nr_hw_segments;
+ /* Maintain bio traversal state for part by part io submission */
+ /* "current" refers to an element currently being submitted for io */
+
+ /* no. of segments left to submit in the current bio */
+ unsigned short nr_bio_segments;
+ /* no. of sectors left to submit in the current bio */
+ unsigned long nr_bio_sectors;
+ /* no. of sectors left to submit in the current segment */
unsigned int current_nr_sectors;
+ /* no. of sectors left to complete in the current segment */
unsigned int hard_cur_sectors;
+
+ struct bio *bio; /* next bio to submit */
+ struct bio *hard_bio; /* next unfinished bio to complete */
+ struct bio *biotail;
+
int tag;
void *special;
char *buffer;
struct completion *waiting;
- struct bio *bio, *biotail;
request_queue_t *q;
struct request_list *rl;
};
@@ -232,6 +246,33 @@
*/
#define blk_queue_headactive(q, head_active)
+
+/*
+ * temporarily mapping a (possible) highmem bio for typically for PIO transfer
+ */
+
+/* current offset with respect to start of the segment being submitted */
+#define blk_rq_offset(rq) (((rq)->hard_cur_sectors - (rq)->current_nr_sectors) << 9)
+
+/* current index into bio being processed for submission */
+#define blk_rq_idx(rq) ((rq)->bio->bi_vcnt - (rq)->nr_bio_segments)
+
+/* current vector being processed */
+#define blk_rq_vec(rq) (bio_iovec_idx((rq)->bio, blk_rq_idx(rq)))
+
+/* Assumes rq->bio != NULL */
+static inline char *rq_map_buffer(struct request *rq, unsigned long *flags)
+{
+ return (__bio_kmap_irq(rq->bio, blk_rq_idx(rq), flags)
+ + blk_rq_offset(rq));
+}
+
+static inline void rq_unmap_buffer(char *buffer, unsigned long *flags)
+{
+ __bio_kunmap_irq(buffer, flags);
+}
+
+
extern unsigned long blk_max_low_pfn, blk_max_pfn;
/*
@@ -251,6 +292,10 @@
#define rq_for_each_bio(bio, rq) \
if ((rq->bio)) \
for (bio = (rq)->bio; bio; bio = bio->bi_next)
+
+#define rq_for_each_unfin_bio(bio, rq) \
+ if ((rq->hard_bio)) \
+ for (bio = (rq)->hard_bio; bio; bio = bio->bi_next)
struct blk_dev_struct {
/*
diff -ur linux-2.5.30-pure/mm/highmem.c linux-2.5.30-bio/mm/highmem.c
--- linux-2.5.30-pure/mm/highmem.c Sat Jul 27 08:28:27 2002
+++ linux-2.5.30-bio/mm/highmem.c Fri Aug 2 10:34:38 2002
@@ -432,8 +432,10 @@
bio->bi_rw = (*bio_orig)->bi_rw;
bio->bi_vcnt = (*bio_orig)->bi_vcnt;
- bio->bi_idx = 0;
+ bio->bi_idx = (*bio_orig)->bi_idx;
bio->bi_size = (*bio_orig)->bi_size;
+ bio->bi_voffset = (*bio_orig)->bi_voffset;
+ bio->bi_endvoffset = (*bio_orig)->bi_endvoffset;
if (pool == page_pool) {
if (rw & WRITE)
@@ -446,6 +448,8 @@
else
bio->bi_end_io = bounce_end_io_read_isa;
}
+
+ BIO_BUG_ON(!bio_consistent(bio)); /* debug only */
bio->bi_private = *bio_orig;
*bio_orig = bio;
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Bio Traversal Changes (Patch 2/4: biotr8-blkusers.diff)
2002-08-02 12:35 [PATCH] Bio Traversal Changes Suparna Bhattacharya
2002-08-02 12:43 ` [PATCH] Bio Traversal Changes (Patch 1/4: biotr8-blk.diff) Suparna Bhattacharya
@ 2002-08-02 12:46 ` Suparna Bhattacharya
2002-08-02 13:17 ` [PATCH] Bio Traversal Changes - (Patch 3/4 : biotr8-blkdrivers.diff) Suparna Bhattacharya
` (2 subsequent siblings)
4 siblings, 0 replies; 9+ messages in thread
From: Suparna Bhattacharya @ 2002-08-02 12:46 UTC (permalink / raw)
To: linux-kernel, linux-scsi, axboe
Corresponding modifications needed in code above block layer
to account for bio traversal changes, mainly ensuring correct
bi_voffset initialization when setting up bios.
diff -ur linux-2.5.30-pure/fs/direct-io.c linux-2.5.30-bio/fs/direct-io.c
--- linux-2.5.30-pure/fs/direct-io.c Fri Aug 2 10:08:29 2002
+++ linux-2.5.30-bio/fs/direct-io.c Fri Aug 2 10:42:13 2002
@@ -193,6 +193,9 @@
bio->bi_vcnt = bio->bi_idx;
bio->bi_idx = 0;
+ bio->bi_voffset = __BVEC_START(bio)->bv_offset;
+ bio->bi_endvoffset = __BVEC_END(bio)->bv_offset +
+ __BVEC_END(bio)->bv_len;
bio->bi_private = dio;
atomic_inc(&dio->bio_count);
submit_bio(dio->rw, bio);
diff -ur linux-2.5.30-pure/fs/jfs/jfs_logmgr.c linux-2.5.30-bio/fs/jfs/jfs_logmgr.c
--- linux-2.5.30-pure/fs/jfs/jfs_logmgr.c Sat Jul 27 08:28:38 2002
+++ linux-2.5.30-bio/fs/jfs/jfs_logmgr.c Fri Aug 2 10:42:13 2002
@@ -1817,6 +1817,9 @@
bio->bi_vcnt = 1;
bio->bi_idx = 0;
bio->bi_size = LOGPSIZE;
+ bio->bi_voffset = __BVEC_START(bio)->bv_offset;
+ bio->bi_endvoffset = __BVEC_END(bio)->bv_offset +
+ __BVEC_END(bio)->bv_len;
bio->bi_end_io = lbmIODone;
bio->bi_private = bp;
@@ -1959,6 +1962,9 @@
bio->bi_vcnt = 1;
bio->bi_idx = 0;
bio->bi_size = LOGPSIZE;
+ bio->bi_voffset = __BVEC_START(bio)->bv_offset;
+ bio->bi_endvoffset = __BVEC_END(bio)->bv_offset +
+ __BVEC_END(bio)->bv_len;
bio->bi_end_io = lbmIODone;
bio->bi_private = bp;
diff -ur linux-2.5.30-pure/fs/mpage.c linux-2.5.30-bio/fs/mpage.c
--- linux-2.5.30-pure/fs/mpage.c Sat Jul 27 08:28:32 2002
+++ linux-2.5.30-bio/fs/mpage.c Fri Aug 2 10:42:13 2002
@@ -82,6 +82,9 @@
{
bio->bi_vcnt = bio->bi_idx;
bio->bi_idx = 0;
+ bio->bi_voffset = __BVEC_START(bio)->bv_offset;
+ bio->bi_endvoffset = __BVEC_END(bio)->bv_offset +
+ __BVEC_END(bio)->bv_len;
bio->bi_end_io = mpage_end_io_read;
if (rw == WRITE)
bio->bi_end_io = mpage_end_io_write;
diff -ur linux-2.5.30-pure/mm/page_io.c linux-2.5.30-bio/mm/page_io.c
--- linux-2.5.30-pure/mm/page_io.c Fri Aug 2 10:08:31 2002
+++ linux-2.5.30-bio/mm/page_io.c Fri Aug 2 10:42:13 2002
@@ -42,6 +42,9 @@
bio->bi_vcnt = 1;
bio->bi_idx = 0;
bio->bi_size = PAGE_SIZE;
+ bio->bi_voffset = __BVEC_START(bio)->bv_offset;
+ bio->bi_endvoffset = __BVEC_END(bio)->bv_offset +
+ __BVEC_END(bio)->bv_len;
bio->bi_end_io = end_io;
}
return bio;
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Bio Traversal Changes - (Patch 3/4 : biotr8-blkdrivers.diff)
2002-08-02 12:35 [PATCH] Bio Traversal Changes Suparna Bhattacharya
2002-08-02 12:43 ` [PATCH] Bio Traversal Changes (Patch 1/4: biotr8-blk.diff) Suparna Bhattacharya
2002-08-02 12:46 ` [PATCH] Bio Traversal Changes (Patch 2/4: biotr8-blkusers.diff) Suparna Bhattacharya
@ 2002-08-02 13:17 ` Suparna Bhattacharya
2002-08-02 13:20 ` [PATCH] Bio Traversal Changes (Patch 4/4: biotr8-doc.diff) Suparna Bhattacharya
2002-08-02 13:48 ` [PATCH] Bio Traversal Changes James Bottomley
4 siblings, 0 replies; 9+ messages in thread
From: Suparna Bhattacharya @ 2002-08-02 13:17 UTC (permalink / raw)
To: linux-kernel, linux-scsi, axboe
Modifications (in part) to some drivers to account for bio
traversal changes.
A few considerations:
- Drivers which traverse segments directly (rather than
use helpers like blk_rq_map_sg or handle and complete
one segment at a time using end_that_request_first),
would need to account for bi_voffset and bi_endvoffset.
- Preferably use rq_map_buffer() to map the current
segment to a virtual address if needed, or do something
similar to account for the correct offsets. bio_kmap_irq
is gone now (notice that the start of the bio may no longer
be the start of the next portion to submit, which is
the right one to map during request processing)
- Use bio_segsize() to find out the length of the
first bio segment rather than directly rely on the bv_len
field, since the bio could contain just a part of the vec.
- In general remember that bio_startoffset() and
bio_endoffset() could be non-zero.
- Drivers which create/setup bios themselves would
need to ensure correct initialization of bi_voffset
and bi_endvoffset
General:
Since some of the block layer helper routines depend
on nr_sectors, current_nr_sectors, hard_cur_sectors,
nr_bio_segments and nr_bio_sectors, a little care is
needed to avoid inconsistencies amongst the various counts
and pointers at any point. A good option is to make use
of process_that_request_first where applicable/suitable
and have it take care of ensuring this, rather than trying
to do the same by hand.
As mentioned earlier some temporary BUG_ON sanity checks
have been inserted at some places where the drivers didn't
appear to be handle arbitrary bios (i.e. with non-zero
bio_startoffset/bio_endoffset).
No changes have been made to LVM for now. Eventually things like
LVM/EVMS would be the generators of clone bios of the type
allowed by this change.
diff -ur linux-2.5.30-pure/drivers/block/floppy.c linux-2.5.30-bio/drivers/block/floppy.c
--- linux-2.5.30-pure/drivers/block/floppy.c Fri Aug 2 10:08:27 2002
+++ linux-2.5.30-bio/drivers/block/floppy.c Fri Aug 2 10:43:30 2002
@@ -2472,6 +2472,9 @@
size = 0;
rq_for_each_bio(bio, CURRENT) {
+ /* Can't handle arbitrary split bio pieces */
+ BIO_BUG_ON(bio_startoffset(bio) != 0);
+ BIO_BUG_ON(bio_endoffset(bio) != 0);
bio_for_each_segment(bv, bio, i) {
if (page_address(bv->bv_page) + bv->bv_offset != base + size)
break;
@@ -2539,6 +2542,9 @@
size = CURRENT->current_nr_sectors << 9;
rq_for_each_bio(bio, CURRENT) {
+ /* Can't handle arbitrary bio pieces as yet */
+ BIO_BUG_ON(bio_startoffset(bio) != 0);
+ BIO_BUG_ON(bio_endoffset(bio) != 0);
bio_for_each_segment(bv, bio, i) {
if (!remaining)
break;
@@ -3886,6 +3892,9 @@
bio.bi_vcnt = 1;
bio.bi_idx = 0;
bio.bi_size = size;
+ bio.bi_voffset = __BVEC_START(&bio)->bv_offset;
+ bio.bi_endvoffset = __BVEC_END(&bio)->bv_offset +
+ __BVEC_END(&bio)->bv_len;
bio.bi_bdev = bdev;
bio.bi_sector = 0;
init_completion(&complete);
diff -ur linux-2.5.30-pure/drivers/block/loop.c linux-2.5.30-bio/drivers/block/loop.c
--- linux-2.5.30-pure/drivers/block/loop.c Fri Aug 2 10:08:27 2002
+++ linux-2.5.30-bio/drivers/block/loop.c Fri Aug 2 10:43:30 2002
@@ -179,7 +179,8 @@
}
static int
-do_lo_send(struct loop_device *lo, struct bio_vec *bvec, int bsize, loff_t pos)
+do_lo_send(struct loop_device *lo, struct bio_vec *bvec, int bsize, loff_t pos,
+ int startoff, int endoff)
{
struct file *file = lo->lo_backing_file; /* kudos to NFsckingS */
struct address_space *mapping = file->f_dentry->d_inode->i_mapping;
@@ -194,8 +195,8 @@
down(&mapping->host->i_sem);
index = pos >> PAGE_CACHE_SHIFT;
offset = pos & (PAGE_CACHE_SIZE - 1);
- data = kmap(bvec->bv_page) + bvec->bv_offset;
- len = bvec->bv_len;
+ data = kmap(bvec->bv_page) + bvec->bv_offset + startoff;
+ len = bvec->bv_len - startoff - endoff;
while (len > 0) {
int IV = index * (PAGE_CACHE_SIZE/bsize) + offset/bsize;
int transfer_result;
@@ -251,14 +252,19 @@
{
unsigned vecnr;
int ret = 0;
+ int startoff = bio_startoffset(bio), endoff = 0;
for (vecnr = 0; vecnr < bio->bi_vcnt; vecnr++) {
+ /* FIXME: Could be more efficient */
+ if (vecnr == bio->bi_vcnt - 1)
+ endoff = bio_endoffset(bio);
struct bio_vec *bvec = &bio->bi_io_vec[vecnr];
- ret = do_lo_send(lo, bvec, bsize, pos);
+ ret = do_lo_send(lo, bvec, bsize, pos, startoff, endoff);
if (ret < 0)
break;
- pos += bvec->bv_len;
+ pos += bvec->bv_len - startoff - endoff;
+ startoff = 0;
}
return ret;
}
@@ -296,17 +302,17 @@
static int
do_lo_receive(struct loop_device *lo,
- struct bio_vec *bvec, int bsize, loff_t pos)
+ struct bio_vec *bvec, int bsize, loff_t pos, int startoff, int endoff)
{
struct lo_read_data cookie;
read_descriptor_t desc;
struct file *file;
cookie.lo = lo;
- cookie.data = kmap(bvec->bv_page) + bvec->bv_offset;
+ cookie.data = kmap(bvec->bv_page) + bvec->bv_offset + startoff;
cookie.bsize = bsize;
desc.written = 0;
- desc.count = bvec->bv_len;
+ desc.count = bvec->bv_len - startoff - endoff;
desc.buf = (char*)&cookie;
desc.error = 0;
spin_lock_irq(&lo->lo_lock);
@@ -322,14 +328,20 @@
{
unsigned vecnr;
int ret = 0;
+ int startoff = bio_startoffset(bio), endoff = 0;
for (vecnr = 0; vecnr < bio->bi_vcnt; vecnr++) {
struct bio_vec *bvec = &bio->bi_io_vec[vecnr];
- ret = do_lo_receive(lo, bvec, bsize, pos);
+ /* FIXME: Could be more efficient */
+ if (vecnr == bio->bi_vcnt - 1)
+ endoff = bio_endoffset(bio);
+
+ ret = do_lo_receive(lo, bvec, bsize, pos, startoff, endoff);
if (ret < 0)
break;
- pos += bvec->bv_len;
+ pos += bvec->bv_len - startoff - endoff;
+ startoff = 0;
}
return ret;
}
@@ -477,18 +489,29 @@
struct bio_vec *from_bvec, *to_bvec;
char *vto, *vfrom;
int ret = 0, i;
+ int from_start, from_len, to_start;
+ from_start = bio_offset(from_bio);
+ to_start = bio_offset(to_bio);
__bio_for_each_segment(from_bvec, from_bio, i, 0) {
to_bvec = &to_bio->bi_io_vec[i];
+ from_len = from_bvec->bv_len;
+
+ /* FIXME: Could be more efficient */
+ if (i == from_bio->bi_vcnt - 1)
+ from_len -= bio_endoffset(from_bio);
kmap(from_bvec->bv_page);
kmap(to_bvec->bv_page);
- vfrom = page_address(from_bvec->bv_page) + from_bvec->bv_offset;
- vto = page_address(to_bvec->bv_page) + to_bvec->bv_offset;
+ vfrom = page_address(from_bvec->bv_page) + from_bvec->bv_offset
+ + from_start;
+ vto = page_address(to_bvec->bv_page) + to_bvec->bv_offset +
+ to_start;
ret |= lo_do_transfer(lo, bio_data_dir(to_bio), vto, vfrom,
- from_bvec->bv_len, IV);
+ from_len, IV);
kunmap(from_bvec->bv_page);
kunmap(to_bvec->bv_page);
+ from_start = to_start = 0;
}
return ret;
diff -ur linux-2.5.30-pure/drivers/block/nbd.c linux-2.5.30-bio/drivers/block/nbd.c
--- linux-2.5.30-pure/drivers/block/nbd.c Sat Jul 27 08:28:32 2002
+++ linux-2.5.30-bio/drivers/block/nbd.c Fri Aug 2 10:43:30 2002
@@ -180,6 +180,9 @@
* whether to set MSG_MORE or not...
*/
rq_for_each_bio(bio, req) {
+ /* Can't handle arbitrary bio pieces yet */
+ BIO_BUG_ON(bio_startoffset(bio) != 0);
+ BIO_BUG_ON(bio_endoffset(bio) != 0);
struct bio_vec *bvec;
bio_for_each_segment(bvec, bio, i) {
flags = 0;
diff -ur linux-2.5.30-pure/drivers/block/rd.c linux-2.5.30-bio/drivers/block/rd.c
--- linux-2.5.30-pure/drivers/block/rd.c Fri Aug 2 10:08:27 2002
+++ linux-2.5.30-bio/drivers/block/rd.c Fri Aug 2 10:43:30 2002
@@ -227,6 +227,9 @@
sector = bio->bi_sector;
rw = bio_data_dir(bio);
+ /* Can't handle a bio split in the middle of a segment */
+ BIO_BUG_ON(bio_startoffset(bio) > 0);
+ BIO_BUG_ON(bio_offset(bio) > 0);
bio_for_each_segment(bvec, bio, i) {
ret |= rd_blkdev_pagecache_IO(rw, bvec, sector, minor);
sector += bvec->bv_len >> 9;
diff -ur linux-2.5.30-pure/drivers/block/umem.c linux-2.5.30-bio/drivers/block/umem.c
--- linux-2.5.30-pure/drivers/block/umem.c Fri Aug 2 10:08:27 2002
+++ linux-2.5.30-bio/drivers/block/umem.c Fri Aug 2 10:43:30 2002
@@ -423,7 +423,7 @@
if (card->mm_pages[card->Ready].cnt >= DESC_PER_PAGE)
return 0;
- len = bio_iovec(bio)->bv_len;
+ len = bio_segsize(bio);
dma_handle = pci_map_page(card->dev,
bio_page(bio),
bio_offset(bio),
diff -ur linux-2.5.30-pure/drivers/ide/ide-disk.c linux-2.5.30-bio/drivers/ide/ide-disk.c
--- linux-2.5.30-pure/drivers/ide/ide-disk.c Fri Aug 2 10:08:28 2002
+++ linux-2.5.30-bio/drivers/ide/ide-disk.c Fri Aug 2 10:43:30 2002
@@ -45,7 +45,7 @@
static inline char *ide_map_rq(struct request *rq, unsigned long *flags)
{
if (rq->bio)
- return bio_kmap_irq(rq->bio, flags) + ide_rq_offset(rq);
+ return rq_map_buffer(rq, flags);
else
return rq->buffer + ((rq)->nr_sectors - (rq)->current_nr_sectors) * SECTOR_SIZE;
}
@@ -54,7 +54,7 @@
unsigned long *flags)
{
if (rq->bio)
- bio_kunmap_irq(to, flags);
+ rq_unmap_buffer(to, flags);
}
/*
@@ -293,7 +293,7 @@
nsect = mcount;
mcount -= nsect;
- buf = bio_kmap_irq(rq->bio, &flags) + ide_rq_offset(rq);
+ buf = ide_map_rq(rq, &flags);
rq->sector += nsect;
rq->nr_sectors -= nsect;
rq->current_nr_sectors -= nsect;
@@ -318,7 +318,7 @@
* last transfer.
*/
ata_write(drive, buf, nsect * SECTOR_WORDS);
- bio_kunmap_irq(buf, &flags);
+ ide_unmap_rq(rq, buf, &flags);
} while (mcount);
ret = ATA_OP_CONTINUES;
diff -ur linux-2.5.30-pure/drivers/ide/ide.c linux-2.5.30-bio/drivers/ide/ide.c
--- linux-2.5.30-pure/drivers/ide/ide.c Fri Aug 2 10:08:28 2002
+++ linux-2.5.30-bio/drivers/ide/ide.c Fri Aug 2 10:43:30 2002
@@ -795,7 +795,9 @@
rq->errors = 0;
if (rq->bio) {
rq->sector = rq->bio->bi_sector;
- rq->current_nr_sectors = bio_iovec(rq->bio)->bv_len >> 9;
+ rq->current_nr_sectors = bio_segsize(rq->bio)
+ >> 9;
+
rq->buffer = NULL;
}
ret = ATA_OP_FINISHED;
diff -ur linux-2.5.30-pure/drivers/ide/pdc4030.c linux-2.5.30-bio/drivers/ide/pdc4030.c
--- linux-2.5.30-pure/drivers/ide/pdc4030.c Sat Jul 27 08:28:32 2002
+++ linux-2.5.30-bio/drivers/ide/pdc4030.c Fri Aug 2 10:43:30 2002
@@ -400,14 +400,14 @@
if (nsect > sectors_avail)
nsect = sectors_avail;
sectors_avail -= nsect;
- to = bio_kmap_irq(rq->bio, &flags) + ide_rq_offset(rq);
+ to = ide_map_rq(rq, &flags);
promise_read(drive, to, nsect * SECTOR_WORDS);
#ifdef DEBUG_READ
printk(KERN_DEBUG "%s: promise_read: sectors(%ld-%ld), "
"buf=0x%08lx, rem=%ld\n", drive->name, rq->sector,
rq->sector+nsect-1, (unsigned long) to, rq->nr_sectors-nsect);
#endif
- bio_kunmap_irq(to, &flags);
+ ide_unmap_rq(rq, to, &flags);
rq->sector += nsect;
rq->errors = 0;
rq->nr_sectors -= nsect;
diff -ur linux-2.5.30-pure/drivers/md/raid1.c linux-2.5.30-bio/drivers/md/raid1.c
--- linux-2.5.30-pure/drivers/md/raid1.c Sat Jul 27 08:28:24 2002
+++ linux-2.5.30-bio/drivers/md/raid1.c Fri Aug 2 10:43:30 2002
@@ -90,6 +90,9 @@
bio->bi_vcnt = RESYNC_PAGES;
bio->bi_idx = 0;
bio->bi_size = RESYNC_BLOCK_SIZE;
+ bio->bi_voffset = __BVEC_START(bio)->bv_offset;
+ bio->bi_endvoffset = __BVEC_END(bio)->bv_offset +
+ __BVEC_END(bio)->bv_len;
bio->bi_end_io = NULL;
atomic_set(&bio->bi_cnt, 1);
diff -ur linux-2.5.30-pure/drivers/md/raid5.c linux-2.5.30-bio/drivers/md/raid5.c
--- linux-2.5.30-pure/drivers/md/raid5.c Sat Jul 27 08:28:31 2002
+++ linux-2.5.30-bio/drivers/md/raid5.c Fri Aug 2 10:43:30 2002
@@ -429,6 +429,8 @@
dev->vec.bv_page = dev->page;
dev->vec.bv_len = STRIPE_SIZE;
dev->vec.bv_offset = 0;
+ dev->req.bi_voffset = 0;
+ dev->req.bi_endvoffset = STRIPE_SIZE;
dev->req.bi_bdev = conf->disks[i].bdev;
dev->req.bi_sector = sh->sector;
@@ -615,6 +617,11 @@
for (;bio && bio->bi_sector < sector+STRIPE_SECTORS;
bio = bio->bi_next) {
int page_offset;
+
+ /* Can't handle arbitrary bio pieces yet */
+ BIO_BUG_ON(bio_startoffset(bio) != 0);
+ BIO_BUG_ON(bio_endoffset(bio) != 0);
+
if (bio->bi_sector >= sector)
page_offset = (signed)(bio->bi_sector - sector) * 512;
else
diff -ur linux-2.5.30-pure/drivers/scsi/ide-scsi.c linux-2.5.30-bio/drivers/scsi/ide-scsi.c
--- linux-2.5.30-pure/drivers/scsi/ide-scsi.c Fri Aug 2 10:08:28 2002
+++ linux-2.5.30-bio/drivers/scsi/ide-scsi.c Fri Aug 2 10:43:30 2002
@@ -592,8 +592,10 @@
while (segments--) {
bh->bi_io_vec[0].bv_page = sg->page;
bh->bi_io_vec[0].bv_len = sg->length;
- bh->bi_io_vec[0].bv_offset = sg->offset;
+ bh->bi_voffset = bh->bi_io_vec[0].bv_offset = sg->offset;
bh->bi_size = sg->length;
+ bh->bi_endvoffset = bh->bi_io_vec[0].bv_offset +
+ bh->bi_io_vec[0].bv_len;
bh = bh->bi_next;
sg++;
}
@@ -605,8 +607,10 @@
#endif
bh->bi_io_vec[0].bv_page = virt_to_page(pc->s.scsi_cmd->request_buffer);
bh->bi_io_vec[0].bv_len = pc->request_transfer;
- bh->bi_io_vec[0].bv_offset = (unsigned long) pc->s.scsi_cmd->request_buffer & ~PAGE_MASK;
+ bh->bi_voffset = bh->bi_io_vec[0].bv_offset = (unsigned long) pc->s.scsi_cmd->request_buffer & ~PAGE_MASK;
bh->bi_size = pc->request_transfer;
+ bh->bi_endvoffset = bh->bi_io_vec[0].bv_offset +
+ bh->bi_io_vec[0].bv_len;
}
return first_bh;
}
diff -ur linux-2.5.30-pure/drivers/scsi/scsi_lib.c linux-2.5.30-bio/drivers/scsi/scsi_lib.c
--- linux-2.5.30-pure/drivers/scsi/scsi_lib.c Sat Jul 27 08:28:38 2002
+++ linux-2.5.30-bio/drivers/scsi/scsi_lib.c Fri Aug 2 10:43:30 2002
@@ -481,10 +481,11 @@
if (SCpnt->buffer != req->buffer) {
if (rq_data_dir(req) == READ) {
unsigned long flags;
- char *to = bio_kmap_irq(req->bio, &flags);
+ /* Todo: check if this is all we need to do */
+ char *to = rq_map_buffer(req, &flags);
memcpy(to, SCpnt->buffer, SCpnt->bufflen);
- bio_kunmap_irq(to, &flags);
+ rq_unmap_buffer(to, &flags);
}
kfree(SCpnt->buffer);
}
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Bio Traversal Changes (Patch 4/4: biotr8-doc.diff)
2002-08-02 12:35 [PATCH] Bio Traversal Changes Suparna Bhattacharya
` (2 preceding siblings ...)
2002-08-02 13:17 ` [PATCH] Bio Traversal Changes - (Patch 3/4 : biotr8-blkdrivers.diff) Suparna Bhattacharya
@ 2002-08-02 13:20 ` Suparna Bhattacharya
2002-08-02 13:48 ` [PATCH] Bio Traversal Changes James Bottomley
4 siblings, 0 replies; 9+ messages in thread
From: Suparna Bhattacharya @ 2002-08-02 13:20 UTC (permalink / raw)
To: linux-kernel, linux-scsi, axboe
And lastly, a patch to the documentation ...
diff -ur linux-2.5.30-pure/Documentation/block/biodoc.txt linux-2.5.30-bio/Documentation/block/biodoc.txt
--- linux-2.5.30-pure/Documentation/block/biodoc.txt Sat Jul 27 08:28:31 2002
+++ linux-2.5.30-bio/Documentation/block/biodoc.txt Fri Aug 2 16:46:19 2002
@@ -5,7 +5,7 @@
Jens Axboe <axboe@suse.de>
Suparna Bhattacharya <suparna@in.ibm.com>
-Last Updated May 2, 2002
+Last Updated August 2, 2002
Introduction:
@@ -204,8 +204,8 @@
which case a virtual mapping of the page is required. For SCSI it is also
done in some scenarios where the low level driver cannot be trusted to
handle a single sg entry correctly. The driver is expected to perform the
-kmaps as needed on such occasions using the bio_kmap and bio_kmap_irq
-routines as appropriate. A driver could also use the blk_queue_bounce()
+kmaps as needed on such occasions using the rq_map_buffer() routine
+as appropriate. A driver could also use the blk_queue_bounce()
routine on its own to bounce highmem i/o to low memory for specific requests
if so desired.
@@ -399,7 +399,8 @@
directly by hand.
This is because end_that_request_first only iterates over the bio list,
and always returns 0 if there are none associated with the request.
- _last works OK in this case, and is not a problem, as I mentioned earlier
+ end_that_request_last works OK in this case, and is not a problem,
+ as mentioned earlier
>
1.3.1 Pre-built Commands
@@ -508,8 +509,9 @@
unsigned int bi_vcnt; /* how may bio_vec's */
unsigned int bi_idx; /* current index into bio_vec array */
-
- unsigned int bi_size; /* total size in bytes */
+ unsigned short bi_voffset; /* current vec offset */
+ unsigned short bi_endvoffset; /* last vec's end offset */
+ unsigned int bi_size; /* total residual size in bytes */
unsigned short bi_phys_segments; /* segments after physaddr coalesce*/
unsigned short bi_hw_segments; /* segments after DMA remapping */
unsigned int bi_max; /* max bio_vecs we can hold
@@ -554,13 +556,58 @@
way). There is a helper routine (blk_rq_map_sg) which drivers can use to build
the sg list.
-Note: Right now the only user of bios with more than one page is ll_rw_kio,
-which in turn means that only raw I/O uses it (direct i/o may not work
-right now). The intent however is to enable clustering of pages etc to
-become possible. The pagebuf abstraction layer from SGI also uses multi-page
-bios, but that is currently not included in the stock development kernels.
-The same is true of Andrew Morton's work-in-progress multipage bio writeout
-and readahead patches.
+The following fields have been introduced in the bio structure
+to enable setting up a bio which starts in the middle of an entry
+of an existing io_vec without having to make a copy of the iovec
+descriptor. This could for example be used by drivers like lvm/md
+when it has to split a single bio (using the bio cloning function
+described later) for striping i/o across multiple devices.
+
+bi_voffset:
+
+Offset relative to the start of the first page, which
+indicates where the bio really starts. In general before
+i/o starts this would be the same as bv_offset for the
+first vec (at bi_idx), but in the case of clone bio s where
+the bio may be split in the middle of a segment this could be
+different. As i/o progresses, now instead of changing any
+of the bvec fields, bi_voffset is moved ahead instead.
+
+The relative offset w.r.t to the start of the first vec
+can be calculated using the macro bio_startoffset(bio)
+
+bi_endvoffset:
+
+Offset relative to the last page which indicates
+where the bio really ends. In general this would be the same
+as bv_offset + bv_len for the last vec, but in the case of
+clone bio s where a split piece ends in the middle of a
+segment, it could be different. This field is really used
+mainly for segment boundary and merge checks (it is more
+convenient than having to walk through the entire bio
+and use bi_size to determine the end just to determine
+mergeability).
+
+The macro bio_endoffset(bio) can be used calculate the
+relative offset w.r.t to the bvec end where the bio
+was broken up.
+
+The remaining size to be transfered in the current bio
+vec should be calculated using the bio_segsize() routine
+(instead of accessing bv_len directly any more). This
+takes care of adjusting the length for the above offsets.
+
+Aside:
+An alternative to bi_voffset being an absolute
+offset wrt to the start of the bvec page would be to
+make it relative to bio_io_vec(bio)->bv_offset instead
+(i.e. the value bio_startoffset() returns in the patch). A
+similar change would then apply to bi_endvoffset. Then
+the fields would be initialized to zero by default,
+though it also would make the mergeability check macros
+a little longer, and possibly add a little extra computation
+during request mapping and end_that_request_first.
+
2.3 Changes in the Request Structure
@@ -609,11 +656,20 @@
unsigned short nr_hw_segments;
/* Various sector counts */
+ /*
+ * The various block internal copies represent counts/pointers of
+ * unfinished i/o, while the other counts/pointers refer to
+ * i/o to be submitted.
+ */
unsigned long nr_sectors; /* no. of sectors left: driver modifiable */
- unsigned long hard_nr_sectors; /* block internal copy of above */
+ unsigned long hard_nr_sectors; /* block internal copy of the above */
unsigned int current_nr_sectors; /* no. of sectors left in the
current segment:driver modifiable */
unsigned long hard_cur_sectors; /* block internal copy of the above */
+
+ unsigned short nr_bio_segments; /* no of segments left in curr bio */
+ unsigned short nr_bio_sectors; /* no of sectors left in curr bio */
+
.
.
int tag; /* command tag associated with request */
@@ -623,6 +679,7 @@
.
.
struct bio *bio, *biotail; /* bio list instead of bh */
+ struct bio *hard_bio; /* block internal copy */
struct request_list *rl;
}
@@ -641,9 +698,11 @@
transfer and invokes block end*request helpers to mark this. The
driver should not modify these values. The block layer sets up the
nr_sectors and current_nr_sectors fields (based on the corresponding
-hard_xxx values and the number of bytes transferred) and updates it on
-every transfer that invokes end_that_request_first. It does the same for the
-buffer, bio, bio->bi_idx fields too.
+hard_xxx values and the number of bytes transferred) and typically
+updates it on every transfer that invokes end_that_request_first,
+unless the driver has advanced these (submission) counters ahead
+of the sectors being completed. The block layer also advances the
+buffer, bio, bio->bi_idx fields appropriately as well as i/o completes.
The buffer field is just a virtual address mapping of the current segment
of the i/o buffer in cases where the buffer resides in low-memory. For high
@@ -653,6 +712,61 @@
a driver needs to be careful about interoperation with the block layer helper
functions which the driver uses. (Section 1.3)
+
+2.3.1 The Separation of Submission and Completion State
+
+The basic protocol followed all through is that the bio fields
+would always reflect the status w.r.t how much i/o remains
+to be completed. On the other hand submission status would
+only be maintained in the request structure. In most cases
+of course, both move in sync (the generic end_request_first
+code tries to handle that transparently by advancing the
+submission pointers if they are behind the completion pointers
+as would happen in the case of drivers which don't modify
+those themselves), but for things like IDE mult-count write,
+the submission counters/pointers may be ahead of the
+completion pointers.
+
+The following fields have been added to the request structure
+to help maintain this distinction.
+
+rq->hard_bio
+ the rq->bio field now reflects the next bio which
+is to be submitted for i/o. Hence, the need for rq->hard_bio
+which keeps track of the next bio to be completed (this
+is the one used by end_that_request_first now, instead
+of rq->bio)
+
+rq->nr_bio_segments
+ this keeps track of how many more vecs remain
+to be submitted in the current bio (rq->bio). It is
+used to compute the current index into rq->bio which
+specifies the segment under submission.
+(rq_map_buffer for example uses this field to map
+the right buffer)
+
+rq->nr_bio_sectors
+ this keeps track of the number of sectors to
+be submitted in the current bio (rq->bio). It can be
+used to compute the remaining sectors in the current
+segment in the situation when it is the last segment.
+
+Now a subtle point about hard_cur_sectors. It reflects
+the number of sectors left to be completed in the
+_current_ segment under submission (i.e. the segment
+in rq->bio, and _not_ rq->hard_bio). This makes it
+possible to use it in rq_map_buffer to determine the
+relative offset in the current segment w.r.t what
+the bio indices might indicate.
+
+A new helper, process_that_request_first() has been
+introduced for updating submission state of the request
+without completing the corresponding bios. It can be used
+by code such as mult-count write which need to traverse
+multiple bio segments for each chunk of i/o submitted,
+where multiple such chunk transfers are required to cover
+the entire request.
+
3. Using bios
3.1 Setup/Teardown
@@ -718,7 +832,7 @@
3.2.1 Traversing segments and completion units in a request
-The macros bio_for_each_segment() and rq_for_each_bio() should be used for
+The macros bio_for_each_segment() and rq_for_each_bio() could be used for
traversing the bios in the request list (drivers should avoid directly
trying to do it themselves). Using these helpers should also make it easier
to cope with block changes in the future.
@@ -727,11 +841,28 @@
bio_for_each_segment(bio_vec, bio, i)
/* bio_vec is now current segment */
+Notice that where bi_voffset differs from bv_offset of the first
+bvec, the current segment might start somewhere inside the current
+bio_vec. The macros bio_startoffset() and bio_endoffset() help
+finding out the relative offsets into the start and end of the
+vectors where the bio really starts and ends.
+
I/O completion callbacks are per-bio rather than per-segment, so drivers
that traverse bio chains on completion need to keep that in mind. Drivers
which don't make a distinction between segments and completion units would
need to be reorganized to support multi-segment bios.
+It is recommended that drivers utilize the block layer routines
+process_that_request_first() while traversing bios for i/o submission,
+instead of iterating over the segments directly, and use
+end_that_request_first() for completion as before. Things like
+rq_map_buffer() rely on the submission pointers in the request
+to map the correct buffer.
+
+rq_map_buffer() could be used to get a virtual address mapping
+for the current segment buffer, in drivers which use PIO for
+example.
+
3.2.2 Setting up DMA scatterlists
The blk_rq_map_sg() helper routine would be used for setting up scatter
@@ -751,6 +882,7 @@
memory segments that the driver can handle (phys_segments) and the
number that the underlying hardware can handle at once, accounting for
DMA remapping (hw_segments) (i.e. IOMMU aware limits).
+- Accounts for bi_voffset/bi_endvoffset for arbitrary bio
Routines which the low level driver can use to set up the segment limits:
@@ -905,36 +1037,18 @@
perform the i/o on each of these.
The embedded bh array in the kiobuf structure has been removed and no
-preallocation of bios is done for kiobufs. [The intent is to remove the
-blocks array as well, but it's currently in there to kludge around direct i/o.]
-Thus kiobuf allocation has switched back to using kmalloc rather than vmalloc.
-
-Todo/Observation:
-
- A single kiobuf structure is assumed to correspond to a contiguous range
- of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec.
- So right now it wouldn't work for direct i/o on non-contiguous blocks.
- This is to be resolved. The eventual direction is to replace kiobuf
- by kvec's.
-
- Badari Pulavarty has a patch to implement direct i/o correctly using
- bio and kvec.
+preallocation of bios is done for kiobufs.
+Note: Direct i/o no longer uses kiobufs any more though, and instead
+directly builds up bios and submits then to the block layer.
(c) Page i/o:
-Todo/Under discussion:
- Andrew Morton's multi-page bio patches attempt to issue multi-page
- writeouts (and reads) from the page cache, by directly building up
- large bios for submission completely bypassing the usage of buffer
- heads. This work is still in progress.
-
- Christoph Hellwig had some code that uses bios for page-io (rather than
- bh). This isn't included in bio as yet. Christoph was also working on a
- design for representing virtual/real extents as an entity and modifying
- some of the address space ops interfaces to utilize this abstraction rather
- than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf
- abstraction, but intended to be as lightweight as possible).
+There now is generic support for multi-page writeouts (and reads)
+from the page cache by directly building up a sequence of large bios
+and submitting them in a pipelined manner. This does away with
+the use of buffer heads for page i/o.
+
(d) Direct access i/o:
Direct access requests that do not contain bios would be submitted differently
@@ -954,14 +1068,6 @@
cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec
array pointer to point to the veclet array in kvecs.
- TBD: In order for this to work, some changes are needed in the way multi-page
- bios are handled today. The values of the tuples in such a vector passed in
- from higher level code should not be modified by the block layer in the course
- of its request processing, since that would make it hard for the higher layer
- to continue to use the vector descriptor (kvec) after i/o completes. Instead,
- all such transient state should either be maintained in the request structure,
- and passed on in some way to the endio completion routine.
-
4. The I/O scheduler
@@ -972,7 +1078,7 @@
ii. improved latency
iii. better utilization of h/w & CPU time
-Characteristics:
+4.1 Characteristics:
i. Linked list for O(n) insert/merge (linear scan) right now
@@ -1046,12 +1152,6 @@
multi-page bios being queued in one shot, we may not need to wait to merge
a big request from the broken up pieces coming by.
- Per-queue granularity unplugging (still a Todo) may help reduce some of the
- concerns with just a single tq_disk flush approach. Something like
- blk_kick_queue() to unplug a specific queue (right away ?)
- or optionally, all queues, is in the plan.
-
-
5. Scalability related changes
5.1 Granular Locking: io_request_lock replaced by a per-queue lock
@@ -1147,9 +1247,8 @@
PIO drivers (or drivers that need to revert to PIO transfer once in a
while (IDE for example)), where the CPU is doing the actual data
transfer a virtual mapping is needed. If the driver supports highmem I/O,
-(Sec 1.1, (ii) ) it needs to use bio_kmap and bio_kmap_irq to temporarily
-map a bio into the virtual address space. See how IDE handles this with
-ide_map_buffer.
+(Sec 1.1, (ii) ) it needs to use rq_map_buffer() to temporarily
+map a bio into the virtual address space. See how IDE handles this.
8. Prior/Related/Impacted patches
diff -ur linux-2.5.30-pure/Documentation/block/request.txt linux-2.5.30-bio/Documentation/block/request.txt
--- linux-2.5.30-pure/Documentation/block/request.txt Sat Jul 27 08:28:41 2002
+++ linux-2.5.30-bio/Documentation/block/request.txt Fri Aug 2 11:54:58 2002
@@ -52,11 +52,15 @@
sector_t sector DBI Target location
-unsigned long hard_nr_sectors B Used to keep sector sane
+unsigned long hard_sector B Used to keep sector sane
+ Tracks the location of unfinished
+ portion
unsigned long nr_sectors DBI Total number of sectors in request
unsigned long hard_nr_sectors B Used to keep nr_sectors sane
+ Tracks no of unfinished sectors in
+ the request
unsigned short nr_phys_segments DB Number of physical scatter gather
segments in a request
@@ -68,6 +72,14 @@
of request
unsigned int hard_cur_sectors B Used to keep current_nr_sectors sane
+ Tracks no unfinished sectors in the
+ same segment.
+
+unsigned long nr_bio_sectors DB Number of sectors in first bio of
+ request
+
+unsigned short nr_bio_segments DB Number of segments in first bio of
+ request
int tag DB TCQ tag, if assigned
@@ -79,9 +91,11 @@
struct completion *waiting D Can be used by driver to get signalled
on request completion
-struct bio *bio DBI First bio in request
+struct bio *bio DBI First unsubmitted bio in request
struct bio *biotail DBI Last bio in request
+
+struct bio *hard_bio B First unfinished bio in request
request_queue_t *q DB Request queue this request belongs to
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Bio Traversal Changes
2002-08-02 12:35 [PATCH] Bio Traversal Changes Suparna Bhattacharya
` (3 preceding siblings ...)
2002-08-02 13:20 ` [PATCH] Bio Traversal Changes (Patch 4/4: biotr8-doc.diff) Suparna Bhattacharya
@ 2002-08-02 13:48 ` James Bottomley
2002-08-05 12:38 ` Suparna Bhattacharya
4 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2002-08-02 13:48 UTC (permalink / raw)
To: suparna; +Cc: linux-kernel, linux-scsi, axboe, B.Zolnierkiewicz, akpm
The SCSI changes (small that they are) look reasonable.
This does look like it exposes an existing problem in the tag/barrier
approach, though.
The bio can be split by making multiple requests over segements of the bio,
correct? If this is a BIO_RW_BARRIER, then each of these requests will be a
REQ_BARRIER. However, in the SCSI paradigm where we translate REQ_BARRIER to
ordered tag, each of the requests will get a new ordered tag as it comes back
around through end_that_request_first, potentially allowing other tags to be
inserted in between these, which would be incorrect, since other bios would be
inserted in between the segments of this one, thus violating the barrier.
Is the above correct? If it is, I may have finally found a use for linked
scsi tasks (gives you the ability to have one tag cover multiple commands).
James
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Bio Traversal Changes
2002-08-02 13:48 ` [PATCH] Bio Traversal Changes James Bottomley
@ 2002-08-05 12:38 ` Suparna Bhattacharya
2002-08-05 15:47 ` James Bottomley
0 siblings, 1 reply; 9+ messages in thread
From: Suparna Bhattacharya @ 2002-08-05 12:38 UTC (permalink / raw)
To: James Bottomley; +Cc: linux-kernel, linux-scsi, axboe, B.Zolnierkiewicz, akpm
On Fri, Aug 02, 2002 at 08:48:06AM -0500, James Bottomley wrote:
> The SCSI changes (small that they are) look reasonable.
>
> This does look like it exposes an existing problem in the tag/barrier
> approach, though.
>
> The bio can be split by making multiple requests over segements of the bio,
> correct? If this is a BIO_RW_BARRIER, then each of these requests will be a
It doesn't quite go as far as multiple requests in the full sense
of what a struct request represents.
All it allows at the moment is the ability for a driver to setup
a command that involves multiple sequential transfers to complete
a single request where each transfer covers a chunk of data for
a portion of the request. Variations like:
setup command
process_that_request_first - chunk 1
[interrupt, status check]
end_that_request_first - chunk1
process_that_request_first - chunk 2
[interrupt, status check]
end_that_request_first - chunk 2
or
setup command
process_that_request_first - chunk 1
process_that_request_first - chunk 2
process_that_request_first - chunk 3
..
end_that_request_first - chunk 1 + 2 + 3
There is only one call to ->request_fn for the entire request, and
the drivers manages things underneath. The chunks are expected to
complete sequentially. In the situation where the request is
restarted in the event of an error (say), the submission pointers
are rolled back to the last (successfully) completed point
before issuing the request again.
Right now I do not know what kind of use there could be for this
in the context of SCSI in general (in IDE, as I had mentioned
before, PIO commands, especially multi-count writes follow such
a pattern). You would probably be in a better position to
do so, and suggestions to improve this in that regard are very
welcome indeed. Which is why it is harder for me to decide if the
situation you suggest could arise with SCSI, since its not clear
whether such pieces/sub-requests are generated in that case.
</ramble>
I must say that I initially did think that this could be
extended to the more generic case which you probably are
referring to and that such an approach could take away the need
to split bios in certain cases (i.e. when the i/o is destined for
a single queue). Later it appeared that trying to cover
the case where each of these pieces gets queued up and might
complete out of order (requiring a tag to correlate things on
completion), would most likely boil down to trying to maintain
all the state that struct request does today.
You might recall the discussion with Niels at the kernel
summit about the alternate possibility of having two request
structs pointing to the same bio, now that we track submission
state separately. In this case, though, as completion state is
still indicated in the bio, it could get a little inelegant
to handle in the context of remembering partial completion
state for two requests simultaneously. Whether partial
completion at a granularity of less than one bio makes sense,
however, is another question that discussions with
Barthlomiej has brought up.
At the same time, if we can afford to allocate a fresh
request struct, then probably allocating a bio (possibly from
a pool associated with the queue) may not sound all that bad.
</ramble>
> REQ_BARRIER. However, in the SCSI paradigm where we translate REQ_BARRIER to
> ordered tag, each of the requests will get a new ordered tag as it comes back
> around through end_that_request_first, potentially allowing other tags to be
> inserted in between these, which would be incorrect, since other bios would be
> inserted in between the segments of this one, thus violating the barrier.
>
> Is the above correct? If it is, I may have finally found a use for linked
> scsi tasks (gives you the ability to have one tag cover multiple commands).
Would be nice (for me) to understand this in more detail.
There might be some possibilities.
Any pointers that I can look up to get a clearer idea ?
Does completion notification happen only when all the commands
covered by a single tag complete ? Otherwise, what is the ordering
amongst the multiple commands in question (do they complete in
serial order as well) ?
Regards
Suparna
>
> James
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Bio Traversal Changes
2002-08-05 12:38 ` Suparna Bhattacharya
@ 2002-08-05 15:47 ` James Bottomley
2002-08-06 12:17 ` Suparna Bhattacharya
0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2002-08-05 15:47 UTC (permalink / raw)
To: suparna
Cc: James Bottomley, linux-kernel, linux-scsi, axboe,
B.Zolnierkiewicz, akpm
suparna@in.ibm.com said:
> There is only one call to ->request_fn for the entire request, and the
> drivers manages things underneath. The chunks are expected to complete
> sequentially. In the situation where the request is restarted in the
> event of an error (say), the submission pointers are rolled back to
> the last (successfully) completed point before issuing the request
> again.
Yes, that's the way I thought it would operate.
suparna@in.ibm.com said:
> I must say that I initially did think that this could be extended to
> the more generic case which you probably are referring to and that
> such an approach could take away the need to split bios in certain
> cases (i.e. when the i/o is destined for a single queue). Later it
> appeared that trying to cover the case where each of these pieces
> gets queued up and might complete out of order (requiring a tag to
> correlate things on completion), would most likely boil down to
> trying to maintain all the state that struct request does today.
For this more generic case, most of our problems seem to be because the
barrier has width: It actually belongs to an I/O request. If the barrier had
zero width (i.e. it was simply a barrier in the stream with no I/O attached)
then it would be much easier to preserve it correctly across this (or any
other) type of bio splitting. It would also make it much more obvious to the
implementing driver where the barrier was supposed to be in the I/O stream,
and would allow more efficient "wait for completion" barrier implementations
for drivers that couldn't enforce it any other way.
> Would be nice (for me) to understand this in more detail. There might
> be some possibilities. Any pointers that I can look up to get a
> clearer idea ?
The SCSI standards (www.t10.org) are the only real authoritative source (with
even some explanation). However, I'll do my best to summarise.
In SCSI, commands are allowed to disconnect, that is suspend temporarily while
the device does other things. When the device implements tag command
queueing, it is allowed to disconnect one command and subsequently reconnect
(restart) a different one. In theory, this means that we can have multiple
active I/Os at once. The way you signal to the scsi device that you want a
barrier is to label one or more of the tags as "ordered" which means that the
device must complete all I/O of tags prior to the ordered one before it and
may not begin I/O of subsequent tags until the ordered tag has completed.
looping a single request over a big bio means that the SCSI device sees the
I/O as a discrete stream of tags. However, we lose throughput if we stall the
queue waiting for this single bio to complete and we can't work out what the
next tag is until the prior tag completes. In the non barrier case,
everything will still be OK as long as the queue isn't stalled because we'll
be getting throughput from other bios coming down.
I think basically, I'd like to translate as much of the bio as I can into SCSI
tags to improve throughput and each tag currently requires a struct request.
> Does completion notification happen only when all the commands
> covered by a single tag complete ? Otherwise, what is the ordering
> amongst the multiple commands in question (do they complete in serial
> order as well) ?
Yes and no. You get a special completion code (INTERMEDIATE_TASK_COMPLETE)
which says "I've finished this bit, give me the next part". You don't get a
real SCSI completion until the last part of the linked task set completes.
The task is linked sequentially, so it does complete in serial order.
However, Don't worry about the linked task stuff, it's a rather esoteric area
of the SCSI standard (that allows a single tag to be used across multiple I/Os
in very much the same way the bio splitting works) which, on mature
reflection, probably isn't such a good idea to use since I'd be doubtful about
how well it's implemented in the devices we have to deal with.
James
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] Bio Traversal Changes
2002-08-05 15:47 ` James Bottomley
@ 2002-08-06 12:17 ` Suparna Bhattacharya
0 siblings, 0 replies; 9+ messages in thread
From: Suparna Bhattacharya @ 2002-08-06 12:17 UTC (permalink / raw)
To: James Bottomley; +Cc: linux-kernel, linux-scsi, axboe, B.Zolnierkiewicz, akpm
On Mon, Aug 05, 2002 at 10:47:39AM -0500, James Bottomley wrote:
> suparna@in.ibm.com said:
> > There is only one call to ->request_fn for the entire request, and the
> > drivers manages things underneath. The chunks are expected to complete
> > sequentially. In the situation where the request is restarted in the
> > event of an error (say), the submission pointers are rolled back to
> > the last (successfully) completed point before issuing the request
> > again.
>
> Yes, that's the way I thought it would operate.
>
> suparna@in.ibm.com said:
> > I must say that I initially did think that this could be extended to
> > the more generic case which you probably are referring to and that
> > such an approach could take away the need to split bios in certain
> > cases (i.e. when the i/o is destined for a single queue). Later it
> > appeared that trying to cover the case where each of these pieces
> > gets queued up and might complete out of order (requiring a tag to
> > correlate things on completion), would most likely boil down to
> > trying to maintain all the state that struct request does today.
>
> For this more generic case, most of our problems seem to be because the
> barrier has width: It actually belongs to an I/O request. If the barrier had
> zero width (i.e. it was simply a barrier in the stream with no I/O attached)
> then it would be much easier to preserve it correctly across this (or any
> other) type of bio splitting. It would also make it much more obvious to the
> implementing driver where the barrier was supposed to be in the I/O stream,
> and would allow more efficient "wait for completion" barrier implementations
> for drivers that couldn't enforce it any other way.
>
> > Would be nice (for me) to understand this in more detail. There might
> > be some possibilities. Any pointers that I can look up to get a
> > clearer idea ?
>
> The SCSI standards (www.t10.org) are the only real authoritative source (with
> even some explanation). However, I'll do my best to summarise.
>
> In SCSI, commands are allowed to disconnect, that is suspend temporarily while
> the device does other things. When the device implements tag command
> queueing, it is allowed to disconnect one command and subsequently reconnect
> (restart) a different one. In theory, this means that we can have multiple
> active I/Os at once. The way you signal to the scsi device that you want a
> barrier is to label one or more of the tags as "ordered" which means that the
> device must complete all I/O of tags prior to the ordered one before it and
> may not begin I/O of subsequent tags until the ordered tag has completed.
>
> looping a single request over a big bio means that the SCSI device sees the
> I/O as a discrete stream of tags. However, we lose throughput if we stall the
> queue waiting for this single bio to complete and we can't work out what the
> next tag is until the prior tag completes. In the non barrier case,
> everything will still be OK as long as the queue isn't stalled because we'll
> be getting throughput from other bios coming down.
>
> I think basically, I'd like to translate as much of the bio as I can into SCSI
> tags to improve throughput and each tag currently requires a struct request.
I didn't think of the possibility of serializing the chunks
of a single request, while letting other requests on the queue through
in the no barrier situation. That's a thought, though it might result
in non-optimal scans ... and in that sense affect the throughput.
But, now I see why the barrier case was the one you were mainly worried
about.
>
> > Does completion notification happen only when all the commands
> > covered by a single tag complete ? Otherwise, what is the ordering
> > amongst the multiple commands in question (do they complete in serial
> > order as well) ?
>
> Yes and no. You get a special completion code (INTERMEDIATE_TASK_COMPLETE)
> which says "I've finished this bit, give me the next part". You don't get a
> real SCSI completion until the last part of the linked task set completes.
> The task is linked sequentially, so it does complete in serial order.
Thanks for the explanation. I think I get the gist.
>
> However, Don't worry about the linked task stuff, it's a rather esoteric area
> of the SCSI standard (that allows a single tag to be used across multiple I/Os
> in very much the same way the bio splitting works) which, on mature
> reflection, probably isn't such a good idea to use since I'd be doubtful about
> how well it's implemented in the devices we have to deal with.
OK.
Regards
Suparna
>
> James
>
>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2002-08-06 12:14 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-08-02 12:35 [PATCH] Bio Traversal Changes Suparna Bhattacharya
2002-08-02 12:43 ` [PATCH] Bio Traversal Changes (Patch 1/4: biotr8-blk.diff) Suparna Bhattacharya
2002-08-02 12:46 ` [PATCH] Bio Traversal Changes (Patch 2/4: biotr8-blkusers.diff) Suparna Bhattacharya
2002-08-02 13:17 ` [PATCH] Bio Traversal Changes - (Patch 3/4 : biotr8-blkdrivers.diff) Suparna Bhattacharya
2002-08-02 13:20 ` [PATCH] Bio Traversal Changes (Patch 4/4: biotr8-doc.diff) Suparna Bhattacharya
2002-08-02 13:48 ` [PATCH] Bio Traversal Changes James Bottomley
2002-08-05 12:38 ` Suparna Bhattacharya
2002-08-05 15:47 ` James Bottomley
2002-08-06 12:17 ` Suparna Bhattacharya
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox