* [Drbd-dev] Problems with DRBD merge-bvec function
@ 2008-04-10 18:39 Graham, Simon
2008-04-10 21:21 ` Lars Ellenberg
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Graham, Simon @ 2008-04-10 18:39 UTC (permalink / raw)
To: drbd-dev
I've been doing some performance comparisons using the iometer benchmark
between a system without DRBD and one with and with a specific setup
that simulates a database workload I am seeing a significant performance
drop with DRBD (I see around 60% of the 'native' perf level when running
with DRBD).
After several days staring at the perf counter data, I've come to the
conclusion that the only difference between the two cases is the size of
requests passed down to the logical volume below DRBD. The iometer
workload is doing a mixture of synchronous reads and writes but all are
exactly 16KB in size and I see that:
1. When running without DRBD, the logical volume is seeing a constant
request size of 32 sectors (16KB)
2. When running with DRBD, the request size is variable but is <= 16
sectors with the vast majority of requests at the 16 sector size.
Enabling the DRBD trace, I can see that we are indeed getting a lot of
8K and smaller requests AND that we never see requests that cross a 32KB
boundary in disk offsets. I think this is causing my problem because
requests are being split above DRBD and then re-merged (sometimes)
between LVM and the physical disk.
Looking at the drbd_merge_bvec function I see that this is indeed
deliberate with the current code being as follows:
#if 1
limit = DRBD_MAX_SEGMENT_SIZE - ((bio_offset &
(DRBD_MAX_SEGMENT_SIZE-1)) + bio_size);
#else
limit = AL_EXTENT_SIZE - ((bio_offset & (AL_EXTENT_SIZE-1)) +
bio_size);
#endif
Since DRBD_MAX_SEGMENT_SIZE is 32KB that means DRBD will never allow a
single BIO to cross the 32KB boundary. The original purpose of this
routine according to the comments was to ensure requests did not cross
the 4MB AL segment size boundary but it seems this was changed.
This seems to be a big problem to me -- even though DRBD advertizes a
max rq size of 32KB, it rarely is able to actually achieve this when
synchronous I/O is done. It's certainly causing me grief at the moment!
I see that I cant simply change this code back to the 4MB boundary check
as we then run into code in drbd_make_request26 that will decide to try
and bio_split the request if it crosses s 32KB boundary... although I
see that the previous code before Lars checkin cbc66a14 actually did the
check based on the AL segment size.
I didn't quite understand the comments re this being necessary to
support two primaries either.
Any suggestions for relaxing this limitation?
Simon
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [Drbd-dev] Problems with DRBD merge-bvec function 2008-04-10 18:39 [Drbd-dev] Problems with DRBD merge-bvec function Graham, Simon @ 2008-04-10 21:21 ` Lars Ellenberg 2008-04-10 22:13 ` Graham, Simon [not found] ` <342BAC0A5467384983B586A6B0B3767108F02F3F@EXNA.corp.s tratus.com> 2 siblings, 0 replies; 8+ messages in thread From: Lars Ellenberg @ 2008-04-10 21:21 UTC (permalink / raw) To: drbd-dev On Thu, Apr 10, 2008 at 02:39:11PM -0400, Graham, Simon wrote: > I've been doing some performance comparisons using the iometer benchmark > between a system without DRBD and one with and with a specific setup > that simulates a database workload I am seeing a significant performance > drop with DRBD (I see around 60% of the 'native' perf level when running > with DRBD). > > After several days staring at the perf counter data, I've come to the > conclusion that the only difference between the two cases is the size of > requests passed down to the logical volume below DRBD. The iometer > workload is doing a mixture of synchronous reads and writes but all are > exactly 16KB in size and I see that: > > 1. When running without DRBD, the logical volume is seeing a constant > request size of 32 sectors (16KB) > 2. When running with DRBD, the request size is variable but is <= 16 > sectors with the vast majority of requests at the 16 sector size. > > Enabling the DRBD trace, I can see that we are indeed getting a lot of > 8K and smaller requests AND that we never see requests that cross a 32KB > boundary in disk offsets. I think this is causing my problem because > requests are being split above DRBD and then re-merged (sometimes) > between LVM and the physical disk. > > Looking at the drbd_merge_bvec function I see that this is indeed > deliberate with the current code being as follows: > > #if 1 > limit = DRBD_MAX_SEGMENT_SIZE - ((bio_offset & > (DRBD_MAX_SEGMENT_SIZE-1)) + bio_size); > #else > limit = AL_EXTENT_SIZE - ((bio_offset & (AL_EXTENT_SIZE-1)) + > bio_size); > #endif > > Since DRBD_MAX_SEGMENT_SIZE is 32KB that means DRBD will never allow a > single BIO to cross the 32KB boundary. The original purpose of this > routine according to the comments was to ensure requests did not cross > the 4MB AL segment size boundary but it seems this was changed. > > This seems to be a big problem to me -- even though DRBD advertizes a > max rq size of 32KB, it rarely is able to actually achieve this when > synchronous I/O is done. It's certainly causing me grief at the moment! > > I see that I cant simply change this code back to the 4MB boundary check > as we then run into code in drbd_make_request26 that will decide to try > and bio_split the request if it crosses s 32KB boundary... although I > see that the previous code before Lars checkin cbc66a14 actually did the > check based on the AL segment size. > > I didn't quite understand the comments re this being necessary to > support two primaries either. > > Any suggestions for relaxing this limitation? if your load wants to submit constant 16KB requrets, get your file system/database simulation to use aligned "pages", and it will be a non-issue. ok, so that does not work. hm. I'll quote the comment from above _req_conflicts here: /* * checks whether there was an overlapping request * or ee already registered. * * if so, return 1, in which case this request is completed on the spot, * without ever being submitted or send. * * return 0 if it is ok to submit this request. * * NOTE: * paranoia: assume something above us is broken, and issues different write * requests for the same block simultaneously... * * To ensure these won't be reordered differently on both nodes, resulting in * diverging data sets, we discard the later one(s). Not that this is supposed * to happen, but this is the rationale why we also have to check for * conflicting requests with local origin, and why we have to do so regardless * of whether we allowed multiple primaries. * * BTW, in case we only have one primary, the ee_hash is empty anyways, and the * second hlist_for_each_entry becomes a noop. This is even simpler than to * grab a reference on the net_conf, and check for the two_primaries flag... */ STATIC int _req_conflicts(drbd_request_t *req) to make it impossible for two "simultaneous" io requests to the same region to reach the disks in different order, we need to check for conflicts. these conflicts are easy to provoke by just doing multiple "dd oflag=direct" to the same block on an smp box, so the risk is real. even when not using two primaries. the conflict detection is based on hash table lookups, key is (target start sector>>HT_SHIFT), HT_SHIFT is defined to 6. that is where the 32kB max segment size comes from. conflict detection works by just checking the collision chain for overlapping requests. if we allow a request to cross collision chain boundaries, we'd have to check three colision chains for the single request, which would be not that bad... but this degenerates when looking at the problem more thoroughly. I don't have it written down anywhere, but it cascades. you have to check the slot and its neighbors, and then the neighbouring slots of those neighbors, and ... soon you have to walk most if not all pending requests to be correct, or limit the number of pending requests, which also hurts performance. (maybe the cascading effect was only for the two-primary case, though, I don't remember exactly anymore) I do remember, however, that I considered one way out of there, which is basically to have struct drbd_request { ... /* corresponding to (start sector >> HT_SHIFT) */ struct hlist_node colision_s; /* corresponding to (end sector >> HT_SHIFT) */ struct hlist_node colision_e; ... } register non-crossing requests only on _s, boundary-crossing requests on both. one has to be careful and write a special-cased hlist_for_each(n,slot) { req_i = hlist_entry(n, (struct drbd_request *), colision_s); if (&req_i->colision_s != n) { req_i = hlist_entry(n, (struct drbd_request *), colision_e); BUG_ON(&req_i->colision_e != n) } or something like that, and be more careful when unlinking. that should do the trick, get a reliable detection, and get rid of the cascading effect. much simpler: you could try to increase HT_SHIFT to 7 or 8, and see if and where the code breaks. or ignore the risk (any application triggering these sanity checks is seriously broken and would probably not work anyways, so as long as you have an established file system/data base, arguably you can assume that this check is just too paranoid, at least in the one-primary case). if you chose this option, just revert it to the 4MB boundary check we used to have. this one has to stay, though, the activity log depends on it, one al-extent coverse 4MB. Lars ^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: [Drbd-dev] Problems with DRBD merge-bvec function 2008-04-10 18:39 [Drbd-dev] Problems with DRBD merge-bvec function Graham, Simon 2008-04-10 21:21 ` Lars Ellenberg @ 2008-04-10 22:13 ` Graham, Simon 2008-04-11 6:45 ` Lars Ellenberg [not found] ` <342BAC0A5467384983B586A6B0B3767108F02F3F@EXNA.corp.s tratus.com> 2 siblings, 1 reply; 8+ messages in thread From: Graham, Simon @ 2008-04-10 22:13 UTC (permalink / raw) To: Lars Ellenberg, drbd-dev Lars, Thanks for the swift and comprehensive response - gonna have to digest this for a bit but I do have some questions: > to make it impossible for two "simultaneous" io requests to the same > region > to reach the disks in different order, we need to check for conflicts. > these conflicts are easy to provoke by just doing multiple "dd > oflag=direct" > to the same block on an smp box, so the risk is real. > even when not using two primaries. > Hmm.. the result of such badness is undefined, but I guess we should try and make DRBD have the same result on both sides in this case... cant say I'm entirely convinced though -- if you do bad things, you get bad results! > conflict detection works by just checking the collision chain for > overlapping > requests. if we allow a request to cross collision chain boundaries, > we'd have > to check three colision chains for the single request, which would be > not that > bad... but this degenerates when looking at the problem more > thoroughly. I Hmm.. this one escapes me - I can see how you have to potentially search three chains for collisions (the one before and the two that the request spans) but if the max rq size is 32KB and the bucket size is 32KB, how can it expand beyond the three? I'm sure you are right, just trying to understand... > or ignore the risk (any application triggering these sanity checks is > seriously > broken and would probably not work anyways, so as long as you have an > established file system/data base, arguably you can assume that this > check is > just too paranoid, at least in the one-primary case). > if you chose this option, just revert it to the 4MB boundary check we > used to have. this one has to stay, though, the activity log depends > on > it, one al-extent coverse 4MB. > That's what I'm testing at the moment -- I reverted the checks in both drbd_merge_bvec and drbd_make_request_26. Thanks again for the invaluable information Simon ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Drbd-dev] Problems with DRBD merge-bvec function 2008-04-10 22:13 ` Graham, Simon @ 2008-04-11 6:45 ` Lars Ellenberg 0 siblings, 0 replies; 8+ messages in thread From: Lars Ellenberg @ 2008-04-11 6:45 UTC (permalink / raw) To: drbd-dev On Thu, Apr 10, 2008 at 06:13:46PM -0400, Graham, Simon wrote: > Lars, > > Thanks for the swift and comprehensive response - gonna have to digest > this for a bit but I do have some questions: > > > to make it impossible for two "simultaneous" io requests to the same > > region > > to reach the disks in different order, we need to check for conflicts. > > these conflicts are easy to provoke by just doing multiple "dd > > oflag=direct" > > to the same block on an smp box, so the risk is real. > > even when not using two primaries. > > > > Hmm.. the result of such badness is undefined, but I guess we should try > and make DRBD have the same result on both sides in this case... that is exactly the point, yes. > cant say I'm entirely convinced though -- if you do bad things, you > get bad results! which is also true. > > conflict detection works by just checking the collision chain for > > overlapping requests. if we allow a request to cross collision > > chain boundaries, we'd have to check three colision chains for the > > single request, which would be not that bad... but this degenerates > > when looking at the problem more thoroughly. I > > Hmm.. this one escapes me - I can see how you have to potentially search > three chains for collisions (the one before and the two that the request > spans) but if the max rq size is 32KB and the bucket size is 32KB, how > can it expand beyond the three? > > I'm sure you are right, just trying to understand... I'm sure I'm right, too, just cannot quite remember ;) thinking about it once more, for the local-only conflict detection, it would be just ok. for the various classifications of the two-thousand-and-odd possiblilities in two-primary conflict detection, there has been cases where it would not be correct anymore, needing cascading colision chain traversal in the "wake-up" path (telling queued conflicts that the pending conflicting request is done now). > > or ignore the risk (any application triggering these sanity checks > > is seriously broken and would probably not work anyways, so as long > > as you have an established file system/data base, arguably you can > > assume that this check is just too paranoid, at least in the > > one-primary case). if you chose this option, just revert it to the > > 4MB boundary check we used to have. this one has to stay, though, > > the activity log depends on it, one al-extent coverse 4MB. > > That's what I'm testing at the moment -- I reverted the checks in both > drbd_merge_bvec and drbd_make_request_26. let us know what the impact on performance is. but maybe this had not been your problem at all? if any of the lower level devices has a merge_bvec function itself, drbd falls back to "PAGE_SIZE" max-segments, unless you have "use-bmbv" enabled, because we currently cannot cope with bios that need not be split on the Primary, but would suddenly be split on the Secondary due to different lower level constraints. if you are sure your lower level devices have the very same constraints, retry with the 32kB boundary settings, but turn on "use-bmbv". -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <342BAC0A5467384983B586A6B0B3767108F02F3F@EXNA.corp.s tratus.com>]
* RE: [Drbd-dev] Problems with DRBD merge-bvec function [not found] ` <342BAC0A5467384983B586A6B0B3767108F02F3F@EXNA.corp.s tratus.com> @ 2008-04-13 21:38 ` Graham, Simon 2008-04-14 8:55 ` Lars Ellenberg 2008-04-14 19:21 ` [Drbd-dev] Perf issues with DRBD when doing a lot of random I/O Graham, Simon 0 siblings, 2 replies; 8+ messages in thread From: Graham, Simon @ 2008-04-13 21:38 UTC (permalink / raw) To: Lars Ellenberg, drbd-dev > > That's what I'm testing at the moment -- I reverted the checks in > both > > drbd_merge_bvec and drbd_make_request_26. > > let us know what the impact on performance is. > It makes things a little better but not much -- after staring at this for a while, I realized that I've been looking at the disk stats for the LVM device underneath DRBD (because DRBD currently doesn't implement the counters exposed in /proc/diskstats) -- at this level, the average size of a transfer is reduced because of the meta data updates that are going on; with the specific workload I am testing, I see about 50 AL cache misses per second - obviously not good (and yes I am experimenting with increasing the size, but this test is vicious and does random writes all over the disk). I've actually been working on adding support for the standard disk counters - will probably submit a patch for that shortly on the assumption that it's generally interesting. > but maybe this had not been your problem at all? > if any of the lower level devices has a merge_bvec function itself, > drbd falls back to "PAGE_SIZE" max-segments, unless you have "use-bmbv" > enabled, because we currently cannot cope with bios that need not be > split on the Primary, but would suddenly be split on the Secondary due > to different lower level constraints. They don't. However, I don't think the code actually behaves the way you describe, unless I'm missing something -- in the merge-bvec routine (in 8.0) it has: limit = DRBD_MAX_SEGMENT_SIZE - ((bio_offset & (DRBD_MAX_SEGMENT_SIZE-1)) + bio_size); if (limit < 0) limit = 0; if (bio_size == 0) { if (limit <= bvec->bv_len) limit = bvec->bv_len; } else if (limit && inc_local(mdev)) { struct request_queue * const b = mdev->bc->backing_bdev->bd_disk->queue; if(b->merge_bvec_fn && mdev->bc->dc.use_bmbv) { backing_limit = b->merge_bvec_fn(b,bio,bvec); limit = min(limit,backing_limit); } dec_local(mdev); } To me, this says it will use the normal 32KB boundary unless use_bmbv is set in which case it uses the minimum of ours and the lower devices value... I don't see anything here that would limit the size to 4K. Simon ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Drbd-dev] Problems with DRBD merge-bvec function 2008-04-13 21:38 ` Graham, Simon @ 2008-04-14 8:55 ` Lars Ellenberg 2008-04-14 19:21 ` [Drbd-dev] Perf issues with DRBD when doing a lot of random I/O Graham, Simon 1 sibling, 0 replies; 8+ messages in thread From: Lars Ellenberg @ 2008-04-14 8:55 UTC (permalink / raw) To: drbd-dev On Sun, Apr 13, 2008 at 05:38:10PM -0400, Graham, Simon wrote: > > > That's what I'm testing at the moment -- I reverted the checks in > > both > > > drbd_merge_bvec and drbd_make_request_26. > > > > let us know what the impact on performance is. > > > > It makes things a little better but not much -- after staring at this > for a while, I realized that I've been looking at the disk stats for the > LVM device underneath DRBD (because DRBD currently doesn't implement the > counters exposed in /proc/diskstats) -- at this level, the average size > of a transfer is reduced because of the meta data updates that are going > on; with the specific workload I am testing, I see about 50 AL cache > misses per second - obviously not good (and yes I am experimenting with > increasing the size, but this test is vicious and does random writes all > over the disk). > > I've actually been working on adding support for the standard disk > counters - will probably submit a patch for that shortly on the > assumption that it's generally interesting. great. > > but maybe this had not been your problem at all? > > if any of the lower level devices has a merge_bvec function itself, > > drbd falls back to "PAGE_SIZE" max-segments, unless you have > "use-bmbv" > > enabled, because we currently cannot cope with bios that need not be > > split on the Primary, but would suddenly be split on the Secondary due > > to different lower level constraints. > > They don't. However, I don't think the code actually behaves the way you > describe, unless I'm missing something -- in the merge-bvec routine (in > 8.0) it has: > > limit = DRBD_MAX_SEGMENT_SIZE - ((bio_offset & > (DRBD_MAX_SEGMENT_SIZE-1)) + bio_size); > > if (limit < 0) limit = 0; > if (bio_size == 0) { > if (limit <= bvec->bv_len) limit = bvec->bv_len; > } else if (limit && inc_local(mdev)) { > struct request_queue * const b = > mdev->bc->backing_bdev->bd_disk->queue; > if(b->merge_bvec_fn && mdev->bc->dc.use_bmbv) { > backing_limit = b->merge_bvec_fn(b,bio,bvec); > limit = min(limit,backing_limit); > } > dec_local(mdev); > } > > To me, this says it will use the normal 32KB boundary unless use_bmbv is > set in which case it uses the minimum of ours and the lower devices > value... I don't see anything here that would limit the size to 4K. right. only, that code will not be used. if the lover level device has a bio merge bvec fn, drbd announces a fixed maximum segment size of PAGE_SIZE, since that is the common denominator and all block devices are required to handle that. there just will not be any merge_bvec fn announced then. -- : Lars Ellenberg Tel +43-1-8178292-55 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : ^ permalink raw reply [flat|nested] 8+ messages in thread
* [Drbd-dev] Perf issues with DRBD when doing a lot of random I/O 2008-04-13 21:38 ` Graham, Simon 2008-04-14 8:55 ` Lars Ellenberg @ 2008-04-14 19:21 ` Graham, Simon 2008-04-14 21:59 ` Lars Ellenberg 1 sibling, 1 reply; 8+ messages in thread From: Graham, Simon @ 2008-04-14 19:21 UTC (permalink / raw) To: Lars Ellenberg, drbd-dev This is a follow on to the earlier conversation on the issues with the drbd_merge_bvec function - having modified this, I am still seeing performance with DRBD of about 66% of what I see with no DRBD. The specific workload is quite vicious and does a lot of random I/O across the entire disk, so I experimented with bumping the AL cache size up to the max; this got my performance up to 72% of 'native' - better but still no great. Then I started thinking about the change I submitted a while back to make meta data updates be barrier requests - given that this random workload causes a lot of AL cache turns, it's also causing a lot of meta-data activity, so a barrier request is likely to cause a lot of stalls. Now, thinking more about this, I'm not so sure that a barrier is appropriate here -- when we update the on-disk AL, we are actually throwing away information that a given block is modified, so we need to be sure THAT block has been committed to the disk, however, it has nothing to do with the current set of outstanding I/O to the disk (at least, it seems so to me). I then tried a little test of simply commenting out the barrier in the meta data update path and voila I was up to 88% of native perf - finally within striking range of acceptable! So... the big question is whether or not having a barrier set on meta-data updates to the on disk AL is required for correctness Simon > On Sun, Apr 13, 2008 at 05:38:10PM -0400, Graham, Simon wrote: > > > > That's what I'm testing at the moment -- I reverted the checks in > > > both > > > > drbd_merge_bvec and drbd_make_request_26. > > > > > > let us know what the impact on performance is. > > > > > > > It makes things a little better but not much -- after staring at this > > for a while, I realized that I've been looking at the disk stats for > the > > LVM device underneath DRBD (because DRBD currently doesn't implement > the > > counters exposed in /proc/diskstats) -- at this level, the average > size > > of a transfer is reduced because of the meta data updates that are > going > > on; with the specific workload I am testing, I see about 50 AL cache > > misses per second - obviously not good (and yes I am experimenting > with > > increasing the size, but this test is vicious and does random writes > all > > over the disk). > > > > I've actually been working on adding support for the standard disk > > counters - will probably submit a patch for that shortly on the > > assumption that it's generally interesting. > > great. > > > > but maybe this had not been your problem at all? > > > if any of the lower level devices has a merge_bvec function itself, > > > drbd falls back to "PAGE_SIZE" max-segments, unless you have > > "use-bmbv" > > > enabled, because we currently cannot cope with bios that need not > be > > > split on the Primary, but would suddenly be split on the Secondary > due > > > to different lower level constraints. > > > > They don't. However, I don't think the code actually behaves the way > you > > describe, unless I'm missing something -- in the merge-bvec routine > (in > > 8.0) it has: > > > > limit = DRBD_MAX_SEGMENT_SIZE - ((bio_offset & > > (DRBD_MAX_SEGMENT_SIZE-1)) + bio_size); > > > > if (limit < 0) limit = 0; > > if (bio_size == 0) { > > if (limit <= bvec->bv_len) limit = bvec->bv_len; > > } else if (limit && inc_local(mdev)) { > > struct request_queue * const b = > > mdev->bc->backing_bdev->bd_disk->queue; > > if(b->merge_bvec_fn && mdev->bc->dc.use_bmbv) { > > backing_limit = b->merge_bvec_fn(b,bio,bvec); > > limit = min(limit,backing_limit); > > } > > dec_local(mdev); > > } > > > > To me, this says it will use the normal 32KB boundary unless use_bmbv > is > > set in which case it uses the minimum of ours and the lower devices > > value... I don't see anything here that would limit the size to 4K. > > right. only, that code will not be used. > if the lover level device has a bio merge bvec fn, > drbd announces a fixed maximum segment size of PAGE_SIZE, since that > is the common denominator and all block devices are required to handle > that. there just will not be any merge_bvec fn announced then. > > -- > : Lars Ellenberg Tel +43-1-8178292-55 : > : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : > : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : > _______________________________________________ > drbd-dev mailing list > drbd-dev@lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-dev ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Drbd-dev] Perf issues with DRBD when doing a lot of random I/O 2008-04-14 19:21 ` [Drbd-dev] Perf issues with DRBD when doing a lot of random I/O Graham, Simon @ 2008-04-14 21:59 ` Lars Ellenberg 0 siblings, 0 replies; 8+ messages in thread From: Lars Ellenberg @ 2008-04-14 21:59 UTC (permalink / raw) To: drbd-dev On Mon, Apr 14, 2008 at 03:21:14PM -0400, Graham, Simon wrote: > This is a follow on to the earlier conversation on the issues with the > drbd_merge_bvec function - having modified this, I am still seeing > performance with DRBD of about 66% of what I see with no DRBD. > > The specific workload is quite vicious and does a lot of random I/O > across the entire disk, so I experimented with bumping the AL cache size > up to the max; this got my performance up to 72% of 'native' - better > but still no great. > > Then I started thinking about the change I submitted a while back to > make meta data updates be barrier requests - given that this random > workload causes a lot of AL cache turns, it's also causing a lot of > meta-data activity, so a barrier request is likely to cause a lot of > stalls. > > Now, thinking more about this, I'm not so sure that a barrier is > appropriate here -- when we update the on-disk AL, we are actually > throwing away information that a given block is modified, we are throwing away information that a given block _may_ be modified, and we are _adding_ information that an other given block may be modified. > so we need to > be sure THAT block has been committed to the disk, however, it has > nothing to do with the current set of outstanding I/O to the disk (at > least, it seems so to me). > > I then tried a little test of simply commenting out the barrier in the > meta data update path and voila I was up to 88% of native perf - finally > within striking range of acceptable! > > So... the big question is whether or not having a barrier set on > meta-data updates to the on disk AL is required for correctness there is certainly room for improvement, we may be able to reduce the number of single meta data requests. but for volatile write cache, is there an other way than barrier requests for us to get FUA? I think the semantics we need for the typical al transaction, i.e. expiring one al-extent, and reusing its slot for an other one, are: - we have to be sure that all io to the region we expire has not only been "completed" (reached the volatile cache) but also "reached stable storage" (the disk itself) - we need to be sure that the al transaction reached stable storage before we start the real io to the corresponding new region which is perfectly expressed by a barrier request. note that in 8.0.12, we made the use of barriers/cache flushes configurable, you can switch it off, if you know and trust your hardware (non-volatile cache). -- : Lars Ellenberg Tel +43-1-8178292-55 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-04-14 21:59 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-10 18:39 [Drbd-dev] Problems with DRBD merge-bvec function Graham, Simon
2008-04-10 21:21 ` Lars Ellenberg
2008-04-10 22:13 ` Graham, Simon
2008-04-11 6:45 ` Lars Ellenberg
[not found] ` <342BAC0A5467384983B586A6B0B3767108F02F3F@EXNA.corp.s tratus.com>
2008-04-13 21:38 ` Graham, Simon
2008-04-14 8:55 ` Lars Ellenberg
2008-04-14 19:21 ` [Drbd-dev] Perf issues with DRBD when doing a lot of random I/O Graham, Simon
2008-04-14 21:59 ` Lars Ellenberg
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox