* [Drbd-dev] Behaviour of verify: false positives -> true positives @ 2008-09-09 14:02 schoebel 2008-09-11 9:25 ` Lars Ellenberg 0 siblings, 1 reply; 11+ messages in thread From: schoebel @ 2008-09-09 14:02 UTC (permalink / raw) To: drbd-dev [-- Attachment #1: Type: text/plain, Size: 8237 bytes --] Hi, my company is considering drbd for building up failover clusters in shared hosting. During our preliminary tests, we noticed that a "drbdadm verify /dev/drbdx" detects differences on a heavily loaded test server (several thousand customers). We noticed two kind of verify differences: one is surely temporary (not repeatable), but the other is persistent, even after umounting the filesystem. According to the manpage on drbd.conf (section "notes on data integrity"), these should be "false positives". Indeed, we found no real corruptions (all different blocks were associated with deleted files). However, this means that verify is (in _our_ point of view) no _reliable_ check for data integrity. Since data integrity of our valuable customer data is of great concern for us, we look for possibilities to change the behavior such that no false positives are reported any more, i.e. any difference reported by verify should be _guaranteed_ to be a "true positive". In my humble opinion, so-called "mission critical" applications demand for that in general. In my understanding of kernel architecture, I believe the block differences are caused by an _intended_ race in the kernel at buffer cache level. Whenever a block gets dirty, there is (deliberately) no lock for any consumer of the buffer cache (such as ext3) which would prevent it from (re-)modification while the block is being written to an ordinary disk (even not necessarily a drbd device). IMHO, this deliberate race is _crucial_ for kernel performance (and thus I don't want to dispute on it). Normally, this race should be no problem at all, even if an inconsistent block (half of the 512-byte block old, other half new version) is written to disk: the dirty-bit is just set again by the buffer cache level again, leading to another writeout which eventually fixes the problem. I believe (but not 100% sure; please comment) that this model can explain the temporary differences seen by a drbd verify: when the data _content_ is mirrored across the network, the local disk version might have another "timestamp" (in _real_ time) than the version transmitted to ethernet. Thus, _any_ IO of dirty blocks might be inconsistent due to the deliberate kernel race on data block _content_ (dereferencing of buffer_head->b_data in parallel to disk IO). With ordinary load patterns, the "chance" to see _temporary_ false positives caused by that race is probably extremely low (perhaps one to some billion). But on a heavily loaded system we have observed it from time to time. Attached below are some tiny perl scripts which can reproduce temporary false positives with a fairly good chance at least on our test systems (I'd be glad to receive reproductions and experience from other users too). Just start io_rewrite_sector2.pl on an (preferably empty) ext3 filesystem on top of drbd, _in parallel_ to a "drbd verify /dev/corresponding_device" (other filesystems not yet tested). My test filesystem for this test was about 2GB size. Possibly you might have to adjust some constants to reproduce the test. Up to now, I was reasoning on the _temporary_ false positives. Persistent false positives could be explained by the following theory: When a file is eventually deleted or truncated, bforget() is called at the buffer cache interface. After that, dirty blocks are no longer transferred to disk, in order to save IO load (IMHO this is _crucial_ for typical access patterns on /tmp/ where typical lifetimes are often less than 1 second). As a consequence, the above-mentioned "fixing" of inconsistent blocks is no longer carried out and long-term differences can remain on the mirrored device, but belonging to deleted files only. Again, the chance to observe that is very low, but I have written another tiny perl script to reproduce that. Just start test_orphan.pl on an _empty_ drbd-mounted filesystem, and _afterwards_ check it with verify. Since the filesystem is empty again after the test, you can be sure that the differences belong to empty or orphan files (if it would belong to filesystem metadata, you would notice that by inspection of the contents with dd). By investigating the different counter values with dd on the underlying devices (primary vs secondary), you can even tell which version was written first. Interestingly, I found that most of the time the local version was the older one, but sometimes it was vice versa. As a side note: test_orphan.pl also produces orphan files. Sometimes the counter values observed in different blocks are lower than the point of unlink() [currently set to 500], but most of the time the counter value is greater. It seems to me that applications producing orphan files raise the chance to observe false positives, but I am not sure. I have not yet a theory for that; please comment. It might be influenced by the timings of the block IO demon and/or by the load patterns. ------- Now I am reasoning on different solutions. Please comment. Here are just some brainstorming ideas, without judging on their quality (this will come later): 1) Try to avoid the kernel race on buffer content, specifically for drbd devices as an _option_ (which is _off_ by default). There are at least two sub-variants of that: 1a) use locking 1b) use an idea published by Herlihy for conflict-free resolution of different _versions_ of blocks, either on the fly or optionally residing in the buffer cache _in parallel_ [nb probably the latter could result in a major rewrite of large portions of the kernel, not to be disputed here on this list] 2) Whenever drbd-verify sees a difference, retry the comparison a few times after a short delay (possibly with exponential backoff), until _temporary_ differences have been filtered out. Persistent differences will not be tackled by that. 3) As an addition to 2), add an _option_ (which is _off_ by default) to the buffer cache code to submit bforgotten() blocks specifically to drbd devices. 4) Add an option to drbd (as usual _off_ by default), which calculates a checksum on _every_ arriving IO request _first_ (before starting any sub-request). After finishing both the local and remote sub-IO, calculate the checksum again and compare. If a difference is found, restart both sub-transmissions again, until no mismatches are found any more. 5) As a refinement of 4), first filter out the temporary false positives by means of 2). Additionally try to identify bforgotten() blocks at the buffer cache level and submit them only _once_ after a bforget(), and only to drbd devices where the corresponding option is set. Then drbd uses method 4) _only_ on those blocks, thereby minimizing the performance impact of 4) to a rare special case. 6) Try to establish a complete solution in presence of races solely at the buffer cache level, without affecting drbd in any way. I am not sure whether this is possible. The raw idea is to _identify_ all races _reliably_(!) when they _actually_ occur (as in contrast to _possible_ occurrence). Theoretically b_count, b_state and/or other/similar means should be available to detect actually occurring races during IO, but I am extremely unsure whether this is possible _reliably_ without additional means such as checksumming. Probably this is the wrong list for discussing this topic, but I would be thankful for hints and ideas before going elsewhere with an uncomplete idea. Now what I personally think of it: 1a) has too strong performance impact, 1b) would probably complicate the kernel by magnitudes, 2) is easy but does not solve all problems, 3) is feasible for non-general applications (outside of /tmp/ etc) but probably could lead to fundamental discussions with kernel developer (forcing us into an inhouse patch), 4) solves it all very easy but hurts performance, 5) is very complicated, and 6) is no mature idea yet. Currently, I would prefer 4) as an option for testing and for gaining experience at the first try, but depending on performance results other mechanisms should be evaluated. Of course, other people have other ideas and opinions, so please feel free to comment. Thanks for your patience, Thomas [-- Attachment #2: io_rewrite_sector2.pl --] [-- Type: application/x-perl, Size: 303 bytes --] [-- Attachment #3: test_orphan.pl --] [-- Type: application/x-perl, Size: 492 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives 2008-09-09 14:02 [Drbd-dev] Behaviour of verify: false positives -> true positives schoebel @ 2008-09-11 9:25 ` Lars Ellenberg 2008-09-11 9:37 ` Lars Ellenberg 2008-09-30 20:10 ` Lars Ellenberg 0 siblings, 2 replies; 11+ messages in thread From: Lars Ellenberg @ 2008-09-11 9:25 UTC (permalink / raw) To: drbd-dev On Tue, Sep 09, 2008 at 04:02:30PM +0200, schoebel wrote: > Hi, > > my company is considering drbd for building up failover clusters in > shared hosting. > > During our preliminary tests, we noticed that a "drbdadm verify > /dev/drbdx" detects differences on a heavily loaded test server > (several thousand customers). > > We noticed two kind of verify differences: one is surely temporary > (not repeatable), but the other is persistent, even after umounting > the filesystem. > > According to the manpage on drbd.conf (section "notes on data integrity"), > these should be "false positives". Indeed, we found no real corruptions (all > different blocks were associated with deleted files). > > However, this means that verify is (in _our_ point of view) no _reliable_ > check for data integrity. Since data integrity of our valuable customer data > is of great concern for us, we look for possibilities to change the behavior > such that no false positives are reported any more, i.e. any difference > reported by verify should be _guaranteed_ to be a "true positive". In my > humble opinion, so-called "mission critical" applications demand for that in > general. > > In my understanding of kernel architecture, I believe the block differences > are caused by an _intended_ race in the kernel at buffer cache level. > Whenever a block gets dirty, there is (deliberately) no lock for any consumer > of the buffer cache (such as ext3) which would prevent it from > (re-)modification while the block is being written to an ordinary disk (even > not necessarily a drbd device). IMHO, this deliberate race is _crucial_ for > kernel performance (and thus I don't want to dispute on it). > > Normally, this race should be no problem at all, even if an inconsistent block > (half of the 512-byte block old, other half new version) is written to disk: > the dirty-bit is just set again by the buffer cache level again, leading to > another writeout which eventually fixes the problem. > > I believe (but not 100% sure; please comment) that this model can explain the > temporary differences seen by a drbd verify: when the data _content_ is > mirrored across the network, the local disk version might have > another "timestamp" (in _real_ time) than the version transmitted to > ethernet. Thus, _any_ IO of dirty blocks might be inconsistent due to the > deliberate kernel race on data block _content_ (dereferencing of > buffer_head->b_data in parallel to disk IO). With ordinary load patterns, > the "chance" to see _temporary_ false positives caused by that race is > probably extremely low (perhaps one to some billion). But on a heavily loaded > system we have observed it from time to time. Attached below are some tiny > perl scripts which can reproduce temporary false positives with a fairly good > chance at least on our test systems (I'd be glad to receive reproductions and > experience from other users too). Just start io_rewrite_sector2.pl on an > (preferably empty) ext3 filesystem on top of drbd, _in parallel_ to a "drbd > verify /dev/corresponding_device" (other filesystems not yet tested). My test > filesystem for this test was about 2GB size. Possibly you might have to > adjust some constants to reproduce the test. > > Up to now, I was reasoning on the _temporary_ false positives. Persistent > false positives could be explained by the following theory: > > When a file is eventually deleted or truncated, bforget() is called at the > buffer cache interface. After that, dirty blocks are no longer transferred to > disk, in order to save IO load (IMHO this is _crucial_ for typical access > patterns on /tmp/ where typical lifetimes are often less than 1 second). As a > consequence, the above-mentioned "fixing" of inconsistent blocks is no longer > carried out and long-term differences can remain on the mirrored device, but > belonging to deleted files only. Again, the chance to observe that is very > low, but I have written another tiny perl script to reproduce that. Just > start test_orphan.pl on an _empty_ drbd-mounted filesystem, and _afterwards_ > check it with verify. Since the filesystem is empty again after the test, you > can be sure that the differences belong to empty or orphan files (if it would > belong to filesystem metadata, you would notice that by inspection of the > contents with dd). By investigating the different counter values with dd on > the underlying devices (primary vs secondary), you can even tell which > version was written first. Interestingly, I found that most of the time the > local version was the older one, but sometimes it was vice versa. > > As a side note: test_orphan.pl also produces orphan files. Sometimes the > counter values observed in different blocks are lower than the point of > unlink() [currently set to 500], but most of the time the counter value is > greater. It seems to me that applications producing orphan files raise the > chance to observe false positives, but I am not sure. I have not yet a theory > for that; please comment. It might be influenced by the timings of the block > IO demon and/or by the load patterns. interessting test setup and solid analysis, I'd say. > ------- > > Now I am reasoning on different solutions. Please comment. > > Here are just some brainstorming ideas, without judging on their quality (this > will come later): > > 1) Try to avoid the kernel race on buffer content, specifically for drbd > devices as an _option_ (which is _off_ by default). There are at least two > sub-variants of that: > 1a) use locking > 1b) use an idea published by Herlihy for conflict-free resolution of different > _versions_ of blocks, either on the fly or optionally residing in the buffer > cache _in parallel_ [nb probably the latter could result in a major rewrite > of large portions of the kernel, not to be disputed here on this list] > > 2) Whenever drbd-verify sees a difference, retry the comparison a few times > after a short delay (possibly with exponential backoff), until _temporary_ > differences have been filtered out. Persistent differences will not be > tackled by that. > > 3) As an addition to 2), add an _option_ (which is _off_ by default) to the > buffer cache code to submit bforgotten() blocks specifically to drbd > devices. > > 4) Add an option to drbd (as usual _off_ by default), which calculates a > checksum on _every_ arriving IO request _first_ (before starting any > sub-request). After finishing both the local and remote sub-IO, calculate the > checksum again and compare. If a difference is found, restart both > sub-transmissions again, until no mismatches are found any more. > > 5) As a refinement of 4), first filter out the temporary false positives by > means of 2). Additionally try to identify bforgotten() blocks at the buffer > cache level and submit them only _once_ after a bforget(), and only to drbd > devices where the corresponding option is set. Then drbd uses method 4) > _only_ on those blocks, thereby minimizing the performance impact of 4) to a > rare special case. > > 6) Try to establish a complete solution in presence of races solely at the > buffer cache level, without affecting drbd in any way. I am not sure whether > this is possible. The raw idea is to _identify_ all races _reliably_(!) when > they _actually_ occur (as in contrast to _possible_ occurrence). > Theoretically b_count, b_state and/or other/similar means should be available > to detect actually occurring races during IO, but I am extremely unsure > whether this is possible _reliably_ without additional means such as > checksumming. Probably this is the wrong list for discussing this topic, but > I would be thankful for hints and ideas before going elsewhere with an > uncomplete idea. > > Now what I personally think of it: 1a) has too strong performance impact, 1b) > would probably complicate the kernel by magnitudes, 2) is easy but does not > solve all problems, 3) is feasible for non-general applications (outside > of /tmp/ etc) but probably could lead to fundamental discussions with kernel > developer (forcing us into an inhouse patch), 4) solves it all very easy but > hurts performance, 5) is very complicated, and 6) is no mature idea yet. > > Currently, I would prefer 4) as an option for testing and for gaining > experience at the first try, but depending on performance results other > mechanisms should be evaluated. > > Of course, other people have other ideas and opinions, so please feel free to > comment. how about holding a ring buffer of pinned pages, say "drbd max-buffer" pages plus some, and in drbd_make_request, memcpy(ring buffer, submitted data) then checksum that, and submit it to localdisk as well as to tcp stack. as these are now "private" data buffers, the "application" cannot possibly modify them in-flight. yes, not necessarily performant. > Thanks for your patience, thanks for sharing your findings. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives 2008-09-11 9:25 ` Lars Ellenberg @ 2008-09-11 9:37 ` Lars Ellenberg 2008-09-30 20:10 ` Lars Ellenberg 1 sibling, 0 replies; 11+ messages in thread From: Lars Ellenberg @ 2008-09-11 9:37 UTC (permalink / raw) To: drbd-dev On Thu, Sep 11, 2008 at 11:25:12AM +0200, Lars Ellenberg wrote: > On Tue, Sep 09, 2008 at 04:02:30PM +0200, schoebel wrote: > > Hi, > > > > my company is considering drbd for building up failover clusters in > > shared hosting. > > > > During our preliminary tests, we noticed that a "drbdadm verify > > /dev/drbdx" detects differences on a heavily loaded test server > > (several thousand customers). > > > > We noticed two kind of verify differences: one is surely temporary > > (not repeatable), but the other is persistent, even after umounting > > the filesystem. > > > > According to the manpage on drbd.conf (section "notes on data integrity"), > > these should be "false positives". Indeed, we found no real corruptions (all > > different blocks were associated with deleted files). > > > > However, this means that verify is (in _our_ point of view) no _reliable_ > > check for data integrity. Since data integrity of our valuable customer data > > is of great concern for us, we look for possibilities to change the behavior > > such that no false positives are reported any more, i.e. any difference > > reported by verify should be _guaranteed_ to be a "true positive". In my > > humble opinion, so-called "mission critical" applications demand for that in > > general. now. an other thought here. any application doing what your test perl scripts do is risking its data, and is not crash safe. aparently there are many of those. if an application starts to rewrite a portion of a file it has written to before, it should fsync() at some point, before it starts the overwrite, so the buffers will be on disk (ok "on device", no necessarily on rotating rust yet, but, while related, that is an other topic). so as long as you run a postfix mailq or a database tablespace on DRBD, or similar applications that know how to behave, I'd not expect any such "modify of in-flight buffers" to happen. any application that does "not behave" is risking data integrity (in event of [application, file system, device or node]crash/ ower outage/failover). For any "behaving" application, the "drbd verify integrity check" is expected to be your "_reliable_" from above. Am I missing something? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives 2008-09-11 9:25 ` Lars Ellenberg 2008-09-11 9:37 ` Lars Ellenberg @ 2008-09-30 20:10 ` Lars Ellenberg 2008-10-01 9:16 ` Thomas Schoebel-Theuer 1 sibling, 1 reply; 11+ messages in thread From: Lars Ellenberg @ 2008-09-30 20:10 UTC (permalink / raw) To: drbd-dev On Thu, Sep 11, 2008 at 11:25:12AM +0200, Lars Ellenberg wrote: > On Tue, Sep 09, 2008 at 04:02:30PM +0200, schoebel wrote: ... > > Now I am reasoning on different solutions. Please comment. > > > > Here are just some brainstorming ideas, without judging on their quality (this > > will come later): ... > how about holding a ring buffer of pinned pages, > say "drbd max-buffer" pages plus some, > and in drbd_make_request, > memcpy(ring buffer, submitted data) > then checksum that, > and submit it to localdisk as well as to tcp stack. > as these are now "private" data buffers, > the "application" cannot possibly modify them in-flight. have you made any progress on that matter? shall we further discuss whether or not it should be "fixed" (in drbd), and if so, how? or have you decided to just move a long? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. __ please don't Cc me, but send to list -- I'm subscribed ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives 2008-09-30 20:10 ` Lars Ellenberg @ 2008-10-01 9:16 ` Thomas Schoebel-Theuer 2008-10-01 10:46 ` Lars Ellenberg 0 siblings, 1 reply; 11+ messages in thread From: Thomas Schoebel-Theuer @ 2008-10-01 9:16 UTC (permalink / raw) To: drbd-dev; +Cc: Michael Ziegel, Peter Scheitterer [-- Attachment #1: Type: text/plain, Size: 3527 bytes --] Hi Lars, I have two different patches, both based on your git version from 2008-09-11. The first one implements the checksumming method while the second one implements the copy method proposed by you. I found that your suggestion was much more easier to implement than the checksumming method. However, the copy method works completely silently (you get no response whether a race on the data is actually taking place or not), while the checksumming method writes reports to the kernel log (so you can observe the behavior). I have also run some performance tests with iozone, but could not find any significant difference between any of the patched and unpatched versions (and, of course, no difference between the two alternatives). iozone's measurements seem to produce "noise" in the range of about 5%. Theoretical considerations tell me that both patches should slowdown overall system performance by about 1% at most, so I could not figure out anything with my single tests (I had no time for running them repeatedly 100 or 1000 times). Enclosed you will find both patches. They are intended only for _testing_ (e.g. performance benchmarks), not for production use. I would be glad if someone could report any experience with it, e.g. performance tests (probably with better testing tools). Currently both methods are enabled by a preprocessor constant (just for testing). For production use, we should do it right. Since both methods are in fact some sort of "paranoia options" (similar to the CRCs already present), I would ask whether we should enable/disable them by drbdadm via the kernel interface, leading to a new interface revision [oh, as a side note, I found that git revision 2008-09-11 crashed with userspace tools 8.2.5 although the interface revision claimed to be compatible, or did I miss something there?]. Should I do the work on the interfaces also, or do you want to integrate it yourself to your systematics? Another issue with the checksumming method: I had problems interfacing with the crypto api (probably a wrongly braced kmap/kunmap bug), and it seems that different kernel versions do it differently, so I wrote my own small XOR checksum algorithm (almost trivial). I'm not sure whether this quick hack is really good. And there is some dead code about the crypto API, which I should either remove or clean it up (e.g. by moving global_tfm to struct drbd_conf). Currently this is all a little hackish, not for production use. Do you have any suggestions how to cleanup this in order to meet your style? Thanks, Thomas Am Dienstag, 30. September 2008 22:10:16 schrieb Lars Ellenberg: > On Thu, Sep 11, 2008 at 11:25:12AM +0200, Lars Ellenberg wrote: > > On Tue, Sep 09, 2008 at 04:02:30PM +0200, schoebel wrote: > > ... > > > > Now I am reasoning on different solutions. Please comment. > > > > > > Here are just some brainstorming ideas, without judging on their > > > quality (this will come later): > > ... > > > how about holding a ring buffer of pinned pages, > > say "drbd max-buffer" pages plus some, > > and in drbd_make_request, > > memcpy(ring buffer, submitted data) > > then checksum that, > > and submit it to localdisk as well as to tcp stack. > > as these are now "private" data buffers, > > the "application" cannot possibly modify them in-flight. > > have you made any progress on that matter? > > shall we further discuss whether or not it should be "fixed" (in drbd), > and if so, how? > > or have you decided to just move a long? [-- Attachment #2: drbd-8.2-checksum.patch --] [-- Type: text/x-diff, Size: 7944 bytes --] diff --git a/drbd/drbd_int.h b/drbd/drbd_int.h index 0281a46..77dc26b 100644 --- a/drbd/drbd_int.h +++ b/drbd/drbd_int.h @@ -99,6 +99,17 @@ extern char usermode_helper[]; #define FALSE 0 #endif +#define VERIFY_HACK + +#ifdef VERIFY_HACK +#define CRC_SIZE 16 +int drbd_checksum_bio(struct bio *bio, u8 * res); +int drbd_checksum_loadalg(char * algname); +void drbd_checksum_unloadalg(void); +struct drbd_conf; +void _restart_requests(struct drbd_conf * mdev); +#endif + /* I don't remember why XCPU ... * This is used to wake the asender, * and to interrupt sending the sending task @@ -678,6 +689,9 @@ struct drbd_request { unsigned long rq_state; /* see comments above _req_mod() */ int seq_num; unsigned long start_time; +#ifdef VERIFY_HACK + u8 checksum[CRC_SIZE]; +#endif }; struct drbd_barrier { @@ -1000,6 +1014,11 @@ struct drbd_conf { struct bm_io_work bm_io_work; u64 ed_uuid; /* UUID of the exposed data */ struct mutex state_mutex; +#ifdef VERIFY_HACK +#define RESTART_MAX 8 + int restart_head, restart_tail, restart_count; + struct bio * restart[RESTART_MAX]; +#endif }; static inline struct drbd_conf *minor_to_mdev(unsigned int minor) diff --git a/drbd/drbd_main.c b/drbd/drbd_main.c index dde171e..b7bde61 100644 --- a/drbd/drbd_main.c +++ b/drbd/drbd_main.c @@ -2838,6 +2838,10 @@ STATIC void drbd_cleanup(void) drbd_unregister_blkdev(DRBD_MAJOR, "drbd"); +#ifdef VERIFY_HACK + drbd_checksum_unloadalg(); +#endif + printk(KERN_INFO "drbd: module cleanup done.\n"); } @@ -2997,6 +3001,12 @@ int __init drbd_init(void) DRBD_MAJOR); printk(KERN_INFO "drbd: minor_table @ 0x%p\n", minor_table); +#ifdef VERIFY_HACK + if(drbd_checksum_loadalg("md5")) { + printk(KERN_INFO "drbd: loading checksum algorithm module failed\n"); + } +#endif + return 0; /* Success! */ Enomem: diff --git a/drbd/drbd_receiver.c b/drbd/drbd_receiver.c index 8ca228c..831dec9 100644 --- a/drbd/drbd_receiver.c +++ b/drbd/drbd_receiver.c @@ -3685,6 +3685,9 @@ STATIC int got_BlockAck(struct drbd_conf *mdev, struct Drbd_Header *h) D_ASSERT(0); } spin_unlock_irq(&mdev->req_lock); +#ifdef VERIFY_HACK + _restart_requests(mdev); +#endif } /* dec_ap_pending is handled within _req_mod */ diff --git a/drbd/drbd_req.c b/drbd/drbd_req.c index e198a03..53ab136 100644 --- a/drbd/drbd_req.c +++ b/drbd/drbd_req.c @@ -93,6 +93,142 @@ STATIC void _print_req_mod(struct drbd_request *req, enum drbd_req_event what) INFO("_req_mod(%p %c ,%s)\n", req, rw, rq_event_names[what]); } +#ifdef VERIFY_HACK + +#include <linux/crypto.h> + +STATIC int drbd_make_request_common(struct drbd_conf *mdev, struct bio *bio); + +static struct crypto_hash * global_tfm = NULL; + +/* Separation between allocation and usage of crypto descriptors. + * This is necessary because crypto_alloc_*() can call schedule(). + */ +int drbd_checksum_loadalg(char * algname) +{ + int digestsize; + global_tfm = crypto_alloc_hash(algname, 0, CRYPTO_ALG_ASYNC); + if(IS_ERR(global_tfm)) { + printk("Aha algorithm=%s error\n", algname); + return -1; + } + digestsize = crypto_hash_digestsize(global_tfm); + if(digestsize > CRC_SIZE || CRC_SIZE != 2*(sizeof(long long))) { + printk("Aha digestsize=%d reslen=%d\n", digestsize, CRC_SIZE); + return -1; + } + printk(KERN_INFO "drbd_checksum_loadalg OK\n"); + return 0; +} + +void drbd_checksum_unloadalg(void) +{ + crypto_free_hash(global_tfm); + global_tfm = NULL; +} + +/* A backwards iterator is missing in <linux/bio.h>. + * We need it for correct nesting of bvec_bmap_irq / unmap pairs. + */ +#define __bio_for_each_segment_back(bvl, bio, i, start_idx) \ + for (bvl = bio_iovec_idx((bio), (start_idx)+(bio)->bi_vcnt-1), i = (start_idx+(bio)->bi_vcnt-1); \ + i >= (start_idx); \ + bvl--, i--) + +#define bio_for_each_segment_back(bvl, bio, i) \ + __bio_for_each_segment_back(bvl, bio, i, (bio)->bi_idx) + +static void dummy_checksum(void * data, int len, long long res[]) +{ + register long long val0 = res[0]; + register long long val1 = res[1]; + long long * _data = data; + while(len > 0) { + val0 ^= *_data++; + val1 ^= *_data++; + len -= 2*sizeof(val0); + } + res[0] = val0; + res[1] = val1; +} + +int drbd_checksum_bio(struct bio *bio, u8 * res) +{ + int err = -1; + struct hash_desc desc; + int count; + + if(!global_tfm) + return -1; + + desc.flags = CRYPTO_TFM_REQ_MAY_SLEEP; + desc.tfm = global_tfm; + + memset(res, 0, CRC_SIZE); + + count = bio->bi_vcnt; + { + void *data[count]; + unsigned long flags[count]; + struct scatterlist sg[count]; + struct bio_vec * bv; + int nbytes = 0; + int i; + int j; + sg_init_table(sg, count); + + j = 0; + bio_for_each_segment(bv, bio, i) { + data[j] = bvec_kmap_irq(bv, &flags[j]); + nbytes += bv->bv_len; + dummy_checksum(data[j], bv->bv_len, (long long*)res); + sg_set_buf(&sg[j], data[j], bv->bv_len); + bvec_kunmap_irq(data[j], &flags[j]); + j++; + } +#if 0 + if(crypto_hash_init(&desc)) + goto out; + if(crypto_hash_update(&desc, sg, nbytes)) + goto out; + if(crypto_hash_final(&desc, res)) + goto out; + bio_for_each_segment_back(bv, bio, i) { + j--; + } +#endif + } + err = 0; +#if 0 +out: +#endif + return err; +} + +void _restart_requests(struct drbd_conf * mdev) +{ + unsigned long flags; + spin_lock_irqsave(&mdev->req_lock, flags); + while(mdev->restart_count > 0) { + struct bio * master_bio; + int restart_count = mdev->restart_count; + master_bio = mdev->restart[mdev->restart_head]; + mdev->restart_head = (mdev->restart_head+1) % RESTART_MAX; + mdev->restart_count--; + if(master_bio) { + spin_unlock_irqrestore(&mdev->req_lock, flags); + INFO("restarting block=%ld (count=%d)\n", master_bio->bi_sector, restart_count); + drbd_make_request_common(mdev, master_bio); + spin_lock_irqsave(&mdev->req_lock, flags); + } + } + spin_unlock_irqrestore(&mdev->req_lock, flags); +} + + +#endif + + # ifdef ENABLE_DYNAMIC_TRACE # define print_rq_state(R, T) \ MTRACE(TraceTypeRq, TraceLvlMetrics, _print_rq_state(R, T);) @@ -381,6 +517,35 @@ void _req_may_be_done(struct drbd_request *req, int error) /* Update disk stats */ _drbd_end_io_acct(mdev, req); +#ifdef VERIFY_HACK + int restart = 0; + if(ok && rw == WRITE) { + u8 checksum[CRC_SIZE]; + int err; + //INFO("checksumming %ld\n", req->master_bio->bi_sector); + err = drbd_checksum_bio(req->master_bio, checksum); + if(!err) { + err = memcmp(req->checksum, checksum, sizeof(checksum)); + } + if(err) { + INFO("bio checksum mismatch block=%ld\n", req->master_bio->bi_sector); + restart = 1; + } + } + if(restart) { + if(mdev->restart_count <= RESTART_MAX) { + INFO("queuing block=%ld for restart (count=%d)\n", req->master_bio->bi_sector, mdev->restart_count); + mdev->restart[mdev->restart_tail] = req->master_bio; + mdev->restart_tail = (mdev->restart_tail+1) % RESTART_MAX; + mdev->restart_count++; + req->master_bio = NULL; + } else { + INFO("Aieee, queue buffer is full (block=%ld)\n", req->master_bio->bi_sector); + restart = 0; + } + } + if(!restart) +#endif _complete_master_bio(mdev,req, ok ? 0 : ( error ? error : -EIO ) ); } else { @@ -887,6 +1052,11 @@ STATIC int drbd_make_request_common(struct drbd_conf *mdev, struct bio *bio) } if (rw == WRITE) { remote = 1; +#ifdef VERIFY_HACK + if(drbd_checksum_bio(bio, req->checksum)) { + ERR("error in checksumming algorithm\n"); + } +#endif } else { /* READ || READA */ if (local) { diff --git a/drbd/drbd_worker.c b/drbd/drbd_worker.c index aff97cc..fe69dd2 100644 --- a/drbd/drbd_worker.c +++ b/drbd/drbd_worker.c @@ -247,6 +247,9 @@ BIO_ENDIO_TYPE drbd_endio_pri BIO_ENDIO_ARGS(struct bio *bio, int error) spin_lock_irqsave(&mdev->req_lock, flags); _req_mod(req, what, error); spin_unlock_irqrestore(&mdev->req_lock, flags); +#ifdef VERIFY_HACK + _restart_requests(mdev); +#endif BIO_ENDIO_FN_RETURN; } [-- Attachment #3: drbd-8.2-copy.patch --] [-- Type: text/x-diff, Size: 4635 bytes --] diff --git a/drbd/drbd_int.h b/drbd/drbd_int.h index 0281a46..c61856e 100644 --- a/drbd/drbd_int.h +++ b/drbd/drbd_int.h @@ -40,6 +40,17 @@ #include <net/tcp.h> #include "lru_cache.h" +#define COPY_HACK + +#ifdef COPY_HACK +struct drbd_conf; +struct page *drbd_pp_alloc(struct drbd_conf *mdev, gfp_t gfp_mask); +void drbd_pp_free(struct drbd_conf *mdev, struct page *page); +struct drbd_request; +void drbd_page_copy(struct drbd_conf *mdev, struct drbd_request *req, struct bio *bio); +void drbd_page_free(struct drbd_conf *mdev, struct drbd_request *req); +#endif + #ifdef __CHECKER__ # define __protected_by(x) __attribute__((require_context(x,1,999,"rdwr"))) # define __protected_read_by(x) __attribute__((require_context(x,1,999,"read"))) @@ -678,6 +689,9 @@ struct drbd_request { unsigned long rq_state; /* see comments above _req_mod() */ int seq_num; unsigned long start_time; +#ifdef COPY_HACK + struct page * copy_page; +#endif }; struct drbd_barrier { diff --git a/drbd/drbd_main.c b/drbd/drbd_main.c index dde171e..bd8e2dc 100644 --- a/drbd/drbd_main.c +++ b/drbd/drbd_main.c @@ -2665,7 +2665,6 @@ STATIC void drbd_destroy_mempools(void) drbd_request_mempool = NULL; drbd_ee_cache = NULL; drbd_request_cache = NULL; - return; } diff --git a/drbd/drbd_receiver.c b/drbd/drbd_receiver.c index 8ca228c..88d1103 100644 --- a/drbd/drbd_receiver.c +++ b/drbd/drbd_receiver.c @@ -92,7 +92,7 @@ void drbd_assert_breakpoint(struct drbd_conf *mdev, char *exp, /** * drbd_bp_alloc: Returns a page. Fails only if a signal comes in. */ -STATIC struct page *drbd_pp_alloc(struct drbd_conf *mdev, gfp_t gfp_mask) +struct page *drbd_pp_alloc(struct drbd_conf *mdev, gfp_t gfp_mask) { unsigned long flags = 0; struct page *page; @@ -167,7 +167,7 @@ STATIC struct page *drbd_pp_alloc(struct drbd_conf *mdev, gfp_t gfp_mask) return page; } -STATIC void drbd_pp_free(struct drbd_conf *mdev, struct page *page) +void drbd_pp_free(struct drbd_conf *mdev, struct page *page) { unsigned long flags = 0; int free_it; diff --git a/drbd/drbd_req.c b/drbd/drbd_req.c index e198a03..2d25450 100644 --- a/drbd/drbd_req.c +++ b/drbd/drbd_req.c @@ -839,6 +839,39 @@ STATIC int drbd_may_do_local_read(struct drbd_conf *mdev, sector_t sector, int s return (0 == drbd_bm_count_bits(mdev, sbnr, ebnr)); } +#ifdef COPY_HACK +void drbd_page_copy(struct drbd_conf *mdev, struct drbd_request *req, struct bio *bio) +{ + struct bio_vec *bvec; + int i; + bio_for_each_segment(bvec, bio, i) { + struct page *dst_page = drbd_pp_alloc(mdev, GFP_KERNEL); + struct page *src_page = bvec->bv_page; + void *dst = kmap(dst_page); + void *src = kmap(src_page); + set_page_private(dst_page, (unsigned long)req->copy_page); + req->copy_page = dst_page; + memcpy(dst, src, PAGE_SIZE); +#if 1 + bvec->bv_page = dst_page; +#endif + kunmap(src_page); + kunmap(dst_page); + } +} + +void drbd_page_free(struct drbd_conf *mdev, struct drbd_request *req) +{ + struct page * free; + while((free = req->copy_page)) { + req->copy_page = (struct page *)page_private(free); + set_page_private(free, (unsigned long)NULL); + drbd_pp_free(mdev, free); + } +} + +#endif + /* * general note: * looking at the state (conn, disk, susp, pdsk) outside of the spinlock that @@ -887,6 +920,11 @@ STATIC int drbd_make_request_common(struct drbd_conf *mdev, struct bio *bio) } if (rw == WRITE) { remote = 1; +#ifdef COPY_HACK + if(req->private_bio) { + drbd_page_copy(mdev, req, req->private_bio); + } +#endif } else { /* READ || READA */ if (local) { diff --git a/drbd/drbd_req.h b/drbd/drbd_req.h index 2c67f09..04c8556 100644 --- a/drbd/drbd_req.h +++ b/drbd/drbd_req.h @@ -294,6 +294,9 @@ static inline struct drbd_request *drbd_req_new(struct drbd_conf *mdev, req->sector = bio->bi_sector; req->size = bio->bi_size; req->start_time = jiffies; +#ifdef COPY_HACK + req->copy_page = NULL; +#endif INIT_HLIST_NODE(&req->colision); INIT_LIST_HEAD(&req->tl_requests); @@ -306,6 +309,10 @@ static inline struct drbd_request *drbd_req_new(struct drbd_conf *mdev, static inline void drbd_req_free(struct drbd_request *req) { +#ifdef COPY_HACK + if(req->mdev) + drbd_page_free(req->mdev, req); +#endif mempool_free(req, drbd_request_mempool); } ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives 2008-10-01 9:16 ` Thomas Schoebel-Theuer @ 2008-10-01 10:46 ` Lars Ellenberg 2008-10-01 11:28 ` Thomas Schoebel-Theuer 0 siblings, 1 reply; 11+ messages in thread From: Lars Ellenberg @ 2008-10-01 10:46 UTC (permalink / raw) To: Thomas Schoebel-Theuer; +Cc: Michael Ziegel, Peter Scheitterer, drbd-dev On Wed, Oct 01, 2008 at 11:16:41AM +0200, Thomas Schoebel-Theuer wrote: > Hi Lars, > > I have two different patches, both based on your git version from 2008-09-11. great. > The first one implements the checksumming method while the second one > implements the copy method proposed by you. > > I found that your suggestion was much more easier to implement than the > checksumming method. However, the copy method works completely silently (you > get no response whether a race on the data is actually taking place or not), > while the checksumming method writes reports to the kernel log (so you can > observe the behavior). I briefly had a look at the copy method. as private_bio is a clone of master_bio, it shares its bvec, thus the pages. you assign the copy pages to the bvec->bv_page (you have to do that, master_bio->bvec->bv_page is used in sendpage) but you don't restore the original pages before completion, even free the pages already. you * potentially leak the original pages * potentially double free the copy pages as the submitter (file system, page cache, whatever) in its completion callback may reference or free those. did you try this on a debug kernel, memory poisoning and page allocation debugging switched on? (of course, that is not the kernel you want to use for performance benchmarks). you need to remember the original pages, and you need to restore them before completion of the master bio. for diagnostics, you can "memcmp" on that occasion as well. and you probably need to increase the "emergency reserve" amount of pre-allocated pages, to avoid resource starvation deadlocks. you should use copy_page or copy_highpage. for performance impact, assuming that your memory bandwidth is much higher than your IO bandwidth, this will "only" increase CPU cycles. which is probably ok on a generic file server, but may become critical when running a data base directly on drbd. so, when doing benchmarks, have a look at the cpu usage as well. I'll have a look at the checksumming method later. thanks again, -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives 2008-10-01 10:46 ` Lars Ellenberg @ 2008-10-01 11:28 ` Thomas Schoebel-Theuer 2008-10-01 11:46 ` Lars Ellenberg 0 siblings, 1 reply; 11+ messages in thread From: Thomas Schoebel-Theuer @ 2008-10-01 11:28 UTC (permalink / raw) To: drbd-dev > as private_bio is a clone of master_bio, > it shares its bvec, thus the pages. Oops, I didn't realize that (and I tested the copy version only briefly because I was more interested in the checksumming because of the kernel log messages). The implications are clear to me. IMHO best solution would be the existence of a bio_copy() in addition to bio_clone() in bio.h, but even if we would get it into the upstream it would not help for elder kernels. > you need to remember the original pages, and you need to restore them > before completion of the master bio. Well, one problem is that the length of bvec could be nearly arbitrary (in theory), so preclaiming "enough" space in struct drbd_request is probably no good idea. Well, kmalloc() would be possible to keep the allocation dynamic without wasting space. But what about simply generating a completely new bio and copying over all the stuff by hand? This would mean to implement some sort of bio_copy() in the local code which could then later be lifted upstreams if other people liked it too. What do you think is better? Thanks for feedback, Thomas ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives 2008-10-01 11:28 ` Thomas Schoebel-Theuer @ 2008-10-01 11:46 ` Lars Ellenberg 2008-10-01 12:49 ` Thomas Schoebel-Theuer 0 siblings, 1 reply; 11+ messages in thread From: Lars Ellenberg @ 2008-10-01 11:46 UTC (permalink / raw) To: drbd-dev On Wed, Oct 01, 2008 at 01:28:02PM +0200, Thomas Schoebel-Theuer wrote: > > as private_bio is a clone of master_bio, > > it shares its bvec, thus the pages. > > Oops, I didn't realize that (and I tested the copy version only briefly > because I was more interested in the checksumming because of the kernel log > messages). The implications are clear to me. > > IMHO best solution would be the existence of a bio_copy() in addition to > bio_clone() in bio.h, but even if we would get it into the upstream it would > not help for elder kernels. > > > you need to remember the original pages, and you need to restore them > > before completion of the master bio. > > Well, one problem is that the length of bvec could be nearly arbitrary (in > theory), BIO_MAX_PAGES DRBD_MAX_SEGMENT_SIZE > so preclaiming "enough" space in struct drbd_request is probably no > good idea. Well, kmalloc() would be possible to keep the allocation dynamic > without wasting space. I meant the pre-allocating in drbd_pp_alloc. number of pages. > But what about simply generating a completely new bio and copying over all the > stuff by hand? This would mean to implement some sort of bio_copy() in the > local code which could then later be lifted upstreams if other people liked > it too. What do you think is better? no, I think it would be enough to just set (pseudo code) copy_page->private = orig_page; bvec->bv_page = copy_page; then, later before completion, bio_for_each_segment memcmp, restore orig drbd_pp_free copy -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives 2008-10-01 11:46 ` Lars Ellenberg @ 2008-10-01 12:49 ` Thomas Schoebel-Theuer 2008-10-01 14:59 ` Lars Ellenberg 0 siblings, 1 reply; 11+ messages in thread From: Thomas Schoebel-Theuer @ 2008-10-01 12:49 UTC (permalink / raw) To: drbd-dev > > Well, one problem is that the length of bvec could be nearly arbitrary > > (in theory), > > BIO_MAX_PAGES > DRBD_MAX_SEGMENT_SIZE Ok. > > But what about simply generating a completely new bio and copying over > > all the stuff by hand? This would mean to implement some sort of > > bio_copy() in the local code which could then later be lifted upstreams > > if other people liked it too. What do you think is better? > > no, I think it would be enough to just set (pseudo code) > copy_page->private = orig_page; Hmm. what if the caller used orig_page->private already and is now accessing the page _in parallel_ to us? IMHO in combination with the next step it _could_ lead to an observable difference: > bvec->bv_page = copy_page; IMHO this could cause a side effect for any observer accessing "his" page via "his" orig_bio and finally arriving at our copy_page->private. I am not sure whether this really happens in the kernel anywhere, but even if currently not it probably could happen sometime in future. The idea behind bio_copy() was to _never_ touch the original and all its transitively reachable descendants in any way, just to be sure nothing can ever go wrong with it (similar to a COW style). Well, the performance might be slightly worse, but we are reasoning on correctness at a rather tricky high level. I am not sure whether it really does any harm, I'm just curious and cautious. Well, before implementing it I will reason on it for some time just to be sure and convinced. Maybe I should implement both alternatives and do some stress-testing on a debug kernel? Thanks for your feedback, Thomas ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives 2008-10-01 12:49 ` Thomas Schoebel-Theuer @ 2008-10-01 14:59 ` Lars Ellenberg 2009-01-26 11:03 ` Lars Ellenberg 0 siblings, 1 reply; 11+ messages in thread From: Lars Ellenberg @ 2008-10-01 14:59 UTC (permalink / raw) To: Thomas Schoebel-Theuer; +Cc: drbd-dev On Wed, Oct 01, 2008 at 02:49:22PM +0200, Thomas Schoebel-Theuer wrote: > > > Well, one problem is that the length of bvec could be nearly arbitrary > > > (in theory), > > > > BIO_MAX_PAGES > > DRBD_MAX_SEGMENT_SIZE > > Ok. > > > > But what about simply generating a completely new bio and copying over > > > all the stuff by hand? This would mean to implement some sort of > > > bio_copy() in the local code which could then later be lifted upstreams > > > if other people liked it too. What do you think is better? > > > > no, I think it would be enough to just set (pseudo code) > > copy_page->private = orig_page; > > Hmm. what if the caller used orig_page->private already and is now accessing > the page _in parallel_ to us? IMHO in combination with the next step it > _could_ lead to an observable difference: > > > bvec->bv_page = copy_page; > > IMHO this could cause a side effect for any observer accessing "his" page > via "his" orig_bio and finally arriving at our copy_page->private. I am not > sure whether this really happens in the kernel anywhere, but even if > currently not it probably could happen sometime in future. The idea behind > bio_copy() was to _never_ touch the original and all its transitively > reachable descendants in any way, just to be sure nothing can ever go wrong > with it (similar to a COW style). Well, the performance might be slightly > worse, but we are reasoning on correctness at a rather tricky high level. > > I am not sure whether it really does any harm, I'm just curious and cautious. I think that is a reasonable attitude ;) so we just don't bio_clone in drbd_req_new then, but bio_alloc a fresh bio with our own bvec and own pages attached, which will be submitted, _and_ be used to sendpage it over. it has to do properly initialized, obviously. we have to make sure to get an extra reference (bio_get) on that private bio then, so it cannot vanish while being handed over to tcp, it has to stay around until master bio completion (where an extra bio_put will free it). _maybe_ we could add full struct bio private_bio + reasonably sized bvec array member to struct drbd_request, so we can skip the extra bio_alloc, _get and _put (unless there is some paranoia code in the generic block layer, which may be unhappy with that). > Well, before implementing it I will reason on it for some time just to be sure > and convinced. Maybe I should implement both alternatives implement the "most correct" alternative. > and do some stress-testing on a debug kernel? absolutely. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives 2008-10-01 14:59 ` Lars Ellenberg @ 2009-01-26 11:03 ` Lars Ellenberg 0 siblings, 0 replies; 11+ messages in thread From: Lars Ellenberg @ 2009-01-26 11:03 UTC (permalink / raw) To: drbd-dev hi there... any news on this? also, please consider testing with linux 2.6.28.2, there are a few data integrity relevant fixes for the generic dirty page write out path in there... On Wed, Oct 01, 2008 at 04:59:10PM +0200, Lars Ellenberg wrote: > On Wed, Oct 01, 2008 at 02:49:22PM +0200, Thomas Schoebel-Theuer wrote: > > > > Well, one problem is that the length of bvec could be nearly arbitrary > > > > (in theory), > > > > > > BIO_MAX_PAGES > > > DRBD_MAX_SEGMENT_SIZE > > > > Ok. > > > > > > But what about simply generating a completely new bio and copying over > > > > all the stuff by hand? This would mean to implement some sort of > > > > bio_copy() in the local code which could then later be lifted upstreams > > > > if other people liked it too. What do you think is better? > > > > > > no, I think it would be enough to just set (pseudo code) > > > copy_page->private = orig_page; > > > > Hmm. what if the caller used orig_page->private already and is now accessing > > the page _in parallel_ to us? IMHO in combination with the next step it > > _could_ lead to an observable difference: > > > > > bvec->bv_page = copy_page; > > > > IMHO this could cause a side effect for any observer accessing "his" page > > via "his" orig_bio and finally arriving at our copy_page->private. I am not > > sure whether this really happens in the kernel anywhere, but even if > > currently not it probably could happen sometime in future. The idea behind > > bio_copy() was to _never_ touch the original and all its transitively > > reachable descendants in any way, just to be sure nothing can ever go wrong > > with it (similar to a COW style). Well, the performance might be slightly > > worse, but we are reasoning on correctness at a rather tricky high level. > > > > I am not sure whether it really does any harm, I'm just curious and cautious. > > I think that is a reasonable attitude ;) > > so we just don't bio_clone in drbd_req_new then, > but bio_alloc a fresh bio with our own bvec and own pages attached, > which will be submitted, _and_ be used to sendpage it over. > it has to do properly initialized, obviously. > > we have to make sure to get an extra reference (bio_get) on that private > bio then, so it cannot vanish while being handed over to tcp, it has to > stay around until master bio completion (where an extra bio_put will > free it). > > _maybe_ we could add full struct bio private_bio + reasonably sized bvec > array member to struct drbd_request, so we can skip the extra bio_alloc, > _get and _put (unless there is some paranoia code in the generic block > layer, which may be unhappy with that). > > > Well, before implementing it I will reason on it for some time just to be sure > > and convinced. Maybe I should implement both alternatives > > implement the "most correct" alternative. > > > and do some stress-testing on a debug kernel? > > absolutely. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2009-01-26 11:03 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-09-09 14:02 [Drbd-dev] Behaviour of verify: false positives -> true positives schoebel 2008-09-11 9:25 ` Lars Ellenberg 2008-09-11 9:37 ` Lars Ellenberg 2008-09-30 20:10 ` Lars Ellenberg 2008-10-01 9:16 ` Thomas Schoebel-Theuer 2008-10-01 10:46 ` Lars Ellenberg 2008-10-01 11:28 ` Thomas Schoebel-Theuer 2008-10-01 11:46 ` Lars Ellenberg 2008-10-01 12:49 ` Thomas Schoebel-Theuer 2008-10-01 14:59 ` Lars Ellenberg 2009-01-26 11:03 ` Lars Ellenberg
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.