All of lore.kernel.org
 help / color / mirror / Atom feed
* [Drbd-dev] Behaviour of verify: false positives -> true positives
@ 2008-09-09 14:02 schoebel
  2008-09-11  9:25 ` Lars Ellenberg
  0 siblings, 1 reply; 11+ messages in thread
From: schoebel @ 2008-09-09 14:02 UTC (permalink / raw)
  To: drbd-dev

[-- Attachment #1: Type: text/plain, Size: 8237 bytes --]

Hi,

my company is considering drbd for building up failover clusters in shared 
hosting.

During our preliminary tests, we noticed that a "drbdadm verify /dev/drbdx" 
detects differences on a heavily loaded test server (several thousand 
customers).

We noticed two kind of verify differences: one is surely temporary (not 
repeatable), but the other is persistent, even after umounting the 
filesystem.

According to the manpage on drbd.conf (section "notes on data integrity"), 
these should be "false positives".  Indeed, we found no real corruptions (all 
different blocks were associated with deleted files).

However, this means that verify is (in _our_ point of view) no _reliable_ 
check for data integrity. Since data integrity of our valuable customer data 
is of great concern for us, we look for possibilities to change the behavior 
such that no false positives are reported any more, i.e. any difference 
reported by verify should be _guaranteed_ to be a "true positive". In my 
humble opinion, so-called "mission critical" applications demand for that in 
general.

In my understanding of kernel architecture, I believe the block differences 
are caused by an _intended_ race in the kernel at buffer cache level. 
Whenever a block gets dirty, there is (deliberately) no lock for any consumer 
of the buffer cache (such as ext3) which would prevent it from 
(re-)modification while the block is being written to an ordinary disk (even 
not necessarily a drbd device). IMHO, this deliberate race is _crucial_ for 
kernel performance (and thus I don't want to dispute on it).

Normally, this race should be no problem at all, even if an inconsistent block 
(half of the 512-byte block old, other half new version) is written to disk: 
the dirty-bit is just set again by the buffer cache level again, leading to 
another writeout which eventually fixes the problem.

I believe (but not 100% sure; please comment) that this model can explain the 
temporary differences seen by a drbd verify: when the data _content_ is 
mirrored across the network, the local disk version might have 
another "timestamp" (in _real_ time) than the version transmitted to 
ethernet. Thus, _any_ IO of dirty blocks might be inconsistent due to the 
deliberate kernel race on data block _content_ (dereferencing of 
buffer_head->b_data in parallel to disk IO). With ordinary load patterns, 
the "chance" to see _temporary_ false positives caused by that race is 
probably extremely low (perhaps one to some billion). But on a heavily loaded 
system we have observed it from time to time. Attached below are some tiny 
perl scripts which can reproduce temporary false positives with a fairly good 
chance at least on our test systems (I'd be glad to receive reproductions and 
experience from other users too). Just start io_rewrite_sector2.pl on an 
(preferably empty) ext3 filesystem on top of drbd, _in parallel_ to a "drbd 
verify /dev/corresponding_device" (other filesystems not yet tested). My test 
filesystem for this test was about 2GB size. Possibly you might have to 
adjust some constants to reproduce the test.

Up to now, I was reasoning on the _temporary_ false positives. Persistent 
false positives could be explained by the following theory:

When a file is eventually deleted or truncated, bforget() is called at the 
buffer cache interface. After that, dirty blocks are no longer transferred to 
disk, in order to save IO load (IMHO this is _crucial_ for typical access 
patterns on /tmp/ where typical lifetimes are often less than 1 second). As a 
consequence, the above-mentioned "fixing" of inconsistent blocks is no longer 
carried out and long-term differences can remain on the mirrored device, but 
belonging to deleted files only. Again, the chance to observe that is very 
low, but I have written another tiny perl script to reproduce that. Just 
start test_orphan.pl on an _empty_ drbd-mounted filesystem, and _afterwards_ 
check it with verify. Since the filesystem is empty again after the test, you 
can be sure that the differences belong to empty or orphan files (if it would 
belong to filesystem metadata, you would notice that by inspection of the 
contents with dd). By investigating the different counter values with dd on 
the underlying devices (primary vs secondary), you can even tell which 
version was written first. Interestingly, I found that most of the time the 
local version was the older one, but sometimes it was vice versa.

As a side note: test_orphan.pl also produces orphan files. Sometimes the 
counter values observed in different blocks are lower than the point of 
unlink() [currently set to 500], but most of the time the counter value is 
greater. It seems to me that applications producing orphan files raise the 
chance to observe false positives, but I am not sure. I have not yet a theory 
for that; please comment. It might be influenced by the timings of the block 
IO demon and/or by the load patterns.

-------

Now I am reasoning on different solutions. Please comment.

Here are just some brainstorming ideas, without judging on their quality (this 
will come later):

1) Try to avoid the kernel race on buffer content, specifically for drbd 
devices as an _option_ (which is _off_ by default). There are at least two 
sub-variants of that:
1a) use locking
1b) use an idea published by Herlihy for conflict-free resolution of different 
_versions_ of blocks, either on the fly or optionally residing in the buffer 
cache _in parallel_ [nb probably the latter could result in a major rewrite 
of large portions of the kernel, not to be disputed here on this list]

2) Whenever drbd-verify sees a difference, retry the comparison a few times 
after a short delay (possibly with exponential backoff), until _temporary_ 
differences have been filtered out. Persistent differences will not be 
tackled by that.

3) As an addition to 2), add an _option_ (which is _off_ by default) to the 
buffer cache code  to submit bforgotten() blocks specifically to drbd 
devices.

4) Add an option to drbd (as usual _off_ by default), which calculates a 
checksum on _every_ arriving IO request _first_ (before starting any 
sub-request). After finishing both the local and remote sub-IO, calculate the 
checksum again and compare. If a difference is found, restart both 
sub-transmissions again, until no mismatches are found any more.

5) As a refinement of 4), first filter out the temporary false positives by 
means of 2). Additionally try to identify bforgotten() blocks at the buffer 
cache level and submit them only _once_ after a bforget(), and only to drbd 
devices where the corresponding option is set. Then drbd uses method 4) 
_only_ on those blocks, thereby minimizing the performance impact of 4) to a 
rare special case.

6) Try to establish a complete solution in presence of races solely at the 
buffer cache level, without affecting drbd in any way. I am not sure whether 
this is possible. The raw idea is to _identify_ all races _reliably_(!) when 
they _actually_ occur (as in contrast to _possible_ occurrence). 
Theoretically b_count, b_state and/or other/similar means should be available 
to detect actually occurring races during IO, but I am extremely unsure 
whether this is possible _reliably_ without additional means such as 
checksumming. Probably this is the wrong list for discussing this topic, but 
I would be thankful for hints and ideas before going elsewhere with an 
uncomplete idea.

Now what I personally think of it: 1a) has too strong performance impact, 1b) 
would probably complicate the kernel by magnitudes, 2) is easy but does not 
solve all problems, 3) is feasible for non-general applications (outside 
of /tmp/ etc) but probably could lead to fundamental discussions with kernel 
developer (forcing us into an inhouse patch), 4) solves it all very easy but 
hurts performance, 5) is very complicated, and 6) is no mature idea yet.

Currently, I would prefer 4) as an option for testing and for gaining 
experience at the first try, but depending on performance results other 
mechanisms should be evaluated.

Of course, other people have other ideas and opinions, so please feel free to 
comment.

Thanks for your patience,

Thomas

[-- Attachment #2: io_rewrite_sector2.pl --]
[-- Type: application/x-perl, Size: 303 bytes --]

[-- Attachment #3: test_orphan.pl --]
[-- Type: application/x-perl, Size: 492 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives
  2008-09-09 14:02 [Drbd-dev] Behaviour of verify: false positives -> true positives schoebel
@ 2008-09-11  9:25 ` Lars Ellenberg
  2008-09-11  9:37   ` Lars Ellenberg
  2008-09-30 20:10   ` Lars Ellenberg
  0 siblings, 2 replies; 11+ messages in thread
From: Lars Ellenberg @ 2008-09-11  9:25 UTC (permalink / raw)
  To: drbd-dev

On Tue, Sep 09, 2008 at 04:02:30PM +0200, schoebel wrote:
> Hi,
> 
> my company is considering drbd for building up failover clusters in
> shared hosting.
> 
> During our preliminary tests, we noticed that a "drbdadm verify
> /dev/drbdx" detects differences on a heavily loaded test server
> (several thousand customers).
> 
> We noticed two kind of verify differences: one is surely temporary
> (not repeatable), but the other is persistent, even after umounting
> the filesystem.
> 
> According to the manpage on drbd.conf (section "notes on data integrity"), 
> these should be "false positives".  Indeed, we found no real corruptions (all 
> different blocks were associated with deleted files).
> 
> However, this means that verify is (in _our_ point of view) no _reliable_ 
> check for data integrity. Since data integrity of our valuable customer data 
> is of great concern for us, we look for possibilities to change the behavior 
> such that no false positives are reported any more, i.e. any difference 
> reported by verify should be _guaranteed_ to be a "true positive". In my 
> humble opinion, so-called "mission critical" applications demand for that in 
> general.
> 
> In my understanding of kernel architecture, I believe the block differences 
> are caused by an _intended_ race in the kernel at buffer cache level. 
> Whenever a block gets dirty, there is (deliberately) no lock for any consumer 
> of the buffer cache (such as ext3) which would prevent it from 
> (re-)modification while the block is being written to an ordinary disk (even 
> not necessarily a drbd device). IMHO, this deliberate race is _crucial_ for 
> kernel performance (and thus I don't want to dispute on it).
> 
> Normally, this race should be no problem at all, even if an inconsistent block 
> (half of the 512-byte block old, other half new version) is written to disk: 
> the dirty-bit is just set again by the buffer cache level again, leading to 
> another writeout which eventually fixes the problem.
> 
> I believe (but not 100% sure; please comment) that this model can explain the 
> temporary differences seen by a drbd verify: when the data _content_ is 
> mirrored across the network, the local disk version might have 
> another "timestamp" (in _real_ time) than the version transmitted to 
> ethernet. Thus, _any_ IO of dirty blocks might be inconsistent due to the 
> deliberate kernel race on data block _content_ (dereferencing of 
> buffer_head->b_data in parallel to disk IO). With ordinary load patterns, 
> the "chance" to see _temporary_ false positives caused by that race is 
> probably extremely low (perhaps one to some billion). But on a heavily loaded 
> system we have observed it from time to time. Attached below are some tiny 
> perl scripts which can reproduce temporary false positives with a fairly good 
> chance at least on our test systems (I'd be glad to receive reproductions and 
> experience from other users too). Just start io_rewrite_sector2.pl on an 
> (preferably empty) ext3 filesystem on top of drbd, _in parallel_ to a "drbd 
> verify /dev/corresponding_device" (other filesystems not yet tested). My test 
> filesystem for this test was about 2GB size. Possibly you might have to 
> adjust some constants to reproduce the test.
> 
> Up to now, I was reasoning on the _temporary_ false positives. Persistent 
> false positives could be explained by the following theory:
> 
> When a file is eventually deleted or truncated, bforget() is called at the 
> buffer cache interface. After that, dirty blocks are no longer transferred to 
> disk, in order to save IO load (IMHO this is _crucial_ for typical access 
> patterns on /tmp/ where typical lifetimes are often less than 1 second). As a 
> consequence, the above-mentioned "fixing" of inconsistent blocks is no longer 
> carried out and long-term differences can remain on the mirrored device, but 
> belonging to deleted files only. Again, the chance to observe that is very 
> low, but I have written another tiny perl script to reproduce that. Just 
> start test_orphan.pl on an _empty_ drbd-mounted filesystem, and _afterwards_ 
> check it with verify. Since the filesystem is empty again after the test, you 
> can be sure that the differences belong to empty or orphan files (if it would 
> belong to filesystem metadata, you would notice that by inspection of the 
> contents with dd). By investigating the different counter values with dd on 
> the underlying devices (primary vs secondary), you can even tell which 
> version was written first. Interestingly, I found that most of the time the 
> local version was the older one, but sometimes it was vice versa.
> 
> As a side note: test_orphan.pl also produces orphan files. Sometimes the 
> counter values observed in different blocks are lower than the point of 
> unlink() [currently set to 500], but most of the time the counter value is 
> greater. It seems to me that applications producing orphan files raise the 
> chance to observe false positives, but I am not sure. I have not yet a theory 
> for that; please comment. It might be influenced by the timings of the block 
> IO demon and/or by the load patterns.

interessting test setup and solid analysis, I'd say.

> -------
> 
> Now I am reasoning on different solutions. Please comment.
> 
> Here are just some brainstorming ideas, without judging on their quality (this 
> will come later):
> 
> 1) Try to avoid the kernel race on buffer content, specifically for drbd 
> devices as an _option_ (which is _off_ by default). There are at least two 
> sub-variants of that:
> 1a) use locking
> 1b) use an idea published by Herlihy for conflict-free resolution of different 
> _versions_ of blocks, either on the fly or optionally residing in the buffer 
> cache _in parallel_ [nb probably the latter could result in a major rewrite 
> of large portions of the kernel, not to be disputed here on this list]
> 
> 2) Whenever drbd-verify sees a difference, retry the comparison a few times 
> after a short delay (possibly with exponential backoff), until _temporary_ 
> differences have been filtered out. Persistent differences will not be 
> tackled by that.
> 
> 3) As an addition to 2), add an _option_ (which is _off_ by default) to the 
> buffer cache code  to submit bforgotten() blocks specifically to drbd 
> devices.
> 
> 4) Add an option to drbd (as usual _off_ by default), which calculates a 
> checksum on _every_ arriving IO request _first_ (before starting any 
> sub-request). After finishing both the local and remote sub-IO, calculate the 
> checksum again and compare. If a difference is found, restart both 
> sub-transmissions again, until no mismatches are found any more.
> 
> 5) As a refinement of 4), first filter out the temporary false positives by 
> means of 2). Additionally try to identify bforgotten() blocks at the buffer 
> cache level and submit them only _once_ after a bforget(), and only to drbd 
> devices where the corresponding option is set. Then drbd uses method 4) 
> _only_ on those blocks, thereby minimizing the performance impact of 4) to a 
> rare special case.
> 
> 6) Try to establish a complete solution in presence of races solely at the 
> buffer cache level, without affecting drbd in any way. I am not sure whether 
> this is possible. The raw idea is to _identify_ all races _reliably_(!) when 
> they _actually_ occur (as in contrast to _possible_ occurrence). 
> Theoretically b_count, b_state and/or other/similar means should be available 
> to detect actually occurring races during IO, but I am extremely unsure 
> whether this is possible _reliably_ without additional means such as 
> checksumming. Probably this is the wrong list for discussing this topic, but 
> I would be thankful for hints and ideas before going elsewhere with an 
> uncomplete idea.
> 
> Now what I personally think of it: 1a) has too strong performance impact, 1b) 
> would probably complicate the kernel by magnitudes, 2) is easy but does not 
> solve all problems, 3) is feasible for non-general applications (outside 
> of /tmp/ etc) but probably could lead to fundamental discussions with kernel 
> developer (forcing us into an inhouse patch), 4) solves it all very easy but 
> hurts performance, 5) is very complicated, and 6) is no mature idea yet.
> 
> Currently, I would prefer 4) as an option for testing and for gaining 
> experience at the first try, but depending on performance results other 
> mechanisms should be evaluated.
> 
> Of course, other people have other ideas and opinions, so please feel free to 
> comment.


how about holding a ring buffer of pinned pages,
say "drbd max-buffer" pages plus some,
and in drbd_make_request,
memcpy(ring buffer, submitted data)
then checksum that,
and submit it to localdisk as well as to tcp stack.
as these are now "private" data buffers,
the "application" cannot possibly modify them in-flight.

yes, not necessarily performant.

> Thanks for your patience,

thanks for sharing your findings.

-- 
: Lars Ellenberg                
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives
  2008-09-11  9:25 ` Lars Ellenberg
@ 2008-09-11  9:37   ` Lars Ellenberg
  2008-09-30 20:10   ` Lars Ellenberg
  1 sibling, 0 replies; 11+ messages in thread
From: Lars Ellenberg @ 2008-09-11  9:37 UTC (permalink / raw)
  To: drbd-dev

On Thu, Sep 11, 2008 at 11:25:12AM +0200, Lars Ellenberg wrote:
> On Tue, Sep 09, 2008 at 04:02:30PM +0200, schoebel wrote:
> > Hi,
> > 
> > my company is considering drbd for building up failover clusters in
> > shared hosting.
> > 
> > During our preliminary tests, we noticed that a "drbdadm verify
> > /dev/drbdx" detects differences on a heavily loaded test server
> > (several thousand customers).
> > 
> > We noticed two kind of verify differences: one is surely temporary
> > (not repeatable), but the other is persistent, even after umounting
> > the filesystem.
> > 
> > According to the manpage on drbd.conf (section "notes on data integrity"), 
> > these should be "false positives".  Indeed, we found no real corruptions (all 
> > different blocks were associated with deleted files).
> > 
> > However, this means that verify is (in _our_ point of view) no _reliable_ 
> > check for data integrity. Since data integrity of our valuable customer data 
> > is of great concern for us, we look for possibilities to change the behavior 
> > such that no false positives are reported any more, i.e. any difference 
> > reported by verify should be _guaranteed_ to be a "true positive". In my 
> > humble opinion, so-called "mission critical" applications demand for that in 
> > general.

now.
an other thought here.

any application doing what your test perl scripts do
is risking its data, and is not crash safe.

aparently there are many of those.

if an application starts to rewrite a portion of a file it has written
to before, it should fsync() at some point, before it starts the
overwrite, so the buffers will be on disk
(ok "on device", no necessarily on rotating rust yet,
 but, while related, that is an other topic).

so as long as you run a postfix mailq
or a database tablespace on DRBD,
or similar applications that know how to behave,
I'd not expect any such "modify of in-flight buffers" to happen.

any application that does "not behave" is risking data integrity
(in event of [application, file system, device or node]crash/
 ower outage/failover).

For any "behaving" application, the "drbd verify integrity check" is
expected to be your "_reliable_" from above.

Am I missing something?

-- 
: Lars Ellenberg                
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives
  2008-09-11  9:25 ` Lars Ellenberg
  2008-09-11  9:37   ` Lars Ellenberg
@ 2008-09-30 20:10   ` Lars Ellenberg
  2008-10-01  9:16     ` Thomas Schoebel-Theuer
  1 sibling, 1 reply; 11+ messages in thread
From: Lars Ellenberg @ 2008-09-30 20:10 UTC (permalink / raw)
  To: drbd-dev

On Thu, Sep 11, 2008 at 11:25:12AM +0200, Lars Ellenberg wrote:
> On Tue, Sep 09, 2008 at 04:02:30PM +0200, schoebel wrote:

...

> > Now I am reasoning on different solutions. Please comment.
> > 
> > Here are just some brainstorming ideas, without judging on their quality (this 
> > will come later):

...

> how about holding a ring buffer of pinned pages,
> say "drbd max-buffer" pages plus some,
> and in drbd_make_request,
> memcpy(ring buffer, submitted data)
> then checksum that,
> and submit it to localdisk as well as to tcp stack.
> as these are now "private" data buffers,
> the "application" cannot possibly modify them in-flight.

have you made any progress on that matter?

shall we further discuss whether or not it should be "fixed" (in drbd),
and if so, how?

or have you decided to just move a long?

-- 
: Lars Ellenberg                
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives
  2008-09-30 20:10   ` Lars Ellenberg
@ 2008-10-01  9:16     ` Thomas Schoebel-Theuer
  2008-10-01 10:46       ` Lars Ellenberg
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Schoebel-Theuer @ 2008-10-01  9:16 UTC (permalink / raw)
  To: drbd-dev; +Cc: Michael Ziegel, Peter Scheitterer

[-- Attachment #1: Type: text/plain, Size: 3527 bytes --]

Hi Lars,

I have two different patches, both based on your git version from 2008-09-11.

The first one implements the checksumming method while the second one 
implements the copy method proposed by you.

I found that your suggestion was much more easier to implement than the 
checksumming method. However, the copy method works completely silently (you 
get no response whether a race on the data is actually taking place or not), 
while the checksumming method writes reports to the kernel log (so you can 
observe the behavior).

I have also run some performance tests with iozone, but could not find any 
significant difference between any of the patched and unpatched versions 
(and, of course, no difference between the two alternatives). iozone's 
measurements seem to produce "noise" in the range of about 5%. Theoretical 
considerations tell me that both patches should slowdown overall system 
performance by about 1% at most, so I could not figure out anything with my 
single tests (I had no time for running them repeatedly 100 or 1000 times).

Enclosed you will find both patches. They are intended only for _testing_ 
(e.g. performance benchmarks), not for production use. I would be glad if 
someone could report any experience with it, e.g. performance tests (probably 
with better testing tools).

Currently both methods are enabled by a preprocessor constant (just for 
testing). For production use, we should do it right. Since both methods are 
in fact some sort of "paranoia options" (similar to the CRCs already 
present), I would ask whether we should enable/disable them by drbdadm via 
the kernel interface, leading to a new interface revision [oh, as a side 
note, I found that git revision 2008-09-11 crashed with userspace tools 8.2.5 
although the interface revision claimed to be compatible, or did I miss 
something there?].

Should I do the work on the interfaces also, or do you want to integrate it 
yourself to your systematics?

Another issue with the checksumming method: I had problems interfacing with 
the crypto api (probably a wrongly braced kmap/kunmap bug), and it seems that 
different kernel versions do it differently, so I wrote my own small XOR 
checksum algorithm (almost trivial). I'm not sure whether this quick hack is 
really good. And there is some dead code about the crypto API, which I should 
either remove or clean it up (e.g. by moving global_tfm to struct drbd_conf). 
Currently this is all a little hackish, not for production use. Do you have 
any suggestions how to cleanup this in order to meet your style?

Thanks,

Thomas


Am Dienstag, 30. September 2008 22:10:16 schrieb Lars Ellenberg:
> On Thu, Sep 11, 2008 at 11:25:12AM +0200, Lars Ellenberg wrote:
> > On Tue, Sep 09, 2008 at 04:02:30PM +0200, schoebel wrote:
>
> ...
>
> > > Now I am reasoning on different solutions. Please comment.
> > >
> > > Here are just some brainstorming ideas, without judging on their
> > > quality (this will come later):
>
> ...
>
> > how about holding a ring buffer of pinned pages,
> > say "drbd max-buffer" pages plus some,
> > and in drbd_make_request,
> > memcpy(ring buffer, submitted data)
> > then checksum that,
> > and submit it to localdisk as well as to tcp stack.
> > as these are now "private" data buffers,
> > the "application" cannot possibly modify them in-flight.
>
> have you made any progress on that matter?
>
> shall we further discuss whether or not it should be "fixed" (in drbd),
> and if so, how?
>
> or have you decided to just move a long?



[-- Attachment #2: drbd-8.2-checksum.patch --]
[-- Type: text/x-diff, Size: 7944 bytes --]

diff --git a/drbd/drbd_int.h b/drbd/drbd_int.h
index 0281a46..77dc26b 100644
--- a/drbd/drbd_int.h
+++ b/drbd/drbd_int.h
@@ -99,6 +99,17 @@ extern char usermode_helper[];
 #define FALSE 0
 #endif
 
+#define VERIFY_HACK
+
+#ifdef VERIFY_HACK
+#define CRC_SIZE 16
+int drbd_checksum_bio(struct bio *bio, u8 * res);
+int drbd_checksum_loadalg(char * algname);
+void drbd_checksum_unloadalg(void);
+struct drbd_conf;
+void _restart_requests(struct drbd_conf * mdev);
+#endif
+
 /* I don't remember why XCPU ...
  * This is used to wake the asender,
  * and to interrupt sending the sending task
@@ -678,6 +689,9 @@ struct drbd_request {
 	unsigned long rq_state; /* see comments above _req_mod() */
 	int seq_num;
 	unsigned long start_time;
+#ifdef VERIFY_HACK
+	u8 checksum[CRC_SIZE];
+#endif
 };
 
 struct drbd_barrier {
@@ -1000,6 +1014,11 @@ struct drbd_conf {
 	struct bm_io_work bm_io_work;
 	u64 ed_uuid; /* UUID of the exposed data */
 	struct mutex state_mutex;
+#ifdef VERIFY_HACK
+#define RESTART_MAX 8
+	int restart_head, restart_tail, restart_count;
+	struct bio * restart[RESTART_MAX];
+#endif
 };
 
 static inline struct drbd_conf *minor_to_mdev(unsigned int minor)
diff --git a/drbd/drbd_main.c b/drbd/drbd_main.c
index dde171e..b7bde61 100644
--- a/drbd/drbd_main.c
+++ b/drbd/drbd_main.c
@@ -2838,6 +2838,10 @@ STATIC void drbd_cleanup(void)
 
 	drbd_unregister_blkdev(DRBD_MAJOR, "drbd");
 
+#ifdef VERIFY_HACK
+	drbd_checksum_unloadalg();
+#endif
+
 	printk(KERN_INFO "drbd: module cleanup done.\n");
 }
 
@@ -2997,6 +3001,12 @@ int __init drbd_init(void)
 		DRBD_MAJOR);
 	printk(KERN_INFO "drbd: minor_table @ 0x%p\n", minor_table);
 
+#ifdef VERIFY_HACK
+	if(drbd_checksum_loadalg("md5")) {
+	    printk(KERN_INFO "drbd: loading checksum algorithm module failed\n");
+	}
+#endif
+
 	return 0; /* Success! */
 
 Enomem:
diff --git a/drbd/drbd_receiver.c b/drbd/drbd_receiver.c
index 8ca228c..831dec9 100644
--- a/drbd/drbd_receiver.c
+++ b/drbd/drbd_receiver.c
@@ -3685,6 +3685,9 @@ STATIC int got_BlockAck(struct drbd_conf *mdev, struct Drbd_Header *h)
 			D_ASSERT(0);
 		}
 		spin_unlock_irq(&mdev->req_lock);
+#ifdef VERIFY_HACK
+		_restart_requests(mdev);
+#endif
 	}
 	/* dec_ap_pending is handled within _req_mod */
 
diff --git a/drbd/drbd_req.c b/drbd/drbd_req.c
index e198a03..53ab136 100644
--- a/drbd/drbd_req.c
+++ b/drbd/drbd_req.c
@@ -93,6 +93,142 @@ STATIC void _print_req_mod(struct drbd_request *req, enum drbd_req_event what)
 	INFO("_req_mod(%p %c ,%s)\n", req, rw, rq_event_names[what]);
 }
 
+#ifdef VERIFY_HACK
+
+#include <linux/crypto.h>
+
+STATIC int drbd_make_request_common(struct drbd_conf *mdev, struct bio *bio);
+
+static struct crypto_hash * global_tfm = NULL;
+
+/* Separation between allocation and usage of crypto descriptors.
+ * This is necessary because crypto_alloc_*() can call schedule().
+ */
+int drbd_checksum_loadalg(char * algname)
+{
+	int digestsize;
+	global_tfm = crypto_alloc_hash(algname, 0, CRYPTO_ALG_ASYNC);
+	if(IS_ERR(global_tfm)) {
+		printk("Aha algorithm=%s error\n", algname);
+		return -1;
+	}
+	digestsize = crypto_hash_digestsize(global_tfm);
+	if(digestsize > CRC_SIZE || CRC_SIZE != 2*(sizeof(long long))) {
+		printk("Aha digestsize=%d reslen=%d\n", digestsize, CRC_SIZE);
+		return -1;
+	}
+	printk(KERN_INFO "drbd_checksum_loadalg OK\n");
+	return 0;
+}
+
+void drbd_checksum_unloadalg(void)
+{
+	crypto_free_hash(global_tfm);
+	global_tfm = NULL;
+}
+
+/* A backwards iterator is missing in <linux/bio.h>.
+ * We need it for correct nesting of bvec_bmap_irq / unmap pairs.
+ */
+#define __bio_for_each_segment_back(bvl, bio, i, start_idx)		\
+	for (bvl = bio_iovec_idx((bio), (start_idx)+(bio)->bi_vcnt-1), i = (start_idx+(bio)->bi_vcnt-1);	\
+	     i >= (start_idx);						\
+	     bvl--, i--)
+
+#define bio_for_each_segment_back(bvl, bio, i)				\
+	__bio_for_each_segment_back(bvl, bio, i, (bio)->bi_idx)
+
+static void dummy_checksum(void * data, int len, long long res[])
+{
+	register long long val0 = res[0];
+	register long long val1 = res[1];
+	long long * _data = data;
+	while(len > 0) {
+		val0 ^= *_data++;
+		val1 ^= *_data++;
+		len -= 2*sizeof(val0);
+	}
+	res[0] = val0;
+	res[1] = val1;
+}
+
+int drbd_checksum_bio(struct bio *bio, u8 * res)
+{
+	int err = -1;
+	struct hash_desc desc;
+	int count;
+	
+	if(!global_tfm)
+		return -1;
+
+	desc.flags = CRYPTO_TFM_REQ_MAY_SLEEP;
+	desc.tfm = global_tfm;
+
+	memset(res, 0, CRC_SIZE);
+
+	count = bio->bi_vcnt;
+	{
+		void *data[count];
+		unsigned long flags[count];
+		struct scatterlist sg[count];
+		struct bio_vec * bv;
+		int nbytes = 0;
+		int i;
+		int j;
+		sg_init_table(sg, count);
+		
+		j = 0;
+		bio_for_each_segment(bv, bio, i) {
+			data[j] = bvec_kmap_irq(bv, &flags[j]);
+			nbytes += bv->bv_len;
+			dummy_checksum(data[j], bv->bv_len, (long long*)res);
+			sg_set_buf(&sg[j], data[j], bv->bv_len);
+			bvec_kunmap_irq(data[j], &flags[j]);
+			j++;
+		}
+#if 0		
+		if(crypto_hash_init(&desc))
+			goto out;
+		if(crypto_hash_update(&desc, sg, nbytes))
+			goto out;
+		if(crypto_hash_final(&desc, res))
+			goto out;
+		bio_for_each_segment_back(bv, bio, i) {
+			j--;
+		}
+#endif		
+	}
+	err = 0;
+#if 0	
+out:
+#endif
+	return err;
+}
+
+void _restart_requests(struct drbd_conf * mdev)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&mdev->req_lock, flags);
+	while(mdev->restart_count > 0) {
+		struct bio * master_bio;
+		int restart_count = mdev->restart_count;
+		master_bio = mdev->restart[mdev->restart_head];
+		mdev->restart_head = (mdev->restart_head+1) % RESTART_MAX;
+		mdev->restart_count--;
+		if(master_bio) {
+			spin_unlock_irqrestore(&mdev->req_lock, flags);
+			INFO("restarting block=%ld (count=%d)\n", master_bio->bi_sector, restart_count);
+			drbd_make_request_common(mdev, master_bio);
+			spin_lock_irqsave(&mdev->req_lock, flags);
+		}
+	}
+	spin_unlock_irqrestore(&mdev->req_lock, flags);
+}
+
+
+#endif
+
+
 # ifdef ENABLE_DYNAMIC_TRACE
 #  define print_rq_state(R, T) \
 	MTRACE(TraceTypeRq, TraceLvlMetrics, _print_rq_state(R, T);)
@@ -381,6 +517,35 @@ void _req_may_be_done(struct drbd_request *req, int error)
 		/* Update disk stats */
 		_drbd_end_io_acct(mdev, req);
 
+#ifdef VERIFY_HACK
+		int restart = 0;
+		if(ok && rw == WRITE) {
+			u8 checksum[CRC_SIZE];
+			int err;
+			//INFO("checksumming %ld\n", req->master_bio->bi_sector);
+			err = drbd_checksum_bio(req->master_bio, checksum);
+			if(!err) {
+				err = memcmp(req->checksum, checksum, sizeof(checksum));
+			}
+			if(err) {
+				INFO("bio checksum mismatch block=%ld\n", req->master_bio->bi_sector);
+				restart = 1;
+			}
+		}
+		if(restart) {
+			if(mdev->restart_count <= RESTART_MAX) {
+				INFO("queuing block=%ld for restart (count=%d)\n", req->master_bio->bi_sector, mdev->restart_count);
+				mdev->restart[mdev->restart_tail] = req->master_bio;
+				mdev->restart_tail = (mdev->restart_tail+1) % RESTART_MAX;
+				mdev->restart_count++;
+				req->master_bio = NULL;
+			} else {
+				INFO("Aieee, queue buffer is full (block=%ld)\n", req->master_bio->bi_sector);
+				restart = 0;
+			}
+		}
+		if(!restart)
+#endif
 		_complete_master_bio(mdev,req,
 				     ok ? 0 : ( error ? error : -EIO ) );
 	} else {
@@ -887,6 +1052,11 @@ STATIC int drbd_make_request_common(struct drbd_conf *mdev, struct bio *bio)
 	}
 	if (rw == WRITE) {
 		remote = 1;
+#ifdef VERIFY_HACK
+		if(drbd_checksum_bio(bio, req->checksum)) {
+			ERR("error in checksumming algorithm\n");
+		}
+#endif
 	} else {
 		/* READ || READA */
 		if (local) {
diff --git a/drbd/drbd_worker.c b/drbd/drbd_worker.c
index aff97cc..fe69dd2 100644
--- a/drbd/drbd_worker.c
+++ b/drbd/drbd_worker.c
@@ -247,6 +247,9 @@ BIO_ENDIO_TYPE drbd_endio_pri BIO_ENDIO_ARGS(struct bio *bio, int error)
 	spin_lock_irqsave(&mdev->req_lock, flags);
 	_req_mod(req, what, error);
 	spin_unlock_irqrestore(&mdev->req_lock, flags);
+#ifdef VERIFY_HACK
+	_restart_requests(mdev);
+#endif
 
 	BIO_ENDIO_FN_RETURN;
 }

[-- Attachment #3: drbd-8.2-copy.patch --]
[-- Type: text/x-diff, Size: 4635 bytes --]

diff --git a/drbd/drbd_int.h b/drbd/drbd_int.h
index 0281a46..c61856e 100644
--- a/drbd/drbd_int.h
+++ b/drbd/drbd_int.h
@@ -40,6 +40,17 @@
 #include <net/tcp.h>
 #include "lru_cache.h"
 
+#define COPY_HACK
+
+#ifdef COPY_HACK
+struct drbd_conf;
+struct page *drbd_pp_alloc(struct drbd_conf *mdev, gfp_t gfp_mask);
+void drbd_pp_free(struct drbd_conf *mdev, struct page *page);
+struct drbd_request;
+void drbd_page_copy(struct drbd_conf *mdev, struct drbd_request *req, struct bio *bio);
+void drbd_page_free(struct drbd_conf *mdev, struct drbd_request *req);
+#endif
+
 #ifdef __CHECKER__
 # define __protected_by(x)       __attribute__((require_context(x,1,999,"rdwr")))
 # define __protected_read_by(x)  __attribute__((require_context(x,1,999,"read")))
@@ -678,6 +689,9 @@ struct drbd_request {
 	unsigned long rq_state; /* see comments above _req_mod() */
 	int seq_num;
 	unsigned long start_time;
+#ifdef COPY_HACK
+        struct page * copy_page;
+#endif
 };
 
 struct drbd_barrier {
diff --git a/drbd/drbd_main.c b/drbd/drbd_main.c
index dde171e..bd8e2dc 100644
--- a/drbd/drbd_main.c
+++ b/drbd/drbd_main.c
@@ -2665,7 +2665,6 @@ STATIC void drbd_destroy_mempools(void)
 	drbd_request_mempool = NULL;
 	drbd_ee_cache        = NULL;
 	drbd_request_cache   = NULL;
-
 	return;
 }
 
diff --git a/drbd/drbd_receiver.c b/drbd/drbd_receiver.c
index 8ca228c..88d1103 100644
--- a/drbd/drbd_receiver.c
+++ b/drbd/drbd_receiver.c
@@ -92,7 +92,7 @@ void drbd_assert_breakpoint(struct drbd_conf *mdev, char *exp,
 /**
  * drbd_bp_alloc: Returns a page. Fails only if a signal comes in.
  */
-STATIC struct page *drbd_pp_alloc(struct drbd_conf *mdev, gfp_t gfp_mask)
+struct page *drbd_pp_alloc(struct drbd_conf *mdev, gfp_t gfp_mask)
 {
 	unsigned long flags = 0;
 	struct page *page;
@@ -167,7 +167,7 @@ STATIC struct page *drbd_pp_alloc(struct drbd_conf *mdev, gfp_t gfp_mask)
 	return page;
 }
 
-STATIC void drbd_pp_free(struct drbd_conf *mdev, struct page *page)
+void drbd_pp_free(struct drbd_conf *mdev, struct page *page)
 {
 	unsigned long flags = 0;
 	int free_it;
diff --git a/drbd/drbd_req.c b/drbd/drbd_req.c
index e198a03..2d25450 100644
--- a/drbd/drbd_req.c
+++ b/drbd/drbd_req.c
@@ -839,6 +839,39 @@ STATIC int drbd_may_do_local_read(struct drbd_conf *mdev, sector_t sector, int s
 	return (0 == drbd_bm_count_bits(mdev, sbnr, ebnr));
 }
 
+#ifdef COPY_HACK
+void drbd_page_copy(struct drbd_conf *mdev, struct drbd_request *req, struct bio *bio)
+{
+	struct bio_vec *bvec;
+        int i;
+	bio_for_each_segment(bvec, bio, i) {
+                struct page *dst_page = drbd_pp_alloc(mdev, GFP_KERNEL);
+                struct page *src_page = bvec->bv_page;
+                void *dst = kmap(dst_page);
+                void *src = kmap(src_page);
+                set_page_private(dst_page, (unsigned long)req->copy_page);
+                req->copy_page = dst_page;
+                memcpy(dst, src, PAGE_SIZE);
+#if 1
+                bvec->bv_page = dst_page;
+#endif
+                kunmap(src_page);
+                kunmap(dst_page);
+        }        
+}
+
+void drbd_page_free(struct drbd_conf *mdev, struct drbd_request *req)
+{
+        struct page * free;
+        while((free = req->copy_page)) {
+                req->copy_page = (struct page *)page_private(free);
+                set_page_private(free, (unsigned long)NULL);
+                drbd_pp_free(mdev, free);
+        }
+}
+
+#endif
+
 /*
  * general note:
  * looking at the state (conn, disk, susp, pdsk) outside of the spinlock that
@@ -887,6 +920,11 @@ STATIC int drbd_make_request_common(struct drbd_conf *mdev, struct bio *bio)
 	}
 	if (rw == WRITE) {
 		remote = 1;
+#ifdef COPY_HACK
+                if(req->private_bio) {
+                        drbd_page_copy(mdev, req, req->private_bio);
+                }
+#endif
 	} else {
 		/* READ || READA */
 		if (local) {
diff --git a/drbd/drbd_req.h b/drbd/drbd_req.h
index 2c67f09..04c8556 100644
--- a/drbd/drbd_req.h
+++ b/drbd/drbd_req.h
@@ -294,6 +294,9 @@ static inline struct drbd_request *drbd_req_new(struct drbd_conf *mdev,
 		req->sector      = bio->bi_sector;
 		req->size        = bio->bi_size;
 		req->start_time  = jiffies;
+#ifdef COPY_HACK
+                req->copy_page = NULL;
+#endif
 		INIT_HLIST_NODE(&req->colision);
 		INIT_LIST_HEAD(&req->tl_requests);
 
@@ -306,6 +309,10 @@ static inline struct drbd_request *drbd_req_new(struct drbd_conf *mdev,
 
 static inline void drbd_req_free(struct drbd_request *req)
 {
+#ifdef COPY_HACK
+        if(req->mdev)
+                drbd_page_free(req->mdev, req);
+#endif
 	mempool_free(req, drbd_request_mempool);
 }
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives
  2008-10-01  9:16     ` Thomas Schoebel-Theuer
@ 2008-10-01 10:46       ` Lars Ellenberg
  2008-10-01 11:28         ` Thomas Schoebel-Theuer
  0 siblings, 1 reply; 11+ messages in thread
From: Lars Ellenberg @ 2008-10-01 10:46 UTC (permalink / raw)
  To: Thomas Schoebel-Theuer; +Cc: Michael Ziegel, Peter Scheitterer, drbd-dev

On Wed, Oct 01, 2008 at 11:16:41AM +0200, Thomas Schoebel-Theuer wrote:
> Hi Lars,
> 
> I have two different patches, both based on your git version from 2008-09-11.

great.

> The first one implements the checksumming method while the second one 
> implements the copy method proposed by you.
> 
> I found that your suggestion was much more easier to implement than the 
> checksumming method. However, the copy method works completely silently (you 
> get no response whether a race on the data is actually taking place or not), 
> while the checksumming method writes reports to the kernel log (so you can 
> observe the behavior).

I briefly had a look at the copy method.

as private_bio is a clone of master_bio,
it shares its bvec, thus the pages.

you assign the copy pages to the bvec->bv_page
(you have to do that, master_bio->bvec->bv_page is used in sendpage)
but you don't restore the original pages before completion,
even free the pages already.
you
 * potentially leak the original pages
 * potentially double free the copy pages
as the submitter (file system, page cache, whatever)
in its completion callback may reference or free those.

did you try this on a debug kernel,
memory poisoning and page allocation debugging switched on?
(of course, that is not the kernel you want to use for
 performance benchmarks).

you need to remember the original pages, and you need to restore them
before completion of the master bio.

for diagnostics, you can "memcmp" on that occasion as well.

and you probably need to increase the "emergency reserve" amount of
pre-allocated pages, to avoid resource starvation deadlocks.

you should use copy_page or copy_highpage.

for performance impact, assuming that your memory bandwidth is much
higher than your IO bandwidth, this will "only" increase CPU cycles.
which is probably ok on a generic file server, but may become critical
when running a data base directly on drbd.
so, when doing benchmarks, have a look at the cpu usage as well.

I'll have a look at the checksumming method later.

thanks again,

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives
  2008-10-01 10:46       ` Lars Ellenberg
@ 2008-10-01 11:28         ` Thomas Schoebel-Theuer
  2008-10-01 11:46           ` Lars Ellenberg
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Schoebel-Theuer @ 2008-10-01 11:28 UTC (permalink / raw)
  To: drbd-dev

> as private_bio is a clone of master_bio,
> it shares its bvec, thus the pages.

Oops, I didn't realize that (and I tested the copy version only briefly 
because I was more interested in the checksumming because of the kernel log 
messages). The implications are clear to me.

IMHO best solution would be the existence of a bio_copy() in addition to 
bio_clone() in bio.h, but even if we would get it into the upstream it would 
not help for elder kernels.

> you need to remember the original pages, and you need to restore them
> before completion of the master bio.

Well, one problem is that the length of bvec could be nearly arbitrary (in 
theory), so preclaiming "enough" space in struct drbd_request is probably no 
good idea. Well, kmalloc() would be possible to keep the allocation dynamic 
without wasting space.

But what about simply generating a completely new bio and copying over all the 
stuff by hand? This would mean to implement some sort of bio_copy() in the 
local code which could then later be lifted upstreams if other people liked 
it too. What do you think is better?

Thanks for feedback,

Thomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives
  2008-10-01 11:28         ` Thomas Schoebel-Theuer
@ 2008-10-01 11:46           ` Lars Ellenberg
  2008-10-01 12:49             ` Thomas Schoebel-Theuer
  0 siblings, 1 reply; 11+ messages in thread
From: Lars Ellenberg @ 2008-10-01 11:46 UTC (permalink / raw)
  To: drbd-dev

On Wed, Oct 01, 2008 at 01:28:02PM +0200, Thomas Schoebel-Theuer wrote:
> > as private_bio is a clone of master_bio,
> > it shares its bvec, thus the pages.
> 
> Oops, I didn't realize that (and I tested the copy version only briefly 
> because I was more interested in the checksumming because of the kernel log 
> messages). The implications are clear to me.
> 
> IMHO best solution would be the existence of a bio_copy() in addition to 
> bio_clone() in bio.h, but even if we would get it into the upstream it would 
> not help for elder kernels.
> 
> > you need to remember the original pages, and you need to restore them
> > before completion of the master bio.
> 
> Well, one problem is that the length of bvec could be nearly arbitrary (in 
> theory),

BIO_MAX_PAGES
DRBD_MAX_SEGMENT_SIZE

> so preclaiming "enough" space in struct drbd_request is probably no 
> good idea. Well, kmalloc() would be possible to keep the allocation dynamic 
> without wasting space.

I meant the pre-allocating in drbd_pp_alloc. number of pages.

> But what about simply generating a completely new bio and copying over all the 
> stuff by hand? This would mean to implement some sort of bio_copy() in the 
> local code which could then later be lifted upstreams if other people liked 
> it too. What do you think is better?

no, I think it would be enough to just set (pseudo code)
copy_page->private = orig_page;
bvec->bv_page = copy_page;

then, later before completion,
bio_for_each_segment
  memcmp,
  restore orig
  drbd_pp_free copy

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives
  2008-10-01 11:46           ` Lars Ellenberg
@ 2008-10-01 12:49             ` Thomas Schoebel-Theuer
  2008-10-01 14:59               ` Lars Ellenberg
  0 siblings, 1 reply; 11+ messages in thread
From: Thomas Schoebel-Theuer @ 2008-10-01 12:49 UTC (permalink / raw)
  To: drbd-dev

> > Well, one problem is that the length of bvec could be nearly arbitrary
> > (in theory),
>
> BIO_MAX_PAGES
> DRBD_MAX_SEGMENT_SIZE

Ok.

> > But what about simply generating a completely new bio and copying over
> > all the stuff by hand? This would mean to implement some sort of
> > bio_copy() in the local code which could then later be lifted upstreams
> > if other people liked it too. What do you think is better?
>
> no, I think it would be enough to just set (pseudo code)
> copy_page->private = orig_page;

Hmm. what if the caller used orig_page->private already and is now accessing 
the page _in parallel_ to us? IMHO in combination with the next step it 
_could_ lead to an observable difference:

> bvec->bv_page = copy_page;

IMHO this could cause a side effect for any observer accessing "his" page 
via "his" orig_bio and finally arriving at our copy_page->private. I am not 
sure whether this really happens in the kernel anywhere, but even if 
currently not it probably could happen sometime in future. The idea behind 
bio_copy() was to _never_ touch the original and all its transitively 
reachable descendants in any way, just to be sure nothing can ever go wrong 
with it (similar to a COW style). Well, the performance might be slightly 
worse, but we are reasoning on correctness at a rather tricky high level.

I am not sure whether it really does any harm, I'm just curious and cautious.

Well, before implementing it I will reason on it for some time just to be sure 
and convinced. Maybe I should implement both alternatives and do some 
stress-testing on a debug kernel?

Thanks for your feedback,

Thomas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives
  2008-10-01 12:49             ` Thomas Schoebel-Theuer
@ 2008-10-01 14:59               ` Lars Ellenberg
  2009-01-26 11:03                 ` Lars Ellenberg
  0 siblings, 1 reply; 11+ messages in thread
From: Lars Ellenberg @ 2008-10-01 14:59 UTC (permalink / raw)
  To: Thomas Schoebel-Theuer; +Cc: drbd-dev

On Wed, Oct 01, 2008 at 02:49:22PM +0200, Thomas Schoebel-Theuer wrote:
> > > Well, one problem is that the length of bvec could be nearly arbitrary
> > > (in theory),
> >
> > BIO_MAX_PAGES
> > DRBD_MAX_SEGMENT_SIZE
> 
> Ok.
> 
> > > But what about simply generating a completely new bio and copying over
> > > all the stuff by hand? This would mean to implement some sort of
> > > bio_copy() in the local code which could then later be lifted upstreams
> > > if other people liked it too. What do you think is better?
> >
> > no, I think it would be enough to just set (pseudo code)
> > copy_page->private = orig_page;
> 
> Hmm. what if the caller used orig_page->private already and is now accessing 
> the page _in parallel_ to us? IMHO in combination with the next step it 
> _could_ lead to an observable difference:
> 
> > bvec->bv_page = copy_page;
> 
> IMHO this could cause a side effect for any observer accessing "his" page 
> via "his" orig_bio and finally arriving at our copy_page->private. I am not 
> sure whether this really happens in the kernel anywhere, but even if 
> currently not it probably could happen sometime in future. The idea behind 
> bio_copy() was to _never_ touch the original and all its transitively 
> reachable descendants in any way, just to be sure nothing can ever go wrong 
> with it (similar to a COW style). Well, the performance might be slightly 
> worse, but we are reasoning on correctness at a rather tricky high level.
>
> I am not sure whether it really does any harm, I'm just curious and cautious.

I think that is a reasonable attitude ;)

so we just don't bio_clone in drbd_req_new then,
but bio_alloc a fresh bio with our own bvec and own pages attached,
which will be submitted, _and_ be used to sendpage it over.
it has to do properly initialized, obviously.

we have to make sure to get an extra reference (bio_get) on that private
bio then, so it cannot vanish while being handed over to tcp, it has to
stay around until master bio completion (where an extra bio_put will
free it).

_maybe_ we could add full struct bio private_bio + reasonably sized bvec
array member to struct drbd_request, so we can skip the extra bio_alloc,
_get and _put (unless there is some paranoia code in the generic block
layer, which may be unhappy with that).

> Well, before implementing it I will reason on it for some time just to be sure 
> and convinced. Maybe I should implement both alternatives

implement the "most correct" alternative.

> and do some stress-testing on a debug kernel?

absolutely.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Drbd-dev] Behaviour of verify: false positives -> true positives
  2008-10-01 14:59               ` Lars Ellenberg
@ 2009-01-26 11:03                 ` Lars Ellenberg
  0 siblings, 0 replies; 11+ messages in thread
From: Lars Ellenberg @ 2009-01-26 11:03 UTC (permalink / raw)
  To: drbd-dev

hi there...
any news on this?

also, please consider testing with linux 2.6.28.2,
there are a few data integrity relevant fixes for the
generic dirty page write out path in there...

On Wed, Oct 01, 2008 at 04:59:10PM +0200, Lars Ellenberg wrote:
> On Wed, Oct 01, 2008 at 02:49:22PM +0200, Thomas Schoebel-Theuer wrote:
> > > > Well, one problem is that the length of bvec could be nearly arbitrary
> > > > (in theory),
> > >
> > > BIO_MAX_PAGES
> > > DRBD_MAX_SEGMENT_SIZE
> > 
> > Ok.
> > 
> > > > But what about simply generating a completely new bio and copying over
> > > > all the stuff by hand? This would mean to implement some sort of
> > > > bio_copy() in the local code which could then later be lifted upstreams
> > > > if other people liked it too. What do you think is better?
> > >
> > > no, I think it would be enough to just set (pseudo code)
> > > copy_page->private = orig_page;
> > 
> > Hmm. what if the caller used orig_page->private already and is now accessing 
> > the page _in parallel_ to us? IMHO in combination with the next step it 
> > _could_ lead to an observable difference:
> > 
> > > bvec->bv_page = copy_page;
> > 
> > IMHO this could cause a side effect for any observer accessing "his" page 
> > via "his" orig_bio and finally arriving at our copy_page->private. I am not 
> > sure whether this really happens in the kernel anywhere, but even if 
> > currently not it probably could happen sometime in future. The idea behind 
> > bio_copy() was to _never_ touch the original and all its transitively 
> > reachable descendants in any way, just to be sure nothing can ever go wrong 
> > with it (similar to a COW style). Well, the performance might be slightly 
> > worse, but we are reasoning on correctness at a rather tricky high level.
> >
> > I am not sure whether it really does any harm, I'm just curious and cautious.
> 
> I think that is a reasonable attitude ;)
> 
> so we just don't bio_clone in drbd_req_new then,
> but bio_alloc a fresh bio with our own bvec and own pages attached,
> which will be submitted, _and_ be used to sendpage it over.
> it has to do properly initialized, obviously.
> 
> we have to make sure to get an extra reference (bio_get) on that private
> bio then, so it cannot vanish while being handed over to tcp, it has to
> stay around until master bio completion (where an extra bio_put will
> free it).
> 
> _maybe_ we could add full struct bio private_bio + reasonably sized bvec
> array member to struct drbd_request, so we can skip the extra bio_alloc,
> _get and _put (unless there is some paranoia code in the generic block
> layer, which may be unhappy with that).
> 
> > Well, before implementing it I will reason on it for some time just to be sure 
> > and convinced. Maybe I should implement both alternatives
> 
> implement the "most correct" alternative.
> 
> > and do some stress-testing on a debug kernel?
> 
> absolutely.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-01-26 11:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-09 14:02 [Drbd-dev] Behaviour of verify: false positives -> true positives schoebel
2008-09-11  9:25 ` Lars Ellenberg
2008-09-11  9:37   ` Lars Ellenberg
2008-09-30 20:10   ` Lars Ellenberg
2008-10-01  9:16     ` Thomas Schoebel-Theuer
2008-10-01 10:46       ` Lars Ellenberg
2008-10-01 11:28         ` Thomas Schoebel-Theuer
2008-10-01 11:46           ` Lars Ellenberg
2008-10-01 12:49             ` Thomas Schoebel-Theuer
2008-10-01 14:59               ` Lars Ellenberg
2009-01-26 11:03                 ` Lars Ellenberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.