Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed
* RE: [Drbd-dev] DRBD-8: recent regression causing corruption andcrashes
@ 2006-08-11 19:11 Graham, Simon
  2006-08-11 19:57 ` Lars Ellenberg
  0 siblings, 1 reply; 5+ messages in thread
From: Graham, Simon @ 2006-08-11 19:11 UTC (permalink / raw)
  To: Lars Ellenberg, drbd-dev

> / 2006-08-11 12:01:23 -0400
> \ Graham, Simon:
> > Quick update:
> >
> 
> How exactly do you "test"?
> Kernel and hardware?
> (sorry, if you posted that earlier, just point me to it)

In this case, this happens only when I install a pair of systems from
scratch and it is doing initial synchronization of one specific DRBD
partition which is also being written to by our applications at the same
time. I did post the sequence at the end of a previous message, but it's
basically:

1. on both systems use drbdmeta to wipe the meta data with no network
connection established
2. on one system, mount the drbd disk, make a file system and untar some
stuff on to it (still with no network connection)
3. reboot both systems - when they come up, resync starts. On one
system, mount the file system (which causes reads/writes
   at the same time as the resync)

Once I'm in this state (and have had the crash which happens everytime),
I'm not able to manually resync the disks -- I suspect I don't
understand enough about this yet, but it always says there is a
split-brain and it's not able to fix it even if I set the after-sb-xpri
options.

The hardware is a pair of Dell servers, software is 2.6.16.13 with Xen
3.0.2 patches; this all worked fine until about 1 week ago when I
upgraded to the latest trunk version of drbd 8.

Simon

BTW: I have also checked carefully that I'm running the latest trunk
version (as of last night).



^ permalink raw reply	[flat|nested] 5+ messages in thread
* RE: [Drbd-dev] DRBD-8: recent regression causing corruption andcrashes
@ 2006-08-11 21:55 Graham, Simon
  0 siblings, 0 replies; 5+ messages in thread
From: Graham, Simon @ 2006-08-11 21:55 UTC (permalink / raw)
  To: Lars Ellenberg, drbd-dev

After a lot of looking at the disassembly of the send-ack routines, I
think I've found it -- the new routines Philipp added do this:

static int _drbd_send_ack(drbd_dev *mdev, Drbd_Packet_Cmd cmd, 
			  sector_t sector,
			  unsigned int blksize,
			  u64 block_id)
{...}

int drbd_send_ack(drbd_dev *mdev, Drbd_Packet_Cmd cmd, struct
Tl_epoch_entry *e)
{
	return _drbd_send_ack(mdev,cmd,
			      cpu_to_be64(drbd_ee_get_sector(e)),
			      cpu_to_be32(drbd_ee_get_size(e)),
			      e->block_id);
}

Now, if you build on a system that does NOT have CONFIG_LBD defined,
then the definition of sector_t is 'unsigned long' - i.e. 32-bits, to
the code above byte swaps the sector number as a u64, then truncates it
to 32-bits leaving JUST the byte-swapped upper portion, i.e. zero
_ALWAYS_.

I just checked my config and CONFIG_LBD is off -- I'm guessing it's
probably on for the tests you run?

I also think the fix is simply a matter of changing the definition of
_drbd_send_ack to be 'u64 sector' - I'm going to try this right now!

Simon


^ permalink raw reply	[flat|nested] 5+ messages in thread
* RE: [Drbd-dev] DRBD-8: recent regression causing corruption andcrashes
@ 2006-08-11 22:31 Graham, Simon
  2006-08-14  6:53 ` Philipp Reisner
  0 siblings, 1 reply; 5+ messages in thread
From: Graham, Simon @ 2006-08-11 22:31 UTC (permalink / raw)
  To: Graham, Simon, Lars Ellenberg, drbd-dev

[-- Attachment #1: Type: text/plain, Size: 1693 bytes --]

That was it -- things are going MUCH better now - trivial patch
attached.

Simon


> -----Original Message-----
> From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
> On Behalf Of Graham, Simon
> Sent: Friday, August 11, 2006 5:56 PM
> To: Lars Ellenberg; drbd-dev@linbit.com
> Subject: RE: [Drbd-dev] DRBD-8: recent regression causing corruption
> andcrashes
> 
> After a lot of looking at the disassembly of the send-ack routines, I
> think I've found it -- the new routines Philipp added do this:
> 
> static int _drbd_send_ack(drbd_dev *mdev, Drbd_Packet_Cmd cmd,
> 			  sector_t sector,
> 			  unsigned int blksize,
> 			  u64 block_id)
> {...}
> 
> int drbd_send_ack(drbd_dev *mdev, Drbd_Packet_Cmd cmd, struct
> Tl_epoch_entry *e)
> {
> 	return _drbd_send_ack(mdev,cmd,
> 			      cpu_to_be64(drbd_ee_get_sector(e)),
> 			      cpu_to_be32(drbd_ee_get_size(e)),
> 			      e->block_id);
> }
> 
> Now, if you build on a system that does NOT have CONFIG_LBD defined,
> then the definition of sector_t is 'unsigned long' - i.e. 32-bits, to
> the code above byte swaps the sector number as a u64, then truncates
it
> to 32-bits leaving JUST the byte-swapped upper portion, i.e. zero
> _ALWAYS_.
> 
> I just checked my config and CONFIG_LBD is off -- I'm guessing it's
> probably on for the tests you run?
> 
> I also think the fix is simply a matter of changing the definition of
> _drbd_send_ack to be 'u64 sector' - I'm going to try this right now!
> 
> Simon
> 
> _______________________________________________
> drbd-dev mailing list
> drbd-dev@lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-dev

[-- Attachment #2: drbd-sector.patch --]
[-- Type: application/octet-stream, Size: 354 bytes --]

Index: drbd_main.c
===================================================================
--- drbd_main.c	(revision 3504)
+++ drbd_main.c	(working copy)
@@ -1506,7 +1506,7 @@
  * in big endian!
  */ 
 static int _drbd_send_ack(drbd_dev *mdev, Drbd_Packet_Cmd cmd, 
-			  sector_t sector,
+			  u64 sector,
 			  unsigned int blksize,
 			  u64 block_id)
 {

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-08-14  6:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-11 19:11 [Drbd-dev] DRBD-8: recent regression causing corruption andcrashes Graham, Simon
2006-08-11 19:57 ` Lars Ellenberg
  -- strict thread matches above, loose matches on Subject: below --
2006-08-11 21:55 Graham, Simon
2006-08-11 22:31 Graham, Simon
2006-08-14  6:53 ` Philipp Reisner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox