* RE: [Drbd-dev] DRBD-8: recent regression causing corruption andcrashes
@ 2006-08-11 22:31 Graham, Simon
2006-08-14 6:53 ` Philipp Reisner
0 siblings, 1 reply; 5+ messages in thread
From: Graham, Simon @ 2006-08-11 22:31 UTC (permalink / raw)
To: Graham, Simon, Lars Ellenberg, drbd-dev
[-- Attachment #1: Type: text/plain, Size: 1693 bytes --]
That was it -- things are going MUCH better now - trivial patch
attached.
Simon
> -----Original Message-----
> From: drbd-dev-bounces@linbit.com [mailto:drbd-dev-bounces@linbit.com]
> On Behalf Of Graham, Simon
> Sent: Friday, August 11, 2006 5:56 PM
> To: Lars Ellenberg; drbd-dev@linbit.com
> Subject: RE: [Drbd-dev] DRBD-8: recent regression causing corruption
> andcrashes
>
> After a lot of looking at the disassembly of the send-ack routines, I
> think I've found it -- the new routines Philipp added do this:
>
> static int _drbd_send_ack(drbd_dev *mdev, Drbd_Packet_Cmd cmd,
> sector_t sector,
> unsigned int blksize,
> u64 block_id)
> {...}
>
> int drbd_send_ack(drbd_dev *mdev, Drbd_Packet_Cmd cmd, struct
> Tl_epoch_entry *e)
> {
> return _drbd_send_ack(mdev,cmd,
> cpu_to_be64(drbd_ee_get_sector(e)),
> cpu_to_be32(drbd_ee_get_size(e)),
> e->block_id);
> }
>
> Now, if you build on a system that does NOT have CONFIG_LBD defined,
> then the definition of sector_t is 'unsigned long' - i.e. 32-bits, to
> the code above byte swaps the sector number as a u64, then truncates
it
> to 32-bits leaving JUST the byte-swapped upper portion, i.e. zero
> _ALWAYS_.
>
> I just checked my config and CONFIG_LBD is off -- I'm guessing it's
> probably on for the tests you run?
>
> I also think the fix is simply a matter of changing the definition of
> _drbd_send_ack to be 'u64 sector' - I'm going to try this right now!
>
> Simon
>
> _______________________________________________
> drbd-dev mailing list
> drbd-dev@lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-dev
[-- Attachment #2: drbd-sector.patch --]
[-- Type: application/octet-stream, Size: 354 bytes --]
Index: drbd_main.c
===================================================================
--- drbd_main.c (revision 3504)
+++ drbd_main.c (working copy)
@@ -1506,7 +1506,7 @@
* in big endian!
*/
static int _drbd_send_ack(drbd_dev *mdev, Drbd_Packet_Cmd cmd,
- sector_t sector,
+ u64 sector,
unsigned int blksize,
u64 block_id)
{
^ permalink raw reply [flat|nested] 5+ messages in thread* RE: [Drbd-dev] DRBD-8: recent regression causing corruption andcrashes
@ 2006-08-11 21:55 Graham, Simon
0 siblings, 0 replies; 5+ messages in thread
From: Graham, Simon @ 2006-08-11 21:55 UTC (permalink / raw)
To: Lars Ellenberg, drbd-dev
After a lot of looking at the disassembly of the send-ack routines, I
think I've found it -- the new routines Philipp added do this:
static int _drbd_send_ack(drbd_dev *mdev, Drbd_Packet_Cmd cmd,
sector_t sector,
unsigned int blksize,
u64 block_id)
{...}
int drbd_send_ack(drbd_dev *mdev, Drbd_Packet_Cmd cmd, struct
Tl_epoch_entry *e)
{
return _drbd_send_ack(mdev,cmd,
cpu_to_be64(drbd_ee_get_sector(e)),
cpu_to_be32(drbd_ee_get_size(e)),
e->block_id);
}
Now, if you build on a system that does NOT have CONFIG_LBD defined,
then the definition of sector_t is 'unsigned long' - i.e. 32-bits, to
the code above byte swaps the sector number as a u64, then truncates it
to 32-bits leaving JUST the byte-swapped upper portion, i.e. zero
_ALWAYS_.
I just checked my config and CONFIG_LBD is off -- I'm guessing it's
probably on for the tests you run?
I also think the fix is simply a matter of changing the definition of
_drbd_send_ack to be 'u64 sector' - I'm going to try this right now!
Simon
^ permalink raw reply [flat|nested] 5+ messages in thread* RE: [Drbd-dev] DRBD-8: recent regression causing corruption andcrashes
@ 2006-08-11 19:11 Graham, Simon
2006-08-11 19:57 ` Lars Ellenberg
0 siblings, 1 reply; 5+ messages in thread
From: Graham, Simon @ 2006-08-11 19:11 UTC (permalink / raw)
To: Lars Ellenberg, drbd-dev
> / 2006-08-11 12:01:23 -0400
> \ Graham, Simon:
> > Quick update:
> >
>
> How exactly do you "test"?
> Kernel and hardware?
> (sorry, if you posted that earlier, just point me to it)
In this case, this happens only when I install a pair of systems from
scratch and it is doing initial synchronization of one specific DRBD
partition which is also being written to by our applications at the same
time. I did post the sequence at the end of a previous message, but it's
basically:
1. on both systems use drbdmeta to wipe the meta data with no network
connection established
2. on one system, mount the drbd disk, make a file system and untar some
stuff on to it (still with no network connection)
3. reboot both systems - when they come up, resync starts. On one
system, mount the file system (which causes reads/writes
at the same time as the resync)
Once I'm in this state (and have had the crash which happens everytime),
I'm not able to manually resync the disks -- I suspect I don't
understand enough about this yet, but it always says there is a
split-brain and it's not able to fix it even if I set the after-sb-xpri
options.
The hardware is a pair of Dell servers, software is 2.6.16.13 with Xen
3.0.2 patches; this all worked fine until about 1 week ago when I
upgraded to the latest trunk version of drbd 8.
Simon
BTW: I have also checked carefully that I'm running the latest trunk
version (as of last night).
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [Drbd-dev] DRBD-8: recent regression causing corruption andcrashes
2006-08-11 19:11 Graham, Simon
@ 2006-08-11 19:57 ` Lars Ellenberg
0 siblings, 0 replies; 5+ messages in thread
From: Lars Ellenberg @ 2006-08-11 19:57 UTC (permalink / raw)
To: drbd-dev
/ 2006-08-11 15:11:38 -0400
\ Graham, Simon:
> > / 2006-08-11 12:01:23 -0400
> > \ Graham, Simon:
> > > Quick update:
> > >
> >
> > How exactly do you "test"?
> > Kernel and hardware?
> > (sorry, if you posted that earlier, just point me to it)
>
> In this case, this happens only when I install a pair of systems from
> scratch and it is doing initial synchronization of one specific DRBD
> partition which is also being written to by our applications at the same
> time. I did post the sequence at the end of a previous message, but it's
> basically:
>
> 1. on both systems use drbdmeta to wipe the meta data with no network
> connection established
> 2. on one system, mount the drbd disk, make a file system and untar some
> stuff on to it (still with no network connection)
> 3. reboot both systems - when they come up, resync starts. On one
> system, mount the file system (which causes reads/writes
> at the same time as the resync)
>
> Once I'm in this state (and have had the crash which happens everytime),
> I'm not able to manually resync the disks -- I suspect I don't
> understand enough about this yet, but it always says there is a
> split-brain and it's not able to fix it even if I set the after-sb-xpri
> options.
>
> The hardware is a pair of Dell servers, software is 2.6.16.13 with Xen
> 3.0.2 patches; this all worked fine until about 1 week ago when I
> upgraded to the latest trunk version of drbd 8.
ok...
so there is badness somewhere in our recent commits?
you remember (look it up in the kernel logs) which
revision did work last?
did you change the file system?
> Simon
>
> BTW: I have also checked carefully that I'm running the latest trunk
> version (as of last night).
Also, I'm up to some serious bug in the alloc_ee function:
we do not handle bio_add_page "errors" yet, but they _do_ occur.
may or may not be related to those strange WriteAck sector == 0.
--
: Lars Ellenberg Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe http://www.linbit.com :
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2006-08-14 6:53 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-11 22:31 [Drbd-dev] DRBD-8: recent regression causing corruption andcrashes Graham, Simon
2006-08-14 6:53 ` Philipp Reisner
-- strict thread matches above, loose matches on Subject: below --
2006-08-11 21:55 Graham, Simon
2006-08-11 19:11 Graham, Simon
2006-08-11 19:57 ` Lars Ellenberg
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox