Problems with software RAID + iSCSI or GNBD

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Problems with software RAID + iSCSI or GNBD
@ 2005-06-27  2:42 Christopher Smith
  2005-06-28 16:05 ` Michael Stumpf
  2005-06-29  1:11 ` Paul Clements
  0 siblings, 2 replies; 9+ messages in thread
From: Christopher Smith @ 2005-06-27  2:42 UTC (permalink / raw)
  To: linux-raid

I'm not sure if this is the correct list to be posting this to, but it 
is software RAID related, so if nothing else hopefully someone here can 
point me in the right direction.

I'm trying to roll my own SAN, but I've had mixed results thus far.  In 
my basic, initial setup I've created a configuration with two "disk 
nodes" and a single "concentrator node".  My objective is to have the 
"concentrator" take the physical disk exported from the "disk nodes" and 
  stitch it together into a RAID1.  So, it looks like this:

              "Concentrator"
                 /dev/md0
                  /     \
              GigE       GigE
                /         \
     "Disk node 1"       "Disk node 2"

So far I've tried using iSCSI and GNBD as the "back end" to make the 
disk space in the nodes visible to the concentrator.  I've had two 
problems, one unique to using iSCSI and the other common to both.

Problem 1: (Re)Sync performance is atrocious with iSCSI

If I use iSCSI as the back end, the RAID only builds at about 6 - 
7M/sec.  Once that is complete, however, performance is much better - 
reads around 100M/sec and writes around 50M/sec.  It's only during the 
sync the performance is awful.  It's not related to 
/proc/sys/dev/raid/speed_limit_max either, which I have set to 50M/sec. 
  Nor is it related to the sheer volume of traffic flying around, as if 
I use disktest to simultaneously read and write to both disk nodes, 
performance on all benchmarks only drops down to about 40 - 50M/sec.

If I switch the back end to GNBD, the resync speed is around 40 - 50M/sec.

Problem 2: The system doesn't deal with failure very well.

Once I got the RAID1 up and running, I tried to simulate a node failure 
by pulling the network cable from the node while disk activity was 
taking place.  I was hoping the concentrator would detect the "disk" had 
failed and simply drop it from the array (so it could later be simply 
re-added).  Unfortunately that doesn't appear to happen.  What does 
happen is that all IO to the md device "hangs" (eg: disktest throughput 
drops to 0M/sec), I am unable to either 'cat /prod/mdstat' to see the md 
device's status or use mdadm to manually fail the device - both simply 
result in the command "hanging".

Does anyone have any insight as to what might be causing these problems ?

Cheers,
CS

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problems with software RAID + iSCSI or GNBD
  2005-06-27  2:42 Problems with software RAID + iSCSI or GNBD Christopher Smith
@ 2005-06-28 16:05 ` Michael Stumpf
  2005-06-29  2:09   ` Christopher Smith
  2005-06-29  1:11 ` Paul Clements
  1 sibling, 1 reply; 9+ messages in thread
From: Michael Stumpf @ 2005-06-28 16:05 UTC (permalink / raw)
  To: Christopher Smith; +Cc: linux-raid

>
> Problem 2: The system doesn't deal with failure very well.
>
> Once I got the RAID1 up and running, I tried to simulate a node 
> failure by pulling the network cable from the node while disk activity 
> was taking place.  I was hoping the concentrator would detect the 
> "disk" had failed and simply drop it from the array (so it could later 
> be simply re-added).  Unfortunately that doesn't appear to happen.  
> What does happen is that all IO to the md device "hangs" (eg: disktest 
> throughput drops to 0M/sec), I am unable to either 'cat /prod/mdstat' 
> to see the md device's status or use mdadm to manually fail the device 
> - both simply result in the command "hanging".

I probably can't help much at the moment, but...

I didn't realize that NBD type things had advanced to even this level of 
stability.  This is good news.  I've been wanting to do something like 
what you're trying for some time to overcome the bounds of  
power/busses/heat/space that limit you to a single machine when building 
a large md or LVM.  Looking at the GNBD project page, it still seems 
pretty raw, although a related project DDRAID seems to hold some promise.

I'd be pretty timid about putting anything close to production on these 
drivers, though.

What distro / kernel version / level of GNBD are you using? 

Regards-
Michael Stumpf

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problems with software RAID + iSCSI or GNBD
  2005-06-28 16:05 ` Michael Stumpf
@ 2005-06-29  2:09   ` Christopher Smith
  0 siblings, 0 replies; 9+ messages in thread
From: Christopher Smith @ 2005-06-29  2:09 UTC (permalink / raw)
  To: mjstumpf; +Cc: linux-raid

Michael Stumpf wrote:
> I probably can't help much at the moment, but...
> 
> I didn't realize that NBD type things had advanced to even this level of 
> stability.  This is good news.  I've been wanting to do something like 
> what you're trying for some time to overcome the bounds of  
> power/busses/heat/space that limit you to a single machine when building 
> a large md or LVM.  Looking at the GNBD project page, it still seems 
> pretty raw, although a related project DDRAID seems to hold some promise.
> 
> I'd be pretty timid about putting anything close to production on these 
> drivers, though.
> 
> What distro / kernel version / level of GNBD are you using?

Well, I don't know if they have yet - the main reason I'm fiddling with 
this is to see if it's feasible :).

However, I have belted tens of gigabytes of data at the mirror-over-GNBD 
and mirror-over-iSCSI using various benchmarking tools without any 
kernel panics or (apparent) data corruption, so I'm gaining confidence 
that it's a workable solution.  I haven't yet started the same level of 
testing with Windows and Linux clients sitting above the 
initiator/bridge level yet, however, as I want to make sure the back end 
is pretty stable before moving on (as it will become a - relatively - 
single point of failure for most of the important machines in our 
network, and hence the entire company).

I'm just using a stock Fedora Core 4 and the GNBD it includes.  A bit 
bleeding edge, I know, but I figured since it had just been released 
when I started on this project, why not ;).

With regards to the problem I was having with node failures, at least 
with iSCSI the solution was setting a timeout so that a "disk failed" 
error was actually returned - by default the iSCSI initiator assumes any 
disconnection errors are network-related and transient, so it simply 
stops any IO to the iSCSI target until it reappears.  Now that I've 
specified a timeout, node "failures" behave as expected and the mirror 
goes into degraded mode.

I assume I need to do something similar with GNBD so that it really does 
"fail", rather than "hang", but I've been too busy over the last few 
days to actually look into it.

CS

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problems with software RAID + iSCSI or GNBD
  2005-06-27  2:42 Problems with software RAID + iSCSI or GNBD Christopher Smith
  2005-06-28 16:05 ` Michael Stumpf
@ 2005-06-29  1:11 ` Paul Clements
  2005-06-29  4:42   ` Christopher Smith
  1 sibling, 1 reply; 9+ messages in thread
From: Paul Clements @ 2005-06-29  1:11 UTC (permalink / raw)
  To: Christopher Smith; +Cc: linux-raid

Christopher Smith wrote:

>  stitch it together into a RAID1.  So, it looks like this:
> 
>              "Concentrator"
>                 /dev/md0
>                  /     \
>              GigE       GigE
>                /         \
>     "Disk node 1"       "Disk node 2"
> 
> So far I've tried using iSCSI and GNBD as the "back end" to make the 
> disk space in the nodes visible to the concentrator.  I've had two 
> problems, one unique to using iSCSI and the other common to both.

Personally, I wouldn't mess with iSCSI or GNBD. You don't need GNBD in 
this scenario anyway; simple nbd (which is in the mainline kernel...get 
the userland tools at sourceforge.net/projects/nbd) will do just fine, 
and I'd be willing to bet that it is more stable and faster...


> Problem 2: The system doesn't deal with failure very well.
> 
> Once I got the RAID1 up and running, I tried to simulate a node failure 
> by pulling the network cable from the node while disk activity was 
> taking place.  I was hoping the concentrator would detect the "disk" had 
> failed and simply drop it from the array (so it could later be simply 
> re-added).  Unfortunately that doesn't appear to happen.  What does 
> happen is that all IO to the md device "hangs" (eg: disktest throughput 
> drops to 0M/sec), I am unable to either 'cat /prod/mdstat' to see the md 
> device's status or use mdadm to manually fail the device - both simply 
> result in the command "hanging".

Well, that's the fault of either iSCSI or GNBD. md/raid1 over nbd works 
flawlessly in this scenario on 2.6 kernels (for 2.4, you'll need a 
special patch -- ask me for it, if you want).

--
Paul

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problems with software RAID + iSCSI or GNBD
  2005-06-29  1:11 ` Paul Clements
@ 2005-06-29  4:42   ` Christopher Smith
  2005-06-29 14:40     ` Bill Davidsen
  0 siblings, 1 reply; 9+ messages in thread
From: Christopher Smith @ 2005-06-29  4:42 UTC (permalink / raw)
  To: Paul Clements, linux-raid

Paul Clements wrote:
> Christopher Smith wrote:
> 
>>  stitch it together into a RAID1.  So, it looks like this:
>>
>>              "Concentrator"
>>                 /dev/md0
>>                  /     \
>>              GigE       GigE
>>                /         \
>>     "Disk node 1"       "Disk node 2"
>>
>> So far I've tried using iSCSI and GNBD as the "back end" to make the 
>> disk space in the nodes visible to the concentrator.  I've had two 
>> problems, one unique to using iSCSI and the other common to both.
> 
> 
> Personally, I wouldn't mess with iSCSI or GNBD. You don't need GNBD in 
> this scenario anyway; simple nbd (which is in the mainline kernel...get 
> the userland tools at sourceforge.net/projects/nbd) will do just fine, 
> and I'd be willing to bet that it is more stable and faster...

I'd briefly tried nbd, but decided to look elsewhere since it needed to 
much manual configuration (no included rc script, /dev nodes appear to 
have to be manually created - yes, I'm lazy).

I've just finished trying NBD now and it seems to solve both my problems 
- rebuild speed is a healthy 40MB/sec + and the failures are dealt with 
"properly" (ie: the md device goes into degraded mode if a component nbd 
suddenly disappears).  This looks like it might be a goer for the disk 
node/RAID-over-network back-end.

On the "front end", however, we have to use iSCSI because we're planning 
on doling the aggregate disk space out to a mix of platforms (some of 
them potentially clustered in the future) so they can re-share the space 
to "client" machines.  However, the IETD iSCSI Target seems pretty 
solid, so I'm not really concerned about that.

> Well, that's the fault of either iSCSI or GNBD. md/raid1 over nbd works 
> flawlessly in this scenario on 2.6 kernels (for 2.4, you'll need a 
> special patch -- ask me for it, if you want).

Yeah, I fixed that problem (at least with iSCSI - haven't tried with 
GNBD).  It was a PEBKAC issue :).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problems with software RAID + iSCSI or GNBD
  2005-06-29  4:42   ` Christopher Smith
@ 2005-06-29 14:40     ` Bill Davidsen
  2005-06-29 15:24       ` David Dougall
  0 siblings, 1 reply; 9+ messages in thread
From: Bill Davidsen @ 2005-06-29 14:40 UTC (permalink / raw)
  To: Christopher Smith; +Cc: Paul Clements, linux-raid

On Wed, 29 Jun 2005, Christopher Smith wrote:

> > Personally, I wouldn't mess with iSCSI or GNBD. You don't need GNBD in 
> > this scenario anyway; simple nbd (which is in the mainline kernel...get 
> > the userland tools at sourceforge.net/projects/nbd) will do just fine, 
> > and I'd be willing to bet that it is more stable and faster...
> 
> I'd briefly tried nbd, but decided to look elsewhere since it needed to 
> much manual configuration (no included rc script, /dev nodes appear to 
> have to be manually created - yes, I'm lazy).

Based on one test of nbd, it seems to be stable. I did about what you are 
trying, a RAID1 to create an md device, then did an encrypted filesystem 
on it. My test was minor, throw a lot of data at it, check that it is 
there (md5sum), reboot and verify everything still works, drop the local 
drive and rebuild. I did NOT try a rebuild on the nbd drive.

> 
> I've just finished trying NBD now and it seems to solve both my problems 
> - rebuild speed is a healthy 40MB/sec + and the failures are dealt with 
> "properly" (ie: the md device goes into degraded mode if a component nbd 
> suddenly disappears).  This looks like it might be a goer for the disk 
> node/RAID-over-network back-end.

I failed it manually, can't say what pulling the plug will do. Glad it's 
working, I may be doing this on a WAN in the fall.

-- 
bill davidsen <davidsen@tmr.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problems with software RAID + iSCSI or GNBD
  2005-06-29 14:40     ` Bill Davidsen
@ 2005-06-29 15:24       ` David Dougall
  2005-06-29 15:59         ` Bill Davidsen
  2005-06-30 18:05         ` J. Ryan Earl
  0 siblings, 2 replies; 9+ messages in thread
From: David Dougall @ 2005-06-29 15:24 UTC (permalink / raw)
  To: linux-raid

This discussion intrigues me.  I think there is a lot of merit to running
a raid in this manner.  However, if I understand correctly, under normal
circumstances reads from a Raid1 md device will always round-robin between
the devices to increase performance.  Is there a way or what would need to
be done to set a single component in the device to be the primary.  ie.
don't read from the other device unless the first one fails.
I think there was some discussion about this a month or so ago concerning
ramdisk(which I don't know would be quite as useful), but the theory can
apply to any block devices with significantly different speed/latencies,
etc.
Please advise.
--David Dougall


On Wed, 29 Jun 2005, Bill Davidsen wrote:

> On Wed, 29 Jun 2005, Christopher Smith wrote:
>
> > > Personally, I wouldn't mess with iSCSI or GNBD. You don't need GNBD in
> > > this scenario anyway; simple nbd (which is in the mainline kernel...get
> > > the userland tools at sourceforge.net/projects/nbd) will do just fine,
> > > and I'd be willing to bet that it is more stable and faster...
> >
> > I'd briefly tried nbd, but decided to look elsewhere since it needed to
> > much manual configuration (no included rc script, /dev nodes appear to
> > have to be manually created - yes, I'm lazy).
>
> Based on one test of nbd, it seems to be stable. I did about what you are
> trying, a RAID1 to create an md device, then did an encrypted filesystem
> on it. My test was minor, throw a lot of data at it, check that it is
> there (md5sum), reboot and verify everything still works, drop the local
> drive and rebuild. I did NOT try a rebuild on the nbd drive.
>
> >
> > I've just finished trying NBD now and it seems to solve both my problems
> > - rebuild speed is a healthy 40MB/sec + and the failures are dealt with
> > "properly" (ie: the md device goes into degraded mode if a component nbd
> > suddenly disappears).  This looks like it might be a goer for the disk
> > node/RAID-over-network back-end.
>
> I failed it manually, can't say what pulling the plug will do. Glad it's
> working, I may be doing this on a WAN in the fall.
>
> --
> bill davidsen <davidsen@tmr.com>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problems with software RAID + iSCSI or GNBD
  2005-06-29 15:24       ` David Dougall
@ 2005-06-29 15:59         ` Bill Davidsen
  2005-06-30 18:05         ` J. Ryan Earl
  1 sibling, 0 replies; 9+ messages in thread
From: Bill Davidsen @ 2005-06-29 15:59 UTC (permalink / raw)
  To: David Dougall; +Cc: linux-raid

On Wed, 29 Jun 2005, David Dougall wrote:

> This discussion intrigues me.  I think there is a lot of merit to running
> a raid in this manner.  However, if I understand correctly, under normal
> circumstances reads from a Raid1 md device will always round-robin between
> the devices to increase performance.  Is there a way or what would need to
> be done to set a single component in the device to be the primary.  ie.
> don't read from the other device unless the first one fails.
> I think there was some discussion about this a month or so ago concerning
> ramdisk(which I don't know would be quite as useful), but the theory can
> apply to any block devices with significantly different speed/latencies,
> etc.
> Please advise.
> --David Dougall

It would be useful to prioritize device selection in mirrored arrays, such
that for reads one device would be used until the i/o queue had N entries,
then start using the next. It would even seem that the prioritization
could be set by the kernel, "by observation of seek and transfer times."
In the case of a connection via GigE I doubt the network time is an issue
compared to the physical processes. If the link is T1 between continents
it's an issue for big data, and if one end is connected via PPP over
dialup (looking at the worst case) you probably don't want to read over
the net unless the local device has flames coming out.

Other common cases like slow disk on a fast bus and a fast disk on a
slower bus might really mess up any auto tuning. The common case is a
laptop with a 5400 rpm PATA drive internal and a 7200 rpm fast seek drive
plugged into USB2.

I don't see a huge benefit unless there is a really slow network involved.  
Perhaps some developer could speak to the possibility of preferential
reads in general. I peeked at the code, and while it looks actually fairly
simple, it also feels as if it might have a lot of "unintended
consequences," and violations of Plauger's "Law of Least Astonishment."

-- 
bill davidsen <davidsen@tmr.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Problems with software RAID + iSCSI or GNBD
  2005-06-29 15:24       ` David Dougall
  2005-06-29 15:59         ` Bill Davidsen
@ 2005-06-30 18:05         ` J. Ryan Earl
  1 sibling, 0 replies; 9+ messages in thread
From: J. Ryan Earl @ 2005-06-30 18:05 UTC (permalink / raw)
  To: David Dougall; +Cc: linux-raid

David Dougall wrote:

>reads from a Raid1 md device will always round-robin between
>the devices to increase performance.
>
If it was round-robin that would actually decrease the performance, it's 
locality based.  It reads from the RAID member that last operated in the 
physical vicinity of the requested data to be read.  ie it minimize the 
distance disk heads have to move.  This is one way RAID1 lowers average 
seek time.

-ryan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2005-06-30 18:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-27  2:42 Problems with software RAID + iSCSI or GNBD Christopher Smith
2005-06-28 16:05 ` Michael Stumpf
2005-06-29  2:09   ` Christopher Smith
2005-06-29  1:11 ` Paul Clements
2005-06-29  4:42   ` Christopher Smith
2005-06-29 14:40     ` Bill Davidsen
2005-06-29 15:24       ` David Dougall
2005-06-29 15:59         ` Bill Davidsen
2005-06-30 18:05         ` J. Ryan Earl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).