Is there a grand plan for FC failover?

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Is there a grand plan for FC failover?
@ 2004-01-26 14:18 Simon Kelley
  2004-01-26 15:37 ` James Bottomley
  0 siblings, 1 reply; 19+ messages in thread
From: Simon Kelley @ 2004-01-26 14:18 UTC (permalink / raw)
  To: linux-scsi

I see that 2.6.x kernels now have the qla2xxx driver in the mainline, 
but without the failover code.

What is the reason for that? Is there a plan provide failover facilities 
  at a higher level which will be usable with all suitable low-level 
drivers and hardware?

I'm very much in favour of using drivers which are developed in the 
kernel mainline but I have an application which needs failover so I 
might be forced back to the qlogic-distributed code.

Cheers,

Simon.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Is there a grand plan for FC failover?
  2004-01-26 14:18 Is there a grand plan for FC failover? Simon Kelley
@ 2004-01-26 15:37 ` James Bottomley
  2004-01-28 15:02   ` Philip R. Auld
  0 siblings, 1 reply; 19+ messages in thread
From: James Bottomley @ 2004-01-26 15:37 UTC (permalink / raw)
  To: Simon Kelley; +Cc: SCSI Mailing List

On Mon, 2004-01-26 at 08:18, Simon Kelley wrote:
> I see that 2.6.x kernels now have the qla2xxx driver in the mainline, 
> but without the failover code.
> 
> What is the reason for that? Is there a plan provide failover facilities 
>   at a higher level which will be usable with all suitable low-level 
> drivers and hardware?
> 
> I'm very much in favour of using drivers which are developed in the 
> kernel mainline but I have an application which needs failover so I 
> might be forced back to the qlogic-distributed code.

Yes, the direction coming out of KS/OLS last year was to use the dm or
md multi-path code to sit the failover driver on top of sd (or any other
block driver).

The idea being that the Volume Manager layer is the most stack generic
place to do this type of thing.  The thread on this is here:

http://marc.theaimsgroup.com/?t=106005575400003

James



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Is there a grand plan for FC failover?
  2004-01-26 15:37 ` James Bottomley
@ 2004-01-28 15:02   ` Philip R. Auld
  2004-01-28 16:57     ` James Bottomley
  0 siblings, 1 reply; 19+ messages in thread
From: Philip R. Auld @ 2004-01-28 15:02 UTC (permalink / raw)
  To: James Bottomley; +Cc: Simon Kelley, SCSI Mailing List


Hi,
	
Rumor has it that on Mon, Jan 26, 2004 at 09:37:25AM -0600 James Bottomley said:
> On Mon, 2004-01-26 at 08:18, Simon Kelley wrote:
>
> > I'm very much in favour of using drivers which are developed in the 
> > kernel mainline but I have an application which needs failover so I 
> > might be forced back to the qlogic-distributed code.
> 
> Yes, the direction coming out of KS/OLS last year was to use the dm or
> md multi-path code to sit the failover driver on top of sd (or any other
> block driver).
> 
> The idea being that the Volume Manager layer is the most stack generic
> place to do this type of thing.  The thread on this is here:
> 
> http://marc.theaimsgroup.com/?t=106005575400003
> 

There are some issues left un-resolved with this approach. 
The one's that come mind are:

	1) load balancing when possible: it's not enough to be just a 
		failover mechanism.

        2) requiring a userspace program to execute for failover is 
	problematic when it could be the root disk that needs 
	failing over.

 	3) Handling partitions is a problem.

I see multipath and md/RAID as two different animals. Multipathing is 
multiple ways to reach the same physical block. That is it's under 
the logical layer. While RAID is multiple ways to reach the same 
logical block. It's basically many-to-one vs one-to-many. Having 
multiple physical paths to a logical partition is a little 
counter-intuitive.

That said, I'm all for getting Linux to have a decent multipath 
implementation, at what ever layer is agreed upon. I'd love to 
be able to use the native Linux multipathing and stop supporting 
the ones I've written.

 
FWIW,

Phil



> James
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Philip R. Auld, Ph.D.  	        	       Egenera, Inc.    
Principal Software Engineer                   165 Forest St.
(508) 858-2628                            Marlboro, MA 01752

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Is there a grand plan for FC failover?
  2004-01-28 15:02   ` Philip R. Auld
@ 2004-01-28 16:57     ` James Bottomley
  2004-01-28 18:00       ` Philip R. Auld
  0 siblings, 1 reply; 19+ messages in thread
From: James Bottomley @ 2004-01-28 16:57 UTC (permalink / raw)
  To: Philip R. Auld; +Cc: Simon Kelley, SCSI Mailing List

On Wed, 2004-01-28 at 09:02, Philip R. Auld wrote:
> 	1) load balancing when possible: it's not enough to be just a 
> 		failover mechanism.

For first out, a failover target supplies most of the needs.  Nothing
prevents an aggregation target being added later, but failover is
essential.

>         2) requiring a userspace program to execute for failover is 
> 	problematic when it could be the root disk that needs 
> 	failing over.

Userspace is required for *configuration* not failover.  The path
targets failover automatically as they see I/O down particular paths
timing out or failing.  That being said, nothing prevents async
notifications via hotplug also triggering a failover (rather than having
to wait for timeout etc) but it's not required.

>  	3) Handling partitions is a problem.

It is? How? The block approach can sit either above or below partitions
(assuming a slightly more flexible handling of partitions that dm
provides).

> I see multipath and md/RAID as two different animals. Multipathing is 
> multiple ways to reach the same physical block. That is it's under 
> the logical layer. While RAID is multiple ways to reach the same 
> logical block. It's basically many-to-one vs one-to-many. Having 
> multiple physical paths to a logical partition is a little 
> counter-intuitive.

Nothing in the discussion assumed them to be similar.  The notes were
that md already had a multi-path target, and that translated error
indications would be useful to software raid.  Thus designing the
fastfail to cater to both looks like a good idea, but that's just good
design.  Robust multi-pathing and raid will be built on top of
scsi/block fastfail.

James

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Is there a grand plan for FC failover?
  2004-01-28 16:57     ` James Bottomley
@ 2004-01-28 18:00       ` Philip R. Auld
  2004-01-28 20:47         ` Patrick Mansfield
  2004-01-28 22:37         ` Mike Christie
  0 siblings, 2 replies; 19+ messages in thread
From: Philip R. Auld @ 2004-01-28 18:00 UTC (permalink / raw)
  To: James Bottomley; +Cc: Simon Kelley, SCSI Mailing List

Hi James,
	Thanks for the reply. 

Rumor has it that on Wed, Jan 28, 2004 at 10:57:29AM -0600 James Bottomley said:
> On Wed, 2004-01-28 at 09:02, Philip R. Auld wrote:
> > 	1) load balancing when possible: it's not enough to be just a 
> > 		failover mechanism.
> 
> For first out, a failover target supplies most of the needs.  Nothing
> prevents an aggregation target being added later, but failover is
> essential.

Yes, failover is necessary, but not sufficient to make a decent multipath driver.
I'm a little concerned by the "added later" part. Shouldn't it be designed in? 

> 
> >         2) requiring a userspace program to execute for failover is 
> > 	problematic when it could be the root disk that needs 
> > 	failing over.
> 
> Userspace is required for *configuration* not failover.  The path
> targets failover automatically as they see I/O down particular paths
> timing out or failing.  That being said, nothing prevents async
> notifications via hotplug also triggering a failover (rather than having
> to wait for timeout etc) but it's not required.
> 

There was some discussion about vendor plugins to do proprietary failover
when needed. Say to make a passive path active. My concern is that this be
memory resident and that we not have to go to the disk for such things.

If there is no need for that or it's done in such a way that it doesn't need
the disk this won't be a problem then.

> >  	3) Handling partitions is a problem.
> 
> It is? How? The block approach can sit either above or below partitions
> (assuming a slightly more flexible handling of partitions that dm
> provides).
> 

Great, that works for me.  

> > I see multipath and md/RAID as two different animals. Multipathing is 
> > multiple ways to reach the same physical block. That is it's under 
> > the logical layer. While RAID is multiple ways to reach the same 
> > logical block. It's basically many-to-one vs one-to-many. Having 
> > multiple physical paths to a logical partition is a little 
> > counter-intuitive.
> 
> Nothing in the discussion assumed them to be similar.  The notes were
> that md already had a multi-path target, and that translated error
> indications would be useful to software raid.  Thus designing the
> fastfail to cater to both looks like a good idea, but that's just good
> design.  Robust multi-pathing and raid will be built on top of
> scsi/block fastfail.

Aside from the implicit one made by having a multi-path target in the md 
driver, I guess that's true.

As I said, I think this all sounds great. I just hadn't see these (non?)issues 
addressed on the list after they were raised and want to make sure there was 
some thought given to them.


Cheers,

Phil

> 
> James
> 

-- 
Philip R. Auld, Ph.D.  	        	       Egenera, Inc.    
Principal Software Engineer                   165 Forest St.
(508) 858-2628                            Marlboro, MA 01752

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Is there a grand plan for FC failover?
  2004-01-28 18:00       ` Philip R. Auld
@ 2004-01-28 20:47         ` Patrick Mansfield
  2004-01-28 22:14           ` James Bottomley
  2004-01-28 22:37         ` Mike Christie
  1 sibling, 1 reply; 19+ messages in thread
From: Patrick Mansfield @ 2004-01-28 20:47 UTC (permalink / raw)
  To: Philip R. Auld; +Cc: James Bottomley, Simon Kelley, SCSI Mailing List, dm-devel

[cc-ing dm-devel]

My two main issues with dm multipath versus scsi core multipath are:

1) It does not handle character devices.

2) It does not have the information available about the state of the
scsi_device or scsi_host (for path selection), or about the elevator.

If we end up passing all the scsi information up to dm, and it does the
same things that we already do in scsi (or in block), what is the point of
putting the code into a separate layer?

More scsi fastfail like code is still needed - probably for all the cases
where scsi_dev_queue_ready and scsi_host_queue_ready return 0 - and more.
For example, should we somehow make sdev->queue_depth available to dm?

There are still issues with with a per path elevator (i.e. we have an
elevator for each path rather than the entire device) that probably won't
be fixed cleanly in 2.6. AFAIUI this requires moving dm from a bio based
approach to a request based one.

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Is there a grand plan for FC failover?
  2004-01-28 20:47         ` Patrick Mansfield
@ 2004-01-28 22:14           ` James Bottomley
  2004-01-29  0:55             ` Patrick Mansfield
  0 siblings, 1 reply; 19+ messages in thread
From: James Bottomley @ 2004-01-28 22:14 UTC (permalink / raw)
  To: Patrick Mansfield
  Cc: Philip R. Auld, Simon Kelley, SCSI Mailing List, dm-devel

On Wed, 2004-01-28 at 15:47, Patrick Mansfield wrote:
> [cc-ing dm-devel]
> 
> My two main issues with dm multipath versus scsi core multipath are:
> 
> 1) It does not handle character devices.

Multi-path character devices are pretty much corner cases.  It's not
clear to me that you need to handle them in kernel at all.  Things like
multi-path tape often come with an application that's perfectly happy to
take the presentation of two or more tape devices.  Do we have a
character device example we need to support as a single device?

> 2) It does not have the information available about the state of the
> scsi_device or scsi_host (for path selection), or about the elevator.

Well, this is one of those abstraction case things.  Can we make the
information generic enough that the pathing layer makes the right
decisions without worrying about what the underlying internals are? 
That's where enhancements to the fastfail layer come in.  I believe we
can get the fastfail information to the point where we can use it to
make good decisions regardless of underlying transport (or even
subsystem).

> If we end up passing all the scsi information up to dm, and it does the
> same things that we already do in scsi (or in block), what is the point of
> putting the code into a separate layer?

It's for interpretation by those modular add-ons that are allowed to
cater to specific devices.

> More scsi fastfail like code is still needed - probably for all the cases
> where scsi_dev_queue_ready and scsi_host_queue_ready return 0 - and more.
> For example, should we somehow make sdev->queue_depth available to dm?

I agree.  We only have the basics at the moment.  Expanding the error
indications is a necessary next step.

> There are still issues with with a per path elevator (i.e. we have an
> elevator for each path rather than the entire device) that probably won't
> be fixed cleanly in 2.6. AFAIUI this requires moving dm from a bio based
> approach to a request based one.

We had the "where does the elevator go" discussion at the OLS bof.  I
think I heard agreement that the current situation of between dm and
block is suboptimal and that we'd like a true coalescing elevator above
dm with a vestigial one for the mid-layer to use for queueing below.  I
think this is a requirement for dm multipath to work well, but it's not
a requirement for it actually to work.

James

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Is there a grand plan for FC failover?
  2004-01-28 22:14           ` James Bottomley
@ 2004-01-29  0:55             ` Patrick Mansfield
  2004-01-30 19:48               ` [dm-devel] " Joe Thornber
  0 siblings, 1 reply; 19+ messages in thread
From: Patrick Mansfield @ 2004-01-29  0:55 UTC (permalink / raw)
  To: James Bottomley; +Cc: Philip R. Auld, Simon Kelley, SCSI Mailing List, dm-devel

On Wed, Jan 28, 2004 at 05:14:26PM -0500, James Bottomley wrote:
> On Wed, 2004-01-28 at 15:47, Patrick Mansfield wrote:
> > [cc-ing dm-devel]
> > 
> > My two main issues with dm multipath versus scsi core multipath are:
> > 
> > 1) It does not handle character devices.
> 
> Multi-path character devices are pretty much corner cases.  It's not
> clear to me that you need to handle them in kernel at all.  Things like
> multi-path tape often come with an application that's perfectly happy to
> take the presentation of two or more tape devices.  

I have not seen such applications. Standard applications like tar and cpio
are not going to work well.

If you plug a single ported tape drive or other scsi device into a fibre
channel SAN, it will show up multiple times, the hardware itself need not
be multiported.

> Do we have a
> character device example we need to support as a single device?

Not that I know of, but I have not worked in this area recently. I assume
there are also fibre attached media changers.

BTW, we need some sort of udev rules so we can have a multi-path device
(sd part, not dm part) actually show up multiple times.

> > 2) It does not have the information available about the state of the
> > scsi_device or scsi_host (for path selection), or about the elevator.
> 
> Well, this is one of those abstraction case things.  Can we make the
> information generic enough that the pathing layer makes the right
> decisions without worrying about what the underlying internals are? 

I don't think current interfaces and passing up error codes will be
enough, for example: a queue full on a given path (aka scsi_device) when
there is no other IO on that device could lead to starvation, similiar to
one node in a cluster starving other nodes out. 

Limiting IO via some sort of queue_depth in dm would help solve this
particular problem, but there is nothing in place today for dm to have its
own request queue or be request based, limiting the number of bio's to an
aribitrary value would suck, also the sdev->queue_depth is not visible to
dm today.

> That's where enhancements to the fastfail layer come in.  I believe we
> can get the fastfail information to the point where we can use it to
> make good decisions regardless of underlying transport (or even
> subsystem).

> > If we end up passing all the scsi information up to dm, and it does the
> > same things that we already do in scsi (or in block), what is the point of
> > putting the code into a separate layer?
> 
> It's for interpretation by those modular add-ons that are allowed to
> cater to specific devices.

I'm not sure what you mean - adding code or data that is only every used
by dm is wasted if you're not using dm.

> > More scsi fastfail like code is still needed - probably for all the cases
> > where scsi_dev_queue_ready and scsi_host_queue_ready return 0 - and more.
> > For example, should we somehow make sdev->queue_depth available to dm?
> 
> I agree.  We only have the basics at the moment.  Expanding the error
> indications is a necessary next step.

Yes, I was looking into this for use with changes Mike C is working on -
pass up an error via end_that_request_first or such.

> We had the "where does the elevator go" discussion at the OLS bof.  I
> think I heard agreement that the current situation of between dm and
> block is suboptimal and that we'd like a true coalescing elevator above
> dm with a vestigial one for the mid-layer to use for queueing below.  I
> think this is a requirement for dm multipath to work well, but it's not
> a requirement for it actually to work.

If the performance is bad enough, it doesn't matter if it works.

-- Patrick Mansfield

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dm-devel] Re: Is there a grand plan for FC failover?
  2004-01-29  0:55             ` Patrick Mansfield
@ 2004-01-30 19:48               ` Joe Thornber
  2004-01-31  9:30                 ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Joe Thornber @ 2004-01-30 19:48 UTC (permalink / raw)
  To: dm-devel; +Cc: James Bottomley, Philip R. Auld, Simon Kelley, SCSI Mailing List

On Wed, Jan 28, 2004 at 04:55:34PM -0800, Patrick Mansfield wrote:
> > We had the "where does the elevator go" discussion at the OLS bof.  I
> > think I heard agreement that the current situation of between dm and
> > block is suboptimal and that we'd like a true coalescing elevator above
> > dm with a vestigial one for the mid-layer to use for queueing below.  I
> > think this is a requirement for dm multipath to work well, but it's not
> > a requirement for it actually to work.
> 
> If the performance is bad enough, it doesn't matter if it works.

It would be great to get some benchmarks to back up these arguments.
eg, performance of dm mpath with a simple round robin selector,
compared to a scsi layer implementation.  Lifting the elevator (or
lowering dm) is a big piece of work that I wont even consider unless
there is very good reason; the reason probably needs to be broader
than just multipath too.  Even if we did decide to do this, it won't
happen in 2.6.

- Joe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dm-devel] Re: Is there a grand plan for FC failover?
  2004-01-30 19:48               ` [dm-devel] " Joe Thornber
@ 2004-01-31  9:30                 ` Jens Axboe
  2004-01-31 16:59                   ` Philip R. Auld
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2004-01-31  9:30 UTC (permalink / raw)
  To: Joe Thornber
  Cc: dm-devel, James Bottomley, Philip R. Auld, Simon Kelley,
	SCSI Mailing List

On Fri, Jan 30 2004, Joe Thornber wrote:
> On Wed, Jan 28, 2004 at 04:55:34PM -0800, Patrick Mansfield wrote:
> > > We had the "where does the elevator go" discussion at the OLS bof.  I
> > > think I heard agreement that the current situation of between dm and
> > > block is suboptimal and that we'd like a true coalescing elevator above
> > > dm with a vestigial one for the mid-layer to use for queueing below.  I
> > > think this is a requirement for dm multipath to work well, but it's not
> > > a requirement for it actually to work.
> > 
> > If the performance is bad enough, it doesn't matter if it works.
> 
> It would be great to get some benchmarks to back up these arguments.
> eg, performance of dm mpath with a simple round robin selector,
> compared to a scsi layer implementation.  Lifting the elevator (or
> lowering dm) is a big piece of work that I wont even consider unless
> there is very good reason; the reason probably needs to be broader
> than just multipath too.  Even if we did decide to do this, it won't
> happen in 2.6.

I suspect the problem really isn't that huge in 2.6, since most
performance file systems are using mpage or building their own big
bio's. So in a sense, some of the merging already does happen above dm
(and the io scheduler).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dm-devel] Re: Is there a grand plan for FC failover?
  2004-01-31  9:30                 ` Jens Axboe
@ 2004-01-31 16:59                   ` Philip R. Auld
  2004-01-31 17:42                     ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Philip R. Auld @ 2004-01-31 16:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Joe Thornber, dm-devel, James Bottomley, Simon Kelley,
	SCSI Mailing List

Rumor has it that on Sat, Jan 31, 2004 at 10:30:37AM +0100 Jens Axboe said:
> On Fri, Jan 30 2004, Joe Thornber wrote:
> > It would be great to get some benchmarks to back up these arguments.
> > eg, performance of dm mpath with a simple round robin selector,
> > compared to a scsi layer implementation.  Lifting the elevator (or
> > lowering dm) is a big piece of work that I wont even consider unless
> > there is very good reason; the reason probably needs to be broader
> > than just multipath too.  Even if we did decide to do this, it won't
> > happen in 2.6.
> 
> I suspect the problem really isn't that huge in 2.6, since most
> performance file systems are using mpage or building their own big
> bio's. So in a sense, some of the merging already does happen above dm
> (and the io scheduler).

Out of curiosity, where does raw io fit into that in 2.6?


Thanks,

Phil


-- 
Philip R. Auld, Ph.D.  	        	       Egenera, Inc.    
Principal Software Engineer                   165 Forest St.
(508) 858-2628                            Marlboro, MA 01752

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dm-devel] Re: Is there a grand plan for FC failover?
  2004-01-31 16:59                   ` Philip R. Auld
@ 2004-01-31 17:42                     ` Jens Axboe
  2004-02-12 15:17                       ` Philip R. Auld
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2004-01-31 17:42 UTC (permalink / raw)
  To: Philip R. Auld
  Cc: Joe Thornber, dm-devel, James Bottomley, Simon Kelley,
	SCSI Mailing List

On Sat, Jan 31 2004, Philip R. Auld wrote:
> Rumor has it that on Sat, Jan 31, 2004 at 10:30:37AM +0100 Jens Axboe said:
> > On Fri, Jan 30 2004, Joe Thornber wrote:
> > > It would be great to get some benchmarks to back up these arguments.
> > > eg, performance of dm mpath with a simple round robin selector,
> > > compared to a scsi layer implementation.  Lifting the elevator (or
> > > lowering dm) is a big piece of work that I wont even consider unless
> > > there is very good reason; the reason probably needs to be broader
> > > than just multipath too.  Even if we did decide to do this, it won't
> > > happen in 2.6.
> > 
> > I suspect the problem really isn't that huge in 2.6, since most
> > performance file systems are using mpage or building their own big
> > bio's. So in a sense, some of the merging already does happen above dm
> > (and the io scheduler).
> 
> Out of curiosity, where does raw io fit into that in 2.6?

raw io (or O_DIRECT io, same path) should work even better, always send
out bio's as big as the underlying device can support.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dm-devel] Re: Is there a grand plan for FC failover?
  2004-01-31 17:42                     ` Jens Axboe
@ 2004-02-12 15:17                       ` Philip R. Auld
  2004-02-12 15:28                         ` Arjan van de Ven
  0 siblings, 1 reply; 19+ messages in thread
From: Philip R. Auld @ 2004-02-12 15:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Joe Thornber, dm-devel, James Bottomley, Simon Kelley,
	SCSI Mailing List

Rumor has it that on Sat, Jan 31, 2004 at 06:42:01PM +0100 Jens Axboe said:
> On Sat, Jan 31 2004, Philip R. Auld wrote:
> > Rumor has it that on Sat, Jan 31, 2004 at 10:30:37AM +0100 Jens Axboe said:
> > > On Fri, Jan 30 2004, Joe Thornber wrote:
> > > > It would be great to get some benchmarks to back up these arguments.
> > > > eg, performance of dm mpath with a simple round robin selector,
> > > > compared to a scsi layer implementation.  Lifting the elevator (or
> > > > lowering dm) is a big piece of work that I wont even consider unless
> > > > there is very good reason; the reason probably needs to be broader
> > > > than just multipath too.  Even if we did decide to do this, it won't
> > > > happen in 2.6.
> > > 
> > > I suspect the problem really isn't that huge in 2.6, since most
> > > performance file systems are using mpage or building their own big
> > > bio's. So in a sense, some of the merging already does happen above dm
> > > (and the io scheduler).
> > 
> > Out of curiosity, where does raw io fit into that in 2.6?
> 
> raw io (or O_DIRECT io, same path) should work even better, always send
> out bio's as big as the underlying device can support.
> 

That size is based on the blocksize? Is there a way to set the block size higher than
512 w/o mounting it? I've gotten really bad rawio performance on 2.4. since I have a 
limit of 32 sg entries. When Rawio uses a 512 byte blocksize IOs are limited to 16K. 

Will this still be a problem in 2.6?

Thanks again,

Phil

> -- 
> Jens Axboe
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Philip R. Auld, Ph.D.  	        	       Egenera, Inc.    
Principal Software Engineer                   165 Forest St.
(508) 858-2628                            Marlboro, MA 01752

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dm-devel] Re: Is there a grand plan for FC failover?
  2004-02-12 15:17                       ` Philip R. Auld
@ 2004-02-12 15:28                         ` Arjan van de Ven
  2004-02-12 16:03                           ` Philip R. Auld
  0 siblings, 1 reply; 19+ messages in thread
From: Arjan van de Ven @ 2004-02-12 15:28 UTC (permalink / raw)
  To: Philip R. Auld; +Cc: SCSI Mailing List

[-- Attachment #1: Type: text/plain, Size: 429 bytes --]


> That size is based on the blocksize? Is there a way to set the block size higher than
> 512 w/o mounting it? I've gotten really bad rawio performance on 2.4. since I have a 
> limit of 32 sg entries. When Rawio uses a 512 byte blocksize IOs are limited to 16K. 
> 
> Will this still be a problem in 2.6?

If you code the driver right, it's not even a problem in 2.4 if you use
a Enterprise Linux distro kernel.....


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dm-devel] Re: Is there a grand plan for FC failover?
  2004-02-12 15:28                         ` Arjan van de Ven
@ 2004-02-12 16:03                           ` Philip R. Auld
  0 siblings, 0 replies; 19+ messages in thread
From: Philip R. Auld @ 2004-02-12 16:03 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: SCSI Mailing List

Rumor has it that on Thu, Feb 12, 2004 at 04:28:41PM +0100 Arjan van de Ven said:
> 
> > That size is based on the blocksize? Is there a way to set the block size higher than
> > 512 w/o mounting it? I've gotten really bad rawio performance on 2.4. since I have a 
> > limit of 32 sg entries. When Rawio uses a 512 byte blocksize IOs are limited to 16K. 
> > 
> > Will this still be a problem in 2.6?
> 
> If you code the driver right, it's not even a problem in 2.4 if you use
> a Enterprise Linux distro kernel.....
> 

Hi Arjan, I think you know that I do use your Enterprise kernel :)

Can you please explain what "right" means in this case?

I've a hardware limit of 32 sg entries. When the blocksize is the default hardware
sector size of 512 I get commands of 32 separate 512 bytes scatter-gather entries.
This is deteremined at a higher level than my LLDD. What is there that I can do 
about it in the driver?


Cheers,

Phil


-- 
Philip R. Auld, Ph.D.  	        	       Egenera, Inc.    
Principal Software Engineer                   165 Forest St.
(508) 858-2628                            Marlboro, MA 01752

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Is there a grand plan for FC failover?
  2004-01-28 18:00       ` Philip R. Auld
  2004-01-28 20:47         ` Patrick Mansfield
@ 2004-01-28 22:37         ` Mike Christie
  2004-01-29 15:24           ` Philip R. Auld
  1 sibling, 1 reply; 19+ messages in thread
From: Mike Christie @ 2004-01-28 22:37 UTC (permalink / raw)
  To: Philip R. Auld; +Cc: James Bottomley, Simon Kelley, SCSI Mailing List

Philip R. Auld wrote:
> Hi James,
> 	Thanks for the reply. 
> 
> Rumor has it that on Wed, Jan 28, 2004 at 10:57:29AM -0600 James Bottomley said:
> 
>>On Wed, 2004-01-28 at 09:02, Philip R. Auld wrote:
>>
>>>	1) load balancing when possible: it's not enough to be just a 
>>>		failover mechanism.
>>
>>For first out, a failover target supplies most of the needs.  Nothing
>>prevents an aggregation target being added later, but failover is
>>essential.
> 
> 
> Yes, failover is necessary, but not sufficient to make a decent multipath driver.
> I'm a little concerned by the "added later" part. Shouldn't it be designed in? 

The DM multipath can do both. Unfortunately, some work still needs to be 
done. For example, when doing load balancing how much IO to send down 
each path is an issue that the DM maintainer had asked for feedback on. 
Suggestions?

> 
>>>        2) requiring a userspace program to execute for failover is 
>>>	problematic when it could be the root disk that needs 
>>>	failing over.
>>
>>Userspace is required for *configuration* not failover.  The path
>>targets failover automatically as they see I/O down particular paths
>>timing out or failing.  That being said, nothing prevents async
>>notifications via hotplug also triggering a failover (rather than having
>>to wait for timeout etc) but it's not required.
>>
> 
> 
> There was some discussion about vendor plugins to do proprietary failover
> when needed. Say to make a passive path active. My concern is that this be
> memory resident and that we not have to go to the disk for such things.
> 
> If there is no need for that or it's done in such a way that it doesn't need
> the disk this won't be a problem then.
>  

This is something both MD and DM multiapth are not yet fully prepared for.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Is there a grand plan for FC failover?
  2004-01-28 22:37         ` Mike Christie
@ 2004-01-29 15:24           ` Philip R. Auld
  2004-01-29 16:00             ` James Bottomley
  0 siblings, 1 reply; 19+ messages in thread
From: Philip R. Auld @ 2004-01-29 15:24 UTC (permalink / raw)
  To: Mike Christie; +Cc: James Bottomley, Simon Kelley, SCSI Mailing List

Rumor has it that on Wed, Jan 28, 2004 at 02:37:28PM -0800 Mike Christie said:
> Philip R. Auld wrote:
> > Hi James,
> > 	Thanks for the reply. 
> > 
> > Rumor has it that on Wed, Jan 28, 2004 at 10:57:29AM -0600 James Bottomley said:
> > 
> >>On Wed, 2004-01-28 at 09:02, Philip R. Auld wrote:
> >>
> >>>	1) load balancing when possible: it's not enough to be just a 
> >>>		failover mechanism.
> >>
> >>For first out, a failover target supplies most of the needs.  Nothing
> >>prevents an aggregation target being added later, but failover is
> >>essential.
> > 
> > 
> > Yes, failover is necessary, but not sufficient to make a decent multipath driver.
> > I'm a little concerned by the "added later" part. Shouldn't it be designed in? 
> 
> The DM multipath can do both. Unfortunately, some work still needs to be 
> done. For example, when doing load balancing how much IO to send down 
> each path is an issue that the DM maintainer had asked for feedback on. 
> Suggestions?
> 

That leads back to where to put the request merging and elevator code...

The way I currently do load-balancing is on a scsi_cmnd basis. At that point 
the IO is coalesced already.  A shortest queue_depth falling back to 
round-robin algorithm balances really well on individual commands. From a 
logical block level, I'm not sure how. I'm a little unfamiliar (as my posts 
have shown) with how/where the DM code fits in.  (I've also not gotten a good 
look into 2.6 BIO... busy on 2.4 still.)

I think the place to do load balancing would be below the block queue so that
the IOs are coalesced. IMO, until you've merged the individual block requests 
you can't make a good decision about how to balance the load. 

-- 
Philip R. Auld, Ph.D.  	        	       Egenera, Inc.    
Principal Software Engineer                   165 Forest St.
(508) 858-2628                            Marlboro, MA 01752

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Is there a grand plan for FC failover?
  2004-01-29 15:24           ` Philip R. Auld
@ 2004-01-29 16:00             ` James Bottomley
  2004-01-29 23:25               ` Mike Christie
  0 siblings, 1 reply; 19+ messages in thread
From: James Bottomley @ 2004-01-29 16:00 UTC (permalink / raw)
  To: Philip R. Auld; +Cc: Mike Christie, Simon Kelley, SCSI Mailing List

On Thu, 2004-01-29 at 10:24, Philip R. Auld wrote:
> I think the place to do load balancing would be below the block queue so that
> the IOs are coalesced. IMO, until you've merged the individual block requests 
> you can't make a good decision about how to balance the load. 

Yes, that's what I think too, so the elevator should go above dm, with a
vestigial elevator between dm and sd simply for SCSI to use in queuing.

Unfortunately, the way it works today is that the elevator is between dm
and sd.  I know people are thinking about how to change this, but I'm
not sure how far anyone's got.

James

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Is there a grand plan for FC failover?
  2004-01-29 16:00             ` James Bottomley
@ 2004-01-29 23:25               ` Mike Christie
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Christie @ 2004-01-29 23:25 UTC (permalink / raw)
  To: James Bottomley
  Cc: Philip R. Auld, Simon Kelley, SCSI Mailing List, Jens Axboe

James Bottomley wrote:
> On Thu, 2004-01-29 at 10:24, Philip R. Auld wrote:
> 
>>I think the place to do load balancing would be below the block queue so that
>>the IOs are coalesced. IMO, until you've merged the individual block requests 
>>you can't make a good decision about how to balance the load. 
> 
> 
> Yes, that's what I think too, so the elevator should go above dm, with a
> vestigial elevator between dm and sd simply for SCSI to use in queuing.

I do not think there is anyone that will disagree that there should be 
one elevator that does the sorting, merging, aging etc and that it 
should be at a different layer than it is today for DM multipath.

The modifications to DM to allow a multipath device to have a queue and 
to use it are minial. The problem is it's request_fn_proc. In the DM 
request_fn_proc we could just dequeue a request from the DM device's 
queue, clone it so the IO sched state does not get overwritten, map it 
to the correct lower level device queue, then insert it in the lower 
level device queue. Bad design right? We could do something similar 
where the request_fn_proc recieves a request instead of a queue like 
Jen's is planning for 2.7 
(http://www.kernel.org/pub/linux/kernel/people/axboe/experimental/), and 
then build a framework similar to bio remapping where we just clone and 
remap requests to the correct driver. But still a bad design right?

It doesn't seem like there are a lot of options for 2.6 with the current 
framework. We can add hints so we can try to make a decent decsion, but 
it isn't going to be perfect. But as James said the elevator problem is 
not a requirement for dm to actually work. I guess people will continue 
rolling their own solution if they feel the performance is that bad.

I cc'd Jens as he is best qualified to describe why there are problems 
with queueing requests between different queue's. I apoligize for 
bugging him about this again.

Mike

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2004-02-12 16:05 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-26 14:18 Is there a grand plan for FC failover? Simon Kelley
2004-01-26 15:37 ` James Bottomley
2004-01-28 15:02   ` Philip R. Auld
2004-01-28 16:57     ` James Bottomley
2004-01-28 18:00       ` Philip R. Auld
2004-01-28 20:47         ` Patrick Mansfield
2004-01-28 22:14           ` James Bottomley
2004-01-29  0:55             ` Patrick Mansfield
2004-01-30 19:48               ` [dm-devel] " Joe Thornber
2004-01-31  9:30                 ` Jens Axboe
2004-01-31 16:59                   ` Philip R. Auld
2004-01-31 17:42                     ` Jens Axboe
2004-02-12 15:17                       ` Philip R. Auld
2004-02-12 15:28                         ` Arjan van de Ven
2004-02-12 16:03                           ` Philip R. Auld
2004-01-28 22:37         ` Mike Christie
2004-01-29 15:24           ` Philip R. Auld
2004-01-29 16:00             ` James Bottomley
2004-01-29 23:25               ` Mike Christie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox