* Designing a new prio_callout @ 2007-06-21 21:42 Ethan John 2007-07-25 11:37 ` Hannes Reinecke 0 siblings, 1 reply; 14+ messages in thread From: Ethan John @ 2007-06-21 21:42 UTC (permalink / raw) To: dm-devel [-- Attachment #1.1: Type: text/plain, Size: 1673 bytes --] I hope I found the right list for this. My company is developing an iSCSI solution, and in looking into Linux compatability and performance, we're concerned that our architecture doesn't play well with the existing multipath configurations that are available. We cannot support what Linux calls multibus, or what the fiber world seems to call active/active. We will need folks to configure failover exclusively. It appears that the current multipathing failover configuration assigns a priority to each path (by default, just assigns the same priority to all paths). This is bad for us, as we're presenting iSCSI targets across multiple machines in a cluster; ideally, a user will have multiple devices, each with a separate path to a separate machine, with failover to the other machines in the cluster. The default setup of multipath will map all connections to a single machine, which is no load balancing at all. I've fooled around with various other values for default_prio_callot (besides the /bin/true), and the one that seems to work best is actually mpath_prio_random. In fact, mpath_prio_random would actually work perfectly, except that it seems to swap path priorities extremely often -- several times a minute. So my company needs to develop a new script, probably much like the mpath_prio_emc or mpath_prio_netapp ones, so that we can hint at load balancing across devices with failover as the multipathing policy. I've been completely unable to find documentation on this. Where might I look? Is this even the right direction in which to be looking for a solution to this problem? Let me know if anything above needs to be more clear. -- Ethan John [-- Attachment #1.2: Type: text/html, Size: 1764 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Designing a new prio_callout 2007-06-21 21:42 Designing a new prio_callout Ethan John @ 2007-07-25 11:37 ` Hannes Reinecke 2007-07-29 16:03 ` Ethan John 0 siblings, 1 reply; 14+ messages in thread From: Hannes Reinecke @ 2007-07-25 11:37 UTC (permalink / raw) To: device-mapper development Ethan John wrote: > I hope I found the right list for this. > > My company is developing an iSCSI solution, and in looking into Linux > compatability and performance, we're concerned that our architecture > doesn't > play well with the existing multipath configurations that are available. > > We cannot support what Linux calls multibus, or what the fiber world seems > to call active/active. We will need folks to configure failover > exclusively. > > It appears that the current multipathing failover configuration assigns a > priority to each path (by default, just assigns the same priority to all > paths). This is bad for us, as we're presenting iSCSI targets across > multiple machines in a cluster; ideally, a user will have multiple devices, > each with a separate path to a separate machine, with failover to the other > machines in the cluster. The default setup of multipath will map all > connections to a single machine, which is no load balancing at all. > > I've fooled around with various other values for default_prio_callot > (besides the /bin/true), and the one that seems to work best is actually > mpath_prio_random. > Argl. > In fact, mpath_prio_random would actually work perfectly, except that it > seems to swap path priorities extremely often -- several times a minute. So > my company needs to develop a new script, probably much like the > mpath_prio_emc or mpath_prio_netapp ones, so that we can hint at load > balancing across devices with failover as the multipathing policy. > > I've been completely unable to find documentation on this. Where might I > look? Is this even the right direction in which to be looking for a > solution to this problem? > Have you looked a ALUA ? It's in SPC-3 section 5.8: 'Target port group access states'. This is the preferred way of handling these things. And is supported by multipath-tools. Mind you, only implicit ALUA is supported. Explicit ALUA support will be implemented, too, once someone actually uses it :-) Please do not design your own way of handling failover. This has shown too many difficulties in the past. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Designing a new prio_callout 2007-07-25 11:37 ` Hannes Reinecke @ 2007-07-29 16:03 ` Ethan John 2007-07-30 6:31 ` Hannes Reinecke 0 siblings, 1 reply; 14+ messages in thread From: Ethan John @ 2007-07-29 16:03 UTC (permalink / raw) To: device-mapper development Thanks for the heads-up about ALUA. We're looking into it. What is the purpose of the custom mpio_prio_* applications that ship with open-iscsi if not to handle multipathing? On 7/25/07, Hannes Reinecke <hare@suse.de> wrote: > Ethan John wrote: > > I hope I found the right list for this. > > > > My company is developing an iSCSI solution, and in looking into Linux > > compatability and performance, we're concerned that our architecture > > doesn't > > play well with the existing multipath configurations that are available. > > > > We cannot support what Linux calls multibus, or what the fiber world seems > > to call active/active. We will need folks to configure failover > > exclusively. > > > > It appears that the current multipathing failover configuration assigns a > > priority to each path (by default, just assigns the same priority to all > > paths). This is bad for us, as we're presenting iSCSI targets across > > multiple machines in a cluster; ideally, a user will have multiple devices, > > each with a separate path to a separate machine, with failover to the other > > machines in the cluster. The default setup of multipath will map all > > connections to a single machine, which is no load balancing at all. > > > > I've fooled around with various other values for default_prio_callot > > (besides the /bin/true), and the one that seems to work best is actually > > mpath_prio_random. > > > Argl. > > > In fact, mpath_prio_random would actually work perfectly, except that it > > seems to swap path priorities extremely often -- several times a minute. So > > my company needs to develop a new script, probably much like the > > mpath_prio_emc or mpath_prio_netapp ones, so that we can hint at load > > balancing across devices with failover as the multipathing policy. > > > > I've been completely unable to find documentation on this. Where might I > > look? Is this even the right direction in which to be looking for a > > solution to this problem? > > > Have you looked a ALUA ? > It's in SPC-3 section 5.8: 'Target port group access states'. > > This is the preferred way of handling these things. > And is supported by multipath-tools. > > Mind you, only implicit ALUA is supported. Explicit ALUA > support will be implemented, too, once someone actually > uses it :-) > > Please do not design your own way of handling failover. This > has shown too many difficulties in the past. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > hare@suse.de +49 911 74053 688 > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: Markus Rex, HRB 16746 (AG Nürnberg) > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > -- Ethan John http://www.flickr.com/photos/thaen/ (206) 841.4157 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Designing a new prio_callout 2007-07-29 16:03 ` Ethan John @ 2007-07-30 6:31 ` Hannes Reinecke 2007-08-09 17:55 ` Ethan John 0 siblings, 1 reply; 14+ messages in thread From: Hannes Reinecke @ 2007-07-30 6:31 UTC (permalink / raw) To: device-mapper development Ethan John wrote: > Thanks for the heads-up about ALUA. We're looking into it. > > What is the purpose of the custom mpio_prio_* applications that ship > with open-iscsi if not to handle multipathing? > It is. mpath_prio_* are the priority callouts for multipathing. They determine the layout of the multipath map. Theory is that mpath_prio_* will return the priority for a given path in relation to the entire multipath layout. If used with 'group_by_prio' all paths with the same priority will be grouped into one multipath group, and the group with the highest priority will become active. When all paths in a group fail, the group with the next highest priority will become active. Additionally some failover command (as determined by the hardware handler) may be send to the target. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Designing a new prio_callout 2007-07-30 6:31 ` Hannes Reinecke @ 2007-08-09 17:55 ` Ethan John 2007-08-10 15:07 ` Hannes Reinecke 0 siblings, 1 reply; 14+ messages in thread From: Ethan John @ 2007-08-09 17:55 UTC (permalink / raw) To: device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 2636 bytes --] Is it possible to manually set the priority of a path after a connection has been established? Since we're doing failover-only (only 1 active path at a time), it would be nice to tell users that they can manually reset priority after a failure. For example, in a configuration with two paths, where one is active and the other is passive for two differeent volumes, a failure of one path will result in all traffic going through the one remaining path. After the second path comes back up, all traffic will still be written to the first path (paths are not rebalanced after a failure). At this point, we're looking for a decent solution for customers that doesn't involve ALUA, since we won't have resources to implement that for this first version of our target. Ideally, we'd like to be able to set the priority for paths automatically through one of the mpath_prio_* scripts. Even allowing a user to set these priorities manually would be better than advising them to use mpath_prio_random as the "easy configuration" solution. We're not looking to develop our own method of load balancing or failover. We want to work within the MPIO world, but it's a little difficult given that we don't support active/active configurations. Thanks so much for you help so far! On 7/29/07, Hannes Reinecke <hare@suse.de> wrote: > > Ethan John wrote: > > Thanks for the heads-up about ALUA. We're looking into it. > > > > What is the purpose of the custom mpio_prio_* applications that ship > > with open-iscsi if not to handle multipathing? > > > It is. mpath_prio_* are the priority callouts for multipathing. > They determine the layout of the multipath map. > Theory is that mpath_prio_* will return the priority for a > given path in relation to the entire multipath layout. > > If used with 'group_by_prio' all paths with the same > priority will be grouped into one multipath group, and > the group with the highest priority will become active. > > When all paths in a group fail, the group with the next > highest priority will become active. Additionally some > failover command (as determined by the hardware handler) > may be send to the target. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > hare@suse.de +49 911 74053 688 > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: Markus Rex, HRB 16746 (AG Nürnberg) > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > -- Ethan John http://www.flickr.com/photos/thaen/ (206) 841.4157 [-- Attachment #1.2: Type: text/html, Size: 3422 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Designing a new prio_callout 2007-08-09 17:55 ` Ethan John @ 2007-08-10 15:07 ` Hannes Reinecke 2007-08-10 15:40 ` Ethan John 0 siblings, 1 reply; 14+ messages in thread From: Hannes Reinecke @ 2007-08-10 15:07 UTC (permalink / raw) To: device-mapper development Ethan John wrote: > Is it possible to manually set the priority of a path after a connection has > been established? > > Since we're doing failover-only (only 1 active path at a time), it would be > nice to tell users that they can manually reset priority after a failure. > For example, in a configuration with two paths, where one is active and the > other is passive for two differeent volumes, a failure of one path will > result in all traffic going through the one remaining path. After the second > path comes back up, all traffic will still be written to the first path > (paths are not rebalanced after a failure). > Not necessarily. There is the keyword 'failback', which can be set to IMMEDIATE, causing all paths to fail back to the original path once it comes back. And as you don't actually need to send any commands for facilitate the failover I doubt you'd need to develop your own hardware handler. The existing tweaks should be enough, I think. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Designing a new prio_callout 2007-08-10 15:07 ` Hannes Reinecke @ 2007-08-10 15:40 ` Ethan John 2007-08-14 17:05 ` Ethan John 0 siblings, 1 reply; 14+ messages in thread From: Ethan John @ 2007-08-10 15:40 UTC (permalink / raw) To: device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 3507 bytes --] Hannes, thanks again for your help with this. I haven't noticed that failback does the right thing, but I'll try it out again. Could be something we're doing wrong. In any case, there's very little documentation on all this, and I'm trying to develop some kind of strategy for our Linux customers to use until we get ALUA implemented. Being able to set path priorities manually would be ideal, but it seems like this is impossible, right? Here's the situation we have right now. I initiate two connections to one target, across two sessions with two different IPs, with two LUs. Multipath looks like this: mpath45 (20002c9020020001a00151b6b46bb57b0) dm-1 company,iSCSI target [size=15G][features=0][hwhandler=0] \_ round-robin 0 [prio=1][active] \_ 22:0:0:1 sdc 8:32 [active][ready] \_ round-robin 0 [prio=1][enabled] \_ 23:0:0:1 sde 8:64 [active][ready] mpath44 (20002c9020020001200151b6b46bb57ae) dm-0 company,iSCSI target [size=15G][features=0][hwhandler=0] \_ round-robin 0 [prio=1][enabled] \_ 22:0:0:0 sdb 8:16 [active][ready] \_ round-robin 0 [prio=1][enabled] \_ 23:0:0:0 sdd 8:48 [active][ready] Note that there are only two active sessions: # iscsiadm -m session tcp: [20] 10.53.152.22:3260,1 iqn.2001-07.com.company:qaiscsi2:blah1 tcp: [21] 10.53.152.23:3260,2 iqn.2001-07.com.company:qaiscsi2:blah1 So the result is that all activity is routed to the first session that was initiated. I want to change the priorities of the paths to allow for traffic to go to the first IP for mpath45 and the second IP for mpath46. Obviously ALUA is the way to go for this in the future, but we won't have the resources to implement that, so I'm looking for an interim solution that will scale to thousands of clients. Right now, the only thing I can tell people is to manually initiate connections to certain targets through certain IP addresses -- basically, doing the load balancing themselves. Is there a better way? On 8/10/07, Hannes Reinecke <hare@suse.de> wrote: > > Ethan John wrote: > > Is it possible to manually set the priority of a path after a connection > has > > been established? > > > > Since we're doing failover-only (only 1 active path at a time), it would > be > > nice to tell users that they can manually reset priority after a > failure. > > For example, in a configuration with two paths, where one is active and > the > > other is passive for two differeent volumes, a failure of one path will > > result in all traffic going through the one remaining path. After the > second > > path comes back up, all traffic will still be written to the first path > > (paths are not rebalanced after a failure). > > > Not necessarily. There is the keyword 'failback', which can be set to > IMMEDIATE, causing all paths to fail back to the original path once it > comes back. > > And as you don't actually need to send any commands for facilitate > the failover I doubt you'd need to develop your own hardware handler. > > The existing tweaks should be enough, I think. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > hare@suse.de +49 911 74053 688 > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: Markus Rex, HRB 16746 (AG Nürnberg) > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > -- Ethan John http://www.flickr.com/photos/thaen/ (206) 841.4157 [-- Attachment #1.2: Type: text/html, Size: 4467 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Designing a new prio_callout 2007-08-10 15:40 ` Ethan John @ 2007-08-14 17:05 ` Ethan John 2007-08-15 8:45 ` Stefan Bader 2007-08-27 15:09 ` Hannes Reinecke 0 siblings, 2 replies; 14+ messages in thread From: Ethan John @ 2007-08-14 17:05 UTC (permalink / raw) To: device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 4224 bytes --] For the record, setting rr_min_io to something extremely large (we're using 2 billion now, since I'm assuming it's a C integer) solves the immediate problem that we're having (overhead in path switching causing poor performance). Telling people to use mpath_prio_random is still less than ideal for any small number of iSCSI targets, but it a better short-term solution for us than nothing. On 8/10/07, Ethan John <ethan.john@gmail.com> wrote: > > Hannes, thanks again for your help with this. > > I haven't noticed that failback does the right thing, but I'll try it out > again. Could be something we're doing wrong. In any case, there's very > little documentation on all this, and I'm trying to develop some kind of > strategy for our Linux customers to use until we get ALUA implemented. > > Being able to set path priorities manually would be ideal, but it seems > like this is impossible, right? > > Here's the situation we have right now. I initiate two connections to one > target, across two sessions with two different IPs, with two LUs. Multipath > looks like this: > mpath45 (20002c9020020001a00151b6b46bb57b0) dm-1 company,iSCSI target > [size=15G][features=0][hwhandler=0] > \_ round-robin 0 [prio=1][active] > \_ 22:0:0:1 sdc 8:32 [active][ready] > \_ round-robin 0 [prio=1][enabled] > \_ 23:0:0:1 sde 8:64 [active][ready] > mpath44 (20002c9020020001200151b6b46bb57ae) dm-0 company,iSCSI target > [size=15G][features=0][hwhandler=0] > \_ round-robin 0 [prio=1][enabled] > \_ 22:0:0:0 sdb 8:16 [active][ready] > \_ round-robin 0 [prio=1][enabled] > \_ 23:0:0:0 sdd 8:48 [active][ready] > > Note that there are only two active sessions: > # iscsiadm -m session > tcp: [20] 10.53.152.22:3260 ,1 iqn.2001-07.com.company:qaiscsi2:blah1 > tcp: [21] 10.53.152.23:3260,2 iqn.2001-07.com.company:qaiscsi2:blah1 > > So the result is that all activity is routed to the first session that was > initiated. I want to change the priorities of the paths to allow for traffic > to go to the first IP for mpath45 and the second IP for mpath46. > > Obviously ALUA is the way to go for this in the future, but we won't have > the resources to implement that, so I'm looking for an interim solution that > will scale to thousands of clients. Right now, the only thing I can tell > people is to manually initiate connections to certain targets through > certain IP addresses -- basically, doing the load balancing themselves. Is > there a better way? > > On 8/10/07, Hannes Reinecke <hare@suse.de> wrote: > > > > Ethan John wrote: > > > Is it possible to manually set the priority of a path after a > > connection has > > > been established? > > > > > > Since we're doing failover-only (only 1 active path at a time), it > > would be > > > nice to tell users that they can manually reset priority after a > > failure. > > > For example, in a configuration with two paths, where one is active > > and the > > > other is passive for two differeent volumes, a failure of one path > > will > > > result in all traffic going through the one remaining path. After the > > second > > > path comes back up, all traffic will still be written to the first > > path > > > (paths are not rebalanced after a failure). > > > > > Not necessarily. There is the keyword 'failback', which can be set to > > IMMEDIATE, causing all paths to fail back to the original path once it > > comes back. > > > > And as you don't actually need to send any commands for facilitate > > the failover I doubt you'd need to develop your own hardware handler. > > > > The existing tweaks should be enough, I think. > > > > Cheers, > > > > Hannes > > -- > > Dr. Hannes Reinecke zSeries & Storage > > hare@suse.de +49 911 74053 688 > > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg > > GF: Markus Rex, HRB 16746 (AG Nürnberg) > > > > -- > > dm-devel mailing list > > dm-devel@redhat.com > > https://www.redhat.com/mailman/listinfo/dm-devel > > > > > > -- > Ethan John > http://www.flickr.com/photos/thaen/ > (206) 841.4157 > -- Ethan John http://www.flickr.com/photos/thaen/ (206) 841.4157 [-- Attachment #1.2: Type: text/html, Size: 5984 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Designing a new prio_callout 2007-08-14 17:05 ` Ethan John @ 2007-08-15 8:45 ` Stefan Bader 2007-08-15 15:57 ` Ethan John 2007-08-27 15:09 ` Hannes Reinecke 1 sibling, 1 reply; 14+ messages in thread From: Stefan Bader @ 2007-08-15 8:45 UTC (permalink / raw) To: device-mapper development Hi Ethan, I might not understand the problem completely but I do not understand the benefit of changing rr_min_io. As far as I can see from your multipath output, both of the devices consist of two path groups with one path. This means, as long as there is no path failure I/O will never be sent to the inactive group. I guess the only thing you need is a script that might find out from a given scsi device (like sdc) whether this would be the preferred path and then print a number that represents the priority (the lower, the higher). Then use this as priority callout and group by priority with failback set to immediate. Regards, Stefan 2007/8/14, Ethan John <ethan.john@gmail.com>: > For the record, setting rr_min_io to something extremely large (we're using > 2 billion now, since I'm assuming it's a C integer) solves the immediate > problem that we're having (overhead in path switching causing poor > > mpath45 (20002c9020020001a00151b6b46bb57b0) dm-1 > company,iSCSI target > > [size=15G][features=0][hwhandler=0] > > \_ round-robin 0 [prio=1][active] > > \_ 22:0:0:1 sdc 8:32 [active][ready] > > \_ round-robin 0 [prio=1][enabled] > > \_ 23:0:0:1 sde 8:64 [active][ready] > > mpath44 (20002c9020020001200151b6b46bb57ae) dm-0 > company,iSCSI target > > [size=15G][features=0][hwhandler=0] > > \_ round-robin 0 [prio=1][enabled] > > \_ 22:0:0:0 sdb 8:16 [active][ready] > > \_ round-robin 0 [prio=1][enabled] > > \_ 23:0:0:0 sdd 8:48 [active][ready] > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Designing a new prio_callout 2007-08-15 8:45 ` Stefan Bader @ 2007-08-15 15:57 ` Ethan John 2007-08-16 10:58 ` Stefan Bader 0 siblings, 1 reply; 14+ messages in thread From: Ethan John @ 2007-08-15 15:57 UTC (permalink / raw) To: device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 4737 bytes --] Definitely, that would be ideal -- having code on our end that tracked who was writing to which path. It's a matter of development effort at this point, and so I'm exploring other options. You can imagine our system as a number of different machines -- each with separate network addresses -- that all provide access to the same LUs. Let's say you have a single target on our system called "mytarget." Users could log into that target via any one of a number of network addresses (even via DNS name, I suppose). So the response from SendTargets is along the lines of: 10.53.152.22:3260,1 iqn.2001-07.com.company:qaiscsi2:mytarget 10.53.152.23:3260,2 iqn.2001-07.com.company:qaiscsi2:mytarget So they might initiate two logins to two separate IPs: iscsiadm -m node --portal 10.53.152.22 --target iqn.2001-07.com.company:qaiscsi2:mytarget --login; iscsiadm -m node --portal 10.53.152.23 --target iqn.2001-07.com.company:qaiscsi2:mytarget --login; Now what happens? If mytarget has multiple LUs associated with it, the multipath output will look like it did below if failover is being used -- two paths for each of two devices. The problem for us is that by default, multipath just uses the first path that it sees. Which means that for every device in mytarget, all data will be read and written across just the first path -- 10.53.151.22, in this case. We need a way to load balance connections across all available connections. There are several ways that I can see to do this. Ideally, we would implement ALUA on our end and advise people to use mpath_prio_alua as their callout. But this has a development cost. We could also implement a custom system as your suggest, but this also has a development cost. If we could advise users to manually set priorities on the client side, that would be acceptable, but this is impossible with the current version of multipath. As such, the best we can do is to set path priorities randomly, using mpath_prio_random. This is fine, but there is a significant cost in terms of resource usage on our system when the active path changes frequently, especially in cases where users have thousands of clients connected to our system, and paths are switching constantly. Thus we need to limit the number of times the active path switches, which the rr_min_io settings seems to do quite nicely. Not sure if that makes any more sense? I'm trying to be thorough for the sake of the next guy. The information on the web is pretty minimal about all this, and it's been a painful experience getting up to speed. On a related note, I've read the reports of people experiencing higher levels of performance with lower settings of rr_min_io, but it seems to me that as rr_min_io gets smaller, the system becomes less like active/passive MPIO and more like active/active MPIO, so users experiencing this performance improvement would be better off using group_by_serial, so that all paths are excitable simultaneously. On 8/15/07, Stefan Bader <Stefan.Bader@de.ibm.com> wrote: > > Hi Ethan, > > I might not understand the problem completely but I do not understand > the benefit of changing rr_min_io. As far as I can see from your > multipath output, both of the devices consist of two path groups with > one path. This means, as long as there is no path failure I/O will > never be sent to the inactive group. > I guess the only thing you need is a script that might find out from a > given scsi device (like sdc) whether this would be the preferred path > and then print a number that represents the priority (the lower, the > higher). Then use this as priority callout and group by priority with > failback set to immediate. > > Regards, > Stefan > > 2007/8/14, Ethan John <ethan.john@gmail.com>: > > For the record, setting rr_min_io to something extremely large (we're > using > > 2 billion now, since I'm assuming it's a C integer) solves the immediate > > problem that we're having (overhead in path switching causing poor > > > > mpath45 (20002c9020020001a00151b6b46bb57b0) dm-1 > > company,iSCSI target > > > [size=15G][features=0][hwhandler=0] > > > \_ round-robin 0 [prio=1][active] > > > \_ 22:0:0:1 sdc 8:32 [active][ready] > > > \_ round-robin 0 [prio=1][enabled] > > > \_ 23:0:0:1 sde 8:64 [active][ready] > > > mpath44 (20002c9020020001200151b6b46bb57ae) dm-0 > > company,iSCSI target > > > [size=15G][features=0][hwhandler=0] > > > \_ round-robin 0 [prio=1][enabled] > > > \_ 22:0:0:0 sdb 8:16 [active][ready] > > > \_ round-robin 0 [prio=1][enabled] > > > \_ 23:0:0:0 sdd 8:48 [active][ready] > > > > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > -- Ethan John http://www.flickr.com/photos/thaen/ (206) 841.4157 [-- Attachment #1.2: Type: text/html, Size: 5782 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Designing a new prio_callout 2007-08-15 15:57 ` Ethan John @ 2007-08-16 10:58 ` Stefan Bader 2007-08-16 17:30 ` Ethan John 0 siblings, 1 reply; 14+ messages in thread From: Stefan Bader @ 2007-08-16 10:58 UTC (permalink / raw) To: device-mapper development > Now what happens? If mytarget has multiple LUs associated with it, the > multipath output will look like it did below if failover is being used -- > two paths for each of two devices. The problem for us is that by default, > multipath just uses the first path that it sees. Which means that for every > device in mytarget, all data will be read and written across just the first > path -- 10.53.151.22, in this case. > > We need a way to load balance connections across all available connections. > > There are several ways that I can see to do this. Ideally, we would > implement ALUA on our end and advise people to use mpath_prio_alua as their > callout. But this has a development cost. We could also implement a custom > system as your suggest, but this also has a development cost. > > If we could advise users to manually set priorities on the client side, that > would be acceptable, but this is impossible with the current version of > multipath. > Can you find the IP address and UID of a device with the node name? For example you get /dev/sdc and then look for UID (can be retrieved with scsi_id) and the IP address of the connection (not sure this is possible). Then manually create a file containing mappings: <uid>:<ip>:<priority> ... Create a script that is used as the callout which takes a node name looks into the file and prints out the priority. This way the priority of a path does not change like it does with random priorities. The other path will only be used on failure and switched back as soon as the other one is back again (with failback immediate). > On a related note, I've read the reports of people experiencing higher > levels of performance with lower settings of rr_min_io, but it seems to me > that as rr_min_io gets smaller, the system becomes less like active/passive > MPIO and more like active/active MPIO, so users experiencing this > performance improvement would be better off using group_by_serial, so that > all paths are excitable simultaneously. > The setting of rr_min_io only matters if you have more than one path per path group. Otherwise you only can use one path at a time and there is no round-robin. If you have more than one path in a group then lower values help since paths are more likely to be used concurrently. The default of 1000 is to high. Kiyoshi Ueda and Jun'ichi Nomura have done some measurements while looking for a way to improve performance more generally (https://ols2006.108.redhat.com/2007/Reprints/ueda-Reprint.pdf). But again, rr_min_io is only relevant to load-balance paths within the same path group (multibus or as you mentioned group_by_serial). The reason for you path changes (except for real failures) might be rather that random_prio results in different priorities whenever any priority value is checked again. Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Designing a new prio_callout 2007-08-16 10:58 ` Stefan Bader @ 2007-08-16 17:30 ` Ethan John 0 siblings, 0 replies; 14+ messages in thread From: Ethan John @ 2007-08-16 17:30 UTC (permalink / raw) To: device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 3782 bytes --] I don't think it's possible to get the IP of the iSCSI session from within multipath. If anyone knows a way, I could easily write a dumb version of a callout like you describe. I'm not convinced, though, that I could do much better than prio_random with rr_min_io > 2 billion without some extensive work. You'd have to re-check path priorities for each call to the script, which would involve walking through all existing paths and priorities and deciding on path priorities in some intelligent way. It would get hairy pretty quickly. Thanks for the tidbit on how round robin works. Great to know! On 8/16/07, Stefan Bader <Stefan.Bader@de.ibm.com> wrote: > > > Now what happens? If mytarget has multiple LUs associated with it, the > > multipath output will look like it did below if failover is being used > -- > > two paths for each of two devices. The problem for us is that by > default, > > multipath just uses the first path that it sees. Which means that for > every > > device in mytarget, all data will be read and written across just the > first > > path -- 10.53.151.22, in this case. > > > > We need a way to load balance connections across all available > connections. > > > > There are several ways that I can see to do this. Ideally, we would > > implement ALUA on our end and advise people to use mpath_prio_alua as > their > > callout. But this has a development cost. We could also implement a > custom > > system as your suggest, but this also has a development cost. > > > > If we could advise users to manually set priorities on the client side, > that > > would be acceptable, but this is impossible with the current version of > > multipath. > > > > Can you find the IP address and UID of a device with the node name? > For example you get /dev/sdc and then look for UID (can be retrieved > with scsi_id) and the IP address of the connection (not sure this is > possible). Then manually create a file containing mappings: > > <uid>:<ip>:<priority> > ... > > Create a script that is used as the callout which takes a node name > looks into the file and prints out the priority. This way the priority > of a path does not change like it does with random priorities. The > other path will only be used on failure and switched back as soon as > the other one is back again (with failback immediate). > > > > On a related note, I've read the reports of people experiencing higher > > levels of performance with lower settings of rr_min_io, but it seems to > me > > that as rr_min_io gets smaller, the system becomes less like > active/passive > > MPIO and more like active/active MPIO, so users experiencing this > > performance improvement would be better off using group_by_serial, so > that > > all paths are excitable simultaneously. > > > > The setting of rr_min_io only matters if you have more than one path > per path group. Otherwise you only can use one path at a time and > there is no round-robin. If you have more than one path in a group > then lower values help since paths are more likely to be used > concurrently. The default of 1000 is to high. Kiyoshi Ueda and > Jun'ichi Nomura have done some measurements while looking for a way to > improve performance more generally > (https://ols2006.108.redhat.com/2007/Reprints/ueda-Reprint.pdf). But > again, rr_min_io is only relevant to load-balance paths within the > same path group (multibus or as you mentioned group_by_serial). The > reason for you path changes (except for real failures) might be rather > that random_prio results in different priorities whenever any priority > value is checked again. > > Stefan > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > -- Ethan John http://www.flickr.com/photos/thaen/ (206) 841.4157 [-- Attachment #1.2: Type: text/html, Size: 4553 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Designing a new prio_callout 2007-08-14 17:05 ` Ethan John 2007-08-15 8:45 ` Stefan Bader @ 2007-08-27 15:09 ` Hannes Reinecke 2007-08-27 15:50 ` Ethan John 1 sibling, 1 reply; 14+ messages in thread From: Hannes Reinecke @ 2007-08-27 15:09 UTC (permalink / raw) To: device-mapper development Ethan John wrote: > For the record, setting rr_min_io to something extremely large (we're using > 2 billion now, since I'm assuming it's a C integer) solves the immediate > problem that we're having (overhead in path switching causing poor > performance). Telling people to use mpath_prio_random is still less than > ideal for any small number of iSCSI targets, but it a better short-term > solution for us than nothing. > In setting rr_min_io to something extremely large you effectively disable the round-robin scheduler in multipathing. That's okay for the failover scenario you have (as you only have one path per group), but whenever you have more than one path in a group that wouldn't work anymore. > On 8/10/07, Ethan John <ethan.john@gmail.com> wrote: >> Hannes, thanks again for your help with this. >> >> I haven't noticed that failback does the right thing, but I'll try it out >> again. Could be something we're doing wrong. In any case, there's very >> little documentation on all this, and I'm trying to develop some kind of >> strategy for our Linux customers to use until we get ALUA implemented. >> >> Being able to set path priorities manually would be ideal, but it seems >> like this is impossible, right? >> >> Here's the situation we have right now. I initiate two connections to one >> target, across two sessions with two different IPs, with two LUs. Multipath >> looks like this: >> mpath45 (20002c9020020001a00151b6b46bb57b0) dm-1 company,iSCSI target >> [size=15G][features=0][hwhandler=0] >> \_ round-robin 0 [prio=1][active] >> \_ 22:0:0:1 sdc 8:32 [active][ready] >> \_ round-robin 0 [prio=1][enabled] >> \_ 23:0:0:1 sde 8:64 [active][ready] >> mpath44 (20002c9020020001200151b6b46bb57ae) dm-0 company,iSCSI target >> [size=15G][features=0][hwhandler=0] >> \_ round-robin 0 [prio=1][enabled] >> \_ 22:0:0:0 sdb 8:16 [active][ready] >> \_ round-robin 0 [prio=1][enabled] >> \_ 23:0:0:0 sdd 8:48 [active][ready] >> >> Note that there are only two active sessions: >> # iscsiadm -m session >> tcp: [20] 10.53.152.22:3260 ,1 iqn.2001-07.com.company:qaiscsi2:blah1 >> tcp: [21] 10.53.152.23:3260,2 iqn.2001-07.com.company:qaiscsi2:blah1 >> >> So the result is that all activity is routed to the first session that was >> initiated. I want to change the priorities of the paths to allow for traffic >> to go to the first IP for mpath45 and the second IP for mpath46. >> That's a matter of the IP routing. Having both target on the same (sub-) net doesn't work very well with multipathing. Please setup your system with each iSCSI Target port in a different subnet eg 10.53.152.22:3260,1 iqn.2001-07.com.company:qaiscsi2:blah1 10.53.153.22:3260,2 iqn.2001-07.com.company:qaiscsi2:blah1 then you'll have one iSCSI target port per subnet and you can actually do failover etc. >> Obviously ALUA is the way to go for this in the future, but we won't have >> the resources to implement that, so I'm looking for an interim solution that >> will scale to thousands of clients. Right now, the only thing I can tell >> people is to manually initiate connections to certain targets through >> certain IP addresses -- basically, doing the load balancing themselves. Is >> there a better way? >> No, not really. But I'm not a network guru. You may want to ask on the open-iscsi mailing list. And you can get all information you need via sysfs, so it should be possible to create a script like Stefan Bader suggested. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Designing a new prio_callout 2007-08-27 15:09 ` Hannes Reinecke @ 2007-08-27 15:50 ` Ethan John 0 siblings, 0 replies; 14+ messages in thread From: Ethan John @ 2007-08-27 15:50 UTC (permalink / raw) To: device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 4845 bytes --] Thanks again, Hannes. We really appreciate your time on this. Stefan's suggestion will be a great option for round-robin failover for our first release. We'll try to figure out a way to do that. It also sounds like using the ALUA callout is going to be the best long-term solution, which answers the original question that I posed. As for putting failover paths on a different subnet, that will be up to the user to manage. We won't prevent that sort of thing, and network configurations are extremely flexible on our systems. Subnet also doesn't necessarily effect physical network paths with our system. Again, thanks so much for your help! On 8/27/07, Hannes Reinecke <hare@suse.de> wrote: > > Ethan John wrote: > > For the record, setting rr_min_io to something extremely large (we're > using > > 2 billion now, since I'm assuming it's a C integer) solves the immediate > > problem that we're having (overhead in path switching causing poor > > performance). Telling people to use mpath_prio_random is still less than > > ideal for any small number of iSCSI targets, but it a better short-term > > solution for us than nothing. > > > In setting rr_min_io to something extremely large you effectively > disable the round-robin scheduler in multipathing. > That's okay for the failover scenario you have (as you only have > one path per group), but whenever you have more than one path > in a group that wouldn't work anymore. > > > On 8/10/07, Ethan John <ethan.john@gmail.com> wrote: > >> Hannes, thanks again for your help with this. > >> > >> I haven't noticed that failback does the right thing, but I'll try it > out > >> again. Could be something we're doing wrong. In any case, there's very > >> little documentation on all this, and I'm trying to develop some kind > of > >> strategy for our Linux customers to use until we get ALUA implemented. > >> > >> Being able to set path priorities manually would be ideal, but it seems > >> like this is impossible, right? > >> > >> Here's the situation we have right now. I initiate two connections to > one > >> target, across two sessions with two different IPs, with two LUs. > Multipath > >> looks like this: > >> mpath45 (20002c9020020001a00151b6b46bb57b0) dm-1 company,iSCSI target > >> [size=15G][features=0][hwhandler=0] > >> \_ round-robin 0 [prio=1][active] > >> \_ 22:0:0:1 sdc 8:32 [active][ready] > >> \_ round-robin 0 [prio=1][enabled] > >> \_ 23:0:0:1 sde 8:64 [active][ready] > >> mpath44 (20002c9020020001200151b6b46bb57ae) dm-0 company,iSCSI target > >> [size=15G][features=0][hwhandler=0] > >> \_ round-robin 0 [prio=1][enabled] > >> \_ 22:0:0:0 sdb 8:16 [active][ready] > >> \_ round-robin 0 [prio=1][enabled] > >> \_ 23:0:0:0 sdd 8:48 [active][ready] > >> > >> Note that there are only two active sessions: > >> # iscsiadm -m session > >> tcp: [20] 10.53.152.22:3260 ,1 iqn.2001-07.com.company:qaiscsi2:blah1 > >> tcp: [21] 10.53.152.23:3260,2 iqn.2001-07.com.company:qaiscsi2:blah1 > >> > >> So the result is that all activity is routed to the first session that > was > >> initiated. I want to change the priorities of the paths to allow for > traffic > >> to go to the first IP for mpath45 and the second IP for mpath46. > >> > That's a matter of the IP routing. Having both target on the same (sub-) > net > doesn't work very well with multipathing. Please setup your system with > each iSCSI Target port in a different subnet eg > > 10.53.152.22:3260,1 iqn.2001-07.com.company:qaiscsi2:blah1 > 10.53.153.22:3260,2 iqn.2001-07.com.company:qaiscsi2:blah1 > > then you'll have one iSCSI target port per subnet and you can actually > do failover etc. > > >> Obviously ALUA is the way to go for this in the future, but we won't > have > >> the resources to implement that, so I'm looking for an interim solution > that > >> will scale to thousands of clients. Right now, the only thing I can > tell > >> people is to manually initiate connections to certain targets through > >> certain IP addresses -- basically, doing the load balancing themselves. > Is > >> there a better way? > >> > No, not really. But I'm not a network guru. You may want to ask on > the open-iscsi mailing list. > > And you can get all information you need via sysfs, so it should > be possible to create a script like Stefan Bader suggested. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > hare@suse.de +49 911 74053 688 > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: Markus Rex, HRB 16746 (AG Nürnberg) > > -- > dm-devel mailing list > dm-devel@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel > -- Ethan John http://www.flickr.com/photos/thaen/ (206) 841.4157 [-- Attachment #1.2: Type: text/html, Size: 6205 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2007-08-27 15:50 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-06-21 21:42 Designing a new prio_callout Ethan John 2007-07-25 11:37 ` Hannes Reinecke 2007-07-29 16:03 ` Ethan John 2007-07-30 6:31 ` Hannes Reinecke 2007-08-09 17:55 ` Ethan John 2007-08-10 15:07 ` Hannes Reinecke 2007-08-10 15:40 ` Ethan John 2007-08-14 17:05 ` Ethan John 2007-08-15 8:45 ` Stefan Bader 2007-08-15 15:57 ` Ethan John 2007-08-16 10:58 ` Stefan Bader 2007-08-16 17:30 ` Ethan John 2007-08-27 15:09 ` Hannes Reinecke 2007-08-27 15:50 ` Ethan John
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.