* makefs alignment issue
@ 2014-10-24 20:11 Stan Hoeppner
2014-10-24 20:14 ` Eric Sandeen
0 siblings, 1 reply; 12+ messages in thread
From: Stan Hoeppner @ 2014-10-24 20:11 UTC (permalink / raw)
To: xfs
Just remade a couple of filesystems and received an alignment msg I
don't recall receiving previously:
# mkfs.xfs -f -d su=64k,sw=12 /dev/s2d_a1l003
mkfs.xfs: Specified data stripe width 1536 is not the same as the volume
stripe width 2048
meta-data=/dev/s2d_a1l003 isize=256 agcount=44,
agsize=268435440 blks
= sectsz=512 attr=2, projid32bit=0
data = bsize=4096 blocks=11709285376, imaxpct=5
= sunit=16 swidth=192 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=16 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
/dev/s2d_a1l003 is an alias to dm-0 which is a dm-multipath device to a
LUN on hardware RAID. The hardware geometry is 64kx 12, 768k. AFAIK no
geometry information has been specified for these device mapper devices
(someone else's responsibility). Though I assume dm geometry data is
the reason for mkfs.xfs throwing this alert. I don't find anything in
/sys/devices/virtual/block/dm-0/ indicating geometry.
Any ideas how to verify what's going on here and fix it?
Thanks,
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: makefs alignment issue
2014-10-24 20:11 makefs alignment issue Stan Hoeppner
@ 2014-10-24 20:14 ` Eric Sandeen
2014-10-24 22:08 ` Stan Hoeppner
0 siblings, 1 reply; 12+ messages in thread
From: Eric Sandeen @ 2014-10-24 20:14 UTC (permalink / raw)
To: Stan Hoeppner, xfs
On 10/24/14 3:11 PM, Stan Hoeppner wrote:
> Just remade a couple of filesystems and received an alignment msg I
> don't recall receiving previously:
>
> # mkfs.xfs -f -d su=64k,sw=12 /dev/s2d_a1l003
> mkfs.xfs: Specified data stripe width 1536 is not the same as the volume
> stripe width 2048
so specified geometry that differs from the underlying device...
> meta-data=/dev/s2d_a1l003 isize=256 agcount=44,
> agsize=268435440 blks
> = sectsz=512 attr=2, projid32bit=0
> data = bsize=4096 blocks=11709285376, imaxpct=5
> = sunit=16 swidth=192 blks
> naming =version 2 bsize=4096 ascii-ci=0
> log =internal log bsize=4096 blocks=521728, version=2
> = sectsz=512 sunit=16 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
>
> /dev/s2d_a1l003 is an alias to dm-0 which is a dm-multipath device to a
> LUN on hardware RAID. The hardware geometry is 64kx 12, 768k. AFAIK no
> geometry information has been specified for these device mapper devices
> (someone else's responsibility). Though I assume dm geometry data is
> the reason for mkfs.xfs throwing this alert. I don't find anything in
> /sys/devices/virtual/block/dm-0/ indicating geometry.
>
> Any ideas how to verify what's going on here and fix it?
# blockdev --getiomin --getioopt /dev/s2d_a1l003
The first number, minimum io size, is what is used for sunit
The 2nd number, optimal io size, is what is used for swidth
Where dm got the geometry, I'm not sure - you'd have to look into
how you set up the dm device, and what its defaults are I think.
-Eric
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: makefs alignment issue
2014-10-24 20:14 ` Eric Sandeen
@ 2014-10-24 22:08 ` Stan Hoeppner
2014-10-24 22:19 ` Eric Sandeen
0 siblings, 1 reply; 12+ messages in thread
From: Stan Hoeppner @ 2014-10-24 22:08 UTC (permalink / raw)
To: Eric Sandeen, xfs
On 10/24/2014 03:14 PM, Eric Sandeen wrote:
> On 10/24/14 3:11 PM, Stan Hoeppner wrote:
>> Just remade a couple of filesystems and received an alignment msg I
>> don't recall receiving previously:
>>
>> # mkfs.xfs -f -d su=64k,sw=12 /dev/s2d_a1l003
>> mkfs.xfs: Specified data stripe width 1536 is not the same as the volume
>> stripe width 2048
>
> so specified geometry that differs from the underlying device...
>
>> meta-data=/dev/s2d_a1l003 isize=256 agcount=44,
>> agsize=268435440 blks
>> = sectsz=512 attr=2, projid32bit=0
>> data = bsize=4096 blocks=11709285376, imaxpct=5
>> = sunit=16 swidth=192 blks
>> naming =version 2 bsize=4096 ascii-ci=0
>> log =internal log bsize=4096 blocks=521728, version=2
>> = sectsz=512 sunit=16 blks, lazy-count=1
>> realtime =none extsz=4096 blocks=0, rtextents=0
>>
>>
>> /dev/s2d_a1l003 is an alias to dm-0 which is a dm-multipath device to a
>> LUN on hardware RAID. The hardware geometry is 64kx 12, 768k. AFAIK no
>> geometry information has been specified for these device mapper devices
>> (someone else's responsibility). Though I assume dm geometry data is
>> the reason for mkfs.xfs throwing this alert. I don't find anything in
>> /sys/devices/virtual/block/dm-0/ indicating geometry.
>>
>> Any ideas how to verify what's going on here and fix it?
>
> # blockdev --getiomin --getioopt /dev/s2d_a1l003
>
> The first number, minimum io size, is what is used for sunit
> The 2nd number, optimal io size, is what is used for swidth
>
> Where dm got the geometry, I'm not sure - you'd have to look into
> how you set up the dm device, and what its defaults are I think.
Looks like they're being passed up the stack:
# blockdev --getiomin --getioopt /dev/dm-0
512
1048576
# multipath -ll
3600c0ff0003630917954075401000000 dm-0 Tek,DH6554
size=44T features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 9:0:0:3 sdj 8:144 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
`- 1:0:0:3 sdf 8:80 active ready running
/sys/block/sdj/queue# cat minimum_io_size
512
/sys/block/sdj/queue# cat optimal_io_size
1048576
root@Anguish-ssu-1:/sys/block/sdf/queue# cat minimum_io_size
512
root@Anguish-ssu-1:/sys/block/sdf/queue# cat optimal_io_size
1048576
So it's the presence of a value in optimal_io_size that causes the
problem here. My single disk workstation has min but not optimal, as I
assume most do. And I don't get this msg when formatting it.
It's interesting that mkfs.xfs would use these values given they are
static firmware values in most controllers. Thus they don't change when
one uses different geometry for different arrays/LUNs...
So it seems I can safely ignore this mkfs msg.
Thanks,
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: makefs alignment issue
2014-10-24 22:08 ` Stan Hoeppner
@ 2014-10-24 22:19 ` Eric Sandeen
2014-10-24 22:27 ` Eric Sandeen
0 siblings, 1 reply; 12+ messages in thread
From: Eric Sandeen @ 2014-10-24 22:19 UTC (permalink / raw)
To: Stan Hoeppner, xfs
On 10/24/14 5:08 PM, Stan Hoeppner wrote:
>
> On 10/24/2014 03:14 PM, Eric Sandeen wrote:
...
>>> Any ideas how to verify what's going on here and fix it?
>>
>> # blockdev --getiomin --getioopt /dev/s2d_a1l003
>>
>> The first number, minimum io size, is what is used for sunit
>> The 2nd number, optimal io size, is what is used for swidth
>>
>> Where dm got the geometry, I'm not sure - you'd have to look into
>> how you set up the dm device, and what its defaults are I think.
>
> Looks like they're being passed up the stack:
> # blockdev --getiomin --getioopt /dev/dm-0
> 512
> 1048576
>
> # multipath -ll
> 3600c0ff0003630917954075401000000 dm-0 Tek,DH6554
> size=44T features='0' hwhandler='0' wp=rw
> |-+- policy='round-robin 0' prio=50 status=active
> | `- 9:0:0:3 sdj 8:144 active ready running
> `-+- policy='round-robin 0' prio=10 status=enabled
> `- 1:0:0:3 sdf 8:80 active ready running
>
> /sys/block/sdj/queue# cat minimum_io_size
> 512
> /sys/block/sdj/queue# cat optimal_io_size
> 1048576
so supposedly a 512 byte stripe unit, and a 1MB width. Hrmph.
> root@Anguish-ssu-1:/sys/block/sdf/queue# cat minimum_io_size
> 512
> root@Anguish-ssu-1:/sys/block/sdf/queue# cat optimal_io_size
> 1048576
>
> So it's the presence of a value in optimal_io_size that causes the
> problem here. My single disk workstation has min but not optimal, as I
> assume most do. And I don't get this msg when formatting it.
>
> It's interesting that mkfs.xfs would use these values given they are
> static firmware values in most controllers. Thus they don't change when
> one uses different geometry for different arrays/LUNs...
Well, they should change with geometry, but many hardware raids
don't advertise anything meaningful.
There are some heuristics in mkfs to ignore things that just
obviously don't make sense. Perhaps we should ignore anything
with a sector-sized "stripe unit".
(I'll also ask the dm folks why they set it this way).
Dave, any thoughts?
> So it seems I can safely ignore this mkfs msg.
Probably so, but it'd be nice to get rid of it if we could, either
by ignoring sector-sized "stripe units" or changing what dm reports;
not sure.
-Eric
> Thanks,
> Stan
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: makefs alignment issue
2014-10-24 22:19 ` Eric Sandeen
@ 2014-10-24 22:27 ` Eric Sandeen
2014-10-25 3:08 ` Stan Hoeppner
0 siblings, 1 reply; 12+ messages in thread
From: Eric Sandeen @ 2014-10-24 22:27 UTC (permalink / raw)
To: Stan Hoeppner, xfs
On 10/24/14 5:19 PM, Eric Sandeen wrote:
> On 10/24/14 5:08 PM, Stan Hoeppner wrote:
>>
>> On 10/24/2014 03:14 PM, Eric Sandeen wrote:
>
> ...
>
>>>> Any ideas how to verify what's going on here and fix it?
>>>
>>> # blockdev --getiomin --getioopt /dev/s2d_a1l003
Also, what does it show for the underlying non-multipath device(s)?
-Eric
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: makefs alignment issue
2014-10-24 22:27 ` Eric Sandeen
@ 2014-10-25 3:08 ` Stan Hoeppner
2014-10-25 15:51 ` Eric Sandeen
0 siblings, 1 reply; 12+ messages in thread
From: Stan Hoeppner @ 2014-10-25 3:08 UTC (permalink / raw)
To: Eric Sandeen, xfs
On 10/24/2014 05:27 PM, Eric Sandeen wrote:
> On 10/24/14 5:19 PM, Eric Sandeen wrote:
>> On 10/24/14 5:08 PM, Stan Hoeppner wrote:
>>>
>>> On 10/24/2014 03:14 PM, Eric Sandeen wrote:
>>
>> ...
>>
>>>>> Any ideas how to verify what's going on here and fix it?
>>>>
>>>> # blockdev --getiomin --getioopt /dev/s2d_a1l003
>
> Also, what does it show for the underlying non-multipath device(s)?
# blockdev --getiomin --getioopt /dev/sdj
512
1048576
# blockdev --getiomin --getioopt /dev/sdf
512
1048576
Cheers,
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: makefs alignment issue
2014-10-25 3:08 ` Stan Hoeppner
@ 2014-10-25 15:51 ` Eric Sandeen
2014-10-25 17:35 ` Stan Hoeppner
0 siblings, 1 reply; 12+ messages in thread
From: Eric Sandeen @ 2014-10-25 15:51 UTC (permalink / raw)
To: Stan Hoeppner, xfs
On 10/24/14 10:08 PM, Stan Hoeppner wrote:
> On 10/24/2014 05:27 PM, Eric Sandeen wrote:
>> On 10/24/14 5:19 PM, Eric Sandeen wrote:
>>> On 10/24/14 5:08 PM, Stan Hoeppner wrote:
>>>>
>>>> On 10/24/2014 03:14 PM, Eric Sandeen wrote:
>>>
>>> ...
>>>
>>>>>> Any ideas how to verify what's going on here and fix it?
>>>>>
>>>>> # blockdev --getiomin --getioopt /dev/s2d_a1l003
>>
>> Also, what does it show for the underlying non-multipath device(s)?
>
> # blockdev --getiomin --getioopt /dev/sdj
> 512
> 1048576
> # blockdev --getiomin --getioopt /dev/sdf
> 512
> 1048576
Ok, so dm multipath is just bubbling up what the device itself
is claiming; not dm's doing.
I forgot to ask (and you forgot to report...!) what version
of xfsprogs you're using....
Currently, blkid_get_topology() in xfsprogs does:
/*
* Blkid reports the information in terms of bytes, but we want it in
* terms of 512 bytes blocks (just to convert it to bytes later..)
*
* If the reported values are the same as the physical sector size
* do not bother to report anything. It will just cause warnings
* if people specify larger stripe units or widths manually.
*/
val = blkid_topology_get_minimum_io_size(tp);
if (val > *psectorsize)
*sunit = val >> 9;
val = blkid_topology_get_optimal_io_size(tp);
if (val > *psectorsize)
*swidth = val >> 9;
so in your case sunit probably wouldn't get set (can you confirm with
# blockdev --getpbsz that the physical sector size is also 512?)
But the optimal size is > physical sector so swidth gets set.
Bleah... can you just collect all of:
# blockdev --getpbsz --getss --getiomin --getioopt
for your underlying devices, and I'll dig into how xfsprogs is behaving for
those values. I have a hunch that we should be ignoring stripe units of 512
even if the "width" claims to be something larger.
-Eric
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: makefs alignment issue
2014-10-25 15:51 ` Eric Sandeen
@ 2014-10-25 17:35 ` Stan Hoeppner
2014-10-26 23:43 ` Dave Chinner
0 siblings, 1 reply; 12+ messages in thread
From: Stan Hoeppner @ 2014-10-25 17:35 UTC (permalink / raw)
To: Eric Sandeen, xfs
On 10/25/2014 10:51 AM, Eric Sandeen wrote:
> On 10/24/14 10:08 PM, Stan Hoeppner wrote:
>> On 10/24/2014 05:27 PM, Eric Sandeen wrote:
>>> On 10/24/14 5:19 PM, Eric Sandeen wrote:
>>>> On 10/24/14 5:08 PM, Stan Hoeppner wrote:
>>>>>
>>>>> On 10/24/2014 03:14 PM, Eric Sandeen wrote:
>>>>
>>>> ...
>>>>
>>>>>>> Any ideas how to verify what's going on here and fix it?
>>>>>>
>>>>>> # blockdev --getiomin --getioopt /dev/s2d_a1l003
>>>
>>> Also, what does it show for the underlying non-multipath device(s)?
>>
>> # blockdev --getiomin --getioopt /dev/sdj
>> 512
>> 1048576
>> # blockdev --getiomin --getioopt /dev/sdf
>> 512
>> 1048576
>
> Ok, so dm multipath is just bubbling up what the device itself
> is claiming; not dm's doing.
>
> I forgot to ask (and you forgot to report...!) what version
> of xfsprogs you're using....
Sorry Eric, my bad. I should know better after all these years. :(
It's old Debian 6.0 IIRC, let's see...
# xfs_repair -V
xfs_repair version 3.1.4
> Currently, blkid_get_topology() in xfsprogs does:
>
> /*
> * Blkid reports the information in terms of bytes, but we want it in
> * terms of 512 bytes blocks (just to convert it to bytes later..)
> *
> * If the reported values are the same as the physical sector size
> * do not bother to report anything. It will just cause warnings
> * if people specify larger stripe units or widths manually.
> */
> val = blkid_topology_get_minimum_io_size(tp);
> if (val > *psectorsize)
> *sunit = val >> 9;
> val = blkid_topology_get_optimal_io_size(tp);
> if (val > *psectorsize)
> *swidth = val >> 9;
>
> so in your case sunit probably wouldn't get set (can you confirm with
> # blockdev --getpbsz that the physical sector size is also 512?)
# blockdev --getpbsz /dev/dm-0
512
> But the optimal size is > physical sector so swidth gets set.
>
> Bleah... can you just collect all of:
>
> # blockdev --getpbsz --getss --getiomin --getioopt
# blockdev --getpbsz --getss --getiomin --getioopt /dev/sdj
512
512
512
1048576
# blockdev --getpbsz --getss --getiomin --getioopt /dev/sdh
512
512
512
1048576
> for your underlying devices, and I'll dig into how xfsprogs is behaving for
> those values. I have a hunch that we should be ignoring stripe units of 512
> even if the "width" claims to be something larger.
Just a hunch? :)
If the same interface is used for Linux logical block devices (md, dm,
lvm, etc) and hardware RAID, I have a hunch it may be better to
determine that, if possible, before doing anything with these values.
As you said previously, and I agree 100%, a lot of RAID vendors don't
export meaningful information here. In this specific case, I think the
RAID engineers are exporting a value, 1 MB, that works best for their
cache management, or some other path in their firmware. They're
concerned with host interface xfer into the controller, not the IOs on
the back end to the disks. They don't see this as an end-to-end deal.
In fact, I'd guess most of these folks see their device as performing
magic, and it doesn't matter what comes in or goes out either end.
"We'll take care of it."
I don't know what underlying SCSI command is used for populating
optimal_io_size. I'm guessing this has different meaning for different
folks. You say optimal_io_size is the same as RAID width. Apply that
to this case:
hardware RAID 60 LUN, 4 arrays
16+2 RAID6, 256 KB stripe unit, 4096 KB stripe width
16 MB LUN stripe width
optimal_io_size = 16 MB
Is that an appropriate value for optimal_io_size even if this is the
RAID width? I'm not saying it isn't. I don't know. I don't know what
other layers of the Linux and RAID firmware stacks are affected by this,
nor how they're affected.
Thanks,
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: makefs alignment issue
2014-10-25 17:35 ` Stan Hoeppner
@ 2014-10-26 23:43 ` Dave Chinner
2014-10-27 23:04 ` Stan Hoeppner
0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2014-10-26 23:43 UTC (permalink / raw)
To: Stan Hoeppner; +Cc: Eric Sandeen, xfs
On Sat, Oct 25, 2014 at 12:35:17PM -0500, Stan Hoeppner wrote:
> If the same interface is used for Linux logical block devices (md, dm,
> lvm, etc) and hardware RAID, I have a hunch it may be better to
> determine that, if possible, before doing anything with these values.
> As you said previously, and I agree 100%, a lot of RAID vendors don't
> export meaningful information here. In this specific case, I think the
> RAID engineers are exporting a value, 1 MB, that works best for their
> cache management, or some other path in their firmware. They're
> concerned with host interface xfer into the controller, not the IOs on
> the back end to the disks. They don't see this as an end-to-end deal.
> In fact, I'd guess most of these folks see their device as performing
> magic, and it doesn't matter what comes in or goes out either end.
> "We'll take care of it."
Deja vu. This is an isochronous RAID array you are having trouble
with, isn't it?
FWIW, do your problems go away when you make you hardware LUN width
a multiple of the cache segment size?
> optimal_io_size. I'm guessing this has different meaning for different
> folks. You say optimal_io_size is the same as RAID width. Apply that
> to this case:
>
> hardware RAID 60 LUN, 4 arrays
> 16+2 RAID6, 256 KB stripe unit, 4096 KB stripe width
> 16 MB LUN stripe width
> optimal_io_size = 16 MB
>
> Is that an appropriate value for optimal_io_size even if this is the
> RAID width? I'm not saying it isn't. I don't know. I don't know what
> other layers of the Linux and RAID firmware stacks are affected by this,
> nor how they're affected.
yup, i'd expect minimum = 4MB (i.e stripe unit 4MB so we align to
the underlying RAID6 luns) and optimal = 16MB for the stripe width
(and so with swalloc we align to the first lun in the RAID0).
This should be passed up unchanged through the stack if none of the
software layers are doing other geometry modifications (e.g. more
raid, thinp, etc).
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: makefs alignment issue
2014-10-26 23:43 ` Dave Chinner
@ 2014-10-27 23:04 ` Stan Hoeppner
2014-10-28 0:32 ` Dave Chinner
0 siblings, 1 reply; 12+ messages in thread
From: Stan Hoeppner @ 2014-10-27 23:04 UTC (permalink / raw)
To: Dave Chinner; +Cc: Eric Sandeen, xfs
On 10/26/2014 06:43 PM, Dave Chinner wrote:
> On Sat, Oct 25, 2014 at 12:35:17PM -0500, Stan Hoeppner wrote:
>> If the same interface is used for Linux logical block devices (md, dm,
>> lvm, etc) and hardware RAID, I have a hunch it may be better to
>> determine that, if possible, before doing anything with these values.
>> As you said previously, and I agree 100%, a lot of RAID vendors don't
>> export meaningful information here. In this specific case, I think the
>> RAID engineers are exporting a value, 1 MB, that works best for their
>> cache management, or some other path in their firmware. They're
>> concerned with host interface xfer into the controller, not the IOs on
>> the back end to the disks. They don't see this as an end-to-end deal.
>> In fact, I'd guess most of these folks see their device as performing
>> magic, and it doesn't matter what comes in or goes out either end.
>> "We'll take care of it."
>
> Deja vu. This is an isochronous RAID array you are having trouble
> with, isn't it?
I don't believe so. I'm pretty sure the parity rotates; i.e. standard
RAID5/6.
> FWIW, do your problems go away when you make you hardware LUN width
> a multiple of the cache segment size?
Hadn't tried it. And I don't have the opportunity now as my contract
has ended. However the problems we were having weren't related to
controller issues but excessive seeking. I mentioned this in that
(rather lengthy) previous reply.
>> optimal_io_size. I'm guessing this has different meaning for different
>> folks. You say optimal_io_size is the same as RAID width. Apply that
>> to this case:
>>
>> hardware RAID 60 LUN, 4 arrays
>> 16+2 RAID6, 256 KB stripe unit, 4096 KB stripe width
>> 16 MB LUN stripe width
>> optimal_io_size = 16 MB
>>
>> Is that an appropriate value for optimal_io_size even if this is the
>> RAID width? I'm not saying it isn't. I don't know. I don't know what
>> other layers of the Linux and RAID firmware stacks are affected by this,
>> nor how they're affected.
>
> yup, i'd expect minimum = 4MB (i.e stripe unit 4MB so we align to
> the underlying RAID6 luns) and optimal = 16MB for the stripe width
> (and so with swalloc we align to the first lun in the RAID0).
At minimum 4MB how does that affect journal writes which will be much
smaller, especially with a large file streaming workload, for which this
setup is appropriate? Isn't the minimum a hard setting? I.e. we can
never do an IO less than 4MB? Do other layers of the stack use this
variable? Are they expecting values this large?
> This should be passed up unchanged through the stack if none of the
> software layers are doing other geometry modifications (e.g. more
> raid, thinp, etc).
I agree, if RAID vendors all did the right thing...
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: makefs alignment issue
2014-10-27 23:04 ` Stan Hoeppner
@ 2014-10-28 0:32 ` Dave Chinner
2014-10-28 16:55 ` Stan Hoeppner
0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2014-10-28 0:32 UTC (permalink / raw)
To: Stan Hoeppner; +Cc: Eric Sandeen, xfs
On Mon, Oct 27, 2014 at 06:04:05PM -0500, Stan Hoeppner wrote:
> On 10/26/2014 06:43 PM, Dave Chinner wrote:
> > On Sat, Oct 25, 2014 at 12:35:17PM -0500, Stan Hoeppner wrote:
> >> If the same interface is used for Linux logical block devices (md, dm,
> >> lvm, etc) and hardware RAID, I have a hunch it may be better to
> >> determine that, if possible, before doing anything with these values.
> >> As you said previously, and I agree 100%, a lot of RAID vendors don't
> >> export meaningful information here. In this specific case, I think the
> >> RAID engineers are exporting a value, 1 MB, that works best for their
> >> cache management, or some other path in their firmware. They're
> >> concerned with host interface xfer into the controller, not the IOs on
> >> the back end to the disks. They don't see this as an end-to-end deal.
> >> In fact, I'd guess most of these folks see their device as performing
> >> magic, and it doesn't matter what comes in or goes out either end.
> >> "We'll take care of it."
> >
> > Deja vu. This is an isochronous RAID array you are having trouble
> > with, isn't it?
>
> I don't believe so. I'm pretty sure the parity rotates; i.e. standard
> RAID5/6.
The location of parity doesn't dtermine that it is isochronous in
behaviour or not. Often RAID5/6 is marketing speak for "single/dual
parity", not the type of redundancy that is implemented in the
hardware ;)
> > FWIW, do your problems go away when you make you hardware LUN width
> > a multiple of the cache segment size?
>
> Hadn't tried it. And I don't have the opportunity now as my contract
> has ended. However the problems we were having weren't related to
> controller issues but excessive seeking. I mentioned this in that
> (rather lengthy) previous reply.
Right, but if you had a 768k stripe width and a 1MB cache segment
size, a cache segment operation would require two stripe widths to
be operated on, and only one would be a whole stripe width. hence
the possibility of doing more IOs than are necessary to populate
or write back cache segments. i.e. it's a potential reason for
why the back end disks didn't have anywhere near the expected seek
capability they were supposed to have....
> >> optimal_io_size. I'm guessing this has different meaning for different
> >> folks. You say optimal_io_size is the same as RAID width. Apply that
> >> to this case:
> >>
> >> hardware RAID 60 LUN, 4 arrays
> >> 16+2 RAID6, 256 KB stripe unit, 4096 KB stripe width
> >> 16 MB LUN stripe width
> >> optimal_io_size = 16 MB
> >>
> >> Is that an appropriate value for optimal_io_size even if this is the
> >> RAID width? I'm not saying it isn't. I don't know. I don't know what
> >> other layers of the Linux and RAID firmware stacks are affected by this,
> >> nor how they're affected.
> >
> > yup, i'd expect minimum = 4MB (i.e stripe unit 4MB so we align to
> > the underlying RAID6 luns) and optimal = 16MB for the stripe width
> > (and so with swalloc we align to the first lun in the RAID0).
>
> At minimum 4MB how does that affect journal writes which will be much
> smaller, especially with a large file streaming workload, for which this
> setup is appropriate? Isn't the minimum a hard setting? I.e. we can
> never do an IO less than 4MB? Do other layers of the stack use this
> variable? Are they expecting values this large?
No, "minimum_io_size" is for "minimum *efficient* IO size" not the
smallest supported IO size. The smallest supported IO sizes and
atomic IO sizes are defined by hw_sector_size,
physical_block_size and logical_block_size.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: makefs alignment issue
2014-10-28 0:32 ` Dave Chinner
@ 2014-10-28 16:55 ` Stan Hoeppner
0 siblings, 0 replies; 12+ messages in thread
From: Stan Hoeppner @ 2014-10-28 16:55 UTC (permalink / raw)
To: Dave Chinner; +Cc: Eric Sandeen, xfs
On 10/27/2014 07:32 PM, Dave Chinner wrote:
> On Mon, Oct 27, 2014 at 06:04:05PM -0500, Stan Hoeppner wrote:
>> On 10/26/2014 06:43 PM, Dave Chinner wrote:
>>> On Sat, Oct 25, 2014 at 12:35:17PM -0500, Stan Hoeppner wrote:
>>>> If the same interface is used for Linux logical block devices (md, dm,
>>>> lvm, etc) and hardware RAID, I have a hunch it may be better to
>>>> determine that, if possible, before doing anything with these values.
>>>> As you said previously, and I agree 100%, a lot of RAID vendors don't
>>>> export meaningful information here. In this specific case, I think the
>>>> RAID engineers are exporting a value, 1 MB, that works best for their
>>>> cache management, or some other path in their firmware. They're
>>>> concerned with host interface xfer into the controller, not the IOs on
>>>> the back end to the disks. They don't see this as an end-to-end deal.
>>>> In fact, I'd guess most of these folks see their device as performing
>>>> magic, and it doesn't matter what comes in or goes out either end.
>>>> "We'll take care of it."
>>>
>>> Deja vu. This is an isochronous RAID array you are having trouble
>>> with, isn't it?
>>
>> I don't believe so. I'm pretty sure the parity rotates; i.e. standard
>> RAID5/6.
>
> The location of parity doesn't dtermine that it is isochronous in
> behaviour or not. Often RAID5/6 is marketing speak for "single/dual
> parity", not the type of redundancy that is implemented in the
> hardware ;)
Yea, I know. It's the lack of buffering/blocking that makes it
isochronous. Above I was referring to something you said last year:
http://oss.sgi.com/archives/xfs/2013-06/msg00981.html
"And at the other end of the scale, isochronous RAID arrays tend to
have dedicated parity disks so that data read and write behaviour is
deterministic and therefore predictable from a high level...."
>>> FWIW, do your problems go away when you make you hardware LUN width
>>> a multiple of the cache segment size?
>>
>> Hadn't tried it. And I don't have the opportunity now as my contract
>> has ended. However the problems we were having weren't related to
>> controller issues but excessive seeking. I mentioned this in that
>> (rather lengthy) previous reply.
>
> Right, but if you had a 768k stripe width and a 1MB cache segment
> size, a cache segment operation would require two stripe widths to
> be operated on, and only one would be a whole stripe width. hence
> the possibility of doing more IOs than are necessary to populate
> or write back cache segments. i.e. it's a potential reason for
> why the back end disks didn't have anywhere near the expected seek
> capability they were supposed to have....
That's a very good point. And would be performance suicide for a vendor
pushing 224 drive arrays packed w/7.2 drives. I don't think that's
what's happening here though, as testing with 132 parallel dd threads
shows 1375 MB/s to the outer most AG. Peak single thread buffered dd
write is ~1500 - 1800 MB/s depending on file size, etc. So with the
right parallel workload we can achieve pretty close t peak LUN throughput.
>>>> optimal_io_size. I'm guessing this has different meaning for different
>>>> folks. You say optimal_io_size is the same as RAID width. Apply that
>>>> to this case:
>>>>
>>>> hardware RAID 60 LUN, 4 arrays
>>>> 16+2 RAID6, 256 KB stripe unit, 4096 KB stripe width
>>>> 16 MB LUN stripe width
>>>> optimal_io_size = 16 MB
>>>>
>>>> Is that an appropriate value for optimal_io_size even if this is the
>>>> RAID width? I'm not saying it isn't. I don't know. I don't know what
>>>> other layers of the Linux and RAID firmware stacks are affected by this,
>>>> nor how they're affected.
>>>
>>> yup, i'd expect minimum = 4MB (i.e stripe unit 4MB so we align to
>>> the underlying RAID6 luns) and optimal = 16MB for the stripe width
>>> (and so with swalloc we align to the first lun in the RAID0).
>>
>> At minimum 4MB how does that affect journal writes which will be much
>> smaller, especially with a large file streaming workload, for which this
>> setup is appropriate? Isn't the minimum a hard setting? I.e. we can
>> never do an IO less than 4MB? Do other layers of the stack use this
>> variable? Are they expecting values this large?
>
> No, "minimum_io_size" is for "minimum *efficient* IO size" not the
> smallest supported IO size. The smallest supported IO sizes and
> atomic IO sizes are defined by hw_sector_size,
> physical_block_size and logical_block_size.
Ok got it. So this value is a performance hint. So would it be better
for RAID vendors to simply populate these values with zeros instead of
values that don't match the LUN geometry, as is the case with the arrays
I've been working with?
BTW, where/how are these values obtained? Are they returned to a SCSI
inquiry? If so, which SCSI command?
Thanks,
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2014-10-28 16:54 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-24 20:11 makefs alignment issue Stan Hoeppner
2014-10-24 20:14 ` Eric Sandeen
2014-10-24 22:08 ` Stan Hoeppner
2014-10-24 22:19 ` Eric Sandeen
2014-10-24 22:27 ` Eric Sandeen
2014-10-25 3:08 ` Stan Hoeppner
2014-10-25 15:51 ` Eric Sandeen
2014-10-25 17:35 ` Stan Hoeppner
2014-10-26 23:43 ` Dave Chinner
2014-10-27 23:04 ` Stan Hoeppner
2014-10-28 0:32 ` Dave Chinner
2014-10-28 16:55 ` Stan Hoeppner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox