Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
@ 2012-09-18 14:37 Ferry
  2012-09-18 16:48 ` Chris Murphy
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Ferry @ 2012-09-18 14:37 UTC (permalink / raw)
  To: linux-raid, target-devel

Hi there,

we're having serious performance issues with the LIO iSCSI target on a 7
disk RAID-5 set + hotspare (mdadm). As I'm not sure where to go, I've
sent this to both linux-raid and target-devel lists.

We're seeing write performance in the order of, don't fall of your
chair, 3MB/s. This is once the buffers are full. Before the buffers are
full we're near wirespeed (gigabit). We're running blockio in buffered
mode with LIO. The machine is running Ubuntu 12.04 LTS Server (64 bit).
Next to the (ubuntu) stock kernels I have tried several 3.5 versions
from Ubuntu's mainline repository, which seem somewhat faster (up to
6-15MiB/s), however, at least 3.5.2 and 3.5.3 were unstable and made the
machine crash after ~1 day.

As the machine is running production for a backup solution I'm severely
limited in my windows for testing.

Whilst writing, copying a DVD from the Windows 2008 R2 initiator to the
target - no other I/O was active, I noticed in iostat something I
personally find very weird. All the disks in the RAID set (minus the
spare) seem to read 6-7 times as much as they write. Since there is no
other I/O (so there aren't really any reads issued besides some very
occasional overhead for NTFS perhaps once in a while) I find this really
weird. Note also that iostat doesn't show the reads in iostat on the md
device (which is the case if the initiator issues reads) but only on the
active disks in the RAID set, which to me (unknowingly as I am :))
indicates mdadm in the kernel is issuing those reads.

So for example I see disk <sdX> do 600-700kB/s reading in I/O stat
whilst it's writing about 100kB/s.

I think the majority of the issue comes from that.

I've switched back to IETD now. With IETD I can copy with 55MiB/s to the
device *whilst* reading from the same device (copy an ISO onto it, then
copy the ISO from the disk back to the disk, then copy all copies couple
of times - so both read/write). Iostat with IETD whilst writing shows
say 110-120% read per write, however, in this case we were also actually
reading. So to keep it simple, it read 110-120kB/s whilst writing
100kB/s per disk. This is a very serious difference. IETD is running in
fileio mode (write-back), so it buffers too. So if we substract the
actual reading it's IETD 10-20% read on 100% write, vs LIO 600-700% read
on 100% write. That's quite upsetting.

It seems to me the issue exists between LIO's buffers and mdadm. Why it
writes so horribly inefficiently is beyond me though. I've invested
quite some time in this already - however due to the way I've tested
(huge intervals / different kernels, some disks have been swapped, etc)
and my lack of in-depth kernel knowledge I don't think much of it is
accurate enough to post here.

Can someone advise me how to proceed? I was hoping to switch to LIO and
see a slight improvement in performance (besides more/better
functionality as error correction and hopefully better stability). This
has turned out quite differently unfortunately.

Do note - I'm running somewhat unorthodox. I've created a RAID-5 of 7
disks + hotspare (it was originally a RAID-6 w/o hotspare but converted
it to RAID-5 in hopes of improving performance). This disk is about
12TB. It's partitioned with GPT in ~9TB and ~2.5TB (there's huge
rounding differences at these sizes 1000 vs 1024 et al :)). The 2.5TB
currently isn't used. I've exported /dev/md4p1 thus. This in turn is
partitioned (GPT - msdos isn't usable) in windows and used as a disk.

In order to do this I had to modify rtslib as it didn't recognize the
md4p1 as a block device. I've added the major device numbers to the list
there and could export it just 'fine' then. The issues might be related
to this.

If anyone is willing to help me modify the partition table so I can just
export /dev/md4 I can test it. I'm not sure on the offsets (0 included
or not for example) and I don't really want to mess up ~7TiB of data ;).
Don't really have another set of disks this large nor the time to copy
it back and forth. Just adjusting the partition table should work - if
proper values are used. The partition on /dev/md4 should then point the
current /dev/md4p1p2 if you will (/dev/md4p1 is exported and is thus
seen as disk - this in turn is partitioned by windows, for some odd
reason it created a system partition too (afaik it usually only does
this on the install disk, this is just a data disk, so there's a ~128MiB
partition and then the data partition)).

With msdos partitions I could easily mess with it myself. GPT however
also has mirror headers and those might actually overwrite my data if
done incorrectly. At least - that's what I'm worried about, not sure if
that theory is solid. Also, with msdos I could easily make backups with
dd or sfdisk (of the partition table). Not aware of such a tool for gpt.
Parted doesn't seem to be able to dump them.

Kind regards,

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
  2012-09-18 14:37 Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI) Ferry
@ 2012-09-18 16:48 ` Chris Murphy
  2012-09-18 19:49 ` Nicholas A. Bellinger
  2012-09-18 20:06 ` Peter Grandi
  2 siblings, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2012-09-18 16:48 UTC (permalink / raw)
  To: Linux RAID


On Sep 18, 2012, at 8:37 AM, Ferry wrote:

> As I'm not sure where to go, I've
> sent this to both linux-raid and target-devel lists.


I think you're better off taking a chance with one, waiting a few days, then going with another, rather than crossposting.


> So for example I see disk <sdX> do 600-700kB/s reading in I/O stat
> whilst it's writing about 100kB/s.

You've provided no information on the hardware being used, version of kernel and mdadm, or info on the array such as mdadm -Evv <array>.


> With msdos partitions I could easily mess with it myself. GPT however
> also has mirror headers and those might actually overwrite my data if
> done incorrectly. At least - that's what I'm worried about, not sure if
> that theory is solid.

GPT contains checksums for the header and table, so you can't just make manual table edits. And there's no need, just use gdisk (a.k.a. GPT fdisk) or sgdisk.

> Also, with msdos I could easily make backups with
> dd or sfdisk (of the partition table). Not aware of such a tool for gpt.
> Parted doesn't seem to be able to dump them.

gdisk (interactive) or sgdisk (command line) can do this.

Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
  2012-09-18 14:37 Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI) Ferry
  2012-09-18 16:48 ` Chris Murphy
@ 2012-09-18 19:49 ` Nicholas A. Bellinger
  2012-09-18 21:18   ` Peter Grandi
  2012-09-19  6:44   ` Arne Redlich
  2012-09-18 20:06 ` Peter Grandi
  2 siblings, 2 replies; 13+ messages in thread
From: Nicholas A. Bellinger @ 2012-09-18 19:49 UTC (permalink / raw)
  To: Ferry; +Cc: linux-raid, target-devel

On Tue, 2012-09-18 at 16:37 +0200, Ferry wrote:
> Hi there,
> 

Hi Ferry,

> we're having serious performance issues with the LIO iSCSI target on a 7
> disk RAID-5 set + hotspare (mdadm). As I'm not sure where to go, I've
> sent this to both linux-raid and target-devel lists.
> 
> We're seeing write performance in the order of, don't fall of your
> chair, 3MB/s. This is once the buffers are full. Before the buffers are
> full we're near wirespeed (gigabit). We're running blockio in buffered
> mode with LIO. The machine is running Ubuntu 12.04 LTS Server (64 bit).
> Next to the (ubuntu) stock kernels I have tried several 3.5 versions
> from Ubuntu's mainline repository, which seem somewhat faster (up to
> 6-15MiB/s), however, at least 3.5.2 and 3.5.3 were unstable and made the
> machine crash after ~1 day.
> 
> As the machine is running production for a backup solution I'm severely
> limited in my windows for testing.
> 
> Whilst writing, copying a DVD from the Windows 2008 R2 initiator to the
> target - no other I/O was active, I noticed in iostat something I
> personally find very weird. All the disks in the RAID set (minus the
> spare) seem to read 6-7 times as much as they write. Since there is no
> other I/O (so there aren't really any reads issued besides some very
> occasional overhead for NTFS perhaps once in a while) I find this really
> weird. Note also that iostat doesn't show the reads in iostat on the md
> device (which is the case if the initiator issues reads) but only on the
> active disks in the RAID set, which to me (unknowingly as I am :))
> indicates mdadm in the kernel is issuing those reads.
> 
> So for example I see disk <sdX> do 600-700kB/s reading in I/O stat
> whilst it's writing about 100kB/s.
> 
> I think the majority of the issue comes from that.
> 
> I've switched back to IETD now. With IETD I can copy with 55MiB/s to the
> device *whilst* reading from the same device (copy an ISO onto it, then
> copy the ISO from the disk back to the disk, then copy all copies couple
> of times - so both read/write). Iostat with IETD whilst writing shows
> say 110-120% read per write, however, in this case we were also actually
> reading. So to keep it simple, it read 110-120kB/s whilst writing
> 100kB/s per disk. This is a very serious difference. IETD is running in
> fileio mode (write-back), so it buffers too. So if we substract the
> actual reading it's IETD 10-20% read on 100% write, vs LIO 600-700% read
> on 100% write. That's quite upsetting.
> 

Are you enabling emulate_write_cache=1 with your iblock backends..? This
can have a gigantic effect on initiator performance for both MSFT +
Linux SCSI clients.

Also, you'll want to double check
your /sys/block/sdd/queue/max*sectors_kb for the MD RAID to make sure
the WRITEs are striped aligned to get best performance with software MD
raid.

> It seems to me the issue exists between LIO's buffers and mdadm. Why it
> writes so horribly inefficiently is beyond me though. I've invested
> quite some time in this already - however due to the way I've tested
> (huge intervals / different kernels, some disks have been swapped, etc)
> and my lack of in-depth kernel knowledge I don't think much of it is
> accurate enough to post here.
> 
> Can someone advise me how to proceed? I was hoping to switch to LIO and
> see a slight improvement in performance (besides more/better
> functionality as error correction and hopefully better stability). This
> has turned out quite differently unfortunately.
> 
> Do note - I'm running somewhat unorthodox. I've created a RAID-5 of 7
> disks + hotspare (it was originally a RAID-6 w/o hotspare but converted
> it to RAID-5 in hopes of improving performance). This disk is about
> 12TB. It's partitioned with GPT in ~9TB and ~2.5TB (there's huge
> rounding differences at these sizes 1000 vs 1024 et al :)). The 2.5TB
> currently isn't used. I've exported /dev/md4p1 thus. This in turn is
> partitioned (GPT - msdos isn't usable) in windows and used as a disk.
> 
> In order to do this I had to modify rtslib as it didn't recognize the
> md4p1 as a block device. I've added the major device numbers to the list
> there and could export it just 'fine' then. The issues might be related
> to this.
> 

There in lies the problem causing your OOPs.  IBLOCK is *not* intended
to export partitions from a block device.  There is a reason why rtslib
is preventing that from occuring.  ;)

Please use FILEIO with this reporting emulate_write_cache=1 (WCE=1) to
the SCSI clients.  Note that by default in the last kernel releases
we've change FILEIO backends to only always use O_SYNC to ensure data
consistency during a hard power failure, regardless of the
emulate_write_cache=1 setting.

Also note that by default it's my understanding that IETD uses buffered
FILEIO for performance, so in your particular type of setup you'd still
see better performance with buffered FILEIO, but would still have the
potential risk of silent data corruption with buffered FILEIO.

> If anyone is willing to help me modify the partition table so I can just
> export /dev/md4 I can test it.

Please use FILEIO for exporting partitions from block devices.  If you
still need the extra performance of buffered FILEIO for your seutp, +
understand the possible data integrity risks associated with using
buffered FILEIO during a hard power failure, I'm fine with re-adding
this back into target_core_file for v3.7 code for people who really know
what they are doing.

--nab

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
  2012-09-18 19:49 ` Nicholas A. Bellinger
@ 2012-09-18 21:18   ` Peter Grandi
  2012-09-18 22:20     ` Nicholas A. Bellinger
  2012-09-19 14:19     ` freaky
  2012-09-19  6:44   ` Arne Redlich
  1 sibling, 2 replies; 13+ messages in thread
From: Peter Grandi @ 2012-09-18 21:18 UTC (permalink / raw)
  To: Linux RAID, target-devel

>> [ ... ] Before the buffers are full we're near wirespeed
>> (gigabit). We're running blockio in buffered mode with LIO. [
>> ... ] Whilst writing, copying a DVD from the Windows 2008 R2
>> initiator to the target - no other I/O was active, I noticed
>> in iostat something I personally find very weird. All the
>> disks in the RAID set (minus the spare) seem to read 6-7
>> times as much as they write. [ ... ] iostat doesn't show the
>> reads in iostat on the md device (which is the case if the
>> initiator issues reads) but only on the active disks in the
>> RAID set, [ ... ]

This seems to indicate as I mentioned in a previous comment that
there are RAID setup issues...

>> I've switched back to IETD now. With IETD I can copy with
>> 55MiB/s to the device *whilst* reading from the same device
>> (copy an ISO onto it, then copy the ISO from the disk back to
>> the disk, then copy all copies couple of times - so both
>> read/write).

For a RAID set of 6+1 2TB drives each capable of 60-120MB/s that
is still pretty terrible speed (even if the performance seems
not too bad).

>> Iostat with IETD whilst writing shows say 110-120% read per
>> write, however, in this case we were also actually reading.

>> [ ... ] IETD is running in fileio mode (write-back), so it
>> buffers too. [ ... ]

That probably helps the MD get a bit of help with aligned
writes, or perhaps at that point the array had been resynced,
who knows...

> Are you enabling emulate_write_cache=1 with your iblock
> backends..? This can have a gigantic effect on initiator
> performance for both MSFT + Linux SCSI clients.

That sounds interesting, but also potentially rather dangerous,
unless there is a very reliable implementation of IO barriers.
Just like with enabling write caches on real disks...

> [ ... ] check your [ ... ]/queue/max*sectors_kb for the MD
> RAID to make sure the WRITEs are striped aligned to get best
> performance with software MD raid.

That does not quite ensure that the writes are stripe aligned,
but perhaps a larger stripe cache would help.

> Please use FILEIO with this reporting emulate_write_cache=1
> (WCE=1) to the SCSI clients. Note that by default in the last
> kernel releases we've change FILEIO backends to only always
> use O_SYNC to ensure data consistency during a hard power
> failure, regardless of the emulate_write_cache=1 setting.

Ahh interesting too. That's also the right choice unless there
is IO barrier support at all levels.

> Also note that by default it's my understanding that IETD uses
> buffered FILEIO for performance, so in your particular type of
> setup you'd still see better performance with buffered FILEIO,
> but would still have the potential risk of silent data
> corruption with buffered FILEIO.

Not silent data corruption, but data loss. Silent data
corruption is usually meant for the case where an IO completes
and reports success, but the data recorded is not the data
submitted.

> [ ... ] understand the possible data integrity risks
> associated with using buffered FILEIO during a hard power
> failure, I'm fine with re-adding this back into
> target_core_file for v3.7 code for people who really know what
> they are doing.

That "people who really know what they are doing" is generally a
bit optimistic :-).

Do the various modes support IO barriers? That usually is what
is critical, at least for the better informed people.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
  2012-09-18 21:18   ` Peter Grandi
@ 2012-09-18 22:20     ` Nicholas A. Bellinger
  2012-09-19 10:49       ` joystick
  2012-09-19 14:19     ` freaky
  1 sibling, 1 reply; 13+ messages in thread
From: Nicholas A. Bellinger @ 2012-09-18 22:20 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID, target-devel

On Tue, 2012-09-18 at 22:18 +0100, Peter Grandi wrote:
> >> [ ... ] Before the buffers are full we're near wirespeed
> >> (gigabit). We're running blockio in buffered mode with LIO. [
> >> ... ] Whilst writing, copying a DVD from the Windows 2008 R2
> >> initiator to the target - no other I/O was active, I noticed
> >> in iostat something I personally find very weird. All the
> >> disks in the RAID set (minus the spare) seem to read 6-7
> >> times as much as they write. [ ... ] iostat doesn't show the
> >> reads in iostat on the md device (which is the case if the
> >> initiator issues reads) but only on the active disks in the
> >> RAID set, [ ... ]
> 
> This seems to indicate as I mentioned in a previous comment that
> there are RAID setup issues...
> 

<SNIP>

> > Are you enabling emulate_write_cache=1 with your iblock
> > backends..? This can have a gigantic effect on initiator
> > performance for both MSFT + Linux SCSI clients.
> 
> That sounds interesting, but also potentially rather dangerous,
> unless there is a very reliable implementation of IO barriers.
> Just like with enabling write caches on real disks...
> 

Not exactly.  The name of the 'emulate_write_cache' device attribute is
a bit mis-leading here.  This bit simply reports (to the SCSI client)
that the WCE=1 bit is set during SCSI mode sense (caching page) is read
during the initial LUN scan.

For IBLOCK backends using submit_bio(), the I/O operations are already
bypassing buffer cache all together + are fully asynchronous.  So for
IBLOCK we just want to tell the SCSI client to be more aggressive with
it's I/O submission, (SCSI clients have historically been extremely
sensitive when WCE=0 is reported), but this attribute is actually
separate from what's may be running for WCE=1 on the drives making up
the MD RAID block device that's being exported as a SCSI target LUN.

For FILEIO this can be different.  We originally had an extra parameter
passed into rtslib -> /sys/kernel/config/target/core/$HBA/$DEV/control
to optionally disabled O_*SYNC -> enable buffered FILEIO operation.  In
buffered FILEIO operation we expect the initiator to be smart enough to
use FUA (forced unit access) WRITES + SYNCHRONIZE_CACHE to force write
out of write FILEIO blocks still in buffer cache.

> > [ ... ] check your [ ... ]/queue/max*sectors_kb for the MD
> > RAID to make sure the WRITEs are striped aligned to get best
> > performance with software MD raid.
> 
> That does not quite ensure that the writes are stripe aligned,
> but perhaps a larger stripe cache would help.
> 

I'm talking about what MD raid has chosen as it's underlying
max_sectors_kb to issue I/O to the underlying raid member devices.  This
depends on what backend storage hardware is in use, this may end up as
'127', which will result in ugly mis-aligned writes that ends up killing
performance.

We've (RTS) changed this with a one-liner patch to raid456.c code on .32
basded distro kernels in the past to get proper stripe aligned writes,
and it obviously makes a huge difference with fast storage hardware.

> > Please use FILEIO with this reporting emulate_write_cache=1
> > (WCE=1) to the SCSI clients. Note that by default in the last
> > kernel releases we've change FILEIO backends to only always
> > use O_SYNC to ensure data consistency during a hard power
> > failure, regardless of the emulate_write_cache=1 setting.
> 
> Ahh interesting too. That's also the right choice unless there
> is IO barrier support at all levels.
> 
> > Also note that by default it's my understanding that IETD uses
> > buffered FILEIO for performance, so in your particular type of
> > setup you'd still see better performance with buffered FILEIO,
> > but would still have the potential risk of silent data
> > corruption with buffered FILEIO.
> 
> Not silent data corruption, but data loss. Silent data
> corruption is usually meant for the case where an IO completes
> and reports success, but the data recorded is not the data
> submitted.
> 

That's exactly what I'm talking about.

With buffered FILEIO enabled a incoming WRITE payload will have already
been ACKs back to the SCSI fabric and up the storage -> filesystem
stack, but if a power loss was to occur before that data has been
written out (using a battery back-up unit for example), then the FS on
the client will have (silently) lost data.

This is why we removed the buffered FILEIO from mainline in the first
place, but in retrospect if people understand the consequences and still
want to use buffered FILEIO for performance reasons they should be able
to do so.

--nab

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
  2012-09-18 22:20     ` Nicholas A. Bellinger
@ 2012-09-19 10:49       ` joystick
  2012-09-23  1:01         ` Nicholas A. Bellinger
  0 siblings, 1 reply; 13+ messages in thread
From: joystick @ 2012-09-19 10:49 UTC (permalink / raw)
  To: nab; +Cc: Peter Grandi, Linux RAID, target-devel

On 09/19/12 00:20, Nicholas A. Bellinger wrote:
>
>>> Are you enabling emulate_write_cache=1 with your iblock
>>> backends..? This can have a gigantic effect on initiator
>>> performance for both MSFT + Linux SCSI clients.
>> That sounds interesting, but also potentially rather dangerous,
>> unless there is a very reliable implementation of IO barriers.
>> Just like with enabling write caches on real disks...
>>
> Not exactly.  The name of the 'emulate_write_cache' device attribute is
> a bit mis-leading here.  This bit simply reports (to the SCSI client)
> that the WCE=1 bit is set during SCSI mode sense (caching page) is read
> during the initial LUN scan.

Then can I say that the default is wrong?
You are declaring writethrough a device that is almost certainly a 
writeback (because at least HDDs will have caches).

If power is lost at the iscsi target, there WILL be data loss. People do 
not expect that. Change the default!

Besides this, I don't understand how declaring an iscsi target as 
writethrough could slow down operations volountarily by initiators. That 
would be a bug of the initiators because writethrough is "better" than 
writeback for all purposes: initiators should just skip the queue drain 
/ flush / FUA, and all the rest should be the same.

>>> [ ... ] check your [ ... ]/queue/max*sectors_kb for the MD
>>> RAID to make sure the WRITEs are striped aligned to get best
>>> performance with software MD raid.
>> That does not quite ensure that the writes are stripe aligned,
>> but perhaps a larger stripe cache would help.
>>
> I'm talking about what MD raid has chosen as it's underlying
> max_sectors_kb to issue I/O to the underlying raid member devices.  This
> depends on what backend storage hardware is in use, this may end up as
> '127', which will result in ugly mis-aligned writes that ends up killing
> performance.

Interesting observation.
For local processes writing, probably MD waits enough time for other 
requests to come and fill a stripe before initiating a rmw; but maybe 
iscsi is too slow for that and MD initiates an rmw for each request 
which would be a zillion of RMWs.
Can that be? Anyone knows MD enough to say if MD waits a little bit for 
more data in the attempt of filling an entire stripe before proceeding 
with rmw? If yes, can such timeout be set?

> We've (RTS) changed this with a one-liner patch to raid456.c code on .32
> basded distro kernels in the past to get proper stripe aligned writes,
> and it obviously makes a huge difference with fast storage hardware.

This value is writable via sysfs, why do you need a patch?

> That's exactly what I'm talking about.
>
> With buffered FILEIO enabled a incoming WRITE payload will have already
> been ACKs back to the SCSI fabric and up the storage -> filesystem
> stack, but if a power loss was to occur before that data has been
> written out (using a battery back-up unit for example), then the FS on
> the client will have (silently) lost data.
>
> This is why we removed the buffered FILEIO from mainline in the first
> place, but in retrospect if people understand the consequences and still
> want to use buffered FILEIO for performance reasons they should be able
> to do so.

If you declare the target as writeback and implement flush+FUA, no data 
loss should occur AFAIU, isn't that so?

AFAIR, hard disks do normally declare all operations to be complete 
immediately after you submit (while they are still in the cache in 
reality), but if you issue a flush+FUA they make an exception to this 
rule and make sure that this operation and all previously submitted 
operations are indeed on the platter before returning. Do I remember 
correctly?

Can you do the same for buffered FILEIO?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
  2012-09-19 10:49       ` joystick
@ 2012-09-23  1:01         ` Nicholas A. Bellinger
  0 siblings, 0 replies; 13+ messages in thread
From: Nicholas A. Bellinger @ 2012-09-23  1:01 UTC (permalink / raw)
  To: joystick; +Cc: Peter Grandi, Linux RAID, target-devel

On Wed, 2012-09-19 at 12:49 +0200, joystick wrote:
> On 09/19/12 00:20, Nicholas A. Bellinger wrote:
> >
> >>> Are you enabling emulate_write_cache=1 with your iblock
> >>> backends..? This can have a gigantic effect on initiator
> >>> performance for both MSFT + Linux SCSI clients.
> >> That sounds interesting, but also potentially rather dangerous,
> >> unless there is a very reliable implementation of IO barriers.
> >> Just like with enabling write caches on real disks...
> >>
> > Not exactly.  The name of the 'emulate_write_cache' device attribute is
> > a bit mis-leading here.  This bit simply reports (to the SCSI client)
> > that the WCE=1 bit is set during SCSI mode sense (caching page) is read
> > during the initial LUN scan.
> 
> Then can I say that the default is wrong?

No, spinning media drives never enable WRITE cache by default from the
factory.

The SSDs that enable WCE=1 typically aren't going to have a traditional
cache, but from what I understand can still disable WCE=0 in-band.

But you are correct for this case the user would still currently be
expected to set WCE=1 for IBLOCK when the backends have enabled their
own write caching policy.

The reason being that both of those virtual drivers can't peek at the
lower SCSI layer (at the kernel code level) to figure what the
underlying block device (or virtual device) is doing for caching. 

> You are declaring writethrough a device that is almost certainly a 
> writeback (because at least HDDs will have caches).
> 

That is controlled by the scsi caching mode page on the underlying
drive, which can be changed with sg_raw or sdparm --set=WCE.

> If power is lost at the iscsi target, there WILL be data loss. People do 
> not expect that. Change the default!
> 

Just blindly enabling WCE=1 for all cases is the not the correct
solution for all cases.

> 
> Besides this, I don't understand how declaring an iscsi target as 
> writethrough could slow down operations volountarily by initiators. That 
> would be a bug of the initiators because writethrough is "better" than 
> writeback for all purposes: initiators should just skip the queue drain 
> / flush / FUA, and all the rest should be the same.
> 

Depends on the client.  For example .32 distro based SCSI initiators are
still using legacy barriers instead of modern WRITE_FUA starting in
>= .38, and end up having a huge effect when going 20 Gb/sec with lots
of 15K SAS disks.

> 
> >>> [ ... ] check your [ ... ]/queue/max*sectors_kb for the MD
> >>> RAID to make sure the WRITEs are striped aligned to get best
> >>> performance with software MD raid.
> >> That does not quite ensure that the writes are stripe aligned,
> >> but perhaps a larger stripe cache would help.
> >>
> > I'm talking about what MD raid has chosen as it's underlying
> > max_sectors_kb to issue I/O to the underlying raid member devices.  This
> > depends on what backend storage hardware is in use, this may end up as
> > '127', which will result in ugly mis-aligned writes that ends up killing
> > performance.
> 
> Interesting observation.
> For local processes writing, probably MD waits enough time for other 
> requests to come and fill a stripe before initiating a rmw; but maybe 
> iscsi is too slow for that and MD initiates an rmw for each request 
> which would be a zillion of RMWs.
> Can that be? Anyone knows MD enough to say if MD waits a little bit for 
> more data in the attempt of filling an entire stripe before proceeding 
> with rmw? If yes, can such timeout be set?
> 
> > We've (RTS) changed this with a one-liner patch to raid456.c code on .32
> > basded distro kernels in the past to get proper stripe aligned writes,
> > and it obviously makes a huge difference with fast storage hardware.
> 
> This value is writable via sysfs, why do you need a patch?
> 

Actually I meant max_hw_sectors_kb here, and no it's not changeable via
sysfs either when the default is set to a value (like 127) that would
cause unaligned WRITEs to occur.

This can also have a devastating effect on MD raid performance

> > That's exactly what I'm talking about.
> >
> > With buffered FILEIO enabled a incoming WRITE payload will have already
> > been ACKs back to the SCSI fabric and up the storage -> filesystem
> > stack, but if a power loss was to occur before that data has been
> > written out (using a battery back-up unit for example), then the FS on
> > the client will have (silently) lost data.
> >
> > This is why we removed the buffered FILEIO from mainline in the first
> > place, but in retrospect if people understand the consequences and still
> > want to use buffered FILEIO for performance reasons they should be able
> > to do so.
> 
> 
> If you declare the target as writeback and implement flush+FUA, no data 
> loss should occur AFAIU, isn't that so?
> 
> AFAIR, hard disks do normally declare all operations to be complete 
> immediately after you submit (while they are still in the cache in 
> reality), but if you issue a flush+FUA they make an exception to this 
> rule and make sure that this operation and all previously submitted 
> operations are indeed on the platter before returning. Do I remember 
> correctly?
> 
> Can you do the same for buffered FILEIO?
> 

So for v3.7 we'll be re-allowing buffered FILEIO to be optionally
enabled + force WCE=1 for people who know really know what they are
doing.

For the other case you've mentioned, I'd much rather do this in
userspace via rtslib based upon existing sysfs values to automatically
set emulate_write_cache=1 for IBLOCK backend export of struct
block_device, rather than enable WCE=1 for all cases with IBLOCK.

--nab


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
  2012-09-18 21:18   ` Peter Grandi
  2012-09-18 22:20     ` Nicholas A. Bellinger
@ 2012-09-19 14:19     ` freaky
  2012-09-19 17:20       ` Chris Murphy
  1 sibling, 1 reply; 13+ messages in thread
From: freaky @ 2012-09-19 14:19 UTC (permalink / raw)
  To: target-devel, linux-raid

> For a RAID set of 6+1 2TB drives each capable of 60-120MB/s that
> is still pretty terrible speed (even if the performance seems
> not too bad).
>

Yes, but do note the 3.2 kernel has issues with the queue thingy. max
sectors and max hw sectors is set on 127. I've seen this on some
machines with late 2.6 kernels and 3.0 and 3.1 too iirc. It seems fixed
in 3.5. However, I had issues compiling the iscsitarget-dkms modules
against the 3.5 kernel (from package manager) and haven't taken the time
to build a newer version myself, so I haven't tested IET with 3.5.

Also, since it's reading and writing at the same time now it's no longer
(nearly due to fs overhead) purely sequential.

>>> Iostat with IETD whilst writing shows say 110-120% read per
>>> write, however, in this case we were also actually reading.
>>> [ ... ] IETD is running in fileio mode (write-back), so it
>>> buffers too. [ ... ]
> That probably helps the MD get a bit of help with aligned
> writes, or perhaps at that point the array had been resynced,
> who knows...

The results I've submitted now all have been taken whilst the array was
healthy.

>> Are you enabling emulate_write_cache=1 with your iblock
>> backends..? This can have a gigantic effect on initiator
>> performance for both MSFT + Linux SCSI clients.
> That sounds interesting, but also potentially rather dangerous,
> unless there is a very reliable implementation of IO barriers.
> Just like with enabling write caches on real disks...
>
>> [ ... ] check your [ ... ]/queue/max*sectors_kb for the MD
>> RAID to make sure the WRITEs are striped aligned to get best
>> performance with software MD raid.
> That does not quite ensure that the writes are stripe aligned,
> but perhaps a larger stripe cache would help.

Does this help?

root@datavault:~# parted /dev/md4
GNU Parted 2.3
Using /dev/md4
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) u b                                                             
(parted) pr                                                              
Model: Linux Software RAID Array (md)
Disk /dev/md4: 12002393063424B
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start            End              Size             File system 
Name           Flags
 1      1966080B         10115507159039B  10115505192960B              
ReplayStorage
 2      10115507159040B  12002393046527B  1886885887488B               
VDR-Storage

(parted) sel /dev/md4p1
Using /dev/md4p1
(parted) pr                                                              
Model: Unknown (unknown)
Disk /dev/md4p1: 10115505192960B
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start       End              Size             File system 
Name                          Flags
 1      17408B      134235135B       134217728B                   
Microsoft reserved partition  msftres
 2      135266304B  10115504668671B  10115369402368B  ntfs         Basic
data partition

(parted) quit                                                            

1966080/(1024*64*6)=5 (not rounded)
135266304/(1024*64*6)=344 (not rounded)

If my calculations are correct it shouldn't thus only be chunk but even
stripe aligned. I did pay a lot of attention to this during setup. It's
not my daily thing tho', so I do hope I did it correctly. 1024 to get
from B to kiB, 64 kiB's per chunk, 6 data chunks in a 7 disk RAID-5 set
(or well, originally a 8 disk RAID-6 but shouldn't differ).

NTFS is formatted with 64kiB block/cluster size. I've just verified this
again, in 3 ways :).

>
>> Please use FILEIO with this reporting emulate_write_cache=1
>> (WCE=1) to the SCSI clients. Note that by default in the last
>> kernel releases we've change FILEIO backends to only always
>> use O_SYNC to ensure data consistency during a hard power
>> failure, regardless of the emulate_write_cache=1 setting.
> Ahh interesting too. That's also the right choice unless there
> is IO barrier support at all levels.
This is too low level for me currently. I'll have to look it up. I also
take from this that *emulating* write cache != write cache :). I've only
conciously set the buffered mode, but as stated the targetcli utility,
at least the version that comes with ubuntu 12.04, doesn't show this is
set. Then again, not running in fileio mode either and the functionality
has been disabled in 3.5 if I understood correctly.

>> Also note that by default it's my understanding that IETD uses
>> buffered FILEIO for performance, so in your particular type of
>> setup you'd still see better performance with buffered FILEIO,
>> but would still have the potential risk of silent data
>> corruption with buffered FILEIO.
> Not silent data corruption, but data loss. Silent data
> corruption is usually meant for the case where an IO completes
> and reports success, but the data recorded is not the data
> submitted.

Ok, then we have the same concepts. The loss might cause corruption
obviously, but I've never seen it happen silently :).

>
>> [ ... ] understand the possible data integrity risks
>> associated with using buffered FILEIO during a hard power
>> failure, I'm fine with re-adding this back into
>> target_core_file for v3.7 code for people who really know what
>> they are doing.
> That "people who really know what they are doing" is generally a
> bit optimistic :-).

I like to be free to choose. Might not always choose the smart thing -
but at least it's been my choice, not some spoon fed thing :). Others
like to be nurtured tho'. If you're concerned about users (or well
admins - I've never seen a regular user set up a RAID + iSCSI target)
safety that much though I'd take the middle ground - just throw a big
fat red warning. Targetcli already uses fancy colors :). If people
choose to ignore that it's *most definitely* their responsibility (not
that it's anyone elses otherwise, the license clearly states no warranty
whatsoever). There's other ways to make things safe tho' and sometimes
speed is more important than integrity. There's probably still other
reasons people might want to enable it.

>
> Do the various modes support IO barriers? That usually is what
> is critical, at least for the better informed people.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
  2012-09-19 14:19     ` freaky
@ 2012-09-19 17:20       ` Chris Murphy
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2012-09-19 17:20 UTC (permalink / raw)
  To: Linux RAID; +Cc: target-devel

On Sep 19, 2012, at 8:19 AM, freaky wrote:
> 
> Sector size (logical/physical): 512B/4096B

Are all of the underlying drives 512e AF disks? At least one of the embedded GPTs (I'm uncertain how three layers of partitioning can be anything be complicated) defines an unaligned partition, but it's a tiny reserved partition. Everything else appears to be aligned to 4K sectors, but the two start sectors you provided are different which is also confusing why they don't start on 2048s which is the parted default.

On all of the physical devices are all of the partitions 4K aligned?

> Disk /dev/md4
> 1      1966080B

That's a start sector of 3840.

> 
> Disk /dev/md4p1
>  1      17408B

That's a start sector of 34, and not 4K divisible. 

Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
  2012-09-18 19:49 ` Nicholas A. Bellinger
  2012-09-18 21:18   ` Peter Grandi
@ 2012-09-19  6:44   ` Arne Redlich
  2012-09-19 14:27     ` Ferry
  1 sibling, 1 reply; 13+ messages in thread
From: Arne Redlich @ 2012-09-19  6:44 UTC (permalink / raw)
  To: Nicholas A. Bellinger; +Cc: Ferry, linux-raid, target-devel

[Resending without HTML]

2012/9/18 Nicholas A. Bellinger <nab@linux-iscsi.org>:

> Also note that by default it's my understanding that IETD uses buffered
> FILEIO for performance, so in your particular type of setup you'd still
> see better performance with buffered FILEIO, but would still have the
> potential risk of silent data corruption with buffered FILEIO.

Nicholas,

IETs fileio defaults to writethrough caching (by issueing a sync after
writing and before returning a response to the client). Writeback
behaviour as employed by the OP needs to be switched on explicitly.

Also, the failure scenario for writeback caching you're referring to
is neither silent data corruption (as pointed out by Peter already)
nor silent data loss, as the WCE bit makes it pretty clear to the
client side that data is not guaranteed to be on persistent storage
unless explicitly flushed.

Arne

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
  2012-09-19  6:44   ` Arne Redlich
@ 2012-09-19 14:27     ` Ferry
  0 siblings, 0 replies; 13+ messages in thread
From: Ferry @ 2012-09-19 14:27 UTC (permalink / raw)
  Cc: linux-raid, target-devel

On 19-09-12 08:44, Arne Redlich wrote:
> [Resending without HTML]
>
> 2012/9/18 Nicholas A. Bellinger <nab@linux-iscsi.org>:
>
>> Also note that by default it's my understanding that IETD uses buffered
>> FILEIO for performance, so in your particular type of setup you'd still
>> see better performance with buffered FILEIO, but would still have the
>> potential risk of silent data corruption with buffered FILEIO.
> Nicholas,
>
> IETs fileio defaults to writethrough caching (by issueing a sync after
> writing and before returning a response to the client). Writeback
> behaviour as employed by the OP needs to be switched on explicitly.
>
> Also, the failure scenario for writeback caching you're referring to
> is neither silent data corruption (as pointed out by Peter already)
> nor silent data loss, as the WCE bit makes it pretty clear to the
> client side that data is not guaranteed to be on persistent storage
> unless explicitly flushed.
>
> Arne
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

What's the definition of explicitly flushed here? For example, vmware
does nothing but sync I/O. Just start a NFS server on non-cached storage
and run a vm of it, then force the export to async and it'll be lots and
lots faster. I know from test labs setting IET to fileio with writeback
is notably (much) faster than fileio with write through (on non-cached
storage oc). Is this the same as explicitly flushing? Because then it
wouldn't make sense to me that that it's faster with writeback caching
since all I/O is sync from vmware (or more accurately ESX(i)).

Also wouldn't explain why the buffers fill as seen by the free tool for
example :).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
  2012-09-18 14:37 Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI) Ferry
  2012-09-18 16:48 ` Chris Murphy
  2012-09-18 19:49 ` Nicholas A. Bellinger
@ 2012-09-18 20:06 ` Peter Grandi
  2012-09-19 13:08   ` freaky
  2 siblings, 1 reply; 13+ messages in thread
From: Peter Grandi @ 2012-09-18 20:06 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

> I noticed in iostat something I personally find very weird.
> All the disks in the RAID set (minus the spare) seem to read
> 6-7 times as much as they write. Since there is no other I/O
> (so there aren't really any reads issued besides some very
> occasional overhead for NTFS perhaps once in a while) I find
> this really weird. Note also that iostat doesn't show the
> reads in iostat on the md device (which is the case if the
> initiator issues reads) but only on the active disks in the
> RAID set, which to me (unknowingly as I am :)) indicates mdadm
> in the kernel is issuing those reads. [ ... ]

It is not at all weird. The performance of MD ('mdadm' is just
the user level tool to configure it) is pretty good in this case
even if the speed is pretty low. MD is working as expected when
read-modify-write (or some kind of resync or degraded operation)
is occurring. BTW I like your use of the term "RAID set" because
that's what I use myself (because "RAID array" is redundant
:->).

Apparently awareness of the effects of RMW )or resyncing or
degraded operation) is sort of (euphemism) unspecial RAID
knowledge, but only the very elite of sysadms seem to be aware
of it :-). A recent similar enquiry was the (euphemism) strange
concern about dire speed by someone who had (euphemism) bravely
setup RAID6 running deliberately in degraded mode.

My usual refrain is: if you don't know better, never use parity
RAID, only use RAID1 or RAID10 (if you want redundancy).

But while the performance of MD you report is good, the speed is
bad even for a mere RMW/resync/degraded issue, so this detail
matters:

> Do note - I'm running somewhat unorthodox. I've created a
> RAID-5 of 7 disks + hotspare

One could (euphemism) wonder how well a 6x stripe/stripelet size
is going to play with 4KiB aligned NTFS operations...

> (it was originally a RAID-6 w/o hotspare but converted it to
> RAID-5 in hopes of improving performance).

A rather (euphemism) audacious operation, especially because of
the expectation that reshaping a RAID set leaves the content in
an optimal stripe layout. I am guessing that you reshaped rather
than recreated because you did not want to dump/reload the
content, rather (euphemism) optimistically.

There are likely to be other (euphemism) peculiarities in your
setup, probably to do with network flow control, but the above
seems enough...

Sometimes it is difficult for me to find sufficiently mild yet
suggestive euphemisms to describe some of the stuff that gets
reported here. This is one of those cases.

Unless you are absolutely sure you know better:

* Never grow or reshape a RAID set or a filetree.
* Just use RAID1 or RAID10 (or a 3 member RAID5 in some cases
  where writes are rare).
* Don't partition the member or array devices or use GPT for
  both if you must.

If you are absolutely sure you know better then you will not
need to ask for help here :-).

> This disk is about 12TB. It's partitioned with GPT in ~9TB

At least you used GPT partitioning, which is commendable, even
if you regret it below...

> and ~2.5TB (there's huge rounding differences at these sizes
> 1000 vs 1024et al :)).

It is very nearly 5%/7% depending which way.

> With msdos partitions I could easily mess with it myself. [
> ... ]

MSDOS style labels are fraught with subtle problem that require
careful handling.

[ ... ]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
  2012-09-18 20:06 ` Peter Grandi
@ 2012-09-19 13:08   ` freaky
  0 siblings, 0 replies; 13+ messages in thread
From: freaky @ 2012-09-19 13:08 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux RAID

On 18-09-12 22:06, Peter Grandi wrote:
> [ ... ]
>
>> I noticed in iostat something I personally find very weird.
>> All the disks in the RAID set (minus the spare) seem to read
>> 6-7 times as much as they write. Since there is no other I/O
>> (so there aren't really any reads issued besides some very
>> occasional overhead for NTFS perhaps once in a while) I find
>> this really weird. Note also that iostat doesn't show the
>> reads in iostat on the md device (which is the case if the
>> initiator issues reads) but only on the active disks in the
>> RAID set, which to me (unknowingly as I am :)) indicates mdadm
>> in the kernel is issuing those reads. [ ... ]
> It is not at all weird. The performance of MD ('mdadm' is just
> the user level tool to configure it) is pretty good in this case
> even if the speed is pretty low. MD is working as expected when
> read-modify-write (or some kind of resync or degraded operation)
> is occurring. BTW I like your use of the term "RAID set" because
> that's what I use myself (because "RAID array" is redundant
> :->).

Yes, with read-modify-write this is to be expected. However, I'm copying
a large file (largely sequentially thus especially since there was no
other I/O), which is buffered in which case most of the writes should be
entire stripes.

I wouldn't have mentioned it if IETD didn't perform this much better.
Since I'm not aware of kernel internals I figured it might have
something to with the ways the buffers are committed. As in with IETD it
recognizes it's a large contiguous part of data and consolidates it to a
single stripe write and with LIO it writes 64k chunks synchronously
causing it to write say chunk 1 out of 6, reads chunks 2-6, calculates
parity and writes chunk 1 and the parity block and thus has a lot more
operations to write the same stripe. It is also why for example vmware,
which does nothing but sync I/O, has extremely lousy performance on
non-caching RAID controllers. By far the easiest way to show people in
my experience.

Anyway, to me it looks like that's what going on - but I didn't want to
jump to conclusions w/o in-depth kernel knowledge. As far as I know
there aren't any separate buffers though, so LIO and IETD would be using
the same buffers / infrastructure in which case my assumption will be
embarrassingly wrong.

Concerning the RAID set terminology, I didn't even realise that.

> Apparently awareness of the effects of RMW )or resyncing or
> degraded operation) is sort of (euphemism) unspecial RAID
> knowledge, but only the very elite of sysadms seem to be aware
> of it :-). A recent similar enquiry was the (euphemism) strange
> concern about dire speed by someone who had (euphemism) bravely
> setup RAID6 running deliberately in degraded mode.

Even less seem to be aware of the write-hole :D

> My usual refrain is: if you don't know better, never use parity
> RAID, only use RAID1 or RAID10 (if you want redundancy).
>
> But while the performance of MD you report is good, the speed is
> bad even for a mere RMW/resync/degraded issue, so this detail
> matters:
>
>> Do note - I'm running somewhat unorthodox. I've created a
>> RAID-5 of 7 disks + hotspare
> One could (euphemism) wonder how well a 6x stripe/stripelet size
> is going to play with 4KiB aligned NTFS operations...
It's formatted with a 64kiB blocksize (or cluster size in NTFS
terminology) . This is also the RAID's chunk size.

>> (it was originally a RAID-6 w/o hotspare but converted it to
>> RAID-5 in hopes of improving performance).
> A rather (euphemism) audacious operation, especially because of
> the expectation that reshaping a RAID set leaves the content in
> an optimal stripe layout. I am guessing that you reshaped rather
> than recreated because you did not want to dump/reload the
> content, rather (euphemism) optimistically.
Correct. Also, with exception of 3 machines it actually does back up
itself, other jobs replicate to it. If it really would go wrong I'd have
to drive by a customer or 6 to seed-load the other ~50 servers on it
again. Given the risk (never had issues with reshaping) we took it.
Loosing the data is thus not life threatening, just very very annoying /
time consuming. So I was either guaranteed to loose a lot of time moving
the data (and an investment on something to store it on), or take the
risk of loosing a bit more time. It turned out well :). Did use a log
file for the reshaping btw. So it could survive reboots (would have had
to manually restart/bring it up).

> There are likely to be other (euphemism) peculiarities in your
> setup, probably to do with network flow control, but the above
> seems enough...
>
> Sometimes it is difficult for me to find sufficiently mild yet
> suggestive euphemisms to describe some of the stuff that gets
> reported here. This is one of those cases.
>
> Unless you are absolutely sure you know better:
>
> * Never grow or reshape a RAID set or a filetree.
> * Just use RAID1 or RAID10 (or a 3 member RAID5 in some cases
>   where writes are rare).
> * Don't partition the member or array devices or use GPT for
>   both if you must.
Not really an option to use msdos partitioning (way larger than 2TiB)
:). Did look seriously at the offsets being a multiple of 64 so it
doesn't start somewhere in the middle of a chunk. Parted also reports
aligning is properly (I do seriously despise these kind of tools, that
work with blocksizes specifically, using decimal instead of binary
k/M/G's etc).

> If you are absolutely sure you know better then you will not
> need to ask for help here :-).
Do note I'm specifically asking about the interaction with LIO :). Don't
have the benchmarks any more, but several local tests (on the machine
itself w/o the iSCSI layer in between) showed very acceptable numbers. I
just need some descent performance (~50MB/s sequential and in the order
of 10MB/s random - the local benches were way faster than that) and a
lot of storage in this case as it's just to store back-ups (image based
- very large files) and their incrementals. On a daily basis some
incrementals are merged but a lot of the I/O should be sequential and
large (often causing entire stripe writes when buffered/cached - or well
they should and it seems to do this just fine with IET, but not with LIO).

I'll never put anything (random) I/O intensive on anything but 10 :). Or
there must be some really fancy new developments (haven't digged into
ZFS and the likes deep enough yet).

>> This disk is about 12TB. It's partitioned with GPT in ~9TB
> At least you used GPT partitioning, which is commendable, even
> if you regret it below...

Yea, not that I had a choice :P.

>> and ~2.5TB (there's huge rounding differences at these sizes
>> 1000 vs 1024et al :)).
> It is very nearly 5%/7% depending which way.
>
>> With msdos partitions I could easily mess with it myself. [
>> ... ]
> MSDOS style labels are fraught with subtle problem that require
> careful handling.

But they're very easy to backup/restore with dd :).

Thanks for the re' :).

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-09-23  1:01 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-18 14:37 Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI) Ferry
2012-09-18 16:48 ` Chris Murphy
2012-09-18 19:49 ` Nicholas A. Bellinger
2012-09-18 21:18   ` Peter Grandi
2012-09-18 22:20     ` Nicholas A. Bellinger
2012-09-19 10:49       ` joystick
2012-09-23  1:01         ` Nicholas A. Bellinger
2012-09-19 14:19     ` freaky
2012-09-19 17:20       ` Chris Murphy
2012-09-19  6:44   ` Arne Redlich
2012-09-19 14:27     ` Ferry
2012-09-18 20:06 ` Peter Grandi
2012-09-19 13:08   ` freaky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).