linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* raid1 read balance with write mostly
@ 2013-02-02 17:38 tomas.hodek
  2013-02-02 21:52 ` Stan Hoeppner
  0 siblings, 1 reply; 5+ messages in thread
From: tomas.hodek @ 2013-02-02 17:38 UTC (permalink / raw)
  To: linux-raid; +Cc: tomas.hodek

[-- Attachment #1: Type: text/plain, Size: 2113 bytes --]

Hi

I have started to test md raid1 with one ssd and one hdd devices on 3.7.1 kernel (it has trim/discard on raid1). This raid has enabled write behind option and HDD device has enabled write mostly option.
Original idea of write mostly option was "Read requests will only be sent if there is no other option."

My first simple test workload was a building latest stable kernel (3.7.1) using 16 threads.
But i saw some reading from hdd irrespective of a write workload, I saw also more then 1000ms read await while ssd had await about 1ms.  (I only used iostat -x.)

I wanted to know why. I searched in source codes and i found  read_balance function in raid1.c.

If I read well this code and understand it - it do:

If a device has "write mostly" option and if we still have not selected device for reading (and if is_badblock function is ended with true), code select this device directly. This direct selection may be a mistake because overwrite this direct selection is possible only in special cases - if other possible device (without write mostly option) is idle or a request is a part of sequential reads. Standard way read_balance function is searching the nearest and/or the least used device. Such device is using only if we have not a directly selected device (also from write mosty code path).


I thing all code sequence 

best_disk = disk; 
continue;

in main for loop is not best way and that setting 

best_padding_disk  = disk;
best_dist_disk = disk;

is better because it give chance find better alternative. In other words - change direct selection to worst possible alternative. 
But i am not sure in all cases.


I  made 2 version of a small patch to do it which change direct selection to setting write mostly device only as most distant and most pending possible device. Safe version is safe and reliable for future changes, now version is minimal for current code (up to 3.7.5). 

This patch work well for me. I can mark ssd as fail, remove from and add in raid under workload without any trouble or additional kernel log items.

I attach my patches to email.

Best regards
Tomas Hodek 


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: raid1-balance-write-mostly_safe.patch --]
[-- Type: text/x-patch, Size: 871 bytes --]

diff -ur linux-3.7.1-old/drivers/md/raid1.c linux-3.7.1-new/drivers/md/raid1.c
--- linux-3.7.1-old/drivers/md/raid1.c	2012-12-17 20:14:54.000000000 +0100
+++ linux-3.7.1-new/drivers/md/raid1.c	2013-01-09 20:57:47.924610501 +0100
@@ -548,7 +548,7 @@
 		if (test_bit(WriteMostly, &rdev->flags)) {
 			/* Don't balance among write-mostly, just
 			 * use the first as a last resort */
-			if (best_disk < 0) {
+			if (best_dist_disk < 0 || best_pending_disk < 0) {
 				if (is_badblock(rdev, this_sector, sectors,
 						&first_bad, &bad_sectors)) {
 					if (first_bad < this_sector)
@@ -557,7 +557,10 @@
 					best_good_sectors = first_bad - this_sector;
 				} else
 					best_good_sectors = sectors;
-				best_disk = disk;
+				if (best_dist_disk < 0)
+					best_dist_disk = disk;
+				if (best_pending_disk < 0)
+					best_pending_disk = disk;
 			}
 			continue;
 		}

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: raid1-balance-write-mostly_now.patch --]
[-- Type: text/x-patch, Size: 782 bytes --]

diff -ur linux-3.7.1-old/drivers/md/raid1.c linux-3.7.1-new/drivers/md/raid1.c
--- linux-3.7.1-old/drivers/md/raid1.c	2012-12-17 20:14:54.000000000 +0100
+++ linux-3.7.1-new/drivers/md/raid1.c	2013-01-09 20:57:47.924610501 +0100
@@ -548,7 +548,7 @@
 		if (test_bit(WriteMostly, &rdev->flags)) {
 			/* Don't balance among write-mostly, just
 			 * use the first as a last resort */
-			if (best_disk < 0) {
+			if (best_dist_disk < 0) {
 				if (is_badblock(rdev, this_sector, sectors,
 						&first_bad, &bad_sectors)) {
 					if (first_bad < this_sector)
@@ -557,7 +557,8 @@
 					best_good_sectors = first_bad - this_sector;
 				} else
 					best_good_sectors = sectors;
-				best_disk = disk;
+				best_dist_disk = disk;
+				best_pending_disk = disk;
 			}
 			continue;
 		}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid1 read balance with write mostly
  2013-02-02 17:38 raid1 read balance with write mostly tomas.hodek
@ 2013-02-02 21:52 ` Stan Hoeppner
  2013-02-02 22:07   ` Roman Mamedov
  0 siblings, 1 reply; 5+ messages in thread
From: Stan Hoeppner @ 2013-02-02 21:52 UTC (permalink / raw)
  To: tomas.hodek; +Cc: linux-raid

On 2/2/2013 11:38 AM, tomas.hodek@volny.cz wrote:
> Hi
> 
> I have started to test md raid1 with one ssd and one hdd devices on 3.7.1 kernel (it has trim/discard on raid1). This raid has enabled write behind option and HDD device has enabled write mostly option.
> Original idea of write mostly option was "Read requests will only be sent if there is no other option."
> 
> My first simple test workload was a building latest stable kernel (3.7.1) using 16 threads.
> But i saw some reading from hdd irrespective of a write workload, I saw also more then 1000ms read await while ssd had await about 1ms.  (I only used iostat -x.)
> 
> I wanted to know why. I searched in source codes and i found  read_balance function in raid1.c.
> 
> If I read well this code and understand it - it do:
> 
> If a device has "write mostly" option and if we still have not selected device for reading (and if is_badblock function is ended with true), code select this device directly. This direct selection may be a mistake because overwrite this direct selection is possible only in special cases - if other possible device (without write mostly option) is idle or a request is a part of sequential reads. Standard way read_balance function is searching the nearest and/or the least used device. Such device is using only if we have not a directly selected device (also from write mosty code path).
> 
> 
> I thing all code sequence 
> 
> best_disk = disk; 
> continue;
> 
> in main for loop is not best way and that setting 
> 
> best_padding_disk  = disk;
> best_dist_disk = disk;
> 
> is better because it give chance find better alternative. In other words - change direct selection to worst possible alternative. 
> But i am not sure in all cases.

Did you test with a RAID1 of two mechanical drives?  I can envision a
scenario of say a 300GB WD Raptor 10K mirrored to a 300GB partition on a
3TB 5K 'green' drive.  The contents being the root filesystem, mirrored
strictly for safety.  This is probably the same scenario you have in
mind.  But in this case the IOPS performance difference is only 2:1,
whereas with the SSD it's more than 50:1.  So under heavy read load, in
this case we'd probably want the slow 3TB drive to contribute to the
workload.  With your patch, will it still do so?

-- 
Stan


> 
> I  made 2 version of a small patch to do it which change direct selection to setting write mostly device only as most distant and most pending possible device. Safe version is safe and reliable for future changes, now version is minimal for current code (up to 3.7.5). 
> 
> This patch work well for me. I can mark ssd as fail, remove from and add in raid under workload without any trouble or additional kernel log items.
> 
> I attach my patches to email.
> 
> Best regards
> Tomas Hodek 
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid1 read balance with write mostly
  2013-02-02 21:52 ` Stan Hoeppner
@ 2013-02-02 22:07   ` Roman Mamedov
  2013-02-02 22:16     ` Tommy Apel Hansen
  0 siblings, 1 reply; 5+ messages in thread
From: Roman Mamedov @ 2013-02-02 22:07 UTC (permalink / raw)
  To: stan; +Cc: tomas.hodek, linux-raid

[-- Attachment #1: Type: text/plain, Size: 863 bytes --]

On Sat, 02 Feb 2013 15:52:12 -0600
Stan Hoeppner <stan@hardwarefreak.com> wrote:

> So under heavy read load, in this case we'd probably want the slow 3TB drive
> to contribute to the workload.

No, it should behave as documented.

       -W, --write-mostly
              subsequent devices listed in a --build, --create, or --add  com‐
              mand will be flagged as 'write-mostly'.  This is valid for RAID1
              only and means that the 'md'  driver  will  avoid  reading  from
              these devices if at all possible. 

if at all possible = if there is any other mirror still alive, do not read from
this one.

BTW I also run an SSD+HDD array, and after a brief testing today I did notice
some reads from the HDD on reading from the array showing up in iostat. So I
can confirm the bug.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid1 read balance with write mostly
  2013-02-02 22:07   ` Roman Mamedov
@ 2013-02-02 22:16     ` Tommy Apel Hansen
  2013-02-03  9:02       ` tomas.hodek
  0 siblings, 1 reply; 5+ messages in thread
From: Tommy Apel Hansen @ 2013-02-02 22:16 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: stan, tomas.hodek, linux-raid

On Sun, 2013-02-03 at 04:07 +0600, Roman Mamedov wrote:
> On Sat, 02 Feb 2013 15:52:12 -0600
> Stan Hoeppner <stan@hardwarefreak.com> wrote:
> 
> > So under heavy read load, in this case we'd probably want the slow 3TB drive
> > to contribute to the workload.
> 
> No, it should behave as documented.
> 
>        -W, --write-mostly
>               subsequent devices listed in a --build, --create, or --add  com‐
>               mand will be flagged as 'write-mostly'.  This is valid for RAID1
>               only and means that the 'md'  driver  will  avoid  reading  from
>               these devices if at all possible.
                             ^^^^^^^^^^^^^^^^^^^
I would assume that if the kernel thinks the SSD is starved for io (that
being the %util column in iostat) it would consider sending io to the
HDD. At least that would be what I would expect to happen, so if the SSD
reports wrong utilization numbers back or anything else involved in the
io processing the the behavior is as expected.

>  
> 
> if at all possible = if there is any other mirror still alive, do not read from
> this one.
> 
> BTW I also run an SSD+HDD array, and after a brief testing today I did notice
> some reads from the HDD on reading from the array showing up in iostat. So I
> can confirm the bug.
> 


/Tommy

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid1 read balance with write mostly
  2013-02-02 22:16     ` Tommy Apel Hansen
@ 2013-02-03  9:02       ` tomas.hodek
  0 siblings, 0 replies; 5+ messages in thread
From: tomas.hodek @ 2013-02-03  9:02 UTC (permalink / raw)
  To: linux-raid

Yes, main question is what means write mostly. From 

http://marc.info/?l=linux-raid&m=112374499705545&w=4 

means : "Read requests will only be sent if there is no other option."

Maybe we want to change this behavior but direct selection is wrong way. It does not matter if other drives have little bit or heavy workload but only idle. 

If we want a uneven distribution of load some adjustable advantage/disadvantage is better solution then direct selection.

Yes my situation ssd+hdd is the extreme case, but my experience with old scsi 15k hdd is like. In sequential read/write is slow compare to modern green hdd, but if it have random workload all invert. Direct selection slowest drive may stun system for some time while fastest hdd is idle. 

Hodek



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-02-03  9:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-02 17:38 raid1 read balance with write mostly tomas.hodek
2013-02-02 21:52 ` Stan Hoeppner
2013-02-02 22:07   ` Roman Mamedov
2013-02-02 22:16     ` Tommy Apel Hansen
2013-02-03  9:02       ` tomas.hodek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).