Performance question

All of lore.kernel.org
 help / color / mirror / Atom feed

* Performance question
@ 2002-05-05 14:20 Philipp Gühring
  2002-05-05 15:07 ` Oleg Drokin
  0 siblings, 1 reply; 35+ messages in thread
From: Philipp Gühring @ 2002-05-05 14:20 UTC (permalink / raw)
  To: reiserfs-list

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Let's say I have a directory with 100.000 files in it.
The filenames look like

name1_name2_name3_id

So I have

001_41052_50125_1
001_63216_1212_1
...

I have to create a search engine, that serves for example the 4th Block of 10 
files that match the query "001_*_1212_1". The how query would result to 100 
files, that are spread across the directory.

Now my question:

Is it faster with ReiserFS to do a bsd_glob("001_*_1212_1") first, which 
should result to about 100 entries, and then take the entries 40 to 49 from 
the resulting array? 
(Is ReiserFS able to directly return 100 files out of 100000 with the 
globbing function, or is it an iteration over all files in the directory?)

Or should I do 2 opendir-readdir loops, one to read over the first 39 
results, that I do not need, and the second one to geht the results 40 to 49?
The problem here is that I have to readdir about 50000 files (40000 to get 
through the unneeded results, and 10000 to get the 10 results i need)
But on the other hand, I do not have to remember 100 files, from which I only 
need 10.

If ReiserFS has to iterate over 100000 files (the whole directory) to do a 
"001_*_1212_1" glob, because the binary tree only speeds up known files, but 
not patterns, then opendir-readdir should be faster, I guess.

Another option would be to use subdirectories like
name1/name2/name3/id

So the glob would be "001/*/1212/1", which should be faster, anyway.
But on the other hand, I would have to do a lot more directory management, 
creating and deleting directories ...
And implementing an opendir-readdir search through "001/*/1212/1" will be 
more work too.

Thanks for all feedback in advance and many greetings,
- -- 
~ Philipp Gühring              p.guehring@futureware.at
~ http://www.livingxml.net/       ICQ UIN: 6588261
~ <xsl:value-of select="file:/home/philipp/.sig"/>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE81T+elqQ+F+0wB3oRAhw/AKCRH5CbdIMt2+ITpDkNBwcPKYpPqQCgmC2e
RrYDyo/GgzqJvnn1jy1HjiY=
=/ABd
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2002-05-05 14:20 Performance question Philipp Gühring
@ 2002-05-05 15:07 ` Oleg Drokin
  2002-05-05 16:43   ` Philipp G?hring
  2002-05-06 11:06   ` Hans Reiser
  0 siblings, 2 replies; 35+ messages in thread
From: Oleg Drokin @ 2002-05-05 15:07 UTC (permalink / raw)
  To: Philipp G?hring; +Cc: reiserfs-list

Hello!

On Sun, May 05, 2002 at 04:20:13PM +0200, Philipp G?hring wrote:

> Let's say I have a directory with 100.000 files in it.
> The filenames look like
> name1_name2_name3_id
> So I have
> 001_41052_50125_1
> 001_63216_1212_1
> I have to create a search engine, that serves for example the 4th Block of 10 
> files that match the query "001_*_1212_1". The how query would result to 100 
> files, that are spread across the directory.
> Now my question:
> Is it faster with ReiserFS to do a bsd_glob("001_*_1212_1") first, which 
> should result to about 100 entries, and then take the entries 40 to 49 from 
> the resulting array? 
> (Is ReiserFS able to directly return 100 files out of 100000 with the 
> globbing function, or is it an iteration over all files in the directory?)

*glob functions are implemented by various library functions, that do full
readdir scans at least once, I believe.

> Or should I do 2 opendir-readdir loops, one to read over the first 39 
> results, that I do not need, and the second one to geht the results 40 to 49?

In fact I do not see why do you need to do 2 opendir-readdir loops.
One loop should be enough.
You just compare each filename returned against your query and and if it matched
remember it in separate list. So at the end of readdir loop you have a list of
all names in a directory that match your query. And you can apply any additional
check in place just not to remember unnecesary files.

> The problem here is that I have to readdir about 50000 files (40000 to get 
> through the unneeded results, and 10000 to get the 10 results i need)
> But on the other hand, I do not have to remember 100 files, from which I only 
> need 10.

I am completely missing the idea on where these numbers are from. Can you
explain in more details.

> If ReiserFS has to iterate over 100000 files (the whole directory) to do a 
> "001_*_1212_1" glob, because the binary tree only speeds up known files, but 
> not patterns, then opendir-readdir should be faster, I guess.

Binary tree is only helps when you know filename, I believe. You calculate
a hash and out of that hash you can quickly find desired location.
You you come up with a hash that places all filenames like your one near one,
this will help, then.

> Another option would be to use subdirectories like
> name1/name2/name3/id
> So the glob would be "001/*/1212/1", which should be faster, anyway.
> But on the other hand, I would have to do a lot more directory management, 
> creating and deleting directories ...
> And implementing an opendir-readdir search through "001/*/1212/1" will be 
> more work too.

Readdir would require less iterations through 001/*, because number of
entries will be only 100 as you described above.
You get all these 100 entries and then loop 100 times trying to open
001/${next_name}/1212/1 and deciding whenever you need this file or not.
(If it exists of course, or you might get -ENOENT and proceed to next
directory).
Also deleting directories would be an overkill.
I think this might be faster in many circumfstances.
Also what you've descrived looks very like to what squid does. And squid people
went to reiserfs-raw interface and are quite happy with it.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2002-05-05 15:07 ` Oleg Drokin
@ 2002-05-05 16:43   ` Philipp G?hring
  2002-05-06 13:01     ` Oleg Drokin
  2002-05-06 11:06   ` Hans Reiser
  1 sibling, 1 reply; 35+ messages in thread
From: Philipp G?hring @ 2002-05-05 16:43 UTC (permalink / raw)
  To: Oleg Drokin, reiserfs-list

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello!

Thank you Oleg for your answers.

> *glob functions are implemented by various library functions, that do full
> readdir scans at least once, I believe.

I thought I heard about a syscall, that makes it possible to pass the glob to 
the filesystem, so that the filesystem can optimize globbings as it likes, 
and pass the result back to the application, but ok.

> > Or should I do 2 opendir-readdir loops, one to read over the first 39
> > results, that I do not need, and the second one to geht the results 40 to
> > 49?
>
> In fact I do not see why do you need to do 2 opendir-readdir loops.
> One loop should be enough.

Yeah. Sure. My mistake. One opendir, and 2 readdir loops. The first one skips 
over unneeded results and the second one serves the data.

> You just compare each filename returned against your query and and if it
> matched remember it in separate list. So at the end of readdir loop you
> have a list of all names in a directory that match your query. And you can
> apply any additional check in place just not to remember unnecesary files.
>
> > The problem here is that I have to readdir about 50000 files (40000 to
> > get through the unneeded results, and 10000 to get the 10 results i need)
> > But on the other hand, I do not have to remember 100 files, from which I
> > only need 10.
>
> I am completely missing the idea on where these numbers are from. Can you
> explain in more details.

I will try so.
I have a table with 100000 files. A complete search would result for example 
100 files, which are spread across the whole directory.
About every thousand files, there is one file, that matches the query.
Since the client does not want to get 100 files at once, at first I return 
only 10 results for the first page, and the user can navigate page-wise.

So I built up the scenario where the user now wants the see results 40-49 
from the query "001_*_1212_1", 
which I assume as normal behaviour for my application.

> Binary tree is only helps when you know filename, I believe. 

Ok.

> Readdir would require less iterations through 001/*, because number of
> entries will be only 100 as you described above.
> You get all these 100 entries and then loop 100 times trying to open
> 001/${next_name}/1212/1 and deciding whenever you need this file or not.
> (If it exists of course, or you might get -ENOENT and proceed to next
> directory).
> Also deleting directories would be an overkill.

So the question is, how big that overkill is.
Is there perhaps a benchmark that tested it already?

> I think this might be faster in many circumfstances.
> Also what you've descrived looks very like to what squid does. And squid
> people went to reiserfs-raw interface and are quite happy with it.

I think the difference to squid is that they only need one result, not a part 
of a search, with more than one result.
But I am thinking about using reiserfs-raw too ...
(At the moment flexibility has still more priority for me than raw 
performance)

Many greetings,
- -- 
~ Philipp G?hring              p.guehring@futureware.at
~ http://www.livingxml.net/       ICQ UIN: 6588261
~ <xsl:value-of select="file:/home/philipp/.sig"/>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE81WFGlqQ+F+0wB3oRAtYSAJsGgaHnsohasbrjnJEQWAhi4tatSwCfQXDB
dGlKoxKq0vcB0jHMOV6AEWQ=
=heIa
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2002-05-05 15:07 ` Oleg Drokin
  2002-05-05 16:43   ` Philipp G?hring
@ 2002-05-06 11:06   ` Hans Reiser
  1 sibling, 0 replies; 35+ messages in thread
From: Hans Reiser @ 2002-05-06 11:06 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Philipp G?hring, reiserfs-list

glob is implemented by the shell not the filesystem.  This is not for 
good reason, it just is.  We could write something for you to do it in 
the filesystem and it would be faster.  Is your need for speed critical 
enough to justify writing something special for it?

Hans


Oleg Drokin wrote:

>Hello!
>
>On Sun, May 05, 2002 at 04:20:13PM +0200, Philipp G?hring wrote:
>
>  
>
>>Let's say I have a directory with 100.000 files in it.
>>The filenames look like
>>name1_name2_name3_id
>>So I have
>>001_41052_50125_1
>>001_63216_1212_1
>>I have to create a search engine, that serves for example the 4th Block of 10 
>>files that match the query "001_*_1212_1". The how query would result to 100 
>>files, that are spread across the directory.
>>Now my question:
>>Is it faster with ReiserFS to do a bsd_glob("001_*_1212_1") first, which 
>>should result to about 100 entries, and then take the entries 40 to 49 from 
>>the resulting array? 
>>(Is ReiserFS able to directly return 100 files out of 100000 with the 
>>globbing function, or is it an iteration over all files in the directory?)
>>    
>>
>
>*glob functions are implemented by various library functions, that do full
>readdir scans at least once, I believe.
>
>  
>
>>Or should I do 2 opendir-readdir loops, one to read over the first 39 
>>results, that I do not need, and the second one to geht the results 40 to 49?
>>    
>>
>
>In fact I do not see why do you need to do 2 opendir-readdir loops.
>One loop should be enough.
>You just compare each filename returned against your query and and if it matched
>remember it in separate list. So at the end of readdir loop you have a list of
>all names in a directory that match your query. And you can apply any additional
>check in place just not to remember unnecesary files.
>
>  
>
>>The problem here is that I have to readdir about 50000 files (40000 to get 
>>through the unneeded results, and 10000 to get the 10 results i need)
>>But on the other hand, I do not have to remember 100 files, from which I only 
>>need 10.
>>    
>>
>
>I am completely missing the idea on where these numbers are from. Can you
>explain in more details.
>
>  
>
>>If ReiserFS has to iterate over 100000 files (the whole directory) to do a 
>>"001_*_1212_1" glob, because the binary tree only speeds up known files, but 
>>not patterns, then opendir-readdir should be faster, I guess.
>>    
>>
>
>Binary tree is only helps when you know filename, I believe. You calculate
>a hash and out of that hash you can quickly find desired location.
>You you come up with a hash that places all filenames like your one near one,
>this will help, then.
>
>  
>
>>Another option would be to use subdirectories like
>>name1/name2/name3/id
>>So the glob would be "001/*/1212/1", which should be faster, anyway.
>>But on the other hand, I would have to do a lot more directory management, 
>>creating and deleting directories ...
>>And implementing an opendir-readdir search through "001/*/1212/1" will be 
>>more work too.
>>    
>>
>
>Readdir would require less iterations through 001/*, because number of
>entries will be only 100 as you described above.
>You get all these 100 entries and then loop 100 times trying to open
>001/${next_name}/1212/1 and deciding whenever you need this file or not.
>(If it exists of course, or you might get -ENOENT and proceed to next
>directory).
>Also deleting directories would be an overkill.
>I think this might be faster in many circumfstances.
>Also what you've descrived looks very like to what squid does. And squid people
>went to reiserfs-raw interface and are quite happy with it.
>
>
>Bye,
>    Oleg
>
>
>  
>




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2002-05-05 16:43   ` Philipp G?hring
@ 2002-05-06 13:01     ` Oleg Drokin
  0 siblings, 0 replies; 35+ messages in thread
From: Oleg Drokin @ 2002-05-06 13:01 UTC (permalink / raw)
  To: Philipp G?hring; +Cc: reiserfs-list

Hello!

On Sun, May 05, 2002 at 06:43:45PM +0200, Philipp G?hring wrote:

> > *glob functions are implemented by various library functions, that do full
> > readdir scans at least once, I believe.
> I thought I heard about a syscall, that makes it possible to pass the glob to 
> the filesystem, so that the filesystem can optimize globbings as it likes, 
> and pass the result back to the application, but ok.

I do not think something like that exists in Linux. But if you
come up with man page from section 2...

> > > Or should I do 2 opendir-readdir loops, one to read over the first 39
> > > results, that I do not need, and the second one to geht the results 40 to
> > > 49?
> > In fact I do not see why do you need to do 2 opendir-readdir loops.
> > One loop should be enough.
> Yeah. Sure. My mistake. One opendir, and 2 readdir loops. The first one skips 
> over unneeded results and the second one serves the data.

No. Still I think you need only one loop anyway, like this:
<pseudocode>
DIR=opendir(name);
while((result=readdir(DIR)) != NULL) {
	if ( check_filename_criteria(result->filename) ) {
		add_to_list_of_files_to_process(result->filename);
	}
}
for i in list_of_files_to_process {
	process_file(i);
}

So only one loop, and the second one does not count because it is serves
actual data.

> > > The problem here is that I have to readdir about 50000 files (40000 to
> > > get through the unneeded results, and 10000 to get the 10 results i need)
> > > But on the other hand, I do not have to remember 100 files, from which I
> > > only need 10.
> > I am completely missing the idea on where these numbers are from. Can you
> > explain in more details.
> I will try so.
> I have a table with 100000 files. A complete search would result for example 
> 100 files, which are spread across the whole directory.
> About every thousand files, there is one file, that matches the query.
> Since the client does not want to get 100 files at once, at first I return 
> only 10 results for the first page, and the user can navigate page-wise.
> So I built up the scenario where the user now wants the see results 40-49 
> from the query "001_*_1212_1", 
> which I assume as normal behaviour for my application.

Ah, I see what you mean. If you have a lot of resources, you can setup a session
and store all the search results for that session at server side.
So when second request comes in, you just read search result from the session.
Also you kill the session for 5 minutes after 5 minutes of inactivity on it or
so. Hm... This requires for cookies to be enabled, though. ;)

> > Readdir would require less iterations through 001/*, because number of
> > entries will be only 100 as you described above.
> > You get all these 100 entries and then loop 100 times trying to open
> > 001/${next_name}/1212/1 and deciding whenever you need this file or not.
> > (If it exists of course, or you might get -ENOENT and proceed to next
> > directory).
> > Also deleting directories would be an overkill.
> So the question is, how big that overkill is.

I mean that you do not need to delete directories, when they are empty.
You only need to create the directory structure once.

> Is there perhaps a benchmark that tested it already?

No, I do not think so, but feel free to compose and run your own benchmark.

> > I think this might be faster in many circumfstances.
> > Also what you've descrived looks very like to what squid does. And squid
> > people went to reiserfs-raw interface and are quite happy with it.
> I think the difference to squid is that they only need one result, not a part 
> of a search, with more than one result.

Hm. This is true.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 35+ messages in thread

* performance question
@ 2003-03-31 21:37 jp
  2003-04-01  5:40 ` Trond Myklebust
  0 siblings, 1 reply; 35+ messages in thread
From: jp @ 2003-03-31 21:37 UTC (permalink / raw)
  To: nfs

I have looked through the last couple months of mailing lists archives and 
reviewed the material at nfs.sourceforge.net and the list to netapp's nfs 
suggestions.

I am trying to get real good performance out of NFS. So far the best I've 
got is about 1/10 of the local speed with dedicated 100mbps ethernet 
between fairly speedy computers.

Here's the setup.

Server (coffeepot) - Athlon XP2000, Suse 8.1, 2.4.21-pre6 kernel from
kernel.org, boots to an ata100 drive, promise rm8000 external hardware 
raid5 array on adaptec Adaptec AHA-2940U/UW/D controller, 3com 3c905C 
forced to 100-FD with "/sbin/mii-tool -F 100baseTx-FD eth1".

coffeepot:~ # mount |grep sda
/dev/sda1 on /shared/home type ext2 (rw,noatime)
/dev/sda2 on /shared/backup type ext2 (rw,noatime)
/dev/sda3 on /shared/logs type ext2 (rw,noatime)

coffeepot:~ # cat /etc/exports
/shared/home    10.0.34.0/24(rw,no_root_squash,async)
/shared/backup/ 10.0.34.0/24(ro,root_squash,async)
/shared/logs    10.0.34.0/24(rw,root_squash,async)

coffeepot:~ # bonnie++ -d /shared/home/jp -s 1600 -r 512 -u jp
Version 1.01d       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
coffeepot     1600M 40206  42 39934  13  9989   3 19782  22 21765   5 317.2   1
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  2675  99 +++++ +++ +++++ +++  2759  99 +++++ +++  4360 100

Performance is fairly kickin' here locally.

Connected through a HP4000M switch set for full duplex 100baseT on the same
switch linecard for both ports is the client.

http://midcoast.com/~jp/10.0.15.2_15-day.png is the network traffic between the
two computers showing two bonnie++ tests on the right of the graph. There is 
no packet loss between the computers when tested with flood pings or regular pings.

Client info.(froth) - Athlon XP2200, Suse 8.1, 2.4.21-pre6 kernel from
kernel.org, boots to a ata-100 drive. 3com 3c905C forced to 100-FD with
"/sbin/mii-tool -F 100baseTx-FD eth1".

froth:~ # cat /etc/mtab
10.0.34.1:/shared/backup /shared/backup nfs rw,tcp,hard,intr,rsize=1024,wsize=1024,addr=10.0.34.1 0 0
10.0.34.1:/shared/logs /shared/logs nfs rw,tcp,hard,intr,rsize=1024,wsize=1024,addr=10.0.34.1 0 0
10.0.34.1:/shared/home /shared/home nfs rw,udp,hard,intr,rsize=1400,wsize=1400,addr=10.0.34.1 0 0

same bonnie++ command:
Version 1.01d       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
froth         1600M  2724   5  2764   4  1395   3  2778   5  2848   3  33.5   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1175   3  5061  12  2840  10  1208   5  5723  11  1684   4
froth,1600M,2724,5,2764,4,1395,3,2778,5,2848,3,33.5,0,16,1175,3,5061,12,2840,10,1208,5,5723,11,1684,4

I get about 2700 K/sec and seeks go from 317 to 33/sec. The transfer speed
matches the network traffic graph. I would like to do better than 2700ish.

What is possible for me to improve without moving to Gig-Ethernet?

I've tried both TCP and UDP NFS. rsize & wsize or 1024,1400,4096,8192. The larger
two have horrid performance due to packet fragmentation. Like magnitudes worse.
1024, 1400, UDP and TCP all have similar performance for me.

Also, is it possible to clear the counters in nfsstat?

MUCH TIA,
Jason

-- 
/*
Jason Philbrook   |   Midcoast Internet Solutions - Internet Access,
    KB1IOJ        |  Hosting, and TCP-IP Networks for Midcoast Maine
 http://f64.nu/   |             http://www.midcoast.com/
*/

-------------------------------------------------------
This SF.net email is sponsored by: ValueWeb: 
Dedicated Hosting for just $79/mo with 500 GB of bandwidth! 
No other company gives more support or power for your dedicated server
http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: performance question
@ 2003-03-31 21:45 Lever, Charles
  0 siblings, 0 replies; 35+ messages in thread
From: Lever, Charles @ 2003-03-31 21:45 UTC (permalink / raw)
  To: jp; +Cc: nfs

hi jp-

> What is possible for me to improve without moving to Gig-Ethernet?
>=20
> I've tried both TCP and UDP NFS. rsize & wsize or=20
> 1024,1400,4096,8192. The larger
> two have horrid performance due to packet fragmentation. Like=20
> magnitudes worse.
> 1024, 1400, UDP and TCP all have similar performance for me.

this sounds like a network issue.  you should use a network
performance tool (like iPerf) to measure performance between
your client and server, and try to rectify any problems you
find there, before you work on NFS performance.

> Also, is it possible to clear the counters in nfsstat?

only via a client reboot.


-------------------------------------------------------
This SF.net email is sponsored by: ValueWeb: 
Dedicated Hosting for just $79/mo with 500 GB of bandwidth! 
No other company gives more support or power for your dedicated server
http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance question
  2003-03-31 21:37 jp
@ 2003-04-01  5:40 ` Trond Myklebust
  0 siblings, 0 replies; 35+ messages in thread
From: Trond Myklebust @ 2003-04-01  5:40 UTC (permalink / raw)
  To: jp; +Cc: nfs

>>>>> " " == jp  <jp@pour.midcoast.com> writes:

     > Server (coffeepot) - Athlon XP2000, Suse 8.1, 2.4.21-pre6
     > kernel from kernel.org, boots to an ata100 drive, promise
     > rm8000 external hardware raid5 array on adaptec Adaptec
     > AHA-2940U/UW/D controller, 3com 3c905C forced to 100-FD with
     > "/sbin/mii-tool -F 100baseTx-FD eth1".

Why do you have to force it to 100-FD?

Cheers,
  Trond


-------------------------------------------------------
This SF.net email is sponsored by: ValueWeb: 
Dedicated Hosting for just $79/mo with 500 GB of bandwidth! 
No other company gives more support or power for your dedicated server
http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance question
       [not found] <1049188686.19334.20.camel@deskpro02>
@ 2003-04-01 15:39 ` jp
  2003-04-01 16:06   ` Philippe Gramoullé
  2003-04-01 18:45   ` Bogdan Costescu
  0 siblings, 2 replies; 35+ messages in thread
From: jp @ 2003-04-01 15:39 UTC (permalink / raw)
  To: Kåre Hviid; +Cc: nfs

Thanks to the several people for responses!

> > Server (coffeepot) - Athlon XP2000, Suse 8.1, 2.4.21-pre6 kernel from
> > kernel.org, boots to an ata100 drive, promise rm8000 external hardware=20
> > raid5 array on adaptec Adaptec AHA-2940U/UW/D controller, 3com 3c905C=20
> > forced to 100-FD with "/sbin/mii-tool -F 100baseTx-FD eth1".
> 
> Fast question: Are you sure the _switch_ is setup to do
> 100FD as well?  In my experience, forcing FD on newer
> cards and switches is something that must be done
> carefully.  Also, what about link flow control?  I'm not
> sure the 3c905c can be forced to do flow control by
> simple means if your switch happens to support it.  Try
> the same using N-Way auto-negotiation and check what the
> 3c905c thinks about it.

Flow control on the switch is disabled - the default, I checked. It's also 
set for 100-FD, like my ethernet cards. I always hard-set ethernet 
settings because I don't trust autonegotiation under all circumstances.

I installed iperf on both machines and there is not a problem sending 
large amounts of data between machines.

coffeepot:~ # /usr/local/bin/iperf -s -u     

froth:/tmp/iperf-1.7.0 # /usr/local/bin/iperf -c 10.0.34.1 -b 100m
WARNING: option -b implies udp testing
------------------------------------------------------------
Client connecting to 10.0.34.1, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 64.0 KByte (default)
------------------------------------------------------------
[  5] local 10.0.34.2 port 32876 connected with 10.0.34.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec   114 MBytes  95.6 Mbits/sec
[  5] Server Report:
[  5]  0.0-10.0 sec   114 MBytes  95.6 Mbits/sec  0.246 ms    0/81337 (0%)
[  5] Sent 81337 datagrams

froth:/proc # /usr/local/bin/iperf -c 10.0.34.1 -b 90m
WARNING: option -b implies udp testing
------------------------------------------------------------
Client connecting to 10.0.34.1, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 64.0 KByte (default)
------------------------------------------------------------
[  5] local 10.0.34.2 port 32876 connected with 10.0.34.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec   108 MBytes  90.5 Mbits/sec
[  5] Server Report:
[  5]  0.0-10.0 sec   108 MBytes  90.5 Mbits/sec  0.000 ms    0/76925 (0%)
[  5] Sent 76925 datagrams


> 
> Cheers,
> --=20
> K=E5re Hviid   Sys Admin     ukh@id.cbs.dk    +45 3815 3075
> Institut for Datalingvistik, Handelsh=F8jskolen i K=F8benhavn
> 


-- 
/*
Jason Philbrook   |   Midcoast Internet Solutions - Internet Access,
    KB1IOJ        |  Hosting, and TCP-IP Networks for Midcoast Maine
 http://f64.nu/   |             http://www.midcoast.com/
*/


-------------------------------------------------------
This SF.net email is sponsored by: ValueWeb: 
Dedicated Hosting for just $79/mo with 500 GB of bandwidth! 
No other company gives more support or power for your dedicated server
http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance question
  2003-04-01 15:39 ` jp
@ 2003-04-01 16:06   ` Philippe Gramoullé
  2003-04-01 16:22     ` Matt Heaton
  2003-04-01 18:45   ` Bogdan Costescu
  1 sibling, 1 reply; 35+ messages in thread
From: Philippe Gramoullé @ 2003-04-01 16:06 UTC (permalink / raw)
  To: nfs

Hi,

Unless you're using an old exotic Cisco switch, i don't think you should do=
 this, IMHO.

We've had the worst problems doing that and since we use autoneg ( with int=
el EEpro100 card)
we never had a single problem ever since.

Thanks,

Philippe

--

Philippe Gramoull=E9
philippe.gramoulle@mmania.com
Lycos Europe - NOC France




On Tue, 1 Apr 2003 10:39:50 -0500 (EST)
jp@pour.midcoast.com wrote:

  | I always hard-set ethernet=20
  |  settings because I don't trust autonegotiation under all circumstances.


-------------------------------------------------------
This SF.net email is sponsored by: ValueWeb: 
Dedicated Hosting for just $79/mo with 500 GB of bandwidth! 
No other company gives more support or power for your dedicated server
http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance question
  2003-04-01 16:06   ` Philippe Gramoullé
@ 2003-04-01 16:22     ` Matt Heaton
  2003-04-01 17:08       ` Philippe Gramoullé
  0 siblings, 1 reply; 35+ messages in thread
From: Matt Heaton @ 2003-04-01 16:22 UTC (permalink / raw)
  To: Philippe Gramoullé, nfs

I just have to respond to this.  I must respectfully disagree.
Autonegotiation is tolerable at best.
With certain equiptment it works flawlessly, but MANY brands autonegotiate
correct speeds and duplex, but still exhibit 2-3% packetloss or intermittant
latency (high pings times etc).  A perfect example is my cisco 2940 catalyst
switch and my alteon/nortel 180e (layer 2-7 switch).  Both switches are high
quality and work well, but if you link them up with autonegiation you will
have problems.  It will detect proper speeds and duplex, but has speed
problems and packet loss.  When contacting BOTH cisco and nortel support
they both said autonegiation is bad news and should be used only to get
things up and going.  Cisco said if all the products were cisco then no
problem, just as nortel said the same thing.  Just my 2 cents worth, but I
have seen this problem on more than 5 devices on my own network alone.

L8r...

Matt

----- Original Message -----
From: "Philippe Gramoullé" <philippe.gramoulle@mmania.com>
To: <nfs@lists.sourceforge.net>
Sent: Tuesday, April 01, 2003 9:06 AM
Subject: Re: [NFS] performance question

Hi,

Unless you're using an old exotic Cisco switch, i don't think you should do
this, IMHO.

We've had the worst problems doing that and since we use autoneg ( with
intel EEpro100 card)
we never had a single problem ever since.

Thanks,

Philippe

--

Philippe Gramoullé
philippe.gramoulle@mmania.com
Lycos Europe - NOC France

On Tue, 1 Apr 2003 10:39:50 -0500 (EST)
jp@pour.midcoast.com wrote:

  | I always hard-set ethernet
  |  settings because I don't trust autonegotiation under all circumstances.

-------------------------------------------------------
This SF.net email is sponsored by: ValueWeb:
Dedicated Hosting for just $79/mo with 500 GB of bandwidth!
No other company gives more support or power for your dedicated server
http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

-------------------------------------------------------
This SF.net email is sponsored by: ValueWeb: 
Dedicated Hosting for just $79/mo with 500 GB of bandwidth! 
No other company gives more support or power for your dedicated server
http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance question
  2003-04-01 16:22     ` Matt Heaton
@ 2003-04-01 17:08       ` Philippe Gramoullé
  0 siblings, 0 replies; 35+ messages in thread
From: Philippe Gramoullé @ 2003-04-01 17:08 UTC (permalink / raw)
  To: Matt Heaton; +Cc: nfs

Hi,

Ok, i should have been more precise :)

My recommandations were only for NIC <-> switch.

In case of switch <-> switch then you could indeed force things without
problems.

I was refering to some Linux NFS servers, here,  having big troubles talkin=
g to=20
a switch (sorry i don't remember the brand) on which settings were forced.

Thanks,

Philippe

--

Philippe Gramoull=E9
philippe.gramoulle@mmania.com
Lycos Europe - NOC France

On Tue, 1 Apr 2003 09:22:13 -0700
"Matt Heaton" <admin@0catch.com> wrote:

  | I just have to respond to this.  I must respectfully disagree.
  | Autonegotiation is tolerable at best.
  | With certain equiptment it works flawlessly, but MANY brands autonegoti=
ate
  | correct speeds and duplex, but still exhibit 2-3% packetloss or intermi=
ttant
  | latency (high pings times etc).  A perfect example is my cisco 2940 cat=
alyst
  | switch and my alteon/nortel 180e (layer 2-7 switch).  Both switches are=
 high
  | quality and work well, but if you link them up with autonegiation you w=
ill
  | have problems.  It will detect proper speeds and duplex, but has speed
  | problems and packet loss.  When contacting BOTH cisco and nortel support
  | they both said autonegiation is bad news and should be used only to get
  | things up and going.  Cisco said if all the products were cisco then no
  | problem, just as nortel said the same thing.  Just my 2 cents worth, bu=
t I
  | have seen this problem on more than 5 devices on my own network alone.
  |=20
  | L8r...
  |=20
  | Matt
  |=20
  | ----- Original Message -----
  | From: "Philippe Gramoull=E9" <philippe.gramoulle@mmania.com>
  | To: <nfs@lists.sourceforge.net>
  | Sent: Tuesday, April 01, 2003 9:06 AM
  | Subject: Re: [NFS] performance question
  |=20
  |=20
  | Hi,
  |=20
  | Unless you're using an old exotic Cisco switch, i don't think you shoul=
d do
  | this, IMHO.
  |=20
  | We've had the worst problems doing that and since we use autoneg ( with
  | intel EEpro100 card)
  | we never had a single problem ever since.
  |=20
  | Thanks,
  |=20
  | Philippe
  |=20
  | --
  |=20
  | Philippe Gramoull=E9
  | philippe.gramoulle@mmania.com
  | Lycos Europe - NOC France
  |=20
  |=20
  |=20
  |=20
  | On Tue, 1 Apr 2003 10:39:50 -0500 (EST)
  | jp@pour.midcoast.com wrote:
  |=20
  |   | I always hard-set ethernet
  |   |  settings because I don't trust autonegotiation under all circumsta=
nces.
  |=20
  |=20
  | -------------------------------------------------------
  | This SF.net email is sponsored by: ValueWeb:
  | Dedicated Hosting for just $79/mo with 500 GB of bandwidth!
  | No other company gives more support or power for your dedicated server
  | http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/
  | _______________________________________________
  | NFS maillist  -  NFS@lists.sourceforge.net
  | https://lists.sourceforge.net/lists/listinfo/nfs
  |=20
  |=20
  |=20
  |=20
  | -------------------------------------------------------
  | This SF.net email is sponsored by: ValueWeb:=20
  | Dedicated Hosting for just $79/mo with 500 GB of bandwidth!=20
  | No other company gives more support or power for your dedicated server
  | http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/
  | _______________________________________________
  | NFS maillist  -  NFS@lists.sourceforge.net
  | https://lists.sourceforge.net/lists/listinfo/nfs
  |=20

-------------------------------------------------------
This SF.net email is sponsored by: ValueWeb: 
Dedicated Hosting for just $79/mo with 500 GB of bandwidth! 
No other company gives more support or power for your dedicated server
http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: performance question
  2003-04-01 15:39 ` jp
  2003-04-01 16:06   ` Philippe Gramoullé
@ 2003-04-01 18:45   ` Bogdan Costescu
  1 sibling, 0 replies; 35+ messages in thread
From: Bogdan Costescu @ 2003-04-01 18:45 UTC (permalink / raw)
  To: jp; +Cc: Kåre Hviid, nfs

On Tue, 1 Apr 2003 jp@pour.midcoast.com wrote:

> Flow control on the switch is disabled - the default, I checked. It's also 
> set for 100-FD, like my ethernet cards. I always hard-set ethernet 
> settings because I don't trust autonegotiation under all circumstances.

People that don't want to be helped should not ask for help any more!
I've already warned about forcing speed, check the net driver mailing 
lists and scyld.com to see why and also for some words from Donald Becker 
about why the forcing of speed and duplex ever came into discussion.

> WARNING: option -b implies udp testing

Oh yes, you want to test network quality with UDP... Have you ever thought 
that NFS needs communication both ways ? If you think that your network 
with forced full-duplex is perfect, try two UDP streams in opposite 
directions - you should not loose one packet and still achieve high 
bandwidth; and if you want to stress it even more, try UDP packets that do 
not fit in an Ethernet frame.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De

-------------------------------------------------------
This SF.net email is sponsored by: ValueWeb: 
Dedicated Hosting for just $79/mo with 500 GB of bandwidth! 
No other company gives more support or power for your dedicated server
http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* performance question
@ 2005-09-12 19:06 Moritz Gartenmeister
  0 siblings, 0 replies; 35+ messages in thread
From: Moritz Gartenmeister @ 2005-09-12 19:06 UTC (permalink / raw)
  To: netfilter lists

hi

i'm just wondering, if my experienced performance in my network is usual.

setup:

debian linux
kernel 2.6.8.1 (patched with pom espacially l7-filter and ipp2p)
linux-brigde

everthing is working so far (that's the good part).

but i measure different downloadrates:
on my machine (behind the bridge) ~70Kbyte/s
on the bridge ~200Kbyte/s

the linux-bridge has to forward ~500 clients and has to shape 
transparently the traffic.

is this difference in downloadrates normal?

my assumption so far:
i have 4 interfaces on the linux bridge.
eth1 and eth2 doing the bridge, so they are heavly used.
eth0 is rarely used, so this may be an explanation.

even if i stop iptables, there is no increase.

i would just appriciate, if someone can confirm this as ususal behavior.

greets
moritz

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Performance question
@ 2008-02-14 15:40 Font Bella
       [not found] ` <90d010000802140740y3ff2706ybc169728fbafbfb4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Font Bella @ 2008-02-14 15:40 UTC (permalink / raw)
  To: linux-nfs

Hi,

some of our apps are experiencing slow nfs performance in our new cluster, in
comparison with the old one. The nfs setups for both clusters are very
similar, and we are wondering what's going on. The details of both setups are
given below for reference.

The problem seems to occur with apps that do heavy i/o, creating, writing,
reading, and deleting many files. However, writing or reading a large file
(as measure with `time dd if=/dev/zero of=2gbfile bs=1024 count=2000`) is not
slow.

We have performed some tests with the disk benchmark 'dbench', which reports
i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/sec in the
new one.

After noticing this problem, we tried the user-mode nfs server instead of the
kernel-mode server, and just installing the user-mode server helped improving
throughput up to 12 Mb/sec, but still far away from the good old 60 Mb/sec.

After going through the "Optimizing NFS performance" section of the
NFS-Howto and tweaking the rsize,wsize parameters (the optimal seems to be
2048, which seems kind of weird to me, specially compared to the 8192 used in
the old cluster), throughput increased to 21 Mb/sec, but is still too far
from the old 60Mb/sec.

We are stuck at this point. Any help/comment/suggestion will be greatly
appreciated.
/P

**************************** OLD CLUSTER *****************************

SATA disks.

Filesystem: ext3.

* the version of nfs-utils you are using: I don't know. It's the most
  recent version in debian sarge (oldstable).

user-mode nfs server.

nfs version 2, as reported with rpcinfo.

* the version of the kernel and any non-stock applied kernels: 2.6.12
* the distribution of linux you are using: Debian sarge x386 on Intel Xeon
  processors.
* the version(s) of other operating systems involved: no other OS.

It is also useful to know the networking configuration connecting the hosts:
Typical beowulf setup, with all servers connected to a switch, 1Gb network.

/etc/exports:

/srv/homes      192.168.1.0/255.255.255.0 (rw,no_root_squash)

/etc/fstab:

server:/srv/homes/user /mnt/user nfs rw,hard,intr,rsize=8192,wsize=8192 0 0

**************************** NEW CLUSTER *****************************

SAS 10k disks.

Filesystem: ext3 over LVM.

* the version of nfs-utils you are using: I don't know. It's the most
  recent version in debian etch (stable).

kernel-mode nfs server.

nfs version 2, as reported with rpcinfo.

* the version of the kernel and any non-stock applied kernels: 2.6.18-5-amd64
* the distribution of linux you are using: Debian etch AMD64 on Intel Xeon
  processors.
* the version(s) of other operating systems involved: no other OS.

It is also useful to know the networking configuration connecting the hosts:
Typical beowulf setup, with all servers connected to a switch, 1Gb network.

/etc/exports:

/srv/homes      192.168.1.0/255.255.255.0 (no_root_squash)

mount options:

rsize=8192,wsize=8192

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
       [not found] ` <90d010000802140740y3ff2706ybc169728fbafbfb4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-02-14 16:27   ` Marcelo Leal
       [not found]     ` <42996ba90802140827p533779c6o8ab404400be51fdc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Marcelo Leal @ 2008-02-14 16:27 UTC (permalink / raw)
  To: Font Bella; +Cc: linux-nfs

 Hello all,
There is a great diff between access the raw discs and through LVM,
with some kind of RAID, and etc. I think you should use NFS v3, and
it's hard to think that without you explicitally configure it to use
v2, it using...
A great diff between v2 and v3 is that v2 is always "async", what is a
performance burst. Are you sure that in the new environment is not v3?
In the new stable version (nfs-utils), debian is sync by default. I'm
used to "8192" transfer sizes, and was the best perfomance in my
tests.
 Would be nice if you could test another network service writing in
that server.. like ftp, or iscsi.
 Another question, the discs are "local" or SAN? There is no concurrency?

ps.: v2 has a 2GB file size limit AFAIK.

 Leal.

2008/2/14, Font Bella <fontbella@gmail.com>:
> Hi,
>
>  some of our apps are experiencing slow nfs performance in our new cluster, in
>  comparison with the old one. The nfs setups for both clusters are very
>  similar, and we are wondering what's going on. The details of both setups are
>  given below for reference.
>
>  The problem seems to occur with apps that do heavy i/o, creating, writing,
>  reading, and deleting many files. However, writing or reading a large file
>  (as measure with `time dd if=/dev/zero of=2gbfile bs=1024 count=2000`) is not
>  slow.
>
>  We have performed some tests with the disk benchmark 'dbench', which reports
>  i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/sec in the
>  new one.
>
>  After noticing this problem, we tried the user-mode nfs server instead of the
>  kernel-mode server, and just installing the user-mode server helped improving
>  throughput up to 12 Mb/sec, but still far away from the good old 60 Mb/sec.
>
>  After going through the "Optimizing NFS performance" section of the
>  NFS-Howto and tweaking the rsize,wsize parameters (the optimal seems to be
>  2048, which seems kind of weird to me, specially compared to the 8192 used in
>  the old cluster), throughput increased to 21 Mb/sec, but is still too far
>  from the old 60Mb/sec.
>
>  We are stuck at this point. Any help/comment/suggestion will be greatly
>  appreciated.
>  /P
>
>  **************************** OLD CLUSTER *****************************
>
>  SATA disks.
>
>  Filesystem: ext3.
>
>  * the version of nfs-utils you are using: I don't know. It's the most
>   recent version in debian sarge (oldstable).
>
>  user-mode nfs server.
>
>  nfs version 2, as reported with rpcinfo.
>
>  * the version of the kernel and any non-stock applied kernels: 2.6.12
>  * the distribution of linux you are using: Debian sarge x386 on Intel Xeon
>   processors.
>  * the version(s) of other operating systems involved: no other OS.
>
>  It is also useful to know the networking configuration connecting the hosts:
>  Typical beowulf setup, with all servers connected to a switch, 1Gb network.
>
>  /etc/exports:
>
>  /srv/homes      192.168.1.0/255.255.255.0 (rw,no_root_squash)
>
>  /etc/fstab:
>
>  server:/srv/homes/user /mnt/user nfs rw,hard,intr,rsize=8192,wsize=8192 0 0
>
>  **************************** NEW CLUSTER *****************************
>
>  SAS 10k disks.
>
>  Filesystem: ext3 over LVM.
>
>  * the version of nfs-utils you are using: I don't know. It's the most
>   recent version in debian etch (stable).
>
>  kernel-mode nfs server.
>
>  nfs version 2, as reported with rpcinfo.
>
>  * the version of the kernel and any non-stock applied kernels: 2.6.18-5-amd64
>  * the distribution of linux you are using: Debian etch AMD64 on Intel Xeon
>   processors.
>  * the version(s) of other operating systems involved: no other OS.
>
>  It is also useful to know the networking configuration connecting the hosts:
>  Typical beowulf setup, with all servers connected to a switch, 1Gb network.
>
>  /etc/exports:
>
>  /srv/homes      192.168.1.0/255.255.255.0 (no_root_squash)
>
>  mount options:
>
>  rsize=8192,wsize=8192
>  -
>  To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>  the body of a message to majordomo@vger.kernel.org
>  More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


-- 
pOSix rules

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
       [not found]     ` <42996ba90802140827p533779c6o8ab404400be51fdc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-02-14 16:56       ` Chuck Lever
  2008-02-15 15:37         ` Font Bella
  0 siblings, 1 reply; 35+ messages in thread
From: Chuck Lever @ 2008-02-14 16:56 UTC (permalink / raw)
  To: Font Bella; +Cc: NFS list, Marcelo Leal

On Feb 14, 2008, at 11:27 AM, Marcelo Leal wrote:
>  Hello all,
> There is a great diff between access the raw discs and through LVM,
> with some kind of RAID, and etc. I think you should use NFS v3, and
> it's hard to think that without you explicitally configure it to use
> v2, it using...
> A great diff between v2 and v3 is that v2 is always "async", what is a
> performance burst. Are you sure that in the new environment is not v3?
> In the new stable version (nfs-utils), debian is sync by default. I'm
> used to "8192" transfer sizes, and was the best perfomance in my
> tests.

As Marcelo suggested, this could be nothing more than the change in  
default export options (see exports(8) -- the description of the sync/ 
async option) between sarge and etch.  This was a change in the nfs- 
utils package done a while back to improve data integrity guarantees  
during server instability.

You can test this easily by explicitly specifying sync or async in  
your /etc/exports and trying your test.

It especially effects NFSv2, as all NFSv2 writes are FILE_SYNC (ie  
they must be committed to permanent storage before the server  
replies) -- the async export option breaks that guarantee to improve  
performance.  There is some further description in the NFS FAQ at  
http://nfs.sourceforge.net/ .

The preferred way to get "async" write performance is to use NFSv3.

>  Would be nice if you could test another network service writing in
> that server.. like ftp, or iscsi.
>  Another question, the discs are "local" or SAN? There is no  
> concurrency?
>
> ps.: v2 has a 2GB file size limit AFAIK.
>
>  Leal.
>
> 2008/2/14, Font Bella <fontbella@gmail.com>:
>> Hi,
>>
>>  some of our apps are experiencing slow nfs performance in our new  
>> cluster, in
>>  comparison with the old one. The nfs setups for both clusters are  
>> very
>>  similar, and we are wondering what's going on. The details of  
>> both setups are
>>  given below for reference.
>>
>>  The problem seems to occur with apps that do heavy i/o, creating,  
>> writing,
>>  reading, and deleting many files. However, writing or reading a  
>> large file
>>  (as measure with `time dd if=/dev/zero of=2gbfile bs=1024  
>> count=2000`) is not
>>  slow.
>>
>>  We have performed some tests with the disk benchmark 'dbench',  
>> which reports
>>  i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/ 
>> sec in the
>>  new one.
>>
>>  After noticing this problem, we tried the user-mode nfs server  
>> instead of the
>>  kernel-mode server, and just installing the user-mode server  
>> helped improving
>>  throughput up to 12 Mb/sec, but still far away from the good old  
>> 60 Mb/sec.
>>
>>  After going through the "Optimizing NFS performance" section of the
>>  NFS-Howto and tweaking the rsize,wsize parameters (the optimal  
>> seems to be
>>  2048, which seems kind of weird to me, specially compared to the  
>> 8192 used in
>>  the old cluster), throughput increased to 21 Mb/sec, but is still  
>> too far
>>  from the old 60Mb/sec.
>>
>>  We are stuck at this point. Any help/comment/suggestion will be  
>> greatly
>>  appreciated.
>>  /P
>>
>>  **************************** OLD CLUSTER  
>> *****************************
>>
>>  SATA disks.
>>
>>  Filesystem: ext3.
>>
>>  * the version of nfs-utils you are using: I don't know. It's the  
>> most
>>   recent version in debian sarge (oldstable).
>>
>>  user-mode nfs server.
>>
>>  nfs version 2, as reported with rpcinfo.
>>
>>  * the version of the kernel and any non-stock applied kernels:  
>> 2.6.12
>>  * the distribution of linux you are using: Debian sarge x386 on  
>> Intel Xeon
>>   processors.
>>  * the version(s) of other operating systems involved: no other OS.
>>
>>  It is also useful to know the networking configuration connecting  
>> the hosts:
>>  Typical beowulf setup, with all servers connected to a switch,  
>> 1Gb network.
>>
>>  /etc/exports:
>>
>>  /srv/homes      192.168.1.0/255.255.255.0 (rw,no_root_squash)
>>
>>  /etc/fstab:
>>
>>  server:/srv/homes/user /mnt/user nfs  
>> rw,hard,intr,rsize=8192,wsize=8192 0 0
>>
>>  **************************** NEW CLUSTER  
>> *****************************
>>
>>  SAS 10k disks.
>>
>>  Filesystem: ext3 over LVM.
>>
>>  * the version of nfs-utils you are using: I don't know. It's the  
>> most
>>   recent version in debian etch (stable).
>>
>>  kernel-mode nfs server.
>>
>>  nfs version 2, as reported with rpcinfo.
>>
>>  * the version of the kernel and any non-stock applied kernels:  
>> 2.6.18-5-amd64
>>  * the distribution of linux you are using: Debian etch AMD64 on  
>> Intel Xeon
>>   processors.
>>  * the version(s) of other operating systems involved: no other OS.
>>
>>  It is also useful to know the networking configuration connecting  
>> the hosts:
>>  Typical beowulf setup, with all servers connected to a switch,  
>> 1Gb network.
>>
>>  /etc/exports:
>>
>>  /srv/homes      192.168.1.0/255.255.255.0 (no_root_squash)
>>
>>  mount options:
>>
>>  rsize=8192,wsize=8192
>>  -
>>  To unsubscribe from this list: send the line "unsubscribe linux- 
>> nfs" in
>>  the body of a message to majordomo@vger.kernel.org
>>  More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>
>
> -- 
> pOSix rules
> -
> To unsubscribe from this list: send the line "unsubscribe linux- 
> nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2008-02-14 16:56       ` Chuck Lever
@ 2008-02-15 15:37         ` Font Bella
       [not found]           ` <90d010000802150737x2ad0739dmeaaa24dc2845e81a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Font Bella @ 2008-02-15 15:37 UTC (permalink / raw)
  To: Chuck Lever; +Cc: NFS list, Marcelo Leal

Dear all,

I finally got it to work, after much pain/testing. Here are my config
notes (just for the record).
Thanks Marcelo and Chuck!

NFS setup
=========

Documentation
-------------

* http://billharlan.com/pub/papers/NFS_for_clusters.html
* http://nfs.sourceforge.net/nfs-howto/ar01s05.html#nfsd_daemon_instances

Setting
-------

We use package nfs-kernel-server, i.e. we use the kernel-space nfs server,
which is faster than nfs-user-server.

We use NFS version 3.

Configuration
-------------

Make sure we are using nfs version 3. This seems to be the default with
package nfs-kernel-server. Check from client side with::

        cat /proc/mounts

Use UDP for packet transmission, i.e. use option 'proto=udp' in your
/etc/fstab, /etc/auto.home (if using automounts), or in general, in any mount
command. Check from client side also with 'cat /proc/mounts'.

Make sure you have enough nfsd server threads. See if your server is receiving
too many overlapping requests with

  $ grep th /proc/net/rpc/nfsd

Ours isn't, so we increase the number of threads used by the server to
32 by changing
RPCNFSDCOUNT=32 in /etc/default/nfs-kernel-server (Debian configuration file
for startup scripts). Remember to restart nfs-kernel-server for changes to
take effect.

In the server side, use 'async' option in /etc/exports. This was a crucial
step to get good performance.

Finally, try different values of rsize and wsize in your
/etc/fstab, /etc/auto.home (if using automounts), or in general, in any mount
command. Check from client side also with 'cat /proc/mounts'.
Test your favourite benchmark with different rsize,wsize and look for an
optimal value.

ALL the steps above were necessary for me to get good performance, but
the last step was
crucial, since I got very different performances depending on the
value of rsize/wsize.



On Thu, Feb 14, 2008 at 5:56 PM, Chuck Lever <chuck.lever@oracle.com> wrote:
> On Feb 14, 2008, at 11:27 AM, Marcelo Leal wrote:
>  >  Hello all,
>  > There is a great diff between access the raw discs and through LVM,
>  > with some kind of RAID, and etc. I think you should use NFS v3, and
>  > it's hard to think that without you explicitally configure it to use
>  > v2, it using...
>  > A great diff between v2 and v3 is that v2 is always "async", what is a
>  > performance burst. Are you sure that in the new environment is not v3?
>  > In the new stable version (nfs-utils), debian is sync by default. I'm
>  > used to "8192" transfer sizes, and was the best perfomance in my
>  > tests.
>
>  As Marcelo suggested, this could be nothing more than the change in
>  default export options (see exports(8) -- the description of the sync/
>  async option) between sarge and etch.  This was a change in the nfs-
>  utils package done a while back to improve data integrity guarantees
>  during server instability.
>
>  You can test this easily by explicitly specifying sync or async in
>  your /etc/exports and trying your test.
>
>  It especially effects NFSv2, as all NFSv2 writes are FILE_SYNC (ie
>  they must be committed to permanent storage before the server
>  replies) -- the async export option breaks that guarantee to improve
>  performance.  There is some further description in the NFS FAQ at
>  http://nfs.sourceforge.net/ .
>
>  The preferred way to get "async" write performance is to use NFSv3.
>
>
>
>  >  Would be nice if you could test another network service writing in
>  > that server.. like ftp, or iscsi.
>  >  Another question, the discs are "local" or SAN? There is no
>  > concurrency?
>  >
>  > ps.: v2 has a 2GB file size limit AFAIK.
>  >
>  >  Leal.
>  >
>  > 2008/2/14, Font Bella <fontbella@gmail.com>:
>  >> Hi,
>  >>
>  >>  some of our apps are experiencing slow nfs performance in our new
>  >> cluster, in
>  >>  comparison with the old one. The nfs setups for both clusters are
>  >> very
>  >>  similar, and we are wondering what's going on. The details of
>  >> both setups are
>  >>  given below for reference.
>  >>
>  >>  The problem seems to occur with apps that do heavy i/o, creating,
>  >> writing,
>  >>  reading, and deleting many files. However, writing or reading a
>  >> large file
>  >>  (as measure with `time dd if=/dev/zero of=2gbfile bs=1024
>  >> count=2000`) is not
>  >>  slow.
>  >>
>  >>  We have performed some tests with the disk benchmark 'dbench',
>  >> which reports
>  >>  i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/
>  >> sec in the
>  >>  new one.
>  >>
>  >>  After noticing this problem, we tried the user-mode nfs server
>  >> instead of the
>  >>  kernel-mode server, and just installing the user-mode server
>  >> helped improving
>  >>  throughput up to 12 Mb/sec, but still far away from the good old
>  >> 60 Mb/sec.
>  >>
>  >>  After going through the "Optimizing NFS performance" section of the
>  >>  NFS-Howto and tweaking the rsize,wsize parameters (the optimal
>  >> seems to be
>  >>  2048, which seems kind of weird to me, specially compared to the
>  >> 8192 used in
>  >>  the old cluster), throughput increased to 21 Mb/sec, but is still
>  >> too far
>  >>  from the old 60Mb/sec.
>  >>
>  >>  We are stuck at this point. Any help/comment/suggestion will be
>  >> greatly
>  >>  appreciated.
>  >>  /P
>  >>
>  >>  **************************** OLD CLUSTER
>  >> *****************************
>  >>
>  >>  SATA disks.
>  >>
>  >>  Filesystem: ext3.
>  >>
>  >>  * the version of nfs-utils you are using: I don't know. It's the
>  >> most
>  >>   recent version in debian sarge (oldstable).
>  >>
>  >>  user-mode nfs server.
>  >>
>  >>  nfs version 2, as reported with rpcinfo.
>  >>
>  >>  * the version of the kernel and any non-stock applied kernels:
>  >> 2.6.12
>  >>  * the distribution of linux you are using: Debian sarge x386 on
>  >> Intel Xeon
>  >>   processors.
>  >>  * the version(s) of other operating systems involved: no other OS.
>  >>
>  >>  It is also useful to know the networking configuration connecting
>  >> the hosts:
>  >>  Typical beowulf setup, with all servers connected to a switch,
>  >> 1Gb network.
>  >>
>  >>  /etc/exports:
>  >>
>  >>  /srv/homes      192.168.1.0/255.255.255.0 (rw,no_root_squash)
>  >>
>  >>  /etc/fstab:
>  >>
>  >>  server:/srv/homes/user /mnt/user nfs
>  >> rw,hard,intr,rsize=8192,wsize=8192 0 0
>  >>
>  >>  **************************** NEW CLUSTER
>  >> *****************************
>  >>
>  >>  SAS 10k disks.
>  >>
>  >>  Filesystem: ext3 over LVM.
>  >>
>  >>  * the version of nfs-utils you are using: I don't know. It's the
>  >> most
>  >>   recent version in debian etch (stable).
>  >>
>  >>  kernel-mode nfs server.
>  >>
>  >>  nfs version 2, as reported with rpcinfo.
>  >>
>  >>  * the version of the kernel and any non-stock applied kernels:
>  >> 2.6.18-5-amd64
>  >>  * the distribution of linux you are using: Debian etch AMD64 on
>  >> Intel Xeon
>  >>   processors.
>  >>  * the version(s) of other operating systems involved: no other OS.
>  >>
>  >>  It is also useful to know the networking configuration connecting
>  >> the hosts:
>  >>  Typical beowulf setup, with all servers connected to a switch,
>  >> 1Gb network.
>  >>
>  >>  /etc/exports:
>  >>
>  >>  /srv/homes      192.168.1.0/255.255.255.0 (no_root_squash)
>  >>
>  >>  mount options:
>  >>
>  >>  rsize=8192,wsize=8192
>  >>  -
>  >>  To unsubscribe from this list: send the line "unsubscribe linux-
>  >> nfs" in
>  >>  the body of a message to majordomo@vger.kernel.org
>  >>  More majordomo info at  http://vger.kernel.org/majordomo-info.html
>  >>
>  >>
>  >
>  >
>  > --
>  > pOSix rules
>  > -
>  > To unsubscribe from this list: send the line "unsubscribe linux-
>  > nfs" in
>  > the body of a message to majordomo@vger.kernel.org
>  > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>  --
>  Chuck Lever
>  chuck[dot]lever[at]oracle[dot]com
>
>
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
       [not found]           ` <90d010000802150737x2ad0739dmeaaa24dc2845e81a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-02-15 16:13             ` Trond Myklebust
       [not found]               ` <1203092030.11333.4.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
  2008-02-15 16:18             ` Chuck Lever
  1 sibling, 1 reply; 35+ messages in thread
From: Trond Myklebust @ 2008-02-15 16:13 UTC (permalink / raw)
  To: Font Bella; +Cc: Chuck Lever, NFS list, Marcelo Leal


On Fri, 2008-02-15 at 16:37 +0100, Font Bella wrote:

> Finally, try different values of rsize and wsize in your
> /etc/fstab, /etc/auto.home (if using automounts), or in general, in any mount
> command. Check from client side also with 'cat /proc/mounts'.
> Test your favourite benchmark with different rsize,wsize and look for an
> optimal value.
> 
> ALL the steps above were necessary for me to get good performance, but
> the last step was
> crucial, since I got very different performances depending on the
> value of rsize/wsize.

That very likely implies that you have problems with UDP packet loss.
Switch to TCP.

Trond


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
       [not found]           ` <90d010000802150737x2ad0739dmeaaa24dc2845e81a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2008-02-15 16:13             ` Trond Myklebust
@ 2008-02-15 16:18             ` Chuck Lever
  1 sibling, 0 replies; 35+ messages in thread
From: Chuck Lever @ 2008-02-15 16:18 UTC (permalink / raw)
  To: Font Bella; +Cc: NFS list, Marcelo Leal

On Feb 15, 2008, at 10:37 AM, Font Bella wrote:
> Dear all,
>
> I finally got it to work, after much pain/testing. Here are my config
> notes (just for the record).
> Thanks Marcelo and Chuck!
>
> NFS setup
> =========
>
> Documentation
> -------------
>
> * http://billharlan.com/pub/papers/NFS_for_clusters.html
> * http://nfs.sourceforge.net/nfs-howto/ 
> ar01s05.html#nfsd_daemon_instances
>
> Setting
> -------
>
> We use package nfs-kernel-server, i.e. we use the kernel-space nfs  
> server,
> which is faster than nfs-user-server.
>
> We use NFS version 3.
>
> Configuration
> -------------
>
> Make sure we are using nfs version 3. This seems to be the default  
> with
> package nfs-kernel-server. Check from client side with::
>
>         cat /proc/mounts
>
> Use UDP for packet transmission, i.e. use option 'proto=udp' in your
> /etc/fstab, /etc/auto.home (if using automounts), or in general, in  
> any mount
> command. Check from client side also with 'cat /proc/mounts'.
>
> Make sure you have enough nfsd server threads. See if your server  
> is receiving
> too many overlapping requests with
>
>   $ grep th /proc/net/rpc/nfsd
>
> Ours isn't, so we increase the number of threads used by the server to
> 32 by changing
> RPCNFSDCOUNT=32 in /etc/default/nfs-kernel-server (Debian  
> configuration file
> for startup scripts). Remember to restart nfs-kernel-server for  
> changes to
> take effect.
>
> In the server side, use 'async' option in /etc/exports. This was a  
> crucial
> step to get good performance.
>
> Finally, try different values of rsize and wsize in your
> /etc/fstab, /etc/auto.home (if using automounts), or in general, in  
> any mount
> command. Check from client side also with 'cat /proc/mounts'.
> Test your favourite benchmark with different rsize,wsize and look  
> for an
> optimal value.
>
> ALL the steps above were necessary for me to get good performance, but
> the last step was
> crucial, since I got very different performances depending on the
> value of rsize/wsize.

I'm glad you were able to make progress.  32 server threads is  
actually fairly conservative; you might consider 128 or more if you  
have more than a few clients.

I want to make sure you understand the limitations and risks of using  
UDP and the "async" export option, however.

1.  "async" is no longer the default because it introduces a silent  
data corruption risk.  With NFSv3, data write operations are already  
asynchronous, with a subsequent COMMIT, so that they are safe.  The  
client now knows when data has hit stable storage and can thus delete  
its cached copy safely.

I urge you to read the NFS FAQ discussion on the "async" export  
option and reconsider its use in production.

2.  UDP is no longer the default because it also introduces a silent  
data corruption risk, since the IP ID field (which UDP depends on for  
reassembling datagrams larger than a single link-layer frame) is only  
16 bits wide.  If this field should wrap, datagram reassembly is  
compromised.  The UDP datagram checksum is weak enough that the  
receiving end probably won't detect the reassembly errors.

In addition, UDP will likely perform poorly in situations involving  
more than a few clients.  It's congestion control algorithm is unable  
to handle large amounts of concurrent network traffic since it  
doesn't have a packet ACK mechanism like TCP does.  The fact that  
your performance was best at such a small r/wsize (you mentioned 2048  
in your earlier e-mail) suggests you have a network environment that  
would benefit enormously from using TCP.


So, our recommendation these days is to use the default "sync" export  
setting, and use NFSv3 over TCP if at all possible.  (The HOWTO may  
be out of date in this regard).  If you are not able to achieve good  
performance results with these settings, you can e-mail the list  
again and we can do further analysis.



> On Thu, Feb 14, 2008 at 5:56 PM, Chuck Lever  
> <chuck.lever@oracle.com> wrote:
>> On Feb 14, 2008, at 11:27 AM, Marcelo Leal wrote:
>>>  Hello all,
>>> There is a great diff between access the raw discs and through LVM,
>>> with some kind of RAID, and etc. I think you should use NFS v3, and
>>> it's hard to think that without you explicitally configure it to use
>>> v2, it using...
>>> A great diff between v2 and v3 is that v2 is always "async", what  
>>> is a
>>> performance burst. Are you sure that in the new environment is  
>>> not v3?
>>> In the new stable version (nfs-utils), debian is sync by default.  
>>> I'm
>>> used to "8192" transfer sizes, and was the best perfomance in my
>>> tests.
>>
>>  As Marcelo suggested, this could be nothing more than the change in
>>  default export options (see exports(8) -- the description of the  
>> sync/
>>  async option) between sarge and etch.  This was a change in the nfs-
>>  utils package done a while back to improve data integrity guarantees
>>  during server instability.
>>
>>  You can test this easily by explicitly specifying sync or async in
>>  your /etc/exports and trying your test.
>>
>>  It especially effects NFSv2, as all NFSv2 writes are FILE_SYNC (ie
>>  they must be committed to permanent storage before the server
>>  replies) -- the async export option breaks that guarantee to improve
>>  performance.  There is some further description in the NFS FAQ at
>>  http://nfs.sourceforge.net/ .
>>
>>  The preferred way to get "async" write performance is to use NFSv3.
>>
>>
>>
>>>  Would be nice if you could test another network service writing in
>>> that server.. like ftp, or iscsi.
>>>  Another question, the discs are "local" or SAN? There is no
>>> concurrency?
>>>
>>> ps.: v2 has a 2GB file size limit AFAIK.
>>>
>>>  Leal.
>>>
>>> 2008/2/14, Font Bella <fontbella@gmail.com>:
>>>> Hi,
>>>>
>>>>  some of our apps are experiencing slow nfs performance in our new
>>>> cluster, in
>>>>  comparison with the old one. The nfs setups for both clusters are
>>>> very
>>>>  similar, and we are wondering what's going on. The details of
>>>> both setups are
>>>>  given below for reference.
>>>>
>>>>  The problem seems to occur with apps that do heavy i/o, creating,
>>>> writing,
>>>>  reading, and deleting many files. However, writing or reading a
>>>> large file
>>>>  (as measure with `time dd if=/dev/zero of=2gbfile bs=1024
>>>> count=2000`) is not
>>>>  slow.
>>>>
>>>>  We have performed some tests with the disk benchmark 'dbench',
>>>> which reports
>>>>  i/o performance of 60 Mb/sec in the old cluster down to about 6Mb/
>>>> sec in the
>>>>  new one.
>>>>
>>>>  After noticing this problem, we tried the user-mode nfs server
>>>> instead of the
>>>>  kernel-mode server, and just installing the user-mode server
>>>> helped improving
>>>>  throughput up to 12 Mb/sec, but still far away from the good old
>>>> 60 Mb/sec.
>>>>
>>>>  After going through the "Optimizing NFS performance" section of  
>>>> the
>>>>  NFS-Howto and tweaking the rsize,wsize parameters (the optimal
>>>> seems to be
>>>>  2048, which seems kind of weird to me, specially compared to the
>>>> 8192 used in
>>>>  the old cluster), throughput increased to 21 Mb/sec, but is still
>>>> too far
>>>>  from the old 60Mb/sec.
>>>>
>>>>  We are stuck at this point. Any help/comment/suggestion will be
>>>> greatly
>>>>  appreciated.
>>>>  /P
>>>>
>>>>  **************************** OLD CLUSTER
>>>> *****************************
>>>>
>>>>  SATA disks.
>>>>
>>>>  Filesystem: ext3.
>>>>
>>>>  * the version of nfs-utils you are using: I don't know. It's the
>>>> most
>>>>   recent version in debian sarge (oldstable).
>>>>
>>>>  user-mode nfs server.
>>>>
>>>>  nfs version 2, as reported with rpcinfo.
>>>>
>>>>  * the version of the kernel and any non-stock applied kernels:
>>>> 2.6.12
>>>>  * the distribution of linux you are using: Debian sarge x386 on
>>>> Intel Xeon
>>>>   processors.
>>>>  * the version(s) of other operating systems involved: no other OS.
>>>>
>>>>  It is also useful to know the networking configuration connecting
>>>> the hosts:
>>>>  Typical beowulf setup, with all servers connected to a switch,
>>>> 1Gb network.
>>>>
>>>>  /etc/exports:
>>>>
>>>>  /srv/homes      192.168.1.0/255.255.255.0 (rw,no_root_squash)
>>>>
>>>>  /etc/fstab:
>>>>
>>>>  server:/srv/homes/user /mnt/user nfs
>>>> rw,hard,intr,rsize=8192,wsize=8192 0 0
>>>>
>>>>  **************************** NEW CLUSTER
>>>> *****************************
>>>>
>>>>  SAS 10k disks.
>>>>
>>>>  Filesystem: ext3 over LVM.
>>>>
>>>>  * the version of nfs-utils you are using: I don't know. It's the
>>>> most
>>>>   recent version in debian etch (stable).
>>>>
>>>>  kernel-mode nfs server.
>>>>
>>>>  nfs version 2, as reported with rpcinfo.
>>>>
>>>>  * the version of the kernel and any non-stock applied kernels:
>>>> 2.6.18-5-amd64
>>>>  * the distribution of linux you are using: Debian etch AMD64 on
>>>> Intel Xeon
>>>>   processors.
>>>>  * the version(s) of other operating systems involved: no other OS.
>>>>
>>>>  It is also useful to know the networking configuration connecting
>>>> the hosts:
>>>>  Typical beowulf setup, with all servers connected to a switch,
>>>> 1Gb network.
>>>>
>>>>  /etc/exports:
>>>>
>>>>  /srv/homes      192.168.1.0/255.255.255.0 (no_root_squash)
>>>>
>>>>  mount options:
>>>>
>>>>  rsize=8192,wsize=8192
>>>>  -
>>>>  To unsubscribe from this list: send the line "unsubscribe linux-
>>>> nfs" in
>>>>  the body of a message to majordomo@vger.kernel.org
>>>>  More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>
>>>
>>> --
>>> pOSix rules
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe linux-
>>> nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>  --
>>  Chuck Lever
>>  chuck[dot]lever[at]oracle[dot]com
>>
>>
>>
>>
> -
> To unsubscribe from this list: send the line "unsubscribe linux- 
> nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
       [not found]               ` <1203092030.11333.4.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
@ 2008-02-18  9:39                 ` Font Bella
       [not found]                   ` <90d010000802180139x49ac1f49x976f11cec0e01fdf-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 35+ messages in thread
From: Font Bella @ 2008-02-18  9:39 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Chuck Lever, NFS list, Marcelo Leal

I tried TCP and async options, but I get poor performance in my
benchmarks (a dbench run with 10 clients). Below I tabulated the
outcome of my tests, which show that in my setting there is a huge
difference between sync and async, and udp/tcp. Any
comments/suggestions are warmly welcome.

I also tried setting 128 server threads as Chuck suggested, but this
doesn't seem to affect performance. This makes sense, since we only
have a dozen of clients.

About sync/async, I am not very concerned about corrupt data if the
cluster goes down, we do mostly computing, no crucial database
transactions or anything like that. Our users wouldn't mind some
degree of data corruption in case of power failure, but speed is
crucial.

Our network setting is just a dozen of servers connected to a switch.
Everything (adapters/cables/switch) is 1gigabit. We use ethernet
bonding to double networking speed.

Here are the test results. I didn't measure SYNC+UDP, since SYNC+TCP
already gives me very poor performance. Admittedly, my test is very
simple, and I should probably try something more complete, like
IOzone. But the dbench run seems to reproduce the bottleneck we've
been observing in our cluster.

Thanks,
/P

********************** ASYNC option in server ******************************

rsize,wsize          TCP                 UDP

1024                  24 MB/s            34 MB/s
2048                  35                 49
4096                  37                 75
8192                  40.4               35
16386                 40.2               19

********************** SYNC option in server ******************************

rsize,wsize          TCP                 UDP

1024                  6 MB/s             ?? MB/s
2048                  7.44               ??
4096                  7.33               ??
8192                  7                  ??
16386                 7                  ??

On Feb 15, 2008 5:13 PM, Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
>
> That very likely implies that you have problems with UDP packet loss.
> Switch to TCP.
>
> Trond
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
       [not found]                   ` <90d010000802180139x49ac1f49x976f11cec0e01fdf-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-02-18 16:59                     ` Chuck Lever
  0 siblings, 0 replies; 35+ messages in thread
From: Chuck Lever @ 2008-02-18 16:59 UTC (permalink / raw)
  To: Font Bella; +Cc: Trond Myklebust, NFS list, Marcelo Leal

On Feb 18, 2008, at 4:39 AM, Font Bella wrote:
> I tried TCP and async options, but I get poor performance in my
> benchmarks (a dbench run with 10 clients). Below I tabulated the
> outcome of my tests, which show that in my setting there is a huge
> difference between sync and async, and udp/tcp. Any
> comments/suggestions are warmly welcome.
>
> I also tried setting 128 server threads as Chuck suggested, but this
> doesn't seem to affect performance. This makes sense, since we only
> have a dozen of clients.

Each Linux client mount point can generate up to 16 server requests  
by default.  A dozen clients each with a single mount point can  
generate 192 concurrent requests.  So 128 server threads is not as  
outlandish as you might think.

In this case, you are likely hitting some other bottleneck before the  
clients can utilize all the server threads.

> About sync/async, I am not very concerned about corrupt data if the
> cluster goes down, we do mostly computing, no crucial database
> transactions or anything like that. Our users wouldn't mind some
> degree of data corruption in case of power failure, but speed is
> crucial.

The data corruption is silent.  If it weren't, you could simply  
restore from a backup as soon as you recover from a server crash.   
Silent corruption spreads into your backed up data, and starts  
causing strange application errors, sometimes a long time after the  
corruption first occurred.

> Our network setting is just a dozen of servers connected to a switch.
> Everything (adapters/cables/switch) is 1gigabit. We use ethernet
> bonding to double networking speed.
>
> Here are the test results. I didn't measure SYNC+UDP, since SYNC+TCP
> already gives me very poor performance. Admittedly, my test is very
> simple, and I should probably try something more complete, like
> IOzone. But the dbench run seems to reproduce the bottleneck we've
> been observing in our cluster.

I assume the dbench test is read and write only (little or no  
metadata activity like file creation and deletion).  How closely does  
dbench reflect your production workload?

I see from your initial e-mail that your server file system is:

 > SAS 10k disks.
 >
 > Filesystem: ext3 over LVM.

Have you tried testing over NFS with a file system that resides on a  
single physical disk?  If you have done a read-only test versus a  
write-only test, how do the numbers compare?  Have you tested a range  
of write sizes, from small file writes v. writes to writing files  
larger than the server's memory?

> ********************** ASYNC option in server  
> ******************************
>
> rsize,wsize          TCP                 UDP
>
> 1024                  24 MB/s            34 MB/s
> 2048                  35                 49
> 4096                  37                 75
> 8192                  40.4               35
> 16386                 40.2               19

As the size of the read and write requests increase, your UDP  
throughput decreases markedly.  This does indicate some packet loss,  
so TCP is going to provide consistent performance and much lower risk  
to data integrity as your network and client workloads increase.

You might try this test again and watch your clients' ethernet  
bandwidth and RPC retransmit rate to see what I mean.  At the 16386  
setting, the UDP test may be pumping significantly more packets onto  
the network, but is getting only about 20MB/s through.  This will  
certainly have some effect on other traffic on the network.

The first thing I check in these instances is that gigabit ethernet  
flow control is enabled in both directions on all interfaces (both  
host and switch).

In addition, using larger r/wsize settings on your clients means the  
server can perform disk reads and writes more efficiently, which will  
help your server scale with increasing client workloads.

By examining your current network carefully, you might be able to  
boost the performance of NFS over both UDP and TCP.  With bonded  
gigabit, you should be able to push network throughput past 200 MB/s  
using a test like iPerf which doesn't touch disks.  Thus, at least  
NFS reads from files already in the server's page cache ought to fly  
in this configuration.

> ********************** SYNC option in server  
> ******************************
>
> rsize,wsize          TCP                 UDP
>
> 1024                  6 MB/s             ?? MB/s
> 2048                  7.44               ??
> 4096                  7.33               ??
> 8192                  7                  ??
> 16386                 7                  ??

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* performance question
@ 2008-03-20 18:01 david ahern
  0 siblings, 0 replies; 35+ messages in thread
From: david ahern @ 2008-03-20 18:01 UTC (permalink / raw)
  To: kvm-devel

I am trying to understand spikes in system time that I am seeing in a VM. The
guest OS is RHEL4, with 2 vpcus, and 2.5Gb RAM; host is running 2.6.24.2 kernel.
kvm version is kvm-63.

Using the stat scripts Christian Ehrhardt posted a few days ago (thanks,
Christian, very handy tool) I collected kvm_stat data as a function of time (I
added time to the output). Comparing plots of guest system time to plots of
kvm_stat the spikes in system time most correlate to the following kvm_stat
variables:

mmu_cache_miss
mmu_flooded
mmu_pte_updated
mmu_pte_write
mmu_shadow_zapped
pf_fixed
pf_guest
remote_tlb_flush
tlb_flush

Can someone provide some guidance/hints on what would cause spikes in the above
and if there is anything I can do to improve it?

The load on the VM is fairly constant (network traffic of ~48kB/sec received and
 ~189kB/sec transmit) with some moderate disk IO as well.

thanks,
david

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Performance question
@ 2009-01-17 17:18 Piergiorgio Sartor
  2009-01-17 18:37 ` Bill Davidsen
  2009-01-17 22:08 ` Keld Jørn Simonsen
  0 siblings, 2 replies; 35+ messages in thread
From: Piergiorgio Sartor @ 2009-01-17 17:18 UTC (permalink / raw)
  To: linux-raid

Hi all,

I'll have to setup some machines with two HDs (each)
in order to get some redundancy.

Reading the MD features I noticed there are several
possibilities to create a mirror.
I was wondering which one offer the best perfomances
and/or what are the compromises to accept between
the different solutions.

One possibility is a classic RAID-1 mirror.
Another is a RAID-10 far.
There would also be the RAID-10 near, but I guess
this is equivalent to RAID-1.

Any suggestion on which method offers higher "speed"?
Or there are other possibilities with 2 HDs (keeping
the redundancy, of course)?

Thanks a lot in advance,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
@ 2009-01-17 18:11 David Lethe
  2009-01-17 18:20 ` Piergiorgio Sartor
  0 siblings, 1 reply; 35+ messages in thread
From: David Lethe @ 2009-01-17 18:11 UTC (permalink / raw)
  To: Piergiorgio Sartor, linux-raid

All we know is that you use 2 disks and md.  This is like posting to a TCP/IP architecture group and saying you have a network connection and want performance advice.   Read up, supply full config info, run benchmarks, then ask specific questions.  GI=GO.
-----Original Message-----

From:  "Piergiorgio Sartor" <piergiorgio.sartor@nexgo.de>
Subj:  Performance question
Date:  Sat Jan 17, 2009 11:18 am
Size:  874 bytes
To:  "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>

Hi all, 

I'll have to setup some machines with two HDs (each) 
in order to get some redundancy. 

Reading the MD features I noticed there are several 
possibilities to create a mirror. 
I was wondering which one offer the best perfomances 
and/or what are the compromises to accept between 
the different solutions. 

One possibility is a classic RAID-1 mirror. 
Another is a RAID-10 far. 
There would also be the RAID-10 near, but I guess 
this is equivalent to RAID-1. 

Any suggestion on which method offers higher "speed"? 
Or there are other possibilities with 2 HDs (keeping 
the redundancy, of course)? 

Thanks a lot in advance, 

bye, 

--  

piergiorgio 
-- 
To unsubscribe from this list: send the line "unsubscribe linux-raid" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at  http://vger.kernel.org/majordomo-info.html 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2009-01-17 18:11 Performance question David Lethe
@ 2009-01-17 18:20 ` Piergiorgio Sartor
  0 siblings, 0 replies; 35+ messages in thread
From: Piergiorgio Sartor @ 2009-01-17 18:20 UTC (permalink / raw)
  To: linux-raid

Hi,

thanks for the answer.

Well what I would like to have is exactly a configuration
hint, eventually benchmarks and the like.

The requirements are: two disks, redundacy.
The question is: what configuration is reccommended
in view of performances (or "what can be achieved").

Is that specific enough?

Thanks again,

bye,

pg

On Sat, Jan 17, 2009 at 12:11:00PM -0600, David Lethe wrote:
> All we know is that you use 2 disks and md.  This is like posting to a TCP/IP architecture group and saying you have a network connection and want performance advice.   Read up, supply full config info, run benchmarks, then ask specific questions.  GI=GO.
> -----Original Message-----
> 
> From:  "Piergiorgio Sartor" <piergiorgio.sartor@nexgo.de>
> Subj:  Performance question
> Date:  Sat Jan 17, 2009 11:18 am
> Size:  874 bytes
> To:  "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
> 
> Hi all, 
>  
> I'll have to setup some machines with two HDs (each) 
> in order to get some redundancy. 
>  
> Reading the MD features I noticed there are several 
> possibilities to create a mirror. 
> I was wondering which one offer the best perfomances 
> and/or what are the compromises to accept between 
> the different solutions. 
>  
> One possibility is a classic RAID-1 mirror. 
> Another is a RAID-10 far. 
> There would also be the RAID-10 near, but I guess 
> this is equivalent to RAID-1. 
>  
> Any suggestion on which method offers higher "speed"? 
> Or there are other possibilities with 2 HDs (keeping 
> the redundancy, of course)? 
>  
> Thanks a lot in advance, 
>  
> bye, 
>  
> --  
>  
> piergiorgio 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html 
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2009-01-17 17:18 Performance question Piergiorgio Sartor
@ 2009-01-17 18:37 ` Bill Davidsen
  2009-01-17 22:08 ` Keld Jørn Simonsen
  1 sibling, 0 replies; 35+ messages in thread
From: Bill Davidsen @ 2009-01-17 18:37 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

Piergiorgio Sartor wrote:
> Hi all,
>
> I'll have to setup some machines with two HDs (each)
> in order to get some redundancy.
>
> Reading the MD features I noticed there are several
> possibilities to create a mirror.
> I was wondering which one offer the best perfomances
> and/or what are the compromises to accept between
> the different solutions.
>
> One possibility is a classic RAID-1 mirror.
> Another is a RAID-10 far.
> There would also be the RAID-10 near, but I guess
> this is equivalent to RAID-1.
>
> Any suggestion on which method offers higher "speed"?
> Or there are other possibilities with 2 HDs (keeping
> the redundancy, of course)?
>   

Mirrored array will offer slower write speed no matter how you do it, 
usually about the speed of a single drive. With raid10 far you should 
get about N times faster read than a single drive, where N is drives in 
the array. Clearly using three or more drives will help a LOT in typical 
performance.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2009-01-17 17:18 Performance question Piergiorgio Sartor
  2009-01-17 18:37 ` Bill Davidsen
@ 2009-01-17 22:08 ` Keld Jørn Simonsen
  2009-01-19 18:12   ` Piergiorgio Sartor
  1 sibling, 1 reply; 35+ messages in thread
From: Keld Jørn Simonsen @ 2009-01-17 22:08 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On Sat, Jan 17, 2009 at 06:18:06PM +0100, Piergiorgio Sartor wrote:
> Hi all,
> 
> I'll have to setup some machines with two HDs (each)
> in order to get some redundancy.
> 
> Reading the MD features I noticed there are several
> possibilities to create a mirror.
> I was wondering which one offer the best perfomances
> and/or what are the compromises to accept between
> the different solutions.
> 
> One possibility is a classic RAID-1 mirror.
> Another is a RAID-10 far.
> There would also be the RAID-10 near, but I guess
> this is equivalent to RAID-1.

Yes, raid10,n2 is quite the same as raid1 for 2 drives,
That is the disk layout is the same. There may be some 
differences due to the use of different drivers, tho. It was reported at
some time that there were some errors that one of the drivers handled
better than the other. I am not sure which one was the better.
Also syncing and rebuilding etc. may have different performance.

> Any suggestion on which method offers higher "speed"?
> Or there are other possibilities with 2 HDs (keeping
> the redundancy, of course)?

raid10,f2 offers something like double the speed for sequential read,
while probably being a little faster on random read, and with a file
system about equal in performance on writes. Degraded performance (in
tha case that one disk is failing) could be worse for raid10,f2, but in
real life, with the fs elevator in operation, the penalty may be
minimal. IMHO you could normally replace raid1 and raid10,n2, and
raid1+0 with raid10,f2, except for boot devices.

Theoretically there is another possibility in raid5 with 2 drives,
but I am not sure it even works out in practice, and there is imho no
gain in it, except that you can expand the array with more disks.
Furthermore there is raid10,o2 which is viable, but does not
perform as well as raid10,f2.

For linux raid performance have a look at
http://linux-raid.osdl.org/index.php/Performance

For setting up a system with 2 disks so you can survive that one disk
fails, see
http://linux-raid.osdl.org/index.php/Preventing_against_a_failing_disk

I am the main author of both wiki pages, so I am interested in feedback.

Best regards
Keld

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2009-01-17 22:08 ` Keld Jørn Simonsen
@ 2009-01-19 18:12   ` Piergiorgio Sartor
  2009-01-21  0:15     ` Keld Jørn Simonsen
  0 siblings, 1 reply; 35+ messages in thread
From: Piergiorgio Sartor @ 2009-01-19 18:12 UTC (permalink / raw)
  To: linux-raid

Hi,

thanks for the answer, that was exactly what I
was looking for.

Some feedback for you.
About the performance & benchmarking I've nothing
special to say.
About the setup of two disks, I've some questions,
in no particular order.

The creation of "mdadm.conf" is done by:

mdadm --detail --scan

Somewhere else I found:

mdadm --examine --scan

The two produce different results and the Fedora
installer seems to use the second one.

Which one is really correct? Can we use one or the
other interchangeably?

Second question.
The wiki page does not mention anything about
metadata types.
While it is clear that /boot must have the RAID
header at the end, it is not clear if the RAID-10,f2
could or should have the metadata at the beginning.
In this respect, it would be nice also to have some
clarification about the reccommended metadata version,
i.e. is it better 0.90 or 1.x? Why?

One note. Maybe it could be worth to mention that
further "partitioning" could be done with LVM on top
of the RAID, so only 3 md devices will be needed.

Hope this helps.

Thanks again,

bye,

pg

On Sat, Jan 17, 2009 at 11:08:49PM +0100, Keld Jørn Simonsen wrote:
> On Sat, Jan 17, 2009 at 06:18:06PM +0100, Piergiorgio Sartor wrote:
> > Hi all,
> > 
> > I'll have to setup some machines with two HDs (each)
> > in order to get some redundancy.
> > 
> > Reading the MD features I noticed there are several
> > possibilities to create a mirror.
> > I was wondering which one offer the best perfomances
> > and/or what are the compromises to accept between
> > the different solutions.
> > 
> > One possibility is a classic RAID-1 mirror.
> > Another is a RAID-10 far.
> > There would also be the RAID-10 near, but I guess
> > this is equivalent to RAID-1.
> 
> Yes, raid10,n2 is quite the same as raid1 for 2 drives,
> That is the disk layout is the same. There may be some 
> differences due to the use of different drivers, tho. It was reported at
> some time that there were some errors that one of the drivers handled
> better than the other. I am not sure which one was the better.
> Also syncing and rebuilding etc. may have different performance.
> 
> > Any suggestion on which method offers higher "speed"?
> > Or there are other possibilities with 2 HDs (keeping
> > the redundancy, of course)?
> 
> raid10,f2 offers something like double the speed for sequential read,
> while probably being a little faster on random read, and with a file
> system about equal in performance on writes. Degraded performance (in
> tha case that one disk is failing) could be worse for raid10,f2, but in
> real life, with the fs elevator in operation, the penalty may be
> minimal. IMHO you could normally replace raid1 and raid10,n2, and
> raid1+0 with raid10,f2, except for boot devices.
> 
> Theoretically there is another possibility in raid5 with 2 drives,
> but I am not sure it even works out in practice, and there is imho no
> gain in it, except that you can expand the array with more disks.
> Furthermore there is raid10,o2 which is viable, but does not
> perform as well as raid10,f2.
> 
> For linux raid performance have a look at
> http://linux-raid.osdl.org/index.php/Performance
> 
> For setting up a system with 2 disks so you can survive that one disk
> fails, see
> http://linux-raid.osdl.org/index.php/Preventing_against_a_failing_disk
> 
> I am the main author of both wiki pages, so I am interested in feedback.
> 
> Best regards
> Keld

-- 

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2009-01-19 18:12   ` Piergiorgio Sartor
@ 2009-01-21  0:15     ` Keld Jørn Simonsen
  2009-01-21  1:05       ` Richard Scobie
  2009-01-21 19:14       ` Piergiorgio Sartor
  0 siblings, 2 replies; 35+ messages in thread
From: Keld Jørn Simonsen @ 2009-01-21  0:15 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On Mon, Jan 19, 2009 at 07:12:53PM +0100, Piergiorgio Sartor wrote:
> Hi,
> 
> thanks for the answer, that was exactly what I
> was looking for.

Good!

> Some feedback for you.
> About the performance & benchmarking I've nothing
> special to say.
> About the setup of two disks, I've some questions,
> in no particular order.
> 
> The creation of "mdadm.conf" is done by:
> 
> mdadm --detail --scan
> 
> Somewhere else I found:
> 
> mdadm --examine --scan
> 
> The two produce different results and the Fedora
> installer seems to use the second one.
> 
> Which one is really correct? Can we use one or the
> other interchangeably?

--detail looks at the running arrays, while --examine most
likely (depending on mdadm.conf) looks at all partitions
on the system. 

Given that the arrays are just created in the installation process, and
the active running arrays are most likely the ones you want your system
to know of, I think --detail is the better. --examine does on two of my
systems generate info that are in conflict and not suitable for a
mdadm.conf file, such as two /dev/md1 with different UUIDs.

> Second question.
> The wiki page does not mention anything about
> metadata types.
> While it is clear that /boot must have the RAID
> header at the end, it is not clear if the RAID-10,f2
> could or should have the metadata at the beginning.
> In this respect, it would be nice also to have some
> clarification about the reccommended metadata version,
> i.e. is it better 0.90 or 1.x? Why?

To me it does not matter that much, except for the booting device.
Each partition in the booting device must look like a normal (ext3)
partition, as grub and lilo does not know of raids, and just treats
a booting partition as a standalone partition. So here you should use
0.90 metadata, which is put at the end of the array.

For other arrays I think one important choice is if you have an array
greater than 2 TiB to not use 0.90 metadata, as this has a limit of 2
TiB.

> One note. Maybe it could be worth to mention that
> further "partitioning" could be done with LVM on top
> of the RAID, so only 3 md devices will be needed.

yes, I have been looking into that. Maybe I will add some words on this.

> Hope this helps.

yes, thanks for your feedback!

best regards
keld

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2009-01-21  0:15     ` Keld Jørn Simonsen
@ 2009-01-21  1:05       ` Richard Scobie
  2009-01-21 19:14       ` Piergiorgio Sartor
  1 sibling, 0 replies; 35+ messages in thread
From: Richard Scobie @ 2009-01-21  1:05 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: Piergiorgio Sartor, linux-raid

Keld Jørn Simonsen wrote:

> For other arrays I think one important choice is if you have an array
> greater than 2 TiB to not use 0.90 metadata, as this has a limit of 2
> TiB.

This restriction only applies if the individual members of the array are 
larger than 2TB each.

Regards,

Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2009-01-21  0:15     ` Keld Jørn Simonsen
  2009-01-21  1:05       ` Richard Scobie
@ 2009-01-21 19:14       ` Piergiorgio Sartor
  2009-01-21 20:15         ` Keld Jørn Simonsen
  1 sibling, 1 reply; 35+ messages in thread
From: Piergiorgio Sartor @ 2009-01-21 19:14 UTC (permalink / raw)
  To: linux-raid

Hi again,

[--detail vs. --examine]
> --detail looks at the running arrays, while --examine most
> likely (depending on mdadm.conf) looks at all partitions
> on the system. 
> 
> Given that the arrays are just created in the installation process, and
> the active running arrays are most likely the ones you want your system
> to know of, I think --detail is the better. --examine does on two of my
> systems generate info that are in conflict and not suitable for a
> mdadm.conf file, such as two /dev/md1 with different UUIDs.

yes, but I noticed that with "--detail" and an
array (RAID-1) resyincing, it reports "spares=1"
too, while when the array is in sync, it prints
the correct geometry.
So, I was wondering, since I also noticed that
"--examine" produces the arrays with /dev/md/"name",
so if two arrays have same name, it ends up with
the same device.
Is this maybe a bug of mdadm?

[metadata position]
> To me it does not matter that much, except for the booting device.
> Each partition in the booting device must look like a normal (ext3)
> partition, as grub and lilo does not know of raids, and just treats
> a booting partition as a standalone partition. So here you should use
> 0.90 metadata, which is put at the end of the array.

Well, I was a bit mixing up things with this question.
In the back of my head the question was:

What about performances, RAID-10 f2, bitmap (important)
and metadata 1.0 vs. 1.1?

This could be a further test for performances. It would
be interesting to know if it is better to have the
metadata at the beginning or at the end of a RAID-10 f2,
with two HDs, having the bitmap enabled.
Or if it does not matter at all.

Reading around I found different "opinions" about bitmap
and performances, but I did not find a "convincing" test.

Thanks again.

Different item of the wiki, I run into it today.
Maybe the "initrd" description could be updated, since
it uses "mdassemble", while the "initrd" I have uses
directly "mdadm -As --auto=yes ..." (I do not remember
the full line).

Hope this helps,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2009-01-21 19:14       ` Piergiorgio Sartor
@ 2009-01-21 20:15         ` Keld Jørn Simonsen
  2009-01-21 20:26           ` Piergiorgio Sartor
  0 siblings, 1 reply; 35+ messages in thread
From: Keld Jørn Simonsen @ 2009-01-21 20:15 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid

On Wed, Jan 21, 2009 at 08:14:52PM +0100, Piergiorgio Sartor wrote:
> Hi again,
> 
> [--detail vs. --examine]
> > --detail looks at the running arrays, while --examine most
> > likely (depending on mdadm.conf) looks at all partitions
> > on the system. 
> > 
> > Given that the arrays are just created in the installation process, and
> > the active running arrays are most likely the ones you want your system
> > to know of, I think --detail is the better. --examine does on two of my
> > systems generate info that are in conflict and not suitable for a
> > mdadm.conf file, such as two /dev/md1 with different UUIDs.
> 
> yes, but I noticed that with "--detail" and an
> array (RAID-1) resyincing, it reports "spares=1"
> too, while when the array is in sync, it prints
> the correct geometry.
> So, I was wondering, since I also noticed that
> "--examine" produces the arrays with /dev/md/"name",
> so if two arrays have same name, it ends up with
> the same device.
> Is this maybe a bug of mdadm?

I leave this to others to answer this one.
I think it is strange for --detail to report "spares=1"
if it is syncing.

> [metadata position]
> > To me it does not matter that much, except for the booting device.
> > Each partition in the booting device must look like a normal (ext3)
> > partition, as grub and lilo does not know of raids, and just treats
> > a booting partition as a standalone partition. So here you should use
> > 0.90 metadata, which is put at the end of the array.
> 
> Well, I was a bit mixing up things with this question.
> In the back of my head the question was:
> 
> What about performances, RAID-10 f2, bitmap (important)
> and metadata 1.0 vs. 1.1?
> 
> This could be a further test for performances. It would
> be interesting to know if it is better to have the
> metadata at the beginning or at the end of a RAID-10 f2,
> with two HDs, having the bitmap enabled.
> Or if it does not matter at all.
> 
> Reading around I found different "opinions" about bitmap
> and performances, but I did not find a "convincing" test.

I have not tested it. So yes, I think this is something to do a performance test
on.  I think it should not matter much whether it is in the beginning or in
the end. However, if you make a test, then you most likely will do it on
a newly created raid, and then files would tend to be allocated in the
beginning of the file system, thus favouring a metadata block in the
beginning of the raid. In real operation this will tend to even out.
Another issue is that the sectors in the beginning of a disk are much 
faster, a factor of two perhaps, than the sectors in the end of the
drive.

> Thanks again.
> 
> Different item of the wiki, I run into it today.
> Maybe the "initrd" description could be updated, since
> it uses "mdassemble", while the "initrd" I have uses
> directly "mdadm -As --auto=yes ..." (I do not remember
> the full line).

mdasseble is specifically made for initrd, so why not use it here?

Best regards
keld

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Performance question
  2009-01-21 20:15         ` Keld Jørn Simonsen
@ 2009-01-21 20:26           ` Piergiorgio Sartor
  0 siblings, 0 replies; 35+ messages in thread
From: Piergiorgio Sartor @ 2009-01-21 20:26 UTC (permalink / raw)
  To: linux-raid

Hi,

thanks for the explanation about metadata.

> mdasseble is specifically made for initrd, so why not use it here?

I do not know, I just noticed that, on Fedora,
the initrd with RAID has /etc/mdadm.conf and it
calls "mdadm -As ...".
Which I found annoying, since I do not know
what will happen in case an array is changed
(UUID change, /etc/mdadm.conf not more consistent).

Anyway, if you say "mdassemble" is OK, no problem.

Thanks,

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Performance Question
@ 2011-09-15 19:43 --[ UxBoD ]--
  0 siblings, 0 replies; 35+ messages in thread
From: --[ UxBoD ]-- @ 2011-09-15 19:43 UTC (permalink / raw)
  To: dm-devel

[-- Attachment #1.1: Type: text/plain, Size: 1317 bytes --]

Hello all,

we are about to configure a new storage system that utilizes the Nexenta OS with sparsely allocated ZVOLs. We wish to present 4TB of storage to a Linux system that has four NICs available to it. We are unsure whether to present one large ZVOL or four smaller ones to maximize the use of the NICs available to us. We have set rr_min_io to 100 which we have found offers a good level of performance. Though this raises an interesting question; that the multipath.conf man pages says that the rr_min_io parameter is the number of IOs across the whole path group before a switch is made to the next path. What constitutes a single IO operation ? A user opens a file for read access, one IOP to open the file, IOsX to read the contents, and another to close ? Do each of those SCSI operations happen on the same path ie. on the same block device ? If a second user comes along and requests data from the same block device do they happen on the same path or the next one in the path group ? We imagine that they will all happen on the same path until rr_min_io is reached and it switches over to the next path.

We are trying to squeeze out the maximum performance from our system and we are unable to max out our 4 x 1Gbe interfaces. Any thoughts on how we can improve our performance ?
--
Thanks, Phil

[-- Attachment #1.2: Type: text/html, Size: 1574 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2011-09-15 19:43 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-17 17:18 Performance question Piergiorgio Sartor
2009-01-17 18:37 ` Bill Davidsen
2009-01-17 22:08 ` Keld Jørn Simonsen
2009-01-19 18:12   ` Piergiorgio Sartor
2009-01-21  0:15     ` Keld Jørn Simonsen
2009-01-21  1:05       ` Richard Scobie
2009-01-21 19:14       ` Piergiorgio Sartor
2009-01-21 20:15         ` Keld Jørn Simonsen
2009-01-21 20:26           ` Piergiorgio Sartor
  -- strict thread matches above, loose matches on Subject: below --
2011-09-15 19:43 Performance Question --[ UxBoD ]--
2009-01-17 18:11 Performance question David Lethe
2009-01-17 18:20 ` Piergiorgio Sartor
2008-03-20 18:01 performance question david ahern
2008-02-14 15:40 Performance question Font Bella
     [not found] ` <90d010000802140740y3ff2706ybc169728fbafbfb4-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-02-14 16:27   ` Marcelo Leal
     [not found]     ` <42996ba90802140827p533779c6o8ab404400be51fdc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-02-14 16:56       ` Chuck Lever
2008-02-15 15:37         ` Font Bella
     [not found]           ` <90d010000802150737x2ad0739dmeaaa24dc2845e81a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-02-15 16:13             ` Trond Myklebust
     [not found]               ` <1203092030.11333.4.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2008-02-18  9:39                 ` Font Bella
     [not found]                   ` <90d010000802180139x49ac1f49x976f11cec0e01fdf-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-02-18 16:59                     ` Chuck Lever
2008-02-15 16:18             ` Chuck Lever
2005-09-12 19:06 performance question Moritz Gartenmeister
     [not found] <1049188686.19334.20.camel@deskpro02>
2003-04-01 15:39 ` jp
2003-04-01 16:06   ` Philippe Gramoullé
2003-04-01 16:22     ` Matt Heaton
2003-04-01 17:08       ` Philippe Gramoullé
2003-04-01 18:45   ` Bogdan Costescu
2003-03-31 21:45 Lever, Charles
2003-03-31 21:37 jp
2003-04-01  5:40 ` Trond Myklebust
2002-05-05 14:20 Performance question Philipp Gühring
2002-05-05 15:07 ` Oleg Drokin
2002-05-05 16:43   ` Philipp G?hring
2002-05-06 13:01     ` Oleg Drokin
2002-05-06 11:06   ` Hans Reiser

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.