Re: ext4, barrier, md/RAID1 and write cache

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Martin Steigerwald <ms@teamix.de>
To: Daniel Pocock <daniel@pocock.com.au>
Cc: Martin Steigerwald <Martin@lichtvoll.de>,
	Andreas Dilger <adilger@dilger.ca>,
	linux-ext4@vger.kernel.org
Subject: Re: ext4, barrier, md/RAID1 and write cache
Date: Wed, 9 May 2012 09:30:02 +0200	[thread overview]
Message-ID: <201205090930.02731.ms@teamix.de> (raw)
In-Reply-To: <4FA93BB2.9050509@pocock.com.au>

Am Dienstag, 8. Mai 2012 schrieb Daniel Pocock:
> On 08/05/12 14:55, Martin Steigerwald wrote:
> > Am Dienstag, 8. Mai 2012 schrieb Daniel Pocock:
> >> On 08/05/12 00:24, Martin Steigerwald wrote:
> >>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> >>>> On 07/05/12 20:59, Martin Steigerwald wrote:
> >>>>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> >>>>>>> Possibly the older disk is lying about doing cache flushes.  The
> >>>>>>> wonderful disk manufacturers do that with commodity drives to make
> >>>>>>> their benchmark numbers look better.  If you run some random IOPS
> >>>>>>> test against this disk, and it has performance much over 100 IOPS
> >>>>>>> then it is definitely not doing real cache flushes.
> >>>>> 
> >>>>> […]
> >>>>> 
> >>>>> I think an IOPS benchmark would be better. I.e. something like:
> >>>>> 
> >>>>> /usr/share/doc/fio/examples/ssd-test
> >>>>> 
> >>>>> (from flexible I/O tester debian package, also included in upstream
> >>>>> tarball of course)
> >>>>> 
> >>>>> adapted to your needs.
> >>>>> 
> >>>>> Maybe with different iodepth or numjobs (to simulate several threads
> >>>>> generating higher iodepths). With iodepth=1 I have seen 54 IOPS on a
> >>>>> Hitachi 5400 rpm harddisk connected via eSATA.
> >>>>> 
> >>>>> Important is direct=1 to bypass the pagecache.
> >>>> 
> >>>> Thanks for suggesting this tool, I've run it against the USB disk and
> >>>> an LV on my AHCI/SATA/md array
> >>>> 
> >>>> Incidentally, I upgraded the Seagate firmware (model 7200.12 from CC34
> >>>> to CC49) and one of the disks went offline shortly after I brought the
> >>>> system back up.  To avoid the risk that a bad drive might interfere
> >>>> with the SATA performance, I completely removed it before running any
> >>>> tests. Tomorrow I'm out to buy some enterprise grade drives, I'm
> >>>> thinking about Seagate Constellation SATA or even SAS.
> >>>> 
> >>>> Anyway, onto the test results:
> >>>> 
> >>>> USB disk (Seagate  9SD2A3-500 320GB):
> >>>> 
> >>>> rand-write: (groupid=3, jobs=1): err= 0: pid=22519
> >>>> 
> >>>>   write: io=46680KB, bw=796512B/s, iops=194, runt= 60012msec
[…]
> >>> Please repeat the test with iodepth=1.
> >> 
> >> For the USB device:
> >> 
> >> rand-write: (groupid=3, jobs=1): err= 0: pid=11855
> >> 
> >>   write: io=49320KB, bw=841713B/s, iops=205, runt= 60001msec
[…]
> >> and for the SATA disk:
> >> 
> >> rand-write: (groupid=3, jobs=1): err= 0: pid=12256
> >> 
> >>   write: io=28020KB, bw=478168B/s, iops=116, runt= 60005msec
[…]
> > […]
> > 
> >>      issued r/w: total=0/7005, short=0/0
> >>      
> >>      lat (msec): 4=6.31%, 10=69.54%, 20=22.68%, 50=0.63%, 100=0.76%
> >>      lat (msec): 250=0.09%
> >>> 
> >>> 194 IOPS appears to be highly unrealistic unless NCQ or something like
> >>> that is in use. At least if thats a 5400/7200 RPM sata drive (didn´t
> >>> check vendor information).
> >> 
> >> The SATA disk does have NCQ
> >> 
> >> USB disk is supposed to be 5400RPM, USB2, but reporting iops=205
> >> 
> >> SATA disk is 7200 RPM, 3 Gigabit SATA, but reporting iops=116
> >> 
> >> Does this suggest that the USB disk is caching data but telling Linux
> >> the data is on disk?
> > 
> > Looks like it.
> > 
> > Some older values for a 1.5 TB WD Green Disk:
> > 
> > mango:~# fio -readonly -name iops -rw=randread -bs=512  -runtime=100
> > -iodepth 1 -filename /dev/sda -ioengine  libaio -direct=1
> > [...] iops: (groupid=0, jobs=1): err= 0: pid=9939
> > 
> >   read : io=1,859KB, bw=19,031B/s, iops=37, runt=100024msec [...]</pre>
> > 
> > mango:~# fio -readonly -name iops -rw=randread -bs=512  -runtime=100
> > -iodepth 32 -filename /dev/sda -ioengine  libaio -direct=1
> > iops: (groupid=0, jobs=1): err= 0: pid=10304
> > 
> >   read : io=2,726KB, bw=27,842B/s, iops=54, runt=100257msec
> > 
> > mango:~# hdparm -I /dev/sda | grep -i queue
> > 
> >         Queue depth: 32
> >         
> >            *    Native Command Queueing (NCQ)
> > 
> > - 1,5 TB Western Digital, WDC WD15EADS-00P8B0
> > - Pentium 4 mit 2,80 GHz
> > - 4 GB RAM, 32-Bit Linux
> > - Linux Kernel 2.6.36
> > - fio 1.38-1
[…]
> >> It is a gigabit network and I think that the performance of the dd
> >> command proves it is not something silly like a cable fault (I have come
> >> across such faults elsewhere though)
> > 
> > What is the latency?
> 
> $ ping -s 1000 192.168.1.2
> PING 192.168.1.2 (192.168.1.2) 1000(1028) bytes of data.
> 1008 bytes from 192.168.1.2: icmp_req=1 ttl=64 time=0.307 ms
> 1008 bytes from 192.168.1.2: icmp_req=2 ttl=64 time=0.341 ms
> 1008 bytes from 192.168.1.2: icmp_req=3 ttl=64 time=0.336 ms

Seems to be fine.

> >>> Anyway, 15000 RPM SAS drives should give you more IOPS than 7200 RPM
> >>> SATA drives, but SATA drives are cheaper and thus you could -
> >>> depending on RAID level - increase IOPS by just using more drives.
> >> 
> >> I was thinking about the large (2TB or 3TB) 7200 RPM SAS or SATA drives
> >> in the Seagate `Constellation' enterprise drive range.  I need more
> >> space anyway, and I need to replace the drive that failed, so I have to
> >> spend some money anyway - I just want to throw it in the right direction
> >> (e.g. buying a drive, or if the cheap on-board SATA controller is a
> >> bottleneck or just extremely unsophisticated, I don't mind getting a
> >> dedicated controller)
> >> 
> >> For example, if I knew that the controller is simply not suitable with
> >> barriers, NFS, etc and that a $200 RAID card or even a $500 RAID card
> >> will guarantee better performance with my current kernel, I would buy
> >> that.  (However, I do want to use md RAID rather than a proprietary
> >> format, so any RAID card would be in JBOD mode)
> > 
> > They point is: How much of the performance will arrive at NFS? I can't
> > say yet.
> 
> My impression is that the faster performance of the USB disk was a red
> herring, and the problem really is just the nature of the NFS protocol
> and the way it is stricter about server-side caching (when sync is
> enabled) and consequently it needs more iops.

Yes, that seems to be the case here. It seems to be a small blocksize random 
I/O workload with heavy fsync() usage.

You could adapt to /usr/share/doc/fio/examples/iometer-file-access-server to 
benchmark such a scenario. Also fsmark simulates such a heavy fsync() based 
quite well. I have packaged it for Debian, but its still in NEW queue. You can 
grab it from

http://people.teamix.net/~ms/debian/sid/

(32-Bit build, but easily buildable for amd64 as well)

> I've turned two more machines (a HP Z800 with SATA disk and a Lenovo
> X220 with SSD disk) into NFSv3 servers, repeated the same tests, and
> found similar performance on the Z800, but 20x faster on the SSD (which
> can support more IOPS)

Okay, then you want more IOPS.

> > And wait I/O is quite high.
> > 
> > Thus it seems this workload can be faster with faster / more disks or a
> > RAID controller with battery (and disabling barriers / cache flushes).
> 
> You mean barrier=0,data=writeback?  Or just barrier=0,data=ordered?

I meant data=ordered. As mentioned by Andreas data=journal could yield a 
improvement. I'd suggest trying to but the journal onto a different disk then 
in order to avoid head seeks during writeout of journal data to its final 
location.

> In theory that sounds good, but in practice I understand it creates some
> different problems, eg:
> 
> - monitoring the battery, replacing it periodically
> 
> - batteries only hold the charge for a few hours, so if there is a power
> outage on a Sunday, someone tries to turn on the server on  Monday
> morning and the battery has died, cache is empty and disk is corrupt

Hmmm, from what I know there are NVRAM based controllers that can hold the 
cached data for several days.

> - some RAID controllers (e.g. HP SmartArray) insist on writing their
> metadata to all volumes - so you become locked in to the RAID vendor.  I
> prefer to just use RAID1 or RAID10 with Linux md onto the raw disks.  On
> some Adaptec controllers, `JBOD' mode allows md to access the disks
> directly, although I haven't verified that yet.

I see no reason why SoftRAID cannot be used with a NVRAM based controller.
 
> I'm tempted to just put a UPS on the server and enable NFS `async' mode,
> and avoid running anything on the server that may cause a crash.

A UPS on the server won't make "async" safe. If the server crashes you still 
can loose data.

Ciao,
-- 
Martin Steigerwald - teamix GmbH - http://www.teamix.de
gpg: 19E3 8D42 896F D004 08AC A0CA 1E10 C593 0399 AE90
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2012-05-09  7:30 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-07 10:47 ext4, barrier, md/RAID1 and write cache Daniel Pocock
2012-05-07 16:25 ` Martin Steigerwald
2012-05-07 16:44   ` Daniel Pocock
2012-05-07 16:54     ` Andreas Dilger
2012-05-07 17:28       ` Daniel Pocock
2012-05-07 18:59         ` Martin Steigerwald
2012-05-07 20:56           ` Daniel Pocock
2012-05-07 22:24             ` Martin Steigerwald
2012-05-07 23:23               ` Daniel Pocock
2012-05-08 14:55                 ` Martin Steigerwald
2012-05-08 15:28                   ` Daniel Pocock
2012-05-08 17:02                     ` Andreas Dilger
2012-05-09  7:30                     ` Martin Steigerwald [this message]
2012-05-09  9:34                       ` Martin Steigerwald

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201205090930.02731.ms@teamix.de \
    --to=ms@teamix.de \
    --cc=Martin@lichtvoll.de \
    --cc=adilger@dilger.ca \
    --cc=daniel@pocock.com.au \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.