* External log size limitations
@ 2011-02-16 18:54 Andrew Klaassen
2011-02-17 0:32 ` Dave Chinner
0 siblings, 1 reply; 10+ messages in thread
From: Andrew Klaassen @ 2011-02-16 18:54 UTC (permalink / raw)
To: xfs
Hi all,
I found some this document:
http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=bks&fname=/SGI_Admin/LX_XFS_AG/ch02.html
...which says that an XFS external log is limited to 128MB.
Is there any way to make that larger?
Goal: I'd love to try putting the external log on an SSD that could
sustain two or three minutes of steady full-throttle writing. 128MB
gives me less than a second worth of writes before my write speed slows
down to the underlying storage speed.
I'll do lots of benchmarks before rolling it out to make sure it
actually does help, of course. I just want to know if it's possible,
and how to do it if it is.
Thanks.
Andrew
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: External log size limitations
2011-02-16 18:54 External log size limitations Andrew Klaassen
@ 2011-02-17 0:32 ` Dave Chinner
2011-02-18 15:26 ` Andrew Klaassen
0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2011-02-17 0:32 UTC (permalink / raw)
To: Andrew Klaassen; +Cc: xfs
On Wed, Feb 16, 2011 at 01:54:47PM -0500, Andrew Klaassen wrote:
> Hi all,
>
> I found some this document:
>
> http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=bks&fname=/SGI_Admin/LX_XFS_AG/ch02.html
>
> ...which says that an XFS external log is limited to 128MB.
>
> Is there any way to make that larger?
The limit is just under 2GB now - that document is a couple of years
out of date - so if you are running on anything more recent that a
~2.6.27 kernel 2GB logs should work fine.
> Goal: I'd love to try putting the external log on an SSD that could
> sustain two or three minutes of steady full-throttle writing. 128MB
> gives me less than a second worth of writes before my write speed
> slows down to the underlying storage speed.
Data write speed or metadata write speed? What sort of write
patterns? Also, don't forget that data is not logged so increasing
the log size won't change the speed of data writeback.
As it is, 2GB is still not enough for preventing metadata writeback
for minutes if that is what you are trying to do. Even if you use
the new delaylog mount option - which reduces log traffic by an
order of magnitude for most non-synchronous workloads - log write
rates can be upwards of 30MB/s under concurrent metadata intensive
workloads....
> I'll do lots of benchmarks before rolling it out to make sure it
> actually does help, of course. I just want to know if it's
> possible, and how to do it if it is.
If you want a log larger than 2GB, then there is a lot of code
changes in both kernel an userspace as the log arithmetic is all
done via 32 bit integers and a lot of it is byte based.
As it is, there are significant scaling issues with logs of even 2GB
in size - log replay can take tens of minutes when a log full of
inode changes have to be replayed, filling a 2GB log means you'll
probably have ten of gigabytes of dirty metadata in memory, so
response to memory shortages can cause IO storms and severe
interactivity problems, etc.
In general, I'm finding that a log size of around 512MB w/ delaylog
gives the best tradeoff between scalability, performance, memory
usage and relatively sane recovery times...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: External log size limitations
2011-02-17 0:32 ` Dave Chinner
@ 2011-02-18 15:26 ` Andrew Klaassen
2011-02-18 19:55 ` Stan Hoeppner
2011-02-20 21:14 ` Dave Chinner
0 siblings, 2 replies; 10+ messages in thread
From: Andrew Klaassen @ 2011-02-18 15:26 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Dave Chinner wrote:
> The limit is just under 2GB now - that document is a couple of years
> out of date - so if you are running on anything more recent that a
> ~2.6.27 kernel 2GB logs should work fine.
Ah, good to know.
> Data write speed or metadata write speed? What sort of write
> patterns?
A couple of hundred nodes on a renderfarm doing mostly compositing with
some 3D. It's about 80/20 read/write. On the current system that we're
thinking of converting - an Exastore version 3 system - browsing the
filesystem becomes ridiculously slow when write loads become moderate,
which is why snappier metadata operations are attractive to us.
One thing I'm worried about, though, is moving from the Exastore's 64K
block size to the 4K Linux blocksize limitation. My quick calculation
says that that's going to reduce our throughput under random load (which
is what a renderfarm becomes with a couple of hundred nodes) from about
200MB/s to about 13MB/s with our 56x7200rpm disks. It's too bad those
large blocksize patches from a couple of years back didn't go through to
make this worry moot.
> Also, don't forget that data is not logged so increasing
> the log size won't change the speed of data writeback.
Yes, of course... that momentarily slipped my mind.
> As it is, 2GB is still not enough for preventing metadata writeback
> for minutes if that is what you are trying to do. Even if you use
> the new delaylog mount option - which reduces log traffic by an
> order of magnitude for most non-synchronous workloads - log write
> rates can be upwards of 30MB/s under concurrent metadata intensive
> workloads....
Is there a rule-of-thumb to convert number of files being written to log
write rates? We push a lot of data through, but most of the files are a
few megabytes in size instead of a few kilobytes.
> If you want a log larger than 2GB, then there is a lot of code
> changes in both kernel an userspace as the log arithmetic is all
> done via 32 bit integers and a lot of it is byte based.
Good to know.
> As it is, there are significant scaling issues with logs of even 2GB
> in size - log replay can take tens of minutes when a log full of
> inode changes have to be replayed,
We've got decent a UPS, so unless we get kernel panics, those tens of
minutes for an occasional unexpected hard shutdown should mean less lost
production time than the drag of slower metadata operations all the time.
> filling a 2GB log means you'll
> probably have ten of gigabytes of dirty metadata in memory, so
> response to memory shortages can cause IO storms and severe
> interactivity problems, etc.
I assume that if we packed the server with 128GB of RAM we wouldn't have
to worry about that as much. But... short of that, would you have a
rule of thumb for log size to memory size? Could I expect reasonable
performance with a 2GB log and 32GB in the server? With 12GB in the server?
I know you'd have to mostly guess to make up a rule of thumb, but your
guesses would be a lot better than mine. :-)
> In general, I'm finding that a log size of around 512MB w/ delaylog
> gives the best tradeoff between scalability, performance, memory
> usage and relatively sane recovery times...
I'm excited about the delaylog and other improvements I'm seeing
entering the kernel, but I'm worried about stability. There seem to
have been a lot of bugfix patches and panic reports since 2.6.35 for XFS
to go along with the performance improvements, which makes me tempted to
stick to 2.6.34 until the dust settles and the kinks are worked out. If
I put the new XFS code on the server, will it stay up for a year or more
without any panics or crashes?
Thanks for your great feedback. This is one of the things that makes
open source awesome.
Andrew
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: External log size limitations
2011-02-18 15:26 ` Andrew Klaassen
@ 2011-02-18 19:55 ` Stan Hoeppner
2011-02-18 20:31 ` Andrew Klaassen
2011-02-20 21:14 ` Dave Chinner
1 sibling, 1 reply; 10+ messages in thread
From: Stan Hoeppner @ 2011-02-18 19:55 UTC (permalink / raw)
To: Andrew Klaassen; +Cc: xfs
Andrew Klaassen put forth on 2/18/2011 9:26 AM:
> A couple of hundred nodes on a renderfarm doing mostly compositing with
> some 3D. It's about 80/20 read/write. On the current system that we're
> thinking of converting - an Exastore version 3 system - browsing the
> filesystem becomes ridiculously slow when write loads become moderate,
> which is why snappier metadata operations are attractive to us.
I'm not familiar with Exanet, only that it was an Israeli company that
went belly up in late '09 early '10. Was the hardware provided by them?
Is it totally proprietary, or are you able to wipe the OS and install a
fresh copy of your preferred Linux distro and a recent kernel?
> One thing I'm worried about, though, is moving from the Exastore's 64K
> block size to the 4K Linux blocksize limitation. My quick calculation
> says that that's going to reduce our throughput under random load (which
> is what a renderfarm becomes with a couple of hundred nodes) from about
> 200MB/s to about 13MB/s with our 56x7200rpm disks. It's too bad those
> large blocksize patches from a couple of years back didn't go through to
> make this worry moot.
I'm not sure which block size you're referring to here. Are you
referring to the kernel page size or the filesystem block size? AFAIK,
the default Linux kernel page size is still 8 KiB although there has
been talk for some time WRT changing it to 4 KiB, but IIRC some are
hesitant due to stack overruns with the 4 KiB page size. Regardless,
the kernel page size isn't a factor WRT to throughput to disk.
If you're referring to the latter, XFS has a block size configurable per
filesystem from 512 bytes to 64 KiB, with 4 KiB being the default. Make
your XFS filesystems with "-b size=65536" and you should be good to go.
Are those 56 drives configured as a single large RAID stripe? RAID 10
or RAID 6? Or are they split up into multiple smaller arrays? Hardware
or software RAID? I ask as it will allow us to give you the exact
mkfs.xfs command line you need to make your XFS filesystem(s) for
optimum performance.
> Is there a rule-of-thumb to convert number of files being written to log
> write rates? We push a lot of data through, but most of the files are a
> few megabytes in size instead of a few kilobytes.
They're actually kind of independent of one another. For instance, 'rm
-rf' on a 50k file directory structure won't touch a single file, only
metadata. So you have zero files being written but 50k log write
transactions (which delaylog will coalesce into fewer larger actual disk
writes). Typically, the data being written into the log is only a
fraction of the size of the files themselves, especially in your case
where most files are > 1MB in size, so the log bandwidth required for
"normal" file write operations is pretty low. If you're nervous about
it, simply install a small (40 GB) fast SSD in the server and put one
external journal log on it for each filesystem. That'll give you about
40-50k random 4k IOPS throughput for the journal logs. Combined with
delaylog I think this would thoroughly eliminate any metadata
performance issues.
> I assume that if we packed the server with 128GB of RAM we wouldn't have
> to worry about that as much. But... short of that, would you have a
> rule of thumb for log size to memory size? Could I expect reasonable
> performance with a 2GB log and 32GB in the server? With 12GB in the
> server?
The key to metadata performance isn't as much the size of log device but
the throughput. If you have huge write cache on your hardware RAID
controllers and are using internal logs, or if you use a local SSD for
external logs, I would think you don't need the logs to be really huge,
as you're able to push the tail very fast, especially in the case of a
locally attached (SATA) SSD. Write cache on a big SAN array may be very
fast, but you typically have a an FC switch hop or two to traverse,
increasing latency. Latency with a locally attached SSD is about as low
as you can get, barring use of a ramdisk, which no sane person would
ever use for a filesystem journal.
> I'm excited about the delaylog and other improvements I'm seeing
> entering the kernel, but I'm worried about stability. There seem to
> have been a lot of bugfix patches and panic reports since 2.6.35 for XFS
> to go along with the performance improvements, which makes me tempted to
> stick to 2.6.34 until the dust settles and the kinks are worked out. If
> I put the new XFS code on the server, will it stay up for a year or more
> without any panics or crashes?
You're asking for a guarantee that no one can give you, or would dare
to. And this would have little to do with confidence in XFS, but the
sheer complexity of the Linux kernel, and not knowing exactly what
hardware you have. There could be a device driver bug in a newer kernel
that might panic your system. There's no way for us to know that kind
of thing, so, no guarantees. :(
WRT XFS, there were a number of patches up to 2.6.35.11 which address
the problems you mention above, but none in 2.6.36.4 or 2.6.37.1, all of
which are the currently available kernels at kernel.org. So, given that
the patches have slowed down dramatically recently and the bugs have
been squashed, WRT XFS, I think you should feel confident installing
2.6.37.1.
And, as always, install it on a test rig first and pound the daylights
out of it first with a test based on your actual real workload.
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: External log size limitations
2011-02-18 19:55 ` Stan Hoeppner
@ 2011-02-18 20:31 ` Andrew Klaassen
2011-02-19 3:53 ` Stan Hoeppner
0 siblings, 1 reply; 10+ messages in thread
From: Andrew Klaassen @ 2011-02-18 20:31 UTC (permalink / raw)
To: Stan Hoeppner; +Cc: xfs
Stan Hoeppner wrote:
> I'm not familiar with Exanet, only that it was an Israeli company that
> went belly up in late '09 early '10. Was the hardware provided by them?
> Is it totally proprietary, or are you able to wipe the OS and install a
> fresh copy of your preferred Linux distro and a recent kernel?
It's IBM and LSI gear, so I'm crossing my fingers that a Linux install
will be relatively painless.
> I'm not sure which block size you're referring to here. Are you
> referring to the kernel page size or the filesystem block size? AFAIK,
> the default Linux kernel page size is still 8 KiB although there has
> been talk for some time WRT changing it to 4 KiB, but IIRC some are
> hesitant due to stack overruns with the 4 KiB page size. Regardless,
> the kernel page size isn't a factor WRT to throughput to disk.
> If you're referring to the latter, XFS has a block size configurable per
> filesystem from 512 bytes to 64 KiB, with 4 KiB being the default. Make
> your XFS filesystems with "-b size=65536" and you should be good to go.
I thought that the filesystem block size was still limited to the kernel
page size, which is 4K on x86 systems.
http://oss.sgi.com/projects/xfs/
"The maximum filesystem block size is the page size of the kernel, which
is 4K on x86 architecture."
Is this no longer true? It would be awesome news if it wasn't.
My quick calculations were based on worst-case random read, which is
what we were seeing with the Exastore. They had a 64K blocksize * 48
disks * 70 seeks per second = 215 MB/s, which is exactly what we were
seeing under load. Under heavy random load, I'm worried that XFS has to
either thrash the disks with 4K reads and writes ~or~ introduce
unnecessary latency by doing read-combining and write-combining and/or
predictive elevator hanky-panky.
> Are those 56 drives configured as a single large RAID stripe? RAID 10
> or RAID 6? Or are they split up into multiple smaller arrays? Hardware
> or software RAID? I ask as it will allow us to give you the exact
> mkfs.xfs command line you need to make your XFS filesystem(s) for
> optimum performance.
I miscounted; it's 48 drives split into 6 hardware RAID-5 arrays.
> They're actually kind of independent of one another. For instance, 'rm
> -rf' on a 50k file directory structure won't touch a single file, only
> metadata. So you have zero files being written but 50k log write
> transactions (which delaylog will coalesce into fewer larger actual disk
> writes). Typically, the data being written into the log is only a
> fraction of the size of the files themselves, especially in your case
> where most files are > 1MB in size, so the log bandwidth required for
> "normal" file write operations is pretty low. If you're nervous about
> it, simply install a small (40 GB) fast SSD in the server and put one
> external journal log on it for each filesystem. That'll give you about
> 40-50k random 4k IOPS throughput for the journal logs. Combined with
> delaylog I think this would thoroughly eliminate any metadata
> performance issues.
Is there a way to monitor log operations to find out how much is being
used at a given time?
> The key to metadata performance isn't as much the size of log device but
> the throughput. If you have huge write cache on your hardware RAID
> controllers and are using internal logs, or if you use a local SSD for
> external logs, I would think you don't need the logs to be really huge,
> as you're able to push the tail very fast, especially in the case of a
> locally attached (SATA) SSD. Write cache on a big SAN array may be very
> fast, but you typically have a an FC switch hop or two to traverse,
> increasing latency. Latency with a locally attached SSD is about as low
> as you can get, barring use of a ramdisk, which no sane person would
> ever use for a filesystem journal.
All the metadata eventually has to be written to the main array, so
doesn't that ultimately become the limiting factor on metadata
throughput under sustained load?
> You're asking for a guarantee that no one can give you, or would dare
> to. And this would have little to do with confidence in XFS, but the
> sheer complexity of the Linux kernel, and not knowing exactly what
> hardware you have. There could be a device driver bug in a newer kernel
> that might panic your system. There's no way for us to know that kind
> of thing, so, no guarantees. :(
Fair enough. Since I get yelled at if the server goes down, I'm drawn
to proven-track-record kernels as much as possible. (Well, okay... not
quite back to 2.4-proven-track-record kernels...)
> WRT XFS, there were a number of patches up to 2.6.35.11 which address
> the problems you mention above, but none in 2.6.36.4 or 2.6.37.1, all of
> which are the currently available kernels at kernel.org. So, given that
> the patches have slowed down dramatically recently and the bugs have
> been squashed, WRT XFS, I think you should feel confident installing
> 2.6.37.1.
>
> And, as always, install it on a test rig first and pound the daylights
> out of it first with a test based on your actual real workload.
Ayup...
Andrew
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: External log size limitations
2011-02-18 20:31 ` Andrew Klaassen
@ 2011-02-19 3:53 ` Stan Hoeppner
2011-02-19 10:02 ` Matthias Schniedermeyer
0 siblings, 1 reply; 10+ messages in thread
From: Stan Hoeppner @ 2011-02-19 3:53 UTC (permalink / raw)
To: xfs
Fist, sorry for the length. I can tend to get windy talking shop. :)
Andrew Klaassen put forth on 2/18/2011 2:31 PM:
> It's IBM and LSI gear, so I'm crossing my fingers that a Linux install
> will be relatively painless.
Ahh, good. At least, so far it seems so. ;)
> I thought that the filesystem block size was still limited to the kernel
> page size, which is 4K on x86 systems.
>
> http://oss.sgi.com/projects/xfs/
>
> "The maximum filesystem block size is the page size of the kernel, which
> is 4K on x86 architecture."
>
> Is this no longer true? It would be awesome news if it wasn't.
My mistake. It would appear you are limited to the page size, which, as
I mentioned, is still 8 KiB for most distros. If you roll your own
kernel you can obviously tweak this, but to what end? The kernel team's
trend is toward smaller page sizes for greater memory usage efficiency.
> My quick calculations were based on worst-case random read, which is
> what we were seeing with the Exastore. They had a 64K blocksize * 48
> disks * 70 seeks per second = 215 MB/s, which is exactly what we were
> seeing under load. Under heavy random load, I'm worried that XFS has to
> either thrash the disks with 4K reads and writes ~or~ introduce
> unnecessary latency by doing read-combining and write-combining and/or
> predictive elevator hanky-panky.
I think you're giving too much weight to the filesystem block size WRT
random read IO throughput. Once you seek to the start of the file
location on disk, there is no more effort involved in reading the next
128 disk sectors whether the XFS block size is 8 sectors or 128 sectors.
And for files smaller than 64 KiB you're actually _decreasing_ your
seek performance when using 64 KiB blocks. For instance, if you have a
file that is 16 KiB and you have a 4 KiB block size, you'll head seek to
the start of the file, read 4 blocks (32 sectors), and then the head is
free to seek to the next request. With a 64 KiB block size, you seek to
the start of the 16 KiB file, then read 128 sectors, the last 96 sectors
being being empty, or contents of another file, and you just wasted time
reading 96 sectors instead of allowing the head to seek to the next request.
So, using a smaller block size doesn't give you decreased performance
for large files, but it gives you better performance for small files.
Also, 215 MB/s random IO seems absolutely horrible for 48 drives. Are
these 15k FC/SAS drives or 7.2k SATA drives? A single 15k drive should
sustain ~250-300 seeks/sec, a 7.2K drive about 100-150. 70 seeks/sec is
below 5.4K laptop drive territory.
Additionally, tweaking things like
/sys/block/[dev]/queue/max_hw_sectors_kb
/sys/block/[dev]/queue/nr_requests
and the elevator
/sys/block/[dev]/queue/scheduler
Will affect this performance more than the FS block size.
> I miscounted; it's 48 drives split into 6 hardware RAID-5 arrays.
Eeuuww. RAID 5 is not known for stellar random read performance (nor
stellar anything performance, especially horrible for random writes).
Quite the opposite.
A suggestion. You'd lose about 38% of your current space if my math is
correct, but reconfiguring each of those as hardware RAID 10 instead of
5, and concatenating them with mdraid or LVM should give you at
_minimum_ a 2:1 boost in sustained random read IOPS and bandwidth,
probably much much much more. Random writes would be much higher still.
If you can get by with that much less space I'd go with six 8 disk HW
RAID 10s in the new setup assuming you have 6 LSI HBAs. Whatever the
number of HBAs, create a RAID 10 on each with an equal number of drives
on each HBA. It doesn't make sense to have more than one RAID pack on a
single HBA--just slows it down considerably. If they did that with
these RAID 5s that could explain some of the performance issue. I'd set
the LSI RAID 10 stripe size to between 64KB - 256KB depending on your
average file size. I'd then concatenate the resulting 6 devices (or
however many there be) with mdadm or LVM (mdadm is probably a little
faster, LVM more flexible).
Then when creating your XFS filesystem, specify agcount=48 or
(agcount=#HBAs*8) in this case, which will get you 8 allocation groups
per HW array, in essence 2 AGs per stripe count spindle--8 disks, 4
striped mirror pairs, 4 stripe spindles. This should get you the
parallelism you need for high performance multiuser random IO workloads.
This all assumes a highly loaded server with a lot of access to multiple
different files. If your access pattern is one heavy hitter app against
only a few big files, getting parallelism via lots of allocation groups
on concatenated storage may not be the way to go. In that case we'd
need to go with multiple layer striping, with HW RAID 10 and software
RAID 0 across them.
I didn't recommend this because trying to get average size files broken
up into chunks that fit neatly across a layered stripe is almost
impossible, and you end up with weird allocation patterns on disk,
wasted space issues, etc. I think it's better to use smallish HW
stripes, no SW stripes, in an instance like this, and allow XFS to drive
the parallelism via allocation groups. This yields better file layout
on disk and better space utilization.
In addition, using concatenation, as we recently learned from an OP who
went through it (unfortunately), with this setup you can lose an entire
hardware array and the FS can keep chugging along after a repair. You
simply lose any files on the dead array.
> Is there a way to monitor log operations to find out how much is being
> used at a given time?
Point in time? Probably not. I'm sure there's a counter somewhere but
I'm not familiar with it. What you should be concerned with isn't
necessarily how much of the journal log is being used at any point in
time, but how fast the data is moving through the log. This is why the
speed of the log device is critical, and the size is not. Recall that
the max log size is 2GB.
> All the metadata eventually has to be written to the main array, so
> doesn't that ultimately become the limiting factor on metadata
> throughput under sustained load?
The answer is: it depends on the workload. Add another "depends" when
using delaylog. For example, a given directory inode may be modified
multiple times during a very short period of time. 'rm -rf' on a huge
directory is a good example of this. A huge number of modifications to
the directory are performed, but with delaylog they will be consolidated
and coalesced into a single or a few actual writes into the journal and
filesystem instead of many thousands of writes. These types of
operations are historically where the metadata bottleneck lurked. If
you simply have 1000 users hitting a fileserver and each user writes a
file every minute or so, you'll never see a metadata bottleneck. If you
have _and_ one user decides to delete a directory with 100k files in it,
then you have a metadata bottleneck, at least, if you're not using
delaylog. If you are using it you won't see much of a bottleneck.
Although you'll see some pretty high CPU usage for a 100k file delete
operation. But the load on the on disk journal log will be relatively
light.
Please keep us posted. I'm really interested to see what you end up
doing with this and how it performs afterward.
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: External log size limitations
2011-02-19 3:53 ` Stan Hoeppner
@ 2011-02-19 10:02 ` Matthias Schniedermeyer
2011-02-19 20:33 ` Stan Hoeppner
0 siblings, 1 reply; 10+ messages in thread
From: Matthias Schniedermeyer @ 2011-02-19 10:02 UTC (permalink / raw)
To: Stan Hoeppner; +Cc: xfs
On 18.02.2011 21:53, Stan Hoeppner wrote:
> Fist, sorry for the length. I can tend to get windy talking shop. :)
>
> Andrew Klaassen put forth on 2/18/2011 2:31 PM:
>
> > It's IBM and LSI gear, so I'm crossing my fingers that a Linux install
> > will be relatively painless.
>
> Ahh, good. At least, so far it seems so. ;)
>
> > I thought that the filesystem block size was still limited to the kernel
> > page size, which is 4K on x86 systems.
> >
> > http://oss.sgi.com/projects/xfs/
> >
> > "The maximum filesystem block size is the page size of the kernel, which
> > is 4K on x86 architecture."
> >
> > Is this no longer true? It would be awesome news if it wasn't.
>
> My mistake. It would appear you are limited to the page size, which, as
> I mentioned, is still 8 KiB for most distros.
You confuse that with STACK-size.
The page-size is, and has always been, 4 KiB (on X86).
The only exception are the Huge-Pages and while i 'grep'ped for
Huge-pages i found this nice little paragraph in:
Documentation/vm/hugetlbpage.txt (Current git version on it's way to 2.6.38)
- snip -
The intent of this file is to give a brief summary of hugetlbpage support in
the Linux kernel. This support is built on top of multiple page size support
that is provided by most modern architectures. For example, i386
architecture supports 4K and 4M (2M in PAE mode) page sizes, ia64
architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical
translations. Typically this is a very scarce resource on processor.
Operating systems try to make best use of limited number of TLB resources.
This optimization is more critical now as bigger and bigger physical memories
(several GBs) are more readily available.
- snip -
Bis denn
--
Real Programmers consider "what you see is what you get" to be just as
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated,
cryptic, powerful, unforgiving, dangerous.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: External log size limitations
2011-02-19 10:02 ` Matthias Schniedermeyer
@ 2011-02-19 20:33 ` Stan Hoeppner
2011-02-19 21:47 ` Emmanuel Florac
0 siblings, 1 reply; 10+ messages in thread
From: Stan Hoeppner @ 2011-02-19 20:33 UTC (permalink / raw)
To: xfs
Matthias Schniedermeyer put forth on 2/19/2011 4:02 AM:
> On 18.02.2011 21:53, Stan Hoeppner wrote:
>> Fist, sorry for the length. I can tend to get windy talking shop. :)
>>
>> Andrew Klaassen put forth on 2/18/2011 2:31 PM:
>>
>>> It's IBM and LSI gear, so I'm crossing my fingers that a Linux install
>>> will be relatively painless.
>>
>> Ahh, good. At least, so far it seems so. ;)
>>
>>> I thought that the filesystem block size was still limited to the kernel
>>> page size, which is 4K on x86 systems.
>>>
>>> http://oss.sgi.com/projects/xfs/
>>>
>>> "The maximum filesystem block size is the page size of the kernel, which
>>> is 4K on x86 architecture."
>>>
>>> Is this no longer true? It would be awesome news if it wasn't.
>>
>> My mistake. It would appear you are limited to the page size, which, as
>> I mentioned, is still 8 KiB for most distros.
>
> You confuse that with STACK-size.
Yes, I did. However...
> The page-size is, and has always been, 4 KiB (on X86).
To bring this back around to the OP's original question, do you agree or
disagree with my assertion that a 64 KiB XFS block size will yield
little if any advantage over a 4 KiB block size, and may in fact have
some disadvantages, specifically with small file random IO?
--
Stan
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: External log size limitations
2011-02-19 20:33 ` Stan Hoeppner
@ 2011-02-19 21:47 ` Emmanuel Florac
0 siblings, 0 replies; 10+ messages in thread
From: Emmanuel Florac @ 2011-02-19 21:47 UTC (permalink / raw)
To: Stan Hoeppner; +Cc: xfs
Le Sat, 19 Feb 2011 14:33:58 -0600 vous écriviez:
> To bring this back around to the OP's original question, do you agree
> or disagree with my assertion that a 64 KiB XFS block size will yield
> little if any advantage over a 4 KiB block size, and may in fact have
> some disadvantages, specifically with small file random IO?
Undoubtly. The very big block size of Exastore probably is due to its
parallel cluster configuration; all parallel clusters filesystems I
know of (Lustre, PVFS2, CEPH, Isilon, etc) use 64K or bigger blocks.
The exastore big block size is a constraint due to its architecture,
not a desirable improvement. In fact, exanet suffered from many
performance problems, because general use parallel clusters are hard.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: External log size limitations
2011-02-18 15:26 ` Andrew Klaassen
2011-02-18 19:55 ` Stan Hoeppner
@ 2011-02-20 21:14 ` Dave Chinner
1 sibling, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2011-02-20 21:14 UTC (permalink / raw)
To: Andrew Klaassen; +Cc: xfs
On Fri, Feb 18, 2011 at 10:26:37AM -0500, Andrew Klaassen wrote:
> Dave Chinner wrote:
> >The limit is just under 2GB now - that document is a couple of years
> >out of date - so if you are running on anything more recent that a
> >~2.6.27 kernel 2GB logs should work fine.
>
> Ah, good to know.
>
> >Data write speed or metadata write speed? What sort of write
> >patterns?
>
> A couple of hundred nodes on a renderfarm doing mostly compositing
> with some 3D. It's about 80/20 read/write. On the current system
> that we're thinking of converting - an Exastore version 3 system -
> browsing the filesystem becomes ridiculously slow when write loads
> become moderate, which is why snappier metadata operations are
> attractive to us.
OK, but I don't think that the metadata operations are becoming slow
because you are doing write operations - they are likely to be slow
due to doing _lots of IO_. That won't change with XFS....
> One thing I'm worried about, though, is moving from the Exastore's
> 64K block size to the 4K Linux blocksize limitation. My quick
> calculation says that that's going to reduce our throughput under
> random load (which is what a renderfarm becomes with a couple of
> hundred nodes) from about 200MB/s to about 13MB/s with our
> 56x7200rpm disks. It's too bad those large blocksize patches from a
> couple of years back didn't go through to make this worry moot.
How much data is actually being changed out of each of those 64k
blocks? Last time I analysed a compositing application, it was
reading full frames and textures, then writing only the modified
portions of the frames back to the server. Because these were ѕmall
sections of the frames, it was typically writing only a few KB at a
time per IO, with several write IOs and seeks for each region it was
working on. It was completely small random write bound, and while
XFS does OK at that sort of workload, it's not optmised for it like
WAFL, ZFS or BTRFS...
IOWs the write bandwidth of XFS will be determined by how big these
IOs are, not the block size. It may be faster doing smaller IOs
because the 64k block size would probably require read-modify-write
cycles for this workload. XFS will still max out the disk IOPS under
this workload, so don't expect cold-cache metadata operations to be
miraculously faster than on your current system...
> >As it is, 2GB is still not enough for preventing metadata writeback
> >for minutes if that is what you are trying to do. Even if you use
> >the new delaylog mount option - which reduces log traffic by an
> >order of magnitude for most non-synchronous workloads - log write
> >rates can be upwards of 30MB/s under concurrent metadata intensive
> >workloads....
>
> Is there a rule-of-thumb to convert number of files being written to
> log write rates? We push a lot of data through, but most of the
> files are a few megabytes in size instead of a few kilobytes.
Not really. Run your workload and measure it - XFS exports stats
that include the amount written to the journal. See:
http://xfs.org/index.php/Runtime_Stats
> > filling a 2GB log means you'll
> >probably have ten of gigabytes of dirty metadata in memory, so
> >response to memory shortages can cause IO storms and severe
> >interactivity problems, etc.
>
> I assume that if we packed the server with 128GB of RAM we wouldn't
> have to worry about that as much. But... short of that, would you
> have a rule of thumb for log size to memory size? Could I expect
> reasonable performance with a 2GB log and 32GB in the server? With
> 12GB in the server?
<shrug>
It's all dependent on your workload. Test it and see...
> I know you'd have to mostly guess to make up a rule of thumb, but
> your guesses would be a lot better than mine. :-)
>
> >In general, I'm finding that a log size of around 512MB w/ delaylog
> >gives the best tradeoff between scalability, performance, memory
> >usage and relatively sane recovery times...
>
> I'm excited about the delaylog and other improvements I'm seeing
> entering the kernel, but I'm worried about stability. There seem to
> have been a lot of bugfix patches and panic reports since 2.6.35 for
> XFS to go along with the performance improvements, which makes me
> tempted to stick to 2.6.34 until the dust settles and the kinks are
> worked out. If I put the new XFS code on the server, will it stay
> up for a year or more without any panics or crashes?
If you are concerned about stability under heavy load in production
environments, then you should be running a well tested environment
such as RHEL or SLES. The latest and greatest mainline kernel is not
for you....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-02-20 21:11 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-16 18:54 External log size limitations Andrew Klaassen
2011-02-17 0:32 ` Dave Chinner
2011-02-18 15:26 ` Andrew Klaassen
2011-02-18 19:55 ` Stan Hoeppner
2011-02-18 20:31 ` Andrew Klaassen
2011-02-19 3:53 ` Stan Hoeppner
2011-02-19 10:02 ` Matthias Schniedermeyer
2011-02-19 20:33 ` Stan Hoeppner
2011-02-19 21:47 ` Emmanuel Florac
2011-02-20 21:14 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox