Suggested XFS setup/options for 10TB file system w/ 18-20M files.

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Suggested XFS setup/options for 10TB file system w/ 18-20M files.
@ 2017-10-02 13:14 R. Jason Adams
  2017-10-02 13:36 ` Eric Sandeen
  2017-10-02 23:14 ` Dave Chinner
  0 siblings, 2 replies; 8+ messages in thread
From: R. Jason Adams @ 2017-10-02 13:14 UTC (permalink / raw)
  To: linux-xfs

Hello,

I have a use case where I'm writing ~500Kb (avg size) files to a 10TB XFS file systems. Each of system has 36 of these 10TB drives.

The application opens the file, writes the data (single call), and closes the file. In addition there are a few lines added to the extended attributes. The filesystem ends up with 18 to 20 million files when the drive is full. The files are currently spread over 128x128 directories using a hash of the filename.

The format command I'm using:

mkfs.xfs -f -i size=1024 ${DRIVE}

Mount options:

rw,noatime,attr2,inode64,allocsize=2048k,logbufs=8,logbsize=256k,noquota

As the drive is filling, the first few % of the drive seems fine. Using iostat the avgrq-sz is close to the average file size. What I'm noticing is as the drive starts to fill (say around 5-10%) the reads start increasing (r/s in iostat). In addition, the avgrq-sz starts to decrease. Pretty soon the r/s can be 1/3 to 1/2 as many as our w/s. At first we thought this was related to using extended attributes, but disabling that didn’t make a difference at all.

Considering I know the app isn’t making any read request, I’m guessing this is related to updating metadata etc. Any guidance on how to resolve/reduce/etc? For example, would a different directory structure help (more files in less directories)?

Thanks,
R. Jason Adams

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Suggested XFS setup/options for 10TB file system w/ 18-20M files.
  2017-10-02 13:14 Suggested XFS setup/options for 10TB file system w/ 18-20M files R. Jason Adams
@ 2017-10-02 13:36 ` Eric Sandeen
  2017-10-02 13:49   ` R. Jason Adams
  2017-10-02 23:14 ` Dave Chinner
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2017-10-02 13:36 UTC (permalink / raw)
  To: R. Jason Adams, linux-xfs



On 10/2/17 8:14 AM, R. Jason Adams wrote:
> Hello,
> 
> I have a use case where I'm writing ~500Kb (avg size) files to a 10TB XFS file systems. Each of system has 36 of these 10TB drives.

On what version of kernel & what version of xfsprogs?

> The application opens the file, writes the data (single call), and closes the file. In addition there are a few lines added to the extended attributes. The filesystem ends up with 18 to 20 million files when the drive is full. The files are currently spread over 128x128 directories using a hash of the filename.

It's not uncommon for application filename hashing like this to be less efficient than the internal xfs directory algorithms, FWIW.

> The format command I'm using:
> 
> mkfs.xfs -f -i size=1024 ${DRIVE}

Why 1k inodes?

> Mount options:
> 
> rw,noatime,attr2,inode64,allocsize=2048k,logbufs=8,logbsize=256k,noquota

Why all these options?
 
> As the drive is filling, the first few % of the drive seems fine. Using iostat the avgrq-sz is close to the average file size. What I'm noticing is as the drive starts to fill (say around 5-10%) the reads start increasing (r/s in iostat). In addition, the avgrq-sz starts to decrease. Pretty soon the r/s can be 1/3 to 1/2 as many as our w/s. At first we thought this was related to using extended attributes, but disabling that didn’t make a difference at all.
> 
> Considering I know the app isn’t making any read request, I’m guessing this is related to updating metadata etc. Any guidance on how to resolve/reduce/etc? For example, would a different directory structure help (more files in less directories)?

Perhaps it's taking time reading through a large custom-hashed directory tree?  I don't know what that custom directory layout might look like.

http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E

Have you tried starting with defaults, and working your way up from there (if needed?)

Thanks,
-Eric


> Thanks,
> R. Jason Adams

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Suggested XFS setup/options for 10TB file system w/ 18-20M files.
  2017-10-02 13:36 ` Eric Sandeen
@ 2017-10-02 13:49   ` R. Jason Adams
  2017-10-02 14:10     ` R. Jason Adams
  2017-10-02 14:12     ` Eric Sandeen
  0 siblings, 2 replies; 8+ messages in thread
From: R. Jason Adams @ 2017-10-02 13:49 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs

> On what version of kernel & what version of xfsprogs?

We’re on CentOS 7: 

3.10.0-693.2.2.el7.x86_64
xfsprogs.x86_64  4.5.0-12.el7


>> The application opens the file, writes the data (single call), and closes the file. In addition there are a few lines added to the extended attributes. The filesystem ends up with 18 to 20 million files when the drive is full. The files are currently spread over 128x128 directories using a hash of the filename.
> 
> It's not uncommon for application filename hashing like this to be less efficient than the internal xfs directory algorithms, FWIW.

Good to know. We originally had 256x256 and changed to 128x128 to see if it would help. I figured 18M files in 1 directory wasn’t ideal though.


>> The format command I'm using:
>> 
>> mkfs.xfs -f -i size=1024 ${DRIVE}
> 
> Why 1k inodes?

Our extended attributes average ~200B .. figured a little extra room wouldn’t hurt?

Example:

getfattr -d 3c75666a3279623367406b79633479346c777a2e6c706f7471696e3e

# file: 3c75666a3279623367406b79633479346c777a2e6c706f7471696e3e
user.offset="682"
user.crc="1911595230"
user.date="1506918540"
user.id="f97800a5-66cd-4cb1-9a95-796ae0e8871e"
user.inserted="1506918595"
user.size="793169"


>> Mount options:
>> 
>> rw,noatime,attr2,inode64,allocsize=2048k,logbufs=8,logbsize=256k,noquota
> 
> Why all these options?

Started with defaults.. started adding options trying to resolve the issue ;)


> Perhaps it's taking time reading through a large custom-hashed directory tree?  I don't know what that custom directory layout might look like.

It’s currently 128 dirs, each with 128 in them.


> Have you tried starting with defaults, and working your way up from there (if needed?)

Yep. I’m so used to XFS “just working” that I started trying a lot of options after searching for solutions. Lots of suggestions out there. ;)

-R. Jason Adams


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Suggested XFS setup/options for 10TB file system w/ 18-20M files.
  2017-10-02 13:49   ` R. Jason Adams
@ 2017-10-02 14:10     ` R. Jason Adams
  2017-10-02 14:12     ` Eric Sandeen
  1 sibling, 0 replies; 8+ messages in thread
From: R. Jason Adams @ 2017-10-02 14:10 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs


> On Oct 2, 2017, at 9:49 AM, R. Jason Adams <rjasonadams@gmail.com> wrote:
> 
>> Why 1k inodes?
> 
> Our extended attributes average ~200B .. figured a little extra room wouldn’t hurt?


Hrm.. this has me wondering.. is a downside of the larger inode that less of them can be cached in memory?

Thanks,
R. Jason Adams


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Suggested XFS setup/options for 10TB file system w/ 18-20M files.
  2017-10-02 13:49   ` R. Jason Adams
  2017-10-02 14:10     ` R. Jason Adams
@ 2017-10-02 14:12     ` Eric Sandeen
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Sandeen @ 2017-10-02 14:12 UTC (permalink / raw)
  To: R. Jason Adams; +Cc: linux-xfs

On 10/2/17 8:49 AM, R. Jason Adams wrote:
>> On what version of kernel & what version of xfsprogs?
> 
> We’re on CentOS 7: 
> 
> 3.10.0-693.2.2.el7.x86_64
> xfsprogs.x86_64  4.5.0-12.el7

Ok.  At least we know inode64 is on by default.

>>> The application opens the file, writes the data (single call), and closes the file. In addition there are a few lines added to the extended attributes. The filesystem ends up with 18 to 20 million files when the drive is full. The files are currently spread over 128x128 directories using a hash of the filename.
>>
>> It's not uncommon for application filename hashing like this to be less efficient than the internal xfs directory algorithms, FWIW.
> 
> Good to know. We originally had 256x256 and changed to 128x128 to see if it would help. I figured 18M files in 1 directory wasn’t ideal though.

I'd suggest figuring less, testing more.  Maybe, maybe not.

> 
>>> The format command I'm using:
>>>
>>> mkfs.xfs -f -i size=1024 ${DRIVE}
>>
>> Why 1k inodes?
> 
> Our extended attributes average ~200B .. figured a little extra room wouldn’t hurt?
Don't figure, measure... ;)

> 
> Example:
> 
> getfattr -d 3c75666a3279623367406b79633479346c777a2e6c706f7471696e3e
> 
> # file: 3c75666a3279623367406b79633479346c777a2e6c706f7471696e3e
> user.offset="682"
> user.crc="1911595230"
> user.date="1506918540"
> user.id="f97800a5-66cd-4cb1-9a95-796ae0e8871e"
> user.inserted="1506918595"
> user.size="793169"

That example fits in a default 512 byte inode
without trouble, FWIW:

$ getfattr -d file
# file: file
user.crc="1911595230"
user.date="1506918540"
user.id="f97800a5-66cd-4cb1-9a95-796ae0e8871e"
user.offset="682"
user.size="793169"

$ xfs_bmap -a file
file: no extents

If you have larger than needed inodes, that's some
amount of unnecessary overhead right there.
> 
>>> Mount options:
>>>
>>> rw,noatime,attr2,inode64,allocsize=2048k,logbufs=8,logbsize=256k,noquota
>>
>> Why all these options?
> 
> Started with defaults.. started adding options trying to resolve the issue ;)

Ok, too much shooting from the hip.  Most of those are defaults already...
unless you have a specific reason to change defaults, generally best
not to try random things without understanding.
 
> 
>> Perhaps it's taking time reading through a large custom-hashed directory tree?  I don't know what that custom directory layout might look like.
> 
> It’s currently 128 dirs, each with 128 in them.
> 
> 
>> Have you tried starting with defaults, and working your way up from there (if needed?)
> 
> Yep. I’m so used to XFS “just working” that I started trying a lot of
> options after searching for solutions. Lots of suggestions out there.
> ;)

Yeah, everybody's an expert.  ;)

If you want to figure out where your reads are coming from, you could use
blktrace to get the read locations, and use xfs_db to work your way back
to what metadata is being read.  Or, possibly using tracepoints would be
enlightening.  

...now I'm going to break my own rule and make a suggestion without 
sufficient analysis;
if you're allocating a lot of inodes, turning on the free inode btree at
mkfs time might help (mkfs.xfs -m finobt=1)

-Eric

> 
> -R. Jason Adams


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Suggested XFS setup/options for 10TB file system w/ 18-20M files.
  2017-10-02 13:14 Suggested XFS setup/options for 10TB file system w/ 18-20M files R. Jason Adams
  2017-10-02 13:36 ` Eric Sandeen
@ 2017-10-02 23:14 ` Dave Chinner
  2017-10-03 18:10   ` R. Jason Adams
  1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2017-10-02 23:14 UTC (permalink / raw)
  To: R. Jason Adams; +Cc: linux-xfs

On Mon, Oct 02, 2017 at 09:14:07AM -0400, R. Jason Adams wrote:
> Hello,
> 
> I have a use case where I'm writing ~500Kb (avg size) files to a
> 10TB XFS file systems. Each of system has 36 of these 10TB drives.
> 
> The application opens the file, writes the data (single call), and
> closes the file. In addition there are a few lines added to the
> extended attributes. The filesystem ends up with 18 to 20 million
> files when the drive is full. The files are currently spread over
> 128x128 directories using a hash of the filename.

Eric already mentioned it, but hashing directories in userspace is
only necessary to generate sufficient parallelism for the
application's file create/unlink needs.

You're using 10TB drives, so they'll have 10AGs, so each filesystem
can be running 10 concurrent file create/unlinks. Hence having
128x128 = 16384 directories and so ~1000 files per directory is
splitting things way to fine.

Read the "Directory block size" section here:

https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/tree/admin/XFS_Performance_Tuning/filesystem_tunables.asciidoc

Summary:

.Recommended maximum number of directory entries for directory block
sizes
[header]
|=====
| Directory block size  | Max. entries (read-heavy)     | Max.  entries (write-heavy)
| 4 KB                  | 100000-200000                 | 1000000-2000000
| 16 KB                 | 100000-1000000                | 1000000-10000000
| 64 KB                 | >1000000                      | >10000000
|=====

With 4k directory block size and your write heavy workload, you
could get away with just 10 directories. However, it'd probably be
better to use a single level 100-directory wide hash to bring to
down to less than 200k files per directory....

> The format command I'm using:
> 
> mkfs.xfs -f -i size=1024 ${DRIVE}

Small files should be a single extent, so there's heaps of room for
a 200 byte xattr in the inode. using 512 byte inodes will half
memory demand for caching inode buffers....

> Mount options:
> 
> rw,noatime,attr2,inode64,allocsize=2048k,logbufs=8,logbsize=256k,noquota

You probably don't need the allocsize mount option. It turns off the
delalloc autosizing code and prevents tight packing of small
write-once files.

In general, use the defaults and don't add anything extra unless you
know it solves a specific problem you've witnessed in testing...

> As the drive is filling, the first few % of the drive seems fine.
> Using iostat the avgrq-sz is close to the average file size. What
> I'm noticing is as the drive starts to fill (say around 5-10%) the
> reads start increasing (r/s in iostat). In addition, the avgrq-sz
> starts to decrease. Pretty soon the r/s can be 1/3 to 1/2 as many
> as our w/s.

Most likely going to be metadata writeback of inode buffers
requiring RMW based on experience with gluster and ceph having
exactly the same problems.  Use blktrace to identify what the reads
are, and see if those same blocks are written later on. An io marked
a "M" is a metadata IO. Post the blktrace output of the bits you
find relevant.

FWIW, how much RAM do you have in the system, and what does 'echo
200 > /proc/sys/fs/xfs/xfssyncd_centisecs' do to the behaviour?

> At first we thought this was related to using extended
> attributes, but disabling that didn’t make a difference at
> all.
> 
> Considering I know the app isn’t making any read request,
> I’m guessing this is related to updating metadata etc.

Not necessarily. The page cache could be doing RMW cycles if the
write sizes are not page aligned...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Suggested XFS setup/options for 10TB file system w/ 18-20M files.
  2017-10-02 23:14 ` Dave Chinner
@ 2017-10-03 18:10   ` R. Jason Adams
  2017-10-03 20:32     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: R. Jason Adams @ 2017-10-03 18:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> 
> With 4k directory block size and your write heavy workload, you
> could get away with just 10 directories. However, it'd probably be
> better to use a single level 100-directory wide hash to bring to
> down to less than 200k files per directory….

Moved over to single level with 100 directories. 

> Small files should be a single extent, so there's heaps of room for
> a 200 byte xattr in the inode. using 512 byte inodes will half
> memory demand for caching inode buffers….

Moved to 512 byte inodes.

> In general, use the defaults and don't add anything extra unless you
> know it solves a specific problem you've witnessed in testing…

Moved to the defaults.

> 
> Most likely going to be metadata writeback of inode buffers
> requiring RMW based on experience with gluster and ceph having
> exactly the same problems.  Use blktrace to identify what the reads
> are, and see if those same blocks are written later on. An io marked
> a "M" is a metadata IO. Post the blktrace output of the bits you
> find relevant.

Reformated the drive and it's refilling. With the changes suggested (100 dir, 512 nodes, defaults) it already seems better. We’re currently at 6% full and the reads are quite a bit less than they were before at similar fullness. One thing I’m noticing in Grafana, the read request/s seem to keep increasing (up to ~8/s) for around an 15 minutes, then they drop down 1/s for 10-15 minutes.. then over the next 15 minutes they build back up.. etc

> FWIW, how much RAM do you have in the system, and what does 'echo
> 200 > /proc/sys/fs/xfs/xfssyncd_centisecs' do to the behaviour?

System has 24G of ram. I’m guessing a move to 96 or 192G would help a lot.. in the end the system will have 36 of these 10TB drives.

I want to thank you and Eric for the time you’ve taken to help. Feels good to make some progress on this issue.

-R. Jason Adams

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Suggested XFS setup/options for 10TB file system w/ 18-20M files.
  2017-10-03 18:10   ` R. Jason Adams
@ 2017-10-03 20:32     ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2017-10-03 20:32 UTC (permalink / raw)
  To: R. Jason Adams; +Cc: linux-xfs

On Tue, Oct 03, 2017 at 02:10:57PM -0400, R. Jason Adams wrote:
> Reformated the drive and it's refilling. With the changes
> suggested (100 dir, 512 nodes, defaults) it already seems better.
> We’re currently at 6% full and the reads are quite a bit less
> than they were before at similar fullness.

Ok, that's good - we're making progress :)

> One thing I’m
> noticing in Grafana, the read request/s seem to keep increasing
> (up to ~8/s) for around an 15 minutes, then they drop down 1/s for
> 10-15 minutes.. then over the next 15 minutes they build back up..
> etc

I'm still curious as to what the reads are - can you run a blktrace
to try to capture some of them? That will give us a better idea of
what might be done to further reduce this.

It might be worth checking what is happening with the xfs inode
cache and overall memory usage and see if this change in read
behaviour correlates with memory reclaim being active?

> > FWIW, how much RAM do you have in the system, and what does
> > 'echo 200 > /proc/sys/fs/xfs/xfssyncd_centisecs' do to the
> > behaviour?
> 
> System has 24G of ram. I’m guessing a move to 96 or 192G would
> help a lot.. in the end the system will have 36 of these 10TB
> drives.

24G should be enough if you're not walking the entire data set
periodically. The working set of inodes should be pretty small,
so the amount of memory being used to cache the working set
shouldn't be huge nor cause much trouble in terms of turnover.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-10-03 20:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-02 13:14 Suggested XFS setup/options for 10TB file system w/ 18-20M files R. Jason Adams
2017-10-02 13:36 ` Eric Sandeen
2017-10-02 13:49   ` R. Jason Adams
2017-10-02 14:10     ` R. Jason Adams
2017-10-02 14:12     ` Eric Sandeen
2017-10-02 23:14 ` Dave Chinner
2017-10-03 18:10   ` R. Jason Adams
2017-10-03 20:32     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox