linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ric Wheeler <ricwheeler@gmail.com>
To: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Cc: linux-mm@kvack.org,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	Mel Gorman <mgorman@suse.de>, Andreas Dilger <adilger@dilger.ca>,
	sage@inktank.com
Subject: Re: Linux Plumbers IO & File System Micro-conference
Date: Fri, 19 Jul 2013 15:57:37 -0400	[thread overview]
Message-ID: <51E99A31.2070208@gmail.com> (raw)
In-Reply-To: <51E998E0.10207@itwm.fraunhofer.de>

On 07/19/2013 03:52 PM, Bernd Schubert wrote:
> Hello Ric, hi all,
>
> On 07/12/2013 07:20 PM, Ric Wheeler wrote:
>>
>> If you have topics that you would like to add, wait until the
>> instructions get posted at the link above. If you are impatient, feel
>> free to email me directly (but probably best to drop the broad mailing
>> lists from the reply).
>
> sorry, that will be a rather long introduction, the short conclusion is below.
>
>
> Introduction to the meta-cache issue:
> =====================================
> For quite a while we are redesigning our FhGFS storage layout to workaround 
> meta-cache issues of underlying file systems. However, there are constraints 
> as data and meta-data are distributed on between several targets/servers. 
> Other distributed file systems, such as Lustre and (I think) cepfs should have 
> the similar issues.
>
> So the main issue we have is that streaming reads/writes evict meta-pages from 
> the page-cache. I.e. this results in lots of directory-block reads on creating 
> files. So FhGFS, Lustre an (I believe) cephfs are using hash-directories to 
> store object files. Access to files in these hash-directories is rather random 
> and with increasing number of files, access to hash directory-blocks/pages 
> also gets entirely random. Streaming IO easily evicts these pages, which 
> results in high latencies when users perform file creates/deletes, as 
> corresponding directory blocks have to be re-read from disk again and again.
> Now one could argue that hash-directories are poor choice and indeed we are 
> mostly solving that issue in FhGFS now(currently stable release on the meta 
> side, upcoming release on the data/storage side).
> However, given by the problem of distributed meta-data and distributed data we 
> have not found a way yet to entirely eliminate hash directories. For example, 
> recently one of our users created 80 million directories with one or two files 
> in these directories and even with the new layout that still would be an 
> issue. It even is an issue with direct access on the underlying file system. 
> Of course,  basically empty directories should be avoided at all, but users 
> have their own way of doing IO.
> Furthermore, the meta-cache vs. streaming-cache issue is not limited to 
> directory blocks only, but any cached meta-data are affected. Mel recently 
> wrote a few patches to improve meta-caching ("Obey mark_page_accessed hint 
> given by filesystems"), but at least for our directory-block issue that 
> doesn't seem to help.
>
> Conclusion:
> ===========
> From my point of view, there should be a small, but configurable, number pages 
> reserved for meta-data only. If streaming IO wouldn't be able evict these 
> pages, our and other file systems meta-cache issues probably would be entire 
> solved at all.
>
>
> Example:
> ========
>
> Just a very basic simple bonnie++ test with 60000 files on ext4 with inlined 
> data to reduce block and bitmap lookups and writes.
>
> Entirely cached hash directories (16384), which are populated with about 16 
> million files, so 1000 files per hash-dir.
>
>> Version  1.96       ------Sequential Create------ --------Random Create--------
>> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  /sec %CP
>>            60:32:32  1702  14  2025  12  1332   4  1873  16 2047  13  1266   3
>> Latency              3874ms    6645ms    8659ms     505ms 7257ms    9627ms
>> 1.96,1.96,fslab3,1,1374655110,,,,,,,,,,,,,,60,32,32,,,1702,14,2025,12,1332,4,1873,16,2047,13,1266,3,,,,,,,3874ms,6645ms,8659ms,505ms,7257ms,9627ms 
>>
>
>
> Now after clients did some streaming IO:
>
>> Version  1.96       ------Sequential Create------ --------Random Create--------
>> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  /sec %CP
>>            60:32:32   541   4  2343  16  2103   6   586   5 1947  13  1603   4
>> Latency               190ms     166ms    3459ms    6762ms 6518ms    9185ms
>
>
> With longer/more streaming that can go down to 25 creates/s. iostat and btrace 
> show lots of meta-reads then, which correspond to directory-block reads.
>
> Now after running 'find' over these hash directories to re-read all blocks:
>
>> Version  1.96       ------Sequential Create------ --------Random Create--------
>> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  /sec %CP
>>            60:32:32  1878  16  2766  16  2464   7  1506  13 2054  13  1433   4
>> Latency               349ms     164ms    1594ms    7730ms 6204ms    8112ms
>
>
>
> Would a dedicated meta-cache be a topic for discussion?
>
>
> Thanks,
> Bernd
>

Hi Bernd,

I think that sounds like an interesting idea to discuss - can you add a proposal 
here:

http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/proposals

Thanks!

Ric


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2013-07-19 19:57 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-12 17:20 Linux Plumbers IO & File System Micro-conference Ric Wheeler
2013-07-12 17:42 ` faibish, sorin
2013-07-15 21:22   ` Ric Wheeler
2013-07-19 19:52 ` Bernd Schubert
2013-07-19 19:57   ` Ric Wheeler [this message]
2013-07-22  0:47   ` Dave Chinner
2013-07-22 12:36     ` Bernd Schubert
2013-07-23  6:25       ` Dave Chinner
2013-07-26 14:35         ` Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51E99A31.2070208@gmail.com \
    --to=ricwheeler@gmail.com \
    --cc=adilger@dilger.ca \
    --cc=bernd.schubert@itwm.fraunhofer.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=sage@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).