From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ric Wheeler <ricwheeler@gmail.com>
Subject: Re: Linux Plumbers IO & File System Micro-conference
Date: Fri, 19 Jul 2013 15:57:37 -0400
Message-ID: <51E99A31.2070208@gmail.com>
References: <51E03AFB.1000000@gmail.com> <51E998E0.10207@itwm.fraunhofer.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: linux-mm@kvack.org, Linux FS Devel <linux-fsdevel@vger.kernel.org>,
 Mel Gorman <mgorman@suse.de>,
 Andreas Dilger <adilger@dilger.ca>, sage@inktank.com
To: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Return-path: <owner-linux-mm@kvack.org>
In-Reply-To: <51E998E0.10207@itwm.fraunhofer.de>
Sender: owner-linux-mm@kvack.org
List-Id: linux-fsdevel.vger.kernel.org

On 07/19/2013 03:52 PM, Bernd Schubert wrote:
> Hello Ric, hi all,
>
> On 07/12/2013 07:20 PM, Ric Wheeler wrote:
>>
>> If you have topics that you would like to add, wait until the
>> instructions get posted at the link above. If you are impatient, feel
>> free to email me directly (but probably best to drop the broad mailing
>> lists from the reply).
>
> sorry, that will be a rather long introduction, the short conclusion is below.
>
>
> Introduction to the meta-cache issue:
> =====================================
> For quite a while we are redesigning our FhGFS storage layout to workaround 
> meta-cache issues of underlying file systems. However, there are constraints 
> as data and meta-data are distributed on between several targets/servers. 
> Other distributed file systems, such as Lustre and (I think) cepfs should have 
> the similar issues.
>
> So the main issue we have is that streaming reads/writes evict meta-pages from 
> the page-cache. I.e. this results in lots of directory-block reads on creating 
> files. So FhGFS, Lustre an (I believe) cephfs are using hash-directories to 
> store object files. Access to files in these hash-directories is rather random 
> and with increasing number of files, access to hash directory-blocks/pages 
> also gets entirely random. Streaming IO easily evicts these pages, which 
> results in high latencies when users perform file creates/deletes, as 
> corresponding directory blocks have to be re-read from disk again and again.
> Now one could argue that hash-directories are poor choice and indeed we are 
> mostly solving that issue in FhGFS now(currently stable release on the meta 
> side, upcoming release on the data/storage side).
> However, given by the problem of distributed meta-data and distributed data we 
> have not found a way yet to entirely eliminate hash directories. For example, 
> recently one of our users created 80 million directories with one or two files 
> in these directories and even with the new layout that still would be an 
> issue. It even is an issue with direct access on the underlying file system. 
> Of course,  basically empty directories should be avoided at all, but users 
> have their own way of doing IO.
> Furthermore, the meta-cache vs. streaming-cache issue is not limited to 
> directory blocks only, but any cached meta-data are affected. Mel recently 
> wrote a few patches to improve meta-caching ("Obey mark_page_accessed hint 
> given by filesystems"), but at least for our directory-block issue that 
> doesn't seem to help.
>
> Conclusion:
> ===========
> From my point of view, there should be a small, but configurable, number pages 
> reserved for meta-data only. If streaming IO wouldn't be able evict these 
> pages, our and other file systems meta-cache issues probably would be entire 
> solved at all.
>
>
> Example:
> ========
>
> Just a very basic simple bonnie++ test with 60000 files on ext4 with inlined 
> data to reduce block and bitmap lookups and writes.
>
> Entirely cached hash directories (16384), which are populated with about 16 
> million files, so 1000 files per hash-dir.
>
>> Version  1.96       ------Sequential Create------ --------Random Create--------
>> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  /sec %CP
>>            60:32:32  1702  14  2025  12  1332   4  1873  16 2047  13  1266   3
>> Latency              3874ms    6645ms    8659ms     505ms 7257ms    9627ms
>> 1.96,1.96,fslab3,1,1374655110,,,,,,,,,,,,,,60,32,32,,,1702,14,2025,12,1332,4,1873,16,2047,13,1266,3,,,,,,,3874ms,6645ms,8659ms,505ms,7257ms,9627ms 
>>
>
>
> Now after clients did some streaming IO:
>
>> Version  1.96       ------Sequential Create------ --------Random Create--------
>> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  /sec %CP
>>            60:32:32   541   4  2343  16  2103   6   586   5 1947  13  1603   4
>> Latency               190ms     166ms    3459ms    6762ms 6518ms    9185ms
>
>
> With longer/more streaming that can go down to 25 creates/s. iostat and btrace 
> show lots of meta-reads then, which correspond to directory-block reads.
>
> Now after running 'find' over these hash directories to re-read all blocks:
>
>> Version  1.96       ------Sequential Create------ --------Random Create--------
>> fslab3              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>> files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP /sec %CP  /sec %CP
>>            60:32:32  1878  16  2766  16  2464   7  1506  13 2054  13  1433   4
>> Latency               349ms     164ms    1594ms    7730ms 6204ms    8112ms
>
>
>
> Would a dedicated meta-cache be a topic for discussion?
>
>
> Thanks,
> Bernd
>

Hi Bernd,

I think that sounds like an interesting idea to discuss - can you add a proposal 
here:

http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/proposals

Thanks!

Ric


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>