From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mark Nelson Subject: Re: Initial performance cluster SimpleMessenger vs AsyncMessenger results Date: Tue, 13 Oct 2015 10:52:51 -0500 Message-ID: <561D28D3.3060107@redhat.com> References: <561BE4E3.7050404@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mx1.redhat.com ([209.132.183.28]:47323 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752702AbbJMPwy (ORCPT ); Tue, 13 Oct 2015 11:52:54 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Gregory Farnum Cc: ceph-devel , "ceph-users@lists.ceph.com" On 10/12/2015 11:12 PM, Gregory Farnum wrote: > On Mon, Oct 12, 2015 at 9:50 AM, Mark Nelson wrote: >> Hi Guy, >> >> Given all of the recent data on how different memory allocator >> configurations improve SimpleMessenger performance (and the effect of memory >> allocators and transparent hugepages on RSS memory usage), I thought I'd run >> some tests looking how AsyncMessenger does in comparison. We spoke about >> these a bit at the last performance meeting but here's the full write up. >> The rough conclusion as of right now appears to be: >> >> 1) AsyncMessenger performance is not dependent on the memory allocator like >> with SimpleMessenger. >> >> 2) AsyncMessenger is faster than SimpleMessenger with TCMalloc + 32MB (ie >> default) thread cache. >> >> 3) AsyncMessenger is consistently faster than SimpleMessenger for 128K >> random reads. >> >> 4) AsyncMessenger is sometimes slower than SimpleMessenger when memory >> allocator optimizations are used. >> >> 5) AsyncMessenger currently uses far more RSS memory than SimpleMessenger. >> >> Here's a link to the paper: >> >> https://drive.google.com/file/d/0B2gTBZrkrnpZS1Q4VktjZkhrNHc/view > > Can you clarify these tests a bit more? I can't make the number of > nodes, OSDs, and SSDs work out properly. Were the FIO jobs 256 > concurrent ops per job, or in aggregate? Is there any more info that > might suggest why the 128KB rand-read (but not read nor write, and not > 4k rand-read) was so asymmetrical? > Hi Greg, Resending this to the list for posterity as I realized I only sent it you earlier: - 4 Nodes - 4 P3700s per node - 4 OSDs per P3700 (Similar to Intel's setup in Jiangang and Jian's paper) Each node also acted as an fio client using the librbd engine: - 4 Nodes - 2 volumes per node - 1 fio process per volume - 32 concurrent IOs per fio process The 128KB random read results are interesting. In memory allocator tests I saw performance decrease with more threadcache or when TCMalloc was used, and in the past I've seen odd performance characteristics around this IO size. I think it must be a difficult case for the memory allocator to handle consistently well and AsyncMesseneger maybe just sidesteps the problem. Mark